Part 2. Quality comes from the evaluation loop, not from prompts
LLM quality is stabilized when managed through datasets, evaluation criteria, online feedback, and regression detection loops, not sentence tuning.
Series: Prompt Engineering이 아니라 System Engineering이 중요한 이유
총 12편 구성. 현재 2편을 보고 있습니다.
- 01Part 1. Prompt is an interface: Revisiting system boundaries and contracts
- 02Part 2. Quality comes from the evaluation loop, not from promptsCURRENT
- 03Part 3. Reliability Design: Retry, Timeout, Fallback, Circuit Breaker
- 04Part 4. Cost Design: Cache, Batching, Routing, Token Budget
- 05Part 5. Security Design: Prompt Injection, Data Leak, Policy Guard
- 06Part 6. Observability Design: Trace, Span, Log Schema, Regression Detection
- 07Part 7. Context Engineering: RAG, Memory, Recency, Multi-Tenancy
- 08Part 8. Agent architecture: Planner/Executor, state machine, task queue
- 09Part 9. Productization: Failure UX, Human-in-the-loop, Operational Governance
- 10Part 10. Change Management: Prompt Changes vs System Changes, Experiments and Rollbacks
- 11Part 11. Reference Architecture: End-to-End Operational Design
- 12Part 12. Organization/Process: Operational Maturity Model and Roadmap
In a demo environment, the experience is often that “the answers got better when I changed the prompt a little.” However, in service operations, this experience does not last long. The same prompt can produce different results as traffic types change, data changes, and policy requirements are added. So, if you manage quality with prompt sentences, you end up managing luck, not quality.
Based on version
- Node.js 20 LTS
- TypeScript 5.8.x
- Next.js 16.1.x
- OpenAI API (Responses API, based on 2026-03 document)
- PostgreSQL 15
- Redis 7
Raise a problem
The reason why LLM quality problems are prolonged in the operating environment is mostly due to the “lack of an evaluation system.” Frequently occurring patterns are as follows:
- Accuracy was high in offline samples, but evasive answers increase in actual user requests.
- When safety is strengthened, usability plummets.
- When the routing policy was changed to reduce model cost, the failure rate increased rapidly in certain domain questions.
- The operations team only has qualitative feedback such as “There are not many responses these days,” and cannot explain in numbers what and how much has worsened.
Practical example A: In-house knowledge search chatbot
The offline evaluation accuracy of the in-house document search chatbot was 86%. However, in actual operation, tickets for “policy document citations were incorrect” increased rapidly. The cause was not a model problem but a delay in search index freshness. It was not discovered before deployment because the evaluation set did not include a freshness scenario.
Practical example B: Payment domain customer support agent
When we refined the prompt to improve the quality of the “Refundable?” response, the average response length doubled. As a result, the bounce rate among mobile users increased, and CS satisfaction actually decreased. This is a case where accuracy improvement did not cover the decline in UX quality.
Key concepts
LLM quality should be viewed as a multidimensional quality vector rather than a single score.| axis | Key Questions | Representative indicators | failure signal | | --- | --- | --- | --- | | Accuracy | Does it fit the facts/policy | task pass rate, citation precision | Invalid answers, inconsistent evidence | | Usefulness | Is it actually solved by users | resolution rate, follow-up ratio | Increase in repeat questions | | Safety | Are there any violations of the prohibition policy? | policy violation rate | Sensitive information exposure | | Consistency | Is the result deviation large for the same input? | variance score | Non-reproducible issue | | Efficiency | Is time/cost appropriate for quality? | p95 latency, cost/request | SLA/Budget Exceeded |
The important point is the trade-off between axes.
- Increasing the safety threshold may increase false refusals.
- If you increase the proportion of small models for cost optimization, accuracy may decrease in difficult queries.
- Reducing the response length improves delay, but may reduce explanation sufficiency.
Therefore, quality control is not “model score optimization” but “objective function design”.
Practical pattern
Pattern 1: Composing the assessment set as a “collection of operational risks” rather than a “collection of correct answers”
It is recommended that operational evaluation sets be separated into at least three layers.
- Golden Set: Baseline problem of core use cases
- Adversarial Set: Policy violations, vague questions, jailbreak attempts
- Drift Set: Actual traffic samples from the last 7 to 30 days
type EvalCase = {
id: string;
tier: "golden" | "adversarial" | "drift";
input: string;
expected: {
mustInclude?: string[];
mustNotInclude?: string[];
policyTags?: string[];
};
};
type EvalResult = {
caseId: string;
pass: boolean;
score: number;
reasons: string[];
};
export async function runEval(cases: EvalCase[]): Promise<EvalResult[]> {
const results: EvalResult[] = [];
for (const c of cases) {
const output = await infer(c.input);
const passInclude = (c.expected.mustInclude ?? []).every((k) => output.includes(k));
const passExclude = (c.expected.mustNotInclude ?? []).every((k) => !output.includes(k));
const pass = passInclude && passExclude;
results.push({
caseId: c.id,
pass,
score: pass ? 1 : 0,
reasons: pass ? [] : ["constraint mismatch"],
});
}
return results;
}
Operating points:
- Distribution that improves only the Golden Set is prohibited, and a minimum Adversarial score lower limit is enforced.
- Drift Set is automatically updated periodically, but personal information is masked first.
- When a product/policy changes, the evaluation set version is also uploaded.
Pattern 2: Collect online quality signals as “events”
Offline scores alone cannot explain actual quality. Online, quality events must be recorded through a combination of user behavior and system failures.
{
"event": "llm_response_completed",
"request_id": "req_20260303_102310",
"prompt_version": "support-v4.2.1",
"model_route": "small->large-fallback",
"latency_ms": 812,
"cost_usd": 0.0048,
"policy_violation": false,
"user_follow_up_within_120s": true,
"thumb_down": false,
"resolved": true
}
# 품질 회귀 감지 예시: 15분 윈도우에서 후속 질문 비율 급등 탐지
./alerts/eval-guard.sh \
--metric user_follow_up_ratio \
--window 15m \
--baseline 7d \
--threshold +18%
Operating points:- “User re-question” is a strong signal of low quality, so it is considered latency and policy violation.
- Not only model changes but also search index/policy engine changes are displayed on the same dashboard.
- Quality alerts can be linked to distribution alerts to allow automatic traffic reduction.
Failure cases/anti-patterns
Failure scenario: “Increased offline accuracy, increased operational complaints”
Situation:
- After applying the new prompt + model routing, the Golden Set score increased from 78% to 90%.
- Two hours after deployment,
user_follow_up_ratioincreased from 12% to 31%. - As a result of the cause analysis, the rejection policy was applied excessively, leading to an increase in “no help” responses even to normal questions.
Detection procedure:
- Online indicator alarm:
follow_up_ratio,resolved_ratesimultaneous deterioration - Trace Sampling: Policy Engine
deny_reason=ambiguousRate Spikes Checked - Version comparison: check simultaneous deployment of prompt v4.2.1 + policy rule-set v19
Mitigation Procedures:
- Immediately rollback only the policy rule-set to the previous version (v18)
- Temporarily divert influence domain (payment/refund) traffic to human-review path
- Reduce the problem of lack of explanation by easing the automatic response length limit
Recovery Procedure:
- Add “normal question false positive” case to Adversarial Set
- Recalibrate policy classifier threshold to AB test
- Add “rejection rate cap” as a prerequisite for distribution gates
Representative antipatterns
- How to make distribution decisions with one offline score
- A culture that reduces quality issues to issues of personal competency of the prompter
- View user feedback (thumb down, re-question) only as a product indicator and exclude it from operational indicators
- Operation that fixes the evaluation set and ignores data drift
Checklist- [ ] Are quality indicators decomposed and managed into accuracy/usefulness/safety/efficiency?
- Does the distribution gate include both Golden + Adversarial + Drift sets?
- Are online metrics (re-question rate, resolution rate, policy violation rate) linked to the prompt version?
- Is a rollback or routing switch runbook prepared when a quality alarm occurs?
- Is simultaneous deployment of prompt changes and policy changes controlled?
- Are failure samples collected on a weekly basis and re-incorporated into the evaluation set?
- Do you automatically verify that cost/delay improvement work is not accompanied by quality regression?
Summary
Prompt tuning can be the starting point for quality improvement, but it cannot be the end point. Operational quality comes from a loop consisting of evaluation set design, online feedback collection, regression detection, and deployment gates. Quality becomes a system when you can explain "why it gets better and when it gets worse" instead of just feeling "it looks good."
Next episode preview
The next part deals with reliability. Explains why retries, timeouts, fallbacks, and circuit breakers in LLM systems must be designed differently from regular APIs. In particular, the criteria for distinguishing between “cases where retries improve quality” and “cases where retries amplify failures” are organized into actual operating scenarios.
Reference link
- OpenAI Developer Docs - Responses API
- OpenTelemetry 공식 문서
- Martin Fowler - Circuit Breaker
- 블로그: LLM Agent Tool Guardrail 설계
Series navigation
- Previous post: Part 1. Prompt는 인터페이스다: 시스템 경계와 계약으로 다시 보기
- Next post: Part 3. 신뢰성 설계: Retry, Timeout, Fallback, Circuit Breaker