Part 2. Quality comes from the evaluation loop, not from prompts

In a demo environment, the experience is often that “the answers got better when I changed the prompt a little.” However, in service operations, this experience does not last long. The same prompt can produce different results as traffic types change, data changes, and policy requirements are added. So, if you manage quality with prompt sentences, you end up managing luck, not quality.

Based on version

Node.js 20 LTS
TypeScript 5.8.x
Next.js 16.1.x
OpenAI API (Responses API, based on 2026-03 document)
PostgreSQL 15
Redis 7

Raise a problem

The reason why LLM quality problems are prolonged in the operating environment is mostly due to the “lack of an evaluation system.” Frequently occurring patterns are as follows:

Accuracy was high in offline samples, but evasive answers increase in actual user requests.
When safety is strengthened, usability plummets.
When the routing policy was changed to reduce model cost, the failure rate increased rapidly in certain domain questions.
The operations team only has qualitative feedback such as “There are not many responses these days,” and cannot explain in numbers what and how much has worsened.

Practical example A: In-house knowledge search chatbot

The offline evaluation accuracy of the in-house document search chatbot was 86%. However, in actual operation, tickets for “policy document citations were incorrect” increased rapidly. The cause was not a model problem but a delay in search index freshness. It was not discovered before deployment because the evaluation set did not include a freshness scenario.

Practical example B: Payment domain customer support agent

When we refined the prompt to improve the quality of the “Refundable?” response, the average response length doubled. As a result, the bounce rate among mobile users increased, and CS satisfaction actually decreased. This is a case where accuracy improvement did not cover the decline in UX quality.

Key concepts

LLM quality should be viewed as a multidimensional quality vector rather than a single score.| axis | Key Questions | Representative indicators | failure signal | | --- | --- | --- | --- | | Accuracy | Does it fit the facts/policy | task pass rate, citation precision | Invalid answers, inconsistent evidence | | Usefulness | Is it actually solved by users | resolution rate, follow-up ratio | Increase in repeat questions | | Safety | Are there any violations of the prohibition policy? | policy violation rate | Sensitive information exposure | | Consistency | Is the result deviation large for the same input? | variance score | Non-reproducible issue | | Efficiency | Is time/cost appropriate for quality? | p95 latency, cost/request | SLA/Budget Exceeded |

The important point is the trade-off between axes.

Increasing the safety threshold may increase false refusals.
If you increase the proportion of small models for cost optimization, accuracy may decrease in difficult queries.
Reducing the response length improves delay, but may reduce explanation sufficiency.

Therefore, quality control is not “model score optimization” but “objective function design”.

Mermaid diagram rendering...

Practical pattern

Pattern 1: Composing the assessment set as a “collection of operational risks” rather than a “collection of correct answers”

It is recommended that operational evaluation sets be separated into at least three layers.

Golden Set: Baseline problem of core use cases
Adversarial Set: Policy violations, vague questions, jailbreak attempts
Drift Set: Actual traffic samples from the last 7 to 30 days

type EvalCase = {
  id: string;
  tier: "golden" | "adversarial" | "drift";
  input: string;
  expected: {
    mustInclude?: string[];
    mustNotInclude?: string[];
    policyTags?: string[];
  };
};

type EvalResult = {
  caseId: string;
  pass: boolean;
  score: number;
  reasons: string[];
};

export async function runEval(cases: EvalCase[]): Promise<EvalResult[]> {
  const results: EvalResult[] = [];

  for (const c of cases) {
    const output = await infer(c.input);
    const passInclude = (c.expected.mustInclude ?? []).every((k) => output.includes(k));
    const passExclude = (c.expected.mustNotInclude ?? []).every((k) => !output.includes(k));
    const pass = passInclude && passExclude;

    results.push({
      caseId: c.id,
      pass,
      score: pass ? 1 : 0,
      reasons: pass ? [] : ["constraint mismatch"],
    });
  }

  return results;
}

Operating points:

Distribution that improves only the Golden Set is prohibited, and a minimum Adversarial score lower limit is enforced.
Drift Set is automatically updated periodically, but personal information is masked first.
When a product/policy changes, the evaluation set version is also uploaded.

Pattern 2: Collect online quality signals as “events”

Offline scores alone cannot explain actual quality. Online, quality events must be recorded through a combination of user behavior and system failures.

{
  "event": "llm_response_completed",
  "request_id": "req_20260303_102310",
  "prompt_version": "support-v4.2.1",
  "model_route": "small->large-fallback",
  "latency_ms": 812,
  "cost_usd": 0.0048,
  "policy_violation": false,
  "user_follow_up_within_120s": true,
  "thumb_down": false,
  "resolved": true
}

# 품질 회귀 감지 예시: 15분 윈도우에서 후속 질문 비율 급등 탐지
./alerts/eval-guard.sh \
  --metric user_follow_up_ratio \
  --window 15m \
  --baseline 7d \
  --threshold +18%

Operating points:- “User re-question” is a strong signal of low quality, so it is considered latency and policy violation.

Not only model changes but also search index/policy engine changes are displayed on the same dashboard.
Quality alerts can be linked to distribution alerts to allow automatic traffic reduction.

Failure cases/anti-patterns

Failure scenario: “Increased offline accuracy, increased operational complaints”

Situation:

After applying the new prompt + model routing, the Golden Set score increased from 78% to 90%.
Two hours after deployment, user_follow_up_ratio increased from 12% to 31%.
As a result of the cause analysis, the rejection policy was applied excessively, leading to an increase in “no help” responses even to normal questions.

Detection procedure:

Online indicator alarm: follow_up_ratio, resolved_rate simultaneous deterioration
Trace Sampling: Policy Engine deny_reason=ambiguous Rate Spikes Checked
Version comparison: check simultaneous deployment of prompt v4.2.1 + policy rule-set v19

Mitigation Procedures:

Immediately rollback only the policy rule-set to the previous version (v18)
Temporarily divert influence domain (payment/refund) traffic to human-review path
Reduce the problem of lack of explanation by easing the automatic response length limit

Recovery Procedure:

Add “normal question false positive” case to Adversarial Set
Recalibrate policy classifier threshold to AB test
Add “rejection rate cap” as a prerequisite for distribution gates

Representative antipatterns

How to make distribution decisions with one offline score
A culture that reduces quality issues to issues of personal competency of the prompter
View user feedback (thumb down, re-question) only as a product indicator and exclude it from operational indicators
Operation that fixes the evaluation set and ignores data drift

Checklist- [ ] Are quality indicators decomposed and managed into accuracy/usefulness/safety/efficiency?

Does the distribution gate include both Golden + Adversarial + Drift sets?
Are online metrics (re-question rate, resolution rate, policy violation rate) linked to the prompt version?
Is a rollback or routing switch runbook prepared when a quality alarm occurs?
Is simultaneous deployment of prompt changes and policy changes controlled?
Are failure samples collected on a weekly basis and re-incorporated into the evaluation set?
Do you automatically verify that cost/delay improvement work is not accompanied by quality regression?

Summary

Prompt tuning can be the starting point for quality improvement, but it cannot be the end point. Operational quality comes from a loop consisting of evaluation set design, online feedback collection, regression detection, and deployment gates. Quality becomes a system when you can explain "why it gets better and when it gets worse" instead of just feeling "it looks good."

Next episode preview

The next part deals with reliability. Explains why retries, timeouts, fallbacks, and circuit breakers in LLM systems must be designed differently from regular APIs. In particular, the criteria for distinguishing between “cases where retries improve quality” and “cases where retries amplify failures” are organized into actual operating scenarios.

Reference link

Previous post: Part 1. Prompt는 인터페이스다: 시스템 경계와 계약으로 다시 보기
Next post: Part 3. 신뢰성 설계: Retry, Timeout, Fallback, Circuit Breaker