4 min read

Part 10. Change Management: Prompt Changes vs System Changes, Experiments and Rollbacks

We summarize why different deployment gates are needed for each type of change in LLM operations, as well as experiment, canary, and rollback strategies.

Series: Prompt Engineering이 아니라 System Engineering이 중요한 이유

12편 구성. 현재 10편을 보고 있습니다.

In the LLM service, “change” happens every day. Fix prompts, update search indexes, adjust policy rules, and change model routing. The problem begins the moment we treat these changes at the same level. One line of prompt statement and one line of permission policy have different levels of risk. Therefore, change management should be designed based on scope of impact rather than code type.

Based on version

  • Node.js 20 LTS
  • TypeScript 5.8.x
  • Next.js 16.1.x
  • OpenAI API (Responses API, based on 2026-03 document)
  • PostgreSQL 15
  • Redis 7

Raise a problem

Many operational failures arise from change management failures rather than functional defects.

  • It becomes impossible to separate the cause by changing the prompt and retrieval settings at the same time.
  • Policy changes are distributed in full without canaries, resulting in a rapid increase in the rejection rate.
  • The rollback path is tied to “code deployment” only, delaying prompt/policy recovery.
  • There are no experimental results left, so the same mistakes are repeated.

Practical Example A: Deploying Search Quality Improvement

The retrieval top-k adjustment and prompt template improvement were distributed at once. Quality dropped, but rollback was delayed because it was unknown what change was causing it.

Practical Example B: Security Policy Update

When we immediately applied the entire sensitive domain blocking rule, even normal questions were blocked. The policy change deployed without canaries led to a sharp deterioration in product KPIs.

Key concepts

Change management should be broken down into “what is affected” rather than “what changed.”

Change TypeScope of influenceBasic riskRecommended Deployment Strategy
Prompt textResponse type/tonemiddleSmall Canary + Quality Gate
Retrieval SettingsEvidence/RecentnessHighDeploy steps by domain
Policy ruleallow/block boundariesvery highApproval required + gradual expansion
Tool permissionsexecution resultvery highSandbox Verification + Limited Canary
model routingCost/Quality/DelayMedium to highTraffic Splitting Experiment

Core principles:

  1. Deploy only one type of high-risk change at a time.
  2. Have a dedicated gate (quality, security, cost) for each change.
  3. Rollback must be possible “per change” rather than “system-wide”.
Mermaid diagram rendering...

Practical pattern

Pattern 1: Introducing a Change Classifier

Automatic classification of risk levels at the time of change requests ensures consistent deployment procedures.

type ChangeType = "prompt" | "retrieval" | "policy" | "tool_permission" | "routing";

type ChangeRequest = {
  id: string;
  type: ChangeType;
  touchesDomains: string[];
  hasSecurityImpact: boolean;
  expectedMetricShift?: string;
};

function classifyRisk(req: ChangeRequest): "low" | "medium" | "high" {
  if (req.type === "policy" || req.type === "tool_permission") return "high";
  if (req.hasSecurityImpact) return "high";
  if (req.type === "retrieval" && req.touchesDomains.includes("payment")) return "high";
  if (req.type === "retrieval" || req.type === "routing") return "medium";
  return "low";
}

export function requiredGates(req: ChangeRequest) {
  const risk = classifyRisk(req);
  if (risk === "high") return ["security_review", "offline_eval", "canary_1_5_10", "rollback_plan"];
  if (risk === "medium") return ["offline_eval", "canary_10_25_50"];
  return ["offline_smoke", "canary_10"];
}

Operating points:

  • Mandate risk tier and gate results in change PR template.
  • Restrict distribution of high-risk changes at night or on holidays.
  • The minimum canary level is enforced even for emergency changes.

Pattern 2: Prompt/Policy Version Registry and Independent Rollback

Fast rollback is possible only if there is a version registry that is separate from code distribution.

{
  "release_id": "rel_20260303_1900",
  "prompt_version": "assist-v8.2.0",
  "policy_version": "sec-v13",
  "retrieval_version": "ret-v5.4",
  "routing_version": "route-v3.1",
  "canary": {
    "percent": 10,
    "target_tenants": ["tenant_a", "tenant_b"]
  },
  "rollback": {
    "prompt_version": "assist-v8.1.4",
    "policy_version": "sec-v12",
    "retrieval_version": "ret-v5.3",
    "routing_version": "route-v3.0"
  }
}
# 변경 단위 롤백 예시
./ops/release/rollback-component.sh \
  --component policy_version \
  --to sec-v12 \
  --reason "false refusal spike"

Operating points:

  • Measure component unit rollback time with KPI.
  • Reduce specific pattern bias by diversifying Canary target tenants.
  • Be sure to tag the cause of regression after rollback.

An experiment is meaningless if it ends with a report. Promotion rules must be fixed in advance.

Example promotion rule:

  • Quality: resolved_rate within -1%p compared to baseline
  • Safety: policy_violation_rate no increase
  • Cost: within cost_per_resolved_request +10%
  • Delay: Within p95_latency +15%

Operating points:

  • The baseline is calculated as a 7-day moving average.
  • It requires simultaneous passage of multiple indicators rather than a single indicator.
  • Reduce recurrence by incorporating experimental failure cases into the evaluation set.

Failure cases/anti-patterns

Failure scenario: “Cause missing due to simultaneous changes”

Situation:

  • Changed prompt/retrieval/policy simultaneously in the same release.
  • Within 30 minutes after deployment, follow_up_ratio and false_refusal_rate rose simultaneously.
  • Because there were so many axes that were changed, the cause could not be isolated, and it took 90 minutes for the entire rollback.

Detection procedure:

  1. Check quality/policy indicator composite alarm
  2. Check concurrent change configuration in release manifest
  3. Comparison of detailed trace tags (prompt_version, policy_version, retrieval_version)

Mitigation Procedures:

  1. Normalize blocking rate by first rolling back only the policy version
  2. partial restoration of retrieval settings to previous values
  3. Keep prompts but reduce traffic to 20%

Recovery Procedure:

  1. Addition of rule prohibiting simultaneous distribution of high-risk changes
  2. Apply “component count 2 or less” limit to release gate
  3. Standardize the separation distribution principle in postmortem

Representative antipatterns

  • I omitted the gate based on my intuition that “I think it will work out.”
  • Canaries are viewed as simple traffic splitting and do not connect quality metrics.
  • Prepare only forward-fix without rollback path
  • Relies on personal memory without documenting experimental results

Checklist

  • Are change requests classified into risk tiers automatically/manually?
  • For high-risk changes, are the approvers and distribution windows managed separately?
  • Are prompt/policy/retrieval/routing versions tracked separately?
  • Is component-level rollback possible within 10 minutes?
  • Are quality/safety/cost/delay all included in the Canary promotion criteria?
  • Is there a rule limiting the number of simultaneous changes?
  • Are experiment failure cases returned to the evaluation set?

Summary

The essence of change management in LLM operations is not speed but controllability. The accident radius can be reduced by separating risks by change type and preparing dedicated gates and partial rollback routes. As soon as prompt changes and system changes are treated in the same way, operational resilience drops dramatically.

Next episode preview

In the next section, we present a reference architecture that combines the principles discussed so far. Separate the control plane and data plane, and organize end-to-end how quality/cost/security/reliability indicators are connected to one operation loop.

Series navigation

Comments