Part 10. Change Management: Prompt Changes vs System Changes, Experiments and Rollbacks

In the LLM service, “change” happens every day. Fix prompts, update search indexes, adjust policy rules, and change model routing. The problem begins the moment we treat these changes at the same level. One line of prompt statement and one line of permission policy have different levels of risk. Therefore, change management should be designed based on scope of impact rather than code type.

Based on version

Node.js 20 LTS
TypeScript 5.8.x
Next.js 16.1.x
OpenAI API (Responses API, based on 2026-03 document)
PostgreSQL 15
Redis 7

Raise a problem

Many operational failures arise from change management failures rather than functional defects.

It becomes impossible to separate the cause by changing the prompt and retrieval settings at the same time.
Policy changes are distributed in full without canaries, resulting in a rapid increase in the rejection rate.
The rollback path is tied to “code deployment” only, delaying prompt/policy recovery.
There are no experimental results left, so the same mistakes are repeated.

Practical Example A: Deploying Search Quality Improvement

The retrieval top-k adjustment and prompt template improvement were distributed at once. Quality dropped, but rollback was delayed because it was unknown what change was causing it.

Practical Example B: Security Policy Update

When we immediately applied the entire sensitive domain blocking rule, even normal questions were blocked. The policy change deployed without canaries led to a sharp deterioration in product KPIs.

Key concepts

Change management should be broken down into “what is affected” rather than “what changed.”

Change Type	Scope of influence	Basic risk	Recommended Deployment Strategy
Prompt text	Response type/tone	middle	Small Canary + Quality Gate
Retrieval Settings	Evidence/Recentness	High	Deploy steps by domain
Policy rule	allow/block boundaries	very high	Approval required + gradual expansion
Tool permissions	execution result	very high	Sandbox Verification + Limited Canary
model routing	Cost/Quality/Delay	Medium to high	Traffic Splitting Experiment

Core principles:

Deploy only one type of high-risk change at a time.
Have a dedicated gate (quality, security, cost) for each change.
Rollback must be possible “per change” rather than “system-wide”.

Mermaid diagram rendering...

Practical pattern

Pattern 1: Introducing a Change Classifier

Automatic classification of risk levels at the time of change requests ensures consistent deployment procedures.

type ChangeType = "prompt" | "retrieval" | "policy" | "tool_permission" | "routing";

type ChangeRequest = {
  id: string;
  type: ChangeType;
  touchesDomains: string[];
  hasSecurityImpact: boolean;
  expectedMetricShift?: string;
};

function classifyRisk(req: ChangeRequest): "low" | "medium" | "high" {
  if (req.type === "policy" || req.type === "tool_permission") return "high";
  if (req.hasSecurityImpact) return "high";
  if (req.type === "retrieval" && req.touchesDomains.includes("payment")) return "high";
  if (req.type === "retrieval" || req.type === "routing") return "medium";
  return "low";
}

export function requiredGates(req: ChangeRequest) {
  const risk = classifyRisk(req);
  if (risk === "high") return ["security_review", "offline_eval", "canary_1_5_10", "rollback_plan"];
  if (risk === "medium") return ["offline_eval", "canary_10_25_50"];
  return ["offline_smoke", "canary_10"];
}

Operating points:

Mandate risk tier and gate results in change PR template.
Restrict distribution of high-risk changes at night or on holidays.
The minimum canary level is enforced even for emergency changes.

Pattern 2: Prompt/Policy Version Registry and Independent Rollback

Fast rollback is possible only if there is a version registry that is separate from code distribution.

{
  "release_id": "rel_20260303_1900",
  "prompt_version": "assist-v8.2.0",
  "policy_version": "sec-v13",
  "retrieval_version": "ret-v5.4",
  "routing_version": "route-v3.1",
  "canary": {
    "percent": 10,
    "target_tenants": ["tenant_a", "tenant_b"]
  },
  "rollback": {
    "prompt_version": "assist-v8.1.4",
    "policy_version": "sec-v12",
    "retrieval_version": "ret-v5.3",
    "routing_version": "route-v3.0"
  }
}

# 변경 단위 롤백 예시
./ops/release/rollback-component.sh \
  --component policy_version \
  --to sec-v12 \
  --reason "false refusal spike"

Operating points:

Measure component unit rollback time with KPI.
Reduce specific pattern bias by diversifying Canary target tenants.
Be sure to tag the cause of regression after rollback.

Pattern 3: Link experiment results to deployment decisions

An experiment is meaningless if it ends with a report. Promotion rules must be fixed in advance.

Example promotion rule:

Quality: resolved_rate within -1%p compared to baseline
Safety: policy_violation_rate no increase
Cost: within cost_per_resolved_request +10%
Delay: Within p95_latency +15%

Operating points:

The baseline is calculated as a 7-day moving average.
It requires simultaneous passage of multiple indicators rather than a single indicator.
Reduce recurrence by incorporating experimental failure cases into the evaluation set.

Failure cases/anti-patterns

Failure scenario: “Cause missing due to simultaneous changes”

Situation:

Changed prompt/retrieval/policy simultaneously in the same release.
Within 30 minutes after deployment, follow_up_ratio and false_refusal_rate rose simultaneously.
Because there were so many axes that were changed, the cause could not be isolated, and it took 90 minutes for the entire rollback.

Detection procedure:

Check quality/policy indicator composite alarm
Check concurrent change configuration in release manifest
Comparison of detailed trace tags (prompt_version, policy_version, retrieval_version)

Mitigation Procedures:

Normalize blocking rate by first rolling back only the policy version
partial restoration of retrieval settings to previous values
Keep prompts but reduce traffic to 20%

Recovery Procedure:

Addition of rule prohibiting simultaneous distribution of high-risk changes
Apply “component count 2 or less” limit to release gate
Standardize the separation distribution principle in postmortem

Representative antipatterns

I omitted the gate based on my intuition that “I think it will work out.”
Canaries are viewed as simple traffic splitting and do not connect quality metrics.
Prepare only forward-fix without rollback path
Relies on personal memory without documenting experimental results

Checklist

Are change requests classified into risk tiers automatically/manually?
For high-risk changes, are the approvers and distribution windows managed separately?
Are prompt/policy/retrieval/routing versions tracked separately?
Is component-level rollback possible within 10 minutes?
Are quality/safety/cost/delay all included in the Canary promotion criteria?
Is there a rule limiting the number of simultaneous changes?
Are experiment failure cases returned to the evaluation set?

Summary

The essence of change management in LLM operations is not speed but controllability. The accident radius can be reduced by separating risks by change type and preparing dedicated gates and partial rollback routes. As soon as prompt changes and system changes are treated in the same way, operational resilience drops dramatically.

Next episode preview

In the next section, we present a reference architecture that combines the principles discussed so far. Separate the control plane and data plane, and organize end-to-end how quality/cost/security/reliability indicators are connected to one operation loop.

Reference link

Previous post: Part 9. 제품화: 실패 UX, Human-in-the-loop, 운영 거버넌스
Next post: Part 11. 레퍼런스 아키텍처: 엔드투엔드 운영 설계

Part 10. Change Management: Prompt Changes vs System Changes, Experiments and Rollbacks

Series: Prompt Engineering이 아니라 System Engineering이 중요한 이유

Based on version

Raise a problem

Practical Example A: Deploying Search Quality Improvement

Practical Example B: Security Policy Update

Key concepts

Practical pattern

Pattern 1: Introducing a Change Classifier

Pattern 2: Prompt/Policy Version Registry and Independent Rollback

Pattern 3: Link experiment results to deployment decisions

Failure cases/anti-patterns

Failure scenario: “Cause missing due to simultaneous changes”

Representative antipatterns

Checklist

Summary

Next episode preview

Reference link

Series navigation

Comments