Part 10. Change Management: Prompt Changes vs System Changes, Experiments and Rollbacks
We summarize why different deployment gates are needed for each type of change in LLM operations, as well as experiment, canary, and rollback strategies.
Series: Prompt Engineering이 아니라 System Engineering이 중요한 이유
총 12편 구성. 현재 10편을 보고 있습니다.
- 01Part 1. Prompt is an interface: Revisiting system boundaries and contracts
- 02Part 2. Quality comes from the evaluation loop, not from prompts
- 03Part 3. Reliability Design: Retry, Timeout, Fallback, Circuit Breaker
- 04Part 4. Cost Design: Cache, Batching, Routing, Token Budget
- 05Part 5. Security Design: Prompt Injection, Data Leak, Policy Guard
- 06Part 6. Observability Design: Trace, Span, Log Schema, Regression Detection
- 07Part 7. Context Engineering: RAG, Memory, Recency, Multi-Tenancy
- 08Part 8. Agent architecture: Planner/Executor, state machine, task queue
- 09Part 9. Productization: Failure UX, Human-in-the-loop, Operational Governance
- 10Part 10. Change Management: Prompt Changes vs System Changes, Experiments and RollbacksCURRENT
- 11Part 11. Reference Architecture: End-to-End Operational Design
- 12Part 12. Organization/Process: Operational Maturity Model and Roadmap
In the LLM service, “change” happens every day. Fix prompts, update search indexes, adjust policy rules, and change model routing. The problem begins the moment we treat these changes at the same level. One line of prompt statement and one line of permission policy have different levels of risk. Therefore, change management should be designed based on scope of impact rather than code type.
Based on version
- Node.js 20 LTS
- TypeScript 5.8.x
- Next.js 16.1.x
- OpenAI API (Responses API, based on 2026-03 document)
- PostgreSQL 15
- Redis 7
Raise a problem
Many operational failures arise from change management failures rather than functional defects.
- It becomes impossible to separate the cause by changing the prompt and retrieval settings at the same time.
- Policy changes are distributed in full without canaries, resulting in a rapid increase in the rejection rate.
- The rollback path is tied to “code deployment” only, delaying prompt/policy recovery.
- There are no experimental results left, so the same mistakes are repeated.
Practical Example A: Deploying Search Quality Improvement
The retrieval top-k adjustment and prompt template improvement were distributed at once. Quality dropped, but rollback was delayed because it was unknown what change was causing it.
Practical Example B: Security Policy Update
When we immediately applied the entire sensitive domain blocking rule, even normal questions were blocked. The policy change deployed without canaries led to a sharp deterioration in product KPIs.
Key concepts
Change management should be broken down into “what is affected” rather than “what changed.”
| Change Type | Scope of influence | Basic risk | Recommended Deployment Strategy |
|---|---|---|---|
| Prompt text | Response type/tone | middle | Small Canary + Quality Gate |
| Retrieval Settings | Evidence/Recentness | High | Deploy steps by domain |
| Policy rule | allow/block boundaries | very high | Approval required + gradual expansion |
| Tool permissions | execution result | very high | Sandbox Verification + Limited Canary |
| model routing | Cost/Quality/Delay | Medium to high | Traffic Splitting Experiment |
Core principles:
- Deploy only one type of high-risk change at a time.
- Have a dedicated gate (quality, security, cost) for each change.
- Rollback must be possible “per change” rather than “system-wide”.
Practical pattern
Pattern 1: Introducing a Change Classifier
Automatic classification of risk levels at the time of change requests ensures consistent deployment procedures.
type ChangeType = "prompt" | "retrieval" | "policy" | "tool_permission" | "routing";
type ChangeRequest = {
id: string;
type: ChangeType;
touchesDomains: string[];
hasSecurityImpact: boolean;
expectedMetricShift?: string;
};
function classifyRisk(req: ChangeRequest): "low" | "medium" | "high" {
if (req.type === "policy" || req.type === "tool_permission") return "high";
if (req.hasSecurityImpact) return "high";
if (req.type === "retrieval" && req.touchesDomains.includes("payment")) return "high";
if (req.type === "retrieval" || req.type === "routing") return "medium";
return "low";
}
export function requiredGates(req: ChangeRequest) {
const risk = classifyRisk(req);
if (risk === "high") return ["security_review", "offline_eval", "canary_1_5_10", "rollback_plan"];
if (risk === "medium") return ["offline_eval", "canary_10_25_50"];
return ["offline_smoke", "canary_10"];
}
Operating points:
- Mandate risk tier and gate results in change PR template.
- Restrict distribution of high-risk changes at night or on holidays.
- The minimum canary level is enforced even for emergency changes.
Pattern 2: Prompt/Policy Version Registry and Independent Rollback
Fast rollback is possible only if there is a version registry that is separate from code distribution.
{
"release_id": "rel_20260303_1900",
"prompt_version": "assist-v8.2.0",
"policy_version": "sec-v13",
"retrieval_version": "ret-v5.4",
"routing_version": "route-v3.1",
"canary": {
"percent": 10,
"target_tenants": ["tenant_a", "tenant_b"]
},
"rollback": {
"prompt_version": "assist-v8.1.4",
"policy_version": "sec-v12",
"retrieval_version": "ret-v5.3",
"routing_version": "route-v3.0"
}
}
# 변경 단위 롤백 예시
./ops/release/rollback-component.sh \
--component policy_version \
--to sec-v12 \
--reason "false refusal spike"
Operating points:
- Measure component unit rollback time with KPI.
- Reduce specific pattern bias by diversifying Canary target tenants.
- Be sure to tag the cause of regression after rollback.
Pattern 3: Link experiment results to deployment decisions
An experiment is meaningless if it ends with a report. Promotion rules must be fixed in advance.
Example promotion rule:
- Quality:
resolved_ratewithin -1%p compared to baseline - Safety:
policy_violation_rateno increase - Cost: within
cost_per_resolved_request+10% - Delay: Within
p95_latency+15%
Operating points:
- The baseline is calculated as a 7-day moving average.
- It requires simultaneous passage of multiple indicators rather than a single indicator.
- Reduce recurrence by incorporating experimental failure cases into the evaluation set.
Failure cases/anti-patterns
Failure scenario: “Cause missing due to simultaneous changes”
Situation:
- Changed prompt/retrieval/policy simultaneously in the same release.
- Within 30 minutes after deployment,
follow_up_ratioandfalse_refusal_raterose simultaneously. - Because there were so many axes that were changed, the cause could not be isolated, and it took 90 minutes for the entire rollback.
Detection procedure:
- Check quality/policy indicator composite alarm
- Check concurrent change configuration in release manifest
- Comparison of detailed trace tags (
prompt_version,policy_version,retrieval_version)
Mitigation Procedures:
- Normalize blocking rate by first rolling back only the policy version
- partial restoration of retrieval settings to previous values
- Keep prompts but reduce traffic to 20%
Recovery Procedure:
- Addition of rule prohibiting simultaneous distribution of high-risk changes
- Apply “component count 2 or less” limit to release gate
- Standardize the separation distribution principle in postmortem
Representative antipatterns
- I omitted the gate based on my intuition that “I think it will work out.”
- Canaries are viewed as simple traffic splitting and do not connect quality metrics.
- Prepare only forward-fix without rollback path
- Relies on personal memory without documenting experimental results
Checklist
- Are change requests classified into risk tiers automatically/manually?
- For high-risk changes, are the approvers and distribution windows managed separately?
- Are prompt/policy/retrieval/routing versions tracked separately?
- Is component-level rollback possible within 10 minutes?
- Are quality/safety/cost/delay all included in the Canary promotion criteria?
- Is there a rule limiting the number of simultaneous changes?
- Are experiment failure cases returned to the evaluation set?
Summary
The essence of change management in LLM operations is not speed but controllability. The accident radius can be reduced by separating risks by change type and preparing dedicated gates and partial rollback routes. As soon as prompt changes and system changes are treated in the same way, operational resilience drops dramatically.
Next episode preview
In the next section, we present a reference architecture that combines the principles discussed so far. Separate the control plane and data plane, and organize end-to-end how quality/cost/security/reliability indicators are connected to one operation loop.
Reference link
- OpenAI Developer Docs - Responses API
- OpenTelemetry 공식 문서
- Martin Fowler - Circuit Breaker
- 블로그: LLM Agent Tool Guardrail 설계
Series navigation
- Previous post: Part 9. 제품화: 실패 UX, Human-in-the-loop, 운영 거버넌스
- Next post: Part 11. 레퍼런스 아키텍처: 엔드투엔드 운영 설계