Part 9. Productization: Failure UX, Human-in-the-loop, Operational Governance
It covers failure UX, human intervention, and operational governance design to make a technically functional LLM function into a trustworthy product for users.
Series: Prompt Engineering이 아니라 System Engineering이 중요한 이유
총 12편 구성. 현재 9편을 보고 있습니다.
- 01Part 1. Prompt is an interface: Revisiting system boundaries and contracts
- 02Part 2. Quality comes from the evaluation loop, not from prompts
- 03Part 3. Reliability Design: Retry, Timeout, Fallback, Circuit Breaker
- 04Part 4. Cost Design: Cache, Batching, Routing, Token Budget
- 05Part 5. Security Design: Prompt Injection, Data Leak, Policy Guard
- 06Part 6. Observability Design: Trace, Span, Log Schema, Regression Detection
- 07Part 7. Context Engineering: RAG, Memory, Recency, Multi-Tenancy
- 08Part 8. Agent architecture: Planner/Executor, state machine, task queue
- 09Part 9. Productization: Failure UX, Human-in-the-loop, Operational GovernanceCURRENT
- 10Part 10. Change Management: Prompt Changes vs System Changes, Experiments and Rollbacks
- 11Part 11. Reference Architecture: End-to-End Operational Design
- 12Part 12. Organization/Process: Operational Maturity Model and Roadmap
LLM capabilities are impressive at the demo stage, but are evaluated by different criteria at the production stage. Users are asking “when can I trust” the model rather than how smart it is. Therefore, the key to productization is not only to increase the correct response rate, but also to make failure predictable and control the user experience in case of failure.
Based on version
- Node.js 20 LTS
- TypeScript 5.8.x
- Next.js 16.1.x
- OpenAI API (Responses API, based on 2026-03 document)
- PostgreSQL 15
- Redis 7
Raise a problem
Common product failure patterns in operational LLM functions include:
- Even if the answer quality is low, the UI shows excessive confidence, leading to user misjudgment.
- In case of failure, the guidance text is abstract, so the user cannot decide the next action.
- Even if there is a human-in-the-loop, there is no standard for when to hand it over to a human, so it only creates a bottleneck.
- Policy/legal/security requirements are attached late after feature release, resulting in frequent rollbacks.
Practical example A: Payment domain consultation assistant
In an inquiry about the refund policy, the model gave an ambiguous answer, but the UI seemed to be the "correct answer", so the user made the wrong decision. Technically it wasn't an error, but from a product perspective, it was a trust failure.
Practical Example B: Operations Automation Assistant
In the function where the agent suggests risky actions (account suspension, settings change), the approval UX was poor and the operator could not understand the context. As a result, approval delays accumulated, worsening the operational SLA.
Key concepts
Productization is about designing a secure interface between “model output” and “user action.” There are three key points.
- Failure UX: Instead of hiding failure, guide the next action.
- Human-in-the-loop: Clearly define sections that require human intervention.
- Operational governance: Embed policies and responsibility boundaries into the release system.| area | Question | Design principles | Operational Metrics | | --- | --- | --- | --- | | Failed UX | How to Show Failure | Cause type + recovery action suggested | Retry success rate, bounce rate | | expression of trust | When to Speak Confidently | Rationale/Limitations Simultaneous exposure | False high-confidence response rate | | human intervention | Where do you need human approval | Risk Based Routing | Approval SLA, automation rate | | Governance | Who approves changes | Role-based authorization | Number of rollbacks, policy violation rate |
Practical pattern
Pattern 1: Anchoring a failing UX into an API contract
Generic phrases like “an error has occurred” mask product failures. User action guides for each type of failure must be included in the response contract.
type AssistantResponse = {
status: "ok" | "degraded" | "needs_review" | "blocked";
answer: string;
confidence: number;
reasonCode?:
| "LOW_CONFIDENCE"
| "POLICY_RESTRICTED"
| "SOURCE_STALE"
| "TOOL_UNAVAILABLE";
nextAction?: "retry" | "ask_human" | "provide_more_context" | "view_policy_doc";
citations?: string[];
};
export function toUserFacingMessage(res: AssistantResponse) {
if (res.status === "ok") return res;
if (res.reasonCode === "LOW_CONFIDENCE") {
return { ...res, nextAction: "provide_more_context" };
}
if (res.reasonCode === "POLICY_RESTRICTED") {
return { ...res, nextAction: "ask_human" };
}
return { ...res, nextAction: "retry" };
}
Operating points:
- Conversion rate and user abandonment rate by reasonCode are counted separately.
- Do not use the reliability score only for internal judgment, but expose it meaningfully in UX.
- Include recovery actions such as “Open supporting document” in the product flow.
Pattern 2: Risk-based human-in-the-loop routing
Attaching human review to every request is not scalable. On the other hand, full automation has a high risk of accidents. Routing should be done based on risk score.
{
"risk_routing_policy": {
"auto_allow_if": {
"confidence_min": 0.82,
"policy_decision": "allow",
"domain": ["faq", "guide"]
},
"review_required_if": {
"domain": ["payment", "account_lock", "legal"],
"confidence_below": 0.82,
"tool_call": ["change_account_state", "issue_refund"]
},
"hard_block_if": {
"policy_decision": "deny",
"pii_leak_risk": "high"
}
}
}
# 리뷰 큐 SLA 모니터링
./ops/review-queue/check-sla.sh \
--queue human_review \
--target-p95-minutes 15 \
--alert-channel support-oncall
Operating points:
- If the review queue backlog exceeds the threshold, the automation rate is adjusted.
- The review results are fed back into the evaluation set to gradually expand the scope of automation.
- Original text/evidence/policy log is provided on one screen in the reviewer UI.
Pattern 3: Release governance and separation of responsibilities
At the productization stage, “change responsibility” is more important than “function completion.” Approvers for each prompt/policy/search/tool must be separated.
| Change Type | Required Approver | Required Verification |
|---|---|---|
| Change Prompt | Product + Platform | offline evaluation + online canary |
| Policy change | Security + Legal (as required) | Policy Regression Testing |
| Change Tool permissions | Platform + Security | Permission Test + Audit Log Check |
| Change Retrieval | Search + Platform | freshness/quality regression testing |
Operating points:- Link approval records to tickets/PRs to ensure traceability.
- Simultaneous deployment of high-risk changes is prohibited and applied in stages.
- When a policy violation occurs, the relevant change bundle is automatically displayed.
Failure cases/anti-patterns
Failure Scenario: “It was correct, but the product failed”
Situation:
- Model response accuracy has been improved compared to before.
- However, low-confidence responses were also displayed with equal emphasis in the UI, making users overconfident in their answers.
- In particular, complaints increased as incorrect perceptions of high trust increased in payment policy questions.
Detection procedure:
high_confidence_wrong_answer_rateindicator rises- In the user behavior log by reasonCode, the rate of incorrect actions increases after
LOW_CONFIDENCEresponse. - Identify clusters of identical UX issues through CS ticket text analysis
Mitigation Procedures:
- Visually mark low confidence responses as “requires review”
- Automatic response is blocked for policy domain queries without a supporting link.
- Temporary expansion of risk domain human-review path
Recovery Procedure:
- Fix failed UX contract (reasonCode, nextAction) as product standard
- Add “Expression of Trustworthiness” item to UX review checklist
- Add accuracy and
user-safe-completion-rateto product KPI
Representative antipatterns
- Temporarily sealing LLM errors with UX text only
- Operate human-in-the-loop through simple manual processing rather than quality correction
- Quickly deploy only prompts without policy approval
- Exclude “Recovery success rate after failure” from product KPIs
Checklist
- Is the failure type (reasonCode) defined in the API/UX contract?
- Are low-confidence responses clearly distinguished in the UI?
- Do red domain requests have human-review routing rules?
- Are review queue SLAs and backlog thresholds managed as operational metrics?
- Are the prompt/policy/permission change approvers separated?
- Do you collect user behavior-based quality indicators (
safe completion,recovery success)? - Is it possible to trace the set of causal changes in an accident?
SummaryProductization is not about exposing model performance on the screen. It’s about transforming failures into manageable experiences and embedding human intervention and governance into operating systems. For a technically sound system to be safe for users, the failure UX and responsibility structure must be designed together.
Next episode preview
The next part deals with change management. It explains why it is dangerous to deploy prompt changes, system changes, and policy changes in the same way, and outlines how to design experiment/canary/rollback strategies for each type of change.
Reference link
- OpenAI Developer Docs - Responses API
- OpenTelemetry 공식 문서
- Martin Fowler - Circuit Breaker
- 블로그: LLM Agent Tool Guardrail 설계
Series navigation
- Previous post: Part 8. Agent 아키텍처: Planner/Executor, 상태 머신, 작업 큐
- Next post: Part 10. 변경 관리: 프롬프트 변경 vs 시스템 변경, 실험과 롤백