5 min read

Part 9. Productization: Failure UX, Human-in-the-loop, Operational Governance

It covers failure UX, human intervention, and operational governance design to make a technically functional LLM function into a trustworthy product for users.

Series: Prompt Engineering이 아니라 System Engineering이 중요한 이유

12편 구성. 현재 9편을 보고 있습니다.

LLM capabilities are impressive at the demo stage, but are evaluated by different criteria at the production stage. Users are asking “when can I trust” the model rather than how smart it is. Therefore, the key to productization is not only to increase the correct response rate, but also to make failure predictable and control the user experience in case of failure.

Based on version

  • Node.js 20 LTS
  • TypeScript 5.8.x
  • Next.js 16.1.x
  • OpenAI API (Responses API, based on 2026-03 document)
  • PostgreSQL 15
  • Redis 7

Raise a problem

Common product failure patterns in operational LLM functions include:

  • Even if the answer quality is low, the UI shows excessive confidence, leading to user misjudgment.
  • In case of failure, the guidance text is abstract, so the user cannot decide the next action.
  • Even if there is a human-in-the-loop, there is no standard for when to hand it over to a human, so it only creates a bottleneck.
  • Policy/legal/security requirements are attached late after feature release, resulting in frequent rollbacks.

Practical example A: Payment domain consultation assistant

In an inquiry about the refund policy, the model gave an ambiguous answer, but the UI seemed to be the "correct answer", so the user made the wrong decision. Technically it wasn't an error, but from a product perspective, it was a trust failure.

Practical Example B: Operations Automation Assistant

In the function where the agent suggests risky actions (account suspension, settings change), the approval UX was poor and the operator could not understand the context. As a result, approval delays accumulated, worsening the operational SLA.

Key concepts

Productization is about designing a secure interface between “model output” and “user action.” There are three key points.

  1. Failure UX: Instead of hiding failure, guide the next action.
  2. Human-in-the-loop: Clearly define sections that require human intervention.
  3. Operational governance: Embed policies and responsibility boundaries into the release system.| area | Question | Design principles | Operational Metrics | | --- | --- | --- | --- | | Failed UX | How to Show Failure | Cause type + recovery action suggested | Retry success rate, bounce rate | | expression of trust | When to Speak Confidently | Rationale/Limitations Simultaneous exposure | False high-confidence response rate | | human intervention | Where do you need human approval | Risk Based Routing | Approval SLA, automation rate | | Governance | Who approves changes | Role-based authorization | Number of rollbacks, policy violation rate |
Mermaid diagram rendering...

Practical pattern

Pattern 1: Anchoring a failing UX into an API contract

Generic phrases like “an error has occurred” mask product failures. User action guides for each type of failure must be included in the response contract.

type AssistantResponse = {
  status: "ok" | "degraded" | "needs_review" | "blocked";
  answer: string;
  confidence: number;
  reasonCode?:
    | "LOW_CONFIDENCE"
    | "POLICY_RESTRICTED"
    | "SOURCE_STALE"
    | "TOOL_UNAVAILABLE";
  nextAction?: "retry" | "ask_human" | "provide_more_context" | "view_policy_doc";
  citations?: string[];
};

export function toUserFacingMessage(res: AssistantResponse) {
  if (res.status === "ok") return res;

  if (res.reasonCode === "LOW_CONFIDENCE") {
    return { ...res, nextAction: "provide_more_context" };
  }

  if (res.reasonCode === "POLICY_RESTRICTED") {
    return { ...res, nextAction: "ask_human" };
  }

  return { ...res, nextAction: "retry" };
}

Operating points:

  • Conversion rate and user abandonment rate by reasonCode are counted separately.
  • Do not use the reliability score only for internal judgment, but expose it meaningfully in UX.
  • Include recovery actions such as “Open supporting document” in the product flow.

Pattern 2: Risk-based human-in-the-loop routing

Attaching human review to every request is not scalable. On the other hand, full automation has a high risk of accidents. Routing should be done based on risk score.

{
  "risk_routing_policy": {
    "auto_allow_if": {
      "confidence_min": 0.82,
      "policy_decision": "allow",
      "domain": ["faq", "guide"]
    },
    "review_required_if": {
      "domain": ["payment", "account_lock", "legal"],
      "confidence_below": 0.82,
      "tool_call": ["change_account_state", "issue_refund"]
    },
    "hard_block_if": {
      "policy_decision": "deny",
      "pii_leak_risk": "high"
    }
  }
}
# 리뷰 큐 SLA 모니터링
./ops/review-queue/check-sla.sh \
  --queue human_review \
  --target-p95-minutes 15 \
  --alert-channel support-oncall

Operating points:

  • If the review queue backlog exceeds the threshold, the automation rate is adjusted.
  • The review results are fed back into the evaluation set to gradually expand the scope of automation.
  • Original text/evidence/policy log is provided on one screen in the reviewer UI.

Pattern 3: Release governance and separation of responsibilities

At the productization stage, “change responsibility” is more important than “function completion.” Approvers for each prompt/policy/search/tool ​​must be separated.

Change TypeRequired ApproverRequired Verification
Change PromptProduct + Platformoffline evaluation + online canary
Policy changeSecurity + Legal (as required)Policy Regression Testing
Change Tool permissionsPlatform + SecurityPermission Test + Audit Log Check
Change RetrievalSearch + Platformfreshness/quality regression testing

Operating points:- Link approval records to tickets/PRs to ensure traceability.

  • Simultaneous deployment of high-risk changes is prohibited and applied in stages.
  • When a policy violation occurs, the relevant change bundle is automatically displayed.

Failure cases/anti-patterns

Failure Scenario: “It was correct, but the product failed”

Situation:

  • Model response accuracy has been improved compared to before.
  • However, low-confidence responses were also displayed with equal emphasis in the UI, making users overconfident in their answers.
  • In particular, complaints increased as incorrect perceptions of high trust increased in payment policy questions.

Detection procedure:

  1. high_confidence_wrong_answer_rate indicator rises
  2. In the user behavior log by reasonCode, the rate of incorrect actions increases after LOW_CONFIDENCE response.
  3. Identify clusters of identical UX issues through CS ticket text analysis

Mitigation Procedures:

  1. Visually mark low confidence responses as “requires review”
  2. Automatic response is blocked for policy domain queries without a supporting link.
  3. Temporary expansion of risk domain human-review path

Recovery Procedure:

  1. Fix failed UX contract (reasonCode, nextAction) as product standard
  2. Add “Expression of Trustworthiness” item to UX review checklist
  3. Add accuracy and user-safe-completion-rate to product KPI

Representative antipatterns

  • Temporarily sealing LLM errors with UX text only
  • Operate human-in-the-loop through simple manual processing rather than quality correction
  • Quickly deploy only prompts without policy approval
  • Exclude “Recovery success rate after failure” from product KPIs

Checklist

  • Is the failure type (reasonCode) defined in the API/UX contract?
  • Are low-confidence responses clearly distinguished in the UI?
  • Do red domain requests have human-review routing rules?
  • Are review queue SLAs and backlog thresholds managed as operational metrics?
  • Are the prompt/policy/permission change approvers separated?
  • Do you collect user behavior-based quality indicators (safe completion, recovery success)?
  • Is it possible to trace the set of causal changes in an accident?

SummaryProductization is not about exposing model performance on the screen. It’s about transforming failures into manageable experiences and embedding human intervention and governance into operating systems. For a technically sound system to be safe for users, the failure UX and responsibility structure must be designed together.

Next episode preview

The next part deals with change management. It explains why it is dangerous to deploy prompt changes, system changes, and policy changes in the same way, and outlines how to design experiment/canary/rollback strategies for each type of change.

Series navigation

Comments