4 min read

Part 3. Reliability Design: Retry, Timeout, Fallback, Circuit Breaker

We summarize the reasons and operating patterns for retries, timeouts, fallbacks, and circuit breakers in LLM systems that should be designed differently from regular APIs.

Series: Prompt Engineering이 아니라 System Engineering이 중요한 이유

12편 구성. 현재 3편을 보고 있습니다.

When adding LLM functionality to a product, the first question that arises seems to be “accuracy.” In actual operation, the question of “when and how to fail” first becomes a problem. Even for the same model, the failure radius can be completely different depending on how the retry policy and timeout budget are set. In this part, we decompose the LLM workload from a reliability perspective and organize design patterns that can be directly applied in operations.

Based on version

  • Node.js 20 LTS
  • TypeScript 5.8.x
  • Next.js 16.1.x
  • OpenAI API (Responses API, based on 2026-03 document)
  • PostgreSQL 15
  • Redis 7

Raise a problem

When designing LLM system reliability with only existing REST API operation experience, the following problems often occur.

  • By applying “Retry 3 times if failure” is applied collectively, model overload is amplified.
  • Waiting for long inferences without a timeout standard breaks the entire request SLA.
  • There is no standard for switching to a fallback model, so traffic is rapidly focused on high-cost models.
  • Cause analysis becomes impossible as it does not distinguish between tool call failure and model failure.

Practical example A: E-commerce chatbot

During the promotion period, traffic increased four times compared to normal times, and the response delay of the first model increased. The system retried all failures equally, and the retry requests again created delays, causing the queue to become very crowded. As a result, from the user's perspective, it became a "no response" problem rather than a "response quality" problem.

Practical example B: In-house operations assistant

The model was configured to be re-invoked even in situations where the tool call failed (permission error), so token costs were repeatedly consumed without resolving the cause of the failure. The cause was a design that did not separate failure types.

Key concepts

LLM reliability is not a single “model call” but a combination of “probabilistic reasoning + deterministic system”. Therefore, failures should be classified as follows:| Failure type | representative cause | Retryability | Recommended response | | --- | --- | --- | --- | | Transient Infra Failure | Network outage, temporary 5xx | High | Short backoff retry | | Capacity Failure | model provider saturation, rate limiting | middle | Queuing + Routing Downgrade | | Deterministic Input Failure | Schema violation, policy blocking | low | Immediate Fail + User Guide | | Tool Permission Failure | Permission/Policy Mismatch | low | No retry, permission path guidance | | Long Reasoning Timeout | Excessive context/complex queries | middle | Split processing, step-by-step response |

There are four core principles:

  1. Retry is applied differently for each failure type.
  2. Timeouts are divided into stages within the overall request budget (deadline budget).
  3. Fallback must be an explicit policy that accepts quality degradation.
  4. Circuit breakers are a device not only to protect suppliers but also to protect our services.
Mermaid diagram rendering...

Practical pattern

Pattern 1: Deadline Budgeting + Stepped Timeout

The overall SLA must be determined first, and the budget must be explicitly allocated to each stage. The “if the model is slow, wait” approach almost always fails in operations.

type Budget = {
  totalMs: number;
  modelPrimaryMs: number;
  retryMs: number;
  toolMs: number;
  renderMs: number;
};

const defaultBudget: Budget = {
  totalMs: 2500,
  modelPrimaryMs: 1200,
  retryMs: 500,
  toolMs: 400,
  renderMs: 300,
};

export async function handleRequest(input: string, budget = defaultBudget) {
  const started = Date.now();

  const primary = await callModel(input, { timeoutMs: budget.modelPrimaryMs }).catch(() => null);
  if (primary) return finalize(primary, started);

  const elapsed = Date.now() - started;
  const remaining = budget.totalMs - elapsed;
  if (remaining <= budget.renderMs) {
    return degraded("응답 생성 시간이 초과되어 요약 결과를 제공합니다.");
  }

  const fallback = await callFallbackModel(input, {
    timeoutMs: Math.min(budget.retryMs, remaining - budget.renderMs),
  }).catch(() => null);

  if (!fallback) return degraded("현재 요청이 많아 간단 답변으로 대체합니다.");
  return finalize(fallback, started);
}

Operating points:

  • For each request, deadline_ms is included in the trace to analyze “time budget consumption location.”
  • Long document summary/analysis is separated from the synchronous path and transferred to the asynchronous task queue.
  • Manage user expectations by explicitly marking degraded responses in UX.

Pattern 2: Retry based on failure type + circuit breaker

Retrying every failure becomes a failure amplifier. Retries should be limited to "reversible failures".

type FailureKind =
  | "TRANSIENT_INFRA"
  | "RATE_LIMIT"
  | "SCHEMA_INVALID"
  | "POLICY_DENIED"
  | "TOOL_PERMISSION"
  | "UNKNOWN";

function shouldRetry(kind: FailureKind, attempt: number) {
  if (attempt >= 2) return false;
  if (kind === "TRANSIENT_INFRA") return true;
  if (kind === "RATE_LIMIT") return true;
  return false;
}

function nextBackoffMs(attempt: number) {
  return Math.min(200 * 2 ** attempt + Math.floor(Math.random() * 100), 900);
}

export async function resilientInfer(input: string) {
  let attempt = 0;
  while (true) {
    try {
      return await callPrimaryModel(input);
    } catch (e) {
      const kind = classifyFailure(e);
      if (!shouldRetry(kind, attempt)) throw e;
      await sleep(nextBackoffMs(attempt));
      attempt += 1;
    }
  }
}
# 예시: 5분 창에서 실패율과 지연이 임계치 초과면 서킷 오픈
./ops/circuit/open-if-needed.sh \
  --target primary_model \
  --window 5m \
  --error-rate-threshold 0.08 \
  --latency-p95-threshold-ms 1800

Operating points:

  • Whether to allow low-quality fallback paths or completely block them when the circuit is open varies depending on the service domain.
  • Circuit state changes are immediately propagated to the on-call channel and automatic recovery criteria (half-open probe) are documented.
  • The retry request is recorded as a tag separate from the original request and the amplification rate is calculated.

Failure cases/anti-patterns### Failure scenario: “Secondary failure due to excessive retry attempts”

Situation:

  • Increased delays on the provider side after deploying new model versions.
  • The application used timeout=4s and retry=3 as global settings.
  • The average response increased from 1.2 seconds to 6.7 seconds, worker threads became saturated, and the delay spread to the internal API.

Detection procedure:

  1. retry_amplification_ratio alert (retry ratio to original request)
  2. queue_depth surges + worker_busy_ratio persists above 90%
  3. Check infer -> timeout -> retry pattern repetition in trace

Mitigation Procedures:

  1. Reduce global retry immediately (3 -> 1) and reduce timeout
  2. Circuit breaker forced open, 60% traffic bypassed with fallback model
  3. Force long request types (document digests) to an asynchronous queue

Recovery Procedure:

  1. Prohibit retry of deterministic failure by introducing a failure classifier
  2. Standardization of deadline budget by stage
  3. Add retry amplification rate and degraded response rate as weekly SLO review items

Representative antipatterns

  • Assumption: “The more retries, the safer”
  • Design to leave deadline budget empty by mistaking the supplier's SLA for our SLA
  • Quickly activated only in case of failure without quality verification of the fallback model
  • Rely only on client timeout without circuit breaker

Checklist

  • Have you classified failure types into at least four types and coded a retry policy for each type?
  • Is the timeout budget divided by stage based on the entire request SLA?
  • Are the quality/cost/policy violation rate of the fallback path measured even in normal times?
  • Are circuit breaker open/half-open/close transition criteria documented?
  • Is the retry amplification rate (retry_amplification_ratio) monitored?
  • Are tool call failures and model failures collected as separate indicators?
  • Is there a runbook prepared that diverts synchronous requests to an asynchronous queue in case of failure?

Summary

The key to LLM reliability design is not “optimizing the success path” but “controlling the failure path.” Designing retries, timeouts, fallbacks, and circuit breakers around failure types can limit the failure radius. Changing failure classification and time budgets has a greater impact on operational stability than changing prompts.## Next episode preview

The next section deals with cost. Cache, batching, model routing, and token budget are explained not from a simple savings perspective, but as a control loop that satisfies both quality and reliability. In particular, we focus on operational indicators to avoid the typical pitfall of “reducing costs but simultaneously worsening quality/delays.”

Series navigation

Comments