Part 3. Reliability Design: Retry, Timeout, Fallback, Circuit Breaker
We summarize the reasons and operating patterns for retries, timeouts, fallbacks, and circuit breakers in LLM systems that should be designed differently from regular APIs.
Series: Prompt Engineering이 아니라 System Engineering이 중요한 이유
총 12편 구성. 현재 3편을 보고 있습니다.
- 01Part 1. Prompt is an interface: Revisiting system boundaries and contracts
- 02Part 2. Quality comes from the evaluation loop, not from prompts
- 03Part 3. Reliability Design: Retry, Timeout, Fallback, Circuit BreakerCURRENT
- 04Part 4. Cost Design: Cache, Batching, Routing, Token Budget
- 05Part 5. Security Design: Prompt Injection, Data Leak, Policy Guard
- 06Part 6. Observability Design: Trace, Span, Log Schema, Regression Detection
- 07Part 7. Context Engineering: RAG, Memory, Recency, Multi-Tenancy
- 08Part 8. Agent architecture: Planner/Executor, state machine, task queue
- 09Part 9. Productization: Failure UX, Human-in-the-loop, Operational Governance
- 10Part 10. Change Management: Prompt Changes vs System Changes, Experiments and Rollbacks
- 11Part 11. Reference Architecture: End-to-End Operational Design
- 12Part 12. Organization/Process: Operational Maturity Model and Roadmap
When adding LLM functionality to a product, the first question that arises seems to be “accuracy.” In actual operation, the question of “when and how to fail” first becomes a problem. Even for the same model, the failure radius can be completely different depending on how the retry policy and timeout budget are set. In this part, we decompose the LLM workload from a reliability perspective and organize design patterns that can be directly applied in operations.
Based on version
- Node.js 20 LTS
- TypeScript 5.8.x
- Next.js 16.1.x
- OpenAI API (Responses API, based on 2026-03 document)
- PostgreSQL 15
- Redis 7
Raise a problem
When designing LLM system reliability with only existing REST API operation experience, the following problems often occur.
- By applying “Retry 3 times if failure” is applied collectively, model overload is amplified.
- Waiting for long inferences without a timeout standard breaks the entire request SLA.
- There is no standard for switching to a fallback model, so traffic is rapidly focused on high-cost models.
- Cause analysis becomes impossible as it does not distinguish between tool call failure and model failure.
Practical example A: E-commerce chatbot
During the promotion period, traffic increased four times compared to normal times, and the response delay of the first model increased. The system retried all failures equally, and the retry requests again created delays, causing the queue to become very crowded. As a result, from the user's perspective, it became a "no response" problem rather than a "response quality" problem.
Practical example B: In-house operations assistant
The model was configured to be re-invoked even in situations where the tool call failed (permission error), so token costs were repeatedly consumed without resolving the cause of the failure. The cause was a design that did not separate failure types.
Key concepts
LLM reliability is not a single “model call” but a combination of “probabilistic reasoning + deterministic system”. Therefore, failures should be classified as follows:| Failure type | representative cause | Retryability | Recommended response | | --- | --- | --- | --- | | Transient Infra Failure | Network outage, temporary 5xx | High | Short backoff retry | | Capacity Failure | model provider saturation, rate limiting | middle | Queuing + Routing Downgrade | | Deterministic Input Failure | Schema violation, policy blocking | low | Immediate Fail + User Guide | | Tool Permission Failure | Permission/Policy Mismatch | low | No retry, permission path guidance | | Long Reasoning Timeout | Excessive context/complex queries | middle | Split processing, step-by-step response |
There are four core principles:
- Retry is applied differently for each failure type.
- Timeouts are divided into stages within the overall request budget (deadline budget).
- Fallback must be an explicit policy that accepts quality degradation.
- Circuit breakers are a device not only to protect suppliers but also to protect our services.
Practical pattern
Pattern 1: Deadline Budgeting + Stepped Timeout
The overall SLA must be determined first, and the budget must be explicitly allocated to each stage. The “if the model is slow, wait” approach almost always fails in operations.
type Budget = {
totalMs: number;
modelPrimaryMs: number;
retryMs: number;
toolMs: number;
renderMs: number;
};
const defaultBudget: Budget = {
totalMs: 2500,
modelPrimaryMs: 1200,
retryMs: 500,
toolMs: 400,
renderMs: 300,
};
export async function handleRequest(input: string, budget = defaultBudget) {
const started = Date.now();
const primary = await callModel(input, { timeoutMs: budget.modelPrimaryMs }).catch(() => null);
if (primary) return finalize(primary, started);
const elapsed = Date.now() - started;
const remaining = budget.totalMs - elapsed;
if (remaining <= budget.renderMs) {
return degraded("응답 생성 시간이 초과되어 요약 결과를 제공합니다.");
}
const fallback = await callFallbackModel(input, {
timeoutMs: Math.min(budget.retryMs, remaining - budget.renderMs),
}).catch(() => null);
if (!fallback) return degraded("현재 요청이 많아 간단 답변으로 대체합니다.");
return finalize(fallback, started);
}
Operating points:
- For each request,
deadline_msis included in the trace to analyze “time budget consumption location.” - Long document summary/analysis is separated from the synchronous path and transferred to the asynchronous task queue.
- Manage user expectations by explicitly marking degraded responses in UX.
Pattern 2: Retry based on failure type + circuit breaker
Retrying every failure becomes a failure amplifier. Retries should be limited to "reversible failures".
type FailureKind =
| "TRANSIENT_INFRA"
| "RATE_LIMIT"
| "SCHEMA_INVALID"
| "POLICY_DENIED"
| "TOOL_PERMISSION"
| "UNKNOWN";
function shouldRetry(kind: FailureKind, attempt: number) {
if (attempt >= 2) return false;
if (kind === "TRANSIENT_INFRA") return true;
if (kind === "RATE_LIMIT") return true;
return false;
}
function nextBackoffMs(attempt: number) {
return Math.min(200 * 2 ** attempt + Math.floor(Math.random() * 100), 900);
}
export async function resilientInfer(input: string) {
let attempt = 0;
while (true) {
try {
return await callPrimaryModel(input);
} catch (e) {
const kind = classifyFailure(e);
if (!shouldRetry(kind, attempt)) throw e;
await sleep(nextBackoffMs(attempt));
attempt += 1;
}
}
}
# 예시: 5분 창에서 실패율과 지연이 임계치 초과면 서킷 오픈
./ops/circuit/open-if-needed.sh \
--target primary_model \
--window 5m \
--error-rate-threshold 0.08 \
--latency-p95-threshold-ms 1800
Operating points:
- Whether to allow low-quality fallback paths or completely block them when the circuit is open varies depending on the service domain.
- Circuit state changes are immediately propagated to the on-call channel and automatic recovery criteria (half-open probe) are documented.
- The retry request is recorded as a tag separate from the original request and the amplification rate is calculated.
Failure cases/anti-patterns### Failure scenario: “Secondary failure due to excessive retry attempts”
Situation:
- Increased delays on the provider side after deploying new model versions.
- The application used
timeout=4sandretry=3as global settings. - The average response increased from 1.2 seconds to 6.7 seconds, worker threads became saturated, and the delay spread to the internal API.
Detection procedure:
retry_amplification_ratioalert (retry ratio to original request)queue_depthsurges +worker_busy_ratiopersists above 90%- Check
infer -> timeout -> retrypattern repetition in trace
Mitigation Procedures:
- Reduce global retry immediately (
3 -> 1) and reduce timeout - Circuit breaker forced open, 60% traffic bypassed with fallback model
- Force long request types (document digests) to an asynchronous queue
Recovery Procedure:
- Prohibit retry of deterministic failure by introducing a failure classifier
- Standardization of deadline budget by stage
- Add retry amplification rate and degraded response rate as weekly SLO review items
Representative antipatterns
- Assumption: “The more retries, the safer”
- Design to leave deadline budget empty by mistaking the supplier's SLA for our SLA
- Quickly activated only in case of failure without quality verification of the fallback model
- Rely only on client timeout without circuit breaker
Checklist
- Have you classified failure types into at least four types and coded a retry policy for each type?
- Is the timeout budget divided by stage based on the entire request SLA?
- Are the quality/cost/policy violation rate of the fallback path measured even in normal times?
- Are circuit breaker open/half-open/close transition criteria documented?
- Is the retry amplification rate (
retry_amplification_ratio) monitored? - Are tool call failures and model failures collected as separate indicators?
- Is there a runbook prepared that diverts synchronous requests to an asynchronous queue in case of failure?
Summary
The key to LLM reliability design is not “optimizing the success path” but “controlling the failure path.” Designing retries, timeouts, fallbacks, and circuit breakers around failure types can limit the failure radius. Changing failure classification and time budgets has a greater impact on operational stability than changing prompts.## Next episode preview
The next section deals with cost. Cache, batching, model routing, and token budget are explained not from a simple savings perspective, but as a control loop that satisfies both quality and reliability. In particular, we focus on operational indicators to avoid the typical pitfall of “reducing costs but simultaneously worsening quality/delays.”
Reference link
- OpenAI Developer Docs - Responses API
- OpenTelemetry 공식 문서
- Martin Fowler - Circuit Breaker
- 블로그: LLM Agent Tool Guardrail 설계
Series navigation
- Previous post: Part 2. 품질은 Prompt가 아니라 평가 루프에서 나온다
- Next post: Part 4. 비용 설계: 캐시, 배칭, 라우팅, 토큰 예산