Part 4. Cost Design: Cache, Batching, Routing, Token Budget
LLM costs are determined by the system control method, not the model unit price. Organize cache, batching, routing, and token budget from an operational perspective.
Series: Prompt Engineering이 아니라 System Engineering이 중요한 이유
총 12편 구성. 현재 4편을 보고 있습니다.
- 01Part 1. Prompt is an interface: Revisiting system boundaries and contracts
- 02Part 2. Quality comes from the evaluation loop, not from prompts
- 03Part 3. Reliability Design: Retry, Timeout, Fallback, Circuit Breaker
- 04Part 4. Cost Design: Cache, Batching, Routing, Token BudgetCURRENT
- 05Part 5. Security Design: Prompt Injection, Data Leak, Policy Guard
- 06Part 6. Observability Design: Trace, Span, Log Schema, Regression Detection
- 07Part 7. Context Engineering: RAG, Memory, Recency, Multi-Tenancy
- 08Part 8. Agent architecture: Planner/Executor, state machine, task queue
- 09Part 9. Productization: Failure UX, Human-in-the-loop, Operational Governance
- 10Part 10. Change Management: Prompt Changes vs System Changes, Experiments and Rollbacks
- 11Part 11. Reference Architecture: End-to-End Operational Design
- 12Part 12. Organization/Process: Operational Maturity Model and Roadmap
LLM costs are often reduced to a question of “which model do you use?” In actual operation, the system structure rather than the model unit price determines the total cost. Even if you use the same model, the monthly cost increases by 2 to 5 times depending on the cache strategy, context length, routing policy, and retry policy. So cost optimization is an architectural design issue, not a purchasing negotiation.
Based on version
- Node.js 20 LTS
- TypeScript 5.8.x
- Next.js 16.1.x
- OpenAI API (Responses API, based on 2026-03 document)
- PostgreSQL 15
- Redis 7
Raise a problem
Recurring patterns of cost failure in operations are often similar.
- Attempting to secure “accuracy” by attaching an unlimited number of contexts causes an explosion in token usage.
- Requests that are sufficient for a small model are processed in batches with a large model, increasing the unit price.
- When retrying fails, the cost is amplified by sending the same long prompt repeatedly.
- Even if there is a cache, the hit rate is low due to weak key design, and as a result, the number of calls does not decrease.
Practical example A: Customer support FAQ chatbot
Although 40% of all requests were focused on the top 30 questions, retrieval + generation was performed each time without a cache. Even though the response quality was the same, monthly costs continued to increase. The cause is not a model problem but the system's failure to recognize the repeat request.
Practical example B: Internal document analysis API
In document summary requests, the entire original text was always attached, resulting in an average of over 20k input tokens per request. In reality, the same accuracy could be maintained through section-level division and summary recombination, but the cost and delay both worsened due to the lack of boundary design.
Key concepts
To control costs, you first need to break down the “unit economics.”| Item | Question | Representative indicators | control lever | | --- | --- | --- | --- | | input token | Why did it get longer | input_tokens/request | Context compression, RAG top-k | | output token | Why did it become so long | output_tokens/request | Response length policy, format enforcement | | Model unit price | Why the expensive model was chosen | model_mix_ratio | Routing policy, difficulty classification | | number of calls | Why process the same request repeatedly | calls/session | Cache, idempotent keys, retry policy | | cost of failure | Why Failure Accumulates Costs | retry_token_cost | Failure type classification, retry limits |
The key is not “use the lowest unit price model.” A control loop that satisfies both quality (SLO) and cost (SLA) must be designed.
Practical pattern
Pattern 1: Enforce token budget per request
The policy of “putting in a lot for a good answer” is unsustainable in operation. The maximum input/output token must be specified for each request purpose, and if exceeded, switch to compression or step processing.
type BudgetPolicy = {
inputMax: number;
outputMax: number;
overflowStrategy: "truncate" | "summarize" | "async";
};
const budgetByTask: Record<string, BudgetPolicy> = {
faq_answer: { inputMax: 4000, outputMax: 600, overflowStrategy: "summarize" },
policy_check: { inputMax: 2500, outputMax: 300, overflowStrategy: "truncate" },
long_report: { inputMax: 8000, outputMax: 1200, overflowStrategy: "async" },
};
export function enforceBudget(task: string, inputTokens: number) {
const policy = budgetByTask[task] ?? budgetByTask.faq_answer;
if (inputTokens <= policy.inputMax) return { allowed: true, strategy: "none" as const };
return {
allowed: false,
strategy: policy.overflowStrategy,
overBy: inputTokens - policy.inputMax,
};
}
Operating points:
- Find context design flaws by aggregating the token budget violation rate by service.
- Requests converted to
asyncclearly indicate the expected completion time to the user. - Budget policies are reevaluated periodically, but do not automatically expand based on temporary campaign traffic.
Pattern 2: Combined Cache + Routing Design
Operating cache and model routing separately will limit their effectiveness. Cache lookup before routing and response cache storage after routing must be combined into a consistent key system.
type Route = "small" | "large";
function buildCacheKey(input: string, tenantId: string, policyVersion: string) {
return hash(`${tenantId}:${policyVersion}:${normalize(input)}`);
}
export async function answer(input: string, tenantId: string) {
const policyVersion = "v3";
const key = buildCacheKey(input, tenantId, policyVersion);
const hit = await cache.get(key);
if (hit) return { source: "cache", ...hit };
const route: Route = classifyDifficulty(input) === "hard" ? "large" : "small";
const model = route === "large" ? largeModel : smallModel;
const result = await model.generate(input);
await cache.set(key, result, { ttlSec: route === "large" ? 3600 : 900 });
return { source: route, ...result };
}
Operating points:
- Prevent contamination by including the tenant/policy version in the cache key.
- High-cost model results increase cost efficiency by taking a long TTL.
- Don’t just look at the cache hit rate, but also track the “amount of tokens saved by cache.”
Pattern 3: Automate batch processing and cost alerts
By collecting and processing tasks that do not require real-time in batches, unit costs and billing volatility can be greatly reduced.
# 5분 간격 배치 처리 예시
./jobs/llm-batch-runner.sh \
--queue report_generation \
--max-batch-size 120 \
--max-total-input-tokens 300000 \
--model small-first \
--fallback large-on-fail
# 비용 경보: 시간당 비용이 베이스라인 대비 30% 초과 시 알림
./alerts/cost-guard.sh \
--metric usd_per_hour \
--window 1h \
--baseline 14d \
--threshold +30%
Operating points:- The batch size is determined by considering not only throughput but also reprocessing costs in case of failure.
- Batch retries are separated into individual work units to avoid re-executing the entire batch.
- Cost alerts should be viewed in conjunction with delay/quality indicators to reduce false positives.
Failure cases/anti-patterns
Failure Scenario: “Surge in Costs due to Fallback Flood”
Situation:
- Due to a temporary failure of the first small model, the fallback rate jumped from 8% to 64%.
- The fallback model had a unit cost per call of 6 times, and at the same time, retries were activated, so the cost per hour increased by 4.3 times.
- User quality actually decreased, but only costs increased.
Detection procedure:
- Detection of large model ratio spike in
model_mix_ratio - Simultaneous alarm generation for
retry_token_costandusd_per_hour - Check
small timeout -> large fallback -> retryrepeat path in trace
Mitigation Procedures:
- Apply fallback concurrency upper limit (cap per tenant)
- Immediately switch non-critical request types to degraded response (degraded mode)
- Limit retry policy to
TRANSIENT_INFRAonly
Recovery Procedure:
- Add cost protection gate to fallback activation conditions
- Reflect the “quality-to-cost upper limit” rule in the routing policy.
- Add
cost per resolved requestindicator to weekly review
Representative antipatterns
- Approach cost issues only through model unit price negotiation
- Aggressively optimize only cache/routing without quality indicators -Operation without observing retry token cost
- Cache keys without tenant separation cause data confusion
Checklist
- Is the input/output token budget for each request type defined in code?
- Does the cache key reflect tenant/policy version/normalization input?
- Are model routing results and quality indicators collected together?
- Are there fallback rate caps and cost protection policies?
- Is the retry cost (
retry_token_cost) separately monitored? - Is a partial failure reprocessing strategy defined in batch jobs?
- Do you view
cost per resolved requestas a key operating indicator?
SummaryLLM cost optimization is not “writing a cheap model” but rather control plane design. Token budget, cache, routing, placement, and alerts must be tied into one loop to ensure both quality and cost. Prompt tuning is only an auxiliary means of reducing costs; sustainable savings come from system policy.
Next episode preview
The next section deals with security. It explains prompt injection, data leakage, permission policies, and tool sandboxing by linking them into a single threat model. In particular, it deals with how to block situations where “the system is dangerous even if the model appears safe” from the perspective of the policy engine and execution isolation.
Reference link
- OpenAI Developer Docs - Responses API
- OpenTelemetry 공식 문서
- Martin Fowler - Circuit Breaker
- 블로그: LLM Agent Tool Guardrail 설계