Part 4. Cost Design: Cache, Batching, Routing, Token Budget

LLM costs are often reduced to a question of “which model do you use?” In actual operation, the system structure rather than the model unit price determines the total cost. Even if you use the same model, the monthly cost increases by 2 to 5 times depending on the cache strategy, context length, routing policy, and retry policy. So cost optimization is an architectural design issue, not a purchasing negotiation.

Based on version

Node.js 20 LTS
TypeScript 5.8.x
Next.js 16.1.x
OpenAI API (Responses API, based on 2026-03 document)
PostgreSQL 15
Redis 7

Raise a problem

Recurring patterns of cost failure in operations are often similar.

Attempting to secure “accuracy” by attaching an unlimited number of contexts causes an explosion in token usage.
Requests that are sufficient for a small model are processed in batches with a large model, increasing the unit price.
When retrying fails, the cost is amplified by sending the same long prompt repeatedly.
Even if there is a cache, the hit rate is low due to weak key design, and as a result, the number of calls does not decrease.

Practical example A: Customer support FAQ chatbot

Although 40% of all requests were focused on the top 30 questions, retrieval + generation was performed each time without a cache. Even though the response quality was the same, monthly costs continued to increase. The cause is not a model problem but the system's failure to recognize the repeat request.

Practical example B: Internal document analysis API

In document summary requests, the entire original text was always attached, resulting in an average of over 20k input tokens per request. In reality, the same accuracy could be maintained through section-level division and summary recombination, but the cost and delay both worsened due to the lack of boundary design.

Key concepts

To control costs, you first need to break down the “unit economics.”| Item | Question | Representative indicators | control lever | | --- | --- | --- | --- | | input token | Why did it get longer | input_tokens/request | Context compression, RAG top-k | | output token | Why did it become so long | output_tokens/request | Response length policy, format enforcement | | Model unit price | Why the expensive model was chosen | model_mix_ratio | Routing policy, difficulty classification | | number of calls | Why process the same request repeatedly | calls/session | Cache, idempotent keys, retry policy | | cost of failure | Why Failure Accumulates Costs | retry_token_cost | Failure type classification, retry limits |

The key is not “use the lowest unit price model.” A control loop that satisfies both quality (SLO) and cost (SLA) must be designed.

Mermaid diagram rendering...

Practical pattern

Pattern 1: Enforce token budget per request

The policy of “putting in a lot for a good answer” is unsustainable in operation. The maximum input/output token must be specified for each request purpose, and if exceeded, switch to compression or step processing.

type BudgetPolicy = {
  inputMax: number;
  outputMax: number;
  overflowStrategy: "truncate" | "summarize" | "async";
};

const budgetByTask: Record<string, BudgetPolicy> = {
  faq_answer: { inputMax: 4000, outputMax: 600, overflowStrategy: "summarize" },
  policy_check: { inputMax: 2500, outputMax: 300, overflowStrategy: "truncate" },
  long_report: { inputMax: 8000, outputMax: 1200, overflowStrategy: "async" },
};

export function enforceBudget(task: string, inputTokens: number) {
  const policy = budgetByTask[task] ?? budgetByTask.faq_answer;
  if (inputTokens <= policy.inputMax) return { allowed: true, strategy: "none" as const };

  return {
    allowed: false,
    strategy: policy.overflowStrategy,
    overBy: inputTokens - policy.inputMax,
  };
}

Operating points:

Find context design flaws by aggregating the token budget violation rate by service.
Requests converted to async clearly indicate the expected completion time to the user.
Budget policies are reevaluated periodically, but do not automatically expand based on temporary campaign traffic.

Pattern 2: Combined Cache + Routing Design

Operating cache and model routing separately will limit their effectiveness. Cache lookup before routing and response cache storage after routing must be combined into a consistent key system.

type Route = "small" | "large";

function buildCacheKey(input: string, tenantId: string, policyVersion: string) {
  return hash(`${tenantId}:${policyVersion}:${normalize(input)}`);
}

export async function answer(input: string, tenantId: string) {
  const policyVersion = "v3";
  const key = buildCacheKey(input, tenantId, policyVersion);

  const hit = await cache.get(key);
  if (hit) return { source: "cache", ...hit };

  const route: Route = classifyDifficulty(input) === "hard" ? "large" : "small";
  const model = route === "large" ? largeModel : smallModel;

  const result = await model.generate(input);
  await cache.set(key, result, { ttlSec: route === "large" ? 3600 : 900 });

  return { source: route, ...result };
}

Operating points:

Prevent contamination by including the tenant/policy version in the cache key.
High-cost model results increase cost efficiency by taking a long TTL.
Don’t just look at the cache hit rate, but also track the “amount of tokens saved by cache.”

Pattern 3: Automate batch processing and cost alerts

By collecting and processing tasks that do not require real-time in batches, unit costs and billing volatility can be greatly reduced.

# 5분 간격 배치 처리 예시
./jobs/llm-batch-runner.sh \
  --queue report_generation \
  --max-batch-size 120 \
  --max-total-input-tokens 300000 \
  --model small-first \
  --fallback large-on-fail

# 비용 경보: 시간당 비용이 베이스라인 대비 30% 초과 시 알림
./alerts/cost-guard.sh \
  --metric usd_per_hour \
  --window 1h \
  --baseline 14d \
  --threshold +30%

Operating points:- The batch size is determined by considering not only throughput but also reprocessing costs in case of failure.

Batch retries are separated into individual work units to avoid re-executing the entire batch.
Cost alerts should be viewed in conjunction with delay/quality indicators to reduce false positives.

Failure cases/anti-patterns

Failure Scenario: “Surge in Costs due to Fallback Flood”

Situation:

Due to a temporary failure of the first small model, the fallback rate jumped from 8% to 64%.
The fallback model had a unit cost per call of 6 times, and at the same time, retries were activated, so the cost per hour increased by 4.3 times.
User quality actually decreased, but only costs increased.

Detection procedure:

Detection of large model ratio spike in model_mix_ratio
Simultaneous alarm generation for retry_token_cost and usd_per_hour
Check small timeout -> large fallback -> retry repeat path in trace

Mitigation Procedures:

Apply fallback concurrency upper limit (cap per tenant)
Immediately switch non-critical request types to degraded response (degraded mode)
Limit retry policy to TRANSIENT_INFRA only

Recovery Procedure:

Add cost protection gate to fallback activation conditions
Reflect the “quality-to-cost upper limit” rule in the routing policy.
Add cost per resolved request indicator to weekly review

Representative antipatterns

Approach cost issues only through model unit price negotiation
Aggressively optimize only cache/routing without quality indicators -Operation without observing retry token cost
Cache keys without tenant separation cause data confusion

Checklist

Is the input/output token budget for each request type defined in code?
Does the cache key reflect tenant/policy version/normalization input?
Are model routing results and quality indicators collected together?
Are there fallback rate caps and cost protection policies?
Is the retry cost (retry_token_cost) separately monitored?
Is a partial failure reprocessing strategy defined in batch jobs?
Do you view cost per resolved request as a key operating indicator?

SummaryLLM cost optimization is not “writing a cheap model” but rather control plane design. Token budget, cache, routing, placement, and alerts must be tied into one loop to ensure both quality and cost. Prompt tuning is only an auxiliary means of reducing costs; sustainable savings come from system policy.

Next episode preview

The next section deals with security. It explains prompt injection, data leakage, permission policies, and tool sandboxing by linking them into a single threat model. In particular, it deals with how to block situations where “the system is dangerous even if the model appears safe” from the perspective of the policy engine and execution isolation.

Reference link

Previous post: Part 3. 신뢰성 설계: Retry, Timeout, Fallback, Circuit Breaker
Next post: Part 5. 보안 설계: Prompt Injection, Data Leak, Policy Guard