Part 6. Observability Design: Trace, Span, Log Schema, Regression Detection
To catch quality degradation without failure in LLM operation, trace, log, and quality indicators must be designed as a single observation system.
Series: Prompt Engineering이 아니라 System Engineering이 중요한 이유
총 12편 구성. 현재 6편을 보고 있습니다.
- 01Part 1. Prompt is an interface: Revisiting system boundaries and contracts
- 02Part 2. Quality comes from the evaluation loop, not from prompts
- 03Part 3. Reliability Design: Retry, Timeout, Fallback, Circuit Breaker
- 04Part 4. Cost Design: Cache, Batching, Routing, Token Budget
- 05Part 5. Security Design: Prompt Injection, Data Leak, Policy Guard
- 06Part 6. Observability Design: Trace, Span, Log Schema, Regression DetectionCURRENT
- 07Part 7. Context Engineering: RAG, Memory, Recency, Multi-Tenancy
- 08Part 8. Agent architecture: Planner/Executor, state machine, task queue
- 09Part 9. Productization: Failure UX, Human-in-the-loop, Operational Governance
- 10Part 10. Change Management: Prompt Changes vs System Changes, Experiments and Rollbacks
- 11Part 11. Reference Architecture: End-to-End Operational Design
- 12Part 12. Organization/Process: Operational Maturity Model and Roadmap
In the LLM system, “failure” and “deterioration in quality” are not the same event. Even if HTTP 200 is returned normally, users may feel "unhelpful" and operators will miss this change if they are only looking at the infrastructure dashboard. So LLM observability should be broader than traditional infrastructure monitoring. The inference phase, search phase, policy phase, tool execution phase, and user response must be combined into one flow.
Based on version
- Node.js 20 LTS
- TypeScript 5.8.x
- Next.js 16.1.x
- OpenAI API (Responses API, based on 2026-03 document)
- PostgreSQL 15
- Redis 7
Raise a problem
In practice, the most dangerous situation is when “there is no error, but the service deteriorates.” The following pattern is representative.
- The model response success rate is high, but the user follow-up ratio rapidly increases.
- Average latency is normal, but p95 deteriorates only for certain tenants/domains.
- Quality has decreased after cost-saving routing, but the cause cannot be analyzed because there is no routing information in the log schema.
- Even if the RAG search quality is poor, regression is recognized late by collecting only model stage indicators.
Practical example A: In-house search chatbot
After deployment, feedback about “answers being incorrect” increased, but infrastructure indicators appeared normal. The reason was that the ratio of related documents was lowered due to the change in retrieval top-k. Since there was no span in the search step, I misjudged it to be a model problem.
Practical example B: Automated customer support response
After strengthening the policy engine rules, the response rejection rate increased, and users repeated the question again. The system returned a 200 response, indicating that the failure was not detected. In the end, the problem was discovered only after CS tickets increased rapidly.
Key concepts
LLM observability requires at least four axes simultaneously.| axis | Question | representative signal | Symptoms of failure | | --- | --- | --- | --- | | Path observation (Path) | Where did it slow down | trace/span latency | Identification of bottleneck section | | Output | What was the response | structured log, policy result | Cause cannot be reproduced | | Quality Observation (Quality) | Was it valid for the user | resolution rate, follow-up ratio | Latent quality decline | | Cost observation (Cost) | How much was this quality achieved | token cost/request | delay in detecting cost spikes |
An important principle is to "connect all signals with the same request ID." Without this connection, although the dashboard may be flashy, decision-making speed is slow.
Practical pattern
Pattern 1: Step-by-step Span standardization
If Span names and properties are different for each team, analysis is impossible. At a minimum, it is recommended to fix the following set of spans per request.
llm.requestllm.retrievalllm.inferencellm.policy_checkllm.tool_executionllm.response_render
import { trace } from "@opentelemetry/api";
const tracer = trace.getTracer("llm-platform");
export async function tracedInference(req: {
requestId: string;
tenantId: string;
promptVersion: string;
modelRoute: string;
}) {
return tracer.startActiveSpan("llm.request", async (root) => {
root.setAttribute("request.id", req.requestId);
root.setAttribute("tenant.id", req.tenantId);
root.setAttribute("prompt.version", req.promptVersion);
root.setAttribute("model.route", req.modelRoute);
const retrieval = tracer.startSpan("llm.retrieval");
const docs = await retrieveDocs(req).finally(() => retrieval.end());
const infer = tracer.startSpan("llm.inference");
const output = await callModel(req, docs).finally(() => infer.end());
const policy = tracer.startSpan("llm.policy_check");
const decision = await evaluatePolicy(output).finally(() => policy.end());
root.setAttribute("policy.decision", decision.status);
root.end();
return decision;
});
}
Operating points:
- Be sure to include
tenant,promptVersion, andmodelRoutein the span attribute. - Response length/token/policy decision values are summarized in the root span to improve searchability.
- Original text that could be PII is not stored in span.
Pattern 2: Manage log schemas as contracts rather than “debug strings”
LLM logs are of little use if left in free text. Operational events must be schematized and commonly used for quality/security/cost analysis.
{
"event": "llm.response.completed",
"timestamp": "2026-03-03T03:42:10.221Z",
"request_id": "req_1701",
"tenant_id": "tenant_a",
"prompt_version": "assist-v7.1.0",
"model_route": "small->large_fallback",
"retrieval_top_k": 8,
"retrieval_hit_ratio": 0.62,
"input_tokens": 3480,
"output_tokens": 512,
"latency_ms": 1320,
"policy_decision": "allow",
"resolved": false,
"user_follow_up_within_120s": true
}
# 회귀 감지: 최근 30분 follow-up 비율이 7일 베이스라인 대비 20%↑면 경보
./ops/alerts/regression-guard.sh \
--metric user_follow_up_within_120s_ratio \
--window 30m \
--baseline 7d \
--threshold +20% \
--group-by tenant_id,model_route
Operating points:
resolved,follow_up, andpolicy_decisionare agreed upon as common product/platform KPIs.- Prevent parser corruption by specifying the log schema version (
schema_version). - Alarms are more effective in
group-by tenant/model_route/prompt_versionunits than the overall average.
Pattern 3: Define SLO only for “Quality Regression”
If you only have availability SLOs, you miss quality degradation. LLM requires a separate quality SLO.Example quality SLO:
resolved_rate >= 78%follow_up_ratio <= 18%policy_false_refusal_rate <= 3%retrieval_hit_ratio >= 0.55
Operating points:
- Automatically stops distribution pipeline promotion when quality SLO is violated.
- Quality SLO is reviewed separately by domain (payment, refund, delivery) on a weekly basis.
- In the accuracy/safety conflict section, the human-review sampling rate is temporarily increased.
Failure cases/anti-patterns
Failure scenario: “0% error, rapid increase in complaints”
Situation:
- After deployment on Monday morning, the infrastructure error rate was within the normal range at 0.3%.
- However,
user_follow_up_within_120s_ratiorose from 14% to 29%. - The
model_routefield was missing in the log, making it difficult to isolate the cause.
Detection procedure:
- Cross-check the increase in CS tickets and the increase in quality indicators
- In trace samples,
llm.retrievaldelay increases andretrieval_hit_ratiodecreases. - Identification of reduced retrieval top-k and increased index refresh interval in recent deployment changes.
Mitigation Procedures:
- retrieval rolls back settings to previous values
- Temporarily increase large model routing rate for poor quality tenants
- Bypass follow-up questions to human-review queue
Recovery Procedure:
- Require
model_route,retrieval_hit_ratioin log schema - Integrate quality alerts with distribution alerts
- Add the “error rate normal + quality deterioration” complex scenario to the failure training (runbook) item.
Representative antipatterns
- Determine LLM quality by only looking at infrastructure indicators (5xx, CPU)
- Store log/trace/quality events separately without request ID
- Increased security risk by recording excessively sensitive information in span
- Operation with many dashboards but no rollback trigger criteria
Checklist- [ ] Are request unit common keys (request_id, tenant_id, prompt_version) consistently included in trace/log/metric?
- Are the
llm.retrieval,llm.inference,llm.policy_checkspans standardized? - Are quality indicators (
resolved_rate,follow_up_ratio) operated together with infrastructure indicators? - Are there log schema versioning and parser compatibility policies?
- Does the regression detection alarm operate in
tenant/model_route/prompt_versionunits? - Is it possible to stop promotion of deployment or automatically rollback when quality SLO is violated?
- Are PII minimization/masking policies applied to observation data?
Summary
The goal of LLM observability is not “showing” but “quick cause identification and control.” To ensure stable operation, trace, log, and quality events must be grouped into the same request axis, and regression detection and distribution control must be connected. The key is to reflect in the system the premise that quality can collapse even when there are no problems.
Next episode preview
The next part deals with Context Engineering. We explain how to design RAG, memory, freshness, and multi-tenancy to maintain context quality. In particular, situations where “the model is the same but the answers suddenly start to differ” are analyzed from the perspective of retrieval quality and data life cycle.
Reference link
- OpenAI Developer Docs - Responses API
- OpenTelemetry 공식 문서
- Martin Fowler - Circuit Breaker
- 블로그: LLM Agent Tool Guardrail 설계
Series navigation
- Previous post: Part 5. 보안 설계: Prompt Injection, Data Leak, Policy Guard
- Next post: Part 7. Context Engineering: RAG, Memory, 최신성, 멀티테넌시