Part 6. Observability Design: Trace, Span, Log Schema, Regression Detection

In the LLM system, “failure” and “deterioration in quality” are not the same event. Even if HTTP 200 is returned normally, users may feel "unhelpful" and operators will miss this change if they are only looking at the infrastructure dashboard. So LLM observability should be broader than traditional infrastructure monitoring. The inference phase, search phase, policy phase, tool execution phase, and user response must be combined into one flow.

Based on version

Node.js 20 LTS
TypeScript 5.8.x
Next.js 16.1.x
OpenAI API (Responses API, based on 2026-03 document)
PostgreSQL 15
Redis 7

Raise a problem

In practice, the most dangerous situation is when “there is no error, but the service deteriorates.” The following pattern is representative.

The model response success rate is high, but the user follow-up ratio rapidly increases.
Average latency is normal, but p95 deteriorates only for certain tenants/domains.
Quality has decreased after cost-saving routing, but the cause cannot be analyzed because there is no routing information in the log schema.
Even if the RAG search quality is poor, regression is recognized late by collecting only model stage indicators.

Practical example A: In-house search chatbot

After deployment, feedback about “answers being incorrect” increased, but infrastructure indicators appeared normal. The reason was that the ratio of related documents was lowered due to the change in retrieval top-k. Since there was no span in the search step, I misjudged it to be a model problem.

Practical example B: Automated customer support response

After strengthening the policy engine rules, the response rejection rate increased, and users repeated the question again. The system returned a 200 response, indicating that the failure was not detected. In the end, the problem was discovered only after CS tickets increased rapidly.

Key concepts

LLM observability requires at least four axes simultaneously.| axis | Question | representative signal | Symptoms of failure | | --- | --- | --- | --- | | Path observation (Path) | Where did it slow down | trace/span latency | Identification of bottleneck section | | Output | What was the response | structured log, policy result | Cause cannot be reproduced | | Quality Observation (Quality) | Was it valid for the user | resolution rate, follow-up ratio | Latent quality decline | | Cost observation (Cost) | How much was this quality achieved | token cost/request | delay in detecting cost spikes |

An important principle is to "connect all signals with the same request ID." Without this connection, although the dashboard may be flashy, decision-making speed is slow.

Mermaid diagram rendering...

Practical pattern

Pattern 1: Step-by-step Span standardization

If Span names and properties are different for each team, analysis is impossible. At a minimum, it is recommended to fix the following set of spans per request.

llm.request
llm.retrieval
llm.inference
llm.policy_check
llm.tool_execution
llm.response_render

import { trace } from "@opentelemetry/api";

const tracer = trace.getTracer("llm-platform");

export async function tracedInference(req: {
  requestId: string;
  tenantId: string;
  promptVersion: string;
  modelRoute: string;
}) {
  return tracer.startActiveSpan("llm.request", async (root) => {
    root.setAttribute("request.id", req.requestId);
    root.setAttribute("tenant.id", req.tenantId);
    root.setAttribute("prompt.version", req.promptVersion);
    root.setAttribute("model.route", req.modelRoute);

    const retrieval = tracer.startSpan("llm.retrieval");
    const docs = await retrieveDocs(req).finally(() => retrieval.end());

    const infer = tracer.startSpan("llm.inference");
    const output = await callModel(req, docs).finally(() => infer.end());

    const policy = tracer.startSpan("llm.policy_check");
    const decision = await evaluatePolicy(output).finally(() => policy.end());

    root.setAttribute("policy.decision", decision.status);
    root.end();

    return decision;
  });
}

Operating points:

Be sure to include tenant, promptVersion, and modelRoute in the span attribute.
Response length/token/policy decision values are summarized in the root span to improve searchability.
Original text that could be PII is not stored in span.

Pattern 2: Manage log schemas as contracts rather than “debug strings”

LLM logs are of little use if left in free text. Operational events must be schematized and commonly used for quality/security/cost analysis.

{
  "event": "llm.response.completed",
  "timestamp": "2026-03-03T03:42:10.221Z",
  "request_id": "req_1701",
  "tenant_id": "tenant_a",
  "prompt_version": "assist-v7.1.0",
  "model_route": "small->large_fallback",
  "retrieval_top_k": 8,
  "retrieval_hit_ratio": 0.62,
  "input_tokens": 3480,
  "output_tokens": 512,
  "latency_ms": 1320,
  "policy_decision": "allow",
  "resolved": false,
  "user_follow_up_within_120s": true
}

# 회귀 감지: 최근 30분 follow-up 비율이 7일 베이스라인 대비 20%↑면 경보
./ops/alerts/regression-guard.sh \
  --metric user_follow_up_within_120s_ratio \
  --window 30m \
  --baseline 7d \
  --threshold +20% \
  --group-by tenant_id,model_route

Operating points:

resolved, follow_up, and policy_decision are agreed upon as common product/platform KPIs.
Prevent parser corruption by specifying the log schema version (schema_version).
Alarms are more effective in group-by tenant/model_route/prompt_version units than the overall average.

Pattern 3: Define SLO only for “Quality Regression”

If you only have availability SLOs, you miss quality degradation. LLM requires a separate quality SLO.Example quality SLO:

resolved_rate >= 78%
follow_up_ratio <= 18%
policy_false_refusal_rate <= 3%
retrieval_hit_ratio >= 0.55

Operating points:

Automatically stops distribution pipeline promotion when quality SLO is violated.
Quality SLO is reviewed separately by domain (payment, refund, delivery) on a weekly basis.
In the accuracy/safety conflict section, the human-review sampling rate is temporarily increased.

Failure cases/anti-patterns

Failure scenario: “0% error, rapid increase in complaints”

Situation:

After deployment on Monday morning, the infrastructure error rate was within the normal range at 0.3%.
However, user_follow_up_within_120s_ratio rose from 14% to 29%.
The model_route field was missing in the log, making it difficult to isolate the cause.

Detection procedure:

Cross-check the increase in CS tickets and the increase in quality indicators
In trace samples, llm.retrieval delay increases and retrieval_hit_ratio decreases.
Identification of reduced retrieval top-k and increased index refresh interval in recent deployment changes.

Mitigation Procedures:

retrieval rolls back settings to previous values
Temporarily increase large model routing rate for poor quality tenants
Bypass follow-up questions to human-review queue

Recovery Procedure:

Require model_route, retrieval_hit_ratio in log schema
Integrate quality alerts with distribution alerts
Add the “error rate normal + quality deterioration” complex scenario to the failure training (runbook) item.

Representative antipatterns

Determine LLM quality by only looking at infrastructure indicators (5xx, CPU)
Store log/trace/quality events separately without request ID
Increased security risk by recording excessively sensitive information in span
Operation with many dashboards but no rollback trigger criteria

Checklist- [ ] Are request unit common keys (`request_id`, `tenant_id`, `prompt_version`) consistently included in trace/log/metric?

Are the llm.retrieval, llm.inference, llm.policy_check spans standardized?
Are quality indicators (resolved_rate, follow_up_ratio) operated together with infrastructure indicators?
Are there log schema versioning and parser compatibility policies?
Does the regression detection alarm operate in tenant/model_route/prompt_version units?
Is it possible to stop promotion of deployment or automatically rollback when quality SLO is violated?
Are PII minimization/masking policies applied to observation data?

Summary

The goal of LLM observability is not “showing” but “quick cause identification and control.” To ensure stable operation, trace, log, and quality events must be grouped into the same request axis, and regression detection and distribution control must be connected. The key is to reflect in the system the premise that quality can collapse even when there are no problems.

Next episode preview

The next part deals with Context Engineering. We explain how to design RAG, memory, freshness, and multi-tenancy to maintain context quality. In particular, situations where “the model is the same but the answers suddenly start to differ” are analyzed from the perspective of retrieval quality and data life cycle.

Reference link

Previous post: Part 5. 보안 설계: Prompt Injection, Data Leak, Policy Guard
Next post: Part 7. Context Engineering: RAG, Memory, 최신성, 멀티테넌시

Part 6. Observability Design: Trace, Span, Log Schema, Regression Detection

Series: Prompt Engineering이 아니라 System Engineering이 중요한 이유

Based on version

Raise a problem

Practical example A: In-house search chatbot

Practical example B: Automated customer support response

Key concepts

Practical pattern

Pattern 1: Step-by-step Span standardization

Pattern 2: Manage log schemas as contracts rather than “debug strings”

Pattern 3: Define SLO only for “Quality Regression”

Failure cases/anti-patterns

Failure scenario: “0% error, rapid increase in complaints”

Representative antipatterns

Checklist- [ ] Are request unit common keys (`request_id`, `tenant_id`, `prompt_version`) consistently included in trace/log/metric?

Summary

Next episode preview

Reference link

Series navigation

Comments

Based on version

Raise a problem

Practical example A: In-house search chatbot

Practical example B: Automated customer support response

Key concepts

Practical pattern

Pattern 1: Step-by-step Span standardization

Pattern 2: Manage log schemas as contracts rather than “debug strings”

Pattern 3: Define SLO only for “Quality Regression”

Failure cases/anti-patterns

Failure scenario: “0% error, rapid increase in complaints”

Representative antipatterns

Checklist- [ ] Are request unit common keys (request_id, tenant_id, prompt_version) consistently included in trace/log/metric?

Summary

Next episode preview

Reference link

Series navigation

Comments

Checklist- [ ] Are request unit common keys (`request_id`, `tenant_id`, `prompt_version`) consistently included in trace/log/metric?