OpenTelemetry Observability Baseline

Introduction

Failure response speed is determined by context quality, not log volume. Even if OpenTelemetry is introduced, the cause analysis time is hardly reduced if trace_id does not cross the service boundary. This article presents an observability baseline that ensures traceability at minimal cost.

OpenTelemetry 관측성 베이스라인 커버 — Wikimedia Commons 기반 무료 이미지

Problem definition

The reason observability fails is because it tries to collect all signals at once.

There is no span naming convention, so the dashboard becomes a collection of strings that cannot be queried.
There is no business key in the error event, so the extent of the failure impact cannot be tracked.
The sampling policy is fixed, so important traces are missed during peak times.

Initially, tracking only three critical paths is sufficient. Instead, tag standards and ID propagation rules must be strongly fixed.

Key concepts

perspective	Design criteria	Verification points
Trace	Request Unit Correlation	Span connection rate by service boundary
Metrics	SLO-centric indicators	False positive alert rate
Logs	structured log + trace_id	Time required for cause analysis
Sampling	Traffic/Error Based Dynamic Sampling	Signal Density to Cost

Observability is design, not collection. You can reduce unnecessary collection costs by first defining what questions you are leaving data to answer.

Code example 1: HTTP service tracing

import { trace } from "@opentelemetry/api";

const tracer = trace.getTracer("8space-api");

export async function tracedHandler(requestId: string, fn: () => Promise<unknown>) {
  return tracer.startActiveSpan("http.request", async (span) => {
    span.setAttribute("request.id", requestId);
    span.setAttribute("service.name", "blog-api");

    try {
      const result = await fn();
      span.setStatus({ code: 1 });
      return result;
    } catch (error) {
      span.recordException(error as Error);
      span.setStatus({ code: 2, message: "handler failed" });
      throw error;
    } finally {
      span.end();
    }
  });
}

Code Example 2: Collector Pipeline

receivers:
  otlp:
    protocols:
      http:
      grpc:

processors:
  batch:
  attributes:
    actions:
      - key: deployment.environment
        value: production
        action: upsert

exporters:
  prometheus:
  otlphttp:
    endpoint: https://otel.example.com/v1/traces

Architecture flow

Mermaid diagram rendering...

Tradeoffs

Increasing sampling improves analysis quality, but storage costs quickly increase.
Searchability improves if you add a lot of tags, but performance may decrease due to high cardinality.
Narrowing the initial scope can lead to faster introduction, but failures in non-core sections require separate analysis.

Cleanup

The OpenTelemetry baseline is not about complete measurement, but about first creating a minimum operable standard. By fixing ID propagation, tag rules, and dynamic sampling, failure response speed can be reliably improved.

Image source

Cover: source link
License: CC BY-SA 3.0 / Author: Unknown
Note: After downloading the free license image from Wikimedia Commons, it was optimized to JPG at 1600px.