2 min read

OpenTelemetry Observability Baseline

Measurement standards that connect logs, metrics, and traces to reduce time to cause of failure

OpenTelemetry Observability Baseline thumbnail

Introduction

Failure response speed is determined by context quality, not log volume. Even if OpenTelemetry is introduced, the cause analysis time is hardly reduced if trace_id does not cross the service boundary. This article presents an observability baseline that ensures traceability at minimal cost.

OpenTelemetry 관측성 베이스라인 커버
Wikimedia Commons 기반 무료 이미지

Problem definition

The reason observability fails is because it tries to collect all signals at once.

  • There is no span naming convention, so the dashboard becomes a collection of strings that cannot be queried.
  • There is no business key in the error event, so the extent of the failure impact cannot be tracked.
  • The sampling policy is fixed, so important traces are missed during peak times.

Initially, tracking only three critical paths is sufficient. Instead, tag standards and ID propagation rules must be strongly fixed.

Key concepts

perspectiveDesign criteriaVerification points
TraceRequest Unit CorrelationSpan connection rate by service boundary
MetricsSLO-centric indicatorsFalse positive alert rate
Logsstructured log + trace_idTime required for cause analysis
SamplingTraffic/Error Based Dynamic SamplingSignal Density to Cost

Observability is design, not collection. You can reduce unnecessary collection costs by first defining what questions you are leaving data to answer.

Code example 1: HTTP service tracing

import { trace } from "@opentelemetry/api";

const tracer = trace.getTracer("8space-api");

export async function tracedHandler(requestId: string, fn: () => Promise<unknown>) {
  return tracer.startActiveSpan("http.request", async (span) => {
    span.setAttribute("request.id", requestId);
    span.setAttribute("service.name", "blog-api");

    try {
      const result = await fn();
      span.setStatus({ code: 1 });
      return result;
    } catch (error) {
      span.recordException(error as Error);
      span.setStatus({ code: 2, message: "handler failed" });
      throw error;
    } finally {
      span.end();
    }
  });
}

Code Example 2: Collector Pipeline

receivers:
  otlp:
    protocols:
      http:
      grpc:

processors:
  batch:
  attributes:
    actions:
      - key: deployment.environment
        value: production
        action: upsert

exporters:
  prometheus:
  otlphttp:
    endpoint: https://otel.example.com/v1/traces

Architecture flow

Mermaid diagram rendering...

Tradeoffs

  • Increasing sampling improves analysis quality, but storage costs quickly increase.
  • Searchability improves if you add a lot of tags, but performance may decrease due to high cardinality.
  • Narrowing the initial scope can lead to faster introduction, but failures in non-core sections require separate analysis.

Cleanup

The OpenTelemetry baseline is not about complete measurement, but about first creating a minimum operable standard. By fixing ID propagation, tag rules, and dynamic sampling, failure response speed can be reliably improved.

Image source

  • Cover: source link
  • License: CC BY-SA 3.0 / Author: Unknown
  • Note: After downloading the free license image from Wikimedia Commons, it was optimized to JPG at 1600px.

Comments