Skip to content

Distributed tracing setup

Коротко:

Distributed tracing — отслеживание single request через multiple microservices. Key: trace context propagation через HTTP/gRPC headers (W3C traceparent). Each service creates child span, parent ID от header. Tools: OpenTelemetry auto-instrumentation, Jaeger UI, Tempo + Grafana. Detect latency culprit (slow DB query on service C через 5 hops).

Ниже: пошаговая инструкция, рабочие примеры, типичные ошибки, FAQ.

Попробовать бесплатно →

Пошаговая настройка

  1. OpenTelemetry в каждом service (see how-to-setup-opentelemetry)
  2. Enable W3C traceparent propagation (auto в most SDKs)
  3. HTTP client auto-inject traceparent в outgoing requests
  4. Test: trigger frontend call → search trace ID в Jaeger UI
  5. Instrument db queries, cache hits через auto-instrumentation
  6. Add baggage для business context (order_id, user_id)
  7. Alert на slow spans (p99 > 500ms на specific endpoint)

Рабочие примеры

СценарийКонфиг
W3C traceparent header# HTTP request header traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 # ^version ^trace-id (16B hex) ^span-id (8B) ^flags # All services read + create child span with same trace-id
Manual context propagation// Node.js — downstream HTTP call const { context, propagation } = require('@opentelemetry/api'); const headers = {}; propagation.inject(context.active(), headers); // headers now contains traceparent await fetch('http://service-b/api', { headers });
gRPC metadata// gRPC inject const metadata = new grpc.Metadata(); propagation.inject(context.active(), metadata, { set(carrier, key, value) { carrier.set(key, value); } }); client.someMethod(request, metadata, callback);
Query в Jaeger UI# Jaeger UI # service: frontend # tags: http.status_code=500 # min duration: 1s # Find slow traces → drill into span tree → identify bottleneck
Baggage propagation// Set business context на whole trace import { propagation } from '@opentelemetry/api'; const ctx = propagation.setBaggage( context.active(), propagation.createBaggage({ 'user.id': { value: '123' }, 'order.id': { value: 'ord-456' } }) ); // Propagates across services в baggage header

Типичные ошибки

  • Forgetting HTTP client instrument — trace breaks на boundary
  • 100% sampling в prod = огромный $$ при scale. Head-based 1% + keep errors 100%
  • Asynchronous work (queue → worker) — context lost. Manually propagate в message headers
  • Too much baggage — доп header bytes на каждый request. Limit к small value (<4 KB)
  • Traces без service name — невозможно identify. Set SERVICE_NAME env или в SDK config

Больше по теме

Часто задаваемые вопросы

Jaeger vs Tempo vs Honeycomb?

Jaeger: open source, battle-tested, local storage OK. Tempo: Grafana-native, S3 backend, cheap at scale. Honeycomb: SaaS, best query UX ($70+/мес).

Trace вместо logs?

Traces show flow + timing. Logs show discrete events. Complement: trace ID в log line ties logs к specific trace.

Performance overhead?

OTel auto-instrument: 1-3% CPU, negligible latency. Sample 1-10% для prod.

Как latency issue быстро найти?

<a href="/ping">Enterno Ping</a> для endpoint-уровня. Для full trace — Jaeger/Tempo UI filter by duration.