Skip to content

Distributed Tracing Setup

Key idea:

Distributed tracing — tracking a single request across multiple microservices. Key: trace context propagation via HTTP/gRPC headers (W3C traceparent). Each service creates a child span, parent ID from header. Tools: OpenTelemetry auto-instrumentation, Jaeger UI, Tempo + Grafana. Detect latency culprit (slow DB query on service C after 5 hops).

Below: step-by-step, working examples, common pitfalls, FAQ.

Try it now — free →

Step-by-Step Setup

  1. OpenTelemetry in every service (see how-to-setup-opentelemetry)
  2. Enable W3C traceparent propagation (auto in most SDKs)
  3. HTTP client auto-injects traceparent in outgoing requests
  4. Test: trigger frontend call → search trace ID in Jaeger UI
  5. Instrument DB queries, cache hits via auto-instrumentation
  6. Add baggage for business context (order_id, user_id)
  7. Alert on slow spans (p99 > 500ms on specific endpoint)

Working Examples

ScenarioConfig
W3C traceparent header# HTTP request header traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 # ^version ^trace-id (16B hex) ^span-id (8B) ^flags # All services read + create child span with same trace-id
Manual context propagation// Node.js — downstream HTTP call const { context, propagation } = require('@opentelemetry/api'); const headers = {}; propagation.inject(context.active(), headers); // headers now contains traceparent await fetch('http://service-b/api', { headers });
gRPC metadata// gRPC inject const metadata = new grpc.Metadata(); propagation.inject(context.active(), metadata, { set(carrier, key, value) { carrier.set(key, value); } }); client.someMethod(request, metadata, callback);
Query in Jaeger UI# Jaeger UI # service: frontend # tags: http.status_code=500 # min duration: 1s # Find slow traces → drill into span tree → identify bottleneck
Baggage propagation// Set business context across whole trace import { propagation } from '@opentelemetry/api'; const ctx = propagation.setBaggage( context.active(), propagation.createBaggage({ 'user.id': { value: '123' }, 'order.id': { value: 'ord-456' } }) ); // Propagates across services in baggage header

Common Pitfalls

  • Forgetting to instrument HTTP client — trace breaks at boundary
  • 100% sampling in prod = huge $$ at scale. Head-based 1% + keep errors 100%
  • Asynchronous work (queue → worker) — context lost. Manually propagate in message headers
  • Too much baggage — extra header bytes on every request. Limit to small values (<4 KB)
  • Traces without service name — impossible to identify. Set SERVICE_NAME env or in SDK config

Learn more

Frequently Asked Questions

Jaeger vs Tempo vs Honeycomb?

Jaeger: open source, battle-tested, local storage OK. Tempo: Grafana-native, S3 backend, cheap at scale. Honeycomb: SaaS, best query UX ($70+/mo).

Trace instead of logs?

Traces show flow + timing. Logs show discrete events. Complement: trace ID in log line ties logs to a specific trace.

Performance overhead?

OTel auto-instrument: 1-3% CPU, negligible latency. Sample 1-10% for prod.

How to find latency issue fast?

<a href="/en/ping">Enterno Ping</a> for endpoint-level. For full trace — Jaeger/Tempo UI filter by duration.