Skip to content

What is Observability

Key idea:

Observability — the ability to understand a system's internal state from its external outputs. Three pillars: **metrics** (numbers over time — CPU, QPS), **logs** (events — errors, audit trail), **traces** (request path across distributed services). Difference from monitoring: monitoring = knowing known unknowns (CPU high). Observability = exploring unknown unknowns (new bug type).

Below: details, example, related terms, FAQ.

Details

  • Metrics: Prometheus, Grafana, Datadog, New Relic. Aggregated, efficient
  • Logs: Loki, ELK stack, CloudWatch. Full-text, expensive at scale
  • Traces: Jaeger, Zipkin, Tempo. Per-request detailed flow
  • Correlation: trace_id links all 3 (standardised via OpenTelemetry)
  • Cardinality explosion: high-cardinality labels (user_id) kill Prometheus

Example

// OpenTelemetry instrumented code
const tracer = trace.getTracer('my-app');
const span = tracer.startSpan('db-query');
try {
  await db.query('SELECT ...')
} finally {
  span.end();  // exports trace to Jaeger/Tempo
}

Related Terms

Learn more

Frequently Asked Questions

Observability vs Monitoring?

Monitoring = alerts on predetermined conditions. Observability = ad-hoc investigation via exploration. Overlap is big but observability goes deeper.

Do I need all 3 pillars?

Minimum: metrics + logs. Traces — when you have microservices/distributed. In a monolith start with the first two.

Stack suggestions?

Small team: Datadog (SaaS, all-in-one) or Grafana Cloud (cheaper). Self-host: Prometheus + Loki + Tempo + Grafana (LGTM).