OTEL collector — alert on dropped spans · monitoring cookbook · Enterno.io

Enterno.io Editorial Team

OTEL collector — alert on dropped spans

OTEL collector is overloaded — `otelcol_exporter_send_failed_spans` is climbing. Traces are lost, prod debugging goes blind. The tracing backend hides the gap.

Stack: otel-collector · prometheus · cron Tags: opentelemetry, observability, traces

Recipe

bash

#!/usr/bin/env bash
# /etc/cron.d/otel-drops
# */5 * * * * root /opt/otel-drops.sh

# OTEL collector exposes Prometheus metrics on :8888 by default
URL=${OTEL_METRICS:-http://localhost:8888/metrics}
THRESH=${THRESH:-100}                 # spans dropped / 5 min
STATE=/var/lib/otel-drops.state

NOW=$(curl -fsS "$URL" | awk '/^otelcol_exporter_send_failed_spans/ && !/#/ {sum += $2} END {print int(sum)+0}')
PREV=$(cat "$STATE" 2>/dev/null || echo 0)
echo "$NOW" > "$STATE"

DELTA=$((NOW - PREV))

if [ "$DELTA" -gt "$THRESH" ]; then
  curl -fsS "$HEARTBEAT_URL" --data "dropped=$DELTA,window=5m"
  exit 2
fi
echo "OK ($DELTA spans dropped / 5m)"

Same thing in Enterno.io

Wire to an Enterno heartbeat with retention — you can correlate "drops happen exactly at traffic peaks" and decide between a bigger collector vs head-based sampling.

Set up HTTP monitor → ← All recipes

Related recipes

Logging pipeline — alert when ingest rate drops

bash

Filebeat / Logstash silently died on one edge node. Elasticsearch ingest rate fell 40 % but no one watches dashboards. Sentry without logs is blindness.

Prometheus — alert when a scrape target is unreachable

bash

Prometheus itself is alive, but one of its targets has up==0 — data stops flowing, graphs go blank, and alertmanager rules built on that target don't fire (no data = no alert).

Alertmanager — alert when an alert is stuck in pending

bash

An alertmanager alert sits in state=pending past its for-window — it should be active but is not firing (group_wait too big? notifier broken? misconfigured route?). Nobody gets paged.

Recipe

Same thing in Enterno.io

Related recipes

Logging pipeline — alert when ingest rate drops

Prometheus — alert when a scrape target is unreachable

Alertmanager — alert when an alert is stuck in pending

Start monitoring for free