Datadog — alert when a host stops reporting · monitoring cookbook · Enterno.io

Igor Verentsov

Datadog — alert when a host stops reporting

The Datadog agent dies (OOM, mismatched apt update, cert expiry on dd-staging.com) — host disappears from the dashboard after 10 min (default mute window), but nobody alerts that monitoring went blind.

Stack: dd-api · curl · cron Tags: datadog, apm, observability

Recipe

bash

#!/usr/bin/env bash
# /etc/cron.d/dd-agent
# */10 * * * * root /opt/dd-agent.sh

DD_API_KEY=${DD_API_KEY}
DD_APP_KEY=${DD_APP_KEY}
EXPECTED=${EXPECTED:-host1,host2,host3}    # comma-separated hostname list

LIVE=$(curl -fsS -H "DD-API-KEY: $DD_API_KEY" -H "DD-APPLICATION-KEY: $DD_APP_KEY" \
  "https://api.datadoghq.com/api/v1/hosts?from=$(date -d '5 minutes ago' +%s)" \
  | jq -r '.host_list[].name' | sort -u)

MISSING=""
IFS=',' read -ra EXP <<< "$EXPECTED"
for H in "${EXP[@]}"; do
  if ! echo "$LIVE" | grep -qx "$H"; then
    MISSING="$MISSING$H,"
  fi
done

if [ -n "$MISSING" ]; then
  curl -fsS "$HEARTBEAT_URL" --data-urlencode "missing=$MISSING"
  exit 2
fi
echo "OK (all hosts reporting)"

Same thing in Enterno.io

Wrap in an Enterno heartbeat — an independent channel that catches "monitoring went blind" without relying on the same DD stack.

Set up API monitor → ← All recipes

Related recipes

Logging pipeline — alert when ingest rate drops

bash

Filebeat / Logstash silently died on one edge node. Elasticsearch ingest rate fell 40 % but no one watches dashboards. Sentry without logs is blindness.

Prometheus — alert when a scrape target is unreachable

bash

Prometheus itself is alive, but one of its targets has up==0 — data stops flowing, graphs go blank, and alertmanager rules built on that target don't fire (no data = no alert).

OTEL collector — alert on dropped spans

bash

OTEL collector is overloaded — `otelcol_exporter_send_failed_spans` is climbing. Traces are lost, prod debugging goes blind. The tracing backend hides the gap.

Recipe

Same thing in Enterno.io

Related recipes

Logging pipeline — alert when ingest rate drops

Prometheus — alert when a scrape target is unreachable

OTEL collector — alert on dropped spans

Start monitoring for free