Datadog — alert when a host stops reporting
The Datadog agent dies (OOM, mismatched apt update, cert expiry on dd-staging.com) — host disappears from the dashboard after 10 min (default mute window), but nobody alerts that monitoring went blind.
Recipe
#!/usr/bin/env bash
# /etc/cron.d/dd-agent
# */10 * * * * root /opt/dd-agent.sh
DD_API_KEY=${DD_API_KEY}
DD_APP_KEY=${DD_APP_KEY}
EXPECTED=${EXPECTED:-host1,host2,host3} # comma-separated hostname list
LIVE=$(curl -fsS -H "DD-API-KEY: $DD_API_KEY" -H "DD-APPLICATION-KEY: $DD_APP_KEY" \
"https://api.datadoghq.com/api/v1/hosts?from=$(date -d '5 minutes ago' +%s)" \
| jq -r '.host_list[].name' | sort -u)
MISSING=""
IFS=',' read -ra EXP <<< "$EXPECTED"
for H in "${EXP[@]}"; do
if ! echo "$LIVE" | grep -qx "$H"; then
MISSING="$MISSING$H,"
fi
done
if [ -n "$MISSING" ]; then
curl -fsS "$HEARTBEAT_URL" --data-urlencode "missing=$MISSING"
exit 2
fi
echo "OK (all hosts reporting)"
Same thing in Enterno.io
Wrap in an Enterno heartbeat — an independent channel that catches "monitoring went blind" without relying on the same DD stack.
Related recipes
Filebeat / Logstash silently died on one edge node. Elasticsearch ingest rate fell 40 % but no one watches dashboards. Sentry without logs is blindness.
Prometheus itself is alive, but one of its targets has up==0 — data stops flowing, graphs go blank, and alertmanager rules built on that target don't fire (no data = no alert).
OTEL collector is overloaded — `otelcol_exporter_send_failed_spans` is climbing. Traces are lost, prod debugging goes blind. The tracing backend hides the gap.