etcd — alert on frequent leader flapping
Inside a K8s cluster etcd re-elects the leader every 30 s — kube-apiserver lags, controller-manager can't keep reconciling. Only visible in etcd metrics.
Recipe
#!/usr/bin/env bash
# /etc/cron.d/etcd-flap
# */5 * * * * root /opt/etcd-flap.sh
ENDPOINTS=${ETCD_ENDPOINTS:-https://127.0.0.1:2379}
CACERT=${ETCD_CACERT:-/etc/etcd/ca.crt}
CERT=${ETCD_CERT:-/etc/etcd/etcd.crt}
KEY=${ETCD_KEY:-/etc/etcd/etcd.key}
# Use the metrics endpoint — Prometheus exposition format
TOTAL=$(curl -s --cacert "$CACERT" --cert "$CERT" --key "$KEY" \
"${ENDPOINTS%,*}/metrics" | awk '/^etcd_server_leader_changes_seen_total/ {print $2}')
STATE=/var/lib/etcd-flap.state
PREV=$(cat "$STATE" 2>/dev/null || echo 0)
echo "$TOTAL" > "$STATE"
DELTA=$((${TOTAL%.*} - ${PREV%.*}))
THRESH=${THRESH:-3} # > 3 elections / 5 min = flapping
if [ "$DELTA" -gt "$THRESH" ]; then
curl -fsS "$HEARTBEAT_URL" --data "leader_changes=$DELTA,window=5m"
exit 2
fi
echo "OK ($DELTA leader changes / 5m)"
Same thing in Enterno.io
Wire to Enterno heartbeat — correlate flap episodes with deploy windows and surface 'on release day etcd flapped 12 times' instead of scattered metrics.
Related recipes
Readiness probes pass inside the pod, but no one sees that the LB refused to route traffic to the new deploy.
A CrashLoopBackOff in one namespace — kubectl shows a restart count of 47, but nobody sees it. Want an endpoint that returns high when the counter jumps.
A node goes NotReady (kubelet stopped pinging the apiserver, runtime is sick) — pods on it linger like zombies until a taint evicts them. Kubernetes events do not go to Slack by default.