Skip to content

etcd — alert on frequent leader flapping

Inside a K8s cluster etcd re-elects the leader every 30 s — kube-apiserver lags, controller-manager can't keep reconciling. Only visible in etcd metrics.

Recipe

bash
#!/usr/bin/env bash
# /etc/cron.d/etcd-flap
# */5 * * * * root /opt/etcd-flap.sh

ENDPOINTS=${ETCD_ENDPOINTS:-https://127.0.0.1:2379}
CACERT=${ETCD_CACERT:-/etc/etcd/ca.crt}
CERT=${ETCD_CERT:-/etc/etcd/etcd.crt}
KEY=${ETCD_KEY:-/etc/etcd/etcd.key}

# Use the metrics endpoint — Prometheus exposition format
TOTAL=$(curl -s --cacert "$CACERT" --cert "$CERT" --key "$KEY" \
  "${ENDPOINTS%,*}/metrics" | awk '/^etcd_server_leader_changes_seen_total/ {print $2}')

STATE=/var/lib/etcd-flap.state
PREV=$(cat "$STATE" 2>/dev/null || echo 0)
echo "$TOTAL" > "$STATE"

DELTA=$((${TOTAL%.*} - ${PREV%.*}))
THRESH=${THRESH:-3}                  # > 3 elections / 5 min = flapping

if [ "$DELTA" -gt "$THRESH" ]; then
  curl -fsS "$HEARTBEAT_URL" --data "leader_changes=$DELTA,window=5m"
  exit 2
fi
echo "OK ($DELTA leader changes / 5m)"

Same thing in Enterno.io

Wire to Enterno heartbeat — correlate flap episodes with deploy windows and surface 'on release day etcd flapped 12 times' instead of scattered metrics.

Set up HTTP monitor → ← All recipes

Related recipes

A node goes NotReady (kubelet stopped pinging the apiserver, runtime is sick) — pods on it linger like zombies until a taint evicts them. Kubernetes events do not go to Slack by default.