nginx — alert on a proxy_cache hit-ratio drop
nginx proxy_cache hit ratio drops — backend starts burning CPU. Usually "forgot proxy_cache_valid in a new location", or cache was wiped, or TTL is too short.
Recipe
#!/usr/bin/env bash
# /etc/cron.d/nginx-cache
# */1 * * * * root /opt/nginx-cache.sh
LOG=${LOG:-/var/log/nginx/access.log}
WINDOW=60 # last 60 s
MIN_HITS_PCT=${MIN_HITS_PCT:-50} # alert when HIT < 50 % of total
SINCE=$(date -d "-${WINDOW} seconds" '+%d/%b/%Y:%H:%M:%S')
# Assumes $upstream_cache_status is in the access log format ($cs)
read TOTAL HIT < <(awk -v since="$SINCE" '
$4 >= "["since {
t++
if ($NF == "HIT") h++
}
END { print t+0, h+0 }
' "$LOG")
[ "$TOTAL" -lt 50 ] && exit 0 # too few requests to judge
PCT=$((HIT * 100 / TOTAL))
if [ "$PCT" -lt "$MIN_HITS_PCT" ]; then
curl -fsS "$HEARTBEAT_URL" --data "hit_pct=$PCT,window=60s,total=$TOTAL"
exit 2
fi
echo "OK (${PCT}% hit / 60s)"
Same thing in Enterno.io
Wrap in an Enterno heartbeat — catch a cache miss within 60 s after a new nginx config deploys, before the backend burns out.
Related recipes
Memcached fills up and starts evicting keys under load; the app cache-misses and hammers the DB. Want an evictions/min threshold.
The server starts returning 503/504 — but a plain uptime check misses it because the homepage is 200 while the API path is on fire.
long_query_time = 1, slow_query_log enabled. You need to know when the slow-query rate per minute suddenly jumps (a deploy broke an index, ORM went N+1).