PostgreSQL — alert when autovacuum is stuck
autovacuum_max_workers are pinned (long-running query holds a lock, or vacuum_cost_limit is too low) — tables bloat, disk usage climbs linearly. Postgres itself does not alert.
Recipe
#!/usr/bin/env bash
# /etc/cron.d/pg-autovacuum
# */15 * * * * postgres /opt/pg-autovacuum.sh
DB=${PGDATABASE:-postgres}
THRESH_MIN=${THRESH_MIN:-30} # alert if any vacuum runs > 30 min
OLDEST=$(psql -At -d "$DB" -c "
SELECT EXTRACT(EPOCH FROM (NOW() - query_start)) / 60
FROM pg_stat_activity
WHERE query LIKE 'autovacuum:%'
ORDER BY query_start ASC
LIMIT 1;
")
OLDEST_MIN=${OLDEST%.*}
[ -z "$OLDEST_MIN" ] && OLDEST_MIN=0
if [ "$OLDEST_MIN" -gt "$THRESH_MIN" ]; then
TABLE=$(psql -At -d "$DB" -c "
SELECT regexp_replace(query, 'autovacuum: VACUUM\s+', '')
FROM pg_stat_activity
WHERE query LIKE 'autovacuum:%'
ORDER BY query_start ASC LIMIT 1;
")
curl -fsS "$HEARTBEAT_URL" --data "vacuum_min=$OLDEST_MIN,table=$TABLE"
exit 2
fi
echo "OK (oldest vacuum=${OLDEST_MIN}m)"
Same thing in Enterno.io
Wrap in an Enterno heartbeat — spot the pattern "Mondays autovacuum hangs 2 hours" and target the specific bloated table before the DBA team reacts.
Related recipes
A table takes 200 GB, of which 150 GB is bloat (dead tuples). VACUUM FULL needs an exclusive lock, autovacuum cannot keep up. You notice when an index scan turns into a seq scan.
Detect the moment a replica falls behind the primary by more than 10 seconds.
Every week some analyst fires off a heavy SELECT without LIMIT and blocks other transactions. You want to know within a minute, not within an hour.