The consumer is up but offset isn't growing (consumer-thread deadlock, or it's stuck without heartbeat). A lag-only check shows 0 lag because the producer is also idle — but the bug is in production.
Monitoring cookbook
Hand-written recipes for the monitoring problems we see most often. Each recipe shows a minimal DIY script and the one-click Enterno.io monitor that covers the same concern without extra infrastructure.
Main SQS queue is processing fine, but the DLQ silently grows — some messages fail 3 attempts and end up there. Nobody looks at the DLQ until it's a thousand deep.
Prometheus itself is alive, but one of its targets has up==0 — data stops flowing, graphs go blank, and alertmanager rules built on that target don't fire (no data = no alert).
OTEL collector is overloaded — `otelcol_exporter_send_failed_spans` is climbing. Traces are lost, prod debugging goes blind. The tracing backend hides the gap.
docker info hangs >30 s — the daemon is in a split-brain state. Containers keep running (kernel holds the namespaces), but you cannot deploy a new release. systemctl status shows active.
A node goes NotReady (kubelet stopped pinging the apiserver, runtime is sick) — pods on it linger like zombies until a taint evicts them. Kubernetes events do not go to Slack by default.
S3 endpoint starts 5xx-ing — your app gets random failures on upload. AWS Health shows 'healthy', the CloudWatch alarm is on a 5-min aggregate — reaction is late.
An istio-proxy sidecar in a pod is restarting — the app keeps running, but mesh policy is broken, mTLS goes unchecked, and traffic flows in violation of policy.
Envoy returns 503 (upstream timeout, no healthy hosts) — users get 5xx, but upstreams themselves are healthy. A standard 5xx-monitor shows "all OK" because it watches the app.
logrotate stopped (config syntax error on last edit, or the systemd timer was disabled) — the main log file grows. Nobody notices until the disk fills.
A Borg backup fails (passphrase rotated, repo lock stuck, ssh key expired) — you only learn when you need to restore, and the last snapshot is a week old.
A Redis Streams consumer lags — it reads messages but never XACKs them (the worker hangs between read and ack). XLEN does not grow, XPENDING does.
PVC is created, but the provisioner did not allocate the volume (wrong StorageClass? capacity exhausted? CSI driver? upstream cloud quota?). The pod waits and never starts — deployment status does not show why.
cron is alive, but the job (timer disabled, MAILTO=root spam, sh-syntax-error in crontab) did not run last night. The classic "we forgot last night gave empty reports".
autovacuum_max_workers are pinned (long-running query holds a lock, or vacuum_cost_limit is too low) — tables bloat, disk usage climbs linearly. Postgres itself does not alert.
A service VAULT_TOKEN is close to expiry (no auto-renewal, or non-renewable=true). The service hits Vault — and one day it gets 403 and loses access to its secrets.
fail2ban bans sources by threshold — but a campaign hits from thousands of IPs at 1 attempt each. None get banned individually, but overall noise on the ssh port is huge.
nginx proxy_cache hit ratio drops — backend starts burning CPU. Usually "forgot proxy_cache_valid in a new location", or cache was wiped, or TTL is too short.
Connection to a database / partner API loses 5–10 % of packets — the app sees timeouts, but `ping -c 4` says 'all good'. TCP retransmits silently chop throughput.
A BGP session with an upstream / cloud peering drops — half of routes are gone. The peer does not notify you, and your network monitoring (if any) often is not wired to BGP state.
Have a recipe we missed?
Tell us which stack to cover next — drop a line to support@enterno.io and we'll add the recipe (and credit you on the page).
Start monitoring — free →