Envoy — alert when proxy 5xx without upstream 5xx
Envoy returns 503 (upstream timeout, no healthy hosts) — users get 5xx, but upstreams themselves are healthy. A standard 5xx-monitor shows "all OK" because it watches the app.
Recipe
#!/usr/bin/env bash
# /etc/cron.d/envoy-mismatch
# */1 * * * * root /opt/envoy-mismatch.sh
ENVOY_METRICS=${ENVOY_METRICS:-http://localhost:9901/stats/prometheus}
STATE=/var/lib/envoy-mismatch.state
THRESH=${THRESH:-5} # ratio threshold
PROXY_5XX=$(curl -fsS "$ENVOY_METRICS" | awk '/^envoy_http_downstream_rq_5xx/ {sum += $2} END {print int(sum)+0}')
UPSTREAM_5XX=$(curl -fsS "$ENVOY_METRICS" | awk '/^envoy_cluster_upstream_rq_5xx/ {sum += $2} END {print int(sum)+0}')
# Difference = 5xx Envoy generated locally (upstream timeout, circuit-break)
LOCAL_GEN=$((PROXY_5XX - UPSTREAM_5XX))
[ "$LOCAL_GEN" -lt 0 ] && LOCAL_GEN=0
PREV=$(cat "$STATE" 2>/dev/null || echo 0)
echo "$LOCAL_GEN" > "$STATE"
DELTA=$((LOCAL_GEN - PREV))
if [ "$DELTA" -gt "$THRESH" ]; then
curl -fsS "$HEARTBEAT_URL" --data "envoy_local_5xx=$DELTA,window=1m"
exit 2
fi
echo "OK ($DELTA Envoy-generated 5xx / 1m)"
Same thing in Enterno.io
Wire to an Enterno heartbeat — see "proxy itself returns 5xx" specifically, not the app, so you can target the fix at the mesh layer instead of pulling in the app team.
Related recipes
Ensure your site returns 2xx every minute, alert to Slack/Telegram on failure.
Minimal script that checks an SSL certificate and alerts 14 days before expiry.
Detect the moment a replica falls behind the primary by more than 10 seconds.