Skip to content

Envoy — alert when proxy 5xx without upstream 5xx

Envoy returns 503 (upstream timeout, no healthy hosts) — users get 5xx, but upstreams themselves are healthy. A standard 5xx-monitor shows "all OK" because it watches the app.

Recipe

bash
#!/usr/bin/env bash
# /etc/cron.d/envoy-mismatch
# */1 * * * * root /opt/envoy-mismatch.sh

ENVOY_METRICS=${ENVOY_METRICS:-http://localhost:9901/stats/prometheus}
STATE=/var/lib/envoy-mismatch.state
THRESH=${THRESH:-5}                   # ratio threshold

PROXY_5XX=$(curl -fsS "$ENVOY_METRICS" | awk '/^envoy_http_downstream_rq_5xx/ {sum += $2} END {print int(sum)+0}')
UPSTREAM_5XX=$(curl -fsS "$ENVOY_METRICS" | awk '/^envoy_cluster_upstream_rq_5xx/ {sum += $2} END {print int(sum)+0}')

# Difference = 5xx Envoy generated locally (upstream timeout, circuit-break)
LOCAL_GEN=$((PROXY_5XX - UPSTREAM_5XX))
[ "$LOCAL_GEN" -lt 0 ] && LOCAL_GEN=0

PREV=$(cat "$STATE" 2>/dev/null || echo 0)
echo "$LOCAL_GEN" > "$STATE"
DELTA=$((LOCAL_GEN - PREV))

if [ "$DELTA" -gt "$THRESH" ]; then
  curl -fsS "$HEARTBEAT_URL" --data "envoy_local_5xx=$DELTA,window=1m"
  exit 2
fi
echo "OK ($DELTA Envoy-generated 5xx / 1m)"

Same thing in Enterno.io

Wire to an Enterno heartbeat — see "proxy itself returns 5xx" specifically, not the app, so you can target the fix at the mesh layer instead of pulling in the app team.

Set up HTTP monitor → ← All recipes

Related recipes