AWS Lambda — alert when p99 cold-start regresses
A release bumped the bundle size and p99 cold-start went from 800ms to 3s. The metric is in CloudWatch, but nobody’s watching. Want a heartbeat-style alert.
Recipe
#!/usr/bin/env bash
# Pulls Lambda Init Duration p99 over the last 10 min for one function.
FN="${FN:-my-api}"
THRESHOLD_MS="${THRESHOLD_MS:-1500}"
P99=$(aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda --metric-name Duration \
--dimensions Name=FunctionName,Value="$FN" Name=Resource,Value="$FN" \
--start-time "$(date -u -d '10 min ago' +%FT%TZ)" \
--end-time "$(date -u +%FT%TZ)" \
--period 60 --extended-statistics p99 \
--query 'Datapoints[-1].ExtendedStatistics.p99' --output text 2>/dev/null)
[ -z "$P99" ] || [ "$P99" = "None" ] && { echo "no-data"; exit 1; }
P99_MS=$(printf "%.0f" "$P99")
[ "$P99_MS" -ge "$THRESHOLD_MS" ] && echo "high $P99_MS" || echo "ok $P99_MS"
Same thing in Enterno.io
Drop next to the function. Wrap the endpoint via an Enterno HTTP monitor + a PageSpeed monitor on the function’s public endpoint — correlation of cold-start spikes vs full-page latency.
Related recipes
long_query_time = 1, slow_query_log enabled. You need to know when the slow-query rate per minute suddenly jumps (a deploy broke an index, ORM went N+1).
Memcached fills up and starts evicting keys under load; the app cache-misses and hammers the DB. Want an evictions/min threshold.
Node app stalls under CPU-blocking operations; user latency creeps up. Want an endpoint that exposes the live event-loop-lag value.