Anonymous image pulls hit Docker Hub limits (100/6h per IP) — CI starts failing with TooManyRequests. Usually visible only after you are already over.
Monitoring cookbook
Hand-written recipes for the monitoring problems we see most often. Each recipe shows a minimal DIY script and the one-click Enterno.io monitor that covers the same concern without extra infrastructure.
Falco logs suspicious actions (write to /etc, shell in container, unexpected network connect) — but logs sit locally and nobody looks. An in-container attack develops silently.
An alertmanager alert sits in state=pending past its for-window — it should be active but is not firing (group_wait too big? notifier broken? misconfigured route?). Nobody gets paged.
An Airflow DAG finished past its SLA (but did not fail — late success). By default an SLA miss only triggers an email callback that is rarely configured. The pipeline shows a "red flag" an hour after the fact.
A table takes 200 GB, of which 150 GB is bloat (dead tuples). VACUUM FULL needs an exclusive lock, autovacuum cannot keep up. You notice when an index scan turns into a seq scan.
Someone set `spec.suspend: true` on a CronJob (debug or rushed release) and forgot to revert. The daily task does not run, reports are not generated — you only learn when finance asks.
An Azure subscription approaches a quota (vCPU per region, public IPs, storage accounts) — the next terraform apply fails with 429/ItemNotFound. Quota raises go via a support ticket, you need a head start.
The Datadog agent dies (OOM, mismatched apt update, cert expiry on dd-staging.com) — host disappears from the dashboard after 10 min (default mute window), but nobody alerts that monitoring went blind.
Compliance mandates rotating DB credentials every 90 days. Vault static-creds engine should do it, but someone set max_ttl=0 — the secret lives forever. The auditor finds it first.
Writes on primary grow faster than oplog retention. If a secondary falls behind by more than the oplog window, you need an initial sync (hours of downtime). Usually noticed too late.
Cassandra needs a full repair within `gc_grace_seconds` (default 10 days) — otherwise deletes resurrect as zombies on failover. Easy to miss without a scheduler.
Someone ran `kubectl edit` directly on the cluster — the manifest diverges from git. ArgoCD shows OutOfSync, but auto-sync is off. The manifest drifts further, divergence accumulates.
The Jenkins queue grows — an agent went away, label mismatch, or executors are saturated. PR checks hang, devs start chat-pinging "what is up with CI?".
ECR pulls start failing consistently (IRSA expired, network ACL, repo policy mismatch) — pods in k8s cannot start, ImagePullBackOff. But the kubelet event pages nobody.
After a release Lighthouse perf score drops from 90 to 65 (new lib without code-split, or un-minified bundle). You only learn when RUM starts showing LCP > 4 s.
Someone added `import * from 'lodash'` — the bundle grew 70 KB. CI passed (tests OK), but first user load got 300 ms slower. Catch in CI before merge.
Compliance mandates rotating k8s Secrets (DB passwords, API tokens) every 90 days. Nobody auto-rotates, Secrets live since cluster creation. The auditor finds it first.
Someone ran `vault secrets disable` (debug or drift) — the pipeline reaches for DB creds and gets 404. Vault does not warn — for it this is a "normal admin action".
Fastly soft-purge is typically sub-second, but sometimes hangs 30+ s (overload, key collision). After a release, new assets do not appear, users see the old version.
A GCP project quota (CPU, IPs, persistent disks) creeps toward the limit. The next terraform plan fails with RESOURCE_EXHAUSTED. Quota requests take 1–2 days — needs head start.
Have a recipe we missed?
Tell us which stack to cover next — drop a line to support@enterno.io and we'll add the recipe (and credit you on the page).
Start monitoring — free →