Diagnosing Intermittent Downtime

Anatoly Oshmanovsky

Monitoring

Diagnosing Intermittent Downtime

Published: 23.06.2026 · ~4 min · 27 views

Short answer. Intermittent outages are sneaky: by the time you check, the site is back up. Catching them by hand is nearly impossible — you need continuous monitoring at a short interval (1 minute or 30 seconds) that records every failure with a timestamp and response code. Then correlate the outage moments with load, deploys, resource exhaustion, and traffic spikes. Multi-region checks show whether the site is down globally or only from one location.

Why intermittent outages are hard to catch

If a site goes down for 30-90 seconds once every few hours, the "open it and look" approach is useless — the odds of catching the failure are tiny. You need a system that knocks constantly and logs every response.

The core principle for diagnosing intermittent outages: don't catch the moment with your eyes, catch it with data. Continuous monitoring at short intervals turns an invisible problem into a table of timestamps.

Step 1. Enable continuous monitoring

Set a 1-minute check interval (or 30 seconds for short blips).
Log the response code, response time, and error text of every check.
Enable alerts so you know about an outage in real time.

Step 2. A simple health-check script

If you want to collect data yourself, a minimal continuous checker:

#!/bin/bash
URL="https://example.com"
while true; do
  TS=$(date '+%Y-%m-%d %H:%M:%S')
  RESULT=$(curl -o /dev/null -s -w "%{http_code} %{time_total}" \
    --max-time 15 "$URL")
  echo "$TS $RESULT" >> /var/log/healthcheck.log
  # 000 = timeout/no connection; 5xx = server error
  sleep 30
done

After a day the log will show the exact outage moments. Then comes correlation.

Step 3. Correlate outages with events

Log symptom	Likely cause	What to check
Outages at the same time of day	Cron/backup/log rotation	Task schedule, cron-induced load
Outages on traffic peaks	Worker/memory exhaustion	PHP-FPM limits, RAM, connection pool
Sporadic 5xx codes	DB or external API документацию failures	DB logs, external service timeouts
Code 000 (timeout)	Network/firewall/origin down	fail2ban, network dips, restarts
Outages after a deploy	A release broke some requests	App logs, release rollback

Step 4. Check server resources over time

Memory: leaks cause periodic OOM-killer events and restarts.
CPU: spikes during heavy jobs drop responsiveness.
Disk space: a full log/disk breaks writes and serving.
DB connection pool: exhaustion produces periodic 5xx.

Step 5. Check from multiple regions

Intermittent unreachability can be network, not server. If the site fails from RU but returns 200 from EU/US at the same moment, the problem is on the route between the region and the server, not the server itself.

Multi-region checks answer the key question: "is the server flickering, or is the network between me and the server?" Without it, you can easily fix the wrong problem.

Step 6. Automate via the monitoring API

To set up monitoring programmatically (for example, for each new service), use the REST API:

POST /api/v4/monitors
X-API-Key: your_key
Content-Type: application/json

{
  "url": "https://example.com",
  "check_type": "http",
  "interval_minutes": 1,
  "expected_code": 200,
  "notify_telegram": true,
  "check_regions": "ru-msk,eu-de,us-east"
}

How enterno.io helps

enterno.io is built for exactly this. Uptime monitoring checks the site every minute (30 seconds on higher plans) and catches short outages a manual check misses, recording each failure with a timestamp and response code. Multi-region checks across RU (ru-msk)/EU/US show whether the site is down globally or only from one location. Incidents open and close automatically, building an outage history; public status pages show uptime to your customers. Alerts go to Telegram, Slack, email, and webhook. SSL monitoring (14/3-day thresholds) catches upcoming certificate expiry. The REST API v4 creates monitors programmatically. enterno.io diagnoses and warns — the root-cause fix on the server is done by the owner.

FAQ

What interval do I need for short outages?

If blips are under a minute, use a 30-second interval. At a 5-minute interval you will miss most short dips.

The site "sometimes won't load" but the server logs are clean?

Then the problem is probably before the server: network, firewall, DNS, or CDN. Multi-region checks and a client-side health check will show it.

How do I tell a server failure from a network one?

Correlate the outage moments across regions. If it fails everywhere at once — server. If only from one location — network/route.

Outages line up with night-time — why?

A common cause is cron jobs, backups, or log rotation loading the server. Check the schedule and move heavy jobs to off-peak hours with resource limits.

Next step: enable monitoring at a 1-minute interval with multi-region checks. See also multi-region monitoring and the website monitoring guide.

Check your website right now

Check your site →

Diagnosing Intermittent Downtime

Why intermittent outages are hard to catch

Step 1. Enable continuous monitoring

Step 2. A simple health-check script

Step 3. Correlate outages with events

Step 4. Check server resources over time

Step 5. Check from multiple regions

Step 6. Automate via the monitoring API

How enterno.io helps

FAQ

What interval do I need for short outages?

The site "sometimes won't load" but the server logs are clean?

How do I tell a server failure from a network one?

Outages line up with night-time — why?

Start monitoring for free