Skip to content
← All articles

Diagnosing Intermittent Downtime

Short answer. Intermittent outages are sneaky: by the time you check, the site is back up. Catching them by hand is nearly impossible — you need continuous monitoring at a short interval (1 minute or 30 seconds) that records every failure with a timestamp and response code. Then correlate the outage moments with load, deploys, resource exhaustion, and traffic spikes. Multi-region checks show whether the site is down globally or only from one location.

Why intermittent outages are hard to catch

If a site goes down for 30-90 seconds once every few hours, the "open it and look" approach is useless — the odds of catching the failure are tiny. You need a system that knocks constantly and logs every response.

The core principle for diagnosing intermittent outages: don't catch the moment with your eyes, catch it with data. Continuous monitoring at short intervals turns an invisible problem into a table of timestamps.

Step 1. Enable continuous monitoring

  • Set a 1-minute check interval (or 30 seconds for short blips).
  • Log the response code, response time, and error text of every check.
  • Enable alerts so you know about an outage in real time.

Step 2. A simple health-check script

If you want to collect data yourself, a minimal continuous checker:

#!/bin/bash
URL="https://example.com"
while true; do
  TS=$(date '+%Y-%m-%d %H:%M:%S')
  RESULT=$(curl -o /dev/null -s -w "%{http_code} %{time_total}" \
    --max-time 15 "$URL")
  echo "$TS $RESULT" >> /var/log/healthcheck.log
  # 000 = timeout/no connection; 5xx = server error
  sleep 30
done

After a day the log will show the exact outage moments. Then comes correlation.

Step 3. Correlate outages with events

Log symptomLikely causeWhat to check
Outages at the same time of dayCron/backup/log rotationTask schedule, cron-induced load
Outages on traffic peaksWorker/memory exhaustionPHP-FPM limits, RAM, connection pool
Sporadic 5xx codesDB or external API документацию failuresDB logs, external service timeouts
Code 000 (timeout)Network/firewall/origin downfail2ban, network dips, restarts
Outages after a deployA release broke some requestsApp logs, release rollback

Step 4. Check server resources over time

  • Memory: leaks cause periodic OOM-killer events and restarts.
  • CPU: spikes during heavy jobs drop responsiveness.
  • Disk space: a full log/disk breaks writes and serving.
  • DB connection pool: exhaustion produces periodic 5xx.

Step 5. Check from multiple regions

Intermittent unreachability can be network, not server. If the site fails from RU but returns 200 from EU/US at the same moment, the problem is on the route between the region and the server, not the server itself.

Multi-region checks answer the key question: "is the server flickering, or is the network between me and the server?" Without it, you can easily fix the wrong problem.

Step 6. Automate via the monitoring API

To set up monitoring programmatically (for example, for each new service), use the REST API:

POST /api/v4/monitors
X-API-Key: your_key
Content-Type: application/json

{
  "url": "https://example.com",
  "check_type": "http",
  "interval_minutes": 1,
  "expected_code": 200,
  "notify_telegram": true,
  "check_regions": "ru-msk,eu-de,us-east"
}

How enterno.io helps

enterno.io is built for exactly this. Uptime monitoring checks the site every minute (30 seconds on higher plans) and catches short outages a manual check misses, recording each failure with a timestamp and response code. Multi-region checks across RU (ru-msk)/EU/US show whether the site is down globally or only from one location. Incidents open and close automatically, building an outage history; public status pages show uptime to your customers. Alerts go to Telegram, Slack, email, and webhook. SSL monitoring (14/3-day thresholds) catches upcoming certificate expiry. The REST API v4 creates monitors programmatically. enterno.io diagnoses and warns — the root-cause fix on the server is done by the owner.

FAQ

What interval do I need for short outages?

If blips are under a minute, use a 30-second interval. At a 5-minute interval you will miss most short dips.

The site "sometimes won't load" but the server logs are clean?

Then the problem is probably before the server: network, firewall, DNS, or CDN. Multi-region checks and a client-side health check will show it.

How do I tell a server failure from a network one?

Correlate the outage moments across regions. If it fails everywhere at once — server. If only from one location — network/route.

Outages line up with night-time — why?

A common cause is cron jobs, backups, or log rotation loading the server. Check the schedule and move heavy jobs to off-peak hours with resource limits.

Next step: enable monitoring at a 1-minute interval with multi-region checks. See also multi-region monitoring and the website monitoring guide.

Check your website right now

Check your site →
More articles: Monitoring
Monitoring
Domain and Website Monitoring: Why and How to Set It Up
11.03.2026 · 149 views
Monitoring
Top 10 Website Monitoring Services 2026: Features and Pricing Compared
01.04.2026 · 495 views
Monitoring
Russia's Internet Blocklist in Numbers: 131,000 Blocked Domains Analyzed (2026)
26.06.2026 · 29 views
Monitoring
Uptime Monitoring: Why and How to Set It Up
14.03.2026 · 116 views