Short answer. Intermittent outages are sneaky: by the time you check, the site is back up. Catching them by hand is nearly impossible — you need continuous monitoring at a short interval (1 minute or 30 seconds) that records every failure with a timestamp and response code. Then correlate the outage moments with load, deploys, resource exhaustion, and traffic spikes. Multi-region checks show whether the site is down globally or only from one location.
Why intermittent outages are hard to catch
If a site goes down for 30-90 seconds once every few hours, the "open it and look" approach is useless — the odds of catching the failure are tiny. You need a system that knocks constantly and logs every response.
The core principle for diagnosing intermittent outages: don't catch the moment with your eyes, catch it with data. Continuous monitoring at short intervals turns an invisible problem into a table of timestamps.
Step 1. Enable continuous monitoring
- Set a 1-minute check interval (or 30 seconds for short blips).
- Log the response code, response time, and error text of every check.
- Enable alerts so you know about an outage in real time.
Step 2. A simple health-check script
If you want to collect data yourself, a minimal continuous checker:
#!/bin/bash
URL="https://example.com"
while true; do
TS=$(date '+%Y-%m-%d %H:%M:%S')
RESULT=$(curl -o /dev/null -s -w "%{http_code} %{time_total}" \
--max-time 15 "$URL")
echo "$TS $RESULT" >> /var/log/healthcheck.log
# 000 = timeout/no connection; 5xx = server error
sleep 30
done
After a day the log will show the exact outage moments. Then comes correlation.
Step 3. Correlate outages with events
| Log symptom | Likely cause | What to check |
|---|---|---|
| Outages at the same time of day | Cron/backup/log rotation | Task schedule, cron-induced load |
| Outages on traffic peaks | Worker/memory exhaustion | PHP-FPM limits, RAM, connection pool |
| Sporadic 5xx codes | DB or external API документацию failures | DB logs, external service timeouts |
| Code 000 (timeout) | Network/firewall/origin down | fail2ban, network dips, restarts |
| Outages after a deploy | A release broke some requests | App logs, release rollback |
Step 4. Check server resources over time
- Memory: leaks cause periodic OOM-killer events and restarts.
- CPU: spikes during heavy jobs drop responsiveness.
- Disk space: a full log/disk breaks writes and serving.
- DB connection pool: exhaustion produces periodic 5xx.
Step 5. Check from multiple regions
Intermittent unreachability can be network, not server. If the site fails from RU but returns 200 from EU/US at the same moment, the problem is on the route between the region and the server, not the server itself.
Multi-region checks answer the key question: "is the server flickering, or is the network between me and the server?" Without it, you can easily fix the wrong problem.
Step 6. Automate via the monitoring API
To set up monitoring programmatically (for example, for each new service), use the REST API:
POST /api/v4/monitors
X-API-Key: your_key
Content-Type: application/json
{
"url": "https://example.com",
"check_type": "http",
"interval_minutes": 1,
"expected_code": 200,
"notify_telegram": true,
"check_regions": "ru-msk,eu-de,us-east"
}
How enterno.io helps
enterno.io is built for exactly this. Uptime monitoring checks the site every minute (30 seconds on higher plans) and catches short outages a manual check misses, recording each failure with a timestamp and response code. Multi-region checks across RU (ru-msk)/EU/US show whether the site is down globally or only from one location. Incidents open and close automatically, building an outage history; public status pages show uptime to your customers. Alerts go to Telegram, Slack, email, and webhook. SSL monitoring (14/3-day thresholds) catches upcoming certificate expiry. The REST API v4 creates monitors programmatically. enterno.io diagnoses and warns — the root-cause fix on the server is done by the owner.
FAQ
What interval do I need for short outages?
If blips are under a minute, use a 30-second interval. At a 5-minute interval you will miss most short dips.
The site "sometimes won't load" but the server logs are clean?
Then the problem is probably before the server: network, firewall, DNS, or CDN. Multi-region checks and a client-side health check will show it.
How do I tell a server failure from a network one?
Correlate the outage moments across regions. If it fails everywhere at once — server. If only from one location — network/route.
Outages line up with night-time — why?
A common cause is cron jobs, backups, or log rotation loading the server. Check the schedule and move heavy jobs to off-peak hours with resource limits.
Next step: enable monitoring at a 1-minute interval with multi-region checks. See also multi-region monitoring and the website monitoring guide.