SLA Monitoring for SaaS: Measuring and Holding Uptime
Short answer. A SaaS SLA is a public availability promise expressed as an uptime percentage over a period. 99.9% sounds solid, yet it allows almost 9 hours of downtime per year. To hold an SLA you must measure real availability with independent, short-interval monitoring, compute uptime with a formula, and record incidents automatically. Below: the nines table, the formula and a practical setup.
What an SLA is made of
An SLA (Service Level Agreement) fixes a target availability level and penalties for breaching it — usually credits or partial refunds. Key elements: the target uptime percentage, the measurement window (month/quarter), what counts as downtime and who measures it. If only the provider measures, the customer has no leverage. That is why SaaS teams run independent monitoring.
The nines table: how much downtime is allowed
| SLA | Downtime/month | Downtime/year |
|---|---|---|
| 99% ("two nines") | ~7.2 hours | ~3.65 days |
| 99.9% ("three nines") | ~43.8 minutes | ~8.76 hours |
| 99.95% | ~21.9 minutes | ~4.38 hours |
| 99.99% ("four nines") | ~4.38 minutes | ~52.6 minutes |
| 99.999% ("five nines") | ~26 seconds | ~5.26 minutes |
More nines means costlier infrastructure and a shorter monitoring interval — otherwise you simply won't catch a short incident.
The uptime formula
Uptime is the share of time the service was available out of the total period:
uptime_% = (total_time - downtime) / total_time * 100
# Example: month = 30 days = 43,200 minutes
# 25 minutes of downtime recorded
uptime_% = (43200 - 25) / 43200 * 100 = 99.942%
# This meets a 99.9% SLA but breaches 99.95%
For the formula to reflect reality, the downtime source must be recorded incidents from independent monitoring, not a rough estimate.
The health-check endpoint
Reliable SLA monitoring relies on a dedicated health endpoint that checks not only that the web server is alive, but that dependencies (DB, cache, queue) are reachable. Example check:
# Hit the health endpoint and measure response time
curl -s -o /dev/null -w "HTTP %{http_code}, %{time_total}s\n" \
https://api.example.com/health
# Expected output for a healthy service:
# HTTP 200, 0.142s
In enterno.io you add this URL as an HTTP monitor with an expected 200 code and a 1-minute interval — and uptime is computed automatically.
Holding the SLA in practice
- Short interval. 99.99% needs a 30-second interval — a 5-minute one will miss a 4-minute incident.
- Multi-region. Probes from Russia, the EU and the US separate a network glitch from a real failure.
- Incident threshold. Don't open an incident on a single failed check — require several consecutive failures to filter noise.
- SSL control. An expired cert is 100% downtime for users, so 14/3-day thresholds are mandatory.
Reporting and transparency
A public status page and incident history turn an SLA from a promise into a verifiable fact. Customers see real uptime, and your team gets an accurate basis for SLA-credit math.
FAQ
How does 99.9% differ from 99.99% in practice?
99.9% allows ~8.76 hours of downtime per year; 99.99% only ~52.6 minutes. The infrastructure and cost gap is large.
What interval does a 99.99% SLA need?
30 seconds. At a longer interval short incidents simply won't be measured and uptime will be overstated.
Should I measure uptime myself or trust the provider?
Independently. Provider-only measurement leaves you without leverage in an SLA-credit dispute.
What counts as downtime?
Any time the service is unavailable to a user: 5xx errors, timeouts, DNS unreachability, an expired SSL.
Set up a health monitor and compute SLA automatically on the uptime monitoring page. Also: SLA and uptime math, monitoring guide, status page best practices and the online website checker.