MTTR, MTTF, MTBF: Reliability Metrics Explained for Web Operations
Reliability metrics are the language of uptime. When someone asks "how reliable is your service?", metrics like MTTR, MTTF, and MTBF provide objective answers. Understanding these metrics helps you set meaningful SLAs, prioritize improvements, and communicate with stakeholders.
MTTR — Mean Time to Repair
MTTR measures the average time from when a failure is detected to when the service is restored. It's the most actionable reliability metric because it directly measures your team's ability to respond to and fix issues.
MTTR Formula
MTTR = Total repair time / Number of repairs
Example: 3 incidents took 30min + 120min + 15min = 165min total
MTTR = 165 / 3 = 55 minutes
MTTR Components
- Detection time: How long until the failure is noticed (monitoring, alerts)
- Diagnosis time: How long to identify the root cause
- Repair time: How long to implement the fix
- Verification time: How long to confirm the service is restored
Reducing MTTR
- Faster detection: Comprehensive monitoring with low-threshold alerts
- Faster diagnosis: Runbooks, good logging, observability tools
- Faster repair: Automated rollbacks, feature flags, pre-tested recovery procedures
- Faster verification: Automated health checks, synthetic monitoring
MTTF — Mean Time to Failure
MTTF measures the average time a system operates before its first failure. It's primarily used for non-repairable systems or new deployments. For web services, MTTF answers: "How long after deployment until something breaks?"
MTTF Formula
MTTF = Total uptime before failures / Number of failures
Example: 3 deployments ran for 72h, 168h, 48h before failing
MTTF = (72 + 168 + 48) / 3 = 96 hours
Improving MTTF
- Better testing (unit, integration, load)
- Gradual rollouts (canary deployments)
- Chaos engineering to find weaknesses proactively
- Capacity planning to prevent resource exhaustion
MTBF — Mean Time Between Failures
MTBF measures the average time between consecutive failures for repairable systems. It includes both uptime and repair time: MTBF = MTTF + MTTR. This is the most commonly cited reliability metric for ongoing services.
MTBF Formula
MTBF = Total operational time / Number of failures
Example: Service ran 720 hours in a month with 3 failures
MTBF = 720 / 3 = 240 hours between failures
How They Relate
MTBF = MTTF + MTTR
|←—— MTBF ——→|←—— MTBF ——→|
|← MTTF →|←MTTR→|← MTTF →|←MTTR→|
[ uptime ][down ][ uptime ][down ]
Availability from Metrics
These metrics directly calculate service availability:
Availability = MTTF / MTBF = MTTF / (MTTF + MTTR)
Example: MTTF = 237h, MTTR = 3h
Availability = 237 / (237 + 3) = 237 / 240 = 98.75%
This shows that reducing MTTR has a disproportionate impact on availability compared to increasing MTTF. Going from 3h to 1h MTTR improves availability more than doubling MTTF.
Setting Targets
| Availability | Annual Downtime | Example MTBF/MTTR |
|---|---|---|
| 99% | 3.65 days | MTBF 100h, MTTR 1h |
| 99.9% | 8.76 hours | MTBF 1000h, MTTR 1h |
| 99.95% | 4.38 hours | MTBF 2000h, MTTR 1h |
| 99.99% | 52.6 minutes | MTBF 10000h, MTTR 1h |
Practical Tips
- Track all three metrics: MTTR shows response capability, MTTF shows system robustness, MTBF shows overall reliability
- Focus on MTTR first: It's typically easier and more impactful to reduce repair time than to prevent all failures
- Use percentiles, not just averages: Average MTTR of 30min is meaningless if one incident took 8 hours
- Segment by severity: Track metrics separately for SEV-1, SEV-2, SEV-3 incidents
- Review monthly: Trends matter more than individual values
- Automate measurement: Pull data from your incident management system, not manual tracking
Conclusion
MTTR, MTTF, and MTBF are complementary metrics that together paint a complete picture of service reliability. Start by measuring MTTR — it's the most actionable. Then track MTBF to understand your overall reliability trend. Use these numbers to set realistic SLAs, justify infrastructure investments, and demonstrate improvement over time.
Check your website right now
Check now →