SRE — Site Reliability Engineering · Definition & Examples

Anatoly Oshmanovsky

What is SRE

By Anatoly Oshmanovsky · Updated May 23, 2026

Key idea:

SRE (Site Reliability Engineering) — discipline from Google (2003, Ben Treynor Sloss), applying software engineering principles to infra+ops. Core ideas: **error budgets** (acceptable downtime), **toil reduction** (automate manual work), **SLO-driven** (data over gut feel), shared ownership with dev teams. Differs from DevOps: SRE — SWE role with 50% coding, DevOps — practice + culture.

Below: details, example, related terms, FAQ.

Try it now — free →

Details

Error budget: 100% - SLO. 99.9% SLO = 43 min/month downtime allowed
Toil: manual/repeatable work → automate. Target <50% time
Blameless postmortems: focus on system fixes, not individual blame
On-call rotations with rest schedules
Shared goals with product teams via SLO

Example

// SLO (Service Level Objective):
99.9% of HTTP requests return 2xx/3xx within 200ms, measured over 30 days

// Error budget depletion triggers:
- Feature freeze if budget < 25%
- Automated rollback if burn rate > 10x

Related Terms

Understanding SRE Metrics and Monitoring

The Role of Automation in SRE

SRE Best Practices for Incident Management

Learn more

Glossary

Frequently Asked Questions

SRE vs DevOps?

SRE — concrete role (SWE + ops). DevOps — culture + practices. SRE is a way to implement DevOps. Google treats them as distinct; many companies use the terms interchangeably.

Small team — overkill?

Full SRE role — yes for <10 devs. But principles (SLO, blameless postmortems, toil reduction) applicable at any size.

Required reading?

"Site Reliability Engineering" (2016) + "SRE Workbook" (2018) — free at sre.google/books. Canonical source.

Try the live tool that powered this guide

Free plan — 20 monitors, 5-minute checks, no card required. Upgrade for 1-minute interval and multi-region monitoring.

Start free See pricing