SRE (Site Reliability Engineering) — discipline from Google (2003, Ben Treynor Sloss), applying software engineering principles to infra+ops. Core ideas: **error budgets** (acceptable downtime), **toil reduction** (automate manual work), **SLO-driven** (data over gut feel), shared ownership with dev teams. Differs from DevOps: SRE — SWE role with 50% coding, DevOps — practice + culture.
Below: details, example, related terms, FAQ.
// SLO (Service Level Objective):
99.9% of HTTP requests return 2xx/3xx within 200ms, measured over 30 days
// Error budget depletion triggers:
- Feature freeze if budget < 25%
- Automated rollback if burn rate > 10xSRE — concrete role (SWE + ops). DevOps — culture + practices. SRE is a way to implement DevOps. Google treats them as distinct; many companies use the terms interchangeably.
Full SRE role — yes for <10 devs. But principles (SLO, blameless postmortems, toil reduction) applicable at any size.
"Site Reliability Engineering" (2016) + "SRE Workbook" (2018) — free at sre.google/books. Canonical source.