Skip to content

What is SRE

Key idea:

SRE (Site Reliability Engineering) — discipline from Google (2003, Ben Treynor Sloss), applying software engineering principles to infra+ops. Core ideas: **error budgets** (acceptable downtime), **toil reduction** (automate manual work), **SLO-driven** (data over gut feel), shared ownership with dev teams. Differs from DevOps: SRE — SWE role with 50% coding, DevOps — practice + culture.

Below: details, example, related terms, FAQ.

Details

  • Error budget: 100% - SLO. 99.9% SLO = 43 min/month downtime allowed
  • Toil: manual/repeatable work → automate. Target <50% time
  • Blameless postmortems: focus on system fixes, not individual blame
  • On-call rotations with rest schedules
  • Shared goals with product teams via SLO

Example

// SLO (Service Level Objective):
99.9% of HTTP requests return 2xx/3xx within 200ms, measured over 30 days

// Error budget depletion triggers:
- Feature freeze if budget < 25%
- Automated rollback if burn rate > 10x

Related Terms

Learn more

Frequently Asked Questions

SRE vs DevOps?

SRE — concrete role (SWE + ops). DevOps — culture + practices. SRE is a way to implement DevOps. Google treats them as distinct; many companies use the terms interchangeably.

Small team — overkill?

Full SRE role — yes for <10 devs. But principles (SLO, blameless postmortems, toil reduction) applicable at any size.

Required reading?

"Site Reliability Engineering" (2016) + "SRE Workbook" (2018) — free at sre.google/books. Canonical source.