Skip to content

Что такое SRE

Коротко:

SRE (Site Reliability Engineering) — discipline from Google (2003, Ben Treynor Sloss), применяющая software engineering principles к infra+ops. Core ideas: **error budgets** (acceptable downtime), **toil reduction** (automate manual work), **SLO-driven** (data над gut feel), shared ownership с dev teams. Differ DevOps: SRE — SWE role с 50% coding, DevOps — практика + culture.

Ниже: подробности, пример, смежные термины, FAQ.

Подробности

  • Error budget: 100% - SLO. 99.9% SLO = 43 min/month downtime allowed
  • Toil: manual/repeatable work → automate. Target <50% time
  • Blameless postmortems: focus на system fixes, не individual blame
  • On-call rotations с rest schedules
  • Shared goals с product teams через SLO

Пример

// SLO (Service Level Objective):
99.9% of HTTP requests return 2xx/3xx within 200ms, measured over 30 days

// Error budget depletion triggers:
- Feature freeze if budget < 25%
- Automated rollback if burn rate > 10x

Смежные термины

Больше по теме

Часто задаваемые вопросы

SRE vs DevOps?

SRE — concrete role (SWE + ops). DevOps — culture + practices. SRE is a way to implement DevOps. Google treats them как distinct; many companies use терминологию interchangeably.

Для small team — overkill?

Full SRE role — yes для <10 devs. Но principles (SLO, blameless postmortems, toil reduction) применимы в любом размере.

Required reading?

"Site Reliability Engineering" (2016) + "SRE Workbook" (2018) — free at sre.google/books. Canonical source.