Monitoring as Code: Prometheus Rules and Grafana Dashboards
Monitoring as Code: A Complete Guide
Monitoring as Code (MaC) applies infrastructure-as-code principles to observability. Instead of manually configuring dashboards, alerts, and recording rules through web interfaces, all monitoring configuration is defined in version-controlled files, reviewed through pull requests, and deployed through CI/CD pipelines.
Why Monitoring as Code?
Manual monitoring configuration creates fragile, undocumented observability setups that are difficult to reproduce, audit, and maintain. Codifying monitoring configuration solves these problems.
Key Benefits
- Reproducibility — monitoring setup can be reliably recreated across environments
- Version control — all changes are tracked, reviewed, and reversible
- Consistency — standardized patterns reduce configuration drift between teams
- Automation — monitoring deploys alongside application code in CI/CD
- Disaster recovery — entire monitoring stack can be rebuilt from code
- Knowledge sharing — configuration files serve as living documentation
Prometheus Recording Rules
Recording rules pre-compute frequently used or expensive PromQL expressions and save the result as new time series. This reduces query load on Prometheus and speeds up dashboard rendering.
Recording Rules Configuration
# recording-rules.yml
groups:
- name: http_request_rates
interval: 30s
rules:
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
labels:
aggregation: "rate5m"
- record: job:http_request_duration:p99
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job))
- record: job:http_errors:ratio5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
Recording Rule Best Practices
- Follow the naming convention:
level:metric:operations(e.g.,job:http_requests:rate5m) - Only create recording rules for expressions used in multiple dashboards or alerts
- Set appropriate evaluation intervals matching your scrape intervals
- Document the purpose of each rule with comments
- Test rules in a staging environment before deploying to production
Prometheus Alerting Rules
Alerting rules define conditions that trigger notifications when metrics cross thresholds. Well-designed alerts are actionable, properly scoped, and include sufficient context for responders.
Alerting Rules Configuration
# alerting-rules.yml
groups:
- name: application_alerts
rules:
- alert: HighErrorRate
expr: job:http_errors:ratio5m > 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "High HTTP error rate on {{ $labels.job }}"
description: "Error rate is {{ $value | humanizePercentage }} for job {{ $labels.job }}"
runbook: "https://wiki.internal/runbooks/high-error-rate"
dashboard: "https://grafana.internal/d/app-overview"
- alert: HighLatency
expr: job:http_request_duration:p99 > 2.0
for: 10m
labels:
severity: warning
team: backend
annotations:
summary: "High p99 latency on {{ $labels.job }}"
description: "p99 latency is {{ $value }}s for job {{ $labels.job }}"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
for: 15m
labels:
severity: warning
team: platform
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
Alert Design Principles
- Alert on symptoms, not causes — alert on user-visible impact (high latency, errors) rather than internal metrics
- Use appropriate thresholds — set thresholds based on SLO targets and historical data
- Include for duration — require conditions to persist before firing to avoid flapping
- Provide context — annotations should include runbook links, dashboard URLs, and current values
- Set proper severity — distinguish between critical (pages on-call) and warning (next business day)
Grafana Dashboards as Code
Grafana dashboards can be defined as JSON or generated programmatically using tools like Grafonnet (Jsonnet library) or Grafana Terraform provider. This ensures dashboards are consistent, version-controlled, and automatically deployed.
Dashboard JSON Structure
{
"dashboard": {
"title": "Application Overview",
"uid": "app-overview",
"tags": ["application", "production"],
"timezone": "utc",
"refresh": "30s",
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
"targets": [
{
"expr": "job:http_requests:rate5m",
"legendFormat": "{{ job }}"
}
]
}
]
}
}
Provisioning Dashboards
# grafana/provisioning/dashboards/default.yml
apiVersion: 1
providers:
- name: default
orgId: 1
folder: "Application"
type: file
disableDeletion: false
updateIntervalSeconds: 30
allowUiUpdates: false
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: true
CI/CD Pipeline Integration
Monitoring configuration should be validated and deployed through the same CI/CD pipeline as application code.
Validation Pipeline
# .github/workflows/monitoring.yml
name: Monitoring Validation
on: [pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Validate Prometheus rules
run: promtool check rules monitoring/rules/*.yml
- name: Validate alerting rules
run: promtool check rules monitoring/alerts/*.yml
- name: Unit test rules
run: promtool test rules monitoring/tests/*.yml
- name: Validate Grafana dashboards
run: |
for f in monitoring/dashboards/*.json; do
python -m json.tool "$f" > /dev/null
done
Testing Monitoring Configuration
Prometheus provides built-in support for testing alerting and recording rules with unit tests defined in YAML files.
Rule Unit Tests
# tests/alert-tests.yml
rule_files:
- ../alerts/application.yml
evaluation_interval: 1m
tests:
- interval: 1m
input_series:
- series: 'http_requests_total{job="api", status="500"}'
values: "0+10x20"
- series: 'http_requests_total{job="api", status="200"}'
values: "0+100x20"
alert_rule_test:
- eval_time: 10m
alertname: HighErrorRate
exp_alerts:
- exp_labels:
severity: critical
team: backend
job: api
Directory Structure
Organize monitoring configuration in a clear, maintainable directory structure alongside your application code.
monitoring/
rules/
recording-http.yml
recording-database.yml
alerts/
application.yml
infrastructure.yml
business.yml
dashboards/
application/
overview.json
api-details.json
infrastructure/
nodes.json
kubernetes.json
tests/
alert-tests.yml
recording-tests.yml
provisioning/
dashboards.yml
datasources.yml
Conclusion
Monitoring as Code transforms observability from a fragile manual process into a reliable, auditable, and automated practice. By defining Prometheus rules and Grafana dashboards in version-controlled files, teams gain consistency, reproducibility, and the ability to evolve monitoring alongside applications. Start by codifying your most critical alerts and dashboards, validate them in CI, and gradually expand coverage as the practice matures.
Check your website right now
Check now →