Monitoring as Code: Prometheus Rules and Grafana Dashboards

Anatoly Oshmanovsky

DevOps

Monitoring as Code: Prometheus Rules and Grafana Dashboards

Published: 16.03.2026 · ~5 min · 142 views

Monitoring as Code: A Complete Guide

Monitoring as Code (MaC) applies infrastructure-as-code principles to observability. Instead of manually configuring dashboards, alerts, and recording rules through web interfaces, all monitoring configuration is defined in version-controlled files, reviewed through pull requests, and deployed through CI/CD pipelines.

Why Monitoring as Code?

Manual monitoring configuration creates fragile, undocumented observability setups that are difficult to reproduce, audit, and maintain. Codifying monitoring configuration solves these problems.

Key Benefits

Reproducibility — monitoring setup can be reliably recreated across environments
Version control — all changes are tracked, reviewed, and reversible
Consistency — standardized patterns reduce configuration drift between teams
Automation — monitoring deploys alongside application code in CI/CD
Disaster recovery — entire monitoring stack can be rebuilt from code
Knowledge sharing — configuration files serve as living documentation

Prometheus Recording Rules

Recording rules pre-compute frequently used or expensive PromQL expressions and save the result as new time series. This reduces query load on Prometheus and speeds up dashboard rendering.

Recording Rules Configuration

# recording-rules.yml
groups:
  - name: http_request_rates
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)
        labels:
          aggregation: "rate5m"

      - record: job:http_request_duration:p99
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job))

      - record: job:http_errors:ratio5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)

Recording Rule Best Practices

Follow the naming convention: level:metric:operations (e.g., job:http_requests:rate5m)
Only create recording rules for expressions used in multiple dashboards or alerts
Set appropriate evaluation intervals matching your scrape intervals
Document the purpose of each rule with comments
Test rules in a staging environment before deploying to production

Prometheus Alerting Rules

Alerting rules define conditions that trigger notifications when metrics cross thresholds. Well-designed alerts are actionable, properly scoped, and include sufficient context for responders.

Alerting Rules Configuration

# alerting-rules.yml
groups:
  - name: application_alerts
    rules:
      - alert: HighErrorRate
        expr: job:http_errors:ratio5m > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "High HTTP error rate on {{ $labels.job }}"
          description: "Error rate is {{ $value | humanizePercentage }} for job {{ $labels.job }}"
          runbook: "https://wiki.internal/runbooks/high-error-rate"
          dashboard: "https://grafana.internal/d/app-overview"

      - alert: HighLatency
        expr: job:http_request_duration:p99 > 2.0
        for: 10m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "High p99 latency on {{ $labels.job }}"
          description: "p99 latency is {{ $value }}s for job {{ $labels.job }}"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
        for: 15m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"

Alert Design Principles

Alert on symptoms, not causes — alert on user-visible impact (high latency, errors) rather than internal metrics
Use appropriate thresholds — set thresholds based on SLO targets and historical data
Include for duration — require conditions to persist before firing to avoid flapping
Provide context — annotations should include runbook links, dashboard URLs, and current values
Set proper severity — distinguish between critical (pages on-call) and warning (next business day)

Grafana Dashboards as Code

Grafana dashboards can be defined as JSON or generated programmatically using tools like Grafonnet (Jsonnet library) or Grafana Terraform provider. This ensures dashboards are consistent, version-controlled, and automatically deployed.

Dashboard JSON Structure

{
  "dashboard": {
    "title": "Application Overview",
    "uid": "app-overview",
    "tags": ["application", "production"],
    "timezone": "utc",
    "refresh": "30s",
    "panels": [
      {
        "title": "Request Rate",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
        "targets": [
          {
            "expr": "job:http_requests:rate5m",
            "legendFormat": "{{ job }}"
          }
        ]
      }
    ]
  }
}

Provisioning Dashboards

# grafana/provisioning/dashboards/default.yml
apiVersion: 1
providers:
  - name: default
    orgId: 1
    folder: "Application"
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    allowUiUpdates: false
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

CI/CD Pipeline Integration

Monitoring configuration should be validated and deployed through the same CI/CD pipeline as application code.

Validation Pipeline

# .github/workflows/monitoring.yml
name: Monitoring Validation
on: [pull_request]
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Validate Prometheus rules
        run: promtool check rules monitoring/rules/*.yml
      - name: Validate alerting rules
        run: promtool check rules monitoring/alerts/*.yml
      - name: Unit test rules
        run: promtool test rules monitoring/tests/*.yml
      - name: Validate Grafana dashboards
        run: |
          for f in monitoring/dashboards/*.json; do
            python -m json.tool "$f" > /dev/null
          done

Testing Monitoring Configuration

Prometheus provides built-in support for testing alerting and recording rules with unit tests defined in YAML files.

Rule Unit Tests

# tests/alert-tests.yml
rule_files:
  - ../alerts/application.yml
evaluation_interval: 1m
tests:
  - interval: 1m
    input_series:
      - series: 'http_requests_total{job="api", status="500"}'
        values: "0+10x20"
      - series: 'http_requests_total{job="api", status="200"}'
        values: "0+100x20"
    alert_rule_test:
      - eval_time: 10m
        alertname: HighErrorRate
        exp_alerts:
          - exp_labels:
              severity: critical
              team: backend
              job: api

Directory Structure

Organize monitoring configuration in a clear, maintainable directory structure alongside your application code.

monitoring/
  rules/
    recording-http.yml
    recording-database.yml
  alerts/
    application.yml
    infrastructure.yml
    business.yml
  dashboards/
    application/
      overview.json
      api-details.json
    infrastructure/
      nodes.json
      kubernetes.json
  tests/
    alert-tests.yml
    recording-tests.yml
  provisioning/
    dashboards.yml
    datasources.yml

Conclusion

Monitoring as Code transforms observability from a fragile manual process into a reliable, auditable, and automated practice. By defining Prometheus rules and Grafana dashboards in version-controlled files, teams gain consistency, reproducibility, and the ability to evolve monitoring alongside applications. Start by codifying your most critical alerts and dashboards, validate them in CI, and gradually expand coverage as the practice matures.

Check your website right now

Check your site →

Monitoring as Code: Prometheus Rules and Grafana Dashboards

Monitoring as Code: A Complete Guide

Why Monitoring as Code?

Key Benefits

Prometheus Recording Rules

Recording Rules Configuration

Recording Rule Best Practices

Prometheus Alerting Rules

Alerting Rules Configuration

Alert Design Principles

Grafana Dashboards as Code

Dashboard JSON Structure

Provisioning Dashboards

CI/CD Pipeline Integration

Validation Pipeline

Testing Monitoring Configuration

Rule Unit Tests

Directory Structure

Conclusion

Start monitoring for free