DNS Failover: Automatic Traffic Switching for High Availability
What Is DNS Failover?
DNS failover is an automated mechanism that redirects traffic from a failing server or data center to a healthy one by modifying DNS responses in real time. When a health check detects that the primary server is down, the DNS provider automatically updates the record to point to a backup server, ensuring continuous service availability.
Traditional DNS is static — records are set manually and only change when an administrator updates them. DNS failover adds an intelligence layer: continuous health monitoring combined with automatic record switching. The result is that users are seamlessly routed to working infrastructure without manual intervention.
How DNS Failover Works
The process involves three components working together:
- Health checks — the DNS provider continuously monitors your servers by sending HTTP requests, TCP Ping, or ICMP pings at regular intervals (typically every 30–60 seconds)
- Failover logic — when a health check fails a configured number of times (e.g., 3 consecutive failures), the system marks the server as unhealthy
- DNS record update — the DNS response changes to return the IP of the backup server instead of the failed primary
Example flow:
Normal operation:
User → DNS query → api.example.com → 203.0.113.10 (primary)
Primary fails health check (3 consecutive failures):
System marks 203.0.113.10 as DOWN
Failover activated:
User → DNS query → api.example.com → 203.0.113.20 (backup)
Primary recovers (3 consecutive successes):
System marks 203.0.113.10 as UP
User → DNS query → api.example.com → 203.0.113.10 (primary)
TTL and Failover Speed
The critical factor in DNS failover speed is TTL (Time to Live). When a DNS record has a TTL of 3600 seconds (1 hour), resolvers cache the old IP for up to an hour after failover triggers. During this period, users still reach the failed server.
To make failover effective:
| TTL Value | Failover Speed | DNS Query Load | Use Case |
|---|---|---|---|
| 30s | ~30–90s | Very high | Critical services requiring fast failover |
| 60s | ~1–3 min | High | Production API документацию and websites |
| 300s | ~5–10 min | Moderate | General web properties |
| 3600s | ~1 hour | Low | Not suitable for failover |
Trade-off: low TTL means faster failover but increased DNS query volume (more load on DNS infrastructure and slightly higher latency for every request due to more frequent DNS Lookup).
Failover Strategies
Active-Passive
One primary server handles all traffic. One or more backup servers stay idle until failover. Simple and cost-effective, but backup capacity is wasted during normal operation.
# DNS configuration example
api.example.com A 203.0.113.10 (primary, active)
api.example.com A 203.0.113.20 (backup, returned only on failover)
Active-Active
Multiple servers share traffic during normal operation (round-robin or weighted). When one fails, its share is distributed among remaining servers. More efficient — no idle capacity — but requires all servers to handle extra load during failover.
# Active-active with health checks
api.example.com A 203.0.113.10 weight=70 (primary region)
api.example.com A 203.0.113.20 weight=30 (secondary region)
# If .10 fails, all traffic goes to .20
Geographic (GeoDNS) Failover
Route users to the nearest data center based on their location. If a regional server fails, users are redirected to the next closest healthy region. Combines latency optimization with high availability.
Health Check Configuration
Effective health checks must be:
- Specific — check the actual service, not just that the server responds to ping. An HTTP check to
/healththat verifies database connectivity is better than an ICMP ping - Frequent — every 30–60 seconds for critical services
- Resilient — require multiple consecutive failures before triggering failover (prevent flapping from transient network issues)
- From multiple locations — a health check from a single location may fail due to network path issues, not actual downtime. Use checks from 3+ geographic locations
# Health check configuration example
endpoint: https://api.example.com/health
method: GET
interval: 30s
timeout: 10s
healthy_threshold: 3 # 3 successes to mark UP
unhealthy_threshold: 3 # 3 failures to mark DOWN
expected_status: 200
expected_body: "ok"
check_regions:
- us-east
- eu-west
- ap-southeast
DNS Failover vs Load Balancer Failover
Both provide failover, but at different layers:
| Feature | DNS Failover | Load Balancer |
|---|---|---|
| Layer | DNS (before connection) | Network/Application (L4/L7) |
| Speed | Seconds to minutes (TTL dependent) | Milliseconds to seconds |
| Scope | Cross-region, cross-provider | Within a cluster or region |
| Cost | Low (DNS service fee) | Higher (LB infrastructure) |
| Granularity | Server/IP level | Request level |
| Session persistence | Not possible | Supported |
Best practice: use both. Load balancers for fast failover within a region, DNS failover for cross-region disaster recovery.
Common Pitfalls
- High TTL — the most common mistake. A 3600s TTL makes DNS failover nearly useless. Lower to 60–300s for services requiring failover
- Flapping — aggressive health check thresholds cause rapid switching between primary and backup, confusing caches and users. Use 3+ consecutive failures before failover
- Untested backup — the backup server has not been tested under production load. Failover activates, and the backup immediately collapses. Test backup capacity regularly
- Sticky resolvers — some ISP resolvers ignore TTL and cache records longer. You cannot fully control client-side caching behavior
- No failback plan — once the primary recovers, when and how do you switch back? Automatic failback can be risky if the primary is still unstable
- Single point of failure in DNS — if the DNS provider itself goes down, failover does not work. Consider multi-provider DNS setups for critical services
Monitoring DNS Failover
Continuously verify your failover setup:
- Monitor health check status and response times from multiple regions
- Track DNS propagation after failover events — use tools like Enterno.io to verify records from global locations
- Alert on failover events (both activation and recovery)
- Regularly test failover by simulating primary failure
- Monitor TTL compliance across major resolvers
Conclusion
DNS failover is an essential component of high availability architecture. It provides cross-region resilience at low cost, complementing load balancers that handle intra-region failover. Configure low TTLs, implement robust health checks from multiple locations, test your backup infrastructure under load, and monitor failover events continuously. Combined with proper monitoring, DNS failover ensures your services remain accessible even when individual servers or entire data centers fail.
Check your website right now
Check now →