Zero-Downtime Deployment Strategies
Zero-downtime deployment is the practice of releasing new versions of an application without any interruption to end users. In a world where even seconds of downtime can mean lost revenue, damaged reputation, and broken SLAs, mastering zero-downtime deployment is essential for any production-grade web service.
Why Downtime Happens During Deployments
Traditional deployments cause downtime because the application must be stopped to replace its code and restarted to load the new version. During this window, incoming requests either fail or receive errors. Common causes include:
- Stopping the old version before the new one is ready
- Database migrations that lock tables
- Configuration changes that require restarts
- Dependency updates that break compatibility
- Cold start time for the new application instance
Blue-Green Deployment
Blue-green deployment maintains two identical production environments. At any time, one (blue) serves live traffic while the other (green) is idle or being updated. When the new version is deployed to the green environment and validated, traffic is switched from blue to green.
# Conceptual flow
1. Blue environment serves production traffic
2. Deploy new version to Green environment
3. Run smoke tests on Green
4. Switch load balancer from Blue → Green
5. Green now serves production traffic
6. Blue becomes the staging/rollback environment
# Nginx switch (simplified)
upstream app {
# Before switch:
# server blue.internal:8080;
# After switch:
server green.internal:8080;
}
Advantages: Instant rollback by switching back to blue. Full environment validation before traffic switch. No mixed-version traffic.
Disadvantages: Requires double the infrastructure. Database schema changes must be backward compatible across both environments.
Rolling Deployment
Rolling deployment gradually replaces instances of the old version with the new version, one (or a few) at a time. At any point during the rollout, both versions may be serving traffic simultaneously.
# Kubernetes rolling update
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 4
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # At most 1 pod down during update
maxSurge: 1 # At most 1 extra pod during update
template:
spec:
containers:
- name: app
image: myapp:v2.0
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
Advantages: No need for double infrastructure. Gradual rollout reduces blast radius. Works natively with container orchestrators.
Disadvantages: Both versions run simultaneously, requiring backward compatibility. Rollback is slower (must roll back each instance). Session management across versions needs careful handling.
Canary Deployment
Canary deployment routes a small percentage of traffic to the new version while the majority continues to hit the old version. If metrics look healthy, the percentage is gradually increased until 100% of traffic reaches the new version.
# Nginx canary with weighted upstream
upstream app {
server old-version.internal:8080 weight=90;
server new-version.internal:8080 weight=10; # 10% canary
}
# Gradual rollout stages:
# Stage 1: 5% to canary → monitor for 15 minutes
# Stage 2: 25% to canary → monitor for 30 minutes
# Stage 3: 50% to canary → monitor for 30 minutes
# Stage 4: 100% to canary → deployment complete
Advantages: Lowest risk — only a fraction of users are exposed to the new version. Real production traffic validation. Issues are detected before full rollout.
Disadvantages: Requires traffic-splitting infrastructure. Metrics and monitoring must be granular enough to detect issues in the canary pool. Slower overall deployment time.
Database Migration Strategies
Database schema changes are the hardest part of zero-downtime deployment. A migration that locks a table or removes a column will break the running application. The solution is the expand-and-contract pattern:
Expand Phase
Add new columns, tables, or indexes without removing anything. The old application version continues to work because nothing it depends on has changed.
-- Step 1: Add new column (nullable, no default — instant in MySQL 8+)
ALTER TABLE users ADD COLUMN email_verified TINYINT(1) DEFAULT 0;
-- Step 2: Deploy new application version that writes to both old and new columns
-- Step 3: Backfill existing data
UPDATE users SET email_verified = 1 WHERE verified_at IS NOT NULL;
Contract Phase
After the new application version is stable and all data has been migrated, remove the old columns or tables in a subsequent deployment.
-- Step 4: Deploy version that only reads from new column
-- Step 5: Remove old column (in a separate migration)
ALTER TABLE users DROP COLUMN verified_at;
Graceful Shutdown
Applications must handle in-flight requests during shutdown. A graceful shutdown sequence looks like this:
1. Receive SIGTERM signal
2. Stop accepting new connections
3. Complete all in-flight requests (with a timeout)
4. Close database connections and other resources
5. Exit with code 0
# PHP-FPM graceful shutdown
process_control_timeout = 30 # Wait up to 30s for workers to finish
# Nginx graceful shutdown
nginx -s quit # Waits for active connections to complete
Health Checks and Readiness
Health checks are the glue that holds zero-downtime deployment together. The load balancer must know when a new instance is ready to receive traffic and when an old instance should stop receiving it.
// PHP health check endpoint
// GET /health
$checks = [
'database' => checkDatabaseConnection(),
'redis' => checkRedisConnection(),
'disk' => disk_free_space('/') > 100 * 1024 * 1024,
];
$healthy = !in_array(false, $checks, true);
http_response_code($healthy ? 200 : 503);
echo json_encode(['status' => $healthy ? 'ok' : 'unhealthy', 'checks' => $checks]);
Deployment Checklist
- Database migrations are backward compatible (expand-and-contract)
- Application supports graceful shutdown (handles SIGTERM)
- Health check endpoints are implemented and tested
- Load balancer is configured to check health before routing
- Rollback plan is documented and tested
- Feature flags are in place for risky changes
- Monitoring alerts are set for error rate, latency, and availability
- Static assets are versioned (cache busting)
- Session storage is externalized (Redis, database)
Monitoring During Deployment
Key metrics to watch during and after deployment:
| Metric | Normal Range | Alert Threshold |
|---|---|---|
| Error rate (5xx) | < 0.1% | > 1% |
| P95 latency | < 500ms | > 2x baseline |
| Request rate | Stable | > 20% drop |
| CPU usage | < 70% | > 90% |
| Memory usage | < 80% | > 95% |
| Active connections | Stable | Sudden spike or drop |
Summary
Zero-downtime deployment is achievable with any infrastructure stack through a combination of strategies: blue-green for instant switchover, rolling for gradual replacement, and canary for risk-minimized validation. The keys to success are backward-compatible database migrations, graceful shutdown handling, robust health checks, and comprehensive monitoring. Start with the simplest strategy that meets your needs and evolve as your application and team grow.
Check your website right now
Check now →