Cron Job Monitoring with Dead Man's Switch
The Problem of Silent Failures
Cron jobs run in the background: processing payments, sending emails, generating reports, cleaning up data, running backups. When a cron job silently crashes or hangs, you find out hours, days, or even weeks later — when customers start complaining.
Cron job monitoring isn't checking "is the server running" but rather "did the task execute at the expected time."
Dead Man's Switch — How It Works
A Dead Man's Switch (DMS) is a monitoring pattern that inverts the usual logic. Instead of checking "is the service available?" it checks "has the service stopped reporting?"
How It Works
- You create a monitor with an expected interval (e.g., "cron should report every hour")
- After successful execution, the cron job sends an HTTP request (Ping/heartbeat) to the monitoring URL
- If no heartbeat is received within the expected time + grace period, an alert fires
Implementation
# Crontab: hourly job with heartbeat 0 * * * * /path/to/job.sh && curl -fsS https://monitor.example.com/ping/abc123 > /dev/null
Key point: && means the heartbeat is sent only on successful completion (exit code 0). If the job fails, no heartbeat is sent, and the alert fires.
What to Monitor in Cron Jobs
Scheduled Execution
The task should run at the expected time. DMS detects missed runs — if the cron daemon stopped, the server rebooted, or the crontab was corrupted.
Successful Completion
A task may start but fail with an error. Check the exit code and send the heartbeat only on success.
Execution Time
If a task normally takes 5 minutes but suddenly took 2 hours, that's a sign of trouble (growing database, deadlock, memory leak).
START=$(date +%s)
/path/to/job.sh
END=$(date +%s)
DURATION=$((END - START))
if [ $? -eq 0 ]; then
curl -fsS "https://monitor.example.com/ping/abc123?duration=$DURATION"
fi
Hanging
A task may hang (deadlock, infinite loop). Use timeout:
timeout 3600 /path/to/job.sh
If the task doesn't complete within an hour, timeout kills the process and returns a non-zero exit code.
Overlapping Runs
If the previous run hasn't finished before the next starts, overlap occurs. Use lock files:
#!/bin/bash
LOCKFILE="/tmp/job.lock"
if [ -f "$LOCKFILE" ]; then
echo "Job already running, exiting"
exit 1
fi
trap "rm -f $LOCKFILE" EXIT
touch "$LOCKFILE"
# Main job logic
/path/to/actual-job.sh
Best Practices
Logging
Every cron job should write logs: start time, end time, records processed, errors. Without logs, diagnosis is impossible.
# Crontab with logging 0 * * * * /path/to/job.sh >> /var/log/jobs/hourly-job.log 2>&1
Grace Period
Don't alert immediately on a missed heartbeat. Add a grace period — extra time accounting for normal execution duration variance. For an hourly task, use a 10-15 minute grace period.
Separating Alerts by Severity
- Critical: payment processing, backups, order fulfillment
- Warning: report generation, old data cleanup, cache updates
- Info: statistics, analytics, non-essential notifications
Documentation
For each cron job, document:
- What the task does
- Schedule and expected execution time
- What to do on failure (runbook)
- Dependencies (database, external API документацию, files)
Monitoring with Enterno.io
Set up the Heartbeat Monitor on Enterno.io for cron job monitoring. Create a monitor for each critical task, specify the expected interval and grace period. After task completion, send an HTTP request to the monitoring URL.
Use the monitors dashboard to view the status of all cron jobs in one place.
Summary
Cron jobs are an invisible but critically important part of infrastructure. Dead Man's Switch is the optimal pattern for monitoring them: the task reports its own health, and the absence of a report signals a problem. Monitor not just execution but duration, use timeouts and lock files, and log everything.
Check your website right now
Check now →