Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the cron job that fired twice and told no one

A nightly job that ran twice because two hosts both thought they owned the schedule, and how the duplicate writes stayed quiet for weeks.

A terminal showing a buggy log trace

The symptom was a report with every figure doubled, but only on the third of the month and only sometimes. Nothing in the logs complained. The job exited zero, both times.

It took embarrassingly long to spot because I was reading the application logs and not the crontab. We had migrated a box, kept the old one warm "just in case", and the deploy that was meant to disable cron on the retired host had silently failed on the one file that mattered. So both machines woke at 02:00, both pulled the same input, both wrote to the same table. No lock, no advisory check, no idempotency key. The second run didn't error because there was nothing telling it not to.

The fix for the immediate fire was a flock on a shared NFS path, which is exactly as horrible as it sounds and exactly what you reach for at 09:00 with coffee going cold. The real fix was making the write idempotent: a natural key on (date, source) and an upsert, so a second run is a no-op rather than a doubling.

The lesson I keep relearning: a job that succeeds is not the same as a job that should have run. Exit code zero only tells you the code didn't fall over. It says nothing about whether anyone else got there first.