Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the cron job that ran twice and told no one

A nightly billing aggregation occasionally ran twice on the same data, and finding out why took longer than I'd like to admit.

A terminal full of log lines

The symptom was simple and infuriating: roughly once a fortnight, a nightly aggregation job produced double the numbers it should have. Not every night. Not predictably. And nothing in the logs said anything was wrong, because as far as the job was concerned, nothing was.

The job was a small Python script run from cron at 02:15, reading the previous day's events out of Postgres and writing a rolled-up summary into a reporting table. It had run happily for months. Then someone in finance noticed a Tuesday that looked like two Tuesdays stapled together, and it landed on me.

the first wrong assumption

My first instinct, and it was wrong, was that the script itself had a loop bug. So I read it. Then I read it again. It was about forty lines and there was nowhere for it to accidentally process the same day twice. The query was bounded by date, the insert was a single statement, and the whole thing exited cleanly.

So I did what you do: I added more logging. Timestamp on entry, timestamp on exit, row counts in and out, the hostname, the PID. Then I waited for it to misbehave, which is the worst part of any intermittent bug. You can't make it happen. You just have to live alongside it until it decides to show you.

two PIDs, one minute

It took nine days. When it finally went wrong again, the logs were unambiguous:

2016-09-28 02:15:00 host=batch01 pid=11423 start day=2016-09-27
2016-09-28 02:15:01 host=batch01 pid=11440 start day=2016-09-27
2016-09-28 02:15:31 host=batch01 pid=11423 done rows_in=84021 rows_out=312
2016-09-28 02:15:32 host=batch01 pid=11440 done rows_in=84021 rows_out=312

Two processes. Same host. Same second. Both convinced they were the only one. They each read the same 84,021 rows, each wrote 312 summary rows, and the reporting table happily accepted both because there was no unique constraint stopping it. Double Tuesday.

So the script wasn't looping. Cron was firing it twice. That felt impossible, because a crontab line runs once per schedule, and I'd checked the crontab a dozen times.

A diff of two crontab files side by side

the crontab was right, the other crontab was also right

Here is the bit I'm not proud of. The host had been rebuilt earlier in the year and migrated from a hand-maintained crontab to a config-managed one. The deployment that did the migration installed the new schedule into /etc/cron.d/billing. It did not remove the old user crontab, the one sitting under /var/spool/cron/crontabs/batch. Both existed. Both contained the same line. So most of the time, both ran.

Which raises the obvious question: if both ran every night, why wasn't every night doubled? Because the job was not idempotent in the way I'd assumed, but it was racy in a way that occasionally saved me. The summary insert used INSERT ... ON CONFLICT DO NOTHING against a column that wasn't actually unique under load. When the two processes overlapped closely enough, one of them would lose a few hundred row insertions to a deadlock and quietly roll back its batch, and the night would look clean. When they didn't overlap, both batches landed, and you got the double.

So the bug wasn't one bug. It was a duplicate schedule whose damage was usually, but not always, masked by a second latent bug in the conflict handling. Two wrongs were mostly making a right, which is the most dangerous state a system can be in, because it looks fine until it doesn't.

the fixes, in order of importance

The duplicate crontab was the root cause, so that went first. One scheduling source of truth, in /etc/cron.d, owned by config management, with the user crontab explicitly emptied as part of the run. No more two-places-to-look.

Then the script got a lock. A simple flock wrapper in the cron line, so even if something does try to start two copies, the second one falls over instead of cheerfully racing the first:

15 2 * * * batch /usr/bin/flock -n /run/lock/billing.lock /opt/billing/aggregate.py

-n means non-blocking: if the lock is held, the second invocation exits immediately rather than queuing up behind the first and running late. For a nightly job, late and doubled is worse than skipped.

Finally, the reporting table got a real unique constraint on (report_date, dimension), so a duplicate write becomes a loud error rather than silent extra rows. If something this stupid happens again, I want a constraint violation in my inbox at 02:16, not a confused email from finance a fortnight later.

The lesson I keep relearning is that "the logs said nothing was wrong" usually means the logs weren't being asked the right question. The job did exactly what it was told. It just got told twice, by two schedules I'd convinced myself were one.