Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the report that doubled and nobody noticed

A nightly aggregation that occasionally posted double figures, traced not to the code but to a cron job firing twice across a daylight-saving boundary and a host that had drifted onto two crontabs.

A terminal showing two near-identical job runs an hour apart

A daily figure in an internal dashboard came in roughly double one morning. Not wildly wrong, not negative, not zero, just about twice what the day before had been, and the day after was back to normal. Those are the bugs I dread, because there is nothing to grab. A crash leaves a stack trace. A number that is occasionally too big leaves you staring at a graph trying to decide whether you are imagining it.

The job was a nightly aggregation: read yesterday's events, sum them per account, write a row into a summary table. It had run untouched for months. So my first instinct, which is always to blame the code I most recently changed, was no use because I had not changed anything.

The summary table did have a clue, though, once I stopped looking at the dashboard and looked at the rows directly. The doubled day had two writes, an hour apart, where every other day had one.

SELECT account_id, run_at, total
FROM daily_summary
WHERE summary_date = '2019-10-21'
ORDER BY account_id, run_at;

Two runs. One at 02:00, one at 03:00. The aggregation was not idempotent: it inserted a row each time rather than upserting, and the dashboard summed the rows for the day. One run, one figure. Two runs, double.

A close-up of the crontab and timestamps that explained it

So why two runs on one night and one on all the others? Two things had quietly conspired. The first was a second crontab. The job had been added to the system crontab during a migration and never removed from the old user crontab, so on most hosts it was scheduled once but on this one it was scheduled twice, at 02:00 and at 03:00. That alone should have doubled every day, which it nearly did, except the 03:00 run usually landed after a different cleanup had already moved the day's events out of the source table, so it summed nothing and wrote a zero-ish row that rounded away in the noise.

The second thing was the clock. The doubled night was the night the clocks went back. With the hour repeating between 01:00 and 02:00, the cleanup that normally beat the 03:00 run to the data did not, and for that one night both the 02:00 and the 03:00 aggregation saw a full set of events. Two complete runs, two real rows, one doubled figure. The daylight-saving change did not cause the bug. It just lifted the curtain on a duplicate schedule that had been firing harmlessly into an empty table for weeks.

The fix was unglamorous and correct. The stray user crontab entry went, so the job is scheduled exactly once. The aggregation now upserts on (account_id, summary_date) instead of blindly inserting, so even if it somehow runs twice it lands on the same row with the same answer:

INSERT INTO daily_summary (account_id, summary_date, total, run_at)
VALUES (?, ?, ?, NOW())
ON CONFLICT (account_id, summary_date)
DO UPDATE SET total = EXCLUDED.total, run_at = EXCLUDED.run_at;

The lesson I keep paying for in slightly different currency each time: a scheduled job should be safe to run twice, because sooner or later it will. Retries, double crontabs, a clock that goes backwards once a year, an operator who runs it by hand to "check"… any of them will fire your job a second time. If running twice produces a different answer than running once, that is not a scheduling problem you can fix with a better schedule. It is a correctness problem in the job, and the schedule is just where it finally became visible.