The symptom was duplicate records. Not always, not every night, maybe one night in four, and only ever exactly two of each. A nightly job aggregated some figures and wrote a summary row per account. Sometimes there were two summary rows per account with identical numbers and timestamps a few seconds apart.
Two identical rows is a very specific kind of clue. It is not "the data is wrong". The numbers were right. It is not "we processed bad input". The input was fine. It is "the work happened twice". Once you frame it that way, the question is no longer about the data and entirely about how many times the thing ran.
the job that should not be able to run twice
The job was a single cron entry. One line in one crontab on one host.
30 2 * * * /opt/batch/run-nightly.sh >> /var/log/nightly.log 2>&1
A single cron line cannot fire twice. That is the whole appeal of cron. So my first assumption, and I want to be honest about this because it cost me a day, was that the job itself was internally looping. Maybe a retry. Maybe a function called from two places. I read that script far more carefully than it deserved. It was a hundred lines of plumbing that called one binary. No retry, no loop, no second invocation. It was innocent and I had been interrogating it for hours.
The log was the first thing that did not fit. On a duplicate night, the log had two complete runs.
2021-09-14 02:30:01 nightly start
2021-09-14 02:30:48 nightly done (412 accounts)
2021-09-14 02:30:03 nightly start
2021-09-14 02:30:51 nightly done (412 accounts)
Look at the timestamps. The second run starts at 02:30:03, two seconds after the first. They overlap. Cron does not do that. Cron fires once at 02:30 and would not fire the same entry again until the next day. Two starts two seconds apart means two processes, near simultaneous, which means two somethings deciding it was time.
one host, except it wasn't
The boring possibility is a duplicate crontab. Same line in root's crontab and in a deploy user's crontab. I checked. It wasn't that. The line existed exactly once, for exactly one user, on the host I was looking at.
On the host I was looking at.
That is the sentence that mattered, and I had said it to myself maybe ten times without hearing it. We had migrated this service a few months earlier. The migration had been done the sensible way: bring up the new host, run them in parallel for a bit to make sure the new one was healthy, then decommission the old one. The new host was batch02. I had been ssh'd into batch02 all day.
The old host, batch01, was still up. Still had the service deployed. Still had the crontab. It had been "decommissioned" in the sense that someone had stopped sending it traffic and moved on, and not in the sense that anyone had turned it off or removed its scheduled jobs. It had been quietly running the nightly batch every single night for months, into the same shared database, alongside batch02.
Which immediately explains the "one night in four" pattern, because it is not one in four. It is every night that both hosts happened to have NTP-aligned clocks tight enough, and the same job acquired no lock, and both completed their writes. On nights where one host was a touch slower or a transient hiccup delayed it, the second write landed late enough that a later cleanup step deduplicated it before anyone saw. The randomness was an artefact of timing, not logic. Timing-dependent bugs always feel random until you find the second clock.
why nothing screamed
The genuinely uncomfortable part is how silent this was. batch01 was healthy. It had no traffic, so no dashboard showed it, so no alert covered it. It was paying its cloud bill every month and doing real, harmful work, and the only evidence was two rows in a table that a downstream report quietly tolerated. If the report had crashed on duplicates I would have found this in an hour. Because it coped, the bug survived for months.
The fix was embarrassingly small. Stop and disable the service on batch01, remove its crontab, and actually shut the host down this time. The duplicates stopped that night.
# on batch01
crontab -r
systemctl stop batch.service
systemctl disable batch.service
what I took from it
The thing I keep coming back to is that I spent a day reading the right code on the wrong host. The script was never going to confess, because the script was not guilty. I had a strong, unexamined belief that "the job runs on one machine" and I never tested it, because it was true at the start and I assumed decommissioning had made it true again.
A few lessons that I actually intend to keep, rather than the ones you write down and forget:
- Decommissioned and powered off are different states. A host that still boots, still has the code, and still has cron is a live host wearing a "do not look at me" sign.
- A batch job that writes to shared state should hold a lock that is visible to every instance, not just refuse to overlap with itself locally. A simple advisory lock in the database would have made the second writer either wait or fail loudly. Either is better than succeeding.
- When the data says "this work happened twice", count the workers before you read the worker. The answer is more often "two of them" than "one of them, twice".
I also added an alert on hosts that have a deployed service but no traffic for seven days. It is a cheap query and it would have caught this in a week instead of a season. The most expensive bugs are the ones that are too well behaved to complain.