Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the job that fired twice because two boxes thought they were in charge

A nightly reconciliation job started double-processing after we added a second host for redundancy, and the fix was a distributed lock rather than cleverer scheduling.

A terminal scrolling log output

The symptom arrived as a quiet accounting discrepancy: a nightly reconciliation job, the one that matches yesterday's payments against yesterday's orders, occasionally posted its adjustments twice. Not every night. Most nights it was fine. But once or twice a week, a customer's ledger would show the same correction applied two times over, and someone in finance would have to unpick it by hand.

I'd seen the single-box version of this bug before, where a duplicate crontab entry fires the same line twice on one host. This wasn't that. I checked. One crontab, one entry, owned by config management, no stray user crontab lurking in the spool directory. The schedule was clean. And yet, twice.

the change that caused it

The clue was that this job had run flawlessly for a year and only started misbehaving in January. So I asked the obvious question: what changed in January? The answer was that we'd added a second host to the batch tier. The old single batch box had become a single point of failure that woke people at night, so we'd built a second one and put both behind the same configuration.

Both hosts got the same crontab. Of course they did, that's the whole point of config management: every box in a role looks identical. Which meant both hosts ran the reconciliation job at 02:30. Two boxes, same job, same data, no coordination. Most nights one would happen to finish its first big transaction a few seconds ahead, take a row lock, and the second would block, time out, and log a failure that nobody read. That was the "fine" case, where the doubling was masked by the database getting in its own way. When the timing was slightly different, both ran cleanly to completion against the same input, and you got two sets of adjustments.

A side-by-side diff of two log files

The logs, once I put the hostname in them, were damning:

2017-01-11 02:30:00 host=batch01 pid=9012 reconcile start date=2017-01-10
2017-01-11 02:30:00 host=batch02 pid=8841 reconcile start date=2017-01-10
2017-01-11 02:34:18 host=batch02 pid=8841 reconcile done adjustments=147
2017-01-11 02:34:55 host=batch01 pid=9012 reconcile done adjustments=147

Two hosts, same minute, same 147 adjustments, both convinced they were the only one doing the work.

the wrong fixes I considered first

My first instinct was to designate one host as the "batch leader" and only run the cron there. But that just reintroduces the single point of failure we'd spent effort removing: if the leader is down at 02:30, nothing runs, and a skipped reconciliation is its own kind of incident. Pinning the job to one named host is config that rots the moment someone rebuilds it.

My second instinct was to make the job idempotent, so running it twice produces the same result as running it once. That's genuinely the right long-term goal, and it's where the real adjustment table eventually went. But retrofitting idempotency onto a job that posts financial adjustments is a careful, weeks-long piece of work, and I had a bug bleeding into the ledgers right now.

the fix: let them race, but only let one win

What I wanted was for both hosts to try to run, so that either could cover for the other, but for only one of them to actually do it on any given night. That's a distributed lock, and we already had something that does distributed locks extremely well sitting in the stack: the database itself.

Postgres advisory locks are perfect for this. They're cheap, they're tied to a session so they release automatically if the process dies, and pg_try_advisory_lock returns immediately rather than blocking. The job now starts by trying to grab a named lock. If it gets it, it runs. If it doesn't, the other host already has it, so this host logs that fact and exits cleanly.

-- 1 if we got the lock, false if someone else holds it
SELECT pg_try_advisory_lock(hashtext('reconcile-daily'));
30 2 * * * batch /opt/batch/run-with-lock.sh reconcile-daily /opt/batch/reconcile.py

The wrapper opens a session, tries the advisory lock, and only execs the real job if it won. Both hosts fire at 02:30, both race for the lock, exactly one wins, and the loser exits with a log line saying so. If the winning host dies mid-job, its session drops, the lock releases, and the other host could pick it up on a retry. We got the redundancy we built the second box for, without the double-posting we accidentally bought along with it.

The deeper lesson is one I keep relearning: when you make a system redundant, you have to ask which of its actions are safe to do twice. Serving a read is. Posting a financial adjustment is emphatically not. Adding a second host didn't introduce a bug so much as expose an assumption I'd baked in years earlier, back when there was only ever one box and "only run once" was true for free. The moment that stopped being true, the assumption became a leak, and it leaked straight into the accounts.