The symptom arrived as a finance question, not an engineering one: "why did seventeen accounts get two reconciliation entries last night?" Not a stack trace, not a pager, not a single ERROR line. Seventeen rows that shouldn't exist, and a nightly job that, as far as every dashboard was concerned, had run exactly once and exited zero.
That gap between "the data is wrong" and "nothing looks wrong" is where I lost most of a day. So this is the writeup, partly for the next person and partly because I want it on record that the job was lying to me.
what the job did
The job is a fairly boring nightly reconciliation. It reads a window of transactions, groups them, writes one summary row per account, and marks the window as processed. It's idempotent in spirit: if you run it twice over the same window, the second run is supposed to notice the window is already marked and bail. That "supposed to" is doing a lot of work in that sentence, and we'll come back to it.
It runs from cron on the batch host. Standard stuff:
30 2 * * * /opt/recon/bin/run-nightly --window=yesterday >> /var/log/recon/nightly.log 2>&1
One line. One host. One run a night. Except it wasn't one host.
two boxes, one belief
Eighteen months ago we'd built a warm standby for the batch host. The idea was sensible: if the primary fell over, you'd fail the cron workload across to the standby and carry on. What actually happened is that someone (me, probably, the blame is academic) provisioned the standby from the same configuration management as the primary, including the crontab, and then nobody ever turned the cron off on the standby.
For eighteen months it didn't matter, because the standby couldn't reach the production database. Then, three weeks ago, a network change "tidied up" the firewall rules and, amongst other things, opened the standby's path to the primary database. Helpful. Now both boxes woke at 02:30, both ran the job, and both believed they were the only one doing so.
The "already processed, bail out" guard should have saved us. Here's why it didn't.
the guard was a race, not a lock
The bail-out check did exactly what you'd write on a whiteboard and exactly what you shouldn't ship:
SELECT processed FROM recon_windows WHERE window_id = $1;
-- if processed is true, exit
-- otherwise do the work, then:
UPDATE recon_windows SET processed = true WHERE window_id = $1;
Read, decide, then much later write. Both hosts started within the same second. Both ran the SELECT, both saw processed = false, both decided to proceed, both did the full reconciliation, and both eventually set the flag to true. Classic check-then-act, no transaction wrapping the decision and the work, no lock on the window row, no unique constraint on the output to catch the duplicate. The flag was a comment, not a guarantee.
The reason it ran silently is the part that actually cost me the day. The duplicate rows were valid rows. They inserted cleanly. No constraint violation, no exception, nothing to log. Both processes finished, both wrote their cheerful "nightly complete, 0 errors" line, both exited zero. The system did precisely what we told it to, twice, and the only evidence was in the data it produced.
finding it
What broke the case open was boring and I should have reached for it sooner: timestamps. The summary rows carry a created_at to the millisecond. The seventeen affected accounts each had two rows, and the two rows were always 200 to 900 milliseconds apart. One job doesn't produce the same account's summary twice within a second. Two jobs racing absolutely do.
From there it was a matter of asking which hosts could have written those rows. The application logs the hostname on startup, so I grepped the two nightly logs:
grep 'nightly start' /var/log/recon/nightly.log
Two start lines. Same minute. Different hostnames. There it was. The standby I'd forgotten about was quietly doing a full night's work alongside the primary, and had been for three weeks. We got lucky that only one of those nights produced visible duplicates; the window guard happened to win the race cleanly on the others, which is somehow worse, because it means the bug was there the whole time and only occasionally showed its face.
the fix, in order of how much I trust it
First, the immediate stop: I disabled the crontab on the standby. That ends the bleeding. It does not fix the bug, it just removes the second runner.
Second, the real fix, a unique constraint on the output:
ALTER TABLE recon_summary
ADD CONSTRAINT uq_recon_account_window UNIQUE (account_id, window_id);
Now a duplicate cannot be written at all. If two jobs race, the second insert fails loudly, which is exactly the noise I wanted last night and didn't get. Loud is good. Loud wakes people up. Silent corrupts spreadsheets.
Third, the mutual exclusion the job should have had from day one. A Postgres advisory lock at the top of the run, taken for the duration:
SELECT pg_try_advisory_lock(hashtext('recon-nightly'));
If you can't get the lock, someone else owns the night, so you log that and exit. One owner at a time, enforced by the database rather than by a flag and a prayer.
I cleaned up the seventeen duplicates by hand, kept the earlier row of each pair, and reran the affected accounts' downstream totals.
what I'm taking from it
Idempotency you've reasoned about on a whiteboard is not idempotency the database will enforce. If the only thing stopping a double-write is a sequence of "read, think, write" with a gap in the middle, you have a race, and a race will find you the day a firewall rule changes and you've forgotten which machines run cron.
The bit that genuinely unsettled me is the silence. I have spent years building the reflex that a clean exit and a zero error count mean the job did its one job. This one did its one job twice, reported success both times, and was completely honest by its own lights. The lesson isn't "add more logging", it's that correctness has to live in constraints the data must satisfy, not in the wellbehaved cooperation of the things writing it. A unique index doesn't care how many hosts you forgot about.
I have also, finally, written the standby a crontab of its own that contains nothing but a comment explaining why it contains nothing.