Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

two boxes, one cron line, and a backup that ran in stereo

A nightly backup job started running on two machines at once after a failover, and the only symptom was a storage bill that quietly doubled.

A terminal full of log lines

This one didn't crash anything. It didn't even error. The first sign that something was wrong was a storage figure that had quietly doubled over a month, and a colleague asking, mildly, whether our offsite backup bucket was supposed to be growing quite that fast. It was not. And it took me far too long to admit that the cause was a cron job running twice, on two different machines, both equally certain they were the only one.

I have written about a duplicate cron schedule before, on a single host, two crontabs. This was the more embarrassing cousin: same schedule, two hosts.

the setup, and the lie I'd told myself

We had a primary application server and a warm standby. The standby existed so that if the primary fell over, we could fail across with minimal fuss. Part of the application's nightly housekeeping was a backup script, run from cron at 01:00, that tarred up a working directory and pushed it offsite. Perfectly sensible on the primary.

The lie I'd told myself, when I built the standby, was that it was "idle." It wasn't idle. It was a near-exact clone of the primary, including, it turned out, the application's crontab. So both machines ran the 01:00 backup. Both produced a valid archive. Both pushed it offsite, to timestamped paths that didn't collide, so neither overwrote the other. Two archives a night, every night, indefinitely. No error, because nothing was wrong from either machine's point of view. They were each doing exactly the job they'd been given.

A screen full of source code

why it stayed hidden for a month

A few things conspired to keep this quiet.

The archives were named with the hostname embedded, so they sorted into neat per-host prefixes in the bucket. If you only ever looked at one prefix, which I did, everything looked perfectly normal: one backup per night, retention working, all green. You had to step back and look at the whole bucket to notice there were two of everything.

The restore tests, which I do run, also passed. Of course they did. Both archives were valid. Restoring from either gave you a working directory. The duplication was invisible to every check I'd built, because every check I'd built asked "is there a good backup?" and the answer was an emphatic, doubly-redundant yes.

And the cost signal was slow. Object storage is cheap per gigabyte, so a doubling of a modest backup didn't ring any alarms until a month of it had accumulated into a number someone happened to glance at. Slow leaks are the worst kind, because by the time they're visible you've normalised the wrongness.

finding it

Once I knew what to look for, finding it took minutes. I listed the bucket and grouped by date:

aws s3 ls --recursive s3://backups/app/ \
  | awk '{print $1}' | sort | uniq -c | sort -rn | head

Two objects per date, like clockwork. Then I checked the modification times and saw two uploads a minute apart every night, and a quick look at the hostnames in the keys told the whole story: app01 and app02, both backing up, both since the day I'd built the standby.

the actual problem, and the fix

The root cause wasn't really "a duplicate cron line." That was the symptom. The root cause was that I'd cloned a machine without thinking about which of its scheduled jobs should run on a passive node and which absolutely should not. A backup is a perfect example of the latter: you want it to run on whichever machine is active, and on no others.

So the fix had to encode "active" as something the job could check, rather than relying on me to remember to comb the crontab on every clone. The pattern I settled on: a small guard at the top of the script that asks "am I the active node?" and exits quietly if not.

#!/usr/bin/env bash
set -euo pipefail

# Only the node that currently holds the service VIP should back up.
if ! ip addr show | grep -q "10.0.0.50/32"; then
    logger -t backup "not the active node, skipping"
    exit 0
fi

# ... the actual backup follows

Tying it to the floating service IP means the job follows the role, not the hostname. After a real failover, the standby picks up the VIP and, correctly, picks up the backup duty with it. The old primary, now passive, stops. No human in the loop, no crontab to remember to edit.

I also added the obvious monitoring I should have had from the start: an alert if the offsite location receives more than one backup per host-role per night, and a separate one if it receives zero. Both failure modes are now loud. Doubled-and-silent is the dangerous one, because it masquerades as healthy.

The lesson I keep paying tuition for is that a clone is not a copy of a machine, it's a copy of a machine's responsibilities, and some of those responsibilities are exclusive. "It's just a passive standby" is exactly the sort of thing you say right before it does a month of work nobody asked it to.