The symptom was almost nothing. The nightly export was finishing in roughly half the usual time, every other night, and only sometimes. No errors, no alerts, just a job that occasionally seemed suspiciously quick and then went back to normal. The kind of thing you notice once, frown at, and forget.
What was actually happening: it was running on two hosts at once. We had failed a service over to a standby box a few weeks earlier and never failed it back, because it was working fine. But the standby had a copy of the same crontab, inherited from when it was provisioned as a clone of the primary. So both boxes happily ran the export every night, racing for the same source rows.
Jun 09 02:00:01 host-a EXPORT start id=4471
Jun 09 02:00:01 host-b EXPORT start id=4471
Same job id, same second, two hosts. The "half the time" was each box grabbing roughly half the work before the other locked the rows it wanted, so on a good night they split it neatly and the whole thing finished fast. On a bad night they contended, one stalled, and it looked normal again. That is why it flickered: the bug's visibility depended on a race.
The fix was boring and correct. The crontab belonged on whichever host currently held the service, not on both, so I moved the schedule behind the same failover check that everything else used and tore the orphaned line out of the standby. But the real lesson is the old one wearing new clothes: a clone inherits everything, including the jobs you forgot were jobs. When you copy a box, you copy its crontab, and a crontab is a loaded gun pointed at whatever shared resource it touches. Check it before you let the clone anywhere near production.