Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the code was correct, my mental model wasn't

A timezone bug that wasn't in the code at all but in an assumption I never thought to check, and how long that cost me before I went back to first principles.

A terminal showing log timestamps that don't line up

The worst debugging sessions aren't the ones where the code is wrong. Those are honest. You find the broken line, you fix it, you move on. The worst ones are where the code is exactly right and the thing that's wrong is something you believe so firmly you never think to question it. I lost the best part of an afternoon to one of those this week, and the lesson is the same one I keep having to relearn.

The symptom: a scheduled job that was meant to run at 09:00 was running at 10:00. One hour late, every time, consistently. Consistency is usually a gift in debugging, because it means there's a clean cause, no race, no intermittent gremlin. So I went looking for the off-by-one-hour bug.

the hour I spent in the wrong place

I started where any sensible person starts: the scheduling code. The cron expression said 0 9 * * *. That is nine o'clock. There is no ambiguity in that line. I read it, re-read it, checked it wasn't being overridden somewhere, checked there wasn't a second schedule fighting it. All correct.

So I went down a layer. Was the job computing its own target time and getting it wrong? I traced the time handling through the code. It used the right functions, it didn't do anything daft with offsets, it looked clean. I added logging around every point a time value was created or compared, convinced the bug was hiding in some arithmetic.

A diff with timestamps annotated in two different timezones

the assumption I never logged

The logging is what finally caught it, but not in the way I expected. Every timestamp the job emitted was internally consistent. 09:00 in, 09:00 out. The code genuinely believed it was running at nine.

The thing I'd never questioned was what "nine o'clock" meant to the machine. I had assumed, without ever checking, that the host ran in local UK time. It didn't. It had been provisioned in UTC, like a sensible server should be, and at this time of year the UK is an hour ahead of UTC because of British Summer Time. The job ran at 09:00 UTC, which is 10:00 here, exactly as observed. The code was flawless. My model of the environment it ran in was an hour out.

The fix took thirty seconds: be explicit about the timezone the schedule is interpreted in rather than relying on whatever the host happens to be set to. The afternoon it cost me was entirely spent looking inside the program for a bug that lived in my head.

the actual lesson

Every assumption you don't check is a place a bug can hide for free, because you've pre-cleared it. I never logged "what timezone is this host in" because I already knew the answer. I was certain. And certainty is exactly the condition under which you stop testing something.

The technique that would have saved me the afternoon isn't clever. When the code looks correct and the behaviour is still wrong, the bug is almost always in the gap between what you think the system is and what it actually is. So you go and check the things you'd never normally check, the things you're sure of. What timezone is the host in. Which config file is actually loaded. Which version is really deployed. Is this even the machine I think it is. One of those embarrassingly basic questions is usually the answer, and the discomfort of asking it is the price of getting unstuck.

I now treat strong certainty about an environment as a flag rather than a comfort. If I catch myself saying "it can't be that, that's obviously fine", that's now the first thing I go and verify. It's usually that.