Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the slow leak that was a cache i forgot to evict

A service that grew until it was killed every few days, traced to a map used as a cache that only ever had things added to it.

A terminal showing a process being killed

The service had a tell: it ran fine for a few days, got slower, then got OOM-killed and restarted clean. Sawtooth memory, climbing steadily, never coming back down. A leak, plainly, but the kind that takes days to show, which makes it tedious to chase because every experiment costs you a day of waiting to see if the line is still going up.

The usual instinct in a garbage-collected language is to assume you can't leak, which is wrong in the most common way: you can't leak memory the collector can free, but you can absolutely hold a reference to something forever and stop it ever being collected. That's not a leak the runtime can fix. That's a leak you wrote.

It was a map. Somewhere early on I'd added a little cache, a map[string]result keyed by request, to avoid recomputing the same expensive thing twice. Sensible. The problem is I wrote the half that adds and never wrote the half that removes. Every distinct key that ever came through got an entry, and the keys were effectively unbounded because they included things like timestamps and IDs. The map only ever grew. A cache with no eviction isn't a cache, it's a memory leak with good intentions.

The fix was to make it a bounded cache with an LRU eviction, capped at a sane number of entries, so old keys fall out as new ones arrive. The memory line went flat that afternoon and has stayed flat. The lesson, filed alongside the others: any time you add a long-lived map, write the line that removes from it in the same sitting, or decide on purpose that it's bounded and small. "I'll add eviction later" is how you end up reading a heap profile at 2am, wondering why a cache you'd forgotten you wrote is the biggest thing in the process.