the cache that grew until the box fell over

A terminal showing a stack trace

The symptom was a service that got OOM-killed roughly once a day, always in the small hours, always after enough uptime that nobody connected it to any particular deploy. Restart it and the clock started again. Classic slow leak in a language that's supposed to not have them.

It was a map. I'd added a little in-memory cache months ago, keyed by request id, to dedupe some work. Writing to it was easy and I did it everywhere. Removing from it was a job for "later", and later, as ever, never arrived. So the map only ever grew, one entry per unique request, forever, until the heap ran out and the kernel did the housekeeping I hadn't.

pprof made it obvious in about thirty seconds once I bothered to look. The heap profile pointed straight at the map, holding hundreds of megabytes of keys that would never be read again. The fix was a TTL and a sweep, the thing I should have written the day I added the cache.

The lesson I keep relearning: every cache is a memory leak you've agreed to tolerate. The moment you add the write, write the eviction too, or at least the comment admitting you haven't. A map with no upper bound isn't a cache, it's just a slow way to crash.