The service grew. Not in features, in resident memory. It started a deploy at a comfortable 200MB and, left alone, would be over a gigabyte by the end of the week. Then it got OOM-killed, restarted clean at 200MB, and began the climb again. A sawtooth on the memory graph, which is the unmistakable signature of a leak being papered over by a restart.
The thing about a slow leak is it's never the code you're currently looking at. It's something that was written months ago, works correctly, passes its tests, and quietly hoards. Mine was a map.
Someone, possibly me, had added a small in-process cache to avoid recomputing an expensive lookup for the same key twice in a request. It looked like this, more or less:
var seen = map[string]Result{}
func lookup(key string) Result {
if r, ok := seen[key]; ok {
return r
}
r := expensiveLookup(key)
seen[key] = r
return r
}
Read that and the bug is obvious in hindsight. Things go into seen. Nothing ever comes out. It isn't a cache, it's a permanent record of every distinct key the process has handled since it started. The keys were request-scoped identifiers, near enough unique per request, so the map grew by roughly the request rate, forever. A cache with a hit rate of essentially zero and a memory cost of everything.
The reason it survived review and testing is that in any single request, or any short test run, the cache works perfectly. You hit the same key twice in one request, you get the cached value, the test passes, everyone's happy. The unboundedness only shows up over days, across millions of distinct keys, which no test exercises.
The fix was to stop pretending it was a cache when it had no eviction. The keys were only useful within a single request, so the cache should live and die with the request rather than with the process. I moved it out of the package-level variable and into a per-request struct that gets garbage collected when the request finishes:
type request struct {
seen map[string]Result // lives as long as the request, no longer
}
The right tool when you genuinely want a process-lifetime cache is one with a bound and an eviction policy: an LRU with a size cap, or entries with a TTL, something that promises to forget. A bare map promises only to remember, and a thing that only remembers will, given enough time, remember you right out of memory.
The general lesson, the one I keep writing on the inside of my skull: any collection that's written to from a long-lived path needs an answer to "what removes from this?" If the answer is "nothing", you haven't built a cache. You've built a slow leak with good intentions.