The service didn't crash. That was almost the worst part. It just got slowly, steadily fatter over about a week until the OOM killer ended it, the container restarted, and the clock began again. A sawtooth on the memory graph with a period of roughly seven days. Classic leak, and in Go, which people will tell you doesn't leak. Go has a garbage collector. Go also has maps you hold references to forever, which is the same thing wearing a hat.
I'd been ignoring it for a while because a weekly restart is survivable and there were louder fires. But the sawtooth was getting steeper, and a leak that's getting worse means traffic is growing into it, which means it'll stop being survivable on its own schedule.
finding it
The honest tool here is pprof, and the honest admission is that I should have reached for it on day one instead of day five. The service already had the http handler registered, so it was just a matter of grabbing a heap profile while memory was high.
go tool pprof http://localhost:6060/debug/pprof/heap
top in pprof pointed straight at a single allocation site. Then list on the function showed the line. It was a map.
The code was a per-request deduplication cache. Each incoming request had an ID, and to avoid processing duplicates we stashed the ID in a map. The check was there, the insert was there, and the delete was nowhere. We added to the map on every request and never removed anything. It worked perfectly in tests, which run for seconds and see a few hundred IDs. In production it saw millions, and held every one of them.
the fix, and the real fix
The immediate fix was embarrassingly small. The dedup only needed to remember IDs for a short window, so the map should never have been unbounded in the first place. I swapped it for a time-windowed cache that evicts entries older than the dedup window:
// evict anything older than the dedup window
for id, seen := range c.entries {
if now.Sub(seen) > c.window {
delete(c.entries, id)
}
}
A map you only ever insert into is a leak with extra steps. The compiler won't warn you, the GC can't help you, because as far as it's concerned you still want all that data. You told it so by keeping the reference.
The real fix was a habit, not a patch. Any time I reach for a map as a cache now, the first question is what removes from it. If the answer is nothing, I either bound it or I find a structure that bounds itself. The memory graph has been a flat line for a week, which is the most boring and satisfying graph there is.