I wrote a short note about this last week, but the more I sat with it the more I thought the actual hunt was worth writing down properly. Not the conclusion, which is boring (a map that only ever grew), but the path to it, which I'd happily have skipped if I'd known where to look from the start.
The symptom
A backend service, written in Go, running fine for hours and then quietly OOM-killed. Memory climbed in a dead straight line on the dashboards. No sawtooth, no plateau, just up and to the right until the kernel intervened. Restart, and the same line started again from the bottom. That straight line is the tell. A healthy long-running service breathes: it allocates, it collects, it settles around some working set. A line that only climbs means something is being kept alive that should be allowed to die.
Actually looking
I'd been ignoring it for a while by leaning on a scheduled restart, which is the ops equivalent of putting a bucket under a leak. Eventually I got embarrassed and turned on profiling. The service already imported net/http/pprof, so it was a matter of exposing the endpoint and grabbing a heap profile off a box that had been up long enough to be fat.
go tool pprof http://localhost:6060/debug/pprof/heap
(pprof) top
(pprof) list dedupeCache
top put a single allocation site at the head of the list, miles ahead of anything else, and list walked me straight to the line. It was a cache I'd added months earlier to dedupe some idempotent work, keyed by request id. I wrote to it on every request. I never deleted from it. Ever. So it accumulated one entry per unique request id for the entire lifetime of the process, which on a busy day is an awful lot of sixteen-byte keys plus their values.
The fix, and the actual lesson
The fix was dull: give entries a TTL, run a sweep on a ticker, and put a hard cap on the number of entries so that even if the sweep falls behind, the map can't eat the box. Twenty lines. The leak had been live for months.
The lesson is the one I keep failing to internalise. Every cache is a deliberate memory leak that you've decided is worth it, and the bargain only holds if you've also decided when entries leave. A map with writes and no removes isn't a cache. It's a list of everything that's ever happened, stored in RAM, waiting to kill you. Now my rule is simple: I don't merge the line that adds to a cache without the line that evicts from it in the same change. If I genuinely can't decide on eviction yet, that's a sign I don't understand the cache well enough to be adding it.