The pattern was almost polite about it. Every three or four days, one particular service would get OOM-killed, restart cleanly, and start the slow climb again. Resident memory crept up in a straight line, no spikes, no garbage collection pauses worth mentioning, just a steady march towards the cgroup limit. The kind of leak that doesn't crash today, so it never quite makes it to the top of the list.
It was a map. Of course it was a map. The service kept a map[string]*session keyed on a connection ID, populated on connect and read on every subsequent request. Adding to it was everywhere. Deleting from it was nowhere. The disconnect handler logged the event, flushed some metrics, and then did absolutely nothing about the entry it had just orphaned. So the map only ever grew, one dead session at a time, until the box ran out of room to hold the dead.
What made it hard to see was that the leak rate tracked traffic, not time. On a quiet weekend it would last five days; on a busy Wednesday it barely made three. I spent an embarrassing afternoon convinced it was a load-dependent bug somewhere clever before pprof pointed a finger straight at the map and I felt very silly.
The fix was one line in the disconnect path, a delete(sessions, id), plus a mutex I should have had around the map anyway. The lesson, again, is that anything you add to you must also have a story for removing from, and "the connection will close eventually" is not a story unless something is actually listening for it.