Ramblings of an aging IT geek
← Ramblings of an aging IT geek
debugging

the leak was a cache that only ever grew

A service was slowly eating memory until it got OOM-killed, and the cause was a map used as a cache with entries added but never removed.

A terminal with a bug icon and memory graphs

The graph was the giveaway: a slow, relentless climb in resident memory, perfectly linear, with a sawtooth drop every few days where the OOM killer stepped in and the orchestrator restarted the pod. No spikes, no correlation with traffic bursts, just a steady ramp that always ended the same way. That shape almost always means one thing. Something is being added to a collection and never removed.

The service kept a map keyed by request ID, used to correlate an incoming request with its eventual asynchronous response. You put an entry in when the request arrived, and you pulled it out when the response came back to match them up. Classic, sensible pattern. The bug was that the cleanup only happened on the happy path.

func (h *Handler) onResponse(id string, resp *Response) {
    req, ok := h.pending[id]
    if !ok {
        return
    }
    h.reconcile(req, resp)
    delete(h.pending, id) // only reached if a response arrives
}

When a response arrived, the entry got deleted. When a response never arrived (a timeout, a dropped upstream, a request the other side simply forgot about) the entry sat in the map forever. Most requests got responses, so the map mostly drained, and the leak was slow enough to hide for weeks. But a small fraction of requests never came back, and that small fraction accumulated, one orphaned entry at a time, until the process was carrying millions of dead correlation records and the kernel ran out of patience.

Source code on screen with a highlighted line

The thing I want to flag is why it was so hard to spot in review. The code was not wrong in the way bad code is wrong. Every line was reasonable. The insert was correct, the delete was correct, the reconcile was correct. The bug was in the absence of a line, not the presence of a bad one, and absences don't show up in a diff. You can stare at a correct-looking insert and a correct-looking delete and never notice that the set of paths reaching the delete is smaller than the set of paths reaching the insert. Nothing is highlighted because nothing is there.

The fix was to give every entry an expiry and sweep the map periodically, so an entry that never gets claimed gets evicted anyway:

type entry struct {
    req     *Request
    addedAt time.Time
}

// background sweep, runs on a ticker
func (h *Handler) reap(maxAge time.Duration) {
    cutoff := time.Now().Add(-maxAge)
    for id, e := range h.pending {
        if e.addedAt.Before(cutoff) {
            delete(h.pending, id)
        }
    }
}

The memory graph went flat and stayed flat. The broader lesson I keep relearning is that any map you treat as a cache needs a removal story that does not depend on the well-behaved path running. If the only thing that deletes an entry is success, then every failure is a permanent tenant, and a process is just a slow accumulation of every request that never quite finished. Bound it by size, bound it by age, bound it by something. An unbounded map keyed by anything that varies is a memory leak wearing a respectable jacket.