Ramblings of an aging IT geek
← Ramblings of an aging IT geek
linux

the runaway process that cgroups v2 caught for me

A memory-hungry import job that used to take a box down now gets quietly throttled by a cgroups v2 memory limit instead.

A Linux terminal

There's a nightly import job on one of my boxes that has, twice now, decided to allocate all the memory in the world and take the rest of the machine down with it. Classic OOM killer roulette: the kernel eventually steps in and shoots something, but it's rarely the thing you wanted shot. Last time it killed sshd, which is a special kind of insult.

The proper fix is to make the job not do that. The pragmatic fix, while I work out the proper one, is to put the job in a box it can't escape. On a modern systemd machine that's cgroups v2, and it's genuinely pleasant now.

Just give it a slice

I don't need a hand-rolled cgroup hierarchy. The job already runs from a systemd service, so I just set a memory ceiling on the unit:

[Service]
MemoryMax=2G
MemoryHigh=1500M

MemoryHigh is the soft limit. Cross it and the kernel starts reclaiming aggressively and throttling the cgroup, which slows the process down rather than killing it. MemoryMax is the hard wall. Hit that and the OOM killer fires, but crucially it fires inside the cgroup, so it kills the import job and leaves everything else alone. The blast radius is exactly the unit I scoped, not the whole machine.

A rack of servers

Watching it actually work

The nice part of v2 is the unified hierarchy and the per-cgroup accounting that comes with it. I can watch the job approach its limit live:

$ systemctl status import.service
$ cat /sys/fs/cgroup/system.slice/import.service/memory.current
$ cat /sys/fs/cgroup/system.slice/import.service/memory.events

That memory.events file is the one to watch. The high counter ticks up every time the process gets throttled at the soft limit, and oom_kill tells you if the hard wall was ever hit. After I deployed this, high climbed steadily through the run and oom_kill stayed at zero. The job took about forty minutes longer than usual because it spent the back half being throttled, and I genuinely could not care less. It finished. Nothing else noticed. sshd lived.

The honest caveat is that this treats the symptom. Somewhere in that import there's an unbounded buffer or a query pulling far more than it should, and one day I'll find it. But there's real value in a fix that converts "the box falls over and I get paged at 3am" into "the slow job is a bit slower". cgroups v2 turned a machine-wide outage into a contained, observable, boring event, and boring is the whole point.