when one process ate the box, and cgroups v2 finally fenced it in

A Linux terminal on a dark screen

A build host went unresponsive on Wednesday, and the cause was the oldest story in the book: one process decided it needed all the memory, and nothing on the box was set up to tell it no. The OOM killer eventually woke up and shot something, but by then the load average had been north of 200 for several minutes and every ssh session was treacle. The real fix was not finding the bad process. It was admitting I should never have let any process get into that position in the first place.

This is a job for cgroups, and on a modern systemd machine that means cgroups v2, which by early 2023 is the default on the distributions I run. The mental model is simple: a control group is a box you put processes in, and the box has limits. Memory, CPU, IO. When a process in the box tries to exceed the memory limit, the kernel reclaims from inside that box rather than from the whole machine, and if it cannot, it kills something inside the box rather than rolling the dice across every process on the host.

A rack of servers

The thing that makes v2 pleasant is that you rarely touch the /sys/fs/cgroup files directly. systemd owns the hierarchy, and you express limits as unit properties. For an ad-hoc command, systemd-run will wrap it in a transient scope with whatever limits you ask for:

systemd-run --scope -p MemoryMax=4G -p MemoryHigh=3G \
    --slice=builds make -j8

MemoryHigh is the throttle: cross it and the kernel leans on the cgroup with reclaim pressure and slows it down. MemoryMax is the hard wall: cross that and a process inside the scope gets OOM-killed, and crucially it is contained, the rest of the machine never feels it. That distinction is the whole point. The build can still fail, but it fails inside its own box instead of taking the host down with it.

For the build service proper, I did it the durable way, in a drop-in rather than a one-off command:

# /etc/systemd/system/builder.service.d/limits.conf
[Service]
MemoryHigh=6G
MemoryMax=8G
CPUWeight=50

Then systemctl daemon-reload and a restart, and you can confirm it took with systemctl show builder.service -p MemoryMax. You can watch a live cgroup's actual usage with systemd-cgtop, which sorts the tree by resource use and is the first thing I now reach for when a box feels wrong, well before top.

The lesson is not really about cgroups, it is about defaults. The unlimited host is a loaded gun pointed at every other tenant on it. A few lines of unit configuration turn "one bad process takes everything down" into "one bad process dies and logs a tidy error", and that is the difference between an incident and a non-event. I have gone round the other build hosts and given them all the same treatment. It should have been there from the start.