penning in a runaway process with cgroups v2

A terminal showing system process output

A nightly batch job on one of my boxes decided last week to allocate until the OOM killer woke up and started shooting things. Not the batch job, naturally. It took out a perfectly innocent Postgres instance instead, because the OOM killer scores by memory and the job was clever enough to spread its damage. I came down to a wedged machine and a log full of Out of memory: Kill process.

The job itself is fixable, eventually, but I didn't want a single misbehaving process to be able to take the whole host hostage while I sort it out. This is exactly what control groups are for, and on this box I'm running a recent enough kernel to lean on the v2 unified hierarchy rather than the old tangle of separate controllers.

A rack of servers seen from the cold aisle

the old hierarchy was a mess

cgroup v1 gave you a directory per controller. Memory lived under one mount, CPU under another, blkio somewhere else, and a process could sit in different groups in each, which made reasoning about it genuinely painful. v2 collapses all of that into a single tree where a process belongs to exactly one group and that group has all the controllers enabled on it. Once you've used it the v1 layout feels like an accident, which it sort of was.

The practical upshot for my problem: I create one group, enable the memory controller on it, set a hard limit, and drop the job in.

# enable the memory controller for children
echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control

mkdir /sys/fs/cgroup/batch
echo 512M > /sys/fs/cgroup/batch/memory.max
echo 600M > /sys/fs/cgroup/batch/memory.high

# launch the job inside it
echo $$ > /sys/fs/cgroup/batch/cgroup.procs
exec /opt/jobs/nightly-crunch

The distinction between memory.high and memory.max is the bit worth knowing. high is a throttle: cross it and the group gets put under heavy reclaim pressure and slowed down, but nothing dies. max is the hard ceiling: cross that and the kernel OOM-kills something inside the group rather than rampaging across the whole machine. So the job gets squeezed first, and only if it ignores the squeeze does it die, and crucially it dies on its own instead of taking Postgres with it.

I wired this into the systemd unit rather than the raw filesystem in the end, because systemd speaks cgroup v2 fluently and it's far less fiddly:

[Service]
MemoryHigh=600M
MemoryMax=512M

I ran the job again with the cap on and watched memory.events tick up the high counter as it pushed against the throttle, then settle. No OOM, no collateral, Postgres untouched. The job is slower now under load, which is fine, because a slow job is a problem I can schedule around and a dead database is a problem that pages me. Containment first, performance second. The runaway is still a runaway, but now it's a runaway in a field with a fence round it.