a runaway process and the cgroup that caught it

A Linux terminal with system load output

A batch job on one of my boxes had a habit of occasionally deciding it needed all the memory in the world. Most runs were fine. Then one input would be pathological, the working set would balloon, the machine would start swapping, and within a minute everything else on the host, including sshd, was crawling. By the time I could log in to kill it, the kernel OOM killer had already had its own opinion about which process deserved to die, and it was rarely the right one.

The job was not going to get fixed quickly. So the goal changed: stop one greedy process from taking the whole host down with it. cgroups v2 is exactly the tool for that, and on a recent enough distribution it is already the default hierarchy mounted at /sys/fs/cgroup.

Containing it by hand

The crude version is a few writes. Make a cgroup, set a hard memory ceiling, drop the process into it:

mkdir /sys/fs/cgroup/batch
echo "2G" > /sys/fs/cgroup/batch/memory.max
echo "512M" > /sys/fs/cgroup/batch/memory.high
echo $$ > /sys/fs/cgroup/batch/cgroup.procs

memory.high is the gentle one. Cross it and the process gets throttled and put under heavy reclaim pressure, which slows it down without killing it. memory.max is the wall: hit that and the OOM killer fires, but now it fires inside the cgroup, against the job, and not against whatever else the kernel fancied.

A server rack viewed from the cold aisle

Letting systemd do it properly

Writing to cgroup.procs is fine for a one-off, but it does not survive a reboot and it is fiddly. The better answer is to let systemd own the cgroup, because systemd has been driving the v2 hierarchy for a while and it speaks this natively. For an ad-hoc run, systemd-run is enough:

systemd-run --scope -p MemoryMax=2G -p MemoryHigh=512M ./batch-job input.dat

For the real thing it lives in the unit file:

[Service]
MemoryHigh=512M
MemoryMax=2G
MemorySwapMax=0

MemorySwapMax=0 was the line that actually saved me. The host slowdown was never really the job's fault, it was swap thrashing dragging everything else into the mud. Deny the cgroup swap entirely and a memory-hungry job fails fast and clean inside its own box rather than slowly poisoning the whole machine.

You can watch it work. systemctl status on the unit shows current memory against the limit, and cat memory.events in the cgroup directory counts how often high and max have been tripped:

cat /sys/fs/cgroup/batch/memory.events
# low 0
# high 1432
# max 3
# oom 0
# oom_kill 0

That high 1432 is the system quietly throttling the job hundreds of times instead of letting it run free. Three trips to max, zero kills, and the host never noticed.

The job still has its bad days. The difference is that a bad day is now a contained, boring failure with a log line, rather than a 2am page because a single process decided to eat the server. I did not fix the bug. I put a fence around it, and a good fence is sometimes worth more than a fix you do not have time to write.