caging a runaway process with cgroups v2

A Linux terminal showing process and memory output

There was a batch job on a shared host that behaved itself ninety-five percent of the time and then, occasionally, on a particular shape of input, ballooned its memory until the host went into swap, the OOM killer woke up in a foul mood, and took out something important that had nothing to do with the batch job. The job was the cause; whatever the kernel decided to sacrifice was the victim. That's the worst kind of incident, because the page goes to the team whose service died, not the team whose job misbehaved.

The right answer wasn't to fix the job, or not only that. The job had a bug, sure, but the deeper problem was that one process could consume the whole machine. On a shared host, that's a design flaw regardless of whose code triggers it. The fix is to put a ceiling on what the job is allowed to take, so that when it runs away it hits a wall and dies, instead of running off the edge of the host and pulling its neighbours over with it.

a slice with limits

On a modern systemd host with cgroups v2, this is genuinely a few lines. You don't write to the cgroup filesystem by hand; you let systemd own the hierarchy and you express the limits as unit properties. I gave the job its own slice:

# /etc/systemd/system/batch.slice
[Slice]
MemoryMax=2G
MemoryHigh=1500M
CPUQuota=200%

MemoryMax is the hard wall. Cross it and the process gets OOM-killed, but only this process, inside this slice, rather than the kernel rummaging through the whole host for something to kill. MemoryHigh is the softer one I actually care about more: at 1500M the kernel starts aggressively reclaiming and throttling the cgroup, which slows the runaway down and gives it a chance to either finish or fail cleanly before it hits the hard limit. CPUQuota=200% caps it at two cores' worth so a runaway can't peg every core either.

Then the job runs in the slice rather than wherever it lands by default:

[Service]
Slice=batch.slice
ExecStart=/usr/local/bin/run-batch

A server rack and monitoring view

what changed

The next time the bad input came through, the job hit its memory ceiling and was killed inside its own slice. The host didn't notice. Nothing else died. The batch job's own monitoring picked up the non-zero exit and retried, and the on-call page that used to go to an innocent team for a service they didn't break simply never fired. The incident became a log line.

That's the bit worth holding onto. The cgroup didn't fix the bug; the job still had its memory leak on that one input shape, and we fixed that properly later. What the cgroup did was change the blast radius. An unbounded process on a shared host is a host-wide risk wearing the costume of a single job. Put it in a slice with MemoryMax and MemoryHigh, and the worst case stops being "the machine falls over" and becomes "the misbehaving thing dies, alone, and tells you why". cgroups v2 makes that almost free to set up, and I now reach for a slice on anything that runs alongside services I care about, well before it has misbehaved even once.