cgroups v2 and a Runaway Process

A rack-mounted server in a data centre

A batch job on a shared host decided to allocate memory until there wasn't any left. Not a leak exactly, just an honest piece of code given a pathological input, fanning out into millions of objects it never freed. By the time anyone noticed, the OOM killer had been around the houses, sshd was struggling to fork, and the box was the special kind of alive where it answers ping and nothing else. The job was the problem. The shared host with no limits was the real problem.

The whole point of cgroups v2 is that this is a solved problem, and I'd just never bothered to wire it up for ad-hoc work. So let's fix that.

The quick containment

The first thing I wanted was a way to run something untrusted-by-circumstance without it being able to eat the host. systemd-run makes that a one-liner:

systemd-run --user --scope \
  -p MemoryMax=2G \
  -p CPUQuota=200% \
  ./the-batch-job --input big.json

That drops the job into its own transient scope under cgroups v2. MemoryMax=2G is a hard ceiling: hit it and the job gets OOM-killed in isolation, the host carries on, sshd keeps forking, and I keep my afternoon. CPUQuota=200% caps it at two cores' worth so it can't starve everything else either.

A server's resource graphs on a monitor

Making it stick

One-off scopes are fine for firefighting, but the proper fix is a slice. I put all the batch work under one, so the category is bounded rather than each invocation:

# /etc/systemd/system/batch.slice
[Slice]
MemoryMax=8G
MemoryHigh=6G
CPUWeight=50

MemoryHigh is the underrated setting here. Where MemoryMax is the wall, MemoryHigh is a soft threshold: cross it and the kernel aggressively reclaims and throttles the cgroup rather than killing it outright. In practice that turns "instant OOM" into "runs slowly and noisily", which is a far better signal. You watch it with:

systemd-cgtop
cat /sys/fs/cgroup/batch.slice/memory.current

memory.current is the live usage, and memory.events will show you the high and max counters ticking up so you know which limit a job is leaning on.

What I actually learned

The technical fix took ten minutes. The lesson took longer: any host that runs work of unknown appetite should have that work boxed in before it misbehaves, not after. cgroups v2 plus systemd slices gives you that for free, and you don't need a container runtime or a scheduler to get it. A .slice file and a MemoryMax would have turned my dead host into a single failed job nobody noticed. It does now.