A Runaway Process, And The cgroup That Caught It

A terminal showing system resource usage

The pager went off at 06:40 because a box that hosts half a dozen unrelated services had gone unresponsive. Not down. Unresponsive, which is worse, because the monitoring agent was alive enough to say "I'm fine" right up until it wasn't. SSH eventually let me in after about forty seconds of staring at a blank prompt, and top told the whole story in one line: a data importer that normally sits at a couple of hundred megabytes was sat at eleven gigabytes and climbing.

The importer had hit a malformed input file and was buffering the entire thing in memory before parsing. A bug, certainly, and one I fixed properly later. But the thing that actually annoyed me was that one misbehaving process could take everything else on the host down with it. The web frontend, the metrics collector, the lot, all gasping for memory because the importer was greedy and nothing stopped it.

nice does nothing for memory

My first reflex, the lazy one, was to reach for nice. But nice only touches CPU scheduling. It does nothing for memory pressure, and memory was the problem. You can renice a process into the ground and it will still happily allocate until the OOM killer wakes up and shoots something, often not the thing you wanted shot.

The OOM killer had in fact fired, eventually. It just took its time, and when it did, its heuristics picked off a long-running service with a big resident set rather than the importer that was actually at fault. That is the OOM killer working as designed and being completely unhelpful, both at once.

A diagram of a server under load

a slice with a ceiling

The right answer on a modern systemd box is cgroups v2, and it is much nicer to use than the v1 era of mounting controllers by hand. I run the importer under systemd anyway, so all it took was a drop-in:

# /etc/systemd/system/importer.service.d/limits.conf
[Service]
MemoryHigh=1G
MemoryMax=1500M
CPUQuota=50%

MemoryHigh is the soft limit: cross it and the kernel starts aggressively reclaiming pages from this slice and throttling its allocations, which slows the process down rather than killing it. MemoryMax is the hard wall. Hit that and the OOM killer fires, but crucially it fires inside the cgroup, so it kills the importer and leaves everything else on the box untouched. The blast radius is now one service instead of the whole machine.

CPUQuota=50% was belt and braces. The importer doesn't need a full core, and capping it means a runaway loop can't starve the scheduler either.

After systemctl daemon-reload and a restart, I fed it the same malformed file deliberately. It climbed to 1.5G, got killed, systemd restarted it, it climbed again, got killed again. Annoying, but contained. The rest of the host never noticed.

verify, don't assume

The bit people skip is checking the limit actually applies. cgroup v2 exposes everything through the unified hierarchy, so you can just read it back:

$ systemctl show importer.service -p MemoryMax -p MemoryHigh
MemoryMax=1572864000
MemoryHigh=1073741824

$ cat /sys/fs/cgroup/system.slice/importer.service/memory.current
214893056

That last file is the live usage, and watching it under load is far more honest than top, because it's the number the kernel actually enforces against.

I did fix the importer to stream the file instead of slurping it, which was the real bug. But I'd rather not rely on every process being well-behaved. A limit you've set and verified is worth more than a promise you've made to yourself about future code. The cgroup doesn't care how good my intentions were at 06:40.