Ramblings of an aging IT geek
← Ramblings of an aging IT geek
linux

the night a build job ate the whole machine

A runaway compile job took out a build host, and the fix was finally learning to use cgroups v2 properly instead of fighting it.

A Linux terminal showing process output

The build host fell over at about half four in the afternoon, which is exactly when you don't want a build host to fall over. SSH still answered, eventually, after a long enough pause that I'd already started reaching for the IPMI console. The load average was 340-something. Something had forked itself into a small standing army.

The culprit was a CI job. Somebody's Makefile had make -j with no number after it, so it spawned one compiler per unit of work with no upper bound, and the machine had a lot of work to hand it. Each cc1plus wanted the better part of a gigabyte, swap filled, the OOM killer started swinging, and because everything was sharing one big undifferentiated pool of memory and CPU, the OOM killer happily took out things that had nothing to do with the build. The monitoring agent went. Then sshd's children started getting reaped. Great fun.

I rebooted it, told the team to put a number after -j, and then sat down to do the thing I'd been putting off for about a year: actually understand cgroups v2 rather than copying systemd directives off the internet and hoping.

why v1 never quite stuck for me

I'd used cgroups before, in the v1 sense, which mostly meant cpu, memory and blkio as separate hierarchies that didn't know about each other. You could put a process in the memory cgroup over here and the CPU cgroup over there, and the two had no shared notion of "this is the same job". It worked, but every time I tried to reason about it I ended up with a diagram that looked like a tube map.

v2 fixes the thing that actually bothered me: there is one unified hierarchy. A process lives in exactly one cgroup, and that cgroup has controllers enabled on it. CPU, memory, IO, all hanging off the same tree. You read /sys/fs/cgroup and it's a directory of directories, each one a group, each with a cgroup.controllers file telling you what's switched on.

On this box, running a reasonably current distro, v2 was already mounted as the default:

$ mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)

If you see the v1 tmpfs mount with a dozen subdirectories instead, you're on the hybrid or legacy layout, and you can push it over to unified with systemd.unified_cgroup_hierarchy=1 on the kernel command line. I'll spare you the reboot dance.

Server racks in a machine room

the bit that would have saved the afternoon

The whole incident comes down to one number not being set: a memory limit on the build's slice. Under systemd, every service and every user session already lives in a cgroup, so you don't have to build the tree by hand. You just set the knobs.

For the CI runner, that meant a drop-in:

# /etc/systemd/system/ci-runner.service.d/limits.conf
[Service]
MemoryMax=24G
MemoryHigh=20G
CPUWeight=50
TasksMax=2000

MemoryMax is the hard wall. Go past it and processes in that cgroup get the OOM killer, but crucially it's contained: the kernel kills inside the offending cgroup first, so a runaway build takes itself out instead of taking out sshd. MemoryHigh is the softer one, a point at which the kernel starts aggressively reclaiming and throttling before it gets to the hard limit. The gap between the two gives you a warning zone rather than a cliff.

CPUWeight is relative, not a cap. A weight of 50 against the default 100 means when there's contention, this slice gets roughly half the share the default slices fight over. When the box is idle the build can still use everything, which is what you want. If you genuinely need a ceiling, that's CPUQuota=, but I find weights cause fewer surprises.

You can confirm the limits actually took with systemctl show:

$ systemctl show ci-runner.service -p MemoryMax -p MemoryHigh
MemoryMax=25769803776
MemoryHigh=21474836480

watching it work

The genuinely useful thing v2 gives you for free is pressure stall information. Every cgroup has a memory.pressure and cpu.pressure file, and they tell you what fraction of time tasks were stalled waiting for that resource. Not "is memory full", but "is anything actually suffering because of it", which is a far better question.

$ cat /sys/fs/cgroup/system.slice/ci-runner.service/memory.pressure
some avg10=4.21 avg60=2.88 avg300=1.10 total=9281736
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

some is "at least one task stalled", full is "everything stalled". On a healthy build you'll see some tick up under load and full stay near zero. If full starts climbing, the cgroup is thrashing and you've drawn the limits too tight. During the incident, before any of this existed, the whole machine was effectively at full and there was no per-job number to point at. Now there is.

A server with cabling

what I actually changed

Three things, in the end. The build slices got real memory limits so a runaway compile is now a failed build rather than an outage. The monitoring agent and sshd got moved into their own protected slice with MemoryMin= set, so the OOM killer treats them as load-bearing and reclaims from them last. And I wrote a tiny alert on full pressure over 10% sustained, because that's the early sign of the same problem coming back wearing a different hat.

None of this is new. cgroups v2 has been stable for years and systemd has exposed these knobs for ages. The only thing that was new was me bothering to learn the model instead of treating it as magic that systemd does on my behalf. The unified hierarchy makes it genuinely simple to reason about: one tree, one place a process lives, controllers you can list with cat. I should have done it before the build host taught me the lesson at half four on a Wednesday, but here we are.