the afternoon one PHP worker tried to eat the box

A terminal showing top with one process pinned at high CPU

The alert was the boring kind: load average climbing, then not stopping. By the time I was on the box, top showed one worker at 800% CPU and the rest of the machine quietly starving. A misbehaving job had got itself into a loop and was perfectly happy to consume every core we had.

A couple of years ago this would have been a fire. SSH in, fight the scheduler for a slice of CPU long enough to find the offender, kill it, apologise to everyone whose request had timed out in the meantime. The difference this time was that we had already moved this service onto cgroups v2, and the unified hierarchy did the boring, correct thing without me.

A server with its load spread across cores

what actually held it back

The service runs under a systemd slice with a couple of limits set. Nothing exotic:

[Slice]
CPUWeight=100
CPUQuota=400%
MemoryMax=6G
MemoryHigh=5G

CPUQuota=400% is the one that mattered here. The runaway worker wanted all sixteen cores; the slice said it could have four, and that was the end of the negotiation. The rest of the box stayed responsive. My SSH session was snappy, the database next door never noticed, and the only symptom anyone outside saw was that this one service got slower.

You can watch it happen, which is the part I find satisfying. Under v2 everything hangs off /sys/fs/cgroup in one tree rather than the old split-per-controller mess of v1:

$ cat /sys/fs/cgroup/system.slice/myworker.slice/cpu.stat
usage_usec 9123847221
nr_periods 41028
nr_throttled 40791
throttled_usec 7733120044

nr_throttled near enough equal to nr_periods is the runaway, caught red-handed. Nearly every scheduling period it asked for more than its quota and got told no. That throttled_usec figure is, roughly, all the damage that didn't happen to the rest of the machine.

MemoryHigh is the other quiet hero. It's a soft limit: cross it and the kernel starts reclaiming and throttling allocations under pressure rather than waiting for MemoryMax and reaching for the OOM killer. The worker slowed down well before it could swell up and force a kill, which gave me time to look rather than time to panic.

the fix, and the lesson

The actual bug was dull, a date-range query with no upper bound that occasionally went unbounded. Found it, fixed it, deployed. Fifteen minutes, most of which was reading the query.

What I took from the afternoon wasn't really about cgroups, it was about blast radius. The old instinct is to make every service as fast as possible and trust it not to misbehave. The better instinct is to assume something will misbehave and decide in advance how much of the machine it's allowed to take with it. A quota that says "four cores, no more" costs you a little headroom on a good day and saves you the whole box on a bad one.

If you're still on v1 and your distro has moved on, the migration is mostly painless now and systemd has done the awkward parts for you. Set a quota on the things most likely to run away. Then go and read cpu.stat next time something climbs. It's a much nicer feeling, watching a process be politely refused, than watching it win.