the runaway process, revisited: cpu and io weights in cgroups v2

A Linux terminal

Last week I put a memory ceiling around a job that liked to eat all the RAM, and the box stopped falling over. Good. Except the same job has a second bad habit: when it does run, it pins every core and saturates the disk, and everything else on the machine turns to treacle. The OOM killer is no longer involved, but the experience is still miserable. So this is the same fight from a different angle, CPU and IO instead of memory.

The key idea in cgroups v2 is that you mostly do not want hard caps for this. You want weights. A cap wastes capacity when the rest of the box is idle; a weight only matters when there is contention, and then it shares fairly.

cpu weight, not cpu quota

systemd exposes the v2 CPU controller as a weight from 1 to 10000, default 100:

[Service]
CPUWeight=20
IOWeight=20

CPUWeight=20 says: when the CPU is contended, give this cgroup roughly a fifth of the share a default unit would get. When nothing else wants the cores, it still gets them all, so the job finishes just as fast at 3am when the box is quiet. That asymmetry is exactly what I wanted and is the bit a hard CPUQuota would have thrown away.

A server under load

io weight needs the right backend

IOWeight is the same idea for disk bandwidth, but with a catch worth knowing: the v2 IO controller's proportional weighting wants a suitable backend to actually do the proportional thing. On a setup using the bfq scheduler it behaves; on some configurations the weight is more of a polite suggestion than a guarantee. I checked what I had:

cat /sys/block/sda/queue/scheduler
# [mq-deadline] kyber bfq  -> the [ ] is the active one

If you genuinely need IO isolation you can fall back to IOReadBandwidthMax and IOWriteBandwidthMax as a hard limit per device, but I would reach for the weight first and only cap if the weight is not being honoured.

the result

With weights in place the job now runs whenever it likes and barely registers. Under load it politely steps aside; idle, it sprints. I did not make it faster and I did not make it nicer. I just told the kernel who matters when there is not enough to go round, and that turned out to be the whole problem all along.