The symptom was a host that went unresponsive for a few minutes at unpredictable intervals. SSH would hang, monitoring would flap, and by the time I got a shell it had usually recovered, smug and innocent. Load average in the dozens, then fine. The classic "nothing's wrong now" that makes you doubt your own dashboards.
It was a data-processing job. Most runs were modest. But on certain inputs it would allocate far more than it should, push the box into swap, and from there everything fell apart: the kernel thrashing pages in and out, every other process starved, the whole machine effectively hostage to one greedy job. Not crashed, which would have been cleaner. Just brought to its knees.
The right long-term fix is to fix the job. But the job is somebody else's code, the box is shared, and "please stop using so much memory" is not a patch I can apply this afternoon. What I can do is stop one job from taking down everything around it. That's exactly what cgroups are for, and on this host that's cgroups v2 via systemd.
Fencing it in with a slice
I run the job under its own systemd service, so the cleanest lever is a drop-in that puts hard limits on it. With the unified hierarchy these map straight onto cgroup controllers.
# /etc/systemd/system/dataproc.service.d/limits.conf
[Service]
MemoryMax=4G
MemoryHigh=3G
CPUQuota=200%
MemoryHigh is the gentle one: cross it and the kernel starts reclaiming aggressively and throttling the cgroup, which slows the job down but lets it limp along. MemoryMax is the hard ceiling: cross that and the cgroup OOM-killer steps in and kills the job, not some random victim the global OOM-killer picked. CPUQuota=200% caps it at two cores' worth so it can't monopolise the scheduler either.
The point worth sitting with: the limit is on the cgroup, so it bounds the job and all its children. A process that forks a swarm of workers can't escape its allowance by spreading out. That's the whole reason this works where a per-process ulimit quietly didn't.
What changed
After applying it, you can watch the accounting live:
systemctl show dataproc -p MemoryCurrent
cat /sys/fs/cgroup/system.slice/dataproc.service/memory.current
Now when the job hits a bad input, it hits its own ceiling and gets killed in isolation. The host doesn't notice. The job's supervisor restarts it or alerts, depending on the run, and crucially the alert is "this job died" rather than "this host vanished". That's a categorically better page to receive: specific, contained, and pointing straight at the actual culprit.
I still want the job fixed properly, and there's a ticket for it that will outlive us all. But there's a real difference between a bug that degrades one workload and a bug that takes a shared machine hostage. cgroups v2 turned the second into the first with three lines in a drop-in file. The runaway process is still a runaway. It just runs away inside a fence now, and the fence holds.