A batch importer went sideways on Thursday and tried to load a few million rows entirely into memory at once. The box did not crash so much as become philosophical: load average past 200, everything paging, SSH technically alive but answering with the urgency of a tax office. The OOM killer eventually picked something, and naturally it picked the wrong something.
The annoying part is that this was a solved problem I had not bothered to solve. The importer ran as a systemd service, and cgroups v2 will simply not let a slice exceed a memory ceiling if you tell it the ceiling. I had never told it.
[Service]
MemoryMax=2G
MemoryHigh=1500M
MemoryHigh throttles and reclaims hard before you hit the wall; MemoryMax is the wall, and crossing it kills the job inside its own cgroup rather than letting it drag the host into swap with it. Worth checking the accounting is actually on:
systemctl show import.service -p MemoryCurrent
Now when the importer misbehaves it dies alone, which is all I ever wanted from it. The host stays responsive, the alert fires on a failed unit instead of a wedged machine, and I get to fix the actual bug in daylight instead of at the wrong end of a console that takes thirty seconds to echo each keystroke. The containment was free. I just had to ask for it.