when the writeback storm hits and everything stalls

A terminal showing kernel vm statistics

The box would run beautifully for a couple of minutes, then freeze. Not crash, freeze: SSH input would hang, a ls would take four seconds, and then everything would catch up at once. The CPU was idle. The disks were not the bottleneck on paper. It was the classic shape of a writeback stall, and the culprit was the default dirty_ratio.

On a machine with a lot of RAM, the defaults are generous to the point of being dangerous. vm.dirty_ratio of 20 means the kernel will happily accumulate dirty pages up to 20% of memory before it forces synchronous writeback. On a 128GB ingest box that is roughly 25GB of dirty pages, and when the kernel finally decides to flush them, every process that touches the page cache blocks until the flood drains. Hence the freeze.

The fix is to flush little and often rather than rarely and catastrophically:

vm.dirty_background_ratio = 3
vm.dirty_ratio = 10

Background writeback now kicks in at 3%, so the flusher threads are nearly always trickling pages out to disk instead of saving them up for one big punishing batch. The hard limit at 10% still exists, but we hit it far less often, and when we do there is much less to write. Stalls went from multiple seconds to nothing I could measure. On a box that mostly writes, consider dirty_bytes and dirty_background_bytes instead, so the thresholds track your disk throughput rather than a percentage of however much RAM happens to be fitted. The defaults are tuned for a desktop from a decade ago, not for a server with a fast NVMe and an alarming amount of memory.