when the writeback stalls everything

A Linux terminal with system stats

A box that ingests a lot of data was stalling in bursts. Everything fine, then a couple of seconds where the whole thing felt like it was wading through treacle, then fine again. The CPU was bored. The disks weren't saturated on average. It was the average that lied.

The culprit was dirty page writeback. By default vm.dirty_ratio lets the system accumulate a large fraction of RAM as unwritten dirty pages before it forces a synchronous flush, and on a box with plenty of memory that means a long quiet build-up followed by a thundering herd of writeback that blocks new writers. You see it clearly in /proc/meminfo as Dirty: climbing, then collapsing.

The fix is to flush little and often rather than rarely and all at once:

sysctl -w vm.dirty_background_ratio=5
sysctl -w vm.dirty_ratio=10

dirty_background_ratio is the point where flushing starts in the background; dirty_ratio is the hard wall where writers get throttled. Pulling both down keeps the dirty set small so the kernel is always trickling pages to disk instead of saving them up for a painful spike. If you have a lot of RAM, the percentage-based knobs are coarse and the dirty_bytes variants give you finer control.

The stalls went away. p99 write latency dropped from "noticeable" to "boring", which is exactly where I want it.