when a busy box stalls every thirty seconds, look at dirty_ratio

A Linux terminal with system monitoring output

A write-heavy box was misbehaving in a very particular way: smooth for half a minute, then a stall where everything that touched disk hung for a second or two, then smooth again. Classic dirty page writeback. The kernel lets dirty pages accumulate in the page cache, then panics and flushes a great wodge of them at once, and your latency spikes while it does.

The two knobs are vm.dirty_ratio and vm.dirty_background_ratio. The defaults on this kernel were 20 and 10, expressed as a percentage of RAM. On a box with a lot of memory, 20% is a frankly enormous amount of dirty data to dump on the disk in one go. The background ratio is when the kernel starts flushing quietly in the background; the hard ratio is when writing processes get blocked until the flush catches up. That blocking is the stall you feel.

I lowered both so the kernel flushes little and often rather than rarely and all at once:

sysctl -w vm.dirty_background_ratio=5
sysctl -w vm.dirty_ratio=10

The stalls smoothed right out. Throughput didn't change in any way I could measure, but the latency got far more even, which is what actually mattered. On boxes with a lot of RAM I'd reach for vm.dirty_background_bytes instead, because a percentage of 128GB is still a silly number; setting an absolute cap of a few hundred megabytes is more honest. Either way, the lesson is the same: a big page cache feels great until the bill comes due all at once.