when a box stalls every thirty seconds, look at dirty_ratio

A terminal full of vmstat output on a dark screen

A write-heavy box of ours had a horrible tic: every thirty seconds or so, everything stalled for a second or two. Latency graphs looked like a heartbeat. The CPU was idle during the freezes, so it wasn't compute. iostat -x 1 told the story: long stretches of nothing, then %util slamming to 100 as the disk got a wall of writes dumped on it all at once.

That's classic writeback. Linux buffers dirty pages in RAM and flushes them later. With the defaults, vm.dirty_ratio of 20 on a box with a lot of memory means it'll happily accumulate gigabytes of dirty pages, then panic and flush the lot synchronously, blocking every process that dares to write while it drains. Hence the periodic stall.

The fix is to make it flush little and often instead of rarely and catastrophically:

vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

dirty_background_ratio is when the kernel starts flushing in the background, quietly. dirty_ratio is the hard ceiling where writers get blocked. Pulling both down means the background flusher wakes up sooner and keeps the queue shallow, so you never build up the giant backlog that causes the synchronous stall.

On a machine with tens of gigabytes of RAM the percentage-based knobs are too coarse anyway, so on the worst offenders I switched to the byte-based versions and set vm.dirty_background_bytes to a few hundred meg. Same idea, just expressed in numbers that actually mean something on a big box.

The heartbeat went away. Throughput barely moved, but the p99 latency dropped through the floor, which on this particular service is the number anyone actually cares about. Not glamorous, but few of the good fixes are.