Ramblings of an aging IT geek
← Ramblings of an aging IT geek
linux

when the box freezes for a second every minute

A write-heavy server with periodic latency spikes turned out to be the kernel flushing a huge pile of dirty pages all at once, fixed by lowering dirty_ratio.

A terminal showing vmstat output with periodic write spikes

A write-heavy box had a horrible tic: every so often, latency would spike, the disks would saturate for a few seconds, and then everything settled down again until the next time. Steady-state load was fine. It was the periodic stutter that was killing us.

The culprit was page cache writeback. When you write to a file, the data sits in the page cache as "dirty" pages and gets flushed to disk later. With lots of RAM and the defaults, the kernel was happy to let a very large pile of dirty pages accumulate, and then dirty_ratio would trip and the kernel would block writers while it dumped the whole lot to disk at once. That dump was the spike. A long quiet period of buffering, then a brutal flush.

You can see the backlog directly:

grep -i dirty /proc/meminfo

Watch that number climb between flushes and you've found your sawtooth. The fix is to make the kernel flush sooner and in smaller bites, so writeback is a steady trickle instead of an occasional flood:

vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

Lower dirty_background_ratio starts the background flusher earlier; lower dirty_ratio caps how much can pile up before writers get throttled. On a box with lots of RAM the byte-based knobs (dirty_background_bytes, dirty_bytes) are often saner than ratios, because 5% of a large amount of memory is still an enormous amount of writeback. Set them, sysctl -p, and watch the sawtooth flatten into something you can live with.