when the box freezes for five seconds and writeback is to blame

A server rack with status lights

A box of mine had a horrible habit: every minute or two, under sustained write load, it would freeze. Not crash, freeze. For three to five seconds everything stalled, latency spiked, requests queued, and then it carried on as if nothing had happened. The CPU wasn't pegged. There was plenty of RAM. The disks weren't full. It took me longer than I'd like to admit to realise the culprit was the kernel deciding to flush a vast pile of dirty pages to disk all at once.

The short version: the default vm.dirty_ratio lets the kernel cache a lot of unwritten data in RAM before it forces writeback, and on a machine with plenty of memory that "a lot" becomes "gigabytes". When it finally flushes, the disk can't keep up, and every process trying to write blocks until the backlog clears. The fix is to make the kernel flush little and often instead of rarely and catastrophically.

A datacenter aisle of servers

Two settings do most of the work. vm.dirty_background_ratio is the percentage of RAM that can be dirty before the background flusher threads wake up and start writing, quietly, without blocking anyone. vm.dirty_ratio is the hard ceiling: hit that, and any process that tries to write is forced to block and help flush until you're back under it. The defaults on this kernel were 10 and 20. On a box with 64GB of RAM, 20% is roughly 13GB of dirty pages allowed to pile up before the panic flush. No wonder it stalled.

The thing to understand is that these are percentages of total memory, which made sense when servers had 512MB and 20% was 100MB. On modern memory sizes the percentages let the backlog grow far past what the disks can drain quickly. So I dropped them hard:

# /etc/sysctl.d/10-writeback.conf
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

Then sysctl -p /etc/sysctl.d/10-writeback.conf to apply it live. You can watch the effect with grep -E 'Dirty|Writeback' /proc/meminfo while load runs; the Dirty figure should now hover at a fraction of what it was, because the background flusher kicks in much sooner and keeps the queue short.

If you've enough RAM that even 5% is a lot, the better controls are vm.dirty_background_bytes and vm.dirty_bytes, which set absolute limits instead of percentages. Setting one of the _bytes pairs zeroes out the matching _ratio, so don't set both. On the really big boxes I'll cap vm.dirty_bytes at something the disks can actually flush in a second or two, and not think about RAM size at all.

The result on my box was immediate. The five-second freezes vanished, replaced by a steady trickle of writeback that the disks handled without anyone noticing. Average throughput didn't really change, because the same data still gets written, it just gets written smoothly rather than in heaving great lurches. And smooth is the whole point. A server that does the work evenly is worth far more than one that's marginally faster on average and stops dead every ninety seconds. Flush little and often, and let the box breathe.