the long version: why your write-heavy box stalls, and how to stop it

A Linux terminal

I wrote a short note a fortnight ago about dropping vm.dirty_ratio to smooth out write stalls on an ingest box. It got enough "wait, what's dirty_ratio" replies that I think the long version is worth doing properly, because this is one of those settings that's been wrong-by-default on big machines for years and most people never touch it.

What actually happens when you write a file

When a process writes to a file, that data does not go to the disk. Not straight away. It goes into the page cache, in RAM, and the page is marked dirty: changed in memory, not yet on disk. The kernel writes it out later, in the background, batching things up so the disk gets nice big sequential writes instead of a constant dribble of tiny ones. This is a good design. It's why write() returns fast and why Linux feels responsive.

The problem is the word "later". How much dirty data is the kernel willing to let accumulate before it does something about it? That's what these knobs control.

There are two thresholds that matter:

vm.dirty_background_ratio is the soft limit. When dirty pages exceed this percentage of available memory, the kernel wakes its writeback threads and starts flushing in the background. Your processes don't notice; they keep running.
vm.dirty_ratio is the hard limit. When dirty pages hit this percentage, the kernel stops being polite. Any process that tries to write more gets blocked, synchronously, until enough has been flushed to bring the figure back down.

That second one is the stall. The process isn't waiting on its own IO, it's been conscripted into throttling because the system as a whole let too much pile up.

Why the defaults bite on big boxes

The historic defaults are dirty_background_ratio of 10 and dirty_ratio of 20. Those numbers were chosen back when a server might have had a couple of gigabytes of RAM. Ten percent of 2GB is 200MB, which a disk can clear in a reasonable time.

Now put those same percentages on a box with 64GB of RAM. Twenty percent is 12.8GB of dirty pages allowed to accumulate before the hard throttle kicks in. Picture what happens when you hit that ceiling: the kernel now needs to flush gigabytes to disk before the blocked process can continue. On a spinning disk doing maybe 150MB/s sustained, clearing several gigabytes is measured in tens of seconds. Your process is frozen for the duration. Everything sharing that disk is having a bad time too.

So the pattern you see is exactly what I described on the ingest box. Long stretches where the disk looks idle because everything is sitting happily in cache, then a sudden wall, then the disk pinned at 100% whilst it drains, then quiet again. Sawtooth. The averages look fine. The percentiles are dreadful, and percentiles are what your users feel.

A server rack

What to set instead

For a busy write-heavy box, you want writeback to start early and the hard ceiling to be low enough that hitting it is cheap rather than catastrophic. On the ingest box I used:

vm.dirty_background_ratio = 3
vm.dirty_ratio = 10

Drop those into /etc/sysctl.d/30-writeback.conf and apply with sysctl --system so it survives a reboot. The lower background ratio means the kernel starts trickling data out to disk much sooner, so the dirty pool rarely gets near the hard limit at all. And if it does, ten percent is a smaller wall to climb than twenty.

On really large memory boxes even small percentages are too coarse. Three percent of 256GB is still nearly 8GB of pending writes, which is daft. For those, use the absolute-value siblings instead:

vm.dirty_background_bytes = 268435456
vm.dirty_bytes = 1073741824

That's 256MB and 1GB. Setting the _bytes versions automatically zeroes the corresponding _ratio, they're mutually exclusive, so don't try to set both. Absolute bytes is the saner mental model on modern hardware: you're saying "never let more than a gigabyte of unwritten data build up", which you can reason about against your disk's actual throughput.

Measuring it, because guessing is for amateurs

Before and after, watch the dirty page count directly:

watch -n1 grep -e Dirty -e Writeback /proc/meminfo

Dirty is data waiting to be written; Writeback is data currently being written. On an untuned box under bursty load you'll see Dirty climb and climb, then Writeback spike as the flush kicks in. After tuning, Dirty should hover much lower and Writeback should be a steadier trickle rather than periodic explosions.

For the actual stalls, iostat -x 1 and watch the await column, the average time IO requests spend waiting. Tall spikes lining up with your sawtooth are the throttle biting. And if you want to catch the processes actually being blocked in uninterruptible sleep, a quick ps -eo state,pid,cmd | grep '^D' during a stall will show them sat in D state, waiting on the kernel.

The caveat

This is a trade, not a free win. Flushing earlier and in smaller batches means slightly more total IO and less opportunity for the kernel to coalesce writes into big sequential runs. On a workload that writes a lot then deletes it before it ever needs to hit disk, an aggressive low ratio makes you write data you'd have thrown away. Temporary files, scratch space, that sort of thing.

So the rule isn't "always set it low". It's: if your box writes in bursts and you care about latency and predictability over raw throughput, tune it down. If it's a throughput machine churning data that genuinely needs persisting and nobody's watching latency, the defaults, or even higher ratios, can be fine.

Most of the boxes I look after fall into the first camp. The defaults assume the second, and they assume 2003. Worth ten minutes to check which one you actually have.