tuning dirty_ratio on a box that writes a lot

A busy server pushing a lot of writes to disk

The symptom was a server that felt fine for thirty seconds and then froze for two. Not crashed, not swapping, just every process blocking at once on what looked like nothing, then carrying on as if nothing had happened. Latency graphs showed it as a regular sawtooth: smooth, smooth, smooth, cliff. It was a box that ingested a steady firehose of data and batched it to disk, and the freezes lined up suspiciously well with the disk write spikes.

The cause was page-cache writeback, specifically vm.dirty_ratio. When you write to a file, the data lands in the page cache first and gets flushed to disk later by the kernel's writeback threads. That deferral is the whole point; it lets bursty writes get absorbed and coalesced. But there are limits, and on a write-heavy box the defaults are wrong in an exciting way.

what the knobs actually do

Two thresholds matter, both expressed as a percentage of available memory:

$ sysctl vm.dirty_background_ratio vm.dirty_ratio
vm.dirty_background_ratio = 10
vm.dirty_ratio = 20

dirty_background_ratio is the soft limit. When dirty pages cross it, the kernel starts flushing in the background and your processes don't notice. dirty_ratio is the hard limit. When dirty pages cross that, the kernel makes any process trying to write block and help with writeback until things come back under control. That blocking is the freeze.

Here's the trap. On a machine with a lot of RAM, 20% is enormous. On a 64G box that's roughly 12G of dirty pages allowed to accumulate before the hard limit bites. When it finally does, the kernel has to push gigabytes to a disk that can't take it that fast, so everything stops and waits. The more memory you have, the bigger the cliff.

A latency sawtooth flattening out after tuning

smaller, more often

The fix is to flush earlier and in smaller chunks, so writeback is a constant gentle background activity rather than a periodic avalanche. I dropped both thresholds hard:

# /etc/sysctl.d/30-writeback.conf
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10
vm.dirty_expire_centisecs = 1500
vm.dirty_writeback_centisecs = 500

The two centisecs values control how old a dirty page can get before it's eligible for flush, and how often the writeback threads wake up to look. Waking more often and expiring pages sooner keeps the backlog shallow.

On systems with a lot of RAM the percentage knobs are a blunt instrument anyway, so it can be cleaner to set absolute byte limits with vm.dirty_background_bytes and vm.dirty_bytes instead, which override the ratios. I stuck with the ratios here because 5% and 10% were small enough to do the job.

Apply it with sysctl --system and watch /proc/meminfo:

$ grep -E 'Dirty|Writeback' /proc/meminfo
Dirty:            142908 kB
Writeback:          2048 kB

The trick is that Dirty should stay small and roughly steady, and Writeback should rarely be large. If Dirty climbs into the gigabytes you're back to building an avalanche.

the result

The sawtooth flattened into a low hum. p99 write latency dropped by more than half and, more importantly, the periodic two-second freezes were gone entirely. The total throughput was unchanged, because the same bytes still had to reach the same disk; I'd just stopped letting them pile up into a wall.

The lesson I keep relearning: the defaults are tuned for a desktop with a spinning disk and modest RAM, and a server is neither. When a box stalls in a rhythm rather than at random, suspect a buffer somewhere that's allowed to get too full before it drains. Writeback is the usual culprit, and it's two sysctl lines to fix.