Ramblings of an aging IT geek
← Ramblings of an aging IT geek
linux

when the write cache fights back: tuning dirty_ratio

Diagnosing periodic latency stalls on a write-heavy server and fixing them by tuning the kernel's dirty page writeback thresholds.

A terminal showing system monitoring output

A write-heavy box of mine had a horrible habit. Most of the time it was fine, and then every minute or so everything would stall for a couple of seconds: SSH would hang, request latency would spike, the load average would jump for no obvious reason, and then it would all come good again. Classic sawtooth. The CPU was not busy. The disks were not full. Something was hitching, rhythmically, and rhythmic problems usually mean a timer.

The timer in question was the kernel flushing dirty pages.

what dirty pages are doing to you

When a process writes to a file, the data does not go straight to disk. It lands in the page cache and is marked "dirty", and the kernel writes it out later in the background. This is almost always what you want, because it lets the kernel batch writes and lets your application carry on without blocking on slow storage.

The trouble starts when too much dirty data piles up. The kernel has two thresholds, and on most distributions they are still expressed as a percentage of RAM, which made sense in 2008 and makes much less sense on a box with a lot of memory.

  • vm.dirty_background_ratio: once this fraction of RAM is dirty, the kernel starts writing it out in the background, asynchronously. The application does not notice.
  • vm.dirty_ratio: once this fraction is dirty, the kernel makes writing processes block and flush synchronously. The application very much notices.

On a box with 64 GB of RAM, the defaults of 10 and 20 percent mean the kernel will happily let around 6 GB of dirty pages accumulate before it even starts flushing in the background, and around 13 GB before it slams on the brakes and forces every writer to wait. When that synchronous flush kicks in, you get exactly the stall I was seeing: a big slug of writes dumped to disk all at once, everything blocked behind it.

A server status display showing metrics

seeing it happen

You do not have to guess. The kernel exposes the dirty page count in /proc/meminfo, and you can watch it climb and collapse:

watch -n1 'grep -E "Dirty|Writeback" /proc/meminfo'

When I ran that during one of the stalls, Dirty climbed steadily into the gigabytes, then dropped off a cliff at the exact moment of the hitch, while Writeback briefly shot up. That is the synchronous flush: dirty pages being forcibly pushed into writeback all at once. The sawtooth in my latency graph and the sawtooth in Dirty lined up perfectly.

the fix

The goal is to stop letting so much dirty data accumulate in the first place. Instead of huge infrequent flushes, you want smaller, more frequent ones that the background writer can handle without ever hitting the synchronous wall.

For that, the byte-based knobs are far better than the ratio ones, because they do not scale with RAM and give you a number you can actually reason about:

# Start background writeback at 256 MB dirty
sysctl -w vm.dirty_background_bytes=$((256 * 1024 * 1024))

# Force synchronous flush only past 1 GB dirty
sysctl -w vm.dirty_bytes=$((1024 * 1024 * 1024))

Setting the _bytes variants automatically zeroes the corresponding _ratio variants, so you are not fighting two sets of thresholds. I also nudged the expiry timer down so dirty pages do not sit around indefinitely on a quiet box:

sysctl -w vm.dirty_expire_centisecs=1500
sysctl -w vm.dirty_writeback_centisecs=500

That tells the flusher to wake every five seconds and write out anything dirty for more than fifteen.

To make it stick across reboots, drop it in /etc/sysctl.d/:

# /etc/sysctl.d/60-dirty.conf
vm.dirty_background_bytes = 268435456
vm.dirty_bytes = 1073741824
vm.dirty_expire_centisecs = 1500
vm.dirty_writeback_centisecs = 500

A server rack with monitoring graphs

the result

The stalls went away. The latency graph went flat. Dirty in /proc/meminfo now hovers in the low hundreds of megabytes and never makes the dramatic climb-and-collapse it used to. Throughput was unchanged: I was not writing less data, I was just writing it out steadily instead of in great lurching batches.

the caveats

A couple of honest warnings before you go editing sysctls on a Friday afternoon.

This is not a universal "make it faster" tweak. Smaller thresholds mean more frequent writeback, which on some workloads loses you a bit of batching efficiency. The win here was latency consistency, not throughput, and those two are sometimes in tension. If you have a bursty write workload that genuinely benefits from a large cache to absorb spikes, shrinking the buffer can hurt.

And the right numbers depend entirely on your storage. A box on NVMe can drain a gigabyte of dirty pages fast enough that the synchronous flush barely registers. A box on spinning rust, or worse on a network filesystem, cannot, and that is exactly where the defaults bite hardest. Measure your own Dirty count under load before and after. If it is not climbing into multiple gigabytes and then collapsing, this is probably not your problem and you should look elsewhere.

For my write-heavy box on slower storage, it was the whole fix. One config file, a flat latency graph, and a stall I had been quietly tolerating for months gone for good.