Ramblings of an aging IT geek
← Ramblings of an aging IT geek
linux

when writeback stalls everything, look at dirty_ratio

Periodic write stalls on a busy server traced to the kernel's dirty page writeback, and how lowering dirty_ratio smoothed it out.

A Linux terminal on a dark screen

The symptom was a box that felt fine for thirty seconds and then froze for one. Not crashed, not swapping, just a periodic stall where everything writing to disk hung, drained, and carried on. Latency graphs looked like a heartbeat. The application logs were innocent. The problem was further down, in how the kernel handles dirty pages.

When you write to a file, the data doesn't go straight to disk. It sits in the page cache, marked dirty, and the kernel flushes it later in the background. That's usually a good thing: it batches writes and lets the application get on with its life. The trouble starts when too much dirty data piles up. Once you cross vm.dirty_ratio, the kernel stops being polite and forces the writing process to flush synchronously, blocking it until enough pages hit the disk.

A server in a rack

On this box the defaults were the problem. With a lot of RAM, dirty_ratio at 20 per cent of memory meant gigabytes of dirty pages could accumulate before the kernel cared. Then it cared all at once, dumped the lot at the disk, and the disk could not keep up. Everything writing stalled until the backlog cleared. The bigger the memory, the bigger the bomb.

You can watch it happen:

$ grep -E 'Dirty|Writeback' /proc/meminfo
Dirty:           1843264 kB
Writeback:         98304 kB

Catch it mid-stall and Dirty is enormous whilst Writeback crawls. That gap is your stall.

The fix was to stop letting so much dirty data accumulate. Lower the thresholds so the kernel flushes earlier and more often, in smaller amounts the disk can actually absorb:

vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

dirty_background_ratio is when the kernel starts flushing in the background, quietly. dirty_ratio is the hard ceiling where it starts blocking writers. Setting the background trigger well below the ceiling means writeback gets going early and steadily, so you rarely hit the wall. On boxes with a lot of memory I'd go further and use the byte-based knobs (dirty_background_bytes, dirty_bytes) instead, because a percentage of 256GB is an absurd amount of dirty data to ever allow.

I set these with sysctl, watched the heartbeat in the latency graph flatten out over the next few minutes, then made it permanent in /etc/sysctl.d/. The throughput is fractionally lower on paper, because we're flushing in smaller batches, but nobody runs a server for its benchmark figures. They run it so it doesn't freeze for a second every half minute, and now it doesn't.

The general lesson, which I keep relearning, is that more buffering is not free. A big cache hides small problems and saves up for one large one. When a busy box stalls in a rhythm, look at what's being batched and ask whether the batch has quietly grown bigger than the thing draining it.