Ramblings of an aging IT geek
← Ramblings of an aging IT geek
linux

when a write-heavy box stalls every thirty seconds

A server that froze in periodic bursts under heavy writes, traced to the kernel flushing a huge backlog of dirty pages at once, and how lowering vm.dirty_ratio smoothed it out.

A terminal showing iostat output during a write burst

The symptom was a box that ran perfectly for thirty seconds, then froze for two, then ran perfectly again. Like clockwork. Under a sustained write workload, latency would be fine and then, periodically, everything blocked: new writes hung, even unrelated processes stuttered, and iostat showed the disk pinned at 100% util in a sharp burst before going quiet again. Average throughput looked healthy. The averages were lying. Underneath was a sawtooth of long idle troughs punctuated by brutal flush spikes.

This is the kernel's writeback behaviour, and the knobs are vm.dirty_ratio and vm.dirty_background_ratio.

When you write to a file, the data doesn't go straight to disk. It lands in the page cache as "dirty" pages, and the kernel flushes them out later, in the background, so your writes return quickly. Two thresholds govern this, expressed as a percentage of available memory:

  • vm.dirty_background_ratio: once this fraction of memory is dirty, the kernel starts flushing in the background, asynchronously. Your processes don't notice.
  • vm.dirty_ratio: the hard ceiling. Once this fraction is dirty, the kernel stops being polite. Any process that tries to write is blocked and made to flush synchronously until the backlog comes down.

A diagram of memory filling with dirty pages

On a box with a lot of RAM the defaults are the problem. The old default dirty_ratio of 20% sounds modest until you do the arithmetic. On a machine with, say, 64GB of RAM, 20% is around 12GB of dirty pages allowed to accumulate before the hard limit kicks in. When a slow-ish disk finally has to flush 12GB at once, that takes seconds, and for those seconds every writer on the box is frozen. That was my two-second stall, every time the dirty pages climbed to the ceiling.

The fix is to let far less accumulate before writeback starts, so the disk drains a steady trickle instead of a periodic flood:

# /etc/sysctl.d/30-writeback.conf
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

Apply with sysctl -p /etc/sysctl.d/30-writeback.conf. On a box with a lot of memory you may want the byte-denominated knobs instead, vm.dirty_background_bytes and vm.dirty_bytes, which cap the absolute amount of dirty data rather than a percentage. A fixed cap of a few hundred megabytes is often easier to reason about than a percentage that scales with RAM you weren't trying to use as a write buffer.

The result was exactly what I wanted: the sawtooth flattened. Background flushing kicks in early and often, the dirty page count never climbs high enough to slam into the hard ceiling, and the disk does steady continuous work instead of bursts. Peak throughput is fractionally lower, because you're not letting the cache absorb a huge burst and flush it all in one go. But the worst-case latency dropped from two seconds to nothing noticeable, and on a box people are actually waiting on, predictable-and-slightly-slower beats fast-then-frozen every single time.

The general lesson, which I relearn periodically: large buffers don't remove a bottleneck, they just postpone it and make it hurt more when it arrives. A small steady drain beats a big periodic flood whenever something is waiting on the other end.