Ramblings of an aging IT geek
← Ramblings of an aging IT geek
linux

when the page cache fights back

Why a write-heavy box stalled for seconds at a time, and how tuning vm.dirty_ratio and friends turned a wall of latency into a steady trickle.

A Linux terminal glowing in a dark room

The box was fine for hours, then it would freeze. Not crash, freeze. Every request in flight would hang for two, sometimes three seconds, then everything would catch up at once and carry on as if nothing had happened. No CPU spike worth mentioning. Load average sitting comfortably. The disks were busy but not saturated. Classic "everything looks fine, why is it terrible" territory.

It was the page cache, of course. It usually is.

what dirty_ratio actually does

When a process writes to a file, the kernel doesn't shove those bytes straight at the disk. It writes them into the page cache and marks the pages dirty, then returns to your process almost instantly. The actual write-back to disk happens later, asynchronously, by the flusher threads. This is why writes feel fast: most of the time you're writing to RAM.

The catch is that dirty pages can't accumulate forever. Two knobs govern how much they're allowed to pile up:

  • vm.dirty_background_ratio: the percentage of available memory that can be dirty before the kernel starts flushing in the background, quietly.
  • vm.dirty_ratio: the percentage at which the kernel decides enough is enough and starts throttling. Once you cross this line, your writing process gets blocked and made to do write-back itself until things calm down.

That second one is the stall. When you hit dirty_ratio, writes stop being asynchronous and become synchronous, for everyone, all at once. On a box with a lot of RAM and the default ratios, the amount of data that has to be forced to disk in that moment can be enormous.

A rack of servers with status lights

The defaults on this machine were the old percentage-based ones:

$ sysctl vm.dirty_ratio vm.dirty_background_ratio
vm.dirty_ratio = 20
vm.dirty_background_ratio = 10

Twenty percent doesn't sound like much until you remember the box had 128 GB of RAM. Twenty percent of that is about 25 GB of dirty pages allowed before the throttle kicks in. The background flush wouldn't even start until roughly 12 GB were dirty. So the machine would happily buffer gigabytes of writes in RAM, do nothing about them, and then the moment it crossed the line it would try to push the lot to a disk array that can sustain maybe a few hundred megabytes a second. Two or three seconds of everyone-stops while it drained. Then back to buffering. Lather, rinse, stall.

watching it happen

You can see the dirty pages directly:

$ watch -n1 'grep -E "Dirty|Writeback" /proc/meminfo'
Dirty:          11823104 kB
Writeback:             0 kB

Run that during a load test and you can watch the number climb steadily, then drop off a cliff the instant write-back begins. The cliff is the stall. If Writeback is large and Dirty is large at the same time, you're in it.

I also kept vmstat 1 open. The bo column (blocks out) goes quiet while pages accumulate, then spikes to the device's ceiling and parks there. When bo is pinned and your wa (iowait) climbs, that's the kernel sweating to drain the cache.

the fix

The trick is to stop the kernel hoarding writes. Lower the thresholds so write-back starts early and often, in small amounts, rather than rarely and catastrophically. On a box with lots of RAM I prefer the byte-based knobs to the percentage ones, because a percentage of 128 GB is a silly number of bytes.

# /etc/sysctl.d/30-dirty.conf
vm.dirty_background_bytes = 268435456   # 256 MB
vm.dirty_bytes           = 1073741824   # 1 GB

Setting dirty_background_bytes automatically zeroes the corresponding _ratio and vice versa, so you don't end up with both fighting each other. Apply it:

$ sudo sysctl --system

Now background flushing starts once a mere 256 MB is dirty, which on this array is well under a second of write-back. The hard throttle at 1 GB still exists as a safety net, but in practice you rarely reach it because the background flusher is already chipping away. The writes become a steady trickle to disk instead of an occasional flood.

The other knob worth a look is vm.dirty_expire_centisecs, which controls how old a dirty page can get before write-back is forced regardless of the ratios. The default 3000 (30 seconds) is fine for most things, but if you have data you really don't want sitting in volatile RAM for half a minute, lower it. I left it alone here; the ratio change did the work.

A server room corridor lined with cabinets

the result

The stalls vanished. Latency went from a sawtooth, flat then a cliff then flat, to a gentle, boring line. p99 dropped from somewhere north of two seconds to comfortably under fifty milliseconds. iowait stopped having its periodic tantrums. The disks are busier on average now, which is exactly the point: better to keep them gently occupied than to ignore them and then panic.

A word of caution before you go pasting dirty_bytes everywhere. These numbers are right for this box: lots of RAM, a write-heavy workload, an array that can't absorb a sudden multi-gigabyte dump. A laptop with an NVMe drive and a desktop workload wants different values, and the defaults are honestly fine there. Tuning the dirty ratios is a thing you do when you've measured a problem, not a thing you sprinkle on for luck. Measure the dirty pages, watch the cliff, then move the line so the cliff never comes.

And if your box freezes for seconds at a time while looking otherwise healthy, grep Dirty /proc/meminfo is a very good first thing to type.