Ramblings of an aging IT geek
← Ramblings of an aging IT geek
linux

the page cache was lying to me about disk writes

Why a box with plenty of RAM was stalling under bursty writes, and how lowering vm.dirty_ratio smoothed it out.

A Linux terminal

A batch job on one of our ingest boxes kept stalling for whole seconds at a time, and the disks weren't even busy when it happened. Then they'd be flat out for a moment, then quiet again. Classic write-cache thrash.

The default vm.dirty_ratio of 20 means the kernel will happily let dirty pages pile up until they're a fifth of available memory before it forces the writing process to block and flush. On a box with 64GB of RAM that's a lot of data to suddenly push at a spinning disk. Everything looks fine, then you hit the wall and the whole thing seizes whilst it catches up.

I dropped it, and dropped dirty_background_ratio further so writeback starts earlier and gentler:

vm.dirty_background_ratio = 3
vm.dirty_ratio = 10

The peaks flattened out almost immediately. Slightly more total IO over time, much less of it arriving as a single brick wall. On a busy box that trade is nearly always worth it. The newer dirty_bytes knobs give you absolute numbers instead of percentages if your memory size makes the ratios silly, which on big boxes they increasingly are. Either way, stop trusting the defaults on anything that writes in bursts.