when the box freezes for a second, look at dirty_ratio

A Linux terminal showing system stats

The symptom was a heartbeat of misery. Every thirty seconds or so the ingest box would stall: request latency spiked, iostat showed the disks pinned at 100% util for a couple of seconds, then everything went quiet again until the next pulse. CPU was fine. Memory looked fine. It was the rhythm that gave it away.

That pulse is the kernel's writeback finally flushing a mountain of dirty page cache to disk all at once. On a box with a lot of RAM, the defaults let dirty pages accumulate until they hit vm.dirty_ratio, which on this machine meant several gigabytes of "I'll write that later" sitting in memory. When later arrives, it arrives as one enormous synchronous gulp, and every process trying to do I/O queues up behind it.

The fix is to stop letting it hoard. I switched from the ratio knobs to the byte knobs, because a percentage of 256GB of RAM is an absurd amount of dirty data to flush in one go:

vm.dirty_background_bytes = 268435456
vm.dirty_bytes = 536870912

That's 256MB to start background writeback and 512MB as the hard ceiling. Background flushing now begins early and trickles, so writeback is a steady stream rather than a periodic tidal wave. The stalls went from a couple of seconds every half-minute to nothing measurable. Throughput was identical; it just stopped arriving in lumps. The lesson, again, is that a "busy disk" problem is often really a "we let the cache get greedy" problem, and the cure is to make the box flush little and often instead of rarely and catastrophically.