when too much page cache becomes a problem

A terminal showing kernel vm tunables

The symptom was ugly. A write-heavy ingest box would run smoothly for thirty seconds, then everything would freeze for two or three. Latency graphs looked like a heartbeat monitor. No CPU spike, no swap, nothing in the logs. Just periodic full-system stalls that nobody could explain.

It was the page cache filling up. The defaults let dirty pages accumulate to 20% of RAM before the kernel decides it really must write them out, and at that point it does so synchronously and the box grinds while it catches up. On a machine with a lot of memory and a single spinning disk behind it, 20% is a colossal amount of data to flush in one go.

vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

Lower numbers mean the kernel starts flushing earlier and never lets the backlog grow large enough to cause a synchronous stall. The trade is slightly more frequent, smaller writes instead of rare enormous ones. For this workload that was exactly the right trade: the stalls vanished and average throughput barely moved.

The general lesson is that "more cache" is not free. A big dirty page backlog is a debt the kernel eventually calls in all at once, and on a busy box you want it paying that debt little and often rather than in one terrifying lump. Check dirty_ratio before you blame the disk.