when the writeback stalls everything

A terminal showing kernel virtual memory tunables

The symptom was a box that hitched. Every minute or two, for a second or two, everything paused: SSH typing lagged, request latency spiked, the load average leapt and then settled. No swap thrash, no OOM kills, plenty of free RAM. Just a periodic stall that looked, on a graph, like a heartbeat nobody had asked for.

It was a write-heavy host, a queue worker chewing through batches and flushing a lot of data to a single spinning disk underneath an otherwise generous amount of memory. That combination is exactly where Linux's default dirty page thresholds bite.

Here is the mechanism, briefly, because it is the whole story. When a process writes, the data lands in the page cache as "dirty" pages and the write returns immediately. The kernel flushes those pages to disk in the background later. Two knobs govern when "later" becomes "now":

vm.dirty_background_ratio   # start flushing in the background at this %
vm.dirty_ratio              # block writers entirely until flushed past this %

The ratios are a percentage of available memory. On a box with a lot of RAM that is the trap. The default vm.dirty_ratio of 20 on a machine with, say, 64 GB means the kernel will happily let around 12 GB of dirty pages pile up before it forces a synchronous flush. When it finally hits that ceiling, every writing process is blocked while a vast backlog drains to a disk that can only manage a hundred-odd megabytes a second. That is your stall. The disk cannot absorb in a moment what memory was allowed to hoard for a minute.

A server pushing more writes than its disk can drain

The fix is to stop letting the backlog grow so large. Smaller thresholds mean the kernel starts writing back sooner and more steadily, trading a single nasty stall for a constant gentle trickle. I dropped the ratios and, because percentages of large memory are clumsy, leaned on the byte-valued equivalents instead, which the kernel honours when they are non-zero:

vm.dirty_background_bytes = 268435456   # 256 MiB
vm.dirty_bytes            = 536870912   # 512 MiB

Background flushing kicks in at 256 MiB, hard blocking only past 512 MiB. The numbers are not magic. They are roughly "a few seconds of what this disk can actually swallow", which is the right way to think about it: size the buffer to the drain, not to the memory you happen to have.

Apply it without a reboot and watch:

sysctl -w vm.dirty_background_bytes=268435456
sysctl -w vm.dirty_bytes=536870912

The heartbeat went away. The aggregate throughput barely changed (the same bytes still have to reach the same disk) but they now arrive in a smooth stream rather than periodic floods, and nothing blocks for a second to make it happen. Persist it in /etc/sysctl.d/ so the next reboot does not quietly hand you the stall back.

The general lesson, which I keep relearning, is that a default expressed as a percentage of memory was tuned for machines with far less of it. On a busy box with plenty of RAM and a modest disk, the kernel will let memory write cheques the disk cannot cash. Tune to the slow part.