The box was fine ninety-five percent of the time and catastrophic the other five. Latency would sit nicely at single-digit milliseconds for a couple of minutes, then every request on the machine would stall together for a second or two, then it would recover and pretend nothing had happened. The graphs looked like a heartbeat. The on-call rota did not find this charming.
This is a write-heavy ingestion node: lots of small writes landing constantly, flushed to a single largeish array. The application was not the problem. The problem was the kernel's page cache deciding, all at once and on its own schedule, that it was time to write everything down to disk.
what dirty pages actually are
When a process writes to a file, the data does not go straight to the disk. It goes into the page cache as a "dirty" page, the kernel acknowledges the write immediately, and the actual flush to storage happens later in the background. This is why your writes feel fast: most of the time you are writing to RAM and the disk catches up afterwards. It is one of the nicest lies the kernel tells you.
The catch is that dirty pages cannot accumulate forever. Two thresholds govern when the kernel stops being relaxed about it:
vm.dirty_background_ratiois the point at which the kernel quietly kicks off background flushing. Your processes carry on, the flusher threads do the work behind the scenes.vm.dirty_ratiois the hard ceiling. When dirty pages hit this fraction of available memory, the kernel stops being polite. Any process that tries to write is blocked, made to do the writeback itself, until things are back under control.
On a default Ubuntu box of this era those are 10 and 20 respectively. Ten percent of memory and twenty percent of memory. That sounds modest until you do the arithmetic on a machine with a lot of RAM.
the arithmetic that bites you
This machine had 64 GB of RAM. So dirty_ratio at 20 means the kernel will happily let 12.8 GB of unwritten data pile up in the cache before it slams on the brakes. The background threshold at 10 percent is 6.4 GB.
Now picture what happens. The application writes merrily into cache. Dirty pages climb past 6.4 GB and background flushing starts, but our write rate is high enough that the flushers cannot keep up with the incoming data plus the backlog. So dirty pages keep climbing. Eventually they touch 12.8 GB, hit dirty_ratio, and the kernel does the thing it does at the ceiling: it blocks every writing process and forces them to participate in writeback until the backlog drains.
That is the stall. That is the heartbeat in the graph. The machine spends a couple of minutes building up a 12 GB wall of dirty pages and then everyone stands around waiting while the array writes it all out at once. The disks are not slow on average. They are being asked to do nothing, then everything, then nothing.
You can watch the whole drama in real time:
watch -n1 'grep -E "Dirty|Writeback" /proc/meminfo'
Seeing Dirty ramp smoothly up to several gigabytes and then collapse, in lockstep with the latency spikes, was the moment the diagnosis clicked.
smaller ceilings, smoother behaviour
The fix is to stop the kernel from ever building up that enormous backlog. You want it flushing earlier and more continuously, so the writeback is a steady trickle rather than an occasional flood. The counter-intuitive bit is that you make the box more responsive by giving the cache less slack.
On a machine with this much RAM, the ratio knobs are too coarse, so I switched to the byte-denominated equivalents, which let you set an absolute figure that does not balloon with memory size:
# /etc/sysctl.d/30-writeback.conf
vm.dirty_background_bytes = 268435456 # 256 MiB
vm.dirty_bytes = 536870912 # 512 MiB
Setting the _bytes variants automatically zeroes out the corresponding _ratio values, which is the documented behaviour and exactly what you want here. Apply with sysctl -p /etc/sysctl.d/30-writeback.conf.
The effect was immediate. Dirty pages now hover in the low hundreds of megabytes and the flushers keep pace with the incoming stream. There is no longer a giant backlog to drain, so there is no longer a stall to drain it. The latency graph went from a heartbeat to a flat line.
the trade-off, stated honestly
This is not free and I would not apply it blindly everywhere. By forcing writeback to happen sooner, you give up some of the cache's ability to coalesce writes and absorb bursts. A workload that does a big batch write and then goes quiet might actually prefer the generous defaults, because it can take the whole burst into cache and dribble it out during the idle period that follows. You also expose yourself slightly more to the underlying disk's real throughput, because you are no longer hiding behind a 12 GB buffer.
The deciding question is the shape of your writes. Sustained, high-rate, continuous writes (logging, ingestion, anything streaming) benefit from small thresholds because the backlog never had a quiet period to drain into anyway. Bursty writes with idle gaps may genuinely want the big buffer.
For this box, the workload was relentless and the requirement was predictable latency, so trading a little peak throughput for the elimination of multi-second stalls was an easy call. Measure your own /proc/meminfo against your own latency before you copy these numbers. The principle travels; the exact figures belong to your machine.