The symptom was a box that ran beautifully and then, roughly every thirty seconds, locked up for a second or two. Request latency would sit flat and lovely, then spike into the hundreds of milliseconds, then settle again. No errors. No obvious cause in the application. Just a heartbeat of misery on a regular interval, which is always a clue that something below the application is doing housekeeping.
The box was write-heavy: a lot of small writes, plenty of RAM, and a single spinning-rust array underneath that could not keep up with the burst rate the workload was capable of producing. That combination is the classic recipe for dirty page trouble.
what's actually happening
When you write to a file, Linux doesn't usually send it to disk straight away. It parks the data in the page cache as a "dirty" page and tells your application the write succeeded. The kernel flushes those dirty pages to disk in the background, later, when it suits. This is normally wonderful. It's why writes feel instant.
The trouble starts with two thresholds. vm.dirty_background_ratio is the percentage of RAM that can be dirty before the kernel starts flushing in the background, quietly. vm.dirty_ratio is the hard ceiling: once that fraction of RAM is dirty, the kernel stops being polite and blocks the writing process until enough has been flushed to get back under the limit.
On a box with a lot of RAM, the defaults are dangerous. If dirty_ratio is 20 percent of, say, 64GB, that's roughly 13GB of writes allowed to pile up in memory before anything is forced to disk. The workload happily fills that buffer, the kernel finally hits the ceiling, and then it dumps a huge backlog onto a disk array that needs an age to absorb it. Everything writing to that filesystem stops dead until the flood drains. That's your thirty-second stall.
what I changed
The fix is to stop the kernel hoarding. Lower the thresholds so it flushes little and often instead of rarely and catastrophically. On a box with this much RAM, the ratio knobs are too coarse, so I used the byte-based versions which give you actual control:
vm.dirty_background_bytes = 268435456
vm.dirty_bytes = 536870912
That's 256MB before background flushing kicks in and 512MB as the hard ceiling. The numbers matter less than the principle: keep the dirty buffer small enough that flushing it never overwhelms the disk for more than a moment. A good starting point is a buffer the array can drain in a second or two, then measure.
Set them live to test before committing:
sysctl -w vm.dirty_background_bytes=268435456
sysctl -w vm.dirty_bytes=536870912
The setting dirty_background_bytes and dirty_bytes override their ratio equivalents, so don't try to set both. Whichever you set last wins, and mixing them is a good way to confuse your future self.
the result, and the caveat
The stalls went away. Latency flattened out, the periodic spike vanished, and aggregate throughput barely moved, because the disk was always the bottleneck and now it was being fed steadily instead of in floods. You don't make a slow disk faster by tuning this. You stop the kernel from pretending the disk is fast and then dropping the bill on your application all at once.
The caveat is that this is a trade. Flushing more eagerly means slightly more disk I/O during quiet moments and a smaller cushion to absorb a genuine burst. For a latency-sensitive write-heavy box, that's the right trade every time: I'd far rather pay a steady, predictable tax than take an occasional two-second freeze. For a batch box that just wants maximum throughput and doesn't care about pauses, leave it alone. As ever, the defaults are a compromise for a machine that isn't yours, and yours is the one you should measure.