Ramblings of an aging IT geek
← Ramblings of an aging IT geek
news

the retry storm you wrote yourself

This week's brief provider blip was made worse for me by my own client retry logic, which turned a recoverable wobble into a self-inflicted pile-up.

A status board with one section flashing red

Short provider blip earlier this week, the kind that's resolved before you've finished reading the incident tweet. The provider's bit was fine, recovered in a few minutes. My bit was not fine, and that's on me.

When the dependency started timing out, every one of my workers retried immediately, then retried again, with no backoff and no jitter. So the moment the provider came back up, it came back up to a thundering herd of my own making, all my clients hammering it in lockstep, and I extended a five-minute provider blip into a fifteen-minute self-inflicted outage. The provider recovered. My retry logic refused to let it.

The fix is exponential backoff with jitter. Not exponential backoff on its own, which just synchronises everyone to retry at the same well-spaced moments. Add randomness so the herd spreads out. It's a dozen lines and it turns "everyone retries at once" into "retries fan out gently", which is the entire game during a recovery.

The lesson, again: outages aren't only what the provider does to you. Half the damage is what your client does in response. Make your retries polite, or they'll finish the job the outage started.