when s3 sneezed and half the web caught a cold

A tech news headline on a screen

A fortnight on from the big S3 outage in us-east-1 and I'm still chewing on it. For those who somehow missed it, on 28 February a chunk of S3 in Amazon's oldest region fell over for several hours, and an alarming slice of the web went with it. Not just the obvious dependents either. Dashboards, status pages, IoT gadgets, things you'd never have guessed touched object storage. Amazon's own status page reportedly struggled, because its status icons were served from the thing that was down.

The cause, once they wrote it up, was mundane in the way these always are: a command run during routine debugging took out more capacity than intended, and a subsystem that hadn't been fully restarted in years took a long time to come back. Big systems rust in place. You don't find out until you have to restart them.

The lesson everyone reached for was "multi-region, multi-cloud", and that's not wrong, but it's expensive and most of us won't do it. The cheaper lesson is the one I keep relearning at home: know your single points of failure before they introduce themselves. If your entire app, your monitoring, and your means of telling customers what's happening all sit in one region, you haven't got three things, you've got one thing wearing a wig.

I went and looked at my own stuff afterwards, as you do, and found my status notifications would have died right alongside the service they were meant to report on. Fixed that this week. Mildly embarrassing, but better me than finance.