the outage wasn't the cloud's fault, mostly

Newsroom tickers and server racks

Another month, another cloud provider has a wobble and half of Twitter declares the death of centralised computing. I watched the timeline light up this week and felt the usual mix of sympathy and déjà vu. The dashboards go yellow, the status page lags behind reality by twenty minutes, and everyone discovers at once that their "multi-region" setup shared one control plane after all.

Here is the uncomfortable bit. Most of these outages don't actually take down a whole region. They take down one service in one region, and then take down everything you built that quietly depends on that one service. DNS, IAM, a metadata endpoint, the thing nobody put on the architecture diagram because it was assumed to be always there.

I've done this myself. I had a homelab "redundant" pair that both pulled config from the same single Consul node, because at 2am during the build that was the pragmatic choice and I never went back. The cloud just makes the same mistake more expensive and more public.

The lesson isn't "self-host everything" or "go multi-cloud", both of which usually trade one failure mode for three. The lesson is to actually know your dependency graph, test the failure you're afraid of, and accept that some downtime is cheaper than the architecture you'd need to avoid it. Write the number down. Most businesses can survive an hour offline far better than they can survive the complexity of pretending they can't.