Another month, another status page slowly turning amber while everyone refreshes it and pretends they aren't. The shape is always the same. A control plane has a bad day in one region, the dashboards lag behind reality, and a surprising number of things that "definitely run in multiple availability zones" turn out to share a dependency nobody documented.
I'm not going to name and shame a provider here, partly because by the time you read this it'll be a different one's turn, and partly because the interesting lesson isn't about them. It's about us. The outage is theirs. The blast radius is ours, and the blast radius is almost always bigger than the architecture diagram suggested.
Here's the thing that keeps catching teams out: the failure modes are rarely the compute. Instances dying is the boring case, and most people have at least thought about that. The painful ones are the shared services. DNS. The metadata or auth endpoint everything quietly calls on startup. The managed queue that backs your async jobs and your health checks, so when it slows down your liveness probes start failing and the orchestrator helpfully kills the very things that might have recovered. The single regional dependency that's so reliable you forgot it was a dependency at all.
The other recurring pattern is the status page itself. Provider status pages are marketing-adjacent documents, and they lag. Your own monitoring will know before the green tick goes. So the first lesson, every time, is to trust your synthetic checks over someone else's dashboard, and to have alerting that points at symptoms your users feel rather than at the provider's self-assessment.
So what do I actually do with this, beyond shaking my fist?
A few things, none of them clever, all of them tedious:
- Write down the real dependency graph, not the intended one. Trace one critical request from edge to database and note every external call it makes, including the ones in libraries and sidecars you didn't write. That list is your single-points-of-failure list, whether you like it or not.
- Decide, explicitly and in advance, what "degraded" looks like. Can the product serve stale data? Read-only? A cached page and an honest banner? A graceful degraded mode you designed beats a cascade you didn't.
- Practise the failover before you need it. An untested multi-region setup is a very expensive comfort blanket. Game-day it, on a quiet afternoon, on purpose.
And then the harder, more honest question: is multi-region even worth it for what you're running? For a lot of services, it isn't. The added complexity introduces its own outages, the data-consistency story gets genuinely hard, and you spend real money insuring against a few hours of downtime a year that your users would forgive. Sometimes the right answer is a clear status page of your own, a sensible RTO you've actually agreed with the business, and the discipline not to over-engineer. Resilience theatre is its own failure mode.
What I won't do is pretend the cloud was a mistake. It mostly isn't. But "the cloud is someone else's computer" has a corollary we keep forgetting: someone else's computer has its own bad days, and they don't coordinate them with your launch calendar. The cloud doesn't remove single points of failure. It just relocates them somewhere you can't see and can't fix, which is exactly why you have to map them yourself.