Last month, on 19 November, a big chunk of Microsoft's cloud effectively stopped letting people in. Not because the servers went away, but because the multi-factor authentication that stands between users and Azure Active Directory stopped working across multiple regions. If you couldn't get the second factor, you couldn't sign in, and if you couldn't sign in, it didn't matter how healthy the rest of the platform was. Office 365, Azure portals, anything leaning on that auth path: locked out, for hours, in the middle of the working day.
I watched the status pages with the particular grim sympathy you feel when you've been on the wrong side of one of these. Nobody woke up that morning planning to take out authentication for a continent. These things are always more boring and more human than the headlines suggest.
"the cloud" didn't go down. one thing did.
The phrase "cloud outage" does a lot of unhelpful work. It implies a vast, vaguely defined thing failed in some vast, vaguely defined way. Almost never true. What actually happens is that one shared component, sat upstream of everything, has a bad day, and because everything depends on it, everything inherits the bad day at once.
Authentication is the classic example because it's the perfect choke point. Every request has to prove who it is before it can do anything useful. So the auth service becomes a single dependency that the entire platform shares, by design, and a single dependency that the entire platform shares is, by definition, a single point of failure dressed up in a great deal of redundancy. You can run it across regions, replicate its data, load-balance it six ways, and it's still the one door everybody comes through. When the door jams, the size of the building behind it is irrelevant.
DNS does this. So does a certificate that quietly expires. So does a config push that's syntactically valid and semantically catastrophic, rolled out everywhere at once because rolling out everywhere at once is the whole point of the system. The pattern is always the same: something small, central, and shared, having a wobble that the architecture then faithfully amplifies to global scale.
the lesson we keep not learning
Here's the uncomfortable bit. After each of these, the industry collectively writes the same think-pieces, nods sagely about resilience, and then does very little, because the honest answer is awkward. We moved to the cloud precisely to stop running our own authentication, our own DNS, our own undifferentiated plumbing. That was the right call. The plumbing was a distraction and the providers genuinely do it better than most of us could. But the deal we struck, often without noticing, is that we handed our single points of failure to someone else and lost the ability to do anything about them when they break.
When your own DNS server falls over, it's three in the morning, it's your fault, and it's also yours to fix. You can be on the box. When the provider's auth tier falls over, it's the middle of the afternoon, it's emphatically not your fault, and there is absolutely nothing you can do but refresh a status page and field questions from people who reasonably want to know why they can't work. The powerlessness is the part nobody mentions in the migration business case.
So what actually helps? Not much, if I'm honest, and anyone selling you a tidy five-point resilience checklist is selling you comfort. But a few things genuinely reduce the blast radius:
- Don't let everything share the same door. If your second factor, your VPN, and your break-glass admin access all route through the same provider's auth, you have one failure that locks you out of the very tools you'd use to respond. Keep at least one independent path in. A local account, a separate factor, something that doesn't depend on the thing most likely to be down.
- Degrade instead of die. A surprising amount of software treats "I can't reach auth right now" as "deny everything", when "serve cached, read-only, and shout loudly" would keep the lights on. Outages are far more survivable when the failure mode is reduced function rather than no function.
- Know what you actually depend on. Most teams cannot list, off the top of their head, every shared service that sits on their critical path. Until you can, you can't reason about which provider hiccup turns into your incident. Draw the dependency graph before the outage draws it for you.
- Have a manual fallback for the things that matter most. Not for everything, that way lies madness, but for the handful of operations that genuinely cannot wait three hours. Be able to do them another way.
- Rehearse the outage before it happens. A fallback you've never exercised is a theory, not a plan. The first time you discover your break-glass account also needs the factor that's currently down should not be during the incident. Test the alternate path on a quiet afternoon, write down what you found, and fix the bit that didn't work while it's cheap to.
None of that prevents the outage. The provider's auth tier will go down again, because shared central dependencies will always, eventually, have a bad day, and no amount of redundancy changes the topology. What those measures do is decide whether their bad day is your minor inconvenience or your major incident.
The thing I keep coming back to is that we didn't eliminate single points of failure by going to the cloud. We consolidated them, professionalised them, and made them somebody else's problem, which is mostly an improvement, until the rare day it very much isn't. The cloud will go down again. We'll write these same words again. And the only real question worth asking, before it happens rather than after, is a simple one: when the one door everybody uses jams shut, what's our other way in?