another cloud outage, and what we should learn

Newsroom monitors showing a status page

There has been a fresh round of cloud wobbles this month, and the usual cycle has played out exactly as it always does. A region has a bad few hours, status pages go yellow then red then back to a carefully worded green, and Twitter fills with people discovering that their multi-region failover was a slide in a deck rather than a thing that works. Then everyone forgets until the next one.

I am not going to name and shame a provider here, partly because I do not have the exact timeline in front of me and I would rather be vague than wrong, and partly because it does not matter. AWS has had its days. Azure has had its days, and the world has been leaning on all of them harder than ever over the past month while half the planet works from the spare room. The interesting question is never "which provider failed". It is "why did your thing fail when they did".

the cloud is still someone else's computer

I keep coming back to this line because it keeps being true. Moving to the cloud did not abolish hardware, networks, power, or the laws of physics. It moved them behind an API and a credit card. That is genuinely useful. It is not the same as moving them out of existence.

What the cloud actually gives you is a much better menu of resilience options than most of us could build ourselves, plus the rope to ignore every single one of them. You can spread across availability zones. You can spread across regions. You can run active-active with health-checked DNS in front. You can do none of that, put everything in one AZ behind one load balancer, and it will work beautifully for two years until the morning it does not.

A city skyline at dusk

The failures I have seen up close almost always come down to a hidden single point that nobody drew on the architecture diagram. A shared NAT gateway. A single control-plane dependency that everything quietly calls on startup. One region's S3 that another region's "stateless" service turned out to need. The dependency you forgot about is the one that takes you down, and the cloud is very good at hiding dependencies inside friendly-sounding managed services.

the bits people skip

When I sit down with a team after one of these, the gaps are depressingly consistent.

The control plane is a dependency too. During a big regional event, the ability to launch new instances, change DNS, or even read the console can degrade right when you need it most. If your recovery plan is "we'll just spin up capacity elsewhere", test that you actually can while the region is on fire, not while it is calm.
Failover that is never exercised does not work. I have lost count of the standby databases that had drifted, the secondary region missing a security group, the runbook referencing a hostname that was renamed eighteen months ago. A failover path you do not run on a schedule is a hypothesis, not a capability.
Health checks that check the wrong thing. A /healthz that returns 200 because the web server is up, while the database connection pool is exhausted, will happily keep routing traffic into a black hole. Check the thing that actually matters to a user.
DNS TTLs you cannot live with. If your failover story depends on changing a record, and that record has a one hour TTL with resolvers that ignore it anyway, your "instant" failover has an hour-long tail. Know your real propagation behaviour, not the theoretical one.

None of this is exotic. It is the unglamorous middle of the stack that does not demo well and does not get prioritised until an outage makes the business care for about a fortnight.

resilience is a budget, not a binary

The honest framing is that resilience costs money and complexity, and you have to decide how much you want to buy. Multi-region active-active is real engineering effort and a permanent tax on every change you ship. For a lot of systems that is the wrong trade, and a clear-eyed "we accept that a regional outage takes us down for a few hours, here is our communication plan" is a perfectly respectable answer. What is not respectable is having that be your posture by accident, discovered live.

So the practical takeaway, the same one as last year and the year before: write down what you are actually relying on, including the bits hidden inside managed services. Decide explicitly which failures you will survive and which you will ride out. Then test the survival paths often enough that they are boring. A failover drill that goes smoothly is the cheapest insurance in this industry, and the only kind that pays out when you need it.

The cloud will have another bad day. It always does. The goal is to make your bad day shorter than theirs.