Ramblings of an aging IT geek
← Ramblings of an aging IT geek
news

another region falls over, and the lessons we keep refusing to learn

Reflecting on the latest cloud outage and why "the cloud" being someone else's computer keeps surprising people who should know better.

A newspaper-style montage of tech headlines

There has been another cloud wobble this month, and once again my feed filled up with the same two reactions, in the same order, from the same people. First, "is it just me?". Then, a few minutes later, the dawning horror that no, it is not just you, it is half the internet, and your perfectly engineered service is down because a thing you do not own and cannot see decided to have a bad morning.

I am not going to name and shame the provider, partly because the details were still settling as I write this and I would rather be vague than wrong, and partly because it genuinely does not matter which one it was. It is always one of them, eventually. That is the whole point I want to make.

the cloud is someone else's computer

This phrase has been doing the rounds for a while now and people repeat it as a joke. It is not a joke, it is the single most important operational fact about the platforms we have all moved onto. When you put your service in a region, you have made a bet that someone else's datacentre, someone else's network, someone else's control plane, will stay up. Most days that is a brilliant bet. The whole industry has voted with its wallet and the convenience is real.

But "most days" is doing an enormous amount of work in that sentence. The trouble is that we design as though the cloud's uptime is our uptime, when in fact we have simply rented someone else's uptime and inherited their bad days as our own.

what actually went wrong, usually

I have watched enough of these now to spot the pattern, and it is almost never the thing people assume. It is rarely a datacentre on fire. It is far more often the control plane: the APIs that let you launch, scale, and route. The compute that was already running often keeps running. What dies is your ability to change anything, which matters enormously the moment you are trying to fail over and discover that the very tools you need to fail over with are the ones that are down.

A grey city skyline under heavy cloud

This is the cruel bit. The autoscaler that should have saved you needs the API that is broken. The load balancer reconfiguration that should have routed around the problem needs the API that is broken. Your beautiful self-healing architecture turns out to depend on the one component that is currently on fire, and it heals itself right into a wall.

what we should actually do

Here is where I am supposed to tell you to go multi-region, multi-cloud, and never trust a single provider again. I am not going to, because that advice is usually given by people who have never had to pay for it or operate it.

Real multi-region is expensive, slow to build, and adds a category of bug, the cross-region consistency bug, that is genuinely harder than anything it protects you from. Most teams that announce a multi-cloud strategy end up with two single points of failure and a doubled bill. The honest position is more boring.

First, know your actual dependencies. Not the diagram you drew at the start, the real ones. The DNS provider, the certificate authority, the third-party auth, the thing five layers down that quietly calls home on every request. The outages that hurt most are the ones through a dependency nobody remembered they had.

Second, decide what your service does when the control plane is gone but the data plane is fine. Can your running instances keep serving with no ability to scale? Often the answer is yes, and that is a perfectly respectable place to ride out a two-hour incident. Static fallback pages, cached responses, a read-only mode: these are unglamorous and they keep you alive while the giants sort themselves out.

Third, and this is the one nobody likes, write down what your uptime promise honestly is. If you are single-region on one provider, your real availability is theirs minus your own mistakes. Stop promising four nines on an architecture that mathematically cannot deliver them. Either pay for the redundancy or be honest about the number.

the part I keep relearning

I say all this as someone who has been burned by exactly this and will be burned again. Every time, I tell myself I will map the dependencies properly and test the failure modes for real. Every time, the next outage finds a dependency I had forgotten and a failover path I had never actually exercised under load.

The lesson is not "the cloud is bad". The cloud is extraordinary, and I would not go back to racking my own tin for most of what I run. The lesson is that resilience is something you build and rehearse, not something you rent. The provider sells you availability. They cannot sell you a plan for the day their availability runs out, and that day always comes. Best to have written the plan before, while the internet is calm and your hands are not shaking.