Another month, another big provider falling over. IBM Cloud had a wide global wobble last month that took out the console and a pile of services for the best part of an afternoon, and depending on which region you live in you have probably had your own. I am not writing this to dunk on anyone. Running infrastructure at that scale is genuinely hard, and the people on the receiving end of those pagers were having a much worse day than the rest of us reading about it. What I keep coming back to is how surprised everyone sounds each time, as though this were the first one.
It isn't. It is the same lesson, repeated, and we keep filing it under "freak event" so we never have to do anything about it.
The uncomfortable bit is that most of the pain on these days is self-inflicted. The provider has a bad hour. Plenty of teams then discover that their "highly available" setup was three availability zones in one region, all sharing a control plane, all behind one DNS resolver they had quietly come to depend on. The provider's failure was the trigger. The blast radius was their own.
So what should we actually take from it, beyond a smug tweet and a return to business as usual?
First, know what you cannot survive. Not in the abstract, on a slide. Specifically. If your primary region vanishes for two hours, what breaks, who notices, and how much money or trust does it cost per hour? Most teams cannot answer that, which means every outage is a fresh discovery rather than a rehearsed event. You do not need to be multi-region. You need to have decided not to be multi-region, on purpose, with the cost of that decision written down.
Second, the control plane is not the data plane, and they fail differently. A lot of these incidents leave running instances running but make it impossible to change anything: no new deploys, no scaling, no console, no API. If your recovery plan starts with "spin up replacement capacity", and the thing you need to spin it up with is the thing that is down, you do not have a plan. You have a wish.
Third, and this is the one I find hardest to get teams to take seriously: practise. A failover you have never actually run is a hypothesis, not a capability. The first time you exercise it should not be at 02:00 with the customer on the phone. Game day it. Break the region on purpose, on a Tuesday, with coffee, and watch what falls over. It always surprises you, and it is so much cheaper to be surprised on a Tuesday.
None of this is new. "The cloud is just someone else's computer" is a fridge magnet at this point, but it remains the single most useful sentence anyone has said about the subject. Someone else's computer has someone else's bad days, on someone else's schedule, and you do not get a vote. The only thing you control is how much of your world you have wired directly to it without a fallback.
I have been on both sides of this. I have built the brittle thing, called it resilient, and found out otherwise during an incident I did not cause but absolutely owned. The provider's status page will go green again in a day or two and everyone will move on. The question worth carrying forward is the boring one: when this happens again, and it will, what specifically have we changed so that it costs us less than it did this time? If the honest answer is "nothing", then we did not learn anything. We just got our outage out of the way early.