There was another cloud wobble this month, the sort where half the internet hiccups at once and the affected status pages stay stubbornly green for the first quarter of an hour. I will not pretend to know the exact root cause yet, because the honest post-mortem is always weeks away and the early speculation is usually wrong. What interests me is not this particular failure but the shape it shared with every one before it.
The pattern is always the same. A single regional control plane, or an authentication service, or a shared dependency nobody thought of as shared, has a bad few minutes. And then services that have no obvious connection to each other all fall over together, because underneath they were leaning on the same invisible thing.
That is the lesson, and it is uncomfortable. Your architecture diagram shows the dependencies you chose. It does not show the ones you inherited. The metrics endpoint that quietly resolves through the same provider, the package mirror, the certificate authority, the DNS that everything assumes will simply always be there. You did not draw those arrows because you never decided to depend on them. You just did, by default.
So what should we actually learn, beyond the usual ritual of nodding at "multi-region" and doing nothing? A few things I am taking seriously this week:
- Map the dependencies you did not choose. Walk one request end to end and write down every external thing it touches, including the boring ones. The list is always longer than the diagram.
- Decide, in advance, what degraded looks like. If your auth provider is down, does the app fail open, fail closed, or just hang? Hanging is the worst answer and the most common default.
- Cache the things that should outlive a blip. A short-lived token or a DNS answer held a little longer can be the difference between a wobble and an outage on your side.
The temptation after every one of these is to either panic and re-platform, or shrug and assume it will not happen to you. Both are wrong. The provider will have another bad day, that is simply the cost of running on someone else's computers, and it is mostly a fair trade. The work is making sure their bad day is a degraded experience for your users rather than a total one.
I would rather spend an afternoon now finding the dependencies I cannot see than discover them live, in front of everyone, with the status page still green.