Another fortnight, another patch of the internet going dark for a while because something upstream had a bad afternoon. I am being deliberately vague on the exact provider and timeline, because by the time you read this it will be a different one, and the specifics are not the point. The point is the shape of the thing, and the shape is always the same.
A shared service most of us route through, directly or not, had a degradation. Sites that looked unrelated went down together, because they were not unrelated at all: they leaned on the same DNS, the same CDN, the same identity provider, the same region. The status pages went yellow, then green, and everyone moved on until the next time.
I am not here to dunk on the providers. Running infrastructure at that scale is genuinely hard, and they are usually better at it than I would be. What gets me is how surprised we keep acting. We chose the single dependency. We chose it because it was convenient and cheap and worked nine days out of ten, and then we are shocked on the tenth.
The lesson, again, is unglamorous. Know your dependency graph, including the parts you did not choose on purpose. If your login, your CDN and your DNS all terminate in one provider's one region, you do not have three vendors, you have one with extra steps. Decide on purpose which failures you will ride out and which you will engineer around, because "the cloud is up" is not a design.
None of this is new. I wrote something close to it after the last one, and I will probably write it again after the next. But resilience is boring and outages are exciting, so the boring work keeps losing the argument until the exciting thing happens to you. Do the boring work before then. It is much cheaper when nobody is watching.