Strictly this wasn't a security disclosure, it was a routing one, but it's the thing everyone in my corner of the internet has been talking about this week and the lesson sits in the same drawer. On Sunday a large transit provider, CenturyLink (Level 3 as was), had a very bad morning. A misbehaving configuration, a flowspec rule that propagated where it shouldn't have, and large swathes of traffic across the internet started failing or looping. Cloudflare, among others, reported a significant chunk of their traffic dropping. For a few hours, if your packets happened to want to traverse that network, they had a thoroughly miserable time.
What makes it worth writing about isn't the provider. Everyone has an outage eventually. It's how far the blast radius reached, and how little most of the affected sites could do about it. You can be a tidy little service with healthy origins, good monitoring, redundant everything, and still go dark because a transit network three hops away from your customers decided to fall over. Your status page says all green. Your customers can't reach you. Both things are true at once.
The uncomfortable bit about shared fate
We've spent a decade consolidating onto a handful of CDNs, a handful of transit providers, a handful of clouds, because it's cheaper and faster and genuinely better most of the time. The trade is that we've also consolidated the failure modes. When one of those big shared dependencies has a wobble, it isn't one site that goes down, it's a whole neighbourhood at once, and the very fact that everyone uses the same provider means everyone's "is it just me?" check fails simultaneously. Twitter becomes the status dashboard.
I went and looked at my own setup with this in mind, which is the only honest response to someone else's outage. A few findings, none of them flattering:
- My health checks all egress through the same provider. If that provider is the thing that's broken, every check goes red at once and tells me nothing about where the break is.
- My "failover" DNS has a TTL long enough that switching away from a sick provider takes longer than most of these incidents last. By the time the change propagates, the original is usually back.
- I had genuinely never tested what happens when origins are fine but the path to them isn't. All my failure drills assumed the failure was mine.
What I'm actually changing
Not a great deal, if I'm honest, and that's deliberate. The temptation after an incident like this is to architect for it: multi-CDN, multi-transit, anycast all the things. For most of what I run that's a lot of complexity bought to survive a few hours every couple of years, and complexity has its own outages.
What I am doing is cheaper. I've added an external check that egresses through a different network, so when everything goes red I can at least tell "the internet is sad" from "my thing is sad." I've dropped a couple of TTLs to something I could actually act on inside an incident window. And I've written down, in plain English, the sentence I'll put on the status page when the problem is demonstrably upstream and out of my hands, because writing that calmly at 3am is harder than it sounds.
The thing I keep coming back to is that resilience here isn't really technical. The provider will fix their config, publish a tidy post-incident report, and it'll happen again in some other form next year. What you control is whether you can tell what's going on, whether you can say something true to your users while it's happening, and whether you've made peace with the fact that some of your uptime belongs to someone else's network team having a good day. Mostly they do. Sunday they didn't.