Ramblings of an aging IT geek
← Ramblings of an aging IT geek
news

when half the internet leans on the same control plane

A recent run of cloud control-plane wobbles is a reminder that our redundancy usually stops at the layer we can see, and the layer underneath is shared by everyone.

A newsroom screen showing a status dashboard going red

There's been another round of cloud provider grief this month, and as usual the interesting part isn't the outage itself but the shape of it. A region has a bad afternoon, a control plane gets wedged, and suddenly a long list of companies who have nothing to do with each other all go quiet at the same time. The postmortems will be careful and well written, as they tend to be, and they'll describe something that almost nobody downstream could have prevented from where they were sitting.

I'm not going to name and shame a specific provider, partly because by the time you read this there will have been another one from a different vendor, and partly because the lesson generalises better without the brand attached. The point isn't that Vendor X is bad at this. The point is that we have all quietly agreed to share a small number of very large single points of failure, and we mostly don't think about it until one of them has a Thursday.

the redundancy we buy stops where we can see

Here's the thing that bites people. You do everything you're told. You run across multiple availability zones. You've got auto-scaling, health checks, the lot. Your application tier is genuinely resilient to a box dying, a rack dying, sometimes a whole zone dying. You've tested it. Good.

And then the thing that falls over isn't your application tier. It's the API you call to launch a replacement instance. Or the load balancer control plane that decides where traffic goes. Or the managed DNS that everything resolves through. Your data plane is fine, your servers are happily serving, but you can't change anything, can't scale, can't fail over, because the layer you'd use to do all of that is the layer that's broken. You bought redundancy at the level you could see and reason about, and the failure happened one level down, in the shared machinery that you can't see and don't operate.

A city skyline at dusk with the lights flickering

This is the bit that makes these events feel so helpless from the customer side. During the worst of one of these, you often can't even read the status page reliably, because the status page is hosted on the thing that's down, or the dashboard that would tell you what's happening needs the same control plane that's currently on fire. You're reduced to refreshing third-party outage trackers and the vendor's social media, same as everyone else. It is a deeply undignified way to run a production incident.

what's actually new, and what isn't

None of this is new. We've had shared infrastructure failures for as long as we've had shared infrastructure. What's changed is the blast radius. When everyone's CDN, everyone's auth provider, everyone's DNS, and everyone's object storage resolve to a handful of companies, a single bad config push doesn't take down one site, it takes down a cross-section of the visible web. The correlation is the problem. It's not that failures got more likely, it's that they got more synchronised.

And the managed-everything trend, which I'm genuinely a fan of most days, makes this worse in one specific way. The whole pitch of managed services is "stop operating this yourself, we're better at it than you." That's usually true! I do not want to run my own load balancer fleet. But the flip side is that when the managed thing breaks, you have no levers. You can't ssh in and poke it. You can't roll back their deploy. You can only wait, and refresh, and update your own status page with increasingly apologetic wording.

so what do we actually do about it

I want to be careful here, because the easy answer is "go multi-cloud" and that answer is mostly wrong for most teams. Genuine active-active across two providers is enormously expensive in engineering time, doubles your operational surface, and very often introduces more outages from its own complexity than it prevents from provider failures. If you are not already very good at running on one cloud, you will not magically become good at running on two. I've watched teams burn a year on it and end up less reliable, not more.

What I think is actually defensible is smaller and less heroic:

  • Know your real dependency graph, including the boring shared bits. Not "we're on three AZs" but "every request we serve depends on this one managed DNS zone and this one auth provider, and if either is gone we are gone." Write it down. It's sobering.
  • Decide, deliberately, which dependencies are worth a fallback and which aren't. DNS with a sensible second provider is cheap insurance and worth it. A full hot standby of your entire stack in another region usually isn't. Pick on purpose, don't just inherit the defaults.
  • Make sure your incident process survives the dashboard being down. If your runbook assumes you can reach the cloud console, your runbook has a bug. Have the status pages, the contact paths, and the "we are aware, here's our holding update" templates somewhere that doesn't share fate with the thing most likely to break.
  • Be honest with the people who depend on you. "Our upstream cloud provider is having a regional incident, here's their status link, we'll update in 30 minutes" is a perfectly respectable thing to say. Pretending you have it handled when you're as stuck as everyone else just burns trust.

the uncomfortable bit

The uncomfortable conclusion is that some of this risk is simply not yours to fix, and pretending otherwise is its own kind of mistake. You can do everything right and still be down because a company you pay had a bad config push two layers below where your architecture diagram stops. Accepting that is not the same as being careless. It's the difference between an engineering decision and a comfort blanket.

What I'd ask for, if I had a vote, is just more honesty about shared fate, from vendors and from ourselves. Tell me what your service actually depends on internally. Tell me where the correlation lives. And on my side, stop drawing architecture diagrams that quietly assume the cloud is a flat infinite resource that never has a bad day, because it isn't, and it does, and it will again before the month is out.

The cloud didn't break the way we worry about. No region fell into the sea. A control plane got wedged, the same one a lot of us share, and for a few hours that was enough. We'll all nod along to the postmortem, resolve to map our dependencies properly, and mostly not. Then it'll happen again, and we'll be just as surprised, which is the part I find genuinely funny and slightly bleak in equal measure.