Ramblings of an aging IT geek
← Ramblings of an aging IT geek
news

when the network goes down, everyone discovers their dependencies

Reflections after a large mobile network outage, on how an outage reveals dependencies nobody admitted they had.

A newspaper-style collage of technology headlines

This week a large US mobile network fell over for most of a morning, and a lot of people discovered, all at once, that "my phone has no signal" and "the whole region is down" feel identical from the inside. The cause that came out afterwards pointed at a botched configuration change during network expansion rather than anything dramatic. No villain, no breach. Just a change that did not do what someone expected, at a scale where being slightly wrong is very visible.

I have no special insight into someone else's network. But I have shipped enough of my own changes that broke more than they touched to recognise the shape, and the shape is always the same: a change that was correct in the small and catastrophic in the large.

A city skyline, the kind of place where a network outage makes the news

The thing an outage like this exposes is not the failing component. It is the dependency nobody wrote down. When a mobile network goes dark, the interesting part is not the people who could not text. It is the alarm panels that needed a data connection, the card readers that quietly assumed connectivity, the "offline-capable" apps that were offline-capable right up until they needed to phone home for a token. Everyone learns, simultaneously and unwillingly, exactly what they had been leaning on.

We do the same thing in our own systems, just smaller. We draw an architecture diagram, we feel good about it, and then an outage reveals the edge nobody drew: the service that "doesn't really depend on" the thing that just died, except for that one config fetch on startup, except for that one metrics endpoint that blocks the health check, except for the DNS lookup that everything secretly needs and nobody owns. The diagram is the system we designed. The outage shows the system we built.

So what should we actually learn, beyond the usual hand-wringing? A few honest things.

First, a change being correct is not the same as a change being safe to deploy everywhere at once. Staged rollout is boring and it is the single biggest difference between "we caught it in region two" and "it is on the news." If you cannot roll a change out gradually, that is itself the finding.

Second, your real dependencies are the ones you only notice when they are gone. The way to find them before an outage does is to remove them on purpose, in a controlled way, and watch what screams. Nobody enjoys this and almost nobody does enough of it, myself included.

Third, and this is the one I keep relearning: the failure is rarely the technology. It is a person making a reasonable change under normal pressure, with a blast radius they could not see. The fix is not "be more careful." The fix is to make the blast radius smaller, so that careful and careless produce roughly the same survivable outcome. Every time one of these makes the headlines, I go and look at my own deploys and ask which one of them could take a region down. The honest answer is usually "more than I would like," and that is the whole point of paying attention to someone else's bad morning.