Ramblings of an aging IT geek
← Ramblings of an aging IT geek
news

yet another outage, and the lessons we keep not learning

Reflections at the end of a year full of cloud and DNS outages, and the resilience habits we keep promising ourselves and never building.

A skyline of servers and headlines

We are closing out 2016 with the same story we opened it with: a big chunk of the internet went dark because something central wobbled. I won't pretend to know every detail of the most recent one while it's still fresh and the post-mortems aren't out, but the shape is wearily familiar by now. A single dependency, a long way upstream of you, has a bad day, and suddenly a hundred companies who thought they were independent discover they were all standing on the same paving slab.

This has been the year for it. Back in October, the attack on Dyn took out DNS resolution for a swathe of the web, and people who had never thought about who resolves their domain names found out the hard way. The detail that stuck with me wasn't the botnet, alarming as the Mirai business was. It was how many large, sophisticated, well-funded engineering organisations had quietly arranged things so that one provider failing meant they failed too. Single DNS provider, single record set, no secondary. We'd all read the runbook that says don't do that. We did that.

The city carries on. The internet underneath it, less so

The cloud didn't break the rules. We forgot them

There's a temptation, every time one of these happens, to write a piece about how the cloud is fragile and we should all go back to racking our own kit. That's the wrong lesson and it's lazy. The cloud is, on the whole, more reliable than the wiring cupboard most of us came from. A provider's region has better power, better networking and better people watching it than my old comms room ever did.

The point isn't that the cloud is unreliable. The point is that it fails correlated. When your own server dies, it dies alone, and your competitors carry on. When a shared region or a shared DNS provider has a bad hour, you and everyone who made the same sensible default choice go down together, at the same moment, in front of the same customers. The old failures were independent. The new ones are synchronised. That changes the maths of resilience entirely, and most of our architectures haven't caught up.

What actually helps

None of this is novel. That's what's depressing about it. We know what to do and we don't do it because it costs money and effort on a Tuesday when everything's fine.

  • Secondary DNS. A second provider with a sane TTL is the cheapest insurance in this entire industry, and October proved most of us hadn't bought it.
  • Multi-region, and actually test the failover. A standby region you've never failed over to is not a standby region, it's a hope. Run the drill. Pull the plug deliberately, in daylight, with everyone watching, before the universe does it for you at 3am.
  • Know your transitive dependencies. Your status page, your payment provider, your error tracker, your CDN: how many of those sit in the same region as the thing they're meant to report on? A status page that goes down with the service it monitors is a special kind of useless, and plenty did this year.
  • Graceful degradation. Decide, on purpose and in advance, what your product does when a dependency is gone. Read-only mode, cached responses, a polite holding page. The default of "white screen and a stack trace" is a decision too, just not a good one.

The same skyline, a different lock, the same lesson

The bit nobody wants to hear

Real resilience costs real money and real complexity, every single day, to protect against an outage that might happen twice a year. That trade is genuinely hard to justify to a finance team when the graphs are green. So we don't, and then once a year we write the incident review and promise we will, and then the green comes back and the promise quietly expires.

I'm not above this. We had a hard look at our own DNS after October and found exactly the single-provider arrangement I'd just been smugly criticising in other people. We fixed it. It took an afternoon and a small monthly bill, and the only reason it happened is that a real outage made the abstract risk feel concrete for about a week.

So that's my actual takeaway from a year of outages, and the only one I'd stand behind. The technical lessons are all old news; we've known them for a decade. The hard part is keeping the fear fresh enough to fund the fixes while everything still works. Use this latest one while it's raw. Spend the afternoon now, in the quiet bit between Christmas and New Year, finding the single points of failure you're quietly relying on. Next year will hand us another outage. The only question is whether we'll have done our homework before it arrives.