Ramblings of an aging IT geek
← Ramblings of an aging IT geek
news

the morning a single file grounded a country

Watching the FAA NOTAM outage ground US flights this week, and why a corrupted file taking down a whole system felt uncomfortably familiar.

A tech news headline on a screen

Earlier this week the FAA's NOTAM system went down and the United States stopped departing aircraft for the first time since 2001. Thousands of flights delayed, hundreds cancelled, and the cause, by the FAA's own early account, was a damaged database file. A corrupted file. The system that tells pilots which runways are closed and where the temporary hazards are, taken offline by something that, in spirit, is one bad row.

I watched it unfold over morning coffee on the flight trackers, the ground-stop spreading across the map, and I had that uncomfortable feeling you get when someone else's incident looks a lot like one of your own near-misses. NOTAM, Notices to Air Missions, is genuinely ancient plumbing. The teletype heritage is right there in the all-caps abbreviated format. None of us should be surprised that a system of that vintage has a single point of failure that is one file on one machine.

A city skyline

What struck me was not that it failed, but the shape of the failure. A corrupted file took down the primary and, per the reporting, the attempt to fail over to the backup did not go cleanly either. I have lived that exact sentence. The backup that turns out to be replicating the corruption faithfully, or the failover path that nobody had actually exercised under real conditions, so the first genuine test of it is during the incident. Replication is not a backup if it replicates the bad thing. A failover you have never failed over to is a guess.

You can be smug about old systems, and people were, all week. But I have built modern things that would behave exactly the same. A poisoned record that every consumer trusts. A config push that propagates instantly to every region because fast propagation is a feature right up until the thing you are propagating is wrong. The age of NOTAM is a red herring. The lesson is that any system where corruption flows downstream faster than a human can intervene will, eventually, do this to you, and the technology stack does not save you from it.

The part I actually respect is that the safe failure mode was to stop. Faced with a system it could not trust, the FAA grounded flights rather than dispatch them on stale or missing information. That is expensive and embarrassing and absolutely the right call. A lot of software I have worked on did not have a "stop and refuse to serve" mode; it had a "serve confidently from a corrupted cache" mode, which is far worse and far quieter. Failing loud and safe is a design choice, and it is one you have to make on purpose, long before the bad file arrives.

I do not know yet what the real root cause was, and the early "corrupted file" line may turn out to be a simplification of something messier. It usually is. But the version I will be telling myself this week is the simple one: check that your backup is not just your corruption with a different timestamp, and test the failover before the day you need it. Everyone says that. Almost nobody has actually pulled the plug to find out.