the runc bug, and why container "isolation" keeps me up at night

A newsroom ticker of tech headlines

CVE-2019-5736 landed a couple of weeks ago and the whole industry spent the following days patching in a hurry. The short version: a malicious container could overwrite the host's runc binary and, from there, get code execution as root on the host. It affected Docker, containerd, CRI-O, basically everything that shells out to runc to actually start a container. Which is everything.

I was on call the week it broke, so I got to live the fun part. The fix itself was undramatic: bump the runtime, restart the daemon, move on. The interesting bit was watching how many people reacted as though the sky had fallen, and how many reacted as though nothing had happened. Both were wrong.

A city skyline at dusk

Here is the thing I keep trying to get across to people who run containers in anger. A container is not a virtual machine. It is a process on your host with some namespaces and cgroups drawn around it, sharing the same kernel as everything else. That is a feature, it is why containers are cheap and fast, but it means the security boundary is the kernel and the runtime, not a hypervisor. When the runtime has a bug that lets you climb out, you climb out onto the host. There is no second wall behind the first.

So the real lesson of this CVE is not "runc had a bug", because all software has bugs. It is that a lot of shops were running untrusted or semi-trusted images on shared hosts and quietly assuming the container was doing more isolation work than it actually does. If you build and run only your own images, this was a Tuesday: patch and forget. If you let customers push arbitrary images onto multi-tenant hosts, this should have been a long, uncomfortable conversation about your threat model.

The mitigations that already existed are the ones worth internalising. Don't run containers as root inside the container if you can avoid it. Turn on user namespaces so container-root maps to an unprivileged host UID. Keep a read-only root filesystem where you can. And if you genuinely need to run hostile workloads, reach for something with a real boundary: gVisor, or Kata, or just a plain VM per tenant. The point is to stop pretending a namespace is a sandbox when the workload is adversarial.

I'm not down on containers. I've shipped most of my last few years of work in them and I'd do it again tomorrow. But a disclosure like this is a healthy jolt. It costs nothing to remember, every few months, what the boundary actually is and what it is not. The patch took me ten minutes. The mental model is the part worth keeping.