rootless containers, and making peace with subuid

A Linux terminal showing container processes

The promise of rootless containers is simple and good: a container breakout lands you as an unprivileged user, not as root on the host. The reality, the first time you try it, is a confusing pile of permission errors that all look the same and none of which mention the actual problem. I have now done this enough times to have a path through it that does not involve swearing, so here is the path.

Why bother

The pitch is the threat model. With rootful Docker, the daemon runs as root, and a process that escapes the container is talking to a root-owned socket on your host. With rootless Podman, the whole stack runs as your user. A breakout gets your user's privileges, which on a properly set-up box is not much. For a home server that exposes things to the internet, that difference is worth a bit of upfront fiddling.

There is no daemon, either, which I have come to appreciate more than the security angle. Podman runs the container as a child of your shell or of systemd. No background process holding state, nothing to restart, nothing that keeps running after you log out unless you explicitly ask it to. That last clause is a footgun, and we will get to it.

"Rootless" does not mean "no setup", though, and pretending otherwise is how people end up frustrated. The whole point is to push the trust boundary down to the kernel's user-namespace machinery, and that machinery has to be told who you are and which UIDs you are allowed to pretend to be. Get that wrong and nothing works. Get it right once and it tends to stay working. The rest of this is mostly about getting it right once.

The plumbing that makes it work

Rootless containers lean entirely on user namespaces. The container thinks it is running things as root, UID 0, but that UID 0 is mapped to your real unprivileged UID on the host, and a range of other UIDs inside the container map to a block of "subordinate" UIDs allocated to you. This is the /etc/subuid and /etc/subgid mechanism, and it is the single most common thing people get wrong.

$ grep "$USER" /etc/subuid /etc/subgid
/etc/subuid:johnm:100000:65536
/etc/subgid:johnm:100000:65536

That line says my user gets 65536 subordinate IDs starting at 100000. The container's root maps to my UID, and the container's UID 1 through 65535 map to host UIDs 100000 upward. If that allocation is missing or too small, you get errors that look like the image is broken when the image is fine. On most modern distributions the range is created for you when the user is, but if you added the user with a minimal tool or are on something older, you may need to do it yourself:

sudo usermod --add-subuids 100000-165535 --add-subgids 100000-165535 johnm
podman system migrate

That podman system migrate is the step everyone forgets. After changing subuid ranges you have to tell Podman to rebuild its mapping, otherwise it keeps using the old, broken one and you keep getting the same error and start questioning your sanity.

A close-up of a terminal showing container configuration

The footguns, in the order they bit me

Volume permissions. This is the big one and it follows directly from the namespace mapping. A file your container writes as "root" lands on the host owned by your UID. A file it writes as some other in-container UID lands on the host owned by a number up in the 100000 range, which is a user that does not exist on the host, so ls shows you a bare number and you panic. The fix is usually podman unshare, which drops you into the same user namespace the container uses so that ownership makes sense again:

podman unshare chown -R 1000:1000 /home/johnm/containers/appdata

Run your chown inside unshare and the numbers line up. Run it outside and you are chowning host files to the wrong owner and wondering why nothing works.

Low ports. A rootless container cannot bind to ports below 1024 by default, because that is a privileged operation on the host and you are not privileged. You have three sane options: publish to a high port and put a reverse proxy in front, lower the unprivileged port floor with sysctl net.ipv4.ip_unprivileged_port_start=80, or hand the binary the capability. I use the reverse-proxy approach because I want one thing terminating TLS anyway, and it keeps the containers boringly identical.

Lingering. Because there is no daemon, a rootless service started from your session dies when your session ends, and it will not start at boot unless you arrange it. The two pieces are systemd user services plus lingering:

loginctl enable-linger johnm
systemctl --user enable --now my-app.service

enable-linger is what lets your user's systemd instance keep running when you are not logged in. Without it, you reboot the server, nobody logs in, and none of your "always on" services are on. I learned this the way everyone learns it, by rebooting and finding the house quietly offline.

Was it worth it

Yes, comfortably. The security improvement is real, the lack of a daemon genuinely simplifies my mental model, and quadlets, the newer systemd-native way of declaring rootless containers, have made the boot-time story much tidier than the hand-written unit files I started with. The cost is the one-time tax of understanding user namespaces well enough that the permission errors stop being mysterious and start being obvious.

If I had to give one piece of advice it would be this: the moment you hit a permissions wall, stop guessing and ask which UID namespace you are standing in. Almost every rootless problem I have had reduces to a file owned by a host UID I did not expect, because I forgot the container and the host are looking at the same bytes through two different maps. Once that clicks, the rest is just config.