Ramblings of an aging IT geek
← Ramblings of an aging IT geek
homelab

running the whole house on one docker-compose file

How I consolidated a sprawl of homelab services onto a single Docker Compose stack, with reverse proxy, backups, and the bits that bit me along the way.

A server rack, the home for a small pile of self-hosted things

For about two years my homelab was a museum of how I happened to feel on any given weekend. One service ran in a screen session because I was in a hurry. One was a systemd unit pointing at a virtualenv I no longer dared upgrade. Two more lived on a Raspberry Pi I had to physically go and look at to remember what it did. When the box rebooted, half of it came back and half of it did not, and I never knew which half until something stopped working.

So I spent a wet weekend collapsing the lot into a single docker-compose.yml. Not because Compose is fashionable, but because I wanted one file that described the entire house, and one command that brought it back. This is what I ended up with and what I learned doing it.

one file, one source of truth

The thing that sold me on Compose for a homelab specifically is that the file is the documentation. There is no separate wiki page that has drifted out of date. If a service is running, it is in the file. If it is in the file, I can read exactly which image, which ports, which volumes, and which environment it needs. Six months from now when I have forgotten everything, the file still knows.

The shape is boring on purpose:

version: "3.8"

services:
  traefik:
    image: traefik:v2.4
    restart: unless-stopped
    command:
      - --providers.docker=true
      - --providers.docker.exposedbydefault=false
      - --entrypoints.web.address=:80
      - --entrypoints.websecure.address=:443
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./traefik/acme.json:/acme.json

  jellyfin:
    image: jellyfin/jellyfin:latest
    restart: unless-stopped
    volumes:
      - ./jellyfin/config:/config
      - /tank/media:/media:ro
    labels:
      - traefik.enable=true
      - traefik.http.routers.jellyfin.rule=Host(`media.house.lan`)

Everything follows that pattern. Pin the image, set restart: unless-stopped so a reboot brings it back, mount config into a directory I can see, and let Traefik discover it by label rather than me hand-editing nginx vhosts at midnight.

The lab, mid-consolidation, cables and all

the reverse proxy earns its keep

The single best decision was putting Traefik in front of everything. Before, every service had its own port and I kept a mental note of which was which. Now everything is a hostname on the LAN, Traefik reads the Docker labels, and TLS happens without me thinking about it. Adding a service is three lines of labels, not a new nginx config and a reload I was always slightly afraid of.

It also meant I could stop exposing ports I did not need. Most of these services only talk to Traefik now, on an internal Compose network, and the only things bound to the host are 80 and 443. That alone tidied up the attack surface considerably, which matters more than I would like to admit given how casually some of these images are maintained.

volumes, and where the data actually lives

The mistake I made first, and want to save you from, is letting Docker manage named volumes for anything I cared about. Named volumes are fine until you want to back them up, move them, or just look at them, and then they are a scavenger hunt under /var/lib/docker. So everything stateful uses a bind mount into a directory next to the Compose file. Config lives in ./servicename/config. Bulk media lives on the ZFS pool and gets mounted read-only where it can.

This makes backups trivial. The entire configured state of the house is the Compose file plus a tree of small config directories. That tree is a few hundred megabytes, it lives in a git repo (minus secrets), and a nightly job rsyncs it off-box:

#!/usr/bin/env bash
set -euo pipefail
cd /srv/house
docker compose pull
rsync -a --delete ./config-data/ backup@nas:/backups/house/

If the machine died tomorrow, recovery is: install Docker, clone the repo, restore the config tree, docker compose up -d. I have actually tested this, which is the only reason I trust it. A backup you have never restored is a hope, not a backup.

the bits that bit me

A few things were not obvious going in.

  • acme.json needs to be chmod 600 or Traefik refuses to use it, and the error message is not as helpful as you would like.
  • restart: unless-stopped is the right policy, not always. With always, a service you deliberately stopped comes back after a reboot, which is maddening.
  • Pinning :latest on a media server is a great way to discover a breaking change at the worst possible moment. I now pin real version tags on anything I rely on and only float :latest on the things I am happy to babysit.
  • Compose does not solve secrets. I keep an .env file out of git and reference it, which is hardly Vault, but it is honest about the fact that this is a house, not a bank.

healthchecks and start order

The other thing a homelab teaches you, usually at the worst time, is that "the container is running" and "the service is working" are not the same statement. A database container can be up and accepting connections a good few seconds before it is actually ready to answer a query, and anything that depends on it will cheerfully crash-loop in that gap. Compose's depends_on only waits for the container to start, not for the thing inside it to be ready, which trips up everybody once.

So I lean on healthchecks and let the dependent services restart their way into a working state rather than trying to choreograph perfect start order:

  db:
    image: postgres:13
    restart: unless-stopped
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U app"]
      interval: 10s
      timeout: 5s
      retries: 5
    volumes:
      - ./db/data:/var/lib/postgresql/data

  app:
    image: ghcr.io/me/app:1.4.2
    restart: unless-stopped
    depends_on:
      db:
        condition: service_healthy

With condition: service_healthy, the app waits until the database actually answers pg_isready, not just until its container exists. It is a small thing that turns a reboot from a flurry of crash-loops that eventually settle into a clean, boring start. Boring is the goal. A homelab that boots boringly is one you stop thinking about, which is the entire point of building it.

monitoring, lightly

I resisted adding monitoring for ages because it felt like enterprise cosplay for a house. I was wrong. The cheapest useful thing was a single container that watches the others and pings me if one stops staying up. I do not need Prometheus and a wall of Grafana dashboards to run a media server and a few tools. I need to know, before my partner does, that the photos service has fallen over. A small uptime checker hitting each Traefik hostname does exactly that, and it lives in the same Compose file as everything else, which is the recurring theme: if it is part of the house, it is in the file.

was it worth a weekend?

Comfortably. The house now comes back from a power cut on its own, the file tells me the truth about what is running, and adding something new is a pull request to myself rather than an archaeology dig. It is not a clever setup and it is not trying to be. It is one file, one command, and the very pleasant feeling of knowing where everything is. After two years of not knowing, that turns out to be the feature I actually wanted.