Ramblings of an aging IT geek
← Ramblings of an aging IT geek
linux

the service that wouldn't die, and the restart loop that hid it

A systemd unit I tried to stop kept coming back from the dead, and the culprit was Restart=always quietly papering over a crash I needed to see.

A Linux terminal

I ran systemctl stop on a service and it stopped. Then, a second later, it was running again. Stop, running. Stop, running. For about a minute I genuinely wondered if I'd lost my mind, which is roughly the standard emotional arc of any systemd debugging session.

The unit wasn't haunted. It had Restart=always, and crucially nothing distinguishes "the operator stopped me" from "I crashed" once a stale orchestrator is involved. In my case a separate watchdog timer was noticing the gap and dutifully starting it back up. systemd was doing exactly what it was told. I just hadn't told it the whole truth.

The first thing that cut through the confusion was checking what systemd actually thought had happened, rather than what I assumed:

systemctl status myservice
journalctl -u myservice --since "5 min ago"

The journal showed the service exiting non-zero almost immediately after every start, well before I'd touched it. It wasn't a unit that wouldn't stay dead. It was a unit that wouldn't stay alive, crash-looping fast enough that the restarts blurred into one continuous "running" in my head.

Two things fixed it. First, the actual bug: a missing environment file meant the process bailed on startup, and Restart=always turned a clean, visible crash into an invisible flapping loop. The restart policy wasn't resilience, it was a blindfold. Second, for the debugging itself, systemctl stop followed immediately by masking gives you a moment of peace to think:

systemctl mask myservice

A masked unit is symlinked to /dev/null and physically cannot be started, by you or by an over-eager watchdog. Once I could keep it down, the journal told the real story in about thirty seconds.

The lesson I keep relearning: Restart=always is a fine policy for production and a terrible one while you're diagnosing, because it hides the very evidence you need. When a service "won't die", check whether it's actually dying over and over. Then read the journal before you read your own assumptions.