Ramblings of an aging IT geek
← Ramblings of an aging IT geek
linux

The systemd Unit That Refused to Stay Dead

A service that came back to life seconds after every stop, and the slightly humbling chase through restart policies, socket activation and cgroups that found out why.

A terminal glowing in a dark room

I spent the better part of an afternoon last week trying to stop a service. Not restart it, not reconfigure it, just make it not be running. I typed systemctl stop and it stopped, and then a few seconds later it was back, alive and serving traffic as if nothing had happened. I stopped it again. It came back again. At one point I genuinely said "stay down" out loud to a virtual machine, which is the sort of thing that should worry you about a career in operations.

The point of this post is the actual cause, which turned out to be three separate mechanisms conspiring, none of them a bug. But the journey through them is a decent tour of how systemd resurrection works, and I learned a couple of things I should probably have known years ago.

First suspect: Restart=

The obvious culprit is the restart policy. If a unit has Restart=always and you do something that looks like a crash, it comes straight back. So my first move was to look at the unit:

[Service]
ExecStart=/usr/local/bin/widget-daemon
Restart=on-failure
RestartSec=2

Restart=on-failure, not always. That should mean a clean systemctl stop is treated as deliberate and respected, because stop sends SIGTERM and a process that exits cleanly on SIGTERM counts as success, not failure. So in theory this unit should have stayed stopped. In practice it did not, which told me the restart was not coming from the restart policy at all. Worth ruling out early, because everyone reaches for Restart= first and it is very often a red herring.

To be sure, I checked what systemd actually thought had happened:

systemctl status widget-daemon

The status showed the unit going to inactive (dead) cleanly, then back to active (running) a moment later, with no "Failed" in sight. So it was not restarting because it had failed. Something was starting it. That is a different problem entirely, and it sent me looking at what could pull a stopped unit back up.

Second suspect: socket activation

The thing that can start a "stopped" service without anyone failing is socket activation. If there is a .socket unit listening on the daemon's port, then the moment any client connects, systemd happily starts the .service to handle it. You stop the service, a health check or a stray client hits the port a second later, and systemd does exactly what you asked of it weeks ago and brings the service up to answer.

Rows of servers in a rack

Sure enough:

systemctl list-units --type=socket | grep widget

There it was, widget-daemon.socket, enabled and listening. We had set up socket activation ages ago to get faster boot ordering and never thought about it again. So stopping the service while leaving the socket up is almost pointless: the socket is the thing holding the door open, and the service is just whoever happens to be standing behind it at the time. The fix for actually taking it offline is to stop the socket too, or to mask it if you want it to stay down:

systemctl stop widget-daemon.socket widget-daemon.service

That got me much further. The service stayed down for a good thirty seconds, which after the previous hour felt like a triumph. And then it came back one more time, which is where the genuinely interesting bit lives.

Third suspect: a unit I did not write

The last resurrection was not socket activation, because the socket was stopped. It was a dependency I had not gone looking for. Another unit on the box, a little reconciliation timer that someone had written to "keep the platform healthy", had widget-daemon.service in its Wants=. Every two minutes it woke up, decided the daemon ought to be running because that was its whole job, and started it.

A dimly lit server room aisle

This is the failure mode nobody warns you about. systemd is a dependency graph, and anything in the graph can pull your unit up, not just its own restart policy. The way to find out who is responsible is to ask systemd directly rather than guessing:

systemctl list-dependencies --reverse widget-daemon.service

The reverse dependency list is the single most useful command I relearned that day. It shows you every unit that wants, requires, or is bound to the one you care about. The reconciliation timer was sitting right there in the output, smug as anything. Mystery solved, and slightly embarrassing, because I had spent an hour assuming the service was doing something to itself when in fact a well-meaning neighbour kept reviving it.

What I took away

A few things stuck. A service can be brought up by its restart policy, by socket activation, or by any unit that declares a dependency on it, and those are genuinely different mechanisms with different fixes. systemctl status tells you whether it failed or was started cleanly, which immediately splits the problem in half. And list-dependencies --reverse is how you find the culprit when the service is being started by something else, which is the case people tend to forget exists.

The real lesson is older than systemd, though. When something refuses to stay dead, the question is not "why won't it die" but "who keeps bringing it back". On a modern Linux box the answer is almost always a relationship you set up once, for a good reason, and then completely forgot about.

A note on debugging this faster next time

If I had to do it again, I would skip the guessing and go straight to the journal with the unit's full lifecycle in view:

journalctl -u widget-daemon.service -u widget-daemon.socket --since "10 min ago" -o short-precise

Watching the timestamps line up is what would have told me, in about thirty seconds, that the restarts were not crashes. A failure shows up as code=exited, status=1 or similar, with the Restart= machinery logging its countdown. A start triggered by something else just appears as a clean Started line with no preceding failure, and if you look closely systemd often logs the triggering job. The journal had the whole story the entire time; I simply did not read it carefully enough before reaching for theories.

The other habit worth forming is to treat .socket and .service as a pair in your head from the moment you enable socket activation. They are two units that behave as one thing, and operating on only half of the pair gives you exactly the confusing half-states I spent an afternoon in. systemctl stop foo.service when foo.socket is live is not a complete instruction, it is a suggestion the next connection will overrule. Stop both, or mask the socket, and the surprise goes away.

None of this is exotic. It is all in the manual, and I have read that manual more than once. But there is a difference between knowing that socket activation and reverse dependencies exist and remembering to suspect them when a service is misbehaving at two in the afternoon. That gap is where the afternoon went, and writing it down is my small attempt to close it before the next time.