the unit that came back from the dead, repeatedly

A terminal showing systemctl status output for a failing service

I stopped the service. The service did not stop. Not in the dramatic sense, it didn't ignore the signal and hang there refusing to die. It died perfectly happily, then came back about three seconds later, fresh and innocent, as if I'd never touched it. systemctl stop foo, watch it go, systemctl status foo, and there it was again, active and running, mocking me.

This is the kind of problem that makes you doubt your own sanity, because every individual piece of it is doing exactly what it was told. There's no bug. There's no race in the usual sense. There's just a set of perfectly reasonable instructions that, taken together, mean "this thing must never be allowed to stay down", and I'd written most of them myself months earlier and forgotten.

what was actually happening

The first thing to understand is that systemctl stop and "this unit will now stay stopped" are not the same statement. stop sends the configured stop sequence and brings the unit to inactive. What happens next depends on the rest of the unit's configuration, and on anything else in the system that has an opinion about whether this unit should be running.

The unit had this, which I had put there deliberately and correctly:

[Service]
ExecStart=/usr/local/bin/foo
Restart=always
RestartSec=3

Restart=always does roughly what it says. If the process exits, for any reason, systemd starts it again after RestartSec. The crucial detail, and the one that bit me, is what counts as "the process exited" versus "an operator asked for the unit to stop". When you run systemctl stop, systemd knows the stop was intentional and does not trigger Restart=. That's the whole point of the distinction. So Restart=always on its own wasn't my problem. A manual stop should have stuck.

A close-up of a systemd unit file open in an editor

The problem was the second thing with an opinion. Some months back, during a different incident, I'd added a small "supervisor" timer, a belt-and-braces measure because the service had once died quietly overnight and nobody noticed until the morning. It was a .timer that fired every thirty seconds and ran a oneshot unit whose entire job was:

systemctl is-active --quiet foo || systemctl start foo

Read that back in the context of me trying to stop the service. I stop foo. Within thirty seconds the timer fires, the oneshot checks is-active, finds it inactive, and dutifully starts it again. From my terminal it looked like the unit was respawning itself in three seconds because I kept conflating the two mechanisms. It wasn't Restart=. It was a cron-in-systemd-clothing watchdog I'd written and then mentally filed under "harmless monitoring".

the part that wasted the most time

I spent a good half hour staring at Restart=, RestartSec, and StartLimitIntervalSec, convinced the answer was in there. It wasn't, because that machinery was behaving correctly. The actual culprit was a completely separate unit that didn't even appear in the status output of the one I was watching.

The thing that finally pointed me at it was journalctl -u foo with timestamps. The restart wasn't every three seconds. It was every thirty, give or take, and it was always immediately preceded by a log line from a different unit:

foo-watchdog.service: Starting...
foo.service: Started foo.

Once you see the watchdog's name in the journal right before every resurrection, the whole thing collapses into something obvious and a bit embarrassing. The service wasn't haunted. I'd built it a defibrillator and forgotten to mention it to myself.

what I changed

The watchdog wasn't wrong to exist, it was wrong to be unconditional. A monitor that restarts a service the instant it goes down is fine for crashes and actively hostile for maintenance. The fix was to make "I am deliberately taking this down" a state the watchdog respects.

The cheap version, which is what I actually shipped, is to mask the service when you want it genuinely down:

systemctl mask foo

A masked unit is symlinked to /dev/null and cannot be started by anyone, including a well-meaning systemctl start from a timer. The watchdog's start quietly fails, the unit stays down, and I get to do my maintenance in peace. systemctl unmask foo puts it back. The nicer long-term version is to give the watchdog a sentinel to check, a flag file that says "maintenance in progress, leave this alone", so an operator can suppress it without remembering the mask dance. But mask got me through the evening, and it's the right hammer for "no, really, stay down".

While I was in there, I also reconsidered whether the watchdog deserved to exist at all in that form. The thing it was guarding against, a service dying quietly overnight, is exactly what Restart=always already handles, and handles better, because systemd restarts a crashed service in milliseconds rather than waiting up to thirty seconds for a timer to notice. The watchdog was solving a problem that the unit file had already solved, just less reliably and with the added bonus of resurrecting things I'd deliberately stopped. So the right move wasn't to make the watchdog smarter, it was to delete it. If you want proper "tell me when this stays down" alerting, that belongs in monitoring, where its job is to page a human, not to silently fight the operator for control of the service.

what I'd tell past me

There's a general shape to this class of problem, and it's worth naming because I'll meet it again. Any time a system has more than one component that believes it knows the desired state of a service, those components can and will disagree, usually at the least convenient moment. systemd's restart policy, a homemade timer, a config-management agent doing service { 'foo': ensure => running } every fifteen minutes, an orchestrator with a replica count. Each one is reasonable alone. Put two of them on the same service and "stop" becomes a negotiation rather than a command.

The diagnostic that cuts through it is always the same. Don't stare at the unit you're trying to stop, because that unit is innocent: it's being started, not starting itself. Watch the journal across the whole system around the moment of resurrection:

journalctl --since "5 minutes ago" -o short-precise

The line you want is whatever fires immediately before your service comes back. That name is your culprit, every time. It might be a timer, it might be a config run, it might be something with a thoroughly misleading name that you wrote eight months ago and filed under "harmless". Find it, understand what state it considers correct, and decide which one of you is actually in charge. Then make the other one back down, by masking, by a sentinel, or by deletion. The unit won't stay dead until exactly one thing in the system has an opinion about whether it should be alive.

The lesson I keep paying for: when a thing won't stay dead, the answer is almost never in the unit you're staring at. It's in whatever else on the box was given permission to bring it back. Restart policies, timers, a separate monitoring agent, a config-management run that reconciles state every fifteen minutes and considers "running" the desired state. Any of them will cheerfully undo your stop and never tell you which one did it. Read the journal, find the name that appears just before the resurrection, and go and have a word with that.