Ramblings of an aging IT geek
← Ramblings of an aging IT geek
golang

a hundred lines of go that replaced a cron job and a prayer

I replaced a flaky shell-and-cron contraption with a small long-running Go daemon, and the part worth writing down is the graceful shutdown, not the feature.

A code editor showing a small Go program

The thing it replaced was a shell script run every minute from cron that polled a directory, did some work, and occasionally overlapped with itself in interesting ways. I'd been meaning to rewrite it for months. This week I finally did, in Go, and it came out to about a hundred and twenty lines and a single static binary. No interpreter, no virtualenv, no "which Python is this even using". Just a file you copy to the box and a systemd unit that runs it.

I'm not going to pretend the business logic is interesting. It watches a directory, processes files, writes results. The part actually worth writing down is the shutdown, because that's where small daemons usually let you down. A cron job is born and dies every minute, so cleanup is somebody else's problem. A long-running process has to clean up after itself, and if it doesn't drain in-flight work when systemd sends it a SIGTERM, you get half-processed files and a bad mood.

the shape that matters

The whole thing hangs off a context that cancels on signal:

func main() {
	ctx, stop := signal.NotifyContext(context.Background(),
		syscall.SIGINT, syscall.SIGTERM)
	defer stop()

	if err := run(ctx); err != nil {
		log.Fatalf("exited: %v", err)
	}
	log.Println("clean shutdown")
}

signal.NotifyContext arrived in Go 1.16 and it's exactly the ergonomic I always wanted. Press Ctrl-C, or have systemd send SIGTERM, and ctx is cancelled. Everything downstream that respects the context unwinds in order. No global boolean flag, no channel I forgot to close, no signal.Notify boilerplate copied from a blog post and subtly wrong.

The worker loop just selects on it:

func run(ctx context.Context) error {
	ticker := time.NewTicker(5 * time.Second)
	defer ticker.Stop()

	for {
		select {
		case <-ctx.Done():
			return drain()
		case <-ticker.C:
			if err := processBatch(ctx); err != nil {
				log.Printf("batch: %v", err)
			}
		}
	}
}

When the context is done, drain() finishes whatever's mid-flight and returns. systemd's default TimeoutStopSec gives you ninety seconds before it escalates to SIGKILL, which is comfortably more than this daemon ever needs, but the point is it shuts down deliberately rather than being shot in the head.

A diagram of a signal cancelling a context which unwinds the worker loop

the unfair part

The bit that still feels like cheating after years of doing it is deployment. GOOS=linux GOARCH=amd64 go build, scp the binary, drop in a unit file:

[Service]
ExecStart=/opt/importer/importerd
Restart=on-failure

That's the whole production story. No runtime to match, no dependency tree to reconcile against whatever the box happens to have. The binary is the dependency tree. I know this is old news to anyone who's shipped Go, but coming off a decade of "it works on my machine because my machine has the right Python", it still lands every time.

It's been running for three days. The directory is processed, the overlaps are gone because there's now one process instead of sixty cron invocations an hour fighting over the same files, and when I stop it for a deploy it actually drains. A hundred and twenty lines, and I should have done it a year ago.