Ramblings of an aging IT geek
← Ramblings of an aging IT geek
golang

the smallest useful daemon i've written, and how it got to production

Writing a small Go daemon that polls an API and exposes metrics, and the unglamorous work of making it survive in production.

A code editor showing Go source

I needed a tiny thing. Poll an internal API every thirty seconds, transform the response, and expose it as Prometheus metrics on a port so our existing scraping could pick it up. That's the whole brief. I could have done it in Python with a cron job and a flat file, and a year ago I would have. This time I wrote it in Go, and the interesting part wasn't the writing, it was everything between "it compiles" and "it's been running for a week without anyone thinking about it".

The core was about sixty lines. A time.Ticker, an http.Client with a sane timeout, a couple of prometheus.GaugeVecs, and promhttp.Handler() bolted onto http.ServeMux. Go's standard library does so much of this that reaching for a framework would have been silly. The single static binary is the real selling point: go build, scp it across, done. No virtualenv, no system Python version to fight, no dependency that gets removed from PyPI in two years.

the bits that actually took the time

Writing the happy path took an hour. Making it a daemon I'd trust took the rest of the day, and that's the honest ratio nobody puts in the tutorials.

Timeouts everywhere. The default http.Client has no timeout at all, which means one hung upstream connection and your poller blocks forever, silently stops updating metrics, and looks alive while being useless. So:

client := &http.Client{
    Timeout: 10 * time.Second,
}

That one line is the difference between a daemon and a liability.

Context for shutdown. systemd sends SIGTERM when it wants you gone, and a daemon that ignores it gets SIGKILLed after a grace period, which means a half-finished poll and ugly logs. Catching the signal and cancelling a context lets the in-flight work finish:

ctx, cancel := context.WithCancel(context.Background())
sig := make(chan os.Signal, 1)
signal.Notify(sig, syscall.SIGINT, syscall.SIGTERM)
go func() {
    <-sig
    log.Println("shutting down")
    cancel()
}()

The poll loop selects on ctx.Done() and the ticker, so a SIGTERM stops it cleanly between iterations rather than mid-request.

Source code displayed on a monitor

Not dying when the upstream is down. The first version log.Fataled on a failed request, which felt tidy and was completely wrong. The upstream API has blips. A poller that exits the moment its target hiccups is a poller that's down more than the thing it watches. So a failed poll logs the error, increments an error counter (itself a metric, so I can alert on it), and waits for the next tick. The daemon's job is to keep trying.

the systemd unit, because that's how it actually runs

A binary isn't a service until something supervises it. The unit is boring, which is the point:

[Unit]
Description=metrics poller for the internal thing
After=network-online.target

[Service]
ExecStart=/usr/local/bin/poller -addr :9180 -target https://internal/api
Restart=on-failure
RestartSec=5
User=poller
DynamicUser=no

[Install]
WantedBy=multi-user.target

Restart=on-failure means that if it does crash on something I didn't anticipate, systemd brings it straight back, and the gap shows up as a hole in the metrics rather than a 3am page. Running as a dedicated unprivileged user costs nothing and means a compromise in my sixty lines of polling code can't do much.

the metrics naming, which I got wrong first

A quick aside that cost me an embarrassing half hour. Prometheus has conventions, and the convention is that a metric name carries its unit as a suffix and uses base units, so seconds not milliseconds, bytes not kilobytes. My first pass exposed poll_duration_ms because that's how I think about latency. The result was a dashboard that looked fine until someone tried to write an alert and the numbers didn't line up with everything else, all of which was in seconds. I renamed it to poll_duration_seconds, divided by a thousand, and stopped fighting the ecosystem. The lesson is small but general: when a tool has strong conventions, the cost of ignoring them isn't paid by you, it's paid by whoever reads your output six months later.

I also added a poll_errors_total counter, deliberately a counter and not a gauge, because what you want to alert on is the rate of errors over time, and rate(poll_errors_total[5m]) only makes sense on a monotonic counter. The "_total" suffix is, again, the convention that signals exactly that.

what I'd watch out for next time

Two things bit me that I'd flag to anyone writing their first poller of this shape. The first is that a time.Ticker does not wait for your work to finish before it fires again. If a poll takes longer than the tick interval, ticks queue up and you can end up running polls back to back with no gap, hammering the upstream you were trying to gently observe. Because my poll has a 10-second timeout and a 30-second tick, I'm safe, but only by arithmetic, and if I ever shorten the interval I need to revisit that. The robust pattern is to skip a tick if the previous poll is still running rather than letting them pile up.

The second is logging volume. The naive "log every failed poll" is fine until the upstream goes down for an hour, at which point you've written two thousand identical error lines and, if you're unlucky, filled a disk doing it (see also: every other ops post I've written). For now the volume is tolerable, but the right answer is to log the transition, down and back up, rather than every individual failure, so an outage costs you two log lines instead of a flood.

was Go the right call

Yes, and not because Go is fashionable. The two properties I wanted were a single deployable artifact and first-class concurrency primitives for the ticker-plus-signal-handling dance, and Go hands you both without ceremony. The thing has been running for a week, survived two upstream outages without flinching, and I've not had to think about it since the unit went in. That last part, not thinking about it, is the only review that matters for infrastructure this small.

If I were doing it again I'd add a /healthz endpoint from the start and wire it into the systemd watchdog, because the one failure mode I haven't covered is the daemon being alive but wedged. A future evening's job. For now it ships, it works, and it's boring, which for a daemon is the highest praise I've got.