We had a cron job that renewed a TLS certificate, and a separate bit of monitoring that alerted when a certificate was close to expiring, and the two of them had no idea the other existed. The cron job ran weekly. The cert had a sixty-day life. That maths works right up until the week the cron host is rebooting for patches at exactly the wrong moment, and then you find out the alert and the renewal were never actually connected to each other, they just happened to agree most of the time.
So I wrote the small thing that connects them. A daemon that knows when the cert expires, decides for itself when to renew, and only shouts if it can't. About a hundred and twenty lines, and it has replaced both the cron job and half the monitoring.
the shape of it
The core is a ticker, the cert's own NotAfter, and a renewal threshold. There's nothing here I'm proud of, which is rather the point. It reads the certificate off disk, looks at how long it has left, and if that drops under the threshold it runs the renewal command. The threshold is generous on purpose: renew at twenty days left on a sixty-day cert, so there's a fortnight of retries before anything is actually on fire.
func timeLeft(path string) (time.Duration, error) {
pemBytes, err := ioutil.ReadFile(path)
if err != nil {
return 0, err
}
block, _ := pem.Decode(pemBytes)
if block == nil {
return 0, fmt.Errorf("no PEM data in %s", path)
}
cert, err := x509.ParseCertificate(block.Bytes)
if err != nil {
return 0, err
}
return time.Until(cert.NotAfter), nil
}
crypto/x509 does all the real work here, which is exactly why I reached for Go rather than parsing the cert with openssl x509 -enddate and date arithmetic in a shell script. I have written that shell script. It is wrong about timezones in a way you only discover in the autumn.
the bits that matter more than the feature
The ticker loop is trivial. The work was, as ever, the things around it.
It checks on a sensible interval, every six hours, not every few seconds. A cert's expiry doesn't move quickly and there's no reason to spin. It checks once at startup too, so a fresh deploy doesn't sit idle for six hours before noticing the cert is already in the danger zone.
The renewal command runs with a context timeout, because the one thing worse than a cron job that doesn't renew is a daemon wedged forever on a renewal that hung. If the renewal fails, it logs loudly, leaves the old cert in place, and tries again next tick. It does not delete anything, it does not write a half-finished file over the live one. A renewal tool that can leave you with no certificate at all is worse than the problem it solves.
And it runs under systemd with Restart=on-failure, a dedicated user, and write access scoped to exactly the cert directory. Built with CGO_ENABLED=0 so it's a single static binary I could scp across and forget.
It has been running for nine days and has renewed nothing yet, because nothing has needed renewing. That is the correct behaviour and it is also, I'll admit, slightly unsatisfying. The best a daemon like this can do is make a category of 3am page simply stop happening, and you only ever notice the absence.