Ramblings of an aging IT geek
← Ramblings of an aging IT geek
rust

a christmas-break rust cli, and the honest cost of it

Rewriting a flaky log-rotation helper in Rust over a quiet week, with a frank tally of where the language saved me and where it just cost me time.

A terminal showing Rust source code on a dark background

Every team has the one script nobody owns. Ours was a log shipper. It tailed a directory of application logs, gzipped anything older than a day, and posted the compressed files at an archive endpoint. It was written in a hurry in 2015, in a mix of bash and a Python helper that had grown a config file, and it had failed silently twice in the last quarter. Both times the symptom was the same: disk slowly filling, nobody alerted, until a box fell over at three in the morning and ruined someone's sleep.

I had a slack week before the holidays and I decided to rewrite it in Rust. So the question I always end up asking: was it worth it? Mostly, with caveats, and the caveats are the interesting part.

why rust and not just fix the script

The honest first answer is that I wanted to. That's a real reason and I'm not going to pretend otherwise. But there was a defensible one too. The two failures had both been the same class of bug: the script did something, the something failed, and nothing checked the result. A gzip that ran out of space left a truncated file. An HTTP POST that returned a 503 was treated exactly like a 200. The control flow had no concept of "this step can fail and I must care".

That's a job for types and a language where ignoring a Result is something you have to do on purpose rather than by default. Bash will happily march past a non-zero exit unless you remember set -e, and even then the failure modes around pipes are subtle enough that I've watched experienced people get them wrong. I wanted the compiler nagging me instead.

A terminal listing gzipped log files in a directory

the shape of the thing

The core is small. Walk a directory, find files older than a threshold, compress them, ship them, delete the original only once the ship succeeded. The ordering of that last clause is the whole point, and it's the thing the old script got wrong.

fn process(path: &Path, client: &Client) -> Result<(), ShipError> {
    let gz = compress(path)?;
    upload(client, &gz)?;
    fs::remove_file(path)?;
    fs::remove_file(&gz)?;
    Ok(())
}

That reads as obvious, and it is, but every ? in there is a place the old code carried on regardless. If upload returns an error, we never reach remove_file, and the original log stays put for the next run to retry. The file lingering is a far better failure than the file vanishing. I used reqwest for the HTTP, in blocking mode, because this is a cron job that does one thing and exits and I had no appetite for an async runtime to ship a handful of files a day.

The other thing I leaned on was the iterator chain to find candidates, which in the shell version was find with a -mtime I could never remember the sign of:

let cutoff = SystemTime::now() - Duration::from_secs(24 * 3600);
let candidates: Vec<PathBuf> = fs::read_dir(dir)?
    .filter_map(|e| e.ok())
    .map(|e| e.path())
    .filter(|p| p.extension().map_or(false, |x| x == "log"))
    .filter(|p| is_older_than(p, cutoff))
    .collect();

filter_map(|e| e.ok()) quietly drops directory entries that error rather than crashing the whole run, which is the behaviour I wanted: one unreadable file should not stop the other forty being shipped.

where it actually fought me

Two places, and they're the usual suspects.

Build time first. Pulling in reqwest drags in a large chunk of the ecosystem and a clean build was not quick on the little VM I was testing on. For a tool whose whole pitch is "small and reliable", watching a couple of hundred crates compile felt faintly absurd. It only matters during development, the deployed binary is one file, but the edit-compile-run loop is where you actually live and it dragged.

Second, I spent too long on the error type. I wanted to distinguish "couldn't compress" from "couldn't upload" from "couldn't delete" so the logs would be useful at three in the morning. That's reasonable. What's not reasonable is the forty-five minutes I spent hand-writing From implementations to make ? thread everything into one enum cleanly. There's no anyhow to lean on, so the boilerplate is real, and I'll have more to say about that in a separate post because it deserves its own grumble.

A code editor showing the error enum and From implementations

the accounting

The old solution was maybe 60 lines of bash and 80 of Python with a config file. The Rust version is around 220 lines in one file with no external config, the threshold and the endpoint are arguments. That's more lines, and I want to be fair: a good slice of the extra length is genuinely new behaviour. Checking that the upload succeeded before deleting. Logging which step failed and why. Not falling over because one file in the directory had odd permissions.

So was it worth it for a log shipper that runs four times an hour on a dozen boxes? For the correctness, yes, and easily, because the failure mode of the old one was data loss and a filling disk, and both of those cost real money and real sleep. For the operational story, also yes: a single static binary with no runtime to keep patched is a lovely thing to deploy across a fleet, especially the fleet that's deliberately kept minimal.

Would I reach for Rust for the genuinely trivial glue, the script that moves one file once a day and whose worst failure is a shrug? No. Bash still wins that and it isn't close. The line I keep drawing is whether getting it wrong is expensive. Here it was, twice, with a pager going off to prove it. That's the threshold where an afternoon of Rust starts paying for itself, and the borrow checker, as ever, barely showed up. This was about results you can't ignore, not lifetimes.