Ramblings of an aging IT geek
← Ramblings of an aging IT geek
rust

another small tool, another afternoon in rust

Building a small file-deduplication CLI in Rust, weighing iteration speed against the safety the compiler buys, and where the line now sits for me.

A terminal showing Rust compiler output

I keep doing this. A small job comes up, I have an afternoon, and instead of the sensible thirty-line Python script I reach for cargo new. This time it was a deduplicator: walk a directory tree, hash file contents, report the duplicates so I could decide what to delete. The kind of thing you could write in any language before lunch.

So why Rust again? I asked myself the same question, and the answer is roughly the same as last year, with one new wrinkle.

what the afternoon actually looked like

The core of it is unglamorous and that's the point.

use std::collections::HashMap;
use std::fs;
use std::path::PathBuf;

fn hash_file(path: &PathBuf) -> std::io::Result<u64> {
    let bytes = fs::read(path)?;
    Ok(seahash::hash(&bytes))
}

Walk the tree with walkdir, hash each file, bucket the paths in a HashMap<u64, Vec<PathBuf>>, print any bucket with more than one entry. The whole thing is under a hundred lines including the clap argument parsing for --min-size and a --dry-run I didn't strictly need but always end up wanting.

The bit I'd have got subtly wrong in a quick script: reading entire files into memory to hash them is fine for documents and a terrible idea the moment you point it at a directory of disk images. In Rust the type system didn't stop me doing the naive thing, but the moment I thought about it the fix was obvious and the compiler kept me honest while I switched to streaming the read in chunks. In a hurried Python script I'd have shipped the naive version and discovered the problem later, on real data, at the worst time.

A code editor showing Rust source

the cost, stated plainly

I'm not going to pretend this is free. The first build pulling in walkdir, clap and a hashing crate was the usual go-and-make-tea wait. For a tool I'm iterating on rapidly, that compile loop is a real tax, and anyone who tells you it isn't is either on a much faster machine than mine or not iterating much.

The binary is megabytes for something that does so little. I've made my peace with this. A single static thing I can drop on any box and run with no runtime is, on balance, worth the disk. But I notice it every time.

And there's the borrow checker, which on a job this simple barely showed up. One moment of "I can't both iterate this map and modify it", solved by collecting the keys first. Two minutes. Nothing like a fight, just the usual gentle reminder that Rust wants you to be precise about who owns what.

the new wrinkle

What's changed since I last wrote one of these is me, not the language. I'm faster now. The patterns that used to make me stop and think, propagating errors with ?, reaching for Result instead of unwrapping, structuring a small CLI with clap, are muscle memory. The afternoon was genuinely an afternoon this time, not an afternoon plus two evenings of fighting lifetimes.

That shifts the maths. When the language tax was high and my fluency was low, "just write the script" won most arguments. Now the tax is lower for me specifically, and the safety I get on the file-handling edge cases is real. The line has moved.

so, worth it

For this tool, yes, and I'd defend it. Not because deduplicating files needs Rust, it plainly doesn't, but because the version I shipped handles a directory of huge files without falling over, and I'm confident it handles the empty-directory and permission-denied cases too because the compiler made me look at them.

I'll still reach for Python when I want an answer in five minutes and I'll throw the script away after. But for anything I expect to keep, anything I'll run on machines I care about, the calculus now tips towards Rust more often than it used to. Same conclusion as before, arrived at with less effort, which is its own small endorsement.