Ramblings of an aging IT geek
← Ramblings of an aging IT geek
rust

a rust tool to find duplicate files, and whether it earned the build time

A small Rust command-line tool that finds duplicate files by hashing only the candidates that share a size, and an honest look at where the language helped.

Rust source on a terminal

My photos directory had got out of hand. Years of imports from different cameras and phones, the same shots copied into three different folders, and a nagging suspicion that a meaningful fraction of the disk was just the same files under different names. I wanted a tool that told me which files were genuine duplicates, byte for byte, so I could delete with confidence rather than by vibes. There are existing tools that do this. I wrote my own anyway, in Rust, partly to have a reason to. So: was it worth it?

the trick is not hashing everything

The naive version hashes every file and groups by hash. On a directory of tens of thousands of photos, that means reading every byte off disk, and most of that reading is wasted, because the vast majority of files are unique and you only needed to confirm that they're a different size from everything else.

The actual algorithm is the obvious one once you see it: group files by size first, because two files of different sizes cannot be identical, then only hash within each size group, and only when a group has more than one file in it. The hashing, the expensive part, now only touches files that already have a same-size twin. On my photos that cut the bytes read by something like ninety percent.

let mut by_size: HashMap<u64, Vec<PathBuf>> = HashMap::new();
for entry in WalkDir::new(&root).into_iter().filter_map(Result::ok) {
    if entry.file_type().is_file() {
        let len = entry.metadata().map(|m| m.len()).unwrap_or(0);
        by_size.entry(len).or_default().push(entry.path().to_owned());
    }
}

walkdir handles the recursion and filter_map(Result::ok) quietly drops the entries I can't read rather than aborting the whole walk, which on a real filesystem with the odd permission-denied directory is exactly what you want.

A diagram of files grouped by size, then hashed within groups

where rust pulled its weight

The hashing within a group is where the types earned their keep. For each size group with more than one file I read each file and hash it, then group by digest. The thing Rust made pleasant was iterator chaining: walk, filter, group, retain only the groups with collisions, all as a pipeline rather than a heap of mutable bookkeeping.

by_size.retain(|_, paths| paths.len() > 1);

That one line throws away every size that has no possible duplicate, before I read a single byte of content. The retain on a HashMap mutating in place is the kind of small, sharp tool the standard library is full of, and it's the reason the program reads roughly like the description of the algorithm.

The other quiet win was correctness around the edges. Files that vanish between the walk and the hash, files I can't open, symlinks pointing at nothing: in a shell pipeline of find and md5sum these are silent corruptions of the result. Here they're Results I have to deal with, and dealing with them was a few lines, not an afternoon.

the accounting

Build time, as always, was the cost. A binary depending on walkdir and a hashing crate is not a quick cargo build, and the edit-run loop against a real directory dragged until I pointed it at a small fixture tree for development. That's the recurring tax on doing this in Rust: the compiler is thorough and thoroughness isn't free.

Was it worth it for a one-off disk tidy? Honestly, for a true one-off, probably not, an existing tool would have done. What tipped it was that I now run this monthly, it's a single static binary I trust to never delete anything itself (it only ever prints the duplicates, the deleting is my problem), and the size-first trick makes it fast enough that monthly is no bother. The Rust paid off not in the writing but in the trusting. I'd hand this tool a directory I cared about without a second thought, and that's worth a slow build or two.