an llm in my shell, and the handful of times it bit me

A small robot on a desk

I've had an LLM wired into my shell for the best part of a year now, a little function that takes a plain-English description and hands back a command. Most days it's a quiet, useful thing. A few days it has confidently handed me a loaded gun. Both halves are worth writing down, because the failure modes are more interesting than the wins.

What it's good at

It's brilliant at the commands I use twice a year and forget every time. tar flag combinations. The exact ffmpeg incantation to extract a single audio stream. Which find -exec quoting actually works. The arguments to openssl for inspecting a certificate. These are things I genuinely know how to look up, but looking up takes three minutes and breaks my train of thought, and the model returns them in two seconds with the flags in roughly the right order.

A circuit board close up

The other thing it's good at is reading errors back to me. Paste a wall of Go stack trace or a gnarly git rejection, ask what it means, and the summary is usually correct and always faster than my own first read. As a translator from machine-noise to English, it's excellent.

The times it bit me

The bites all share a shape: a command that looks right, runs without complaint, and does the wrong thing quietly.

Once it gave me a find ... -delete where the predicate matched a directory higher up than I intended. The command was syntactically perfect. It would have removed the wrong tree entirely. I caught it because I have a rule now, and the rule is the whole point of this post.

Another time it produced a rsync with the trailing slash on the wrong side of the source path. Anyone who's used rsync in anger knows that trailing slash is the difference between copying a directory and copying its contents, and the model had it backwards. No error. Just files in the wrong place and an afternoon of confusion.

It also loves to invent flags. It will give you a plausible --dry-run for a tool that has no such thing, because plenty of tools do, and the average across its training is "this flag probably exists". It usually doesn't lie about flags that do exist; it lies about flags it wishes existed.

The rule

So: never run a destructive command from the model without reading it first, and for anything with rm, dd, mkfs, --delete, or a redirect into a real file, run the read-only version first. find without -delete. rsync -n. git with the change staged but not committed. The model is a faster way to draft a command. It is not a faster way to decide whether to run it, and the moment I started treating it as the former and not the latter, it stopped biting me.

I'd not give the shell function up. But I've stopped trusting it the way you trust a colleague, and started trusting it the way you trust autocomplete: useful, frequently right, and never the last word.