I Wired an LLM Into My Shell, Here's Where It Went Wrong

A small robot beside a keyboard

I've had an LLM bound to a keystroke in my shell for a few months now. Type a half-remembered intent, hit the binding, and it suggests the command. Most of the time it's genuinely good, and I want to be fair about that before I spend the rest of this post complaining. It has saved me more trips to man tar than I'd like to admit, and it's quietly brilliant at the commands I use twice a year and forget every time.

But this is a post about the times it bit me, because those are the ones worth writing down, and because "it's mostly fine" is exactly the attitude that lets the rare failure through.

The setup

Nothing clever. A shell function that takes my natural-language description, sends it off with a bit of context about my OS and shell, and drops the suggested command onto the prompt line for me to inspect before running. The key design choice, which I'll defend to the end, is that it never runs anything automatically. It populates the line. I press Enter. That gap is the only safety I have, and I've come to rely on it absolutely.

A circuit board close-up

Where it bit me

The first time was almost funny. I asked it to "find and remove the old build artefacts", expecting something pointed at a build/ directory. What it produced was a find from the current directory with a -delete, and I happened to be sitting in my home directory at the time. Had I been the sort of person who trusts the suggestion and hits Enter on reflex, that would have been a very bad afternoon. The command was valid. It did exactly what I asked. It just had no idea where I was standing, and neither, for that crucial half-second, did I.

The second was subtler and worried me more. I asked for a command to rotate some logs, and it gave me something using a flag that didn't exist on the version of the tool installed on that box. It looked completely plausible. The flag had the right name, the right shape, the sort of thing that absolutely should exist. It just didn't, on that version. I ran it, got an error, no harm done, but it taught me that the model's confidence and the model's correctness are entirely separate quantities. It will invent a flag with exactly the same fluency it uses for a real one.

The third one is the one that changed how I use it. I asked for a git command to clean up some local branches. The suggestion was a pipeline that listed branches, filtered them, and fed the result into git branch -D. The filter was almost right. It would have force-deleted a branch I very much wanted to keep, because the pattern it generated was greedier than I'd intended and the model had no way to know which branches mattered to me. I caught it because the list of branches scrolling past in the suggested pipeline looked wrong, and I stopped to read it properly.

The pattern in the failures

None of these were the model being stupid. That's what I find unsettling. Every failure came from the same root: it doesn't know my context, and it states everything with identical confidence.

It doesn't know which directory I'm in, which matters enormously for anything with -delete or rm. It doesn't know what version of a tool is installed, so it'll cheerfully use flags from a newer release. And it doesn't know which of my files, branches, or services I actually care about, so any destructive operation is a coin flip dressed up as a recommendation.

The dangerous commands and the helpful ones come out in exactly the same tone. There is no tremor in its voice when it's about to delete your home directory. That's the real hazard, not that it's wrong sometimes, but that wrong and right are presented identically.

What I actually do now

I kept the tool, because on balance it's a clear win. But I changed how I treat it.

I read every suggestion before running it, properly, not the reflexive glance I started with. For anything with rm, find -delete, git ... -D, or a redirect that truncates a file, I read it twice and I check where I am with a quick pwd first. I treat it like a confident junior colleague who's pasted me a command from a forum: probably right, occasionally about to ruin my day, always worth a look before I act on it.

# the function deliberately ends here, populating the line.
# it does NOT pipe into sh. that boundary is the whole safety model.
print -z "$suggested_command"

That print -z instead of an eval is the entire philosophy in one line. The moment you remove the human checkpoint, every one of the failures above goes from "mild annoyance" to "incident". The value of the tool is in the suggestion. The safety is entirely in the pause before you accept it.

So: would I recommend wiring an LLM into your shell? Yes, genuinely. It's a real productivity gain and I'd miss it now. But never let it run anything on its own, read what it gives you as if a stranger wrote it, and remember that its calm, helpful tone is exactly as steady when it's about to delete the wrong thing. The tool is good. The discipline is what keeps it good.