a language model lives in my terminal now, and mostly that's fine

A small robot figure beside a keyboard

I've had a language model bound to a keystroke in my shell for a few months now, and I've reached the stage where I can be honest about it, which is to say it's genuinely useful and it has also lied to me with total confidence in ways that nearly cost me real money. Both of those things are true at once and the trick is knowing which mode you're in at any given moment.

The setup is nothing exotic. A small wrapper that takes a natural-language description, sends it off with a system prompt that says "you produce shell commands for Linux, output only the command", and drops the result onto my command line unexecuted so I can read it before I hit return. That last part, unexecuted, is the entire safety model and I'll come back to why it's load-bearing.

where it's genuinely good

The thing it's brilliant at is the commands I use rarely enough to never remember but often enough to need. tar flags. ffmpeg invocations. The exact find incantation to delete files older than thirty days but only in one subtree. The awk one-liner I will look up for the rest of my life because some part of my brain has decided awk is write-only. For all of these, describing what I want in English and getting a plausible command back is faster than the man page and far faster than the four browser tabs I'd otherwise open.

$ ?? convert this mkv to a 720p mp4 with reasonable quality
ffmpeg -i input.mkv -vf scale=-2:720 -c:v libx264 -crf 23 \
  -preset medium -c:a aac -b:a 128k output.mp4

That's correct, it's the command I'd have written if I could remember the flags, and it took two seconds. Over a few months that adds up to a meaningful amount of not-context-switching. The win isn't that it does things I can't. It's that it does things I can do but resent doing, and removes the friction that used to send me off to a browser and back twenty minutes later having forgotten what I was doing.

It's also a surprisingly good rubber duck for "what's the name of the tool that does X". I know the shape of the thing I want, I've just forgotten what it's called, and describing the shape gets me the name. That alone earns its keep.

A close-up of a circuit board

where it bit me

Now the other side, because the failures are more instructive than the wins.

The first time it properly bit me, I asked for a command to clean up old Docker images and it handed me something with --all and --force and a filter I didn't read carefully enough, that as written would have removed images I was actively using rather than just the dangling ones I meant. The command was syntactically perfect. It was confident. It was also wrong in exactly the way that doesn't show up until you've pruned the thing you needed. I caught it because I read it. If I'd been in the habit of just running whatever it suggested, I'd have spent an hour pulling images back down.

The second one was worse, and it's the one that taught me the real lesson. I asked for a command to find and remove some large temporary files, and the path it constructed had an unquoted variable in it that, with the directory I happened to be standing in, would have expanded into something a great deal more enthusiastic than I intended. Spaces in a path, a variable, no quotes, and rm -rf. The classic. The model didn't see it as dangerous because in the abstract it wasn't; in my actual working directory it absolutely was. The model doesn't know your working directory. It doesn't know what's on your disk. It's pattern-matching plausible shell, and plausible shell and safe shell overlap most of the time but not all of the time, and rm -rf lives precisely in the gap.

the rules I actually keep

Another circuit board detail

So a few hard rules have shaped up, and they're less about the model and more about me.

It never auto-runs. The command lands on my line and I press return, or I don't. The moment you wire one of these things to execute its own output you have handed your shell to something that hallucinates with a straight face. The unexecuted gap is where I do the one job the model can't do, which is know my actual context.

I read every destructive command in full before running it, and I treat anything with rm, dd, chmod -R, chown -R, or a redirect into a file as guilty until proven innocent. These are exactly the cases where confident and correct part company, and exactly the cases where being wrong is expensive.

I don't trust it for anything stateful or anything that touches money. Asking it for an ffmpeg flag is low-stakes; if it's wrong, the video looks bad and I try again. Asking it to construct a command that mutates a production database, or a cloud CLI call that spins up resources I pay for, is a different category. There it's a starting point I verify line by line, not an answer.

And I've stopped asking it to explain why a command works as gospel. It's fluent, and fluent and correct are not the same thing. When it explains a flag I half-believe it and then I check, because the explanations are where it's most plausible and least accountable. A wrong command often fails loudly. A wrong explanation just quietly rewires your mental model and you carry the error forward for months.

is it worth it

On balance, yes, clearly, or I'd have unbound the key. The time it saves on the long tail of forgotten flags is real and daily. But the framing that matters is that it's a very fast, very confident junior who has read everything and understood the gist of most of it, and who has never once seen your actual machine. You wouldn't let that person run rm -rf on your homelab unsupervised, and you shouldn't let the model either, no matter how good the suggestion looks.

The two times it nearly bit me, it was stopped by the same thing: a human reading the command before it ran. Keep that human in the loop and the tool is excellent. Take them out and you've automated the production of confident mistakes, which is the one thing nobody actually needs more of.