a small agent that does the boring bits

A small robot on a workbench

I finally built one of these "agent" things that doesn't just produce a confident paragraph and then stop. It actually does things now: reads a ticket, greps the repo, opens the right file, suggests a diff, and runs the tests. Nothing groundbreaking. But the gap between a chatbot and something that touches your filesystem turns out to be mostly plumbing and nerve, and I want to write down what I learned before I forget it.

The core loop is unremarkable. Model gets a system prompt, a task, and a list of tools described as JSON schemas. It replies with either a final answer or a tool call. I run the tool, feed the result back, and go round again. The whole orchestration is maybe two hundred lines. The interesting part was never the loop, it was deciding which tools to expose and how much rope to give it.

A circuit board close up

My first instinct was to give it a shell tool. One run_command and let it figure out the rest. That works, briefly, and then it does something like find / -name "*.log" and you sit there watching it think for forty seconds about output it can't use. So I pulled that back hard. The tools it gets now are narrow and named for intent: read_file, list_dir, search_code, run_tests, propose_patch. The patch tool doesn't apply anything. It writes a diff to a file and stops. I apply it myself after reading it, like a code review where the other reviewer happens to be a very fast, very literal junior who has never once asked a clarifying question.

That last point matters more than the model choice. The agent is genuinely useful for the boring middle of a task: it's the bit where you know roughly what needs doing and just don't fancy the typing. It is not useful for deciding what the task is. When I gave it something underspecified it confidently went off and built the wrong thing, beautifully, with tests. So I keep the task definition tight and human, and let the agent fill in the mechanical bits between.

A few things that helped:

Log every tool call and its result somewhere I can scroll back through. When it goes wrong, and it will, you want the transcript.
Put a hard cap on iterations. Ten steps, then it stops and reports. Runaway loops are real and they cost actual money.
Make tools fail loudly and return the error text verbatim. The model is surprisingly good at reading a stack trace and adjusting, far better than it is at guessing.

The honest summary is that it saves me time on a narrow band of work and costs me vigilance everywhere else. I still read every diff. I still run the tests myself before I trust them. But for the "go and find where this config is actually loaded and show me" sort of question, it's quietly excellent, and I didn't expect to enjoy building it half as much as I did.