i built an agent that does the boring bits, and learned where the seams are

A small robot on a circuit board

I have been deeply sceptical of the word "agent". It arrived as marketing, attached to demos that worked once on stage and never again, and for a long time the gap between the pitch and the reality was wide enough to fall into. But the building blocks have quietly got good enough that I wanted to find out for myself where the real line is between "impressive demo" and "thing I'd actually let near my life". So I built a small one. Not a framework, not a product, just enough code to answer the question.

The job I gave it was deliberately unglamorous. I wanted something that could take a vague instruction in plain English, decide which of a handful of tools to use, call them, look at the results, and either finish or try again. Check the state of my homelab. Search my own notes. Draft a reply to a routine email and leave it in drafts, never send. Tell me which of my backups last ran and when. Boring chores, the sort of thing that's individually trivial and collectively a nuisance.

the model is the easy part

Here's the thing nobody puts on the slide: the language model is the least of your problems. Wiring it up is a couple of hours. You give the model a description of your tools, it emits something that says "I would like to call check_backups with no arguments", you parse that, you actually call the function, you feed the result back, you loop. That's the whole trick. The loop is maybe forty lines.

while True:
    reply = model.chat(messages, tools=TOOLS)
    if reply.tool_calls:
        for call in reply.tool_calls:
            result = TOOLS[call.name](**call.args)
            messages.append(tool_result(call.id, result))
    else:
        return reply.content

That works on the first afternoon. You feel like a wizard. And then you try to use it for real, and every single problem you have from that point on is an engineering problem, not an AI one.

A close-up of a circuit board

The first wall I hit was that the model is confidently wrong about the shape of the world. It would invent a tool I hadn't given it, call a real tool with arguments that didn't exist, or decide a task was complete when it plainly wasn't. None of this is fixable by asking the model nicely. It's fixable by treating everything the model emits as untrusted input from a clever but unreliable junior, validating it hard, and giving it a clear, specific error back when it gets the shape wrong. The error messages I write for the model turned out to matter more than the prompt. "Unknown tool send_email; available tools are: ..." gets a sensible retry. A stack trace gets nonsense.

the boundaries are where the work is

The second wall, the important one, is permission. An agent that can do things is exactly as dangerous as the things it can do. The moment my toy could draft emails I drew a hard line: it may write to a drafts folder and nothing else. It cannot send. It cannot delete. Anything irreversible goes through me. This isn't a temporary safety measure I'll relax once I trust it more. It's the design. The model does not get to take an action I'd be upset to have taken on my behalf without a human in the loop, and that rule is enforced in the plumbing, not requested in the prompt. Prompts are suggestions. Code is policy.

The third wall was knowing when to stop. Left alone, the loop will happily run forever, each step slightly more lost than the last, like someone who's been given directions one too many times and is now just turning corners hopefully. I cap the number of steps, I cap the wall-clock time, and if it hasn't finished by then it reports what it managed and gives up gracefully. A confused agent that quits is fine. A confused agent that keeps going is how you end up with forty draft emails apologising for the previous thirty-nine.

A close-up of a circuit board

is it any good?

For the narrow, boring jobs I built it for, genuinely yes. It checks my backups and tells me in a sentence. It finds the note I half-remember writing. It drafts the dull reply and I skim it and hit send myself, which saves me the activation energy of starting from a blank box, which turns out to be the actual cost of those emails. None of this is magic. All of it is mildly useful, every day, which is a much higher bar than "magic" and one most magic fails to clear.

What surprised me most was how the value tracked the narrowness. The wider I let it roam, the worse it got. When I gave it a vague brief and a big pile of tools, it flailed. When I gave it three tools and a clear job, it was reliable enough that I stopped checking its work, which is the real test. Trust, in a system like this, is just the point at which you stop reading every line of output. I do not trust it broadly. I trust it for exactly the things it has earned, and the boundary between those two is something I maintain on purpose.

There is a temptation, once a thing like this works at all, to keep widening its remit until it breaks. I have resisted that, mostly because the failures aren't quiet. A web service that errors throws a 500 and you find out. An agent that quietly does the wrong thing with a real action doesn't error, it just confidently completes, and you discover the mistake later when the consequences arrive. That asymmetry is the whole reason for the firm hand on irreversible actions. The cost of a wrong read is a wasted minute. The cost of a wrong write could be anything.

What I came away with is a quiet recalibration. The interesting work in "agents" is not the model and not the prompt. It's the same work it always was: clear contracts, hard validation, sensible failure modes, and a firm hand on which actions are reversible. The model is a new and strange component, brilliant at the fuzzy middle and hopeless at the precise edges, and your job is to build a system that plays to that. Give it the ambiguous bit, the "what does this person actually want", and hold the levers yourself. Do that and you get something that does real things. Skip it and you get a demo.

I'm keeping mine. It's small, it's a bit daft, and it has never once sent an email it shouldn't have, because I made quite sure it can't.