Most of the "AI agent" demos I had seen by the summer were a model talking to itself about a task it never actually performed. I wanted the opposite: something small that did one real, boring job end to end, with hands on actual tools, and that I could trust enough to leave running.
The job I picked was alert triage. Every morning I have a folder of overnight alerts, most of them noise, a few that matter. I wanted something to read them, group the duplicates, and tell me the two or three I should actually look at. Not fix anything. Just sort the inbox so I do not have to.
The shape of it
The loop is unglamorous and that is the point. The model gets a system prompt describing the tools it has, then I hand it the alerts and let it call functions until it is done. The function-calling support that landed in the OpenAI API back in June made this tidy, because the model returns a structured tool call rather than me parsing prose and praying.
The tools were deliberately tiny:
fetch_alerts(since)returns the raw alert objects.lookup_runbook(service)returns the wiki page for a service, if one exists.post_summary(text)writes a message to a Slack channel.
That is the whole toolbox. Three functions, each one a normal bit of code I had already written, with a JSON schema bolted on the front so the model knows how to call it.
Where it earned its keep
The grouping is the part that genuinely surprised me. Classic dedup logic struggles when the same underlying fault shows up as five different alerts with different wording from different systems. The model is good at this precisely because it does not need an exact match. "These four are all the same database being unreachable, worded by four different exporters" is the kind of judgement that is annoying to write rules for and easy for a language model to make.
So every morning I now get a Slack message: here are the three things that matter, here is why, here are the runbook links. The forty lines of noise underneath them I never read.
Where I drew the line
The thing I did not do is give it the power to act. No restart_service, no silence_alert, nothing that changes state. The moment an agent can take an irreversible action on the strength of a probabilistic guess, you have built a very confident intern with root and no supervision. I am not ready for that, and I am not sure I ever will be for production.
The honest summary is that the value is not autonomy. The value is summarisation and judgement applied to a firehose, with the actual doing left to deterministic code and to me. Keep the tools small, keep the dangerous ones out, and a little agent that "does things" turns out to be quietly, genuinely useful. Hand it the keys and you have built a liability with good manners.