The last fortnight has been wall-to-wall keynotes, the way January always is now. New frontier models, new benchmark charts where the bars all reach reassuringly close to the top, and a great deal of confident language about "agents" doing your work for you whilst you sip something. My feed has split neatly into people declaring it the end of the profession and people declaring it all hype. As usual, both camps are watching the same demo and seeing what they brought with them.
I sat through a fair bit of it. The thing I keep noticing is that the interesting work has quietly moved away from the headline. The number on the slide, whatever it is this month, isn't where I'm spending my attention. The questions I actually had were: what's the context window cost in practice, what's the latency under real load, and what happens when the clever agent confidently does the wrong thing to a production system at 2am. None of those make a good stage moment, so none of them got one.
That's not cynicism, or I don't mean it to be. When something is genuinely good I'm happy to say so, and some of what's shipped recently is genuinely good. Tool use that mostly works. Long-context retrieval that's stopped falling over halfway through. These are real improvements and I use them daily. But "real improvement I use daily" and "civilisation-altering moment deserving a standing ovation" are different categories, and the keynote format only has the one volume setting.
What I'd love, just once, is a keynote that opens with the failure cases. Here's where it hallucinates. Here's the class of task it's bad at. Here's the bill you'll get. I'd trust the impressive bits far more if they came with honest edges. Instead the edges turn up later, in a blog post from someone who tried it on real work and wrote down what broke, and that post is usually more useful than the entire event that preceded it.
There's also a quieter cost to the breathlessness. When every release is "the biggest leap yet", the word stops meaning anything, and the genuinely big leaps land with the same thud as the incremental ones. I'd rather a company undersell and let the work speak. The few that do it that way have earned a sort of credibility the loud ones can't buy, and I notice I reach for their tools first, almost without deciding to.
So my opinion, for what it's worth in a week where everyone has one: watch what people build with this stuff over the next quarter, not what gets demoed this week. The demo is the company's best case under controlled conditions. The thing that matters is the median engineer's Tuesday. That's where you find out whether the new model is a tool or a toy, and no amount of stage lighting answers it for you. Ask me in March.