Another week, another launch that has apparently changed everything. The big model releases of late April and early May have done their usual job of filling every feed with breathless takes, leaderboard screenshots and at least one person announcing that a whole category of jobs is now obsolete. I have read a fair amount of it, run a few things through the new toys, and come away with the same feeling I get most cycles: the demo is genuinely impressive, and the gap between the demo and my actual work is exactly as wide as it always is.
I want to be fair here, because cynicism is cheap and these things are not nothing. The capabilities are real. The thing that is harder to find in the launch-day coverage is the part where someone tries to use it for a week on a real codebase, with real constraints, and reports back honestly. That is the post I find useful, and it never arrives on day one.
The launch-day ritual
There is a predictable shape to these weeks now, and I say that with affection rather than contempt. The announcement drops, usually with a livestream and a benchmark table where the new thing is ahead on everything that matters and conspicuously absent from the things it loses on. Within the hour the threads start: the screenshots, the "I cancelled my other subscription", the person who got it to do something in one prompt that would have taken them an afternoon.
Then the second wave arrives, slightly more sober. Someone notices the benchmark was run with a prompt nobody would use in practice. Someone else points out the pricing, which is where a lot of the day-one excitement quietly goes to die. And by the weekend you get the first genuinely useful thing: a long, grumpy writeup from a person who actually shipped something with it, listing what worked and what fell over. That is the one I bookmark.
I do not think any of this is bad faith. It is just that a launch is a marketing event, and a marketing event is optimised to make you feel that you are behind. You are not behind. You are watching a film trailer and being asked to review the film.
The thing that makes this cycle slightly worse than previous ones is the sheer compression of it. We used to get a major release every few months and a fortnight to digest it. Now there are several a quarter, each one billed as the moment everything changed, and the half-life of a benchmark headline is measured in days. By the time you have read the careful third-party evaluation of one model, two more have launched claiming to beat it. The treadmill speed has gone up, and the temptation to jump on it has gone up with it. Resisting that is a skill now, not a luxury.
Benchmarks measure what is easy to measure
The other thing worth being honest about is what the leaderboards actually tell you, which is less than the slides imply. A benchmark is a proxy. It measures the thing that is easy to measure and score automatically, which is rarely the thing you care about. A model can top a reasoning benchmark and still be irritating to work with because it is verbose, or slow, or it ignores the format you asked for, or it is brilliant in one language and mediocre in yours. None of that shows up in a single number, and the single number is what gets screenshotted.
I have watched a model lose on a headline benchmark and win comprehensively on my actual work, because my actual work rewards consistency and following instructions far more than it rewards peak cleverness on a hard puzzle. The reverse happens too. The leaderboard is a starting point for which models are worth my time to evaluate properly. It is not a verdict, and treating it as one is how you end up migrating to something that benchmarks beautifully and annoys you daily.
What I actually do with these
My approach has settled into something fairly boring, which is usually a sign it is working. I do not rewire anything on launch day. I add the new model to the small set of evals I keep for the work I genuinely care about, run them, and look at the numbers that matter to me rather than the ones on the slide. Those are: does it get my domain-specific stuff right, how often does it confidently invent something, what does it cost per task at the volume I would actually run, and how does it behave when the input is messy rather than the curated example from the keynote.
That last one is where most of the day-one magic evaporates. The demo input is always clean. My inputs are half-finished, contradictory, and full of context that lives in my head rather than the prompt. A model that is brilliant on the trailer and merely good on my mess is still useful, but it is a different proposition from the one being sold, and it is worth knowing which you have bought before you reorganise your week around it.
The other thing I have learned to watch is the boring operational stuff that never makes the announcement. Rate limits on the first few days. Latency under load when half the internet is hammering the same endpoint. Whether the API is actually stable or whether you are debugging their teething problems on their behalf. None of that is on the leaderboard, and all of it decides whether the thing is usable in anger.
The honest verdict, for now
So where does that leave this week's launch? Roughly where I expected. The headline capability is real and I am glad it exists. I have already found one or two genuine improvements over what I was using, and a couple of places where the older, cheaper option is still the right call because the new one is overkill and the cost adds up. I will keep both around and route to whichever fits the job, which is the unglamorous truth of how most of this works once the confetti settles.
If you are feeling the familiar launch-week pressure to drop everything and migrate, my advice is the same as it is every cycle. Run it against the work you actually do, not the work the demo does. Look at the cost at your real volume. Wait for the grumpy writeup. And remember that the people most confident on day one are, almost by definition, the ones who have not yet tried it on anything hard.
The good ones earn their place quietly, over weeks, by being reliably useful rather than spectacularly impressive once. I will check back in a fortnight and tell you whether this one has.