It's early May 2024 and my feed is wall-to-wall model launches again. OpenAI is teasing something, Meta has Llama 3 out, and everyone has a benchmark chart proving their thing is best. I've stopped reading the charts. They all win on whichever axis they chose to plot.
What I actually want to know never makes the keynote. What's the latency at the 99th percentile when the thing is under real load? What does it cost per million tokens once you stop using the launch-week free tier? How does it behave when the prompt is messy and the user is in a hurry, which is to say always? The demo is a person asking a clean question in a quiet room. Production is none of those things.
I'm not jaded about it. The pace is genuinely remarkable, and Llama 3 8B running on a card I had spare is the kind of thing that would have been science fiction two years ago. I just notice that the excitement lives in the demo and the work lives in the integration. The launch is the easy part. Wiring it into something that doesn't fall over at 9am on a Monday is the part nobody live-streams.
So I'll let the launch dominate the week, and I'll be over here reading the rate-limit documentation.