I had a boring problem, which is the best kind to throw a model at. Incoming support messages needed sorting into about a dozen buckets: billing, bug report, feature request, account access, and so on. A human could do it in a second. A regex couldn't, because people don't write to spec. And a large frontier model could, easily, but I'd be paying per call to do something a child could manage, on a volume that made the bill add up.
So the question was whether a small model, fine-tuned on our actual tickets, could do this one narrow job well enough to run locally and for nearly nothing.
The honest first step was data, and there's no shortcut. I pulled a few thousand historical tickets that already had a category, because they'd been hand-sorted over the years. That's the whole trick with the boring tasks: you usually already have the labels, you just have to go and find them. I cleaned them, stripped signatures and quoted replies, and held back a slice for evaluation that the model would never see during training.
I didn't do a full fine-tune. A small base model with a LoRA adapter was plenty, and it ran on a single GPU without drama. The point of a parameter-efficient approach here isn't just speed, it's that you're not trying to teach the model language, it already has that. You're teaching it your dozen labels and the shape of your particular customers' grumbles.
The result that mattered: on the held-back set, the tuned small model hit accuracy in the high nineties, comfortably better than the few-shot prompt I'd been limping along with, and it ran in a few milliseconds per ticket on hardware I already owned. No per-call cost, no rate limit, no data leaving the building, which the compliance people appreciated more than any accuracy number.
The mistakes were instructive. My first run overfit gloriously: near-perfect on training data, mediocre on anything new, because I'd trained too long on too little. The categories were also imbalanced, mostly "bug report" and "billing", which meant the model learned to shrug and guess the common ones. Reweighting and a more even sample fixed most of it.
The lesson I'd offer is about scope. The instinct with these things is to reach for the biggest, cleverest model and ask it to do everything. But for a narrow, repetitive, well-defined task with labels you already have, a small fine-tuned model is faster, cheaper, private, and frequently more accurate, precisely because it's not trying to be clever. It just does the one dull thing, well, all day, and never asks for a holiday.