fine-tuning a tiny model to do one dull job well

A small robot on a workbench

I had a boring, well-defined task: take an incoming support email and tag it with one of about fifteen categories. For months I did this by handing a large model a careful prompt with the category list and a few examples. It worked, mostly. It was also slow, cost real money per call, and occasionally invented a sixteenth category out of sheer creativity, which is exactly what you don't want from a classifier.

So I tried the unfashionable thing. I fine-tuned a small model to do just this one job, and it's better at it than the big general model was.

why small, and why fine-tune

The task is narrow. There's a fixed output space, plenty of historical examples, and no need for the model to be witty or knowledgeable about the world. That's the ideal shape for fine-tuning a small model: you're not teaching it new facts, you're teaching it a single, consistent reflex. A 1-to-3 billion parameter model has more than enough capacity to learn "this kind of text means this category", and once it has, it doesn't need the category list in the prompt at all because it's baked in.

I used a few thousand labelled examples I already had from the prompt-based system, cleaned up the messy ones, and held back a chunk for evaluation. The training itself was a LoRA fine-tune so I wasn't updating the whole model, just a small adapter, which means it fits on a single consumer GPU and finishes in an evening.

A circuit board close-up

the results that actually mattered

Three things got better, and they're the three I cared about.

Consistency first. The fine-tuned model never invents a category, because it has only ever seen the fifteen that exist. The big model, for all its capability, would occasionally decide an email was "Billing/Refund/Escalation" when the options were "Billing" and "Escalation". The small one simply can't do that, and for a classifier that rigidity is a feature.

Cost second. It runs locally on hardware I already own, so the per-email cost is electricity and nothing else. The big-model approach was a fraction of a penny per call, which sounds like nothing until you multiply by the volume and the retries.

Latency third. A small local model returns a category in well under a second. No network round-trip, no queue, no rate limit to back off from.

The honest caveat: this only works because the task is narrow and I had labelled data. Fine-tuning a small model to be a general assistant is a fool's errand, and people who try it come away disappointed. But for one boring, bounded job that you do thousands of times, teaching a small model the reflex and retiring the elaborate prompt is genuinely the right call. It's less clever and it works better, which is usually a sign you've found the correct answer.