a full walkthrough of fine-tuning a small model for one dull task

A small robot on a workbench

I wrote a short note earlier about replacing a clever prompt with a fine-tuned small model for one boring classification task. This is the long version: the actual pipeline, the decisions, and the mistakes I made along the way, because the mistakes are where the useful detail lives. If you've got a narrow, repetitive job that a big model currently does adequately and expensively, this is the shape of the thing that might replace it.

The task, again: classify an incoming support email into one of fifteen fixed categories. High volume, fixed output space, lots of historical examples. About as friendly a fine-tuning target as you'll find.

start with the data, not the model

The temptation is to pick a model first because that's the fun bit. Don't. The data is the whole game with fine-tuning, and I lost a day learning that the boring way.

I had roughly four thousand emails that the old prompt-based system had categorised, with a human having corrected a subset of them. My first instinct was to throw all four thousand at the trainer. That was wrong, because a good chunk of those labels were the old model's guesses, errors included. Fine-tuning on your previous system's mistakes just teaches the new model to make the same mistakes with more confidence. Garbage in, fluent garbage out.

So I spent the time properly. I pulled out every example a human had verified, which was about twelve hundred. I sampled the rest and hand-checked a few hundred more, fixing the labels as I went. That left me with a smaller but genuinely clean training set, and a held-back evaluation set of three hundred examples that I never let the model see during training. The clean twelve hundred beat the dirty four thousand comfortably. More data is not better data.

A few data lessons worth stating plainly:

Balance the categories as best you can. If "Billing" is 60% of your examples and "API Bug" is 1%, the model learns to shout "Billing" and you'll think it's accurate until you look at the rare classes.
Keep the input format identical to production. If real emails arrive with headers and quoted replies, train on that, not on a cleaned-up version you'll never actually feed it.
Hold out your evaluation set before you do anything else, and never look at it until the end. Peeking is how you fool yourself.

A circuit board close-up

the training itself

I used a small open instruct model in the few-billion-parameter range and a LoRA fine-tune. LoRA, low-rank adaptation, means you freeze the original weights and train a small set of adapter matrices instead. The practical upshot is that training fits on a single consumer GPU, finishes in a couple of hours, and produces a small adapter file rather than a whole new multi-gigabyte model. You can keep several adapters around and swap them, which is handy if you end up with more than one narrow task.

The training data was formatted as instruction pairs: the email as input, the single category word as the target output. No category list in the prompt, no examples, because the whole point is to bake the behaviour into the weights so it's free at inference time.

# the shape of each training example, roughly
{
  "instruction": "Classify this support email.",
  "input": "<the full raw email text>",
  "output": "Billing"
}

The hyperparameters I'll spare you, because the honest truth is I used sensible defaults from the training library and changed almost nothing. The one knob that mattered was epochs: too many and the model overfit, memorising the training set and getting worse on the held-out evaluation. I watched the evaluation accuracy per epoch and stopped when it stopped improving, which was around three epochs. If your eval score is still climbing, keep going; when it plateaus or dips, you're done, and pushing further just teaches the model your training set by heart.

evaluating like you mean it

Accuracy on its own is a trap, especially with imbalanced classes. A model that always guesses the most common category can post a deceptively high number. So I looked at a confusion matrix: which categories get mistaken for which. That immediately showed me two categories the model genuinely couldn't tell apart, and when I looked, I couldn't reliably tell them apart either from the email alone. The fix wasn't more training, it was merging two categories that should never have been separate. The model was right and my taxonomy was wrong, which is a humbling but common outcome.

After the merge and a second training pass on the cleaned data, the fine-tuned model landed comfortably ahead of where the prompt-based big model had been, on the same held-out set. Fewer mistakes, and crucially, no invented categories, because the output space is now physically constrained to what it was trained on.

deployment, and what it costs

It runs locally. The adapter loads on top of the base model, the whole thing sits on a GPU I already had, and a classification comes back in well under a second with no network call. The running cost is electricity. The big-model version cost a fraction of a penny per call, which is fine until you remember the volume and the retries and the rate limits.

The trade I made is real and worth naming: I now own a model. It needs retraining when the categories drift, it needs the base model kept current, and it's one more thing in the estate to look after. That's not free. But for a task this narrow and this high-volume, owning a small specialist that does one job perfectly beats renting a generalist that does it pretty well, and the maintenance is an evening every few months rather than a constant API bill.

If I were giving one piece of advice to someone about to try this, it would be the data one, twice over. Spend your effort on a small, clean, honest dataset and an evaluation set you never peek at. The model is almost an afterthought. The fine-tune was the easy evening. The data prep was the actual work, and it's the only part that determined whether the thing was any good.