Ramblings of an aging IT geek
← Ramblings of an aging IT geek
ai

i didn't need a bigger model, i needed a smaller, dumber one

Fine-tuning a small base model with LoRA to do one narrow classification job reliably, why a 7B model beat prompting a much larger one, and what the whole process actually cost.

A small robot on a desk

The task was deliberately boring: take a free-text support message and label it with one of eleven internal categories. That's it. No reasoning, no creativity, just a consistent mapping from messy human text to a fixed set of tags. We'd been doing it by prompting a large hosted model, and it worked, mostly, but it was slow, it cost a fraction of a penny per message that added up alarmingly at volume, and every so often it would confidently invent a twelfth category that didn't exist.

I had a hunch that this was the wrong tool. A big general model is a polymath you're asking to do filing. What I wanted was a clerk who only knows the filing and does it instantly. So I tried fine-tuning a small model on the one task, and it turned out to be the right call by a comfortable margin.

Why small wins here

A 7B model knows far less about the world than a frontier model, but for a narrow classification job it doesn't need to. It needs to know your eleven categories and the patterns that map to them. Everything else is dead weight. Once you've shown it a few thousand examples of your actual data, the small model isn't guessing from general knowledge any more, it's pattern-matching against exactly your distribution. That's where it gets sharp.

The other wins are operational. A fine-tuned 7B runs locally on a single GPU, the inference is fast, the cost per message is essentially the electricity, and the data never leaves our infrastructure, which the people who own the support data were very happy about.

A circuit board close-up

The data mattered far more than the model

I'll be honest about where the time actually went. The fine-tuning itself was a small fraction of the effort. The bulk was building a clean training set: pulling historical messages, getting their correct labels, and being ruthless about quality. A few hundred carefully checked examples per category beat tens of thousands of noisy ones. Garbage labels in, confidently wrong model out.

I formatted everything as a simple instruction/response pair, one category word as the output, nothing else.

{"instruction": "Classify this support message into one category.\n\nMessage: my card was charged twice for the same order",
 "response": "billing_duplicate_charge"}

Keeping the output to a single known token from a fixed vocabulary is half the trick. It gives the model almost nothing to hallucinate.

LoRA, not a full retrain

I used LoRA, which trains a small set of adapter weights on top of the frozen base model rather than updating all of it. The practical effect is that the whole thing fits on one consumer GPU and trains in a couple of hours, not days. The adapter is a few tens of megabytes you can swap in and out, so the base model stays untouched and reusable.

The hyperparameters I didn't agonise over. A low learning rate, a handful of epochs, and an eye on the validation loss to stop before it started memorising rather than generalising. The single most useful thing I did was hold back a proper test set the model never saw during training, so the accuracy number meant something.

# rough shape of the run
base model:        7B instruct
method:            LoRA, rank 16
training examples: ~4,000 (hand-checked)
held-out test:     ~600
epochs:            3
hardware:          one 24GB GPU
wall-clock:        about 2 hours

The result

On the held-out set the fine-tuned 7B matched the big hosted model on accuracy and beat it on consistency: it never once produced a label outside the eleven categories, because as far as it was concerned no other labels existed. Inference dropped from a network round-trip to a local call measured in milliseconds. The running cost fell off a cliff.

A second circuit board shot

I don't want to over-sell this. Fine-tuning a small model is the right answer for a narrow, stable, high-volume task with good training data. It is the wrong answer for anything open-ended, anything that changes weekly, or anything where you don't have clean labelled examples to learn from. For those, keep prompting the big model and pay the toll.

But the instinct to reach for a bigger, smarter model when something isn't quite working is worth resisting. Sometimes the problem isn't that your model is too dumb. It's that it's too clever for a job that wanted a specialist. The smaller, dumber, narrower model did the boring task better, faster, and cheaper, and it never tried to be interesting about it. For filing, that's exactly what you want.