I did not set out to fine-tune anything. I had a pile of supplier emails, all roughly the same shape, and I wanted three fields out of each one: order reference, delivery date, total. A regex got me most of the way and then died a slow death against every supplier who formatted things their own special way.
Prompting a general model worked, but it was overkill and the results drifted. So I fine-tuned a small one to do exactly this, and nothing else.
The dataset is the work
This is the part nobody tells you cleanly: the model is almost an afterthought, the dataset is the job. I spent one evening writing training code and three evenings building examples. I hand-labelled about 400 emails into a plain JSONL file, input on one side, the exact JSON I wanted on the other.
{"input": "Subject: Order AB-2231 confirmed...", "output": "{\"ref\":\"AB-2231\",\"date\":\"2026-06-01\",\"total\":\"148.50\"}"}
Four hundred felt small. It was enough. The task is narrow, the variety is low, and the model only has to learn one mapping rather than the whole of human language.
The labelling itself taught me where my data was messy. A surprising number of those emails had two order references because of part-shipments, and a handful quoted the total in a currency I had not accounted for. None of that was visible until I sat there reading examples one by one. By the time I finished the dataset I understood the problem far better than I had when I started, which is the quiet bonus nobody mentions: curating training data is also the most thorough audit of your inputs you will ever do.
LoRA, on a single card
I used a LoRA rather than a full fine-tune, which meant I only trained a small set of adapter weights and left the base model frozen. That is the difference between "fits on my one GPU in an evening" and "rent a cluster". I took a small instruct base, a few hundred examples, three epochs, and the whole run finished in under twenty minutes on the 12GB card.
The result was a model that does the one boring task better than my regex and more reliably than the general prompt. It extracts the three fields, returns clean JSON, and when it sees a supplier format it has never met it usually still gets it right, which the regex never managed.
What I would tell past me
Do not fine-tune to make a model cleverer. Fine-tune to make it consistent at something specific. The wins came from narrowing the task and curating the examples, not from any heroics with hyperparameters. I touched the learning rate once, shrugged, and left it.
The unglamorous version of machine learning is the useful one. A small model, a small dataset, one dull job done well, and a regex I was finally allowed to delete.