a tiny model on a tinier machine

A small robot on a desk

I have an old mini PC under the telly running a few homelab odds and ends, eight gigs of RAM and an Intel chip that was unremarkable when it was new. On a whim I built llama.cpp on it and pointed a small quantised model at it, fully expecting to swap into oblivion.

It runs. Not quickly, a handful of tokens a second on a Q4 quant of a 3B model, but it runs, on CPU, on a box I bought to be a glorified DNS server. The trick is the quantisation doing the heavy lifting and llama.cpp being genuinely lean about memory. No GPU, no fans screaming, no cloud bill.

It is not going to write your code. But for a little local "summarise this paragraph" endpoint that never leaves the house, it is honestly fine, and there is something quietly satisfying about a language model purring along on hardware that has absolutely no business doing this. The bar for "good enough, locally" has dropped further than I'd noticed.