The machine in question is an old desktop that was due for the tip: a six-core CPU from a few generations ago, 32GB of DDR4, and a GPU so feeble it would be an insult to call it one. The plan was to find out how far llama.cpp could push it before it gave up. The answer, irritatingly for my disposal plans, was "further than it has any right to."
The whole trick is quantisation. A modern model at full precision wants more memory than this box has, but the GGUF format and the quant levels llama.cpp ships with let you trade a little quality for an enormous amount of headroom. I pulled a 7B-class model down at a 4-bit quant and it fit into RAM with room to spare. That is the entire reason any of this works on a machine like mine.
Build was unremarkable, which is praise:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j
No CUDA, no fuss, just a plain CPU build. Then point it at the GGUF and start the server:
./build/bin/llama-server \
-m models/model-q4_k_m.gguf \
-c 4096 \
-t 6
The thing nobody tells you when you start down this road is that on a CPU you are not compute-bound, you are memory-bandwidth-bound. Token generation is essentially streaming the model's weights through the cores, and the bottleneck is how fast you can move those weights out of RAM, not how fast the cores can multiply. That is why a smaller quant is faster as well as smaller: there are simply fewer bytes to haul around per token. It also explains why setting the thread count to the number of physical cores helped and setting it higher hurt. Past a point you are just adding contention, not throughput.
So how fast? Single-digit tokens per second on the 4-bit 7B, which is comfortably faster than I can read and perfectly usable for a chat-style back and forth. Prompt processing on a long context is the painful bit: feeding it a couple of thousand tokens of context takes a noticeable beat before the first reply token appears. For interactive use that is fine. For batch work over big documents it would test your patience.
Is it good enough to be useful? For drafting, summarising, rubber-ducking a problem and asking the sort of question I would otherwise have typed into a search box, yes, easily. It is not going to out-reason the big hosted models, and I did not expect it to. What it gives me is a model that runs entirely on a machine I own, with no API key, no per-token cost and no data leaving the house, on hardware that was about to become landfill.
That last part is the bit I keep coming back to. The interesting story in local inference is not the top of the range, it is the bottom. The fact that a dead-end desktop with no GPU can serve a genuinely helpful assistant says more about how far quantisation and llama.cpp have come than any benchmark on a rack of accelerators. The old box has earned a stay of execution. It lives under the desk now, quietly answering questions.