Running llama.cpp on a Machine That Should Know Better

A small circuit board with an AI chip

The machine in question is a fanless mini PC I bought years ago to run Pi-hole and a couple of cron jobs. Four cores, 16GB of RAM, an integrated GPU that has never once been asked to do anything more strenuous than draw a login prompt. By every reasonable measure it has no business running a language model. So naturally I spent a wet weekend making it do exactly that, and I'm pleased to report it works, for a sensible definition of works.

The point of this, beyond the obvious "because it's there", was to find out where the floor is. Everyone benchmarks the big rigs. I wanted to know what the cheap box in the cupboard could actually manage, because that's the hardware most homelabs are full of, and a model you run locally on hardware you already own is a different proposition to one you rent by the token.

Getting it built

llama.cpp builds cleanly from source, which is half the reason it's the right tool for this job. No CUDA, no Python dependency hell, just a C++ codebase and a Makefile that knows about CPU features.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j4

On this box the integrated GPU is useless for offload, so it's pure CPU inference. The thing that matters most there is that the build picks up the right SIMD instructions. The mini PC has AVX2, which llama.cpp uses for a meaningful speedup, and the build detected it without me asking. If you're on something older without AVX2 you can still run, it's just slower, and below a certain point "slower" tips over into "not worth it".

Choosing a model that fits

16GB of RAM sounds like plenty until you remember the OS and everything else wants its share. A 7B model at full precision is out of the question. Quantisation is the whole game here. I pulled a 7B model in GGUF format at Q4_K_M, which is the quantisation level I keep coming back to as the sweet spot: small enough to fit comfortably, good enough that the quality loss is real but not ruinous for everyday use.

The arithmetic is roughly: a 7B model at Q4 lands around 4 to 4.5GB on disk and a bit more in memory once you account for the context. That leaves headroom on a 16GB box, which matters because the moment you swap, inference speed falls off a cliff and you may as well go and make tea.

./llama-cli -m models/mistral-7b-instruct-q4_k_m.gguf \
  -p "Explain what a fencepost error is, briefly." \
  -n 256 -c 2048 -t 4

A close-up of a circuit board

What "useful" actually means at this speed

Here's the honest number: I get somewhere around 5 to 7 tokens per second on a 7B Q4 model with all four threads going. That is not fast. It's slower than you can read comfortably, and it's a world away from anything GPU-accelerated. For a back-and-forth chat it's frankly a bit tiring, you ask a question and watch the words trickle out.

But "interactive chat" isn't where this earns its keep. The things that work well at 5 tokens a second are the things that don't need to be instant. Summarising a document I've fed it. Drafting a first pass at some boilerplate I'll rewrite anyway. Classifying a batch of text overnight, where the whole point is that nobody's waiting. Run as a queue rather than a conversation, the speed stops mattering, because the machine is working while I'm not.

The quality at Q4 is the other half of the trade. For straightforward tasks, summarise this, reformat that, answer a factual question that's in its training, it's genuinely fine. Where it falls down is anything needing careful multi-step reasoning, or anything where a subtle error matters. The quantisation rounds off exactly the precision you'd want for the hard cases. So I treat it as a competent intern that's quick with the boring stuff and not to be trusted with the fiddly stuff, which is a fair description of what it is.

A couple of practical notes for anyone trying this on similar hardware. Pin the thread count to your physical cores, not the hyperthreaded count, because on a four-core box -t 4 consistently beat -t 8 for me; the extra threads just fought each other for the same execution units. Keep the context window modest too. I left it at 2048 because every extra token of context costs memory and slows the per-token maths, and for the batch work I'm doing I rarely need more. If you do need a long context, watch the memory closely, because the moment you tip into swap the whole thing becomes unusable rather than merely slow. And if you want it running as a service rather than a one-shot, llama-server gives you a small HTTP endpoint you can point a script at, which is how I ended up wiring the overnight classification job to it.

Was it worth it

Yes, with caveats. The novelty of a language model running entirely offline on a box that draws about as much power as a light bulb hasn't worn off. There's no API key, no rate limit, no data leaving the house, and no bill. For a homelab that's a genuinely nice property, and it's the reason I'd point people at llama.cpp over the heavier frameworks for this kind of hardware: it asks almost nothing of the machine beyond a compiler and some RAM.

The caveat is that you must be honest about the floor. This is not a replacement for the hosted frontier models, and anyone telling you a 7B Q4 on a CPU is "just as good" is selling something. It's a different tool for a different job: small, local, private, slow, and surprisingly capable within its lane. If your expectations match that, the cheap box in the cupboard will pleasantly surprise you. If they don't, you'll be disappointed at 5 tokens a second and you'll go back to the cloud, which is also a perfectly reasonable outcome.

What I like most is what it says about where the floor has moved to. A few years ago the idea of running anything resembling a useful language model on a fanless mini PC would have been absurd. Now it's a weekend project with a Makefile. The hardware hasn't changed. The models, and the people quantising them, have. That's the genuinely impressive part, and it didn't need a keynote to tell me so.