running a 7b model on hardware that should know better

A small robot, lights blinking, working harder than it should

The machine in question is an old desktop with a six-core Ryzen, 32GB of DDR4, and no GPU worth the name. It is the box I use as a always-on tinkering server, and the idea of it running a language model locally felt faintly absurd. It is absurd. It also works, comfortably enough that I now use it daily for small drafting and summarising jobs that I would rather not send off to an API. This is how I got there with llama.cpp, and which knobs actually mattered.

the model and the quantisation

The single biggest decision is quantisation, and it is the one people overthink. The whole reason CPU inference is feasible at all is that you are not running the model in full precision. A 7B model in fp16 is 14GB and crawls; the same model quantised to a 4-bit GGUF is roughly 4GB and runs at a usable speed.

I settled on a Q4_K_M quant of a 7B instruct model. The K_M variants use a mixed scheme that keeps the more sensitive weights at higher precision, and the quality difference against the heavier Q5 and Q6 quants is, for my use, not worth the extra memory bandwidth they demand. On a CPU you are bandwidth-bound, not compute-bound, so a smaller file genuinely runs faster. I tried Q8_0 out of curiosity and it was both larger and slower for no quality I could perceive in summarisation.

A circuit board, close up, doing more than it was designed for

building it for the actual CPU

The default build is portable and therefore slow. The first real speed-up came from building llama.cpp with the optimisations my chip actually supports rather than a lowest-common-denominator binary.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_NATIVE=ON
cmake --build build --config Release -j

GGML_NATIVE=ON lets the compiler target the host CPU directly, so AVX2 and FMA get used instead of sitting idle. On this Ryzen that alone bought me a noticeable jump in tokens per second over a generic build. If you are deploying the binary somewhere other than where you built it, do not use native, but for a box that builds and runs the same model, it is free performance.

the threads lie everyone tells

The intuitive move is to throw every thread at it. --threads 12 on a six-core, twelve-thread part. It is slower. CPU inference of this kind saturates memory bandwidth long before it saturates the cores, and oversubscribing with hyperthreads just adds contention.

The number that worked best was --threads 6, matching physical cores, and on this machine that was meaningfully faster than 12. Measure it on yours, because it depends on your memory subsystem, but the rule of thumb "physical cores, not logical" held. I also found that pinning with taskset to keep the threads off the cores doing other work smoothed out the variance, though it did not change the average much.

The other lever is context size. Every token of context you reserve costs memory and slows the prompt-processing phase. I run with --ctx-size 4096 rather than the model's full window because my jobs do not need more, and a smaller KV cache means faster first-token latency.

serving it

I do not call the CLI directly. llama.cpp ships a server with an OpenAI-compatible endpoint, which means everything I already have that speaks to an OpenAI API can point at it with a changed base URL.

./build/bin/llama-server \
  -m models/mistral-7b-instruct-Q4_K_M.gguf \
  --threads 6 \
  --ctx-size 4096 \
  --host 127.0.0.1 --port 8080

Then a curl to http://127.0.0.1:8080/v1/chat/completions behaves like the real thing. I put it behind the reverse proxy I already run, scoped to the LAN, and that is the whole deployment. No Docker, no Python environment to rot, one static binary and a model file.

is it actually good

For the honest use cases, yes. Summarising a long email thread, drafting a first pass at a commit message, reformatting some notes, pulling structure out of a wall of text: all of these it does well enough that the round trip to a frontier API is not worth it. First token lands in a second or two and it streams out at a pace I can read along with.

What it is not: a coding assistant I would trust unsupervised, or a replacement for the big models when I actually need the reasoning. A 7B model at 4-bit on a CPU is a competent, fast, slightly dim assistant. Knowing which jobs are "competent and fast" jobs, and routing only those to the local box, is the trick. The rest still goes to the proper tools.

The genuinely pleasing part is the independence. This thing runs with no API key, no rate limit, no per-token meter ticking, and no data leaving the house. That it does so on hardware I was about to retire is the bit that still makes me smile.