running a 7b model on a laptop that should know better

A small robot beside a circuit board

I had a free evening and a laptop with no dedicated GPU, and I wanted to know whether running a language model locally was a genuinely useful thing or a party trick. The honest answer is that it's both, and which one you get depends entirely on how much you ask of it. But the fact that it runs at all, on hardware that has no business doing this, is the part that still makes me grin.

The tool is llama.cpp, and the whole project is a quiet marvel. It's the inference engine that made local models go from "you need an A100" to "you need a laptop and some patience". It does this through quantisation: taking model weights stored as 16-bit floats and squeezing them down to 4 or 5 bits each, trading a sliver of quality for an enormous saving in memory and compute. A 7-billion-parameter model that would want 14GB at full precision drops to around 4GB at 4-bit, and suddenly it fits in the RAM I actually have.

getting it running

Building it is refreshingly old-fashioned. Clone, make, done. No Python environment to wrangle, no CUDA toolkit to match against a driver version, none of the dependency archaeology that usually accompanies anything in this space.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

Then you need a model in GGUF format, which is the file format the project settled on after some churn earlier in the year. You can pull pre-quantised ones straight off Hugging Face. I grabbed a 4-bit quantisation of a 7B chat model, dropped it in a models directory, and pointed the binary at it:

./main -m models/llama-2-7b-chat.Q4_K_M.gguf \
  -p "Explain a unix pipe to someone who knows what a function is." \
  -n 256 -t 8

The -t 8 tells it how many threads to use, and on a CPU-only machine that's the dial that matters most. Match it to your physical cores, not your logical ones, because hyperthreading does very little for this workload and oversubscribing just adds scheduling overhead. I learned that the slow way, watching tokens crawl while every core sat at 60%.

A circuit board close up

what it's actually like

The first thing you notice is the speed, or the lack of it. On this laptop I get something like 6 to 9 tokens a second with the 7B model. That's slower than reading speed but not painfully so, and it's a world away from the watch-paint-dry experience I'd braced for. For a chatty back-and-forth it's fine. For generating a long document you'll go and make tea.

The second thing you notice is the quality ceiling. A quantised 7B model is not going to out-reason the big hosted models, and pretending otherwise leads to disappointment. Ask it something fiddly and it'll confidently produce something plausible and wrong. But that's the wrong frame. The question isn't whether it beats the frontier. It's whether it's useful for the things you'd actually run locally, and for a surprising number of those, it is.

where the line falls

Here's where I landed after a few evenings of poking at it. The model is genuinely useful for the small, bounded, repetitive language tasks where I don't want to round-trip to an API: rephrasing a paragraph, drafting a commit message from a diff, summarising a chunk of text, turning rough notes into prose. None of those need a frontier model, all of them benefit from running instantly and privately, and the failure mode when it gets one wrong is that I notice and fix it in a second.

It is not useful, for me, as a coding assistant of any depth, or for anything that needs current facts, or for reasoning across more than a couple of steps. The bigger hosted models are better at those by a wide margin and it isn't close. I'd rather use the right tool than be precious about keeping everything local.

What I didn't expect was how much the privacy angle would change my behaviour. There's a category of thing I'd never paste into a hosted model, half-formed ideas, work-adjacent text, the contents of a personal note, and having a local model means those get the same treatment as everything else. The data never leaves the laptop. That alone justified the evening.

The takeaway, then, isn't that local models have caught up. They haven't, and the gap is real. It's that the floor has dropped through the basement. A capable-enough model now runs on ordinary hardware with no GPU, no account, and no network, and that quietly opens a door that was firmly shut a year ago. The party trick turned out to do real work after all, as long as I'm honest about which work to give it.