running a 13b model on a laptop that cost less than the gpu i wanted

A small robot rendered on a circuit-board background

The machine in question is a four-year-old ThinkPad with integrated graphics and 16GB of RAM. It has no business running a language model. The received wisdom, until very recently, was that you needed a beefy NVIDIA card and a pile of VRAM, and if you didn't have one you watched other people have all the fun. llama.cpp quietly demolished that assumption, and I have spent an evening being pleasantly surprised.

The trick is quantisation. The original weights are 16-bit floats, which is how you end up needing 26GB just to hold a 13B model in memory. llama.cpp ships with GGUF quantised variants that pack each weight down to four bits or thereabouts, with a clever scheme to keep the accuracy loss small. A Q4_K_M quant of a 13B model lands around 7 to 8GB on disk, which fits in RAM on this laptop with room to spare. CPU inference, no GPU required.

Build is the usual cmake dance, and it pulls in nothing exotic:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
./main -m models/llama-2-13b.Q4_K_M.gguf -p "Explain a mutex like I am tired" -n 256 -t 8

The -t 8 matters more than I expected. It pins the number of threads, and on a machine with eight logical cores the sweet spot was the physical core count, not the logical one. Going past it made things slower, presumably because the hyperthreads were squabbling over the same execution units rather than adding anything. Worth measuring on your own hardware rather than trusting mine.

A close-up of a circuit board

Performance is the honest part. I get roughly four to five tokens a second on the 13B Q4 quant. That is not fast. It is slower than you read, and noticeably slower than any hosted API. For a chat where you sit and wait, it is just on the right side of usable. For anything throughput-shaped, batch jobs, summarising a directory of notes, it is a non-starter unless you are happy to leave it running overnight.

But the point isn't the speed. The point is that it runs at all, locally, with no account, no API key, no data leaving the laptop, and no monthly bill. I can ask it about a half-finished thought at eleven at night with the wifi off and it answers. The 7B models run closer to nine or ten tokens a second and are perfectly fine for quick questions, and that is the configuration I have actually kept.

There is something genuinely good happening here, and it is easy to miss amongst the noise about ever-larger hosted models. The frontier is moving in two directions at once. One is up, towards models nobody can run at home. The other is sideways and down, towards models small and well-quantised enough that a second-hand laptop can hold a conversation. The second direction is the one I find myself caring about, because it is the one I can actually own.