Ramblings of an aging IT geek
← Ramblings of an aging IT geek
ai

running a 7b model on a thin client i bought for a tenner

Getting llama.cpp to run a quantised 7B model on an ancient low-power thin client, and being pleasantly surprised it works at all.

A small robot on a circuit board

I have a thin client that cost me about a tenner at a recycling sale. It has a low-power dual core, eight gigabytes of RAM that I added myself, and the thermal design of a paperback book. It is, in every sense, the wrong machine to run a language model on. So naturally I tried.

The point of llama.cpp is that it doesn't care how unsuitable your hardware is. It's CPU-first, it leans on GGUF quantised weights, and it will quite happily run on things that should be running a thermostat. I pulled a 7B model down at a 4-bit quantisation, which gets the whole thing under five gigabytes, and built llama.cpp from source because the prebuilt binaries assumed an instruction set this CPU has never heard of.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make LLAMA_NATIVE=1 -j2
./llama-cli -m models/model.Q4_K_M.gguf -p "explain a unix pipe to a cat" -n 128

A close-up of a circuit board

It works. I want to be clear about the scale of "works" here. We are talking a couple of tokens per second, which is roughly the speed of someone typing thoughtfully whilst also thinking about something else. You ask a question, you go and make tea, you come back to a paragraph. It is not interactive in any way a normal person would accept. But it is genuinely generating coherent text, locally, on a fanless box that draws less power than the lamp next to it.

The reason this matters to me isn't the speed, it's the proof. There's a persistent assumption that you need a fat GPU and a small fortune to touch any of this, and for serious work you do. But for a private little helper that summarises some text overnight, or classifies my own notes, or runs a job where I genuinely do not care if it takes ten minutes, this absurd little machine is enough. The quantisation does most of the heavy lifting; the 4-bit weights lose some precision that, for the things I'm asking, I cannot detect.

I am not going to pretend this is practical. If you want answers in real time, buy the right hardware or rent it by the hour. But I find something cheering in the fact that the floor has dropped this low. A decade ago "run a language model" meant a research cluster. Now it means a thin client from a recycling bin, a 4-bit GGUF, and the patience to wait for your tea. The machine has no business doing this. It does it anyway.