a language model on the gpu that was gathering dust

A small robot figure beside a glowing circuit board

I had a spare GPU doing nothing useful, an old card with 12GB that got demoted out of the desktop a while back, and after months of watching the local-model scene move I finally cleared an evening to see what I could actually run on it. The short version: more than I expected, and the experience of having it sit on your own hardware with no API key and no rate limit is genuinely different from poking a hosted endpoint.

The thing that made this possible at all is quantisation, and the project that made it approachable is llama.cpp. A full 7-billion-parameter model in fp16 wants something like 14GB just for the weights, which won't fit on my card with anything left over. Quantise it down to 4 bits and the same model wants roughly 4GB, with a quality hit that, for casual use, I mostly couldn't feel.

getting it built

The build itself is refreshingly unfussy. It's C++ with optional CUDA, and the CUDA path is what you want if you've a GPU to feed.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make LLAMA_CUBLAS=1

The model format moves fast in this corner of the world, which is the one real friction point. The community has been shifting to the GGML quantised formats and the tooling churns underneath you, so a model file you downloaded a fortnight ago may want a freshly built binary, or vice versa. If you get a cryptic load error, suspect a format mismatch before you suspect anything clever.

Once it's built, running is a one-liner, and the flag that matters most is -ngl, which offloads layers onto the GPU:

./main -m models/llama-2-7b.q4_0.gguf \
  -ngl 35 \
  -c 2048 \
  -p "Explain a TCP three-way handshake to a new engineer."

A close-up of a circuit board

the bit that takes the tuning

-ngl is the knob you'll spend your time on. It's how many transformer layers get pushed onto the GPU; the rest run on the CPU. Push all the layers onto a card with enough VRAM and you get the GPU's full speed. Push too many and you run out of VRAM and it falls over, or worse, swaps and crawls.

With my 12GB card and a 4-bit 7B model, the whole thing fit on the GPU with room for the context, and it was quick: comfortably faster than I can read, several tokens a second, and the prompt processing was near-instant. A 13B model at 4 bits was tighter. It fit, but only just, and a longer context window pushed it over, so I dialled -ngl back a few layers and ate the CPU penalty on those. The lesson is that VRAM is the budget you're actually managing, and context length spends from the same pot as the weights. A bigger model with a short context can beat a smaller model with a huge one, depending on what you're doing.

The other knob worth knowing is -c, the context size. Bigger context costs VRAM and it costs time, and most of the time I didn't need it. For chatting and quick drafting, 2048 was plenty. I only reached for more when feeding in a chunk of a document to summarise, and then I felt every token of it.

was it worth the evening

Yes, with caveats. What it's good at: it never refuses to start, it never rate-limits me, nothing I type leaves the machine, and for rubber-ducking, rephrasing, and first-draft boilerplate it's genuinely handy. The 4-bit 7B model is not going to out-reason the big hosted models, and I didn't expect it to. For "remind me of the rough shape of this", or "rewrite this paragraph less stiffly", it's completely adequate, and it's mine.

What it's not: it's not a replacement for the frontier models when you actually need the reasoning. It hallucinates confidently, it loses the thread on anything long, and the quantisation does cost you, mostly in the form of it being a little more likely to wander. For code it'll do small, well-trodden snippets and then quietly invent an API that doesn't exist, so trust nothing it tells you about a library without checking.

The real value, I think, isn't this evening's output. It's that the whole stack now runs on a card that was otherwise a paperweight, the barrier to running your own model has collapsed from "rent a cloud instance" to "build a C++ project and download a file", and that changes what's worth experimenting with. I left it set up. I find myself reaching for it for the small stuff precisely because there's no friction, no tab to open, no quota to spend. The spare GPU has a job again, and the job turns out to be surprisingly pleasant company.