a language model on the spare gpu, and what it's actually good for

A stylised robot head built from circuit traces

I had a spare GPU sitting in a drawer, a 12GB card pulled out of an old gaming build, and a growing irritation at not understanding what these large language models actually do when you take them out of the demo. The cloud APIs are a black box on purpose. I wanted one running on hardware I could touch, where I could watch the VRAM fill up and see what falls over when it does.

So this is the report from doing exactly that. The short version: it works, it's genuinely impressive in places, and the distance between "running locally" and "useful daily" is wider than the breathless threads suggest. Both halves of that are worth writing down.

getting it running

The thing that's changed in the last few months is quantisation getting good enough to matter. A full-precision model of any interesting size simply will not fit in 12GB. Quantised down to 4-bit, the same weights drop to a quarter of the size, and suddenly a model that would have needed a datacentre card fits on a card I'd otherwise have sold for forty quid.

I started with the LLaMA weights doing the rounds and the llama.cpp project, which has become the centre of gravity for running these things on modest hardware. It's a single C++ binary, builds with CUDA support, and reads the quantised weight files directly. The setup, once you have the weights, is genuinely undramatic:

make LLAMA_CUBLAS=1
./main -m models/model-q4_0.bin -ngl 32 -p "Explain a TCP handshake to a sysadmin."

The -ngl 32 is the lever that matters: it offloads that many layers onto the GPU. Push too many and you run out of VRAM and it dies; too few and the CPU does the work and it crawls. Finding the number where the whole model just fits is the one bit of genuine fiddling. With the 12GB card I could fit the 7B model comfortably and the 13B model if I was careful and closed everything else.

A close-up of a graphics card's circuit board

When it fits on the GPU, generation is fast enough to feel conversational, faster than I can read. When a few layers spill onto the CPU, it slows to a deliberate, word-by-word crawl that's oddly hypnotic to watch but useless for anything interactive. The cliff between the two is sharp. There's no graceful degradation, just "fits" and "doesn't".

what it's actually good at

Here's where I have to be honest, because the demos oversell it and the cynics undersell it, and the truth is more interesting than either.

The 7B model is a confident, fluent, frequently wrong colleague. Ask it to rephrase a paragraph, summarise a block of text, or draft boilerplate, and it's genuinely useful. Ask it a factual question and it will answer with total conviction whether or not it has the faintest idea, which is the failure mode everyone warns you about and which is much more unsettling when it's happening on your own desk. It told me, fluently and in detail, about a Linux command-line flag that does not exist. The fluency is the dangerous part. It sounds exactly as sure when it's wrong.

Where it earned its keep was the small, low-stakes, language-shaped jobs. Turning a terse commit message into a readable one. Drafting the first pass of a function's doc comment. Renaming a pile of variables to something consistent. None of these need it to be right about the world, only fluent about text, and at that it's good. The bigger 13B model is noticeably steadier and hallucinates a little less, but the shape of the thing is the same.

A few practical notes from the running of it. The first token takes a moment while the prompt is processed, then the rest stream steadily, so latency feels worse on short prompts than long generations. Context length is the real constraint on doing anything serious: these models forget the start of a long conversation, and feeding them a whole source file plus a question fills the window fast. And it is genuinely a space heater. The card sits at full tilt for the duration of a generation and the room knows about it.

was it worth it

For learning, unreservedly yes. There is no substitute for watching the VRAM gauge, hitting the context limit, and seeing the thing confabulate in front of you for understanding what these models are and aren't. The mystique evaporates and what's left is a tool with sharp, specific edges.

As a daily driver, not yet, and I want to be clear-eyed about that. The local 7B is to the hosted models roughly what a learner's bike is to a car: the same idea, an order of magnitude less capable, but yours, and the thing you learn to ride on. I'll keep it running, partly for the offline privacy of not sending half-formed thoughts to someone else's server, and partly because the field is moving fast enough that the spare-GPU tier might be properly useful by the summer. The drawer GPU has earned its slot back in the case. That's more than I expected when I started.