a language model on the spare GPU, and what it was actually good for

A stylised robot head made of circuitry

There's a spare GPU in the homelab box that mostly exists for the occasional transcode and for keeping me honest about power draw. A few weeks back I decided to point it at the thing everyone's been talking about and actually run a language model locally, on my own hardware, with no API key and no rate limit. Partly curiosity, partly a stubborn dislike of paying per token to find out whether something is useful.

This is not a benchmark post. It's an account of getting one running, what it cost in faff, and where it earned its keep.

the hardware, and lowering expectations

The card is a consumer GPU with 12GB of VRAM. That number is the whole story. VRAM is the wall you hit first, long before you run out of patience or compute. The headline open models people quote run comfortably on a stack of A100s in someone's datacentre. On 12GB at home, you are firmly in the world of smaller models and quantisation, and the sooner you make peace with that the happier you'll be.

The trick that makes any of this possible on consumer kit is quantisation: storing the weights at lower precision so the whole model fits. Full 16-bit weights for anything interesting won't fit. Drop to 8-bit and a fair bit more becomes possible; drop to 4-bit and you can run models you had no business running, at a quality cost that is real but, for a lot of tasks, surprisingly tolerable.

A close-up of a circuit board

getting it running

I won't pretend the toolchain is settled, because it isn't. It moves week to week and half the instructions you find online were already stale when they were written. The shape of it, though, is roughly this:

# CUDA toolkit and a sane Python environment first
python -m venv ~/llm && source ~/llm/bin/activate
pip install --upgrade pip

# then whichever inference stack you've settled on
pip install torch --index-url https://download.pytorch.org/whl/cu117

The first hour is always the same fight: matching the CUDA version the toolkit expects against the version PyTorch was built for against the driver actually installed. Get those three lined up and most things work. Get them out of step and you'll burn an evening on cryptic loader errors that have nothing to do with the model at all. Keep a note of the exact versions that worked, because you will need it again.

Once it's loading, the first thing to watch is nvidia-smi:

$ nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu --format=csv
memory.used, memory.total, utilization.gpu
10980 MiB, 12288 MiB, 97 %

That's a 4-bit model sitting at the edge of the card with about a gigabyte of headroom. Push the context length up or load anything alongside it and you'll tip over into an out-of-memory error mid-generation, which is exactly as annoying as it sounds. Most of the tuning here is a balancing act between model size, quantisation level, and how much context you're willing to give it.

what it was actually good for

Here's the honest part.

For anything where correctness matters and I'd have checked the answer anyway, it was genuinely useful. Reshaping a blob of text. Drafting the boring first version of a shell script. Explaining an unfamiliar config file. Rubber-ducking a problem at one in the morning when there's nobody to rubber-duck with. The output needs reading carefully, but so does a colleague's, and it never minds being asked the same question four different ways.

For anything where I'd have trusted the answer blind, it was a liability. It will state something wrong with exactly the same calm confidence as something right, and a local model running quantised at 4-bit is, predictably, a bit worse at this than the big hosted ones. It does not know what it doesn't know, and it will happily invent a command-line flag that has never existed. I caught it doing precisely that more than once, and the flag always sounded plausible, which is the dangerous part.

The speed surprised me, in a good way. Tokens come out at reading pace or faster, which is more than enough for interactive use. Nobody's running a production service off this card, but for a single person at a keyboard it never feels like waiting.

was it worth it

For the curiosity, absolutely. There is something clarifying about running the whole thing locally, where you can watch the VRAM fill and feel exactly where the limits are, rather than treating it as magic behind an API.

For day-to-day work, it's a useful tool that I reach for perhaps once a day and never trust without checking. The faff of getting it running was real and the toolchain will have moved on by the time you read this. But the spare GPU now earns slightly more of its electricity than it did, and I've learned more about where these models help and where they quietly mislead than I would have from any amount of reading about them. That, more than the output, was the point.