Ramblings of an aging IT geek
← Ramblings of an aging IT geek
ai

a spare gpu, a quantised model, and lower expectations

What it actually takes to run a usable language model on one consumer GPU at home, and where the wheels come off.

A small robot, doing its best

I had a spare GPU doing nothing in the garage rack, an old 12GB card pulled from a desktop, and a vague sense that I should run a language model on it rather than pay someone else by the token. The short version: it works, it's genuinely useful, and the gap between "works" and "as good as the hosted ones" is exactly as large as the VRAM number suggests.

I went with Ollama because I wanted to spend the evening using a model, not building one. It wraps llama.cpp, pulls quantised GGUF weights, and exposes a small HTTP API, which is all I needed. Install, ollama run, and a few minutes later a 7B model was answering questions on my own hardware.

The number that matters is VRAM, and quantisation is how you make a model fit. A 7B model at 4-bit quantisation lands around 4 to 5GB, which leaves room on a 12GB card for context and comfort. A 13B is tighter but doable. Anything in the 30B-and-up range either won't fit or spills into system RAM, at which point the GPU sits idle whilst your tokens trickle out at reading speed. The first time I tried to be ambitious, the model "worked" and produced about one word per second, which is a fascinating way to lose interest in a sentence.

The circuitry doing the thinking

ollama pull mistral:7b-instruct-q4_K_M
ollama run mistral:7b-instruct
# watch it actually use the card
nvidia-smi -l 1

Quantisation costs you something. A 4-bit model is measurably less sharp than the same model at full precision, and both are a step below the big hosted models. For summarising a document, drafting a function, or rephrasing an email, the 7B is fine and the latency is lovely: no round trip, no rate limit, no bill. For anything that needs real reasoning over a long context, you feel the ceiling quickly.

The genuinely good part is what local buys you beyond cost. Nothing leaves the house. I can point it at notes I'd never paste into a web form, and the whole thing keeps running when my internet doesn't. That alone justified the spare card.

My honest take after a few weeks: a single consumer GPU at home is a brilliant tool for the boring 80% of tasks, and a frustrating one for the interesting 20%. Set your expectations to "fast local assistant" rather than "the clever one", keep the card cool, and it earns its place. Just don't go shopping for a 70B on the strength of one good afternoon.