Ramblings of an aging IT geek
← Ramblings of an aging IT geek
ai

a spare gpu, ollama, and a model that runs on my own metal

Setting up a quantised local LLM on a spare GPU with Ollama, and what it is and is not good for once the novelty wears off.

A small robot figure on a desk

I had a spare GPU sitting in a drawer and a vague feeling that I should be able to run a language model locally without renting someone else's hardware by the token. So I did. The short version: it works, it is genuinely useful for a narrow set of things, and the quantisation is doing more heavy lifting than I expected.

The card is nothing special, an older consumer GPU with twelve gigabytes of VRAM. That number turns out to be the whole story. VRAM is the constraint that decides everything: which models you can run, at what quantisation, and how much context you can give them before things fall over.

A close-up of a circuit board

Ollama made the setup almost embarrassingly easy. Install it, pull a model, and it sorts out the GPU offload for you.

ollama pull mistral
ollama run mistral

That is the entire ceremony. Behind the scenes it is pulling a quantised GGUF, working out how many layers fit in VRAM, and offloading the rest to the CPU if it has to. On twelve gigabytes I can comfortably run a 7B model at a four-bit quantisation with room to spare, and it responds fast enough that the conversation feels live rather than batched.

what the quantisation costs

A four-bit quant of a 7B model is not the same animal as the full-precision version, and you feel it. The prose is slightly less crisp, it makes more confident mistakes, and asking it anything that needs real reasoning produces answers that look right and occasionally are not. For drafting, summarising, rephrasing, and rubber-ducking a problem out loud, it is completely fine. For anything where being wrong matters, I check it.

The thing I did not anticipate was how much the privacy changes my behaviour. When the model runs on my own metal, I stop hesitating before pasting things into it. Config files, half-finished code, the contents of a log I would never send to a hosted service. It all stays on the box. That alone justifies the spare GPU.

It is not replacing the frontier hosted models for the hard stuff. It was never going to. But for the constant stream of small language tasks that I would otherwise not bother automating, having a capable model that costs nothing per query and never leaves the house is quietly excellent. The drawer GPU has earned its keep.