a spare gpu, a quiet llm, and a weekend gone

A stylised robot head representing an AI model

I had a 3090 sitting in a box since the last desktop rebuild, and a vague guilt about paying per token for things I could probably run at home. So I spent a weekend turning the spare card into a small, private, occasionally useful language model that lives in the cupboard under the stairs. Here's what worked, what didn't, and what I'd tell myself before starting.

The short version: a single 24GB card is genuinely capable now, a quantised model will surprise you, and the bottleneck is rarely the GPU. It's everything around it.

the hardware, such as it is

The card is an RTX 3090, 24GB of VRAM, second-hand and already paid for. It went into an old Ryzen box with 64GB of system RAM and an NVMe drive with enough room for a few models. Nothing exotic. The one thing worth saying is that these cards pull real power under load, so I capped the board at 280W with nvidia-smi and lost almost nothing in throughput:

sudo nvidia-smi -pl 280

That dropped the fan noise and the room temperature noticeably, and the tokens-per-second barely moved. If you're running a model overnight, your electricity meter will thank you.

getting something running quickly

I started with Ollama, because it's the path of least resistance and I wanted a working chat before I lost the evening to compiler flags. Install, pull a model, talk to it:

ollama pull llama3.1:8b
ollama run llama3.1:8b

That's it. Eight billion parameters at a sensible quantisation sits comfortably in VRAM with room to spare, and it's fast enough that the latency feels conversational rather than batch. For a first contact with "is this even worth it", Ollama is the right call. It hides the awkward bits and gets you to a prompt in about ninety seconds.

Once I wanted to understand what was actually happening, I dropped down to llama.cpp directly. It's more fiddly, but you can see the levers: how many layers you're offloading to the GPU, the context length, the exact quantisation. That visibility matters once you start caring about why something is slow.

A close-up of a circuit board

what fits, and what it costs you

The honest accounting is about VRAM, not cleverness. A 7B or 8B model at a 4-bit quant fits with the whole context window in memory and runs quickly. A 13B or 14B fits, but you start trading context length for headroom. Anything in the 30B range needs aggressive quantisation and you'll feel it, both in quality and in the slow grind as layers spill.

I settled, for now, on a 14B model for the times I want better reasoning and an 8B for everything quick. The 14B is noticeably more coherent on anything that needs holding two ideas at once. The 8B is fast enough that I'll throw a throwaway question at it without thinking, which turns out to be the thing that gets used.

Quantisation is the lever nobody warns you about properly. A Q4 model is dramatically smaller than its full-precision parent and, for chat, mostly fine. Push down to Q2 or Q3 and it starts making the kind of confident mistakes that are worse than no answer. I'd rather run a smaller model at a sane quant than a bigger one mangled into fitting.

what it's actually good for

This is the part I went in naive about. A local model is not a frontier model, and pretending otherwise leads to disappointment. What it is good at, in my use, is the boring private stuff:

Reformatting and summarising text I don't want leaving the house.
First-draft commit messages and tidying up notes.
Rubber-ducking a problem out loud when nobody's around to be bored by it.
Classification and extraction tasks where I can check the output at a glance.

What it's bad at is anything needing current knowledge, long careful chains of reasoning, or the kind of reliability you'd stake a deploy on. For those I still reach for the hosted models, and I've made my peace with that. The local one earns its keep by being there, private, and free at the margin.

the bits that bit me

CUDA versions and driver mismatches ate an hour. The error messages are not kind, and the fix was, as ever, making the driver and the toolkit agree. Once they did, it never came back.

Context length is the other quiet trap. It's tempting to crank it to the maximum, but a long context eats VRAM and slows everything down, and most of the time I don't need it. I keep it modest and only raise it when a specific task wants the room.

And the obvious one: a model running in a cupboard is only useful if you can reach it. I put a small API in front of it on the local network so my laptop and phone can talk to the same instance, which is the difference between a novelty and a tool.

would i do it again

Yes, and I already have. The card was sunk cost, the weekend was going to be lost to something anyway, and what I've got now is a quiet, private little assistant that costs nothing per question and never sends my notes anywhere. It won't replace the hosted models for the hard problems. But for the steady drip of small private tasks, having it just there, in the cupboard, humming gently, is exactly the right amount of overkill.