A Spare GPU and a Model That Lives in My House

A robot and a glowing circuit board

I had a spare RTX 3060 sitting in a drawer, 12GB of VRAM doing nothing, and a vague feeling that I should stop paying per token for things I keep doing badly anyway. So I built a local LLM box. Not a cluster, not a rack, just one card in an old desktop under the stairs running models that fit in memory.

The first lesson is that VRAM is the whole game. 12GB will comfortably hold a 7B or 8B model at a 4-bit quant with room for context, and it will not hold a 70B no matter how much you wish it would. Once you accept the ceiling, everything else is easy.

Getting it running

Ollama did the boring parts for me. Install it, point it at a model, and it handles the quantisation download and the GPU offload without me reading a single line of CUDA documentation.

curl -fsSL https://ollama.com/install.sh | sh
ollama run llama3.1:8b

That was genuinely it. First token came back in under a second, and a full response streamed at something like 40 tokens per second, which is faster than I read. nvidia-smi showed the whole model sitting in VRAM and the card pulling about 130W under load, which it can do all day.

A close-up of a circuit board

What it is actually good for

Here is the honest bit. A local 8B model is not going to out-reason the big hosted frontier models, and pretending otherwise will only disappoint you. What it is good for is the high-volume, low-stakes, privacy-adjacent work where round trips and per-token cost add up.

I use mine for three things now. Summarising my own notes, which never need to leave the house. Drafting commit messages from a diff, piped straight in over the local API. And as a first-pass classifier for a pile of scanned documents, where "roughly right and free" beats "perfect and metered". For that last one I wired it to the OpenAI-compatible endpoint Ollama exposes on localhost:11434, and most of my existing scripts worked unchanged.

The thing that surprised me is how much the latency matters. When the model is on the same machine, there is no network, no rate limit, no worrying about whether a loop will cost me twenty quid by accident. I can throw a thousand requests at it overnight and the only cost is electricity and the fan noise.

It is not a replacement for the hosted models. It is a different tool, the way a bicycle is not a worse car. For anything I want to run a lot, locally, without thinking about a bill, the spare GPU has more than paid for the drawer it was sitting in.