a spare 3060 and an evening with ollama

An abstract render of a robot

I had a spare RTX 3060 doing nothing in a drawer, 12GB of VRAM, a card too slow for the games I don't have time to play anyway. So I dropped it into the home server to find out what running an LLM locally actually feels like, away from anyone's API and anyone's bill.

Ollama makes the first ten minutes embarrassingly easy. Install it, then ollama run llama3.1:8b and it pulls the weights, loads them onto the GPU and gives you a prompt. No virtualenv archaeology, no CUDA version roulette, none of the suffering I'd braced for. It just worked, which after years of GPU driver pain felt almost suspicious.

The VRAM is the wall you hit, and you hit it fast. A model has to fit in 12GB or it spills into system RAM and slows to a crawl, so the number that matters is the quantisation. An 8B model at 4-bit quantisation (the Q4_K_M builds) sits around 5GB and runs comfortably, leaving headroom for context. A 13B at 4-bit is tight but fits. Anything in the 70B class is simply not happening on this card, not without offloading so much to CPU that you may as well make a cup of tea between tokens.

A close-up of a circuit board

So what is it actually good for? The 8B models are genuinely useful for the small, private jobs. Summarising a chunk of text, rewriting an email, answering a quick "what's the flag for X" question, classifying some notes. They are not going to out-reason the frontier hosted models, and you can feel the gap the moment you ask anything that needs real depth. But they are fast enough to be pleasant, around 30 to 40 tokens per second on this card, and crucially the data never leaves the house. For anything I'd hesitate to paste into a cloud box, that matters more than the quality gap.

The thing nobody mentions in the setup guides is the heat and the noise. The 3060 sits near silent at idle and then spins up like a small hairdryer under sustained generation. My home server lives in a cupboard, and the cupboard is now noticeably warmer. A minor cost, but a real one, and the sort of detail you only learn by running the thing for a week rather than a demo.

Is it worth it? For me, yes, on balance. Not as a replacement for the hosted models when I want the best answer, but as a default for the small stuff, and as a genuinely instructive way to feel where the limits are. There is something clarifying about watching the VRAM gauge and understanding, viscerally, why model size and quantisation are the whole game. The drawer GPU has earned its slot. Just keep an 8B model and modest expectations, and 12GB goes further than you'd think.