the cupboard model, one card and one prompt

A stylised robot head representing an AI model

A few days on from standing up a local model on a spare 3090, the dust has settled and I can be blunter about what matters. The whole thing comes down to one number, and it isn't speed. It's VRAM.

Twenty-four gigabytes decides what you can run, full stop. A 7B or 8B model at a 4-bit quant fits with room for a decent context window and runs fast enough to feel conversational. A 13B or 14B fits but starts eating into the context. Anything bigger needs quantisation aggressive enough that you'll feel the quality drain away. I run an 8B for quick throwaway questions and a 14B when I want it to hold two thoughts at once, and that pairing covers almost everything I ask of it.

A close-up of a circuit board

Getting there is genuinely easy now. Ollama hides the awkward bits:

ollama pull llama3.1:8b
ollama run llama3.1:8b

Ninety seconds and you're talking to it. If you want to understand the levers, layers offloaded, context length, exact quant, drop to llama.cpp, but you don't need to for a working setup.

The honest part is what it's for. This is not a frontier model and pretending otherwise leads straight to disappointment. What it's good at is the private boring stuff: summarising text I don't want leaving the house, first-draft commit messages, rubber-ducking out loud, the odd classification job I can eyeball. For anything needing current knowledge or careful reasoning I still reach for the hosted models, and that's fine.

One practical note that paid off twice over: I capped the card at 280W with nvidia-smi -pl 280 and lost almost no throughput whilst the room got noticeably cooler and quieter. For something that might run overnight, that's free.

So: a spare card, the right-sized model, and modest expectations. It won't replace anything for the hard problems. But as a quiet, private, free-at-the-margin helper in the cupboard, it's exactly enough.