Ramblings of an aging IT geek
← Ramblings of an aging IT geek
ai

putting a spare 3090 to work as a local llm box

How I turned an idle RTX 3090 into a usable local LLM server with llama.cpp and Ollama, what fit in 24GB, and where it stopped being worth the bother.

A stylised robot head rendered in neon

I had a 3090 sitting idle. It came out of the gaming rig when I upgraded, and it spent a few months in a drawer doing nothing but depreciating. So I did the obvious thing and bolted it into a spare case to see how far a single 24GB card gets you with the local models that have landed this year.

The short version: further than I expected, and with far less faff than the same exercise would have been twelve months ago. The tooling has caught up. You no longer need a research lab's worth of Python to get a model answering questions on your own hardware.

the hardware, such as it is

Nothing exotic. An old Ryzen 5 3600, 32GB of system RAM, the 3090, and a 1TB NVMe drive that mostly holds model weights. The card is the only part that matters. Everything else is there to feed it and stay out of the way.

The one thing worth getting right is power. A 3090 will happily pull 350W under load, and if your PSU is marginal you will find out about it the hard way, halfway through generating a response, with the whole box dropping dead. I put a 750W unit in and stopped worrying. I also capped the card with nvidia-smi -pl 280 because the last 70W buys you almost nothing in tokens per second and a lot in fan noise.

# cap power, check it stuck
sudo nvidia-smi -pl 280
nvidia-smi --query-gpu=power.limit,temperature.gpu --format=csv

ollama for the easy life

I started with Ollama because I wanted something working before I lost interest. It wraps llama.cpp, handles model downloads, and gives you an HTTP API on port 11434 that behaves itself. Install, pull a model, talk to it:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b
ollama run llama3.1:8b "explain what a page fault is, briefly"

Llama 3.1 landed last month and the 8B variant is genuinely useful. It fits in a couple of gigabytes at a sensible quantisation, leaves the rest of the card free, and answers fast enough that you stop thinking about latency. For the kind of thing I actually want a local model for, rephrasing, summarising a wall of logs, drafting a commit message I cannot be bothered to write, it is plenty.

The interesting question is what happens when you stop being polite about VRAM.

A close-up of a circuit board with copper traces

how much fits in 24GB

This is the bit everyone wants the table for, so here is the rough mental model I ended up with.

A model's weights at full 16-bit precision want roughly two bytes per parameter. A 7B model is therefore about 14GB before you have done anything useful, and you also need room for the KV cache, which grows with context length. Quantise it down to 4-bit and that 7B drops to around 4GB, which is why everyone runs quantised weights at home. The quality loss at 4-bit is real but small, and for most tasks I genuinely cannot tell.

On a single 24GB card that means:

  • 7B and 8B models run with room to spare, long context and all.
  • 13B models are comfortable at 4-bit.
  • 34B models fit at 4-bit if you are careful with context length.
  • 70B models technically load at aggressive quantisation but you are scraping the barrel, and the KV cache for any real context pushes you over the edge.

I spent an evening trying to coax a 70B model onto the card. It worked, briefly, at 2-bit, and the output was bad enough that I deleted it and felt no loss. The lesson is that quantisation has a floor, and below 4-bit you are paying for the privilege of running a bigger model that has been lobotomised down to the quality of a smaller one. Run the smaller one. It is faster and it is better.

llama.cpp when you want the knobs

Ollama is the right default, but the moment you want to understand what is actually happening, drop down to llama.cpp directly. It is what Ollama runs underneath, and running it yourself exposes every dial: how many layers to offload to the GPU, how big to make the context, which quantisation to load.

./llama-cli -m models/llama-3.1-8b-q4_k_m.gguf \
  -ngl 99 -c 8192 -p "summarise this changelog:"

-ngl 99 says "put as many layers on the GPU as will fit", which on a 24GB card for an 8B model is all of them. The first time I watched the whole model land in VRAM and the CPU go quiet, it clicked. This is the thing that makes a local card worth it. When the model fits entirely on the GPU, generation is fast and steady. The instant it spills back to system RAM, throughput falls off a cliff and you feel every token.

where it stopped being worth it

I will be honest about the limits, because the local-LLM enthusiasm online tends to skip this part.

For coding help, the frontier hosted models are still meaningfully better, and it is not close. An 8B model running on my 3090 is a capable assistant for small, well-scoped questions. It is not a replacement for the big hosted ones on anything that needs real reasoning across a large context. If your day job depends on the quality of the answer, the local model is a supplement, not a substitute.

What the local box is genuinely good at is the high-volume, low-stakes, privacy-sensitive stuff. Throwing logs at it that I would rather not paste into someone else's API. Bulk reformatting. Anything I want to run in a loop without watching a meter tick. The card has paid for itself in not-thinking-about-billing alone, and it was already bought, so the marginal cost was a power cable and an evening.

If you have a capable card gathering dust, do it. Pull Ollama, pull an 8B model, and have it answering questions inside ten minutes. Then spend a happy weekend finding out exactly where 24GB runs out, which is the actually educational part.