putting a spare gpu to work as a private model server

A stylised robot face made of circuitry

I had a spare GPU sitting in a drawer, a 24GB card that came out of my desktop when I upgraded, and a low-grade unease about how much of my working day involves pasting code and half-formed thoughts into someone else's model behind someone else's API. None of it is secret, exactly. But "not secret" and "I'm comfortable sending this to a third party" are different bars, and a lot of what I paste clears the first and not the second. So I built a local model server. This is what it took and, more usefully, what I actually got out of it.

the hardware reality

Let me set expectations honestly, because the gap between the demos and the desk is where people get disappointed. A single 24GB card will comfortably run a quantised model in the 7B to 14B parameter range at a quantisation level (Q4 or Q5 of some flavour) that keeps quality respectable. It will run a 32B model if you quantise it hard and accept slower output. It will not run anything resembling the frontier hosted models, and pretending otherwise just leads to a week of fiddling followed by a sulk.

What 24GB does give you is enough headroom to keep a useful model resident with room for a decent context window, and that combination is where local inference stops being a toy. The thing that makes a local model feel sluggish is usually not the tokens-per-second once it's running, it's the model getting evicted from VRAM and reloaded from disk every time you come back after a coffee. Keep it pinned.

the software stack

I went with Ollama for the management layer, because it does the boring parts well: pulling models, holding them in memory, exposing a stable HTTP endpoint, and handling the quantisation formats without me having to think about them. Under the bonnet it's llama.cpp doing the actual work, and you can drop down to that directly if you want more control, but for a server that just needs to be there when I call it, the higher-level tool is the right altitude.

# pull a model and keep it warm
ollama pull qwen2.5-coder:14b
ollama run qwen2.5-coder:14b

# or hit the API directly from anything on the LAN
curl http://gpubox:11434/api/generate -d '{
  "model": "qwen2.5-coder:14b",
  "prompt": "explain this stack trace",
  "keep_alive": "30m"
}'

The keep_alive parameter is the one that matters for daily comfort. Set it generously and the model stays in VRAM between requests, so the second prompt of the morning is as fast as the tenth. The default is conservative and will reload the model after a short idle, which is exactly the behaviour that makes people conclude local inference is slow when really it's just cold.

A circuit board photographed close up

I exposed the endpoint on the LAN, not the internet, behind the firewall, and pointed my editor's assistant plugin and a couple of small scripts at it. That part is genuinely lovely: the same completion and chat affordances I'm used to, served from a box in the next room, with nothing leaving the house.

what it is actually good at

Here is the honest assessment after a few weeks. For a large, well-defined class of tasks, a local 14B model is completely sufficient, and the privacy and the lack of a per-token meter running change how freely I use it.

It's good at the small, frequent, low-stakes jobs. Explaining an unfamiliar stack trace. Translating a chunk of code from one language to another. Drafting a commit message from a diff. Writing the boring half of a shell script. Summarising a log file. Rubber-ducking a design where I mostly need a competent interlocutor rather than a genius. For all of these, the local model is fast enough, private, and "good enough" by a comfortable margin, and the marginal cost of asking is zero, so I ask more often and learn more.

Where it falls down is exactly where you'd expect. Long, subtle reasoning over a large context. Tasks where the difference between a good answer and a nearly-good answer is the whole point. Anything where I need the model to hold a sprawling codebase in its head and reason across it. For those I still reach for a hosted frontier model, and I do so knowingly, having decided that this particular thing is worth sending out.

the unexpected benefit

The thing I didn't predict is how the local model changed my relationship with the hosted ones. Because I now have a free, private, always-on model for the cheap stuff, I'm far more deliberate about what I send to the expensive cloud ones. The local box absorbs the constant trickle of small questions, and the hosted model gets reserved for the genuinely hard problems where its extra capability earns the trip. That triage was not the goal when I started. It's turned out to be the best part. A spare GPU and an afternoon bought me both a privacy improvement and a clearer head about which tool the job in front of me actually wants.