Ramblings of an aging IT geek
← Ramblings of an aging IT geek
ai

the spare gpu finally earns its keep

Putting a leftover GPU to work serving a quantised local model with Ollama, and what it actually feels like day to day.

A small robot on a desk

There has been an old card sitting in the bottom of the case for the better part of a year, doing nothing but warming the air. A 12GB board, too slow to bother with for anything serious, too good to throw out. So I gave it a job.

Ollama makes this almost embarrassingly easy. Install it, ollama pull llama3.1:8b, and you have a quantised model answering on localhost:11434 in a couple of minutes. The 8B at Q4 fits comfortably in 12GB with room for a decent context window, and on this card I get something like 30 tokens a second. Not fast. Fast enough that I stop noticing the wait, which is the only threshold that actually matters.

What surprised me is how much I reach for it now that it costs nothing per call. Rubber-ducking a function, rewording a paragraph, summarising a wall of log output I cannot be bothered to read in full. None of it is work I would have paid an API for, so it simply never happened before. The card is not clever, but it is always there and it never sends my half-formed nonsense to anyone else's servers.

It will not replace the big models for the things they are genuinely good at. For the small, constant, slightly embarrassing questions, though, having a local one running is a quiet little luxury. The spare GPU has finally earned its keep.