I had a 12GB RTX 3060 doing nothing in a spare machine, so I pointed Ollama at it to see how far a single consumer card gets you now. The answer, in May 2024, is surprisingly far.
ollama run llama3 pulls the 8B model and you're talking to it in under a minute. The quantised 8B fits comfortably in 12GB with room to spare, and it generates fast enough to read along with, perhaps 40 tokens a second. It is not GPT-4. It will cheerfully invent a function signature. But for drafting commit messages, rephrasing an awkward paragraph, or summarising a wall of logs I pasted in, it's genuinely useful and it never leaves the house.
The thing that surprised me was the tooling. The Ollama API speaks an OpenAI-compatible endpoint, so existing scripts mostly just worked once I changed the base URL. I now have a little shell function that pipes text through the local model for tidying, and it costs nothing per call.
Would I run production traffic on it? No. But as a free, private, always-on assistant for low-stakes text wrangling, a spare GPU and one command is a remarkably good deal.