putting a language model on the gpu that was just sitting there

A stylised robot head representing a language model

I had a GPU doing nothing useful, an older card with a respectable but not enormous amount of VRAM, and a vague itch to see how far you can get running a language model on your own hardware rather than poking someone else's API. The short version: further than I expected, but the gap between this and the hosted models is wide and obvious the moment you ask for anything subtle.

The first wall everyone hits is VRAM. The weights of a large model do not politely compress to fit your card, and the model you actually want is two or three sizes too big for what you have. The way around it is quantisation, loading the weights at reduced precision so they fit, and it works better than it has any right to. An 8-bit load of a mid-sized model squeezes into a consumer card with room to spare, and the quality drop is real but not ruinous for casual use.

Getting it running was mostly a fight with the Python environment, which is the part nobody puts in the demo videos. The model code wants one version of the CUDA toolkit, the framework wants another, and your distribution's packaged drivers want a third. I gave up trying to make the system Python happy and put the whole thing in its own environment with pinned versions, which is the only way I have ever got this class of software to behave.

A close-up of a circuit board

Once it loads, the experience is genuinely interesting. Generation runs at a readable pace, a few tokens a second, fast enough to feel like a conversation if you are patient and slow enough to remind you constantly that you are not on someone's datacentre. For drafting, summarising a block of text, or rephrasing something, it is perfectly serviceable. The lights flicker, the fans spin up, and a paragraph appears that is mostly coherent.

Where it falls down is anything that needs reasoning over more than a sentence or two. Ask it to follow a multi-step instruction and it loses the thread halfway. Ask it a factual question and it will answer confidently and wrongly, with the particular smoothness that makes the wrong answer hard to catch. The hosted models do this too, but less, and the difference in how often you get burned is the difference between a toy and a tool.

So why bother, when the hosted version is better and a browser tab away? A few honest reasons. The data never leaves the house, which matters for anything I would not want to paste into someone's web form. It keeps working when the internet does not. And there is something clarifying about watching the thing run on metal you can touch, with a power bill you can measure, rather than as a magic endpoint that bills you per token. It demystifies it. The model is a big pile of numbers being multiplied very fast, and seeing my own fans struggle to do it made that concrete in a way no amount of reading had.

I am not about to replace the hosted models with this for real work. But the spare GPU has gone from doing nothing to doing something, and I have a much better feel for where the floor is. Next I want to try fine-tuning something small on my own notes, which is where local actually wins, because that is the one thing you cannot easily do with someone else's endpoint.