a spare gpu and a language model that runs in my own house

A small robot figure on a desk

I had a spare GPU doing nothing, an old card with 8GB of VRAM that used to live in my gaming machine, and a vague itch to run a language model that didn't phone home. So I dropped it into the homelab box and spent an evening finding out what 8GB actually buys you in 2022.

The short answer: less than the demos suggest, but more than I expected. The headline models everyone's writing about want far more memory than I have, and the moment a model's weights don't fit in VRAM you're either spilling to system RAM (slow) or not running it at all. Where the card earns its keep is the smaller models. A few-billion-parameter model loads, runs, and generates text at a perfectly usable pace for tinkering, and crucially it does it on hardware I already own with no API key in sight.

Getting there was mostly a fight with CUDA versions and Python environments, which is the recurring tax on anything GPU-shaped. Once the driver, the toolkit, and the framework all agreed on which version of CUDA they were living in, the model loaded without drama and the GPU fans spun up in a way that felt appropriately serious.

Is it as good as the hosted offerings? No, and it would be daft to pretend otherwise. The smaller models wander, repeat themselves, and occasionally produce confident nonsense. But it's mine, it runs offline, and there's a particular satisfaction in watching tokens appear from a card that was otherwise gathering dust. That's reason enough for an evening's work.