Ramblings of an aging IT geek
← Ramblings of an aging IT geek
ai

putting a language model on the spare gpu in the cupboard

Running a quantised LLaMA derivative locally on an old 8GB GPU, what fits, what doesn't, and what it's actually good for.

A small robot figure on a desk

I have a spare GPU. An old one, 8GB of VRAM, pulled out of a machine I rebuilt and never quite found a home for. It has been sitting in a cupboard accusing me. So this weekend I put a language model on it, locally, to see what an 8GB card can actually do without an API key or a network round-trip.

The short version: more than I expected, less than the demos promise. Which is roughly where I'd put my expectations for most of this space at the moment.

what fits in 8GB

The headline numbers from the big hosted models are useless here. Those run on cards with ten times the memory. The question for a local box is what survives quantisation down to something that fits, and the answer in late 2022 is: the smaller LLaMA-family weights, quantised hard, do.

Full fp16 weights are roughly two bytes a parameter, so a 7B model wants about 14GB just to sit there, before you've fed it a single token. That's already over my budget. Quantise to 4-bit and you're nearer 4GB of weights, which leaves headroom for the context and the activations. So 7B at 4-bit is comfortable, 13B is a squeeze, and anything bigger is wishful thinking on this card.

A close-up of a circuit board

The quantisation isn't free. You can feel the model get slightly woollier as you push the bits down, more inclined to wander, more inclined to confidently invent. At 8-bit it's barely noticeable. At 4-bit it's there if you're looking. The trade is memory for coherence, and on a small card you take the trade because the alternative is the model not fitting at all.

what it's actually good for

Useful, in rough order of how often I reach for it:

  • Rephrasing and tightening text I've already written. It's good at this and the stakes are low.
  • Rubber-ducking a problem out loud when nobody's around to be bored by it.
  • Quick one-off shell or regex snippets, which I then read carefully because it lies with a straight face.

What it is not good for is anything where being wrong is expensive and not obvious. It will hand you a plausible command that does the wrong thing, formatted beautifully. The failure mode of a local model is the same as the hosted ones, just with worse spelling occasionally.

the bit i actually liked

It's all local. No request leaves the box. For the rubber-ducking and the rephrasing, where I'm half-thinking out loud about things I'd rather not post to someone's server, that matters more than the quality gap. The latency is fine, a sentence or two a second once it's warmed up, which is conversational enough.

I'm not going to pretend the cupboard GPU has changed my life. But it's earning its keep now, and the experiment cost me an afternoon and some electricity. The thing that struck me most is how quickly this went from research-lab-only to something that runs on a card I'd written off. That curve is steep, and I don't think it's done bending yet.