Ramblings of an aging IT geek
← Ramblings of an aging IT geek
ai

running a local llm on a spare gpu

Standing up an open language model on a single consumer GPU at home, what fits in the VRAM, and what the experience is actually like in late 2022.

A small robot figure beside a graphics card

I had a spare GPU doing nothing useful, an old card I demoted when I built a new desktop, and I wanted to know what running a language model at home actually felt like in practice rather than in a press release. Not training one. Just inference, locally, on hardware I already owned. The short version: it works, it is genuinely interesting, and the gap between "a model that fits" and "a model that is good" is wider than the demos let on.

What fits on one card

The constraint is VRAM, and it is unforgiving. A model's parameters have to live in memory, and at full FP16 precision you need roughly two bytes per parameter just to hold the weights, before you add anything for activations or the KV cache. So a 6 billion parameter model wants something like 12 GB of VRAM at FP16. A 12 GB card holding that has almost nothing left over, which means short contexts and a lot of staring at out-of-memory errors.

That maths is why the sweet spot for a single consumer card right now is the 6B class. EleutherAI's GPT-J-6B and the larger GPT-NeoX models, Meta's OPT family, the multilingual BLOOM variants. These are the open weights you can actually pull down and run. The genuinely huge models, the ones that make the headlines, are well out of reach of a single card without serious compromises.

The compromise that helps most is quantisation. Loading weights in 8-bit instead of 16-bit roughly halves the memory, and bitsandbytes plugged into Hugging Face transformers makes this almost too easy:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "EleutherAI/gpt-j-6B"
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    load_in_8bit=True,
)

That load_in_8bit=True is the line that turns "does not fit" into "fits with room to breathe". There is a small quality cost and the load is slower, but on a 12 GB card it is the difference between running a 6B model and not.

A close-up of a circuit board

What the experience is actually like

Here is where I temper expectations, including my own. These open models are not the polished, helpful assistants you may have read about. They are raw language models. They complete text. You hand them a prompt and they continue it, plausibly, in the statistical shape of their training data, which is a great deal of the internet with all that implies.

Ask one a direct question and you are as likely to get a list of three more questions, or a forum signature, or a confident paragraph of nonsense, as an answer. The instruction-following behaviour that makes the hosted services feel like talking to something is the result of additional fine-tuning that these base weights have not had. You can coax a lot out of them with careful prompting, few-shot examples, and a firm hand on the generation parameters, but it is coaxing, not conversation.

The generation settings matter more than I expected. Temperature, top-p, repetition penalty: get them wrong and the model either repeats itself into a loop or wanders off into incoherence. A reasonable starting point for something that stays on the rails:

output = model.generate(
    input_ids,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.1,
)

Even tuned, the throughput on an older single card is measured in a handful of tokens per second, not the instant walls of text you get from a cloud endpoint. You watch it write. There is something quite calming about that, actually, but it is not fast.

Why bother

So if it is slower, rougher, and dumber than the hosted alternatives, why run it at home at all?

Three honest reasons. The first is that nothing leaves the house. The prompt, the output, whatever you are feeding it, all of it stays on hardware you control, which for some uses matters a great deal. The second is cost: once the card is paid for, inference is just electricity, and there is no meter ticking on every request. The third, and the real one for me, is that it demystifies the whole thing. When you have watched a 6B model load layer by layer, run out of memory, get quantised down to fit, and then dribble out text a few tokens at a time, the magic resolves into engineering. It is matrices and memory bandwidth and a very large lookup of what word tends to come next. That is not a criticism. It is the most interesting part.

The field is moving fast enough that this whole post will look quaint in a year. The models will get smaller for the same quality, the tooling will get friendlier, and the fine-tuned open variants will close some of the gap with the hosted services. But if you have a spare card and an idle weekend, pull down a 6B model and run it. Not because it will replace anything you currently use, but because understanding the shape of the thing, from the inside, is worth a great deal more than another demo.