Ramblings of an aging IT geek
← Ramblings of an aging IT geek
ai

a language model on the spare gpu in the cupboard

What it actually takes to run a large language model locally on consumer hardware in 2022, and what you get for the effort.

A stylised robot head representing AI

I had a spare GPU sitting in a drawer, a 12GB card pulled out of my desktop when I upgraded, and the obvious question for 2022: can I run a real language model on it, at home, without renting anything? The short answer is yes, with caveats, and the caveats are most of the interesting part.

This is not GPT-3. You cannot run GPT-3 at home; it lives on a cluster and OpenAI rents you access to it through an API. But the open models have got genuinely usable in the last year, and the tooling has caught up enough that a determined evening gets you something running.

what is actually runnable

The thing that made this practical is GPT-NeoX and the EleutherAI lineage, plus the smaller GPT-J 6B, which is the model I kept coming back to. 6 billion parameters sounds enormous until you realise the truly large models are an order of magnitude bigger, but it is the largest thing that fits comfortably on a 12GB card without heroics.

The memory arithmetic is the first thing to understand, because it governs everything. A parameter in full 32-bit float is four bytes. GPT-J 6B in fp32 is therefore around 24GB of weights alone, which does not fit. In fp16 it halves to roughly 12GB, which fits, barely, with nothing left over for the activations and the KV cache that grow with your sequence length. So in practice you are running fp16 and watching nvidia-smi like a hawk.

watch -n1 nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu --format=csv

If you cannot fit even that, the trick of the moment is 8-bit loading. The bitsandbytes work and the Hugging Face integration around it let you load weights at int8 and roughly halve the memory again, at a small and usually acceptable quality cost. This is the single biggest reason a model that "needs" a datacentre card will suddenly run on a gaming card.

the actual setup

The path of least resistance is Hugging Face Transformers. You install a recent PyTorch built against the right CUDA version, install transformers and accelerate, and load the model with device mapping doing the placement for you.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

name = "EleutherAI/gpt-j-6B"
tok = AutoTokenizer.from_pretrained(name)
model = AutoModelForCausalLM.from_pretrained(
    name,
    revision="float16",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
).to("cuda")

prompt = "The three hardest problems in distributed systems are"
ids = tok(prompt, return_tensors="pt").to("cuda")
out = model.generate(**ids, max_new_tokens=80, do_sample=True, temperature=0.8)
print(tok.decode(out[0]))

The two flags that save you here are revision="float16" so you download the half-precision weights rather than the full fp32 ones, and low_cpu_mem_usage=True so it does not try to materialise the whole model in system RAM before moving it across. Get those wrong and you will spend twenty minutes downloading and then OOM on a machine with plenty of memory, which is a special kind of annoying.

The first load is slow. You are pulling many gigabytes of weights, and on a domestic connection that is a cup of tea, possibly two. After that they cache locally and subsequent loads are limited by how fast you can read from disk into the GPU.

what it is actually like

It is genuinely impressive and genuinely limited, in roughly equal measure, and holding both thoughts at once is the right way to use it.

The good: it completes text fluently, it has clearly absorbed a great deal of the public internet, and for tasks like rephrasing, brainstorming, and generating boilerplate it is useful in a way that feels like a step change from what I could do locally a year ago. Ask it to continue a paragraph in a particular style and it will, often well.

The limitations are equally clear once you push on it. It has no memory beyond the context window, so it cannot hold a conversation in any real sense without you feeding the history back in each turn, and the context window is small enough that long documents are out. It hallucinates confidently, inventing citations and facts with the same calm tone it uses for true ones. And it has no notion of when it does not know something, which is the failure mode you have to design around rather than hope away.

The other honest caveat is throughput. On a single consumer card you are generating a handful of tokens per second, not the instant walls of text the hosted services produce. For experimentation that is fine. For anything interactive at scale it is not, and that gap is exactly what you are paying for when you use a hosted API.

was it worth it

Yes, for the understanding alone. Running the thing locally, watching the memory fill, feeling exactly where the limits bite, taught me more about how these models actually work than any amount of reading about them from a distance. The abstraction of an API hides all of this, deliberately and sensibly, but hiding it also makes it mysterious.

If you have a spare card with 8GB or more, an evening, and a tolerance for dependency wrangling, I would recommend it. Not because the local model will replace anything you currently use a hosted service for, it will not, but because the gap between "magic in the cloud" and "matrix multiplications I can run in my own cupboard" is worth closing in your head. The cupboard is warm now, mind, and the fans are not subtle. Small price.