I had a spare GPU and a curiosity that needed satisfying: how much of this can you actually run at home, on your own hardware, without an API key and a credit card attached? The answer in December 2022 is "more than I expected, with caveats", and the caveats are all about memory.
VRAM is the wall. Everything else is negotiable, but a model has to fit in the card's memory or it doesn't run, and the cards most of us have are not generous. A 12GB card sounds like plenty until you do the arithmetic. Parameters in full 16-bit precision want roughly two bytes each, so a 6-billion-parameter model is around 12GB before you have loaded a single token of context. That leaves no headroom, and it falls over.
quantisation is the answer
The way through is quantisation: storing the weights at lower precision so they take less room. Eight bits per weight roughly halves the memory, four bits quarters it, and the quality cost is smaller than you would fear, especially at 8-bit. With bitsandbytes you can load a model in 8-bit almost transparently:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"EleutherAI/gpt-j-6B",
device_map="auto",
load_in_8bit=True,
)
tok = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
device_map="auto" is doing real work there. If the model still doesn't fit, it will offload layers to system RAM rather than crashing, at the cost of speed. Slow is better than dead.
what it's actually like
Honest expectations matter. A 6B model running locally is not going to feel like the hosted models everyone is talking about this month. It rambles, it loses the thread, and it will confidently invent things. But it runs entirely on my own hardware, offline, with no rate limits and no data leaving the room, and for a lot of tinkering that trade is exactly the one I want.
A few practical notes from the afternoon. Keep an eye on nvidia-smi in another terminal so you can watch the VRAM fill up and catch the moment it starts swapping. Pin your library versions, because this whole ecosystem is moving weekly and a working environment can rot overnight. And do not start with the biggest model you can theoretically fit; start with one that fits comfortably, get the plumbing working end to end, then scale up.
The headline, though, is that the barrier to running this stuff yourself has dropped through the floor. A year ago "run a language model at home" meant a multi-GPU rig and a tolerance for pain. Now it is a spare card, an evening, and a couple of pip install lines. The models are mediocre and the field is changing under your feet, but you can sit at your own desk and have the thing answer back, locally, for free. That is genuinely new, and it is going to get a lot more interesting very quickly.