I had a spare GPU spinning its fans in the homelab and a nagging curiosity about whether I could run a decent language model without renting time from anyone. The answer in October 2022 is "yes, with caveats", and the caveats are almost entirely about VRAM.
What actually fits
The honest constraint is memory. The big, impressive models, the 175B-parameter ones, are not going anywhere near a single consumer card. What you can run locally today is the smaller open family: GPT-J 6B, the smaller GPT-NeoX variants, BLOOM in its more modest sizes, Meta's OPT, and the instruction-tuned FLAN-T5 models if you want something that follows a prompt rather than just continuing it.
A rough rule of thumb: in full fp16, reckon on roughly two bytes per parameter, so a 6B model wants about 12GB just for the weights, before you account for activations and the KV cache. On a 24GB card you have comfortable room for a 6B model and can stretch further with 8-bit loading. On 8 to 12GB you are firmly in "smaller model or bust" territory.
Getting it running
The path of least resistance is Hugging Face transformers. Nothing exotic:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
name = "EleutherAI/gpt-j-6B"
tok = AutoTokenizer.from_pretrained(name)
model = AutoModelForCausalLM.from_pretrained(
name, torch_dtype=torch.float16, device_map="auto"
)
prompt = "Explain MTU to a sysadmin in two sentences:\n"
ids = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=80, do_sample=True, temperature=0.7)
print(tok.decode(out[0], skip_special_tokens=True))
device_map="auto" does a surprising amount of heavy lifting, placing layers across whatever you have. If the weights do not quite fit in fp16, bitsandbytes and load_in_8bit=True will often squeeze a model onto a card it would otherwise overflow, at some cost to speed and a little to quality.
The rough edges
Two things to set expectations on. First, these open models are not ChatGPT. The hosted assistants that have everyone excited are heavily tuned with instruction data and reinforcement from human feedback, and the raw open weights you run at home are noticeably blunter. FLAN-T5 follows instructions best of the bunch, but it is small and it shows.
Second, the tooling is young. You are wiring together transformers, CUDA, the right Torch build and bitsandbytes, and the version dance is real. Budget an evening for the environment alone, and pin everything once it works.
Is it worth it? For me, yes, easily. There is something genuinely satisfying about a model that runs entirely on a box in my own garage, with no API key and no per-token meter ticking over. It is not going to write my emails to a professional standard yet. But the trajectory is steep, the open models are improving month on month, and I would rather be learning the plumbing now than scrambling to catch up later.