I had a spare GPU sitting in a second machine doing nothing but the occasional transcode, so I spent an evening seeing how large a language model I could actually run at home without paying anyone for an API. The answer, with some fiddling, is GPT-J 6B, and it just about fits.
The naive approach fails immediately. Load the model in fp32 with transformers and you are asking for roughly 24 GB of weights alone, which my card does not have. The trick is to load in half precision and let it stream, which more or less halves the footprint:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tok = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
model = AutoModelForCausalLM.from_pretrained(
"EleutherAI/gpt-j-6B",
revision="float16",
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
).cuda()
That revision="float16" matters; it pulls the fp16 weights directly rather than downloading 24 GB and converting. With that, generation runs at a few tokens a second, which is glacial next to a hosted API but perfectly usable for tinkering offline.
Is it any good? It is GPT-2's clever older cousin. It will happily continue a prompt, write passable boilerplate, and then confidently invent a function that does not exist. But it is mine, it runs with the network cable unplugged, and nothing I type leaves the room. For that alone the evening was worth it.