I had a spare GPU doing nothing useful and a nagging sense that running a language model locally ought to be possible by now. The cloud APIs are fine, but I wanted something on my own hardware, where the weights sit on my disk and nothing leaves the house. So I spent a weekend finding out where the demos stop and the reality starts.
The reality starts at VRAM, and it is unforgiving. The model I most wanted to run did not fit. A model's parameter count translates fairly directly into memory: at full fp16 precision you need roughly two bytes per parameter just to hold the weights, before you account for activations and the framework's overhead. A 6 billion parameter model is already past twelve gigabytes of VRAM in fp16, and my spare card does not have twelve gigabytes spare. The 13B and 20B class models that everyone screenshots are simply not going on consumer hardware at full precision.
The way through is quantisation: store the weights in eight or even six bits instead of sixteen. You lose a little quality and you gain a lot of room. With 8-bit loading I could fit a model that would otherwise have laughed at my card, and honestly the output was hard to tell apart from the full-precision version for the things I was asking it to do. The bitsandbytes library does the heavy lifting, and loading a model in 8-bit is mostly a single flag once the stack is installed:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"the-model",
device_map="auto",
load_in_8bit=True,
)
tokenizer = AutoTokenizer.from_pretrained("the-model")
device_map="auto" is the other quiet hero here. If the model still does not fit, it spills layers across GPU and CPU so the thing at least runs, slowly, rather than dying with an out-of-memory error before it generates a single token.
The honest assessment, in June 2022: this is squarely a hobbyist activity. Generation is slow compared to the hosted APIs, the open models are noticeably less capable than the best closed ones, and getting CUDA, the right driver, and the Python stack to agree took longer than the inference did. But it runs. It runs on a card I already owned, the weights never leave my network, and I can poke at the internals in a way no API will ever let me.
That last part is what makes it worth the faff. Running a model locally is not about beating the cloud on quality or speed, because it won't, not this year. It is about the difference between using a thing and understanding it. The spare GPU has a job now.