Ramblings of an aging IT geek
← Ramblings of an aging IT geek
ai

a language model on the gpu i forgot i had

A hands-on account of getting a large language model running locally on a spare consumer GPU, what fits in the VRAM, and what it is genuinely good for in early 2022.

An abstract render of a robot head against a dark background

There is a GPU in my desktop that spends most of its life rendering nothing more demanding than a terminal. It has a respectable amount of VRAM, it cost real money, and it sits idle for ninety-nine percent of every day. So when I started reading about people running large language models locally rather than poking them through an API, the obvious question was whether the card I already owned could do anything useful. The answer, in early 2022, is a qualified yes, with the qualifications doing a fair bit of work.

I want to be clear about what this is and is not. This is not "I trained a model". Nobody is training anything interesting on a single consumer GPU; that is the domain of clusters and budgets I do not have. This is inference: taking a model someone else trained, getting it onto my hardware, and running text through it locally. No API, no rate limits, no sending my prompts to someone else's servers, and crucially nothing leaving the machine.

what actually fits

The first hard wall you hit is VRAM, and it is unforgiving. The model's weights have to fit in the card's memory, plus working space for the activations as it runs. A model's size in parameters maps fairly directly to its size on disk and in memory, depending on the precision you load it at.

Roughly, at full 16-bit precision you need about two bytes per parameter just for the weights. So:

  • A model with a few hundred million parameters fits comfortably and runs quickly. These are small, fast, and frankly limited.
  • A model in the low billions of parameters is the interesting middle ground for a consumer card. It fits, it runs at a usable speed, and it is large enough to be genuinely useful for some tasks.
  • Anything much larger than that, at full precision, simply will not load. You get an out-of-memory error and a bruised ego.

The lever that changes this maths is quantisation: storing the weights at lower precision, 8-bit or even lower, to roughly halve or quarter the memory they need. You trade a little quality for a lot of headroom, and for a lot of tasks the quality loss is barely perceptible. Quantisation is what turns "this model does not fit" into "this model fits with room to spare", and it is the single most useful trick for running these things on hardware that was never designed for them.

A close-up of a circuit board with fine traces

getting it running

The tooling here has improved enormously, to the point where the hard part is no longer the software. The libraries for loading and running transformer models have matured into something you can pip-install and have producing tokens within an afternoon. The general shape of it is unremarkable: pull the model weights, load them onto the GPU, tokenise your input, run a generation loop, decode the output back to text.

A skeleton of it looks like this, and the striking thing is how little there is:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "a-causal-lm-of-your-choosing"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,   # half precision to save VRAM
    device_map="auto",            # put it on the GPU
)

prompt = "Explain why caches are hard, briefly:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=True,
        temperature=0.7,
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))

That torch_dtype=torch.float16 line is doing heavy lifting: loading at half precision instead of full is the difference between the model fitting and not. device_map="auto" is the bit that quietly puts the weights on the GPU rather than leaving them on the CPU to run at a glacial pace.

The two things that bit me, and that the tutorials gloss over:

  1. The first run downloads gigabytes. Model weights are large. The first time you load a model it fetches the whole thing, and on a slow connection that is a coffee, not a moment. Cache it locally and do not be surprised when your disk fills up faster than expected.
  2. no_grad matters. If you forget to disable gradient tracking, the framework helpfully allocates memory to compute gradients you will never use, and a model that should have fit blows past your VRAM. For inference you never need gradients. Wrap the generation in torch.no_grad() and reclaim that memory.

what it is actually good for

This is where the honesty has to come in, because the gap between the demos and the daily reality is wide.

A locally-run model on a consumer GPU in early 2022 is good at: rephrasing and summarising text you give it, drafting boilerplate prose you will then heavily edit, generating variations on a structured input, and answering questions where being approximately right is fine and you can check the answer yourself. It is a competent, slightly unreliable assistant for low-stakes language tasks, and the fact that it runs entirely on my machine with nothing sent anywhere makes it genuinely useful for anything I would rather not paste into a web form.

It is bad at: anything requiring it to actually know a fact reliably, anything where a confident wrong answer is dangerous, and any task where you cannot easily verify the output. The smaller the model you have squeezed onto your card, the more confidently it will tell you things that are simply not true. It does not know that it does not know. That failure mode does not go away with a local model; if anything the smaller models you can fit at home are worse at it than the big hosted ones.

The speed is also a thing to be realistic about. Token generation on a single consumer GPU is usable for interactive, one-at-a-time prompting. It is not fast enough to throw a batch of ten thousand documents at and expect an answer by lunchtime. For my purposes, where I am asking it one thing and reading the reply, that is fine. For anything resembling a production workload it is the wrong tool, and an API or a proper GPU server is the right one.

was it worth it

For me, clearly yes, but for reasons that are as much about understanding as utility. Running one of these things locally, watching the VRAM gauge, hitting the out-of-memory wall and learning to quantise my way around it, has taught me far more about how these models actually behave than any number of articles. The constraints are educational. When you have to fit the thing in eight or twelve gigabytes, you understand precisely what "a larger model" costs in a way that the cloud, where it is just a different string in an API call, completely hides from you.

And there is the privacy angle, which is not nothing. The model runs on my hardware. My prompts do not leave the house. For drafting, summarising and the general category of "I want a language model's help but I do not want to send this anywhere", a local model on a spare GPU is exactly the right shape, limitations and all.

The card is, at last, doing something other than rendering a terminal. I will take that as a win.