Ramblings of an aging IT geek
← Ramblings of an aging IT geek
ai

getting gpt-j to talk on a spare gpu, and why it's harder than the demos suggest

Running EleutherAI's GPT-J locally on a single consumer GPU, the VRAM reality of a 6-billion-parameter model, and what fp16 actually buys you.

A small robot figure beside circuitry

I had a spare GPU sitting in the build box doing nothing over Christmas, and the open-weights language models have got genuinely interesting this past year, so I spent a couple of evenings getting one running locally rather than poking at an API. The headline: yes, you can run a decent open model on your own hardware, and no, it is not the one-line affair the demos imply. The friction is almost entirely VRAM.

The model I settled on was EleutherAI's GPT-J, the 6-billion-parameter one. GPT-Neo is smaller and easier; the full GPT-NeoX is bigger and I don't have the card for it. GPT-J sits in the sweet spot of "actually says interesting things" and "might fit on a card I own".

The VRAM arithmetic nobody mentions

Here's the bit the blog posts gloss over. A 6B-parameter model in full fp32 is roughly 24 GB of weights before you've fed it a single token. That's the entire VRAM budget of a high-end card gone just holding the parameters, with nothing left for activations. So step one is loading it in half precision.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
model = AutoModelForCausalLM.from_pretrained(
    "EleutherAI/gpt-j-6B",
    revision="float16",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
).to("cuda")

Two flags there matter enormously. revision="float16" pulls the fp16 weights so you're not downloading and then converting 24 GB. low_cpu_mem_usage=True stops transformers from materialising the whole thing in system RAM first, which on a box with modest RAM will otherwise OOM before the GPU ever gets a look in. In fp16 the weights come down to roughly 12 GB, which fits a 16 GB card with room for inference, and is very tight on 12 GB.

A circuit board close-up

Generating, and the first disappointment

Once it's on the card, generation is the familiar shape:

prompt = "The three laws of thermodynamics are:"
ids = tok(prompt, return_tensors="pt").input_ids.to("cuda")
out = model.generate(
    ids,
    do_sample=True,
    temperature=0.8,
    top_p=0.9,
    max_new_tokens=120,
)
print(tok.decode(out[0], skip_special_tokens=True))

The first disappointment is speed. On a single consumer card you are looking at a token at a time, and a paragraph takes seconds, not the instant stream you get from a hosted service running on a rack of A100s. That's not a bug, it's just what the hardware does. For batch jobs and tinkering it's fine. For an interactive chat experience it's noticeably sluggish, and managing your expectations there saves a lot of "is it broken?" confusion. It isn't broken; it's a 6B model on one GPU.

The second disappointment is the obvious one: GPT-J is a raw language model, not an assistant. It completes text. It has not been instruction-tuned to be helpful, so it will happily wander off, contradict itself, or confidently invent. Prompt it like the autocomplete-from-the-internet that it is, give it a strong leading context, and it's far more useful than if you treat it like something it isn't.

Fitting it on a smaller card

If you're on 8 GB and feeling brave, the route is 8-bit weights via bitsandbytes, which quantises the parameters down further at some quality cost. It's fiddlier to get the CUDA bits lined up, and it's slower again, but it's the difference between "runs" and "doesn't" on a card that can't hold 12 GB of fp16. I got it limping on a smaller card this way as an experiment; I wouldn't want it as my daily driver.

Was it worth it?

For the novelty and the learning, absolutely. There's something quietly satisfying about a capable language model generating text on hardware sitting under your desk, no API key, no rate limit, no terms of service deciding what it'll answer. The cost is real, though, and worth being honest about: VRAM you have to plan around, speed you have to accept, and output you have to wrangle because it hasn't been tuned to behave.

I don't think local models replace the hosted ones for me yet, not for anything where I want answers rather than completions. But the gap is closing faster than I expected a year ago, and having the whole thing running on my own metal means when it does close, I'll already know the plumbing. That, more than any single generated paragraph, was the point of the exercise.