running a model on the spare 3090, and what it actually costs

A close-up of a robot figure representing AI

I had a spare 3090 sitting in a half-built machine, and the entire internet has spent the last month talking about nothing but large language models. So over the quiet days between Christmas and the new year I decided to stop reading about them and actually run one on hardware I own, on the desk, with the network unplugged if I felt like it.

The short version: it works, it is more useful than I expected for a few specific things, and it is far less magical than the demos make it look. The longer version is below, because the gap between "it loads" and "it is useful" is where all the interesting detail lives.

What actually fits in 24GB

The first thing you learn is that the marketing numbers and the numbers that fit on your card are very different quantities. A model quoted as having seven billion parameters does not mean seven billion bytes. At full 16-bit precision you are looking at roughly two bytes per parameter just for the weights, so a 7B model wants about 14GB before you have done anything, and a 13B model in full precision will not fit in 24GB at all once you account for the activations and the KV cache.

The trick that makes any of this practical on consumer hardware is quantisation: storing the weights at lower precision. Drop to 8-bit and the 13B model squeezes in. Drop to 4-bit and you can run it with room to spare, and you can even get a 30B-class model to load if you are patient and willing to accept it being slow.

A circuit board photographed close up

There is, of course, a cost. Quantisation is lossy. The 4-bit version of a model is measurably worse than the 8-bit version, which is worse than the full-precision one. For a lot of tasks the difference is small enough not to care. For anything where precise wording or careful reasoning matters, you notice. I settled on 8-bit for the 13B model as the sweet spot on the 3090: it fits, it is reasonably quick, and the quality drop is small enough that I stopped noticing it after the first hour.

The practical recipe was unremarkable, which is the nicest thing I can say about it:

# CUDA toolkit matching the driver, then a fresh venv
python -m venv ~/llm/venv
source ~/llm/venv/bin/activate

pip install torch --index-url https://download.pytorch.org/whl/cu117
pip install transformers accelerate bitsandbytes sentencepiece

nvidia-smi   # confirm the card is seen and the VRAM is free

The one thing worth saying clearly: get the driver, the CUDA version and the torch build agreeing with each other first, before you touch a single model. Ninety per cent of the pain I read about online is people fighting a torch build that was compiled against a different CUDA than the one on their machine. Sort that out in isolation, confirm torch.cuda.is_available() returns true, and the rest is downloading large files and waiting.

Loading something and talking to it

With the environment sane, loading a quantised model is a few lines. The key arguments are the ones that push the weights onto the GPU at reduced precision rather than trying to materialise the whole thing in 16-bit first:

from transformers import AutoModelForCausalLM, AutoTokenizer

name = "the-model-you-downloaded"
tok = AutoTokenizer.from_pretrained(name)
model = AutoModelForCausalLM.from_pretrained(
    name,
    device_map="auto",
    load_in_8bit=True,
)

prompt = "Explain what a balancing feedback loop is, briefly."
ids = tok(prompt, return_tensors="pt").to("cuda")
out = model.generate(**ids, max_new_tokens=200)
print(tok.decode(out[0], skip_special_tokens=True))

The first generation is slow because the weights are still being shuffled into place and the caches are cold. After that, on the 3090 with an 8-bit 13B model, I was getting throughput that felt conversational: fast enough to read along with, not fast enough to feel instant. Good enough.

Where it earns its keep, and where it does not

Here is the honest assessment after a few days of poking at it.

It is genuinely good at the boring middle of language tasks. Rewording a paragraph, summarising a wall of log output into something readable, turning a rough bullet list into prose, drafting the first version of an email I do not want to write. None of these need the model to be correct in any deep sense, they need it to be fluent, and fluency is the thing these models have in abundance.

It is unreliable for anything where the answer has to be right. Ask it for a specific command-line flag and there is a decent chance it confidently invents one that does not exist. Ask it for a fact and it will give you a fact-shaped sentence, which is not the same thing. I have learned to treat every factual claim it makes as a hypothesis to check, not an answer to trust, and once you internalise that it stops being annoying and starts being useful: it is a fast way to generate candidate answers that I then verify.

The thing the local setup gives me that a hosted service does not is total control and zero data leaving the machine. I can feed it the contents of a private repository, a draft I am not ready to share, a log full of internal hostnames, and none of it goes anywhere. For a class of work where that matters, the slightly-worse quantised model on my own GPU beats a better model behind someone else's API, every time.

Would I recommend buying a GPU specifically for this? Not yet. The field is moving so fast that whatever you buy will feel dated by spring, and a hosted service will be cheaper and better for most uses. But if you already have the silicon spare, as I did, an afternoon spent getting a model running locally is an afternoon well spent. You come away with a much more grounded sense of what these things actually are: very capable autocomplete, running on your desk, that you should never quite trust and can nonetheless get a lot of mileage out of.

I have left it running. It has already saved me from writing three emails by hand, which on its own is reason enough to keep the card warm.