running code models on my own gpu, a fair-ish comparison

A stylised robot head representing a language model

The interesting thing about local code models right now is not whether they beat the hosted ones. They do not. The interesting thing is how close the gap has become, and how much of that closeness is real versus an artefact of asking them easy questions. I spent a few evenings trying to be fair about it, which mostly meant catching myself being unfair and starting again.

The setup is one machine: a single 24GB consumer card, an ageing but adequate CPU, and llama.cpp doing the heavy lifting via quantised GGUF weights. Everything I tried had to fit in that 24GB with enough headroom for a usable context, which immediately rules out the models that would actually be good. That constraint is the whole story of local inference at the moment. You are not choosing the best model, you are choosing the best model that fits.

what "fair" had to mean

My first instinct was to throw clever prompts at each model and see what came back. That is how you fool yourself. A code model will happily produce plausible-looking output for a vague prompt, and if you are grading on vibes you will conclude they are all wonderful. So I made rules.

Every model got the same prompts, in the same order, with the same system prompt, at temperature 0 where the runtime allowed it. The prompts were things I had actually needed that week, not benchmark puzzles:

write a small Go function with a specific signature and a documented edge case
explain what a gnarly bit of someone else's bash actually does
take a failing test and a stack trace and suggest the fix
refactor a function to remove a data race I had deliberately left in

And crucially, I ran each thing three times. A model that gets it right once and wrong twice has not got it right, it has got lucky, and at temperature 0 even small phrasing differences in my prompt could flip the answer. Determinism in these runtimes is more aspirational than real.

A close-up of a populated circuit board

There is one more thing the constraint hides, which is speed. A model that fits with room to spare runs at a comfortable interactive pace on this card, tokens arriving faster than I read them. A model that only just fits, with the context buffer eating into the headroom, drops to a crawl that breaks the conversational rhythm entirely. By the time it has finished a paragraph I have lost the thread of why I asked. So "fits" is not binary. There is "fits and is usable" and there is "fits and you will go and make tea", and the gap between them is narrow and matters enormously to whether you actually reach for the thing.

what they were good at

Explanation. Across the board, the local models were genuinely useful at reading code and telling me what it does. The bash explainer task was the standout: feed in a wall of find ... -exec horror and you get back a clear, correct, paragraph-by-paragraph account. This makes sense if you think about it. Explaining is a translation task, and these models are very good at translation. There is also a lower bar for failure. A slightly clumsy explanation is still useful, whereas slightly wrong code is worse than useless.

Filling in boilerplate was also solid. Give a model a function signature and a clear docstring describing the edge case, and a 7B-class model will often produce exactly the body you would have typed, including the edge case, because you told it the edge case. That is not intelligence, it is dictation, and dictation is a real productivity win when you are typing the forty-third small helper of the day.

where they fell over

Reasoning across more than a few moving parts. The data race refactor was the test that separated the runs. The smaller models would dutifully add a mutex in the wrong place, or lock around the read but not the write, or, my favourite, add a comment saying // fixed the race next to code that had not fixed the race. The larger quantised models did better, but inconsistently, and "inconsistently fixes your concurrency bugs" is a sentence that should end any deployment conversation.

The other failure was subtle and worth naming. They are confidently wrong in a register that reads exactly like being confidently right. There is no tell. With a junior engineer you can hear the hedging. The model hands you a wrong answer with the same calm prose as a correct one, which means you cannot skim its output, you have to actually verify it, which for code means running it. Once you are running everything anyway, some of the time saving evaporates.

A close-up of a populated circuit board

the quantisation question

I expected aggressive quantisation to be where quality fell off a cliff. It was gentler than that. Dropping from the higher-bit quants to a middling 4-bit one cost me less on the explanation tasks than I feared and more on the reasoning tasks than I hoped, which is the same pattern as everything else: the harder the thinking, the more the compression hurts. If your use is "explain this and write me boilerplate", you can quantise hard and barely notice. If your use is "reason through my bug", you want every bit you can fit, and you still will not have enough.

what I actually concluded

I am keeping a local model in the loop, but for a narrower job than the hype suggests. It is my rubber duck and my boilerplate typist. It reads code I cannot be bothered to read closely, and it writes code I cannot be bothered to type. For anything where being wrong is expensive, I either verify everything it produces or I do not use it.

That is not a triumphant conclusion and I am suspicious of anyone whose comparison ends in a triumph. The honest version is that local code models in April 2023 are a useful tool with sharp edges, running on hardware that cannot hold the models that would dull those edges. The gap is closing. It is not closed. And the bit that will not close by getting bigger weights is the confident-wrongness, which is a property of how these things work, not how big they are. Verify the output. Always verify the output.