Ramblings of an aging IT geek
← Ramblings of an aging IT geek
ai

which local model actually helps me write code

A practical comparison of the local code models I can actually run on a single workstation GPU, judged on real tasks rather than benchmarks.

A small robot figure on a desk

I wanted to know which local model is genuinely useful for code, not which one scores best on a leaderboard. Leaderboards measure a model's ability to pass HumanEval. I measure a model's ability to stop me alt-tabbing to a browser. Those are not the same thing, and the gap between them is where most of the disappointment lives.

The constraint is hardware. Everything here runs on a single 24GB consumer GPU, because that's what I have and what most people reading this have. That rules out the very large models, which is fine, because the interesting question in mid-2025 is how good the things you can actually run at home have become. The answer is: better than I expected, with caveats.

the setup

All of this ran through Ollama with the default quantisations, mostly 4-bit, because that's what fits and that's what people use. I'm not comparing full-precision cloud weights. I'm comparing the thing that runs locally with no setup beyond ollama pull. The models on test:

  • Qwen2.5-Coder, the 7B and 14B variants
  • DeepSeek-Coder-V2 Lite, the 16B mixture-of-experts
  • Codestral 22B
  • Llama 3.1 8B, as a general-purpose baseline rather than a code specialist

I gave each one the same four tasks, drawn from things I'd actually done that week rather than puzzles. Write a non-trivial function from a docstring. Explain a chunk of unfamiliar code. Fix a bug given the code and the error. Refactor a function to be testable. I ran each task a few times to account for sampling, and I judged the output myself, because "did this save me time" isn't something a metric captures.

what actually happened

The headline: Qwen2.5-Coder 14B was the best all-rounder, and it wasn't especially close. It wrote correct, idiomatic code more often than anything else on the list, understood the intent behind a vague docstring, and when it got something wrong it got it wrong in obvious ways rather than subtle ones. Subtle wrongness is the dangerous failure mode, the plausible code that compiles and is quietly incorrect, and the 14B produced far less of it than the smaller models.

A circuit-board close-up

The 7B Qwen was the surprise. For autocomplete and small, well-specified functions it was genuinely good, and it's fast enough that the latency never breaks your flow. If you mostly want a better tab-complete and the occasional "write me this small thing", the 7B is plenty and leaves you GPU headroom. It falls down on tasks that need you to hold more context, which is exactly what you'd expect, but within its range it punches well above its size.

DeepSeek-Coder-V2 Lite was the most interesting on paper and the most frustrating in practice. The mixture-of-experts design means it's fast for its parameter count, and on explaining code it was excellent, often clearer than the Qwen models. But it had a habit of being confidently wrong on bug-fix tasks, inventing an explanation that sounded right and a fix that didn't work. When it was good it was very good, but I couldn't trust it without checking, which eats the time it saves.

Codestral 22B was solid and a bit boring, which is meant as a compliment. It rarely surprised me in either direction. It's the largest thing I could comfortably run and it showed in the quality of longer outputs, but it was slow enough that for quick tasks I'd reach for the 7B Qwen instead and only bring out Codestral for something meatier. The licence is also worth a look before you build anything commercial on it; it's not a standard permissive licence.

Llama 3.1 8B, the non-specialist, was the control. It could write code, and for very common patterns it was fine, but it lagged the code-specific models on anything that needed real knowledge of an API or an idiom. The lesson there is unglamorous: for code, a model trained on code wins. The general model is the one to reach for when you want to talk about the code rather than write it.

the honest verdict

If you want one model, run Qwen2.5-Coder 14B and don't overthink it. If you're GPU-constrained or you care most about latency, the 7B version of the same family is the pragmatic pick and it'll genuinely help. The rest are interesting and worth a try, but for my actual day-to-day they didn't displace the Qwen models.

The bigger point is that local code models crossed a line this year. A year ago "useful local code model on one consumer GPU" was aspirational, the output needed so much correction it wasn't worth it. Now it's real. Not as good as the best cloud models, clearly, but good enough that I keep the browser tab closed more often than I open it, and that was the whole point of the exercise. The privacy and the no-network-round-trip are a bonus on top of work that's finally good enough to use.