which local model actually helps me write code

A small robot figure representing local AI

I've spent a fortnight running code models locally instead of reaching for the hosted ones, partly out of curiosity and partly because I'd rather not paste a client's private repo into someone else's API. The question I actually care about is narrow: when I'm in the editor and stuck, which model that fits on my own GPU is worth alt-tabbing to? Benchmark league tables don't answer that. So I gave each one the same handful of real jobs and watched.

The contenders, all run through ollama on a single 24GB card, were Code Llama in a couple of sizes, the newer DeepSeek Coder, and a general-purpose model thrown in as a control. I judged them on three things I do constantly: write a fiddly function from a clear spec, explain a chunk of code I didn't write, and fix a bug given the error and the surrounding lines.

A circuit-board pattern

For writing from a spec, DeepSeek Coder was the surprise. The 6.7B instruct model produced code that ran first time more often than I expected, and crucially it stuck to the language and style I asked for instead of wandering off into its favourite framework. Code Llama at 13B was close behind and felt steadier on Python specifically, but it had a habit of over-commenting, narrating every line like it was being marked by an examiner.

Explaining existing code was where the gap narrowed. Honestly, most of them are fine at this. You paste in a tangle of code, you get back a readable account of what it does, and the smaller models are good enough that the size premium isn't worth the slower tokens. This is the task I reach for local models for most, and it's the one where "good enough and private and instant" beats "marginally better but over the wire".

Bug fixing was the humbling one. Given a real stack trace and the lines around it, all of them are confidently wrong a meaningful fraction of the time. They'll invent a plausible fix that addresses the symptom in the error message and ignore the actual cause two functions up. DeepSeek edged it again, but the lesson here isn't which model wins, it's that you must read what comes back. Treat the output as a colleague's first guess, not an answer.

A few practical notes that matter more than the rankings. Quantisation is the dial nobody talks about enough: a 4-bit quant of a bigger model often beat a full-precision smaller one and left me headroom for context. Context length is the real constraint for code work, because the moment you want to paste a whole file plus its imports you're brushing the limit, and a truncated file gives you confidently irrelevant suggestions. And the cold-start delay on first load is annoying enough that I keep one model warm rather than swapping constantly.

Where did I land? DeepSeek Coder 6.7B is the one living in my editor now, with a larger Code Llama on hand for the occasional gnarly job where I'll wait for better tokens. Neither replaces the hosted frontier models for genuinely hard reasoning, and I'm not pretending otherwise. But for the day-to-day of "write this boilerplate, explain this, sanity-check that", a model on my own machine is fast, free at the margin, and never sends a line of someone's private code anywhere. That last part is the whole reason I bothered, and it's the reason I'll keep bothering.