I wanted to know one thing: can a model running on hardware I own do enough of my coding day to be worth the bother? Not "is it as good as the frontier APIs," because it isn't, but "is it good enough often enough that I'd reach for it before the network round trip?" After a month I think the answer is a qualified yes, and the qualifications are the interesting part.
The setup, so you can calibrate everything that follows. One machine, a single 24GB GPU, models served through Ollama for the convenience and llama.cpp directly when I wanted to fiddle. I tried three families that were the obvious candidates at the back end of this year: Qwen2.5-Coder (the 7B and 14B), DeepSeek-Coder-V2-Lite, and Codestral 22B. All quantised, mostly Q4_K_M, because that is what fits and runs at a usable speed on one card.
the honest scorecard
I am not going to pretend I ran a rigorous benchmark suite. I ran my actual work through them for four weeks and kept notes. Here is roughly where they landed.
Qwen2.5-Coder 14B was the surprise. For the size, it is genuinely good at the bread-and-butter stuff: write me a function with this signature, here's a stack trace explain it, convert this loop to use the standard library. It held context well enough across a couple of files. It is the one I left running.
Codestral 22B felt a touch stronger on larger, more ambiguous prompts, the kind where you describe a behaviour rather than a function. But on my single card the Q4 quant was slower to first token and I noticed the wait, which matters more than I expected. A model you have to wait for is a model you stop reaching for.
DeepSeek-Coder-V2-Lite was the best value on pure throughput and very solid on fill-in-the-middle completion, which is most of what I want from an in-editor model anyway.
The 7B variants are real and useful, not toys. For autocomplete and "rename this, refactor that" they are quick and right often enough. They fall down the moment a problem needs you to hold three things in your head at once.
A word on quantisation, because it is the lever that decides whether any of this fits. Q4_K_M is the sweet spot I kept coming back to: it halves the memory against a higher-bit quant and the quality drop is small enough that I struggled to feel it on day-to-day coding. Go lower, to Q3 or below, and the model starts making the kind of small, confident mistakes that cost you more time to catch than the model saved you. Go higher and the 14B no longer fits alongside a sensible context window. So the real comparison is not "which model is best" in the abstract, it is "which model is best at the quant and context that fit on the card I actually own," and that reshuffles the order. A 14B at Q4 that fits beats a 22B I have to cripple to load.
Context length is the other quiet constraint. The marketing numbers are large, but every extra token of context is more VRAM, and on a single card you are choosing between a bigger model and a longer window. For in-editor work a modest window is fine, you mostly want the current file and its near neighbours. For "reason across my whole module" you want both the bigger model and the bigger window, and that is precisely the workload a single consumer card cannot give you. The hardware draws the line, and it draws it right where the hard problems start.
where local genuinely wins
Three things, and they are not the things the demos sell.
First, latency for small completions. An in-editor fill-in-the-middle model running locally responds before a network call would have finished its handshake. For tab-completion and boilerplate, local is not a compromise, it is the better experience, full stop.
Second, the stuff you cannot or should not send anywhere. Client code under NDA, anything with credentials in the buffer, a private repo you've not cleared for third-party processing. The local model is the only one I can point at that code without a conversation with someone in legal first. That is not a performance argument, it is a "this is the only option that's allowed" argument, and it's the one that actually changed my habits.
Third, cost stops being a meter you watch. I stopped rationing questions. When the marginal cost of asking is electricity I already pay for, I ask more, and asking more is most of where the value of these tools comes from.
where it loses, plainly
It loses on the hard, sprawling problems. The ones where you give a frontier model six files and a vague description of a bug and it reasons across all of it and finds the thing. The local models, at sizes I can run, lose the thread. They'll confidently rewrite a function in a way that's locally plausible and globally wrong. The bigger the blast radius of the change, the more I want the big model and a careful read of its output.
They also lose on knowing things. Ask about a recent library version, a new API, an obscure error from a tool released this year, and the local model's smaller training and older cutoff show. The frontier APIs are simply more current and broader. For "how do I" questions about anything moving fast, I still go online.
what I actually do now
A split, settled into without really deciding it. Local for the constant, low-stakes, high-frequency work: completion, boilerplate, explain-this-error, refactor-this-function, anything touching code I shouldn't send away. That is most of my keystrokes. The frontier API for the occasional hard problem where reasoning across a lot of context earns its keep, and where I'll read the answer carefully anyway.
# the one I keep warm
ollama run qwen2.5-coder:14b
The thing I did not expect was how much the privacy and the zero marginal cost changed my behaviour, independent of raw quality. I ask the local model things I would never have bothered to ask a metered, off-site one, because asking is free and the code never leaves the room. A "worse" model I use ten times a day beats a better one I use twice. For the everyday grind, local is no longer the consolation prize. It is just the tool that's closest to hand, and most days that's the one that gets used.