Ramblings of an aging IT geek
← Ramblings of an aging IT geek
ai

which local model is any good at code, actually

A hands-on comparison of locally-run code models on the same set of real tasks, judged on whether the output actually compiled and did what was asked.

A robot figure against an abstract AI backdrop

I have spent the past week feeding the same prompts to several models I can run on my own hardware, because the benchmark numbers people quote never quite match what I see when I actually use the things. A model topping a leaderboard on HumanEval tells me almost nothing about whether it will write me a correct nom parser combinator or get the lifetime annotations right on a Rust function. So I built a small, boring, honest test of my own.

the setup

Everything ran locally through llama.cpp on a desktop with 32GB of RAM and a mid-range GPU offloading a chunk of the layers. The contenders were the ones worth taking seriously for code in September 2023: Code Llama 13B Instruct, its 7B sibling, WizardCoder, and a general-purpose Llama 2 13B as a control to see how much the code-specialised tuning actually buys you. All quantised to Q4_K_M or thereabouts so the comparison stayed fair on memory footprint.

I deliberately did not use a fancy harness. I wanted to see what I'd actually get sitting at a terminal, so each model got the same prompt, the same single attempt, no retries, no clever system prompt coaxing it into shape. If a real tool would have to wrestle the model into compliance, I wanted that wrestling to count against it.

the tasks

Twelve tasks, weighted towards the work I actually do rather than the puzzles benchmarks love. A few examples:

  • Write a Rust function that parses an ISO 8601 duration string into a std::time::Duration, returning a Result.
  • Given a struct, derive a builder pattern by hand without a macro crate.
  • Fix a borrow-checker error in a supplied snippet (the kind where the fix is to restructure, not to sprinkle clone).
  • Write a small nom parser for a key=value config line with quoted values.
  • Explain what a given chunk of unfamiliar code does, and spot the off-by-one I'd planted in it.

I scored each one on a brutally simple scale: did it compile, and did it do the thing. Half marks for compiles-but-subtly-wrong. Zero for confident nonsense.

A close-up of a printed circuit board

what happened

Code Llama 13B Instruct was the clear winner, and not by a small margin. On the straightforward generation tasks it was genuinely good: the duration parser compiled first time and handled the awkward cases, the builder was idiomatic, and it knew its way around Result and ? without being reminded. Where it wobbled was the borrow-checker fix. It produced something that compiled, but it had reached for clone to make the error go away rather than restructuring, which is exactly the bad habit I was hoping to catch. Technically correct, spiritually wrong.

The 7B version was a noticeable step down but more useful than its size suggests. It got perhaps two-thirds of the way there on most tasks and was fast enough that iterating felt cheap. For quick scaffolding, a function signature here, a match arm there, it earned its keep. For anything where the details mattered it needed a careful eye, because it was confidently wrong often enough that you could not trust it unsupervised.

WizardCoder surprised me by being excellent at explanation and weaker at generation. Asked to describe what an unfamiliar function did, it was lucid and accurate, and it found my planted off-by-one without prompting. Asked to write the nom parser, it produced something that used an API shape that didn't exist, the sort of plausible-looking hallucination that costs you ten minutes before you realise the function it called was never real.

The general Llama 2 13B control was the instructive one. It was not useless. It could explain code reasonably and write simple functions. But the gap to Code Llama on anything fiddly was wide, which answers the question I started with: the code-specific tuning is worth a great deal, not a rounding error.

the honest conclusions

A few things I'm taking away from this, none of them surprising in hindsight but all of them worth having measured rather than assumed.

First, compiling is a low bar and these models clear it more often than they should be allowed to. A snippet that compiles and is subtly wrong is more dangerous than one that fails loudly, and the local models produce plenty of the former. The off-by-one I planted got missed by two of the four, which tells you something about leaning on them for review.

Second, the clone-everywhere reflex is real and it is everywhere. Every model reached for it under pressure. If you are learning Rust by asking a model to fix your borrow errors, you will learn to paper over ownership problems rather than understand them, and that is a genuinely bad habit to absorb. I suspect this is a training-data artefact: there is an enormous amount of beginner Rust online that solves borrow errors with clone, and the models have dutifully learned that clone is what you reach for when the compiler complains. It is the path of least resistance in the corpus, so it becomes the path of least resistance in the output.

Third, and most usefully, the right tool depends entirely on the job. For generating a first draft of a well-specified function, Code Llama 13B is good enough that I now reach for it before writing boilerplate by hand. For understanding code I didn't write, WizardCoder is the one I'd ask. For quick fragments where I'll check the output anyway, the 7B is fast and cheap. None of them is a replacement for knowing what you're doing, and all of them are perfectly happy to lie to you with a straight face.

What I did not expect was how usable the whole thing is offline. The leaderboards measure something, but it isn't this, and the only benchmark I trust now is the one where I read every line of the output before I run it.