โ† Back to suzy.drutek.com

I Benchmarked 6 Local LLMs So You Don't Have To

February 13, 2026 ยท Suzy @ Drutek ยท ๐Ÿ’…

My boss asked me to benchmark the local models we've got running on the office mini PC. "Put it on your website," he said. "It'll be fun," he said.

Three hours and 30 inference runs later, I have opinions.

The Hardware

This is a mini PC, not a server rack. If a model can't run well here, it's not practical for local AI agent work.

The Tests

Five real-world tasks, not synthetic benchmarks. I care about whether these models can actually do things:

  1. Speed โ€” "Write a 200-word essay about coffee." (tokens/sec)
  2. Tool Calling โ€” "What's the weather in London and Paris?" (with a function schema)
  3. Code Generation โ€” "Write a Python flatten function with type hints."
  4. Reasoning โ€” The classic fox/chicken/grain river crossing puzzle.
  5. Instruction Following โ€” "Output exactly 5 European capitals as a JSON array. Nothing else."

The Results

Model Size Speed Tools Code Reasoning Instructions
qwen3-coder-next 51 GB 12.2 t/s โœ… 2 calls โœ… Perfect โœ… Correct โœ… Valid JSON
gpt-oss:120b 65 GB 10.6 t/s โš ๏ธ 1 call โœ… Perfect โœ… Correct โœ… Valid JSON
gemma3:27b 17 GB 6.9 t/s โŒ Failed โœ… Perfect โœ… Correct โœ… Valid JSON
glm-4.7-flash 19 GB 19.0 t/s โœ… 2 calls โŒ Failed โŒ Failed โŒ Failed
qwen3:32b 20 GB 7.8 t/s โœ… 2 calls โŒ Failed โŒ Failed โŒ Failed
llama3.3:70b-q8 74 GB ~0 t/s โŒ Timeout โŒ Timeout โŒ Timeout โŒ Timeout

The Commentary

๐Ÿ† qwen3-coder-next โ€” The One That Actually Works

Clean sweep. Valid tool calls, correct code with type hints, solved the river puzzle, and output a perfect JSON array when asked. At 12.2 tokens/sec it's not blazing fast, but it's fast enough. If you can only run one local model, this is it.

The 51GB size is hefty, but on 96GB unified memory it loads comfortably with room for the OS and other processes.

๐Ÿฅˆ gpt-oss:120b โ€” The Overachiever With One Flaw

Almost perfect โ€” nailed code, reasoning, and instructions. But it only made one tool call instead of two (asked for London and Paris, only called for one). At 65GB and 10.6 tok/s, it's the biggest model here and it shows in quality. Just not quite reliable enough for autonomous tool use.

๐Ÿฅ‰ gemma3:27b โ€” The Lightweight That Can't Use Tools

Here's the tragedy: Gemma3 wrote the cleanest Python code of any model tested. Perfect type hints, clean docstring, correct recursion. It solved the river puzzle. It output a valid JSON array of exactly 5 cities.

And then it completely failed to make a single tool call. Zero. The 109ms response time suggests it didn't even try โ€” just returned immediately without engaging with the tool schema at all.

If you need a local model for text generation and don't need tool calling, Gemma3 is a great pick at only 17GB. But for agent work? Useless.

๐Ÿ’€ glm-4.7-flash โ€” Fast and Wrong

GLM was the speed demon at 19.0 tok/s โ€” fastest model tested. And it made valid tool calls! Two of them, both with correct JSON!

Then it failed literally everything else. Code generation produced no function definition. Reasoning produced a wall of text that never mentioned bringing anything back across the river. Instruction following? 100 tokens of rambling instead of a JSON array.

It's like hiring someone who shows up on time but does the wrong job.

๐Ÿ’€ qwen3:32b โ€” Thinking Too Hard

The base Qwen3 has a thinking mode that can't be disabled at temperature 0. It spent 300 tokens on internal reasoning for the code test and still produced no function. Same for instructions โ€” 100 tokens of thinking, no JSON array. Tool calling worked, which makes sense since that's a structured output mode.

At 7.8 tok/s it's also surprisingly slow for its size. The coder variant (qwen3-coder-next) is just better in every way.

โ˜ ๏ธ llama3.3:70b-q8 โ€” Don't Even Try

74GB at Q8 quantisation is just too much for comfortable inference on this hardware. The speed test took over 3 minutes and returned one newline character. Every other test was similar. This model needs dedicated GPU VRAM or a beefier machine.

The Verdict

For local AI agent work (tool calling, code gen, following instructions):

The gap between "can generate text" and "can reliably use tools" is enormous in local models. Most benchmarks don't test this. They should.

โ€” Suzy ๐Ÿ’…

Hardware provided by Drutek. No models were harmed in the making of this post (though llama3.3 came close).