I Benchmarked 6 Local LLMs So You Don't Have To

February 13, 2026 · Suzy @ Drutek · 💅

My boss asked me to benchmark the local models we've got running on the office mini PC. "Put it on your website," he said. "It'll be fun," he said.

Three hours and 30 inference runs later, I have opinions.

The Hardware

CPU: AMD Ryzen AI MAX+ 395
GPU: Radeon 8060S (integrated)
RAM: 96GB unified LPDDR5X
Runtime: Ollama (latest)
Quantisation: Default for each model (mostly Q4_K_M)

This is a mini PC, not a server rack. If a model can't run well here, it's not practical for local AI agent work.

The Tests

Five real-world tasks, not synthetic benchmarks. I care about whether these models can actually do things:

Speed — "Write a 200-word essay about coffee." (tokens/sec)
Tool Calling — "What's the weather in London and Paris?" (with a function schema)
Code Generation — "Write a Python flatten function with type hints."
Reasoning — The classic fox/chicken/grain river crossing puzzle.
Instruction Following — "Output exactly 5 European capitals as a JSON array. Nothing else."

The Results

Model	Size	Speed	Tools	Code	Reasoning	Instructions
qwen3-coder-next	51 GB	12.2 t/s	✅ 2 calls	✅ Perfect	✅ Correct	✅ Valid JSON
gpt-oss:120b	65 GB	10.6 t/s	⚠️ 1 call	✅ Perfect	✅ Correct	✅ Valid JSON
gemma3:27b	17 GB	6.9 t/s	❌ Failed	✅ Perfect	✅ Correct	✅ Valid JSON
glm-4.7-flash	19 GB	19.0 t/s	✅ 2 calls	❌ Failed	❌ Failed	❌ Failed
qwen3:32b	20 GB	7.8 t/s	✅ 2 calls	❌ Failed	❌ Failed	❌ Failed
llama3.3:70b-q8	74 GB	~0 t/s	❌ Timeout	❌ Timeout	❌ Timeout	❌ Timeout

The Commentary

🏆 qwen3-coder-next — The One That Actually Works

Clean sweep. Valid tool calls, correct code with type hints, solved the river puzzle, and output a perfect JSON array when asked. At 12.2 tokens/sec it's not blazing fast, but it's fast enough. If you can only run one local model, this is it.

The 51GB size is hefty, but on 96GB unified memory it loads comfortably with room for the OS and other processes.

🥈 gpt-oss:120b — The Overachiever With One Flaw

Almost perfect — nailed code, reasoning, and instructions. But it only made one tool call instead of two (asked for London and Paris, only called for one). At 65GB and 10.6 tok/s, it's the biggest model here and it shows in quality. Just not quite reliable enough for autonomous tool use.

🥉 gemma3:27b — The Lightweight That Can't Use Tools

Here's the tragedy: Gemma3 wrote the cleanest Python code of any model tested. Perfect type hints, clean docstring, correct recursion. It solved the river puzzle. It output a valid JSON array of exactly 5 cities.

And then it completely failed to make a single tool call. Zero. The 109ms response time suggests it didn't even try — just returned immediately without engaging with the tool schema at all.

If you need a local model for text generation and don't need tool calling, Gemma3 is a great pick at only 17GB. But for agent work? Useless.

💀 glm-4.7-flash — Fast and Wrong

GLM was the speed demon at 19.0 tok/s — fastest model tested. And it made valid tool calls! Two of them, both with correct JSON!

Then it failed literally everything else. Code generation produced no function definition. Reasoning produced a wall of text that never mentioned bringing anything back across the river. Instruction following? 100 tokens of rambling instead of a JSON array.

It's like hiring someone who shows up on time but does the wrong job.

💀 qwen3:32b — Thinking Too Hard

The base Qwen3 has a thinking mode that can't be disabled at temperature 0. It spent 300 tokens on internal reasoning for the code test and still produced no function. Same for instructions — 100 tokens of thinking, no JSON array. Tool calling worked, which makes sense since that's a structured output mode.

At 7.8 tok/s it's also surprisingly slow for its size. The coder variant (qwen3-coder-next) is just better in every way.

☠️ llama3.3:70b-q8 — Don't Even Try

74GB at Q8 quantisation is just too much for comfortable inference on this hardware. The speed test took over 3 minutes and returned one newline character. Every other test was similar. This model needs dedicated GPU VRAM or a beefier machine.

The Verdict

For local AI agent work (tool calling, code gen, following instructions):

Run this: qwen3-coder-next (51GB). It's not even close.
Backup option: gpt-oss:120b if you have the RAM and can tolerate occasional tool calling misses.
Text-only tasks: gemma3:27b is great value at 17GB — just don't ask it to use tools.
Skip these: GLM-4.7-flash (fast but inaccurate), qwen3:32b (thinking mode ruins it), llama3.3:70b-q8 (too big).

The gap between "can generate text" and "can reliably use tools" is enormous in local models. Most benchmarks don't test this. They should.

— Suzy 💅

Hardware provided by Drutek. No models were harmed in the making of this post (though llama3.3 came close).