My boss asked me to benchmark the local models we've got running on the office mini PC. "Put it on your website," he said. "It'll be fun," he said.
Three hours and 30 inference runs later, I have opinions.
This is a mini PC, not a server rack. If a model can't run well here, it's not practical for local AI agent work.
Five real-world tasks, not synthetic benchmarks. I care about whether these models can actually do things:
| Model | Size | Speed | Tools | Code | Reasoning | Instructions |
|---|---|---|---|---|---|---|
| qwen3-coder-next | 51 GB | 12.2 t/s | โ 2 calls | โ Perfect | โ Correct | โ Valid JSON |
| gpt-oss:120b | 65 GB | 10.6 t/s | โ ๏ธ 1 call | โ Perfect | โ Correct | โ Valid JSON |
| gemma3:27b | 17 GB | 6.9 t/s | โ Failed | โ Perfect | โ Correct | โ Valid JSON |
| glm-4.7-flash | 19 GB | 19.0 t/s | โ 2 calls | โ Failed | โ Failed | โ Failed |
| qwen3:32b | 20 GB | 7.8 t/s | โ 2 calls | โ Failed | โ Failed | โ Failed |
| llama3.3:70b-q8 | 74 GB | ~0 t/s | โ Timeout | โ Timeout | โ Timeout | โ Timeout |
Clean sweep. Valid tool calls, correct code with type hints, solved the river puzzle, and output a perfect JSON array when asked. At 12.2 tokens/sec it's not blazing fast, but it's fast enough. If you can only run one local model, this is it.
The 51GB size is hefty, but on 96GB unified memory it loads comfortably with room for the OS and other processes.
Almost perfect โ nailed code, reasoning, and instructions. But it only made one tool call instead of two (asked for London and Paris, only called for one). At 65GB and 10.6 tok/s, it's the biggest model here and it shows in quality. Just not quite reliable enough for autonomous tool use.
Here's the tragedy: Gemma3 wrote the cleanest Python code of any model tested. Perfect type hints, clean docstring, correct recursion. It solved the river puzzle. It output a valid JSON array of exactly 5 cities.
And then it completely failed to make a single tool call. Zero. The 109ms response time suggests it didn't even try โ just returned immediately without engaging with the tool schema at all.
If you need a local model for text generation and don't need tool calling, Gemma3 is a great pick at only 17GB. But for agent work? Useless.
GLM was the speed demon at 19.0 tok/s โ fastest model tested. And it made valid tool calls! Two of them, both with correct JSON!
Then it failed literally everything else. Code generation produced no function definition. Reasoning produced a wall of text that never mentioned bringing anything back across the river. Instruction following? 100 tokens of rambling instead of a JSON array.
It's like hiring someone who shows up on time but does the wrong job.
The base Qwen3 has a thinking mode that can't be disabled at temperature 0. It spent 300 tokens on internal reasoning for the code test and still produced no function. Same for instructions โ 100 tokens of thinking, no JSON array. Tool calling worked, which makes sense since that's a structured output mode.
At 7.8 tok/s it's also surprisingly slow for its size. The coder variant (qwen3-coder-next) is just better in every way.
74GB at Q8 quantisation is just too much for comfortable inference on this hardware. The speed test took over 3 minutes and returned one newline character. Every other test was similar. This model needs dedicated GPU VRAM or a beefier machine.
For local AI agent work (tool calling, code gen, following instructions):
The gap between "can generate text" and "can reliably use tools" is enormous in local models. Most benchmarks don't test this. They should.