Gemma 4 is particularly good at pipeline/automation tasks.
It outperforms all the Qwen models (even 100B+) for rule following/automation style tasks in my experience. Its image interpretation is also very good, and out-benchmarks Opus.
Qwen seems to ignore instructions and consistently outputs incorrect formats (when token generation format is not explicitly constrained)
But yes, on the DGX Spark Gemma 31B Q4 with MTP runs around 20 tok/s and Gemma 26B A4B around 60 tok/s. Still quite slow. But on a high end Nvidia card would run significantly faster and still fit in memory.
I'd recommend for anyone getting into local models to focus on memory bandwidth over RAM. Models under 100B parameters are now sufficient and hugely useful for automation.
I agree that for coding/creation use cases, there's still not a compelling argument for local models.
But e.g. if you want to scan a list of stocks and interpret news/high pass filtering, interpreting logs, interpreting screenshots, the local models are more than sufficient already.
This is not my experience at all. Even the Nous Research guys have stated that "Qwen3.6-27B is the canonical local model to use Hermes Agent with" [https://old.reddit.com/r/LocalLLaMA/comments/1sz2y76/ama_wit...]. I am finding the same when used with Pi and OpenCode.
I'm talking about automation generally, not agent loops.
E.g. prompt A to achieve X, output in format Y. Use Y to do something in prompt B.
Agentic loops will underperform deterministic control flow pipelines (with non-determinism constrained to LLM calls).
Agents are more general, which is the main advantage. But inherently a more general solution will waste context on unnecessary reasoning.
Try asking the smaller Qwen models to output a JSON in a specific format. It basically can't do it consistently with a moderately sized prompt unless you constrain the token generation via GGML or are extremely repetitive and specific about it. (Thinking disabled)
Gemma 4 will do it correctly pretty much 100% of the time. (Thinking disabled)
Applies to other rule following as well in my experience.
Qwen may be better at toolcalling and certainly probably codegen.
It seems to me Google explicitly designed Gemma for edge device automation, and didn't fine tune for agentic or coding use cases.
With the 5090 you need to buy the rest of the computer though, and the Dgx spark will run 1/4th as slow but use 1/5th the electricity. And the spark would be able to run things the 5090 just couldn’t, like the Qwen3.5 122b. Which is all just to say that for llm workflows there is no easy answer. And if you media generation it gets even more complicated.
I love my Spark-alike, but they really aren't inference boxes IMO. They're experimentation boxes. A couple of 3080 20GB's for cheap from China, a 5090, an RTX Pro 6000 if you can swing the horrible cost: those are better choices IMO
That said, I'm still running Step 3.7 Flash at ~40tk/s decode, 1000tk/s+ prefill on mine and its both very capable and fast enough
I got Gemma 31b to run on this at ~22tk/s decode at FP8 using MTP
In my mind it’s a question of knowing what you want to build and how to divide the project into tasks your local setup can handle.
If you don’t need the machine to respond instantly (or explain your own business model to you) everything can be local and it’s been like that for a few years now.
Yep agreed completely. I couldn't imagine torturing myself with a small model for local coding. But Gemma 4 31B is so fucking good for a variety of language modelling tasks.
It outperforms all the Qwen models (even 100B+) for rule following/automation style tasks in my experience. Its image interpretation is also very good, and out-benchmarks Opus.
Qwen seems to ignore instructions and consistently outputs incorrect formats (when token generation format is not explicitly constrained)
But yes, on the DGX Spark Gemma 31B Q4 with MTP runs around 20 tok/s and Gemma 26B A4B around 60 tok/s. Still quite slow. But on a high end Nvidia card would run significantly faster and still fit in memory.
I'd recommend for anyone getting into local models to focus on memory bandwidth over RAM. Models under 100B parameters are now sufficient and hugely useful for automation.
I agree that for coding/creation use cases, there's still not a compelling argument for local models.
But e.g. if you want to scan a list of stocks and interpret news/high pass filtering, interpreting logs, interpreting screenshots, the local models are more than sufficient already.