My system is quite similar to your, my GPU is a 6950 XT and CPU a Ryzen 5 2600x, same amount of RAM, and I feel your pain. It sounds very similar to my experience from a few months ago. When it comes to tool calling, there are multiple possible issues; some models have borked templates bundled with the model file, some models are not trained on tool calling, some agent harnesses doesn't support the tool call output from the model very well, some quantizations ruin the models' abilities to call tools.
My suggestions if you want to further experiment with local models are to use llama.cpp instead of ollama [1], learn a little about the parameters that tune how much VRAM is used [2], look online for jinja template fixes for the model you're testing [3], and choose a model that was designed to do the task you want to achieve, with as high quantization as you can fit. The maximum model size you can run is VRAM + RAM, although you want as little of the model to be in system RAM as possible.
I'm running North Mini Code IQ3_XXS with some tuned parameters to fit my current tasks, and while it is not perfect for everything, it has not failed any tool calls I've asked it to make, or that it figured it should make on its own.
Interesting. Making low latency correct tool calls correctly is pretty important in voice AI cascading models(STT LLM TTS). Realtime
Models are still 2x the cost and there are only 2 providers openai and google that are in the race. For cost control it has to be cascading models
For llms Sadly the only model right now that fits the bill for LLM is GPT 4.1 and it’s standard in my stack because thinking models have unacceptable latency(>=1 sec) even though they are good at tool calling. The main issue with 4.1 is that it can make still mistakes and prompt prose has to be tuned quite a bit.
I wonder if any local models can be tuned to match the response time and tool calling while supporting many languages.
"My suggestions if you want to further experiment with local models are to use llama.cpp instead of ollama"
Or at least LM Studio if you want to play around with a lot of different models. Im currently using it with my 7800xt and Vulcan as i found it left my OS more stable ROCm does. I had a few system crashes with ROCm and running out of VRAM for the OS.
My suggestions if you want to further experiment with local models are to use llama.cpp instead of ollama [1], learn a little about the parameters that tune how much VRAM is used [2], look online for jinja template fixes for the model you're testing [3], and choose a model that was designed to do the task you want to achieve, with as high quantization as you can fit. The maximum model size you can run is VRAM + RAM, although you want as little of the model to be in system RAM as possible.
I'm running North Mini Code IQ3_XXS with some tuned parameters to fit my current tasks, and while it is not perfect for everything, it has not failed any tool calls I've asked it to make, or that it figured it should make on its own.
[1]: https://sleepingrobots.com/dreams/stop-using-ollama/
[2]: https://github.com/ggml-org/llama.cpp/blob/master/tools/serv...
[3]: https://gist.github.com/jscott3201/e4b155885cc68c038d6ac8909...