My system is quite similar to your, my GPU is a 6950 XT and CPU a Ryzen 5 2600x,...

itissid · 2026-06-17T10:28:28 1781692108

Interesting. Making low latency correct tool calls correctly is pretty important in voice AI cascading models(STT LLM TTS). Realtime Models are still 2x the cost and there are only 2 providers openai and google that are in the race. For cost control it has to be cascading models

For llms Sadly the only model right now that fits the bill for LLM is GPT 4.1 and it’s standard in my stack because thinking models have unacceptable latency(>=1 sec) even though they are good at tool calling. The main issue with 4.1 is that it can make still mistakes and prompt prose has to be tuned quite a bit.

I wonder if any local models can be tuned to match the response time and tool calling while supporting many languages.

calgoo · 2026-06-17T10:26:58 1781692018

"My suggestions if you want to further experiment with local models are to use llama.cpp instead of ollama"

Or at least LM Studio if you want to play around with a lot of different models. Im currently using it with my 7800xt and Vulcan as i found it left my OS more stable ROCm does. I had a few system crashes with ROCm and running out of VRAM for the OS.