When running on a GPU, dense models are shaping up to be the best way due to two things:
- Maximum intelligence per VRAM (you dont have much)
- Dense models can benefit from MTP to get an almost 2x speedup in decode (ie, a 27b dense model with mtp decodes at about the same speed as a MoE model with 14b active param model would). This is important because local llm rarely has parallel streams to batch together.
When running on large unified memory like Strix Halo or Spark Dgx, MoE models are usually best:
- You can get similar intelligence as a smaller dense model with fewer active params (to compensate for the slower memory) by throwing ram at the problem.
The problem with batching local LLMs is not any inherent lack of multiple parallel sessions, but rather that local dGPUs lack the VRAM capacity to host KV-cache for several of those at once, whereas unified memory platforms broadly lack the compute headroom compared to memory bandwidth that would actually make batching useful.
(SSD streaming a larger-than-RAM model "solves" that latter issue very nicely because it radically slashes the equivalent to memory bandwidth so any saving on that becomes highly significant.)
> This is important because local llm rarely has parallel streams to batch together.
I think most people using agent-like usage could easily run any number of parallel streams pretty often, but you run out of vram for multiple KV caches, unfortunately.
- Maximum intelligence per VRAM (you dont have much)
- Dense models can benefit from MTP to get an almost 2x speedup in decode (ie, a 27b dense model with mtp decodes at about the same speed as a MoE model with 14b active param model would). This is important because local llm rarely has parallel streams to batch together.
When running on large unified memory like Strix Halo or Spark Dgx, MoE models are usually best:
- You can get similar intelligence as a smaller dense model with fewer active params (to compensate for the slower memory) by throwing ram at the problem.