When running on a GPU, dense models are shaping up to be the best way due to two...

zozbot234 · 2026-06-17T09:06:53 1781687213

The problem with batching local LLMs is not any inherent lack of multiple parallel sessions, but rather that local dGPUs lack the VRAM capacity to host KV-cache for several of those at once, whereas unified memory platforms broadly lack the compute headroom compared to memory bandwidth that would actually make batching useful.

(SSD streaming a larger-than-RAM model "solves" that latter issue very nicely because it radically slashes the equivalent to memory bandwidth so any saving on that becomes highly significant.)

nullc · 2026-06-17T11:44:55 1781696695

> This is important because local llm rarely has parallel streams to batch together.

I think most people using agent-like usage could easily run any number of parallel streams pretty often, but you run out of vram for multiple KV caches, unfortunately.