Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I thought MTP wasn't very useful on MoE models because the expert overlap for 2 tokens was too small.
 help



Still helps, and Step 3.5/3.7 were specifically trained for MTP (in a weird triple layer/triple head fashion with a kind of unique architecture)

With the currently-in-PR implementation it doubles decode performance for all the tasks I've been testing it against, at in the worst case is still a 35% uplift, so on a box with heaps of compute and not much memory bandwidth, it's worth it in practice




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: