Special interest topics we touched on:
- why encoder-decoder is not actually that different than decoder-only architecture (Reka is notably enc-dec vs GPT which is dec-only)
- why the "Noam architecture" is All You Need
- The chaotic vs stable periods of spinning up GPU clusters ( per this great post I submitted https://news.ycombinator.com/item?id=39609997 )
- The NeurIPS Mirage paper vs Jason/Yi's Emergent Abilities paper
- The Efficiency Misnomer - reminder on Arvind Narayanan's recent callout that "active params" isn't actually the same thing as lower cost, and also not the same thing as faster inference
- Echoing Founders Fund's skepticism that Open Source AI can have any real lasting impact
please AMA! (Yi will see this)
The practical motivation for MoEs is very clear but I do worry about loss of compositional abilities (that I think just emerge from superposed representations?) that some tasks may require, especially with the many experts phenomenon we're seeing. This is an observation from smaller MoE models (with like top-k gating etc.) that may or may not scale, that denser models trained to the same loss tend to perform complex tasks "better".
Intuitively, do you think MoEs are just another stopgap trick we're using while we figure out more compute, better optimizers or could there be enough theoretical motivation to justify their continued use? If there isn't, perhaps we need to at least figure out "expert scaling laws" :)
HOWEVER i do opine that MoEs are kiiind of a stopgap (both on the pod and on https://latent.space/p/jan-2024) - definitely a validated efficiency/sparsity technique (esp see deepseek's moe work if you havent already, with >100 experts https://buttondown.email/ainews/archive/ainews-deepseek-v2-b...) but mostly a oneoff boost you get on the single small dense expert equivalent model rather than comparable to the capabilities of a large dense model of the same param count (aka I expect a 8x22B MoE to never outperform a 176B dense model ceteris paribus - which is difficult to get a like-for-like comparison on bc these things are expensive, also partially because usually the MoE is just upcycled instead of trained from scratch, and partially because the routing layer is deepening every month). so perhaps to TLDR there is more than enough evidence and practical motivation to justify their continued use (i would go so far as to say that all inference endpoints incl gpt4 and above should be MoEs) but they themselves are not really an architectural decision that matters for the next quantum leap in capabilities
most of the progress and experimentation is done by complete ai laypersons on civitai, comfy ui and silly tavern.
Consider this: we have hit the top of the S curve when it comes to compute and data.
The only way forward is brute force manual experimentation.
And this is done completely without financial or academic incentives.
The big advancements will come by a thousand little steps and comfy ui workflows made by an anon NEET in the chase for the perfect AI waifu. Not by some ai research lab or startup.
we will see break throughs when it comes to gaussian splatting and inpainting in 3d. We will be able to create our own 3d fantasy worlds and wear vr glasses all the time. The metaverse meme is surprisingly becoming real.
But it wont happen with a grand splash or a big product announcement. It will happen quietly in some anon NEET's cave.
Mark my words :D