Why does DeepSeek v3 route to 8 experts per token?

This blog dives into system details of the recent DeepSeekv3 model, comparing it with Llama 3 405B. We cover training economics, use of FP8, and parallelization strategies on the way to a hypothesis for what's on everyone's mind- why route to 8 out of 256 experts?