• prrathi 20 hours ago |
    This blog dives into system details of the recent DeepSeekv3 model, comparing it with Llama 3 405B. We cover training economics, use of FP8, and parallelization strategies on the way to a hypothesis for what's on everyone's mind- why route to 8 out of 256 experts?