December 26, 2024World1 min read
$5.6M for GPT-4o-class performance
DeepSeek-V3 dropped: 671B total parameters, 37B active, trained on 14.8 trillion tokens, fully adopting FP8 mixed precision and Multi-Token Prediction. Training used just 2,048 H800 GPUs at a total cost of about $5.6 million — less than a tenth of many comparable models' training budgets. On multiple evaluation sets, V3 outperformed GPT-4o and Claude 3.5 Sonnet. The paper detailed every engineering choice, including extensive experiments fighting numerical instability under FP8. The community called it “a miracle of efficiency”: not the team with the most compute wins, but the team that uses compute most ruthlessly.
Sources