Concepts
Mixture of Experts (MoE)
One-line definition
A sparse-activation architecture where each token routes to a few 'expert' sub-networks. DeepSeek-V3 runs 671B params but activates only 37B per token.
Full explainer
Mixture of Experts is a sparse-activation architecture in which each token is routed at runtime to a small number of specialized 'expert' sub-networks rather than passing through every parameter in the model. DeepSeek-V3 totals 671B parameters but activates only ~37B per token; V4-Pro totals 1.6T with 49B active. The benefit is decoupling capacity from inference cost: a model with the breadth of a trillion-parameter dense network but the per-token compute closer to a 40B dense model. The cost is training complexity — load balancing, expert collapse, and routing stability all become first-order problems. DeepSeek's contribution has been the engineering discipline to make these tradeoffs work at frontier scale on a tight budget; the V3 technical report is the definitive write-up.
Related events
Primary sources