Concepts

Mixture of Experts (MoE)

One-line definition

A sparse-activation architecture where each token routes to a few 'expert' sub-networks. DeepSeek-V3 runs 671B params but activates only 37B per token.

Full explainer

Mixture of Experts is a sparse-activation architecture in which each token is routed at runtime to a small number of specialized 'expert' sub-networks rather than passing through every parameter in the model. DeepSeek-V3 totals 671B parameters but activates only ~37B per token; V4-Pro totals 1.6T with 49B active. The benefit is decoupling capacity from inference cost: a model with the breadth of a trillion-parameter dense network but the per-token compute closer to a 40B dense model. The cost is training complexity — load balancing, expert collapse, and routing stability all become first-order problems. DeepSeek's contribution has been the engineering discipline to make these tradeoffs work at frontier scale on a tight budget; the V3 technical report is the definitive write-up.

Related events

Primary sources

https://arxiv.org/html/2412.19437v1