Explain

How did $5.6M train a GPT-4-class model?

DeepSeek trained a world-class 671B-parameter model for roughly $5.6 million by stacking a sequence of engineering gains across architecture, precision, and training objectives — not one trick, but a compounding pile of small ones.

From DeepSeek's perspective

We didn't have unlimited GPUs, so we had to make every H800 count. FP8 precision, sparse MoE routing, latent attention caching, and smarter training objectives weren't separate experiments — they were engineered to compose. The cost figure reflects that discipline.

Pick a depth

Kid

DeepSeek built one of the smartest AI brains in the world, but spent way less money than anyone expected — like a baker who learned to make the same great cake using half the flour.

Imagine you're baking a giant birthday cake. Most bakeries use huge sacks of flour and lots of ovens running all day — that costs a lot. DeepSeek's team figured out a smarter recipe: they used special ovens that work more efficiently, only warmed up the exact parts they needed, and found a way to mix the batter that wasted almost nothing. Their cake — an AI called DeepSeek-V3 — turned out just as delicious as the cakes the biggest bakeries make, but cost about ten to twenty times less to bake. The secret wasn't one magic ingredient. It was doing dozens of small things more cleverly than anyone had done before.

Analogy

It's like building the same LEGO castle as everyone else, but figuring out how to do it with half the bricks by being really smart about which pieces go where.

Related concepts

Related events