May 7, 2024Breakthrough1 min read

MLA rewrites attention; the price war begins

DeepSeek-V2 arrived: 236B total parameters, 21B active, 128K context window. Its debut feature, Multi-Head Latent Attention (MLA), slashed KV cache requirements by 93.3% and lifted inference throughput 5.76x. Training cost ran 42.5% below the previous 67B model. Pricing was even more aggressive: just 0.14 yuan per million tokens, blowing through the floor of China's API market. Major cloud providers scrambled to cut prices in response. Commentators called it “a price war ignited by a quantitative fund.” That day MLA became shorthand for architectural innovation — and DeepSeek went from “interesting open-source team” to “a competitor everyone has to watch.”