Concepts
Multi-Head Latent Attention (MLA)
One-line definition
DeepSeek's attention reformulation that compresses key-value pairs through a low-rank latent projection, cutting KV-cache memory by ~93% in V2.
Full explainer
Multi-Head Latent Attention is DeepSeek's reformulation of standard attention that compresses key-value pairs through a low-rank latent projection before caching. Standard multi-head attention stores the full K/V tensors per layer per head; MLA collapses them into a shared latent space, then reconstructs the heads at inference time. The practical effect: DeepSeek-V2 reduced KV-cache memory by roughly 93.3% versus a comparable Llama-2-class baseline, which is what made a 128K-token context window feasible on commodity GPUs while lifting throughput 5.76x. The technique was introduced in the V2 paper (May 2024, arXiv:2405.04434) and propagated through V3, R1, V3.1, and V4. It is the single most influential architectural contribution DeepSeek has made to the open-source ecosystem.
Related events
Primary sources