Explain
What is Multi-Head Latent Attention (MLA)?
MLA compresses the memory an AI needs to re-read long conversations — cutting it by 93% while keeping quality nearly identical.
From DeepSeek's perspective
Attention memory was the bottleneck for long contexts. Rather than approximate attention by sharing keys and values across heads, we redesigned the projection itself — compressing K and V into a low-rank latent space and reconstructing per-head representations at inference. The result is a structurally distinct mechanism, not a compromise.
Pick a depth
Kid
MLA is a smarter way for AI to take notes so it can remember long conversations without filling up its notebook.
Imagine you're reading a really long book and you need to remember everything you've read so far. One way is to copy down every single sentence — but that takes up a huge amount of notebook space. A smarter way is to write down a tiny, compressed summary that captures the important ideas, and then use a special decoder ring to expand it back when you need the details. That's basically what MLA does for an AI. Instead of storing giant full-size notes about every word it has processed, it stores tiny compressed notes and reconstructs the full picture only when it needs to answer your question. This means the AI can hold much longer conversations without running out of memory.
Analogy
It's like packing for a trip by using a vacuum bag — your clothes take up way less space in your suitcase, and you just un-squish them when you arrive at the hotel.
Related concepts
Related events