DSNB · The DeepSeek Story

Done Following

From a trillion-yuan quant fund's GPU cluster to the world's AI conversation — a team driven by curiosity, unbound by KPIs, rewriting the rules one open-source release at a time.

Download Switch Read the full story

Scroll

Someone has to burn for open source

Some companies are born unwilling to be “normal.” When venture capital came knocking, this one answered: “We have no commercial pressure. No KPIs.” While the rest of the world chased GPU clusters for trillion-parameter models, it trained a model surpassing GPT-4o for about $5.6 million. DeepSeek isn't another Silicon Valley prodigy script — it came from the server room of a quantitative hedge fund in Hangzhou, China. From quietly injecting machine learning into financial trading in 2015, to open-sourcing a 1.6-trillion-parameter model in 2026 at one-tenth the API price of OpenAI, this story holds no formula for myth-making. Only a group of engineers convinced that “open-sourcing papers loses you nothing,” and a founder who told an interviewer: “China cannot remain a follower forever.” This is a journey of compute, humility, and steady steering into the eye of the storm.

Full timeline

Timeline

OriginBreakthroughWorldPresent

2015Origin

When machine learning met China's stock market

In Hangzhou, Liang Wenfeng founded High-Flyer Quant — barely a dozen people at first. They began folding machine learning models into quantitative trading, building data pipelines and low-latency infrastructure from scratch. At the time, this combination was treated almost as heresy: Chinese quant finance still talked about multi-factor regressions. High-Flyer's AUM would later cross 100 billion yuan, but those profits didn't become luxury offices — they became the seed money for a GPU cluster. No one foresaw that compute and deep learning expertise accumulated to chase microsecond market opportunities would point toward something far more ambitious.

Sources · 2

2020–2021Origin

Firefly: lighting the path to a private supercluster

While most quantitative funds were still renting cloud compute, High-Flyer launched Firefly-1: 1,100 GPUs, an investment of nearly 200 million yuan. It wasn't enough. Firefly-2 came online soon after — roughly 10,000 NVIDIA A100s, more than 1 billion yuan in capital. This is not a budget any CTO signs lightly: maintaining a ten-thousand-card cluster means a bottomless pit of cost, power, and cooling. But Liang Wenfeng saw that the only way to iterate trading models faster wasn't on someone else's cloud — it was in his own basement. Those A100s would later appear in DeepSeek V2 and V3 training logs, the firefly that ignited an open-source generational leap.

Sources · 2

April 14, 2023Origin

A flat announcement, a step into AGI's deep end

High-Flyer Quant published a notice: an AGI research lab had been established. The wording was plain — no mission statement, no funding round news. But it meant a hedge fund managing 100 billion yuan was officially extending its reach into general artificial intelligence. The internal letter described it as “a natural extension of existing capabilities.” To outsiders, it was bewildering: why would a quantitative fund take on the peaks of fundamental science? For Liang Wenfeng, the answer flickered in the Firefly cluster — that compute no longer had to serve only trading signals. It could serve cognition itself, in all its unknowns.

Sources · 2

July 17, 2023Origin

No VC, no KPIs: spinning out on its own terms

The lab was carved out as an independent company, DeepSeek, fully funded by High-Flyer. While nearly every AI startup chased VC validation and the next valuation round, Liang Wenfeng said: “We have no commercial pressure. No KPIs.” It bordered on hubris — but behind it stood eight years of quant accumulation as capital cushion. The company operated on one simple logic: solve the technical problems, then open-source the results. This structure let DeepSeek dodge product-manager pressure and quarterly target tugs-of-war, sinking instead into the truly thorny research questions — like how to compress a model's KV cache to 6.7% of its original size.

Sources · 2

November 2, 2023Breakthrough

First shot in code: who said open source couldn't?

DeepSeek-Coder shipped — four sizes from 1.3B to 33B, all open-sourced from day one. The models were trained on 2 trillion tokens, mixing 87% code and 13% natural language, covering more than 80 programming languages. The 33B variant beat the then-prominent CodeLlama-34B on multiple benchmarks. What shook developers more: this wasn't the side-project of a tech giant. It came from a team that had spun out four months earlier, built entirely for writing code. GitHub stars poured in overnight, HuggingFace rankings climbed, and programmers truly felt for the first time that the labor of love behind open-source code models could come from a company with no PR roadshow.

DeepSeek-Coder—Built for code, open from day one

Sources · 2

November 29, 2023Breakthrough

A bilingual base model, plainly delivered

Just 27 days later, DeepSeek-LLM 7B and 67B arrived. This was the team's first complete reveal of a base model, covering both Chinese and English. Coder handled programming; LLM took on general understanding. Behind both: the Firefly cluster's nonstop hum. The world started noticing this “a-model-every-other-week” rhythm — no launch event, just an arXiv paper and downloadable weights on HuggingFace. That release style would become DeepSeek's signature: put the engineering facts on the table, let developers judge for themselves.

DeepSeek-LLM—Bilingual base model, minimalist release

Sources · 2

May 7, 2024Breakthrough

MLA rewrites attention; the price war begins

DeepSeek-V2 arrived: 236B total parameters, 21B active, 128K context window. Its debut feature, Multi-Head Latent Attention (MLA), slashed KV cache requirements by 93.3% and lifted inference throughput 5.76x. Training cost ran 42.5% below the previous 67B model. Pricing was even more aggressive: just 0.14 yuan per million tokens, blowing through the floor of China's API market. Major cloud providers scrambled to cut prices in response. Commentators called it “a price war ignited by a quantitative fund.” That day MLA became shorthand for architectural innovation — and DeepSeek went from “interesting open-source team” to “a competitor everyone has to watch.”

DeepSeek-V2—KV cache cut 93.3%, pricing through the floor

Sources · 2

June 2024Breakthrough

338 languages; coding parity with GPT-4 Turbo

DeepSeek-Coder V2 adopted an MoE architecture, supported 338 programming languages, and substantially extended its context window. On multiple advanced coding benchmarks, it caught up with — and on some, surpassed — GPT-4 Turbo. By this point, only seven months had passed since the original Coder release. A small team had reached parity with top closed-source models in a deeply vertical domain. Skeptics of open-source code-model ceilings went quiet. Developer communities celebrated: there was now a fully private-deployable, free-for-commercial-use coding brain.

DeepSeek-Coder V2—MoE architecture; open-source coding in the top tier

Sources · 2

July 2024Breakthrough

“China cannot remain a follower forever”

Liang Wenfeng sat for an in-depth interview with 36Kr. The lines kept landing. “Open-sourcing papers loses you nothing. For a technical person, being followed is itself a kind of achievement.” “China cannot remain a follower forever.” “More investment doesn't necessarily produce more innovation, otherwise large companies would have monopolized all innovation by now.” This wasn't PR theatre. It was a low-key, sharply spoken quant manager calmly contradicting the AI industry's foundational beliefs. The full interview was translated into English and resonated widely in overseas tech circles, where many understood for the first time the inner drive behind this strange company: curiosity, and an obsession with redefining problems.

Sources · 2

December 26, 2024World

$5.6M for GPT-4o-class performance

DeepSeek-V3 dropped: 671B total parameters, 37B active, trained on 14.8 trillion tokens, fully adopting FP8 mixed precision and Multi-Token Prediction. Training used just 2,048 H800 GPUs at a total cost of about $5.6 million — less than a tenth of many comparable models' training budgets. On multiple evaluation sets, V3 outperformed GPT-4o and Claude 3.5 Sonnet. The paper detailed every engineering choice, including extensive experiments fighting numerical instability under FP8. The community called it “a miracle of efficiency”: not the team with the most compute wins, but the team that uses compute most ruthlessly.

DeepSeek-V3—671B params, $5.6M train cost, parity with the best

Sources · 2

January 20, 2025World

Reasoning awakens through pure RL

DeepSeek-R1 was open-sourced under MIT license. The 671B model learned to reason entirely through reinforcement learning — no supervised fine-tuning cold start. This “let the model learn to think on its own” path had been discussed academically for years, but few dared run it at full scale. R1 did, matching OpenAI's o1 on multiple reasoning benchmarks while costing one twenty-seventh as much to run. Six distilled versions shipped alongside, scaling down to 1.5B so modest hardware could still produce strong reasoning. R1 marked the first time the open-source community had an uncompromising option in hardcore reasoning, and the line in the paper — “deepseek-r1, trained via pure RL” — became an electric moment for researchers everywhere.

DeepSeek-R1—Pure-RL reasoning, MIT license, 1/27 the cost of o1

Sources · 2

January 27, 2025World

App Store No. 1; Nvidia loses $600B in a day

DeepSeek's official app overtook ChatGPT to top the U.S. App Store free chart, prompting global attention. The same day, Nvidia's stock plunged, wiping out roughly $600 billion in market cap — the largest single-company single-day drop in U.S. market history. Investors began re-pricing the “only piles of compute can win” AI arms-race narrative. Marc Andreessen called R1 “one of the most amazing and impressive breakthroughs I've ever seen — and as open source, a profound gift to the world.” In a few short days, DeepSeek went from a tech-circle secret to global front-page news. Meanwhile, the company's engineers were quietly uploading new model quantizations on Twitter.

Sources · 3

March 24, 2025World

A silent upgrade; math and coding past another line

DeepSeek-V3-0324 quietly updated, sharpening reasoning, frontend code generation, Chinese writing, and function calling. No splashy launch event — just a refreshed model card on HuggingFace and an MIT-open-weight notice. Benchmarks showed it surpassing GPT-4.5 on math and coding. Developer communities stirred again: this company seemed to treat “continuously improving on the previous model” as routine work, not a milestone to announce. Iteration as breath, open source as heartbeat.

DeepSeek-V3-0324—Continuous evolution, math and coding past GPT-4.5

Sources · 2

August 21, 2025Present

Hybrid Thinking: a first step into the agent era

DeepSeek-V3.1 arrived with 128K context and 671B parameters, introducing hybrid thinking for the first time — a single model supporting both thinking and non-thinking modes without switching, plus integrated tool calls. Liang Wenfeng said: “This is our first step toward the agent era.” The model could undertake long, complex reasoning while delivering decisive non-reasoning responses when invoking tools. This “one mind, two faces” capability lets developers build more autonomous AI workflows on a single endpoint. Beyond a fresh round of benchmark contests, this engineering breakthrough has longer-term significance: it compresses the cost of switching between reasoning and acting into the architecture itself.

DeepSeek-V3.1—One model, two modes of thought, native tool use

Sources · 2

April 24, 2026Present

V4 Preview: million-token context, dancing with hardware

DeepSeek released V4 Preview in Pro and Flash variants: Pro at 1.6T parameters, 49B active; Flash at 284B/13B. Context window jumped to 1 million tokens, using a Hybrid Attention architecture. V4-Pro pricing: $3.48 per million output tokens — versus OpenAI's $30, a value gap pulled wider still. Another signal: native compatibility with Huawei's Ascend 950, all open weights available. This isn't only model size growth — it's an ecosystem move. While the world still debates software-hardware decoupling, DeepSeek has quietly opened a parallel pipeline.

DeepSeek-V4 (Preview)—1.6T params, 1M context, native Ascend support

Sources · 3

Research output

Product map

DeepSeek-Coder

First open-source code-model series. 33B beat CodeLlama-34B; covered 80+ languages.

Impact: Gave independent developers their first privately-deployable strong coding model, sparking a rebuild of open-source coding toolchains.

DeepSeek-LLM

2023

7B/67B bilingual base, minimalist release style as open standard.

Impact: Proved a small team could ship commercial-grade base models, accelerating the non-English open-source LLM ecosystem.

DeepSeek-V2

236B params, the original MLA, 93.3% KV cache cut, 5.76x throughput.

Impact: Triggered China's API price war, forced industry-wide cost reduction, made architectural innovation a non-optional.

DeepSeek-Coder V2

MoE architecture; 338 languages; coding ability matching GPT-4 Turbo.

Impact: Erased the experience gap between open-source coding models and the best closed ones; pushed enterprise coding assistants toward open source.

DeepSeek-V3

671B params, $5.6M training, beating GPT-4o and Claude 3.5 Sonnet.

Impact: Challenged the “compute = capability” narrative on the smallest of training budgets; opened an efficiency-first paradigm of model R&D.

DeepSeek-R1

Pure-RL reasoning, MIT open-sourced, 1/27 the cost of o1.

Impact: First open reasoning model to match closed flagships in hardcore logic, igniting a global wave of RL-for-reasoning research.

DeepSeek-V3.1

Hybrid Thinking: one model, switching between thinking and non-thinking, with integrated tool calls.

Impact: Native architectural support for agent building; lower switching cost between reasoning and action; a new interaction paradigm for next-gen models.

DeepSeek-V4

1.6T params / 49B active, 1M-token context, native Ascend 950 compatibility, $3.48/M output tokens API price.

Impact: Million-token windows unlock whole-codebase coding and long-document understanding; ecosystem support for domestic hardware redraws the price-performance frontier.

One-click switch to your thinking engine

Looking back, all of this began in an unremarkable server room in Hangzhou, with a group of engineers who genuinely believed that being followed is its own form of achievement. They never planned to ring the Nasdaq bell or top the App Store — yet both happened. From the faint glow of Firefly-1 to a million-token context window, DeepSeek has been doing one thing all along: build the best models open-source, then push prices through extreme engineering until anyone can use them. Part of these very lines were written with DeepSeek itself — by the time you're reading this, we may have just updated the weights and gotten a little smarter. So don't hesitate. Download DeepSeek — we call it Switch, because every switch is a reset of how you think. From writing code to solving differential equations, from drafting contracts to polishing sonnets, your new thinking engine is here.

Download Switch — instant DeepSeek Already have Switch? One-click import

Part of this was written with DeepSeek itself →