Spring Festival so far.DeepSeekThe heat continues to climb, accompanied by a lot of misunderstanding and controversy, some people say that it is "the light of the country to beat OpenAI", others say that it is "just a copy of the foreignLarge ModelHomework smarts."
These misunderstandings and controversies center on five main areas:
1, excessive myths and brainless disparagement, DeepSeek in the end is not the bottom of the innovation? The so-called distillation of ChatGPT said in the end there is no basis?
2. the cost of DeepSeek, is it really only $5.5 million?
3, if DeepSeek can really do so efficiently, then the world's major giants huge AI capital expenditure, are not all a waste of money?
4, Does DeepSeek use PTX programming and can it really get around the dependency on Nvidia CUDA?
5, DeepSeek global fire, but because of compliance, geopolitics and other issues, will be banned abroad one after another?
1. over-mythologizing and brainless devaluing Is DeepSeek an underlying innovation or not?
Internet practitioner caoz believes that its value of promoting the development of the industry is worth recognizing, but it is too early to talk about subversion. Some professional evaluations show that it has not surpassed ChatGPT in solving some key problems.
For example, it has been tested to simulate a typical ball bouncing code in an enclosed space, and the performance of the program written by DeepSeek, compared to the ChatGPT o3-mini, still falls short from a physics compliance perspective.
Don't over-mythologize it, but don't mindlessly disparage it either.
There are two extreme views on DeepSeek's technological achievements: one calls its technological breakthrough, a "disruptive revolution"; the other sees it as nothing more than an imitation of foreign models, and there's even speculation that it gained progress by distilling OpenAI models.
Microsoft says DeepSeek distills the results of ChatGPT, so some people have taken advantage of the situation and put DeepSeek down as worthless.
In fact, both views are too one-sided.
More accurately, DeepSeek's breakthrough is an engineering paradigm upgrade for industrial pain points, opening up a new path of "less is more" for AI reasoning.
It does three main levels of innovation:
The first step is to slim down the training architecture - for example, the GRPO algorithm reduces complex algorithms to field-executable engineering solutions by omitting the Critic model (i.e., the "dual-engine" design) that is necessary for traditional reinforcement learning;
Secondly, simple evaluation criteria are adopted, typically such as directly replacing manual scoring with compilation results and unit test pass rate in code generation scenarios, and this deterministic rule-based system effectively cracks the problem of subjective bias in AI training;
Finally, a delicate balance is found in the data strategy, through the combination of the Zero mode, which is purely algorithmic autonomous evolution, and the R1 mode, which requires only thousands of manually labeled data, to retain the model's ability of autonomous evolution while guaranteeing human interpretability.
However, these improvements do not break through the theoretical boundaries of deep learning, nor do they completely subvert the technical paradigm of head models such as OpenAI o1/o3, but rather address the industry's pain points through system-level optimization.
DeepSeek is fully open source and documents these innovations in detail, and the world can leverage these advances to improve their own AI model training. These innovations can be seen in the open source files.
Tanishq Mathew Abraham, former head of research at Stability AI, also highlighted three of DeepSeek's innovations in a recent blog post:
1, multi-head attention mechanism: large language models are usually based on the Transformer architecture, using the so-called multi-head attention (MHA) mechanism. the DeepSeek team has developed a variant of the MHA mechanism, which allows for both more efficient use of memory and better performance.
2. GRPO with verifiable rewards: DeepSeek demonstrated that a very simple reinforcement learning (RL) process can actually achieve GPT-4-like results. What's more, they developed a variant of the PPO reinforcement learning algorithm called GRPO that is more efficient and performs better.
3, DualPipe: when training AI models in a multi-GPU environment, you need to consider a lot of efficiency-related factors. the DeepSeek team has designed a new method called DualPipe, which is significantly more efficient and faster.
Traditionally, "distillation" refers to the training of token probabilities (logits), but ChatGPT does not open such data, so it is basically impossible to "distill" ChatGPT.
Therefore, from a technical point of view, DeepSeek's achievement should not be questioned. Since the OpenAI o1-related thought chain reasoning process has never been made public, it is difficult to achieve this result by simply "distilling" ChatGPT.
And caoz thinks that DeepSeek's training may have partially utilized some distilled corpus information or done a little distillation verification, but this should have a very low impact on the quality and value of its entire model.
In addition, optimizing one's own model based on the distillation verification of the leading model is a routine operation of many big model teams, but after all, it requires networking APIs, the information that can be obtained is very limited, and it is unlikely to be a decisive influence factor, compared to the massive amount of information on Internet data, the corpus that can be obtained by calling the leading big model through the API is a drop in the bucket, and it is reasonable to guess that it is used more for the validation of the strategy for analysis and not directly used for large-scale training.
All big models need to get corpus training from the Internet, and the leading big models are constantly contributing corpus to the Internet. From this perspective, every leading big model can't get rid of the destiny of being captured and distilled, but there's no need to take this as the key to success or failure.
In the end, we all move forward iteratively with you in me and me in you.
2. does DeepSeek cost only $5.5 million?
5.5 million dollars in costs, a conclusion that is both correct and incorrect because it is not clear what the costs are.
Tanishq Mathew Abraham objectively estimates the cost of DeepSeek:
First, it is necessary to understand where this number comes from. This number first appeared in the DeepSeek-V3 paper, which was published a month before the DeepSeek-R1 paper;
DeepSeek-V3 is the base model for DeepSeek-R1, which means that DeepSeek-R1 is actually trained with additional reinforcement learning on top of DeepSeek-V3.
So, in a sense, this cost figure is inherently inaccurate because it doesn't account for the additional cost of reinforcement learning training. But that additional cost is probably in the hundreds of thousands of dollars.
Fig: Discussion of costs in the DeepSeek-V3 paper
So is the $5.5 million cost claimed in the DeepSeek-V3 paper accurate?
Multiple analyses based on GPU cost, dataset size, and model size all yielded similar estimates. It is worth noting that although DeepSeek V3/R1 is a model with 671 billion parameters, it utilizes a mixture-of-experts architecture, which means that only about 37 billion parameters will be used in any function call or forward propagation, and it is this value that is the basis for the calculation of training costs.
It's important to note that DeepSeek is reporting estimated costs based on current market prices. We don't know how much their 2,048 H800 GPU cluster (note: not H100, which is a common misconception) actually cost.Often, buying GPU clusters in bulk will be cheaper than buying them piecemeal, so the actual cost may be lower.
But the point is that this is only the cost of the final training run. There are many smaller experiments and ablation studies before final training is reached, all of which incur considerable costs that are not captured in this report.
There are also numerous other costs, such as researcher salaries. According to SemiAnalysis, DeepSeek's researcher salaries are rumored to be as high as $1 million. That's comparable to the top end of the pay scale at AGI frontier labs like OpenAI or Anthropic.
Some have dismissed DeepSeek's low cost and its operational efficiency because of these additional costs. This is a grossly unfair statement. This is because other AI companies spend a lot of salary on personnel as well, and this is usually not factored into the cost of the model."
Semianalysis (an independent research and analytics firm focused on semiconductors and AI) also gives an analysis of DeepSeek's AI TCO (Total Cost of Ownership in AI), a table that summarizes DeepSeek AI's total cost of ownership for four different GPU models (A100, H20, H800, and H100) including the cost of buying the equipment, building servers, and the cost of operating them. On a four-year cycle, the total cost of these 60,000 GPUs is $2.573 billion, which is mostly the cost of buying servers ($1.629 billion) and the cost of operations ($944 million).
Of course, no one out there knows exactly how many cards DeepSeek owns and what percentage of each model is actually available; all are just estimates.
To summarize, the cost is certainly well over $5.5 million when all the equipment, servers, operations, etc. are accounted for, but at $5.5 million net computing power cost, it is already very efficient.
3. huge capex investment in arithmetic just a huge waste?
This is a widely circulated but rather one-sided view. It's true that DeepSeek has shown an advantage in training efficiency, and has exposed some headline AI companies as potentially having efficiency issues with their use of compute resources. Even NVIDIA's short-term plunge may be related to this misinterpretation being widely circulated.
But that doesn't mean that having more computing resources is a bad thing. From a Scaling Laws perspective, more compute power always means better performance. This trend has continued since the Transformer architecture was introduced in 2017, and DeepSeek's model, which is based on the Transformer architecture.
While the focus of AI development has evolved - from model size, to dataset size, and now to inferential computation and synthetic data - the core rule of "more computation equals better performance" has not changed.
While Deep Seek finds a more efficient path and the law of scale still holds, more computational resources, still gives better results.
4. Does DeepSeek use PTX to bypass the dependency on NVIDIA CUDA?
DeepSeek's paper mentions DeepSeek's use of PTX (Parallel Thread Execution) programming, which allows DeepSeek's system and model to better unleash the performance of the underlying hardware through such a customized PTX optimization.
The original version of the paper is below:
"We employ customized PTX (Parallel Thread Execution) instructions and auto-tune the communication chunk size, which significantly reduces the use of the L2 cache and the interference to other SMs.""We employ customized PTX (Parallel Thread Execution)instructions and auto-tune the communication chunk size, which significantly reduces the use of the L2 cache and the interference to other SMs.""We employ customized PTX (Parallel Thread Execution) instructions and auto-tune the communication chunk size, which significantly reduces the use of the L2 cache and the interference to other SMs. interference."
There are two interpretations of this paragraph circulating on the Internet, one voice believes that this is to "bypass the CUDA monopoly"; the other voice is that, because DeepSeek can not get the highest-end chip, in order to solve the problem of limited interconnect bandwidth of H800 GPUs, it had to sink to a lower layer to improve the cross-chip communication capabilities. communication capability.
According to Dai Guohao, an associate professor at Shanghai Jiaotong University, neither of these statements is quite accurate. First of all, the PTX (Parallel Thread Execution) instruction is actually a component located inside the CUDA driver layer, and it still depends on the CUDA ecosystem. Therefore, the argument that PTX is used to bypass CUDA's monopoly is false.
Prof. Dai Guohao used a PPT to clearly explain the relationship between PTX and CUDA:
PPT made by Dai Guohao, Associate Professor, Shanghai Jiaotong University.
CUDA is a relatively upper layer that provides a range of user-facing programming interfaces. PTX, on the other hand, is generally hidden in the CUDA drivers, so almost all deep learning or big model algorithm engineers are not exposed to this layer.
So why is this layer important? The reason is that you can see that from this position, PTX is directly interacting with the underlying hardware, enabling better programming and calling of the underlying hardware.
In layman's terms, DeepSeek's optimization scheme is not a last resort under the reality of chip constraints, but a proactive optimization that improves the efficiency of communication interconnections regardless of whether the chip is H800 or H100.
5. Will DeepSeek be banned abroad?
After DeepSeek burst into flames, NVIDIA, Microsoft, Intel, AMD, AWS five cloud giants have shelved or integrated DeepSeek, and domestically, Huawei, Tencent, Baidu, Ali, and Volcano Engine also support the deployment of DeepSeek.
However, there are some overly emotional comments on the internet, on the one hand, the foreign cloud giant has shelved DeepSeek and "the foreigners have been beaten into submission".
In fact, these companies are deploying DeepSeek more because of business considerations. As cloud vendors, supporting the deployment of as many of the most popular and capable models as possible can provide better service to customers, and at the same time, it can also rub off on a wave of DeepSeek-related traffic, and perhaps bring in some of the new user conversions as well.
It's true that DeepSeek was deployed centrally during the DeepSeek boom, but the claims of being obsessed with DeepSeek or "overwhelmed" are overblown.
What's more, it was fabricated that after DeepSeek was attacked, the Chinese tech community formed the Avengers Alliance to come to DeepSeek's aid.
On the other hand, there are also voices saying that DeepSeek will soon be banned abroad one after another because of geopolitical and other practical reasons.
In this regard, caoz gives a clearer interpretation: in fact, what we call DeepSeek actually includes two products, one is DeepSeek, an app that has taken the world by storm, and the other is an open source code library on github. The former can be thought of as a demo of the latter, a complete demonstration of its capabilities. The latter, on the other hand, may grow into a thriving open source ecosystem.
What is being restricted from use, is DeepSeek's app, and what the giants are accessing and providing, is the deployment of DeepSeek's open source software. These are two completely different things.
DeepSeek has entered the global AI arena as a "big Chinese model" and has adopted the most atmospheric open source protocol, Apache License 2.0, even allowing commercialization. The current discussion on it has far exceeded the scope of technological innovation, but technological progress is never a black and white right or wrong debate. Instead of falling into excessive touting or wholesale denial, it is better to let time and the market test its true value. After all, in the marathon of AI, the real competition has just begun.
References:
Some common misinterpretations about deepseek by caoz
https://mp.weixin.qq.com/s/Uc4mo5U9CxVuZ0AaaNNi5g
"DeepSeek's Strongest Professional Teardown Comes to a Super Hardcore Interpretation by Professor Qing Jiaofu" by ZeR0
https://mp.weixin.qq.com/s/LsMOIgQinPZBnsga0imcvA
Debunking DeepSeek Delusions By Tanishq Mathew Abraham, Former Head of Research at Stability AI
https://www.tanishq.ai/blog/posts/deepseek-delusions.html