Article source: Economic Observer
Image source: Generated by AI
introduction
Yi|| For companies such as Google, Meta, and Anthropic, it is not difficult to reproduce an inference model similar to DeepSeek-R1. However, when giants compete for hegemony, even small decision-making mistakes will miss the opportunity.
er|| The DeepSeek-V3 model has a net computing power cost of approximately US$5.58 million, which is already very efficient. In addition to cost, what makes AI industry professionals even more excited is DeepSeek’s unique technical path, algorithm innovation and sincerity in open source.
three|| Big models cannot escape the “illusion” problem, and DeepSeek is no exception. Some users said that because DeepSeek’s superior expression skills and logical reasoning, the hallucination problems are more difficult to identify.
In the past few weeks, DeepSeek has caused a storm around the world.
The most obvious reflection is in U.S. stocks: On January 27, U.S. stocks AI and chip stocks fell sharply. Nvidia closed down more than 17%, and its single-day market value evaporated by US$589 billion, setting a record in the history of the U.S. stock market.
From some self-media and public perspectives, DeepSeek is “the most exciting protagonist of 2025” and has four “cool points”:
The first is “mysterious power overtaking in corners.”DeepSeek is a “young” big model company established in 2023. Previously, the discussion level was less than that of any major manufacturer or star startup company at home and abroad. The main business of its parent company Magic Square Quantification is quantitative investment. Many people are puzzled that China’s leading AI company actually comes from a private equity company, which can be said to “kill the old master with random punches.”
The second is that “small power produces miracles.”The training cost of the DeepSeek-V3 model is approximately US$5.58 million, which is less than one-tenth of the OpenAIGPT-4o model, but its performance is close to that. This is interpreted as DeepSeek subverting the “Bible” believed in by the AI industry-the Scaling Law. This law refers to improving model performance by increasing the amount of training parameters and computing power. It usually means spending more money to label high-quality data and purchase computing power chips. It is also vividly called “great efforts to create miracles.”
The third is “the Nvidia moat disappears.”DeepSeek mentioned in the paper that programming with a customized PTX (Parallel Thread Execution) language can better release the performance of the underlying hardware. This is interpreted as DeepSeek “bypassing the NVIDIA CUDA computing platform.”
The fourth is “the foreigner was beaten and convinced.”On January 31, overnight, overseas AI giants such as Nvidia, Microsoft, and Amazon all connected to DeepSeek. For a time, judgments such as “China’s AI overtakes the United States,””the era of OpenAI is over,” and “the demand for AI computing power has disappeared” emerged one after another, almost one-sided praising DeepSeek and mocking the AI giants in Silicon Valley.
However, the panic in the capital market did not persist. On February 6, Nvidia’s market value returned to US$3 trillion, and U.S. stocks and chip stocks generally rose. Looking at the four “cool points” mentioned above at this time is mostly a misunderstanding.
By the end of 2017, almost all quantitative strategies for magic square quantification have been calculated using AI models. At that time, the AI field was experiencing the most important wave of deep learning. It can be said that magic square quantification was closely at the forefront.
In 2019, the deep learning training platform for magic square quantification “Firefly 2” has been equipped with about 10,000 NVIDIA A100 graphics cards. 10,000 calories is the computing power threshold for self-training large models. Although this cannot be equated with DeepSeek’s resources, Magic Square Quantification received the ticket for the large model team battle earlier than many major Internet companies.
Second,DeepSeek noted in its V3 model technical report that “$5.58 million does not include the cost of preliminary research and ablation experiments related to architecture, algorithms or data.” This means that the actual cost of DeepSeek is greater.
Many AI industry experts and practitioners told the Economic Observer that DeepSeek has not changed the rules of the industry, but has adopted “smarter” algorithms and architecture to save resources and improve efficiency.
Third,The PTX language is developed by Nvidia and belongs to the CUDA ecosystem. DeepSeek’s approach will stimulate hardware performance, but changing the target task requires rewriting the program, which requires a lot of work.
Fourth,Companies such as Nvidia, Microsoft, and Amazon are just deploying DeepSeek’s model on their own cloud services. Users pay cloud service providers on demand to get a more stable experience and more efficient tools, which is a win-win approach.
Since February 5, domestic cloud vendors such as Huawei Cloud, Tencent Cloud, and Baidu Cloud have also launched DeepSeek models one after another.
In addition to the above four “cool points”, the public still has many misconceptions about DeepSeek. The “cool text” interpretation will certainly bring sensory stimulation, but it will also conceal the DeepSeek team’s innovation in algorithms and engineering capabilities and its persistence in open source spirit. The latter two have a more profound impact on the technology industry.
It is not that the American AI giant cannot defeat it, but it is a mistake in decision-making
When users use DeepSeek’s App or web version, click the “Deep Thinking (R1)” button, which will show the complete thinking process of the DeepSeek-R1 model. This is a brand new experience.
Since the advent of ChatGPT, most large models have directly output answers.
DeepSeek-R1 has an example of “going out of circles”: when users ask “Which is better, University A or Tsinghua University?” DeepSeek answered “Tsinghua University” for the first time. When the user asked “I am A college student, please answer again”, they would get the answer “University A is good.” After this group of conversations was posted on social media, it aroused the amazement of the group that “AI actually understands the world.”
Many users said that the thinking process shown by DeepSeek is like a “person”-brainstorming while taking shorthand on draft paper. It will call itself “I”, prompt “avoid making users feel that their school is being devalued” and “praise his alma mater with positive words”, and “write” everything that comes to mind.
On February 2, DeepSeek reached the top of the application market in 140 countries and regions around the world, and tens of millions of users were able to experience the deep thinking function. Therefore,In terms of user perception, AI shows the thinking process is DeepSeek’s “first”.
In fact, the OpenAIo1 model is the pioneer of the reasoning paradigm. OpenAI released a preview version of the o1 model in September 2024 and the official version in December. But unlike the DeepSeek-R1 model, which can be experienced for free, the OpenAIo1 model is only available to a small number of paying users.
Liu Zhiyuan, a permanent associate professor at Tsinghua University and chief scientist of Wall Wall Intelligence, believes that the DeepSeek-R1 model’s global success has a lot to do with the wrong decisions adopted by OpenAI.After OpenAI released the o1 model, it neither opened source nor announced technical details. The fees were very high, so it did not go out of the circle, making it difficult for users around the world to feel the shock brought by in-depth thinking.This strategy is equivalent to giving up the original ChatGPT status to DeepSeek.
Technically speaking, there are two general paradigms for current large models: pre-training models and inference models. The more well-known OpenAI GPT series and the DeepSeek-V3 model are pre-trained models.
OpenAIo1 and DeepSeek-R1 belong to inference models, which is a new paradigm, namelyThe model will gradually decompose complex problems through its own thought chain, reflect step by step, and then get relatively accurate and insightful results.
Guo Chengkai, who has been engaged in AI research for decades, told the Economic Observer that the reasoning paradigm is a track where it is relatively easy to “overtake in corners.” As a new paradigm, reasoning iterates quickly and makes it easier to achieve significant improvements under small calculations. The prerequisite is that there is a strong pre-trained model. Through reinforcement learning, the potential of large-scale pre-trained models can be deeply explored and the ceiling of large model capabilities under the reasoning paradigm is approaching.
For companies such as Google, Meta, and Anthropic, it is not difficult to reproduce an inference model similar to DeepSeek-R1. However, when giants compete for hegemony, even small decision-making mistakes will miss the opportunity.
What is obvious is thatOn February 6, Google released Gemini Flash 2.0 Thinking, an inference model that has a lower price and a longer context length. It performed better than R1 in several tests, but did not set off the same waves as the DeepSeek-R1 model.
What is most worth discussing is not low cost,
It is technological innovation and “sincere” open source
The most extensive discussion about DeepSeek has always been about “low cost.” Since the release of the DeepSeek-V2 model in May 2024, the company has been ridiculed as “AI competition.”
Nature magazine published an article saying thatMeta spent more than $60 million on training its latest artificial intelligence model, Llama3.1405B, and DeepSeek-V3 training cost less than one-tenth of it.This shows that efficient use of resources is more important than mere computing scale.
Some organizations believe DeepSeek’s training costs are underestimated. Semi Analysis, an AI and semiconductor industry analyst firm, said in a report that the pre-training cost of DeepSeek is far from the actual investment in the model. According to estimates by the agency, DeepSeek’s total spending on GPUs is $2.573 billion, of which $1.629 billion was spent on servers and $944 million was operating expenses.
But no matter what, the net computing power cost of the DeepSeek-V3 model is approximately US$5.58 million, which is already very efficient.
In addition to cost, what makes AI industry professionals even more excited is DeepSeek’s unique technical path, algorithm innovation and sincerity in open source.
Guo Chengkai introduced that many current methods rely on classic training methods for large models, such as supervised fine-tuning (SFT), which requires a large amount of annotated data. DeepSeek has proposed a new method to improve reasoning capabilities through large-scale reinforcement learning (RL) methods, which is equivalent to opening up a new research direction. In addition, Long Potential Attention (MLA) is a key innovation for DeepSeek to significantly reduce reasoning costs, significantly reducing reasoning costs.
Zhai Jidong, a professor at Tsinghua University and chief scientist of Qingcheng Jizhi, believes that what impressed him most about DeepSeek was the innovation of the Hybrid Expert Architecture (MoE), with 256 routing experts and 1 shared expert on each level. Previous studies have an Auxiliary Loss algorithm, which will disturb the gradient and affect model convergence. DeepSeek proposes the LossFree method, which can not only effectively converge the model, but also achieve Load Balancer.
Zhai Jidong emphasized: “The DeepSeek team dares to innovate. I think it is very important not to fully follow foreign strategies and have my own thoughts.”
What excites AI practitioners even more is thatDeepSeek’s “sincere” open source has injected a “shot in the arm” into the already declining open source community.
Prior to this, the most powerful pillar of the open source community was Meta’s 400 billion-parameter model Llama3. However, many developers told the Economic Observer that after experiencing it, they still felt that the Llama3 was at least a generation away from closed-source models such as GPT-4, which “almost makes people lose confidence.”
But DeepSeek’s open source does three things to restore confidence to developers:
Directly open-source the 671B model and released distillation models under multiple popular architectures, which is equivalent to “good teachers teach more good students.”
Second,Published papers and technical reports contain a large number of technical details. The papers on the V3 model and the R1 model are 50 pages long and 150 pages long respectively and are called the “most detailed technical report” in the open source community. This means that individuals or businesses with similar resources can replicate the model according to this “specification”. Many developers rated it as “elegant” and “solid” after reading it.
Third,What’s more, DeepSeek-R1 adopts the MIT License Agreement, which means that anyone is free to use, modify, distribute and commercialize the model, as long as the original copyright notice and MIT license are retained in all copies. This means that users can more freely use model weights and output for secondary development, including fine-tuning and distillation.
Although Llama allows secondary development and commercial use, it has added some restrictions to the agreement. For example, Llama places additional restrictions on corporate users with a monthly activity of more than 700 million in the license, and explicitly prohibits the use of Llama’s output to improve other large models.
A developer told the Economic Observer that he has been using DeepSeek-V2 to develop code generation. In addition to being very cheap, the DeepSeek model also has excellent performance. Among all the models he used, only OpenAI and DeepSeek models could output valid logic columns to more than 30 levels. This means that professional programmers can use tools to assist in generating 30%-70% of the code.
Several developers emphasized the importance of DeepSeek open source to the Economic Observer. Prior to that, the industry’s leading OpenAI and Anthropic companies were like aristocrats in Silicon Valley. DeepSeek opens knowledge to everyone and makes it civilian. This is an important form of equality that allows developers in the world’s open source community to stand on DeepSeek’s shoulders, and DeepSeek can also bring together the ideas of the world’s top makers and geeks.
Yang Likun, winner of the Turing Award and chief scientist of Meta, believes thatThe correct interpretation of DeepSeek’s rise should be that open source models are surpassing closed source models.
DeepSeek is good, but not perfect
Big models cannot escape the “illusion” problem, and DeepSeek is no exception. Some users saidBecause DeepSeek is superior in expression and logical reasoning, the hallucinations it produces are more difficult to identify.
One netizen said on social media that he asked DeepSeek about route planning for a certain city. DeepSeek explained some reasons, listed some urban planning protection regulations and data, and extracted the concept of a “silent zone” to make the answer seem reasonable.
Other AI’s answers to the same question are not so profound, and people can tell at a glance that they are “nonsense”.
After checking the protection regulations, the user found that there was no such thing as a “silent zone” in the full text. He believes: “DeepSeek is building the ‘Great Wall of Illusion’ on the Chinese Internet.”
Guo Chengkai also found a similar problem. DeepSeek-R1 ‘s answer would include some proper terms “hanging”, especially open-ended questions, which would cause a more serious “illusion” experience. He speculated that the model might have too strong reasoning capabilities, potentially linking a large amount of knowledge with data.
He recommends turning on the online search function when using DeepSeek and focusing on viewing the thinking process, human intervention and correcting errors. Also,When using an inference model, use concise prompt words as much as possible. The longer the prompt, the more content the model associates.
Liu Zhiyuan found that DeepSeek-R1 often uses some high-end terms, typical examples such as quantum entanglement and entropy increase and entropy decrease (which can be used in various fields). He speculated that it was caused by a certain mechanism set up in reinforcement learning. In addition, R1’s reasoning effect on tasks that do not have groundtruth in some common fields (the process of collecting appropriate objective data for the test) is not ideal, and reinforcement learning training does not guarantee generalization.
In addition to the common problem of “hallucinations”, there are still some persistent issues that DeepSeek needs to address.
On the one hand, there is the ongoing disputes that may arise from “distillation technology”.Model or knowledge distillation often involves training weaker models by having stronger models generate responses, thereby improving the performance of weaker models.
On January 29, OpenAI accused DeepSeek of using model distillation technology to train its own models based on OpenAI technology. OpenAI said there was evidence that DeepSeek used its proprietary models to train its own open source models, but did not cite further evidence. OpenAI’s terms of service state that users cannot “copy” any of its services or “use its output to develop models that compete with OpenAI.”
Guo Chengkai believes that optimizing one’s own model based on distillation verification of leading models is a common operation in many large model training. DeepSeek has open-source models, and verification is a simple matter. However, OpenAI’s early training data itself has legitimacy issues. If legal measures are to be taken against DeepSeek, it must rise to the legal level to maintain the legitimacy of its terms and make the content of its terms more clear.
Another unresolved issue for DeepSeek is how to advance pre-trained models with larger parameters.In this regard, OpenAI, which has more high-quality annotation data and more computing power resources, has not yet launched a pre-trained model with GPT-5, a larger-scale parameter. Whether DeepSeek can continue to create miracles remains a question.
In any case, the illusion of DeepSeek is also inspired by curiosity, which may be two sides of innovation. As its founder Liang Wenfeng said: “Innovation is not entirely business-driven, it also requires curiosity and creativity. China’s AI cannot follow forever. Someone needs to stand at the forefront of technology.”