Your Position Home AI Technology

DeepSeek is here, is Wanka still an AI ticket?

Wen| Semiconductor industry

The limit of artificial intelligence is the limit of card splicing. Top AI companies have set the threshold of single-point cluster ten thousand cards for this violent aesthetic competition.

OpenAI’s single point cluster has 50,000 cards, Google has 26,000 cards, and Meta24,500 cards. Zhang Jianzhong, founder and CEO of Moore Thread, once said at the press conference that“In the main AI battlefield, ten thousand cards are the minimum standard.& rdquo;

With the birth of DeepSeek, a big drama about rewriting AI rules is underway.

01 Is the Wanka cluster still an AI ticket?

In 2020, Microsoft took the lead in building a Wanka smart computing center for its AI layout. Subsequently, major technology giants competed to invest in the construction of a smart computing center for a Wanka cluster, such as Amazon, Google, Meta, Tesla, xAI, and domestic technology companies ByteDance, Baidu, Ant, Huawei, iFlytek, and Xiaomi have all built Wanka clusters, and Tencent and Ali have already moved to a 100,000 card cluster.

Building a smart computing center for a Wanka cluster requires huge financial resources, and the purchase cost of GPUs alone is as high as several billion yuan. Despite its high cost, the Wanka Cluster Intelligent Computing Center makes it possible to train complex large models, so it is regarded by the industry as an entry ticket to the AI competition.

Changjiang Securities pointed out in its research report that the size of the model and the amount of training data have become key factors determining model capabilities. With the same model parameters and data sets, cluster training time is expected to be significantly shortened. Larger and more advanced clusters can respond to market trends in a timely manner and conduct rapid iterative training. Overall, a cluster with more than 10,000 calories will help reduce the training time of large models, achieve rapid iteration of model capabilities, and respond to market trends in a timely manner to achieve catch-up and leadership in large models technology. rdquo;

DeepSeek-V3 only used 2048 H800 GPUs during training, but it received very good news in multiple standard tests, surpassing the previous big model in mathematical benchmark tests such as GSM8K and MATH, algorithm code LiveCodeBench and other tests. This couldn’t help but arouse a thought,DeepSeek supports kalia-level cluster training, so is Wanka Smart Computing Center still an AI ticket?

First of all, we must admit that Wanka clusters are still necessary at the training end of the large model. Secondly, the privatization and deployment of the large model has become a consensus in the industry, and the market for private enterprise deployment of small data centers will erupt.

After the emergence of DeepSeek, many companies are competing to access it and make their own local deployments.Enterprises build their own small smart computing centers and deploy 1 to 10 servers (within 100 cards), or 10 to 20 servers (100 cards scale), can also achieve efficient AI services. This has undoubtedly changed the AI admission ticket. Violent card stacking is no longer the only way to enter the market. More companies can participate in this AI craze through algorithm optimization.

Taking RuiPath, a clinic-level multi-modal interactive pathology model jointly released by Huawei and Ruijin Hospital, as an example, using only 16 computing cards, more than 300 pathological diagnosis books were learned, and the accuracy of questions and answers in tests of commonly used questions compiled by pathologists can reach 90%.

Qualcomm Technology believes that the current advanced AI mini model has excellent performance. New technologies such as model distillation and novel AI network architecture can simplify the development process without compromising quality, allowing new models to perform beyond the larger models launched a year ago that can only run in the cloud.

Beyond that,The deployment of small smart computing centers by enterprises has also brought new opportunities to the four major operators and tower companies.The deployment of small data centers requires stable infrastructure such as venues, power, and networks, while the physical computer room resources of operators and tower companies are ready-made. Taking China Tower as an example, it currently has 2.1 million site resources, energy facilities and Nearly one million computer rooms, and 220,000 communication towers have been upgraded to digital towers. In addition, small data centers are close to the source of data generation and can achieve rapid data processing and analysis. The demand for edge computing power is increasing. Currently, China’s tower computing power is changing from centralized to cloud-edge distributed paradigm. Each data center adds tens of tons of new data every day. It is estimated that each data center will access about 200,000 stations in 2025, and the data scale will reach tens of PB levels in the future.

According to Gartner forecasts, 75% of enterprise data will be processed on the edge side in 2025, and the number of edge data centers will exceed three times that of traditional data centers.

02 Data center chip revolution: training slows down, reasoning rises

DeepSeek adopts a pure reinforcement learning training path to get rid of the reliance on supervised learning and fine-tuning stages. At the same time, it uses a new GRPO algorithm to allow model groups to learn from each other, reducing memory consumption to one-third of that of traditional PPO algorithms, and allowing them to be trained with fewer hardware resources. Training; FP8 hybrid precision training reduces memory consumption by 50% and improves computing throughput by 30%; Its data distillation technology reduces the proportion of invalid data from an industry average of 15% to less than 3%;NVLink+InfiniBand dual channel transmission technology improves GPU communication efficiency within the cluster by 65%.

DeepSeek’s innovative methods have reduced training costs and revolutionized data center chips. In the future, the growth rate of high-end GPU demand on the training side may slow down, while the computing power demand on the inference side will show a long-term growth trend.

In this regard, the judgments of major research institutions coincide. Among them, Gartner predicts that the scale of cluster computing power for reasoning will exceed that of training in 2025, and IDC predicts that by 2025, the chips used for reasoning will reach 60.8%. Gong Mingde, an analyst at TrendForce Jibang Consulting, pointed out: DeepSeek’s drive will prompt cloud service providers to more actively invest in low-cost, own ASIC solutions and shift the focus of AI training to AI reasoning. It is expected that by 2028, the proportion of inference chips will increase to 50%. rdquo;

As the name suggests, the training chip is used in the training stage of AI models. It requires a large amount of labeled data to train the system to adapt to specific functions. Therefore, more emphasis is placed on computing performance and storage capabilities. After the model training is completed, the inference chip is responsible for using new data. Forecast and inference, paying more attention to comprehensive indicators of computing power, delay and cost per unit of energy consumption.

Different from the training chip market, which has a 98% market share, the inference chip market is not yet mature and is more blooming. Groq, a US artificial intelligence chip company that has previously caused a wave of enthusiasm on the Internet, was founded in 2016 and has received 5 rounds of financing so far. After Groq completed its latest round of financing of US$640 million in August 2024, its valuation reached 2.8 billion. Groq’s new AI acceleration chip LPU is specially tailored for large languages. Its performance is 10 to 100 times better than conventional GPUs and TPUs, and its reasoning speed is 10 times faster than NVIDIA GPUs.

In foreign markets, Broadcom and Marvell are major suppliers of inference chips. Among them, Broadcom and Google have collaborated to design six generations of TPUs, and the seventh generation of TPUs is expected to be launched in 2026 and 2027. At the same time, its cooperation with Meta in AI infrastructure may reach billions of dollars;Marvell is working with Amazon, Google and Microsoft to currently produce Amazon 5nm Tranium chips and Google 5nm Axion Arm CPU chips. It is also expected to launch the Amazon Inferentia chip project in 2025 and the Microsoft Maia chip project in 2026.

In the domestic market, major technology companies are also actively deploying the AI inference chip market.

  • The light-containing 800 AI chip launched by Dharma Institute has a single chip performance that is 8.5 times that of Google’s TPU v3 and 12 times that of Nvidia T4.
  • Baidu Kunlun series AI chips are the first to support 8bit reasoning. The Baige DeepSeek all-in-one machine is equipped with Kunlun Core P800, with low reasoning latency, an average of less than 50 milliseconds. Among them, Kunlun 3A surpasses the Nvidia A800.
  • The Cambrian Siyuan 590 smart chip supports almost all mainstream models. The computing power of a single card exceeds that of the NVIDIA A100, and the computing power of the cluster is close to the A100 level. Clusters interconnected by kilocalories will lose some performance.

At present, the reasoning stage of large models faces many optimization challenges. The first is KV Cache management. The reasoning process will produce a large number of intermediate results to reduce the amount of computation. How to manage this data is critical, such as using page-based management, but whether the page size is fixed or dynamically adjusted according to load characteristics requires careful design. The second is multi-card collaboration: When the model is large, multiple GPUs are required to cooperate. For example, large model reasoning is performed on 8 GPUs. How to optimize inter-card parallelism is also a big challenge. The most important thing is algorithm optimization: how to optimize from a quantitative perspective to give full play to the underlying computing power performance.

03 Algorithm compensation performance: chip competition begins to dominate software and hardware collaboration”

One of the important reasons why DeepSeek was able to amaze the world with 2048 H800 chips is that it has carried out extreme engineering transformation of the hardware. Through custom CUDA cores and operator fusion technology, it has integrated the MFU of the H800 GPU.(Model FLOP utilization) has increased to 23%, far exceeding the industry average of 15%. More computing tasks can be completed under the same hardware conditions, training efficiency is improved, and a continuous utilization of 98.7% has been achieved on GPU clusters.

This innovative method of using algorithms to supplement performance has been called the lane changing and overtaking of China’s AI by Professor Ma Jianpeng, dean of the Multiscale Research Institute of Complex Systems at Fudan University, leading scientist of the Shanghai Artificial Intelligence Laboratory, and internationally renowned computational biologist. At the same time, this approach will alsoForcing chip manufacturers to shift from piecing processes to algorithm-adaptive design, reserve more interfaces to support dynamic algorithm iteration, such as programmable NPU architecture.

As we all know, AI use cases are constantly evolving, and it is obviously impractical to deploy these use cases on hardware with completely fixed functions. The programmable NPU architecture provides rich programming interfaces and development tools, supports multiple programming languages and frameworks, and allows developers to easily program and configure according to new algorithm requirements. At the same time, it supports dynamic reconfiguration of computing resources, such as computing units, storage units, etc., according to different algorithm requirements.

The most important thing is that the chip development cost is high. Reserved interfaces support dynamic algorithm iteration, which can keep the chip competitive for a long time. In the face of new algorithms, there is no need to redesign the hardware, but to adapt the new algorithm through software upgrades and other methods., no longer afraid of algorithm update iterations.

DeepSeek V3 uses PTX, which is lower-level than CUDA, to optimize hardware algorithms, bypassing CUDA’s high-level API, and directly operates the PTX instruction set for fine-grained hardware optimization, which can get rid of the CUDA high-level framework to a certain extent. Dependence, provides developers with a way to optimize GPU resources without relying on CUDA. At the same time, DeepSeek GPU code is written using the Triton programming language proposed by OpenAI. The underlying layer of Triton can call CUDA and other GPU languages, laying the foundation for adapting to more types of computing chips.

Therefore, we will see many reports writing that DeepSeek has broken through Nvidia CUDA technical barriers. In fact, DeepSeek’s move proves that chip competition has moved from rough hardware at the beginning to a newer soft-software collaborative convolution. The combination of open source frameworks and domestic chips will be a breakthrough. DeepSeek can not only run on Nvidia chips, but also run efficiently on non-mainstream chips such as Huawei Shengteng and AMD.

The more far-reaching impact is that the AI chip field is no longer dominated by Nvidia, and more chip companies can participate. andMemory chip companies upstream of Nvidia, such as Samsung Electronics and SK Hynix, may also be forced to transform.

Previously, the development strategies of semiconductor giants such as Samsung Electronics and SK Hynix have always been to adopt a mass production model focusing on general memory. Their business also relies heavily on bulk supply to major customers such as Intel, Nvidia and AMD. Previously, Bank of America analysis predicts that SK Hynix may receive more than 60% of orders for Nvidia Blackwell GPUs in 2025.

The release of DeepSeek will reduce technology companies ‘demand for Nvidia’s high-end chips, but the total market demand for AI chips will not necessarily decrease. As economist Jevons said: Although technological progress has improved the efficiency of resource use, increased demand often leads to an increase in total consumption.

Amazon CEO Andy Jasi once said that DeepSeek’s technological breakthroughs will instead drive the growth of overall demand for artificial intelligence. The decline in the cost of technologies such as artificial intelligence reasoning does not mean that companies will reduce their investment in technology. Conversely, cost reductions allow companies to develop innovative projects that had previously been shelved due to budget constraints, which ultimately increases overall technology spending.

This is undoubtedly a huge opportunity for Samsung Electronics and SK Hynix to transform, get rid of their dependence on Nvidia and embrace the broader market. HBM’s needs shift from high-end GPUs to customized storage solutions, providing a diversified product lineup for AI services.

Popular Articles