Your Position Home AI Technology

Closed-door discussion among Chinese and American AI entrepreneurs: Changes and new trends in AI entrepreneurship after DeepSeek-R1

Article source: FounderPark

Image source: Generated by AIImage source: Generated by AI

DeepSeek is undoubtedly the focus during the Spring Festival of 2025. From apps topping the Apple Store free list to various cloud vendors competing to deploy DeepSeek-R1, DeepSeek has even become the first AI product that many people experience. For entrepreneurs, everyone is chatting from discussions on technological innovation points, analysis of training and reasoning costs to the impact on the entire AI industry.

On February 2, Founder Park and Global Ready, a global closed-door community owned by Geek Park, organized a closed-door discussion, inviting more than 60 AI companies in Silicon Valley, China, London, Singapore, Japan and other places. Founders and technical experts conducted an in-depth discussion on the new technological directions and product trends triggered by DeepSeek from the perspectives of technological innovation, product implementation, and computing power shortage.

Closed-door discussion among Chinese and American AI entrepreneurs: Changes and new trends in AI entrepreneurship after DeepSeek-R1插图1

After the desensitization treatment, we sorted out the key points of this closed-door discussion.

01 Where are DeepSeek’s innovations?

DeepSeek released the V3 base model at the end of December. It is one of the most powerful models currently open source in the industry. It contains 37B activation parameters and has an overall parameter size of 671B. It is a large MoE (Hybrid Expert) model.

The “Aha moment” of the R1 model released in January 2025 refers to the fact that the model can show certain reflective ability when reasoning. For example, during problem solving, a model may realize that a certain method no longer applies and adjust to a more effective method in the process. This ability to reflect comes from reinforcement learning (RL).

R1 is DeepSeek’s flagship model. R1 is comparable to OpenAI o1 in terms of reasoning capabilities. The specific implementation method can be summarized as follows: R1 passes two-step reinforcement learning and two-step SFT, and the first two steps of RL and SFT are mainly used to build a teacher model for data generation to guide the third step of data generation. This model is committed to becoming the most powerful reasoning model currently available.

  • The core innovation of the DeepSeek R1-Zero model is that it skips the traditional fine tuning (SFT) process and directly conducts inference optimization through reinforcement learning (RL). In addition, using DeepSeek R1 as a teacher model to distil an open source small and medium-sized model (such as Qwen1.7B/7B/14B/32B) can significantly improve the capabilities of small models.
  • In coding capabilities, DeepSeek’s R1 is comparable to openAI’s newly released o3 mini, and the overall capabilities of the o3 mini are slightly stronger. The difference is that R1 is open source, which will encourage more applications to use R1.
  • The core of DeepSeek’s success lies in using a highly integrated engineering solution to get prices down. Taking their methods apart, each method can be found in last year’s paper, and DeepSeek will use the latest methods very aggressively. These methods themselves will actually have side effects and bring additional storage overhead, but they will greatly improve the idle rate of the cluster.
  • If it weren’t for a large-scale cluster serving large-scale people, the MLA architecture would have side effects. If a large number of DeepSeek methods are not implemented in specific scenarios and environments, they cannot achieve maximum performance optimization. Using these technologies alone will have side effects. Their system design is very exquisite, so exquisite that if these technologies are taken out alone, they cannot produce their effects.
  • You should not only train a process reward model, because if you only train this model, the final effect may not meet expectations and may even lead to overfitting. DeepSeek chose the most primitive reinforcement learning method, used heuristic rules to score the final result, and then used traditional reinforcement learning methods to correct the process. The method they chose was also made through constant trial and error, thanks to DeepSeek’s sufficiently efficient infra.
  • Even if DeepSeek does not disclose its reasoning code, other teams can roughly deduce what methods they use. The open source model weights are enough for other teams to replicate its performance, but the difficulty lies in how to try out some of the special configurations inside, which takes time.
  • A reward model that only relies on data annotations makes it difficult to achieve the ability of super human intelligence. A real reward model based on real data or real environmental feedback is needed to achieve more advanced reward optimization, resulting in superhuman intelligence capabilities.
  • Technical speculation: If the base model itself has strong versatility, coupled with mathematical and coding capabilities, the combination of the two parts will produce stronger generalization capabilities. For example, there is a relatively intelligent base model. Assuming that this model is already good at writing, then combined with some reinforcement learning of mathematics and code, it may achieve good generalization and eventually produce some very strong abilities. Specifically, it can write works in various genres from parallel prose to quatrain poems, while several other models are not very good at this aspect.

02 Why are DeepSeek’s costs so low?

  • The sparsity of the model is very high. Although this is a large model with more than 600B parameters, when reasoning, the actual activation parameter of each token is very small, only 37B, which means that its speed and resource consumption during reasoning are equivalent to a model with 37B parameters. But achieving this requires a number of design changes to the entire system.
  • In DeepSeek V3, the MoE architecture contains 256 expert modules, but only a small number of them are activated every time you infer. Under high load conditions, it can dynamically adjust resource utilization, which can theoretically reduce costs to 1/256 of the original. This design reflects DeepSeek’s forward-looking software architecture. If the system is optimized well enough, prices can be significantly reduced at the same level.
  • There are generally three axes when training models, that is, parallel segmentation is performed in three dimensions. The first is to implement segmentation and parallelism at the data level, which is called Data Parallelism. The second one is at the model level. Because the various layers of the model are independent of each other, segmentation will be done in this regard. This is called Pipeline Parallelism. The third is to split the weights of the model and allocate them to different GPUs. This is called Tensor Parallelism. In order to accommodate the sparse model design, DeepSeek made a lot of adjustments to the training framework and pipeline. During the training process, Tensor Parallelism was abandoned, only used Data Parallelism and Pipeline Parallelism, and based on this, more refined Expert Parallelism was carried out. By finely dividing the number of experts (up to 256 experts), different experts are assigned to different GPUs. In addition, DeepSeek abandons Tensor Paralleism and can bypass hardware limitations, making the H800 and H100 similar in training performance.
  • In terms of model deployment, experiments have shown that the computing power cost is controllable and the technical difficulty is not high. It usually takes only one to two weeks to complete reproduction, which is very beneficial to many application developers.
  • A possible model architecture: Let the reasoning RL no longer be limited to the large language model itself, but add a thinking machine to the outside to complete the entire reasoning capability, so that the overall cost can be reduced by several orders of magnitude.

03 Chatbot may not be the first AI product for users

  • The success of DeepSeek R1 lies not only in its reasoning capabilities, but also in its combination with search functions. Reasoning model+ search is equivalent to a micro agent framework to some extent. For most users, this is their first experience with an inference model. For users who have already used other inference models (such as OpenAI’s o1), DeepSeek R1 combined with search capabilities is a completely new experience.
  • For users who have never used an AI product, their first AI product may not necessarily be a language interaction product like ChatGPT. It may be a product in another model-driven scenario.
  • The competitive barrier for application-oriented companies in the AI field lies in product experience. Whoever can do it faster and better and provide features that make users feel more comfortable will gain a competitive advantage in the market.
  • Being able to see the thinking process presented by the model is a satisfactory design at present, but it is more like an early effort to use reinforcement learning (RL) to improve model capabilities. The length of the reasoning process is not the only criterion to measure the correctness of the final result. In the future, we will shift from a complex long reasoning process to a more concise short reasoning process.

04 AI landing in vertical scenes becomes easier

  • For relatively vertical tasks, task evaluation can be done through a rule system and does not need to rely on complex rewarding models. On set vertical tasks, models like Tiny Zero or 7B can quickly get usable results.
  • Training on a set vertical task using a model with 7 billion parameters or larger distilled by DeepSeek can quickly get an “aha moment”. From a cost perspective, to do simple arithmetic problems or tasks with clear answers such as blackjack on the 7B model, only 2-4 H100 or H200 sheets are needed. In less than half a day, the model can converge to a usable state.
  • In the vertical domain, especially when dealing with tasks with clear answers, such as mathematical calculations and physical rule judgment (whether objects are placed and movements comply with laws), DeepSeek R1 does have better effects than other models and has controllable costs, so it can be applied in a wide range of vertical domains. However, in tasks where there are no clear answers, such as judging whether something is beautiful or whether an answer makes people happy, this kind of subjective assessment cannot be well addressed through a rule-based approach. This may take three months or six months until a better way emerges to solve these problems.
  • When using supervised fine tuning (SFT) or similar methods, it is difficult to resolve time-consuming datasets queries, and the domain distribution of these datasets often makes it difficult to fully cover all levels of the task. There is now a new and better library of tools equipped with a high-quality model that can solve past data collection difficulties and vertical tasks with clear answers.
  • Based solely on rule-based, although mathematics and code can define clearer rules, relying on a rule system can become very difficult to deal with more complex or open tasks. So you may eventually explore more suitable models to evaluate the results of these complex scenarios. You may adopt an ORM (result-oriented reward function) approach instead of PRM (process-oriented reward function), or explore other similar approaches. Eventually, simulators similar to the “world model” may be built to provide better feedback for decisions on various models.
  • When using small models to train reasoning skills, you don’t even need to rely on token-based solutions. In a certain e-commerce solution, the entire reasoning capability is directly separated from the Transformer-based model, another small model is used to complete all reasoning work, and the Transformer is combined to achieve the entire task.
  • For companies that develop models for their own use (such as hedge funds), the challenge is cost. Large companies can equalize costs by bringing in customers, but small teams or companies cannot afford high R & D costs. DeepSeek’s open source is of great significance to them, which means that teams that could not bear the high research and development costs before can now build models.
  • In the financial field, especially quantitative funds, it is often necessary to analyze a large amount of financial data, such as company earnings reports and Bloomberg data. These companies typically build their own datasets and conduct supervised training, but the cost of data annotation is very high. For these companies, the application of reinforcement learning (RL) in the fine-tuning stage can significantly improve model performance and achieve a qualitative leap.

05 Domestic chips are expected to solve the problem of reasoning and computing power

  • There are still many benchmark A100 and A800 chips in China, but the biggest bottleneck of domestic chips is not chip design, but chip flow. DeepSeek also adapted to Huawei because the latter can produce chips relatively stably and ensure stable training and promotion under subsequent stricter sanctions.
  • Nvidia will develop in the future. From the perspective of single-card training, these high-end chips have excessive computing power in some application scenarios. For example, the computing power of a single card may not be fully utilized during the training phase due to additional cache and memory limitations, making it not the most suitable for training tasks.
  • In the domestic chip market, if we completely focus on AI applications, ignore scientific computing, greatly reduce high-level floating point computing capabilities, and only focus on AI tasks, we can catch up with Nvidia’s flagship chips in some performance indicators.

06 More powerful Agents and cross-application invocation capabilities

  • For many vertical areas, the capabilities of agents will be greatly improved. You can first take out a basic model and turn some rules into a rule model. This rule model may be a pure engineering solution. You can then use this engineering solution to iterate and train the underlying model on it. You may get a result that already has some super human intelligence capabilities. On this basis, make some preference tuning to make its answers more in line with human reading habits, so that you may get a more powerful reasoning agent in a certain vertical field;
  • This may raise the problem that you may not be able to have an agent that has strong generalization capabilities in all verticals. After an agent is trained in a specific domain, it can only work in that domain and cannot be generalized to other verticals. But this is a possible direction, because DeepSeek itself brings a low inference cost. You can select a model and then conduct a series of intensive training. After the training is completed, it only serves a certain vertical field and no longer cares about other vertical fields. For vertical AI companies, this is an acceptable solution.
  • From an academic perspective, an important trend in the coming year is that some existing methods in reinforcement learning will be transferred to the application of large models to solve the current problems of insufficient generalization or inaccurate evaluation. In this way, the performance and generalization ability of the model can be further improved. With the application of intensive learning, the ability to output structured information will be greatly improved, and ultimately it will be able to better support various application scenarios, especially to improve the generation effect of charts and other structured content.
  • More and more people can use R1 to do post training, and everyone can make their own agent. The model layer will become different agent models, and different tools will be used to solve problems in different fields, ultimately realizing a multi agent system.
  • 2025 may be the first year of agents, and many companies will launch agents with the ability to plan tasks. However, there is currently a lack of sufficient data to support these tasks. For example, planning tasks may include helping users order takeout food, book travel, determine the availability of tickets for attractions, etc. These tasks require a large amount of data and reward mechanisms to evaluate the accuracy of the model, such as planning a trip to Zhangjiajie, how to judge right and wrong, and how to learn the model. These issues will become the next research focus, and reasoning capabilities will eventually be used to solve practical problems.
  • The ability to call across applications will become a hot spot in 2025. In Android, due to its open source features, developers can implement cross-application operations through underlying permissions. Agents can control your browser, mobile phone, computer and other devices in the future. However, in the Apple ecosystem, due to strict rights management, agents still face great difficulties in fully controlling all applications on the device. Apple must independently develop an agent that can control all applications. Although Android is open source, it still needs to cooperate with OPPO, Huawei and other manufacturers to realize the opening of underlying permissions on mobile phones, tablets, computers and other devices, thereby obtaining data and supporting the development of agents.

Popular Articles