Your Position Home AI Technology

Illustrated: How did DeepSeek R1 be practiced

Article source: Generating electricity for AI

Image source: Generated by AIImage source: Generated by AI

How does DeepSeek train its R1 inference model?

This paper mainly interprets the training process of DeepSeek – R1 based on the technical report released by DeepSeek; it focuses on discussing four strategies for building and improving inference models.

The original article is from researcher Sebastian Raschka and published in:

https://magazine.sebastianraschka.com/p/understanding-reasoning-llms

This article will summarize the core training part of the R1 inference model.

First of all, based on the technical report released by DeepSeek, the following is a training chart of R1.

Illustrated: How did DeepSeek R1 be practiced插图1

Sort out the process shown in the above figure, where:

(1)DeepSeek – R1 – Zero: This model is based on the DeepSeek – V3 prototype released in December last year. Train it using reinforcement learning (RL) with two reward mechanisms. This method is called “cold-start” training because it does not include supervised fine-tuning (SFT) steps, which is usually part of human feedback reinforcement learning (RLHF).

(2)DeepSeek – R1: This is DeepSeek’s main reasoning model, built based on DeepSeek – R1 – Zero. The team optimized it through an additional supervisory fine-tuning phase and further reinforcement learning training, improving the “cold-start” R1 – Zero model.

(3)DeepSeek – R1 – Distill: The DeepSeek team used supervisory fine-tuning data generated in previous steps to Fine tune the Qwen and Llama models to enhance their reasoning capabilities. Although this is not distillation in the traditional sense, the process involves training smaller models (Llama 8B and 70B, and Qwen 1.5B -30B) using the output of the larger 671B DeepSeek – R1 model.

Here are four main methods for building and improving reasoning models

1. Inference-time scaling

One way to improve LLM reasoning capabilities (or anything in the general sense) is to extend reasoning time-adding computing resources during the reasoning process to improve output quality.

To give a rough analogy, it’s like when people have more time to think about complex questions, they tend to give better answers. Similarly, there are techniques we can use to encourage LLM to “think” more deeply when generating answers.

A simple way to implement extension when reasoning is clever Prompt Engineering. A classic example is thought chain prompting/ CoT Prompt, which adds phrases such as “think step by step” to input prompts. This encourages the model to generate intermediate reasoning steps rather than skipping directly to the final answer, which often yields more accurate results on more complex questions. (Note that this strategy makes no sense for simpler knowledge-based questions like “What is the capital of France?” and it is a practical rule of thumb for determining whether an inference model is appropriate for a given input query.)

Illustrated: How did DeepSeek R1 be practiced插图2

The chain of thought (CoT) approach described above can be seen as an extension when reasoning because it increases the cost of reasoning by generating more output tokens.

Another way to expand when reasoning is to useVoting and search strategies。A simple example is majority voting, which allows LLM to generate multiple answers and then selects the correct answer through majority voting. Similarly, we can use beam search and other search algorithms to generate better answers.

The paper “Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters” is recommended here.

Illustrated: How did DeepSeek R1 be practiced插图3

Different search-based methods rely on a model based on process rewards to select the best answer.

DeepSeek R1 technical report states that its model does not use inference-time extension technology. However, this technology is usually implemented at the application level above LLM, so it is possible that DeepSeek has used the technology in its applications.

I suspect OpenAI’s o1 and o3 models use inference-time extension technology, which explains why they are relatively costly to use compared to models such as GPT -4o. In addition to extension when reasoning, o1 and o3 are likely trained through a reinforcement learning process similar to DeepSeek R1.

2. Pure Reinforcement Learning/ Pure RL

A particularly noteworthy point in the DeepSeek R1 paper is their discovery that reasoning can emerge as an behavior from pure reinforcement learning. Let’s discuss what this means.

As mentioned earlier, DeepSeek has developed three R1 models. The first is DeepSeek – R1 – Zero, which is built on the DeepSeek – V3 basic model. Unlike typical reinforcement learning processes, where supervised fine-tuning (SFT) is usually performed before reinforcement learning, DeepSeek – R1 – Zero is trained entirely through reinforcement learning and does not have an initial supervised fine-tuning/SFT stage, as shown in the following figure.

Illustrated: How did DeepSeek R1 be practiced插图4

Still, this reinforcement learning process is similar to the human feedback reinforcement learning (RLHF) method commonly used to fine-tune preferences for LLMs. However, as mentioned above, the key difference between DeepSeek – R1 – Zero is that they skip the supervised fine tuning (SFT) stage for instruction adjustments. That’s why they call it “pure” reinforcement learning/ Pure RL.

In terms of rewards, they did not use a reward model trained based on human preferences, but instead adopted two reward types: accuracy rewards and format rewards.

  • Accuracy reward/ accuracy reward Use the LeetCode compiler to verify programmed answers and use a deterministic system to evaluate mathematical answers.
  • format reward/ format reward Rely on an LLM evaluator to ensure that answers follow the expected format, such as placing inference steps within labels.

Surprisingly,This approach is enough to allow LLM to evolve basic reasoning skills。The researchers observed an aha moment, in which the model began to generate inference traces in its responses, although it was not explicitly trained on it, as shown in the following figure, from the R1 technical report.

Illustrated: How did DeepSeek R1 be practiced插图5

Although R1 – Zero is not the top reasoning model, as shown in the figure above, it does demonstrate reasoning capabilities by generating intermediate “think” steps. This confirms that using pure reinforcement learning to develop inference models is feasible, and DeepSeek is the first team to demonstrate (or at least publish) this approach.

3. Supervised fine-tuning and intensive learning (SFT + RL)

Next, take a look at the development process of DeepSeek’s main reasoning model, DeepSeek – R1. It can be called a textbook for building reasoning models. Based on DeepSeek – R1 – Zero, this model incorporates more supervised fine-tuning (SFT) and reinforcement learning (RL) to improve its reasoning performance.

It should be noted that adding a supervised fine-tuning phase before reinforcement learning is common in the standard human feedback reinforcement learning (RLHF) process. OpenAI’s o1 is likely to be developed using a similar approach.

Illustrated: How did DeepSeek R1 be practiced插图6

As shown in the figure above, the DeepSeek team used DeepSeek – R1 – Zero to generate what they call “cold start” supervised fine tuning (SFT) data. The term “cold start” means that the data was generated by DeepSeek – R1 – Zero, and the model itself has not been trained on any supervisory fine-tuning data.

Using these cold-start SFT data, DeepSeek first trains the model with fine tuning instructions, and then enters another reinforcement learning (RL) stage. This RL stage follows the accuracy rewards and format rewards used in DeepSeek – R1 – Zero’s RL process. However, they added a consistency reward to prevent the model from mixing languages in its responses, that is, the model switches multiple languages in one response.

After the RL phase, another round of SFT data collection is entered. At this stage, 600,000 Chain of Thought (CoT) SFT examples were generated using the latest model checkpoints, while an additional 200,000 knowledge-based SFT examples were created using the DeepSeek – V3 base model.

These 600,000 + 200,000 SFT samples were then used to perform instruction finetuning/instruction finetuning on the DeepSeek – V3 basic model, followed by the final round of RL. At this stage, they are again using a rules-based approach to determine accuracy rewards for mathematics and programming problems, and using human preference labels for other types of problems. All in all, this is very similar to regular human feedback reinforcement learning (RLHF), except that the SFT data contains (more) thought chain examples. Moreover, RL has verifiable rewards in addition to rewards based on human preferences.

The final model DeepSeek – R1 has significant performance improvements compared to DeepSeek – R1 – Zero due to the additional SFT and RL stages, as shown in the following table.

Illustrated: How did DeepSeek R1 be practiced插图7

4. Pure supervised fine tuning (SFT) and distillation

So far, we have introduced three key methods for building and improving reasoning models:

1/Reason-time extension, which is a technology that improves reasoning capabilities without the need to train or otherwise modify the underlying model.

2/ Pure RL, such as pure reinforcement learning (RL) adopted in DeepSeek – R1 – Zero, shows that reasoning can occur as a learned behavior without supervised fine-tuning.

3/Supervised fine tuning (SFT)+ reinforcement learning (RL), resulting in DeepSeek’s reasoning model DeepSeek – R1.

What remains-model “distillation”. DeepSeek has also released smaller models trained through what they call the distillation process. In the context of LLM, distillation does not necessarily follow the classic knowledge distillation methods used in deep learning. Traditionally, in knowledge distillation, a smaller “student” model would be trained on the logical output of a larger “teacher” model and the target data set.

However, distillation here refers to instruction finetuning/instruction finetune on a supervised fine-tuning (SFT) dataset generated by larger LLMs, such as the Llama 8B and 70B models, and Qwen 2.5B (0.5B -32B). Specifically, these larger LLMs are an intermediate checkpoint/ checkpoint for DeepSeek – V3 and DeepSeek – R1. In fact, the supervisory fine tuning data/SFT data used for this distillation process is the same data set described in the previous section used to train DeepSeek – R1.

To illustrate this process, I highlight the distillation section in the figure below.

Illustrated: How did DeepSeek R1 be practiced插图8

Why did they develop these distillation models? There are two key reasons:

1/Smaller models are more efficient. This means they are cheaper to run and can also run on low-end hardware, making them particularly attractive to many researchers and enthusiasts.

2/As a case study of pure supervisory fine-tuning (SFT). These distillation models are an interesting benchmark that shows how much purely supervised fine-tuning can achieve models without reinforcement learning.

The following table compares the performance of these distillation models with other popular models, as well as DeepSeek – R1 – Zero and DeepSeek – R1.

Illustrated: How did DeepSeek R1 be practiced插图9

As we can see, although the distillation models are several orders of magnitude smaller than DeepSeek – R1, they are significantly much stronger than DeepSeek – R1 – Zero, but still weaker relative to DeepSeek – R1. Equally interesting is that these models also perform well compared to the O1- mini (it is suspected that the O1- mini itself may be a similarly distilled version of O1).

There is another interesting comparison worth mentioning. The DeepSeek team tested whether the sudden reasoning behavior seen in DeepSeek – R1 – Zero could also occur in smaller models. To investigate this, they applied the same pure reinforcement learning method in DeepSeek – R1 – Zero directly to Qwen -32B.

The following table summarizes the results of this experiment, where QwQ -32B- Preview is a reference reasoning model based on Qwen 2.5 32B developed by the Qwen team. This comparison provides some additional insight into whether pure reinforcement learning alone can induce reasoning in a much smaller model than DeepSeek – R1 – Zero.

Illustrated: How did DeepSeek R1 be practiced插图10

Interestingly, the results show that distillation is much more effective than pure reinforcement learning for smaller models. This is consistent with the view that reinforcement learning alone may not be enough to induce strong reasoning capabilities in models of this size, and that supervised fine-tuning based on high-quality reasoning data may be a more effective strategy when dealing with small models.

conclusion

We explored four different strategies for building and improving reasoning models:

  1. Reasoning time expansion: No additional training is required, but it will increase the cost of reasoning. As the number of users or queries grows, the cost of large-scale deployments becomes higher. However, this is still a simple and effective way to improve the performance of existing powerful models. I strongly suspect that o1 uses inferential extension, which explains why o1 costs more per token than DeepSeek – R1.
  2. Pure Reinforcement Learning Pure RL: It’s interesting from a research perspective because it gives us insight into the process of reasoning as an emergent behavior. However, in actual model development, combining reinforcement learning with supervised fine-tuning (RL + SFT) is a better choice because this approach can build stronger reasoning models. I also strongly suspect that o1 is also trained through RL + SFT. Rather, I think o1 starts with a weaker and smaller base model than DeepSeek – R1, but fills the gap through RL + SFT and inferential time expansion.
  3. as described aboveRL + SFT is a key method for building high-performance reasoning models. DeepSeek – R1 shows us an excellent blueprint for achieving this goal.
  4. distillation: is an attractive method and is especially suitable for creating smaller, more efficient models. However, its limitation is that distillation cannot drive innovation or produce the next generation of reasoning models. For example, distillation has always relied on existing stronger models to generate supervised fine tuning (SFT) data.

Next, an interesting direction I look forward to seeing is combining RL + SFT (Method 3) with Reasoning-time extension (Method 1). This is probably what OpenAI’s o1 is doing, but o1 may be based on a weaker underlying model than DeepSeek – R1, which explains why DeepSeek – R1 performs well and is relatively cheap when reasoning.

Popular Articles