If you have had a hard time in a tough math problem, you know how much it is to think a little longer and work carefully. OPENAI’s O1 model is very good in solving inference tasks such as mathematics, coding, and logic if LLM is trained to do the same because LLM is inferred. I showed that I was there.
However, the recipe behind Openai’s reasoning model is secret. In other words, last week, DeepSeek released the Deepseek-R1 model and quickly broke the Internet (and stock market).
The release of Deepseek-R1 has a detailed technical report that outlines important steps in training recipes, as well as O1. This recipe includes some innovation, especially without human supervision, to apply pure enhanced learning to teach how to infer a basic language model. As shown in the figure below, it is very easy to create a powerful inference model if you can access a competent base model and high quality data.
However, the release of Deepseek-R1 opens some questions:
Data Collection: How was the specific dataset curated? Model Training: DeepSeek has not released the training code, so it is unknown which hyper parameters are the most effective and different models and scales. Scaling method: What is the calculation and data trade -off in the training test model?
Open-R1 Project is an initiative that systematically reconstructs the data and training pipeline of Deepseeek-R1, verifies the claim, and spreads the border of the Open Reasoning model. By building Open-R1, reinforced learning enhances inference, shares the reproducible insight with the open source community, and provides transparency on how to create a foundation for future models that utilize these technologies. I am aiming for that.
See the important ingredients behind Deepseek-R1 for this blog post. See the parts to be reproduced and how to contribute to the Open-R1 project.
Let’s jump into 🚀!
How did they do it?
Deepseek-R1 is a reasoning model built based on the basics of Deepseek-V3. Like a legitimate inference model, it starts with a powerful bass model, and Deepseek-V3 is exactly that. The mixture of this 671B expert (MOE) model works equally as heavyweight, such as Sonnet 3.5 and GPT-4O. I am particularly impressed that multi -token prediction (MTP), potential attention to multi -head (MLA), and many hardware optimization are grateful.
DeepSeek has also introduced two models, Deepseek-R1-ZERO and Deepseek-R1. Deepseek-R1-Zero completely depends on the enhanced learning (RL) to make the process more efficiently skip the monitored fine adjustments and use the group relative policy optimization (GRPO) to make the process more efficient. I was. Using a simple reward system, the model was derived, and feedback was provided based on the accuracy and structure of the answer. This approach helped develop useful inference skills, such as the model violating the problem or verifying its own output. However, the response often lacked clarity and was difficult to read.
From there, Deepseek-R1 appears. Started with the “Cold Start” phase, it was fine -tuned with a small set carefully created to improve clarity and readability. From there, after a more RL and an improvement step, we have created a model that refuses low -quality output with both rewards and verified rewards based on human taste, and creates a sophisticated and consistent answer. 。
This sounds all great, but what is actually missing? Let’s look at the lack of puzzle pieces.
Open-R1: A missing piece
Release of Deepseek-R1 is a great benefit for the community, but not everything. The weight of the model is open, but the datasets and code used for model training are not 😢.
The goal of Open-R1 is that these endless parts can be built so that the entire research and industry communities can build similar models or better models using these recipes and datasets. That is. And by doing this open, everyone in the community can contribute!
As shown in the figure below, the attack plan is as follows.
Step 1: The R1-Distill model is reproduced by distilling high quality inference datasets from Deepseek-R1. Step 2: DeepSeek duplicates the pure RL pipeline used to create R1-Zero. This includes a new large dataset curation for mathematics, inference, and code. Step 3: It shows that you can move from the base model → SFT → RL via multi -stage training. With a synthetic dataset, everyone can fine -tune the existing or new LLM and simply fine -tuned it into a reasoning model. Training recipes, including RL, function as a starting point for building similar models from zero, and researchers can build even more advanced methods.
Note that you do not want to stop in the mathematical dataset. There are many possibilities to explore other areas. This is not only an obvious area like a code, but also a scientific field such as medicine that can have a significant impact for inference models.
This initiative is not only to reproduce the results, but also to share the community and insights. By writing what works, what is not, what is not, and why, you want to save other people by wasting time and calculating with non -productive paths. I am.
If this sounds interesting, we love your help! There are many ways to involve the code contribution, whether or not you participate in the discussion about hugging your face. Let’s build together! Lingering