MCPcopy Index your code
hub / github.com/huggingface/open-r1

github.com/huggingface/open-r1 @main sqlite

repository ↗ · DeepWiki ↗
246 symbols 973 edges 39 files 124 documented · 50%
README

Open R1

A fully open reproduction of DeepSeek-R1. This repo is a work in progress, let's build it together!

Table of Contents
1. Overview
2. Plan of attack
3. Installation
4. Training models
- SFT
- GRPO
5. Evaluating models
6. Reproducing Deepseek's evaluation results
7. Data generation
- Generate data from a smol distilled R1 model
- Generate data from DeepSeek-R1
8. Contributing

Overview

The goal of this repo is to build the missing pieces of the R1 pipeline such that everybody can reproduce and build on top of it. The project is simple by design and mostly consists of:

  • src/open_r1: contains the scripts to train models as well as generate synthetic data:
    • grpo.py: trains a model with GRPO on a given dataset.
    • sft.py: performs a simple SFT of a model on a dataset.
    • generate.py: generates synthetic data from a model using Distilabel.
  • Makefile: contains easy-to-run commands for each step in the R1 pipeline leveraging the scripts above.

Plan of attack

We will use the DeepSeek-R1 tech report as a guide, which can roughly be broken down into three main steps:

  • Step 1: replicate the R1-Distill models by distilling a high-quality corpus from DeepSeek-R1.
  • Step 2: replicate the pure RL pipeline that DeepSeek used to create R1-Zero. This will likely involve curating new, large-scale datasets for math, reasoning, and code.
  • Step 3: show we can go from base model to RL-tuned via multi-stage training.

News 🗞️

  • 🧑‍🍳 [2025/05/26] (Step 1 completed!) We release Mixture-of-Thoughts--a curated reasoning dataset of 350k verified traces distilled from R1. The dataset spans tasks in mathematics, coding, and science, and is designed to teach language models to reason step-by-step. We also provide a recipe to train OpenR1-Distill-7B, which replicates the reasoning capabilities of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B and marks the completion of step 1 in the Open R1 project.
  • ⚡️ [2025/03/11] (update #3): We release the CodeForces-CoTs dataset of 10k competitive programming problems and 100k solutions distilled from R1. We also release IOI24: a new benchmark of very hard problems from international olympiads. A 7B Qwen model trained on CodeForces-CoTs can outperform Claude 3.7 Sonnet on IOI24, while a 32B model can outperform R1 itself.
  • ∞ [2025/02/10] (update #2): We release the OpenR1-Math-220k dataset of 220k traces distilled from R1 on a new version of NuminaMath. Models trained on this dataset match the performance of DeepSeek's distilled ones.
  • 🔥 [2025/02/02] (update #1): We implement the first parts of the training, inference, and evaluation pipelines. Let's go!

Installation

[!CAUTION] Libraries rely on CUDA 12.4. If you see errors related to segmentation faults, double check the version your system is running with nvcc --version.

To run the code in this project, first, create a Python virtual environment using e.g. uv. To install uv, follow the UV Installation Guide.

[!NOTE] As a shortcut, run make install to setup development libraries (spelled out below). Afterwards, if everything is setup correctly you can try out the Open-R1 models.

uv venv openr1 --python 3.11 && source openr1/bin/activate && uv pip install --upgrade pip

[!TIP] For Hugging Face cluster users, add export UV_LINK_MODE=copy to your .bashrc to suppress cache warnings from uv

Next, install vLLM and FlashAttention:

uv pip install vllm==0.8.5.post1
uv pip install setuptools && uv pip install flash-attn --no-build-isolation

This will also install PyTorch v2.6.0 and it is very important to use this version since the vLLM binaries are compiled for it. You can then install the remaining dependencies for your specific use case via pip install -e .[LIST OF MODES]. For most contributors, we recommend:

GIT_LFS_SKIP_SMUDGE=1 uv pip install -e ".[dev]"

Next, log into your Hugging Face and Weights and Biases accounts as follows:

huggingface-cli login
wandb login

Finally, check whether your system has Git LFS installed so that you can load and push models/datasets to the Hugging Face Hub:

git-lfs --version

If it isn't installed, run:

sudo apt-get install git-lfs

Training models

[!NOTE] The training commands below are configured for a node of 8 x H100s (80GB). For different hardware and topologies, you may need to tune the batch size and number of gradient accumulation steps.

We support training models with either DDP or DeepSpeed (ZeRO-2 and ZeRO-3). For example, to perform SFT on a dataset distilled from DeepSeek-R1 with reasoning traces such as open-r1/Mixture-of-Thoughts, run:

# Train via command line
accelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
    --model_name_or_path open-r1/Qwen2.5-Math-7B-RoPE-300k \
    --dataset_name open-r1/Mixture-of-Thoughts \
    --dataset_config all \
    --eos_token '<|im_end|>' \
    --learning_rate 4.0e-5 \
    --num_train_epochs 5 \
    --max_seq_length 32768 \
    --per_device_train_batch_size 2 \
    --gradient_checkpointing \
    --bf16 \
    --use_liger_kernel \
    --output_dir data/OpenR1-Distill-7B

# Train via YAML config
accelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
    --config recipes/OpenR1-Distill-7B/sft/config_distill.yaml

Currently, the following tasks are supported:

  • Supervised Fine-Tuning sft
  • Group Relative Policy Optimization grpo

[!TIP] If you scale up/down the number of GPUs, we recommend also scaling up the per-device batch size or number of gradient accumulation steps to keep the global batch size constant.

By default, these scripts will push each model to your Hugging Face Hub username, i.e. {username}/{model_name}-{task}. You can override the parameters in each YAML config by appending them to the command as follows:

# Change the base model to a smaller variant
accelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
    --config recipes/OpenR1-Distill-7B/sft/config_distill.yaml \
    --model_name_or_path Qwen/Qwen3-0.6B-Base \
    --hub_model_id OpenR1-Distill-0.6B \
    --output_dir data/OpenR1-Distill-0.6B

If you also wish to override the Weights and Biases default settings, you can do so as follows:

accelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
    --config recipes/OpenR1-Distill-7B/sft/config_distill.yaml
    --wandb_entity huggingface --wandb_project open-r1 --run_name Qwen2.5-1.5B-GRPO

🚨 WARNING 🚨

Most base models like meta-llama/Llama-3.2-1B do not have a chat template, so we set ChatML as the default during training. However, for Qwen base models like Qwen/Qwen2.5-1.5B, a chat template is pre-defined in the tokenizer, so the EOS token must be set accordingly, e.g.

# Align EOS token with chat template for Qwen base models
accelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
    --model_name_or_path Qwen/Qwen2.5-1.5B \
+   --eos_token '<|im_end|>'
    --dataset_name open-r1/Mixture-of-Thoughts \
    --dataset_config all \
    --learning_rate 4.0e-5 \
    --num_train_epochs 1 \
    --max_seq_length 32768 \
    --per_device_train_batch_size 16 \
    --gradient_checkpointing \
    --bf16 \
    --use_liger_kernel \
    --output_dir data/Qwen2.5-1.5B-Open-R1-Distill

If you wish to use a custom chat template (e.g. Llama or Gemma), then the chat template and associated EOS token must be provided:

# Align EOS token with custom chat template
accelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
    --model_name_or_path meta-llama/Llama-3.2-1B \
+   --chat_template "$(cat llama_chat_template.jinja)" \
+   --eos_token '<|eot_id|>' \
    --dataset_name open-r1/Mixture-of-Thoughts \
    --dataset_config all \
    --learning_rate 4.0e-5 \
    --num_train_epochs 1 \
    --max_seq_length 32768 \
    --per_device_train_batch_size 16 \
    --gradient_checkpointing \
    --bf16 \
    --use_liger_kernel \
    --output_dir data/Llama-3.2-1B-Open-R1-Distill

SFT distillation

We provide a recipe to reproduce the reasoning capabilities of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B, starting from the same base model. To do so, run:

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \
    src/open_r1/sft.py \
    --config recipes/OpenR1-Distill-7B/sft/config_distill.yaml

The result will be a model like open-r1/OpenR1-Distill-7B, with the following downstream performance:

Model AIME 2024 MATH-500 GPQA Diamond LiveCodeBench v5
OpenR1-Distill-7B 52.7 89.0 52.8 39.4
DeepSeek-R1-Distill-Qwen-7B 51.3 93.5 52.4 37.4

You can adjust the YAML config to train on a different base model or dataset.

GRPO

We use TRL's vLLM backend to scale training to large models across multiple nodes. For single-node training of smol models across 8 GPUs, use vllm_mode="colocate" to run vLLM in the same process as the training script:

ACCELERATE_LOG_LEVEL=info \
    accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \
    src/open_r1/grpo.py --config recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml \
    --vllm_mode colocate

[!WARNING] The chat template used in the distilled DeepSeek models omits the contents of the reasoning block within the <think> and </think> tags. It also prefills the assistant response with <think> which interferes with the format reward function. To handle that, it is important to override the chat template as done in e.g. recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml.

For multi-node training on N+1 nodes, with 1 node running the vLLM server and N nodes running training, we provide an example Slurm script. For example, to run the above example on 1+1 nodes with data parallelism, run:

sbatch --nodes=2 slurm/train.slurm --model Qwen2.5-1.5B-Instruct --task grpo --config demo --accelerator zero2 --dp 8 --tp 1

See the Launching jobs on a Slurm cluster section for more details.

GRPO dataset filtering

We provide support to filter datasets by generating and computing pass rate on veriable tasks, see this README

👨‍💻 Training with a code interpreter

We provide a code reward function for executing code generated by the policy during training. Currently, this reward function targets code contests like Codeforces, where solutions are executed against a set of test cases and the overall success rate is returned as the final reward. To ensure safe execution, we support multiple sandbox providers:

  1. E2B - Fast, cloud-based sandboxes with focus on Python execution
  2. Morph - Cloud-based sandboxes with broader language support - Python/JS/C++/Rust

To use the code reward function, first install the necessary dependencies:

uv pip install -e '.[code]'
E2B Provider

To use E2B sandboxes, create a .env file and add your E2B API token:

E2B_API_KEY="e2b_xxx"
Morph Provider

To use Morph, first install the morphcloud package:

pip install morphcloud

Then add your Morph API token to the .env file:

MORPH_API_KEY="YOUR_MORPH_API_KEY"

To specify which provider to use, add the provider_type parameter in your configuration:

# For E2B
provider_type: e2b

# For Morph
provider_type: morph
Dataset Requirements

Make sure your dataset contains a verification_info column with the following schema (adopted from PrimeIntellect's excellent datasets of verifiable problems):

```python { "language": "python", # Morph supports more languages including C++, Java, etc. "test_cases": [ { "input": "4\n4\n0001\n1000\n0011\n0

Core symbols most depended-on inside this repo

get_repetition_penalty_reward
called by 21
src/open_r1/rewards.py
code_reward
called by 9
src/open_r1/rewards.py
get_code_format_reward
called by 7
src/open_r1/rewards.py
run_code
called by 7
src/open_r1/utils/routed_sandbox.py
get_dataset
called by 7
src/open_r1/utils/data.py
tag_count_reward
called by 6
src/open_r1/rewards.py
register_lighteval_task
called by 6
src/open_r1/utils/evaluation.py
deps_list
called by 5
setup.py

Shape

Method 114
Function 96
Class 32
Route 4

Languages

Python100%

Modules by API surface

tests/test_rewards.py51 symbols
src/open_r1/rewards.py23 symbols
src/open_r1/utils/competitive_programming/piston_client.py16 symbols
src/open_r1/utils/competitive_programming/morph_client.py15 symbols
src/open_r1/utils/code_providers.py14 symbols
tests/slow/test_code_reward.py13 symbols
src/open_r1/utils/competitive_programming/ioi_scoring.py12 symbols
tests/utils/test_data.py9 symbols
scripts/morph_router.py9 symbols
scripts/e2b_router.py9 symbols
src/open_r1/utils/callbacks.py8 symbols
src/open_r1/configs.py7 symbols

For agents

$ claude mcp add open-r1 \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact