hub / github.com/Jiayi-Pan/TinyZero

github.com/Jiayi-Pan/TinyZero @main sqlite

1,496 symbols 6,093 edges 214 files 361 documented · 24%

README

TinyZero

⚠️ Deprecation Notice: This repo is no longer actively maintained. For running RL experiments, please directly use the latest veRL library. For the archived original documentation, see OLD_README.md.

TinyZero is a reproduction of DeepSeek R1 Zero in countdown and multiplication tasks. We built upon veRL.

Through RL, the 3B base LM develops self-verification and search abilities all on its own.

You can experience the Aha moment yourself for < $30.

Twitter thread: https://x.com/jiayi_pirate/status/1882839370505621655

Full experiment log: https://wandb.ai/jiayipan/TinyZero

📢: We release Adaptive Parallel Reasoning, where we explore a new dimension in scaling reasoning models.

Installation

conda create -n zero python=3.9
# install torch [or you can skip this step and let vllm install the correct version for you]
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
# install vllm
pip3 install vllm==0.6.3 # or you can install 0.5.4, 0.4.2 and 0.3.1
pip3 install ray

# verl
pip install -e .

# flash attention 2
pip3 install flash-attn --no-build-isolation
# quality of life
pip install wandb IPython matplotlib

Countdown task

Data Preparation

conda activate zero
python ./examples/data_preprocess/countdown.py --local_dir {path_to_your_dataset}

Run Training

conda activate zero

For the following code, if you see out-of-VRAM, try adding critic.model.enable_gradient_checkpointing=True to the script, and check out the discussion here.

Single GPU

Works for model <= 1.5B. For Qwen2.5-0.5B base, we know it fails to learn reasoning.

export N_GPUS=1
export BASE_MODEL={path_to_your_model}
export DATA_DIR={path_to_your_dataset}
export ROLLOUT_TP_SIZE=1
export EXPERIMENT_NAME=countdown-qwen2.5-0.5b
export VLLM_ATTENTION_BACKEND=XFORMERS

bash ./scripts/train_tiny_zero.sh

3B+ model In this case, the base model is able to develop sophisticated reasoning skills.

export N_GPUS=2
export BASE_MODEL={path_to_your_model}
export DATA_DIR={path_to_your_dataset}
export ROLLOUT_TP_SIZE=2
export EXPERIMENT_NAME=countdown-qwen2.5-3b
export VLLM_ATTENTION_BACKEND=XFORMERS

bash ./scripts/train_tiny_zero.sh

Instruct Ablation

We experiment with Qwen-2.5-3B Instruct too. Data Preparation To follow chat template, we need to reprocess the data:

conda activate zero
python examples/data_preprocess/countdown.py --template_type=qwen-instruct --local_dir={path_to_your_dataset}

Training

export N_GPUS=2
export BASE_MODEL={path_to_your_model}
export DATA_DIR={path_to_your_dataset}
export ROLLOUT_TP_SIZE=2
export EXPERIMENT_NAME=countdown-qwen2.5-3b-instruct
export VLLM_ATTENTION_BACKEND=XFORMERS

bash ./scripts/train_tiny_zero.sh

Acknowledgements

We run our experiments based on veRL.
We use Qwen2.5 series base model Qwen2.5.

Citation

@misc{tinyzero,
author       = {Jiayi Pan and Junjie Zhang and Xingyao Wang and Lifan Yuan and Hao Peng and Alane Suhr},
title        = {TinyZero},
howpublished = {https://github.com/Jiayi-Pan/TinyZero},
note         = {Accessed: 2025-01-24},
year         = {2025}
}

Core symbols most depended-on inside this repo

get

called by 147

verl/utils/memory_buffer.py

verl/utils/memory_buffer.py

log_gpu_memory_usage

called by 45

verl/utils/debug/performance.py

update

called by 43

verl/trainer/ppo/core_algos.py

tests/e2e/envs/digit_completion/tokenizer.py

copy_local_path_from_hdfs

called by 22

verl/utils/fs.py

Shape

Method 765

Function 528

Class 173

Route 30

Languages

Python100%

Modules by API surface

verl/workers/megatron_workers.py38 symbols

verl/single_controller/ray/base.py38 symbols

verl/protocol.py38 symbols

verl/models/llama/megatron/modeling_llama_megatron.py35 symbols

verl/workers/fsdp_workers.py32 symbols

verl/third_party/vllm/vllm_v_0_3_1/config.py32 symbols

verl/workers/sharding_manager/megatron_vllm.py31 symbols

verl/third_party/vllm/vllm_v_0_3_1/llm_engine_sp.py31 symbols

verl/utils/torch_functional.py30 symbols

verl/single_controller/base/decorator.py29 symbols

verl/third_party/vllm/vllm_v_0_6_3/spmd_gpu_executor.py25 symbols

verl/third_party/vllm/vllm_v_0_5_4/spmd_gpu_executor.py25 symbols

Dependencies from manifests, versioned

accelerate1×

codetiming1×

datasets1×

dill1×

hydra-core1×

numpy1×

pybind111×

ray1×

tensordict1×

vllm0.6.3 · 1×

For agents

$ claude mcp add TinyZero \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact