hub / github.com/Simple-Efficient/RL-Factory

github.com/Simple-Efficient/RL-Factory @v0.1.0 sqlite

repository ↗ · DeepWiki ↗ · release v0.1.0 ↗

2,902 symbols 12,556 edges 417 files 815 documented · 28%

README

Description

RLFactory is an easy and efficient RL post-training framework for Agentic Learning.

RL-Factory decouples the environment from RL post-training, enabling training with just a tool config and reward function while supporting async tool-calling to make RL post-training 2x faster.

Current version natively supports one-click DeepSearch training and features multi-turn tool-calling, model judge reward, and training of multiple models including Qwen3. More easy and efficient agentic learning modules will be added in upcoming features.

Now, everyone can easily and quickly train an Agent model with Qwen3 (as base models) and MCP tools!

Our Framework Design

Our goal is to enable users to focus on reward logic and tool setup for fast agentic learning with minimal code, while hardcore developers could focus on improving training efficiency and model performance.

For easy-to-use, we decouple the environment from RL-based post-training with several advantages. + Easy-to-design reward function: Calculate rewards through rules, model-judge, and even tools to meet all your requirements for reward function. + Seamless tool setup: Simply provide the configuration file for your MCP tools and custom tools to integrate them into RL learning. + Multi-Agent extention: Convert your agent to the MCP format for easy Multi-Agent Interaction. LLM chat simulation will be also added in the future to improve multi-turn dialogue capabilities.

For efficient learning, we develope several essential modules within the RL post-training framework, making training 2x faster. + Efficient tool-call: Improve online RL training efficiency through batch processing and asynchronous parallel tool calls. + Efficient reward calculation: Deploy LRM (like QwQ-32B) in a distributed manner for efficient model judging, and use asynchronous parallelism to speed up reward calculation.

For future progression, we will continue to prioritize "easy" and "efficient". + Easier: Use WebUI to process data, define tool & environment, adjust training configuration, and manage project. (The WebUI is under rapid development.) + More efficient: Continuously iterating and improving the training framework (such as AsyncLLMEngine) and RL training algorithms.

Description

Release Log

We’ll keep a fast release cycle to quickly deliver and polish the upcoming features. + Version 0.1 + Environment decouple: define your tool-use envinroment easily (tools setup and reward function definition) + Qwen3 Model support: quickly train your agent model using Qwen3 (much better than Qwen2.5 in tool-call) + Efficient training: 2x faster than existing frameworks for rapid model iteration (mainly through async tool-use) + Version 0.2 (within 2 weeks) - WebUI: build a WebUI for data processing, tool & environment definition, training configuration, and project management - More efficient training: support the AsyncLLMEngine for more efficient rollout - More models: test more models (such as Deepseek, Llama, etc.) and add corresponding support configurations - More applications: help create more demos (such as TravelPlanner) to adapt to more benchmarks

User Instructions

Dependencies (Key) yaml Cuda: >=12.0 (Recommended: 12.4) Python: >=3.10 (Recommended: 3.10) # For Qwen3 model support vllm: >=0.8.3 (Recommended: 0.8.5)
Install Requirements bash pip3 install accelerate bitsandbytes datasets deepspeed==0.16.4 einops flash-attn==2.7.0.post2 isort jsonlines loralib optimum packaging peft pynvml>=12.0.0 ray[default]==2.42.0 tensorboard torch torchmetrics tqdm transformers==4.48.3 transformers_stream_generator wandb wheel pip3 install vllm==0.8.5 # Mainly for Qwen3 model support pip3 install "qwen-agent[code_interpreter]" pip3 install llama_index bs4 pymilvus infinity_client codetiming tensordict==0.6 omegaconf torchdata==0.10.0 hydra-core easydict dill python-multipart mcp pip3 install -e . --no-deps pip3 install faiss-gpu-cu12 # Optional, needed for end-to-end search model training with rag_server

Note: Currently, only Qwen models are tested.

What do you need to provide?
An environment is enough! See the minimal tutorial in docs/rl_factory/main_tutorial.md
Training Command bash # Before running, modify MODEL_PATH, REWARD_MODEL_PATH, and several actor_rollout_ref.env parameters as needed bash main_grpo.sh

Demo in DeepSearch Training

In docs/rl_factory/main_tutorial.md, we provide an RLFactory reproduction example of Search-R1. We use Qwen3-4B and Qwen3-8B as the base model for RL training.
Easy: Start with Qwen3 and MCP tools to quickly train your own DeepSearch Agent Model.
Provide only one tool configuration and one reward function to start training!
Qwen3 demonstrates significant advantages in Agent Learning. It can accurately call tools even without SFT, and it also supports the MCP protocol.
Efficient: Enjoy the efficient training enabled by asynchronous parallel tool-call.
Compared to Search-R1 based on the original verl, the required training time is reduced by 1.5 to 2 times, and the efficiency gain is even greater if a model judge is involved.
After 100 steps of training (about 5 hours in 8*A100), Qwen3-4B achieves a score of 0.458 and Qwen3-8B achieves a score of 0.463.
The table below presents our training results under identical computational resources, software, and verl versions
RLFactory trains in about half the time of Search-R1, demonstrating high efficiency.
Qwen3 as the base model outperforms Qwen2.5, enabling domain-specific tool-calling via RL post-training without SFT.

Model Name	Test Score (NQ)	Total Training Time (100 step)	Seconds per step	Training Resources
Search-R1-Qwen2.5-3B-Instruct-GRPO	0.356	7.39 h	266 s	A100 × 8
Search-R1-Qwen2.5-7B-Instruct-GRPO	0.451	9.25 h	333 s	A100 × 8
Search-R1-Qwen3-4B-GRPO	0.420	7.95 h	286 s	A100 × 8
RLFactory-Qwen3-4B-GRPO	0.458	5.30 h	190 s	A100 × 8
RLFactory-Qwen3-8B-GRPO	0.463	5.76 h	207 s	A100 × 8

How to contribute?

We welcome all users and developers to contribute code to RLFactory. If you have any questions, encounter bugs, or would like to collaborate on development, please feel free to contact us!

Submit an issue directly on GitHub
Contact us via email at chaijiajun@meituan.com
Join our WeChat group and become a pioneer in Agent model training!

Description

Acknowledgement

This repo benefits from verl, Search-R1, Qwen-Agent. Thanks for their wonderful works. We will also introduce TRL in the future to further expand the applicability of our framework.

Core symbols most depended-on inside this repo

get

called by 321

verl/utils/memory_buffer.py

verl/utils/debug/performance.py

verl/trainer/ppo/core_algos.py

print_rank_0

called by 84

verl/utils/megatron_utils.py

named_parameters

called by 76

verl/utils/memory_buffer.py

pop

called by 65

verl/protocol.py

Shape

Method 1,366

Function 1,123

Class 347

Route 66

Languages

Python100%

Modules by API surface

verl/workers/fsdp_workers.py51 symbols

verl/single_controller/ray/base.py48 symbols

verl/protocol.py46 symbols

verl/workers/megatron_workers.py40 symbols

verl/models/qwen2/megatron/modeling_qwen2_megatron.py37 symbols

verl/utils/torch_functional.py36 symbols

verl/single_controller/base/decorator.py36 symbols

verl/models/llama/megatron/modeling_llama_megatron.py35 symbols

verl/third_party/vllm/vllm_v_0_3_1/config.py32 symbols

verl/third_party/vllm/vllm_v_0_3_1/llm_engine_sp.py31 symbols

verl/workers/sharding_manager/megatron_vllm.py30 symbols

scripts/model_merger.py30 symbols

Dependencies from manifests, versioned

fastapi0.109.0 · 1×

gradio4.19.2 · 1×

packaging20.0 · 1×

pyarrow19.0.0 · 1×

pydantic2.6.1 · 1×

python-multipart0.0.9 · 1×

tensordict0.6.2 · 1×

tokenizers0.19.1 · 1×

torch-memory-saver0.0.5 · 1×

uvicorn0.27.0 · 1×

For agents

$ claude mcp add RL-Factory \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact