📘Tutorial | 🛠️Installation | 🎨Framework
RLFactory is an easy and efficient RL post-training framework for Agentic Learning.
RL-Factory decouples the environment from RL post-training, enabling training with just a tool config and reward function while supporting async tool-calling to make RL post-training 2x faster.
Current version natively supports one-click DeepSearch training and features multi-turn tool-calling, model judge reward, and training of multiple models including Qwen3. More easy and efficient agentic learning modules will be added in upcoming features.
Now, everyone can easily and quickly train an Agent model with Qwen3 (as base models) and MCP tools!
Our goal is to enable users to focus on reward logic and tool setup for fast agentic learning with minimal code, while hardcore developers could focus on improving training efficiency and model performance.
For easy-to-use, we decouple the environment from RL-based post-training with several advantages. + Easy-to-design reward function: Calculate rewards through rules, model-judge, and even tools to meet all your requirements for reward function. + Seamless tool setup: Simply provide the configuration file for your MCP tools and custom tools to integrate them into RL learning. + Multi-Agent extention: Convert your agent to the MCP format for easy Multi-Agent Interaction. LLM chat simulation will be also added in the future to improve multi-turn dialogue capabilities.
For efficient learning, we develope several essential modules within the RL post-training framework, making training 2x faster. + Efficient tool-call: Improve online RL training efficiency through batch processing and asynchronous parallel tool calls. + Efficient reward calculation: Deploy LRM (like QwQ-32B) in a distributed manner for efficient model judging, and use asynchronous parallelism to speed up reward calculation.
For future progression, we will continue to prioritize "easy" and "efficient". + Easier: Use WebUI to process data, define tool & environment, adjust training configuration, and manage project. (The WebUI is under rapid development.) + More efficient: Continuously iterating and improving the training framework (such as AsyncLLMEngine) and RL training algorithms.
We’ll keep a fast release cycle to quickly deliver and polish the upcoming features. + Version 0.1 + Environment decouple: define your tool-use envinroment easily (tools setup and reward function definition) + Qwen3 Model support: quickly train your agent model using Qwen3 (much better than Qwen2.5 in tool-call) + Efficient training: 2x faster than existing frameworks for rapid model iteration (mainly through async tool-use) + Version 0.2 (within 2 weeks) - WebUI: build a WebUI for data processing, tool & environment definition, training configuration, and project management - More efficient training: support the AsyncLLMEngine for more efficient rollout - More models: test more models (such as Deepseek, Llama, etc.) and add corresponding support configurations - More applications: help create more demos (such as TravelPlanner) to adapt to more benchmarks
yaml
Cuda: >=12.0 (Recommended: 12.4)
Python: >=3.10 (Recommended: 3.10)
# For Qwen3 model support
vllm: >=0.8.3 (Recommended: 0.8.5)bash
pip3 install accelerate bitsandbytes datasets deepspeed==0.16.4 einops flash-attn==2.7.0.post2 isort jsonlines loralib optimum packaging peft pynvml>=12.0.0 ray[default]==2.42.0 tensorboard torch torchmetrics tqdm transformers==4.48.3 transformers_stream_generator wandb wheel
pip3 install vllm==0.8.5 # Mainly for Qwen3 model support
pip3 install "qwen-agent[code_interpreter]"
pip3 install llama_index bs4 pymilvus infinity_client codetiming tensordict==0.6 omegaconf torchdata==0.10.0 hydra-core easydict dill python-multipart mcp
pip3 install -e . --no-deps
pip3 install faiss-gpu-cu12 # Optional, needed for end-to-end search model training with rag_serverNote: Currently, only Qwen models are tested.
docs/rl_factory/main_tutorial.mdbash
# Before running, modify MODEL_PATH, REWARD_MODEL_PATH, and several actor_rollout_ref.env parameters as needed
bash main_grpo.shdocs/rl_factory/main_tutorial.md, we provide an RLFactory reproduction example of Search-R1. We use Qwen3-4B and Qwen3-8B as the base model for RL training. Qwen3 demonstrates significant advantages in Agent Learning. It can accurately call tools even without SFT, and it also supports the MCP protocol.
Efficient: Enjoy the efficient training enabled by asynchronous parallel tool-call.
Qwen3-4B achieves a score of 0.458 and Qwen3-8B achieves a score of 0.463. | Model Name | Test Score (NQ) | Total Training Time (100 step) | Seconds per step | Training Resources |
|---|---|---|---|---|
| Search-R1-Qwen2.5-3B-Instruct-GRPO | 0.356 | 7.39 h | 266 s | A100 × 8 |
| Search-R1-Qwen2.5-7B-Instruct-GRPO | 0.451 | 9.25 h | 333 s | A100 × 8 |
| Search-R1-Qwen3-4B-GRPO | 0.420 | 7.95 h | 286 s | A100 × 8 |
| RLFactory-Qwen3-4B-GRPO | 0.458 | 5.30 h | 190 s | A100 × 8 |
| RLFactory-Qwen3-8B-GRPO | 0.463 | 5.76 h | 207 s | A100 × 8 |
We welcome all users and developers to contribute code to RLFactory. If you have any questions, encounter bugs, or would like to collaborate on development, please feel free to contact us!
This repo benefits from verl, Search-R1, Qwen-Agent. Thanks for their wonderful works. We will also introduce TRL in the future to further expand the applicability of our framework.
$ claude mcp add RL-Factory \
-- python -m otcore.mcp_server <graph>