hub / github.com/OpenManus/OpenManus-RL

github.com/OpenManus/OpenManus-RL @main sqlite

1,653 symbols 5,305 edges 190 files 459 documented · 28%

README

OpenManus-RL

OpenManus-RL is an open-source initiative collaboratively led by Ulab-UIUC and MetaGPT .

This project is an extended version of the original @OpenManus initiative. Inspired by successful RL tunning for reasoning LLM such as Deepseek-R1, QwQ-32B, we will explore new paradigms for RL-based LLM agent tuning, particularly building upon foundations.

We are committed to regularly updating our exploration directions and results in a dynamic, live-streaming fashion. All progress, including rigorous testing on agent benchmarks such as GAIA, AgentBench, WebShop, and OSWorld, and tuned models, will be openly shared and continuously updated.

We warmly welcome contributions from the broader community—join us in pushing the boundaries of agent reasoning and tool integration!

Code and dataset are now available! The verl submodule has been integrated for enhanced RL training capabilities.

<img src="https://github.com/OpenManus/OpenManus-RL/raw/main/assets/manus.jpg" style="width: 100%;" alt="marble">

📖 Table of Contents

OpenManus-RL
🔔 News
Current Team Members
How to Contribute
Roadmap
Method
Dataset
- Dataset Overbiew
- Data Instances
Running
Related Work
Agent tuning
Tool using
Agent tuning instruction dataset
RL tuning
Benchmark
Similar Code
Acknowledgement
Community Group
Citation
Documentation

🔔 News

[2025-03-09] 🍺 We collect and opensource our Agent SFT dataset at Huggingface, go try it!
[2025-03-08] 🎉 We are collaborating with @OpenManus from Metagpt to work on this project together!
[2025-03-06] 🥳 We(UIUC-Ulab) are announcing our live-streaming project, OpenManus-RL.

Current Team Members

@Kunlun Zhu(Ulab-UIUC), @Muxin Tian, @Zijia Liu(Ulab-UIUC), @Yingxuan Yang,@Jiayi Zhang(MetaGPT), @Xinbing Liang, @Weijia Zhang, @Haofei Yu(Ulab-UIUC), @Cheng Qian,@Bowen Jin,

How to Contribute

We wholeheartedly welcome suggestions, feedback, and contributions from the community! Feel free to:

We welcome contributions, including fine-tuning codebase, tuning dataset, environment setup, and computing resources. Create issues for feature requests, bug reports, or ideas. Submit pull requests to help improve OpenManus-RL. Or simply reach out to us for direct collaboration. Important contributors will be listed as co-authors to our paper.

Roadmap

Agent Environment Support Setting up LLM agent environment for online RL tunning.
Agent Trajectories Data Collection Connect to specialized reasoning models such as deepseek-r1, QwQ-32B for more complex inference tasks to collect comprehensive agent trajectories.
RL-Tuning Model Paradigm Provide an RL fine-tuning approach for customizing the agent's behavior in our agent environment.
Test on Agent Benchmarks Evaluate our framework on agentic benchmark such as Webshop, GAIA, OSWorld, AgentBench

Method

Our method proposes an advanced reinforcement learning (RL)-based agent tuning framework designed to significantly enhance reasoning and decision-making capabilities of large language models (LLMs). Drawing inspiration from RAGEN's Reasoning-Interaction Chain Optimization (RICO), our approach further explores novel algorithmic structures, diverse reasoning paradigms, sophisticated reward strategies, and extensive benchmark environments.

Reasoning Models Exploration

To benchmark the reasoning capabilities effectively, we evaluate multiple state-of-the-art reasoning models: - GPT-O1 - Deepseek-R1 - QwQ-32B

Each model provides unique reasoning capabilities that inform downstream optimization and training strategies.

Alternative Rollout Strategies

We experiment with a variety of rollout strategies to enhance agent planning efficiency and reasoning robustness, including:

Tree-of-Thoughts (ToT): Employs tree-based reasoning paths, enabling agents to explore branching possibilities systematically.
Graph-of-Thoughts (GoT): Utilizes graph structures to represent complex reasoning dependencies effectively.
DFSDT (Depth-First Search Decision Trees): Optimizes action selection through depth-first search, enhancing long-horizon planning.
Monte Carlo Tree Search (MCTS): Explores reasoning and decision paths probabilistically, balancing exploration and exploitation effectively.

These methods help identify optimal rollout techniques for various reasoning tasks.

Diverse Reasoning Formats

We specifically analyze and compare several reasoning output formats, notably:

ReAct: Integrates reasoning and action explicitly, encouraging structured decision-making.
Outcome-based Reasoning: Optimizes toward explicit outcome predictions, driving focused goal alignment.

These formats are rigorously compared to derive the most effective reasoning representation for various tasks.

Post-Training Strategies

We investigate multiple post-training methodologies to fine-tune agent reasoning effectively:

Supervised Fine-Tuning (SFT): Initializes reasoning capabilities using human-annotated instructions.
Generalized Reward-based Policy Optimization (GRPO): Incorporates:
- Format-based Rewards: Rewards adherence to specified reasoning structures.
- Outcome-based Rewards: Rewards accurate task completion and goal attainment.
Proximal Policy Optimization (PPO): Enhances agent stability through proximal updates.
Direct Preference Optimization (DPO): Leverages explicit human preferences to optimize agent outputs directly.
Preference-based Reward Modeling (PRM): Uses learned reward functions derived from human preference data.

Training of Agent Reward Model

We train specialized agent reward models using annotated data to accurately quantify nuanced reward signals. These models are then leveraged to guide agent trajectory selection during both training and evaluation phases.

Test-time Scaling of Trajectories

During the inference phase, trajectory scaling methods are implemented, allowing agents to flexibly adapt to varying task complexities, thus enhancing robustness and performance in real-world scenarios.

Action Space Awareness and Strategic Exploration

Agents are equipped with action-space awareness, employing systematic exploration strategies designed to navigate complex action spaces effectively, ultimately maximizing expected rewards.

Integration with RL Tuning Frameworks

We integrate insights and methodologies from leading RL tuning frameworks, including:

Verl - Integrated as Git Submodule - Our primary RL framework, providing advanced training capabilities for agent optimization
TinyZero
OpenR1
Trlx

Verl Integration

The verl submodule is fully integrated into OpenManus-RL, providing: - Advanced RL Algorithms - PPO, DPO, and custom reward modeling - Efficient Training - Optimized for large language model fine-tuning - Flexible Configuration - Easy customization of training parameters - Production Ready - Battle-tested framework from Bytedance

Through these frameworks, agents can effectively balance exploration and exploitation, optimize reasoning processes, and adapt dynamically to novel environments.

In summary, our method systematically integrates advanced reasoning paradigms, diverse rollout strategies, sophisticated reward modeling, and robust RL frameworks, significantly advancing the capability and adaptability of reasoning-enhanced LLM agents.

<img src="https://github.com/OpenManus/OpenManus-RL/raw/main/assets/method_overview.png" style="width: 100%;" alt="marble">

Dataset

OpenManusRL-Dataset combines agent trajectories from AgentInstruct, Agent-FLAN and [AgentTraj-L(AgentGym)] with features:

🔍 ReAct Framework - Reasoning-Acting integration
🧠 Structured Training - Separate format/reasoning learning
🚫 Anti-Hallucination - Negative samples + environment grounding
🌐 6 Domains - OS, DB, Web, KG, Household, E-commerce

Dataset Overview

Source	Trajectories	Avg Turns	Key Features
AgentInstruct	1,866	5.24	Multi-task QA, CoT reasoning
Agent-FLAN	34442	3-35	Error recovery patterns, diverse real-world tasks
AgentTraj-L	14485	3-35	interactive environments and tasks
Combined	50793	4-20	Enhanced generalization, uniform format with broader task coverage

Supported Tasks

text-generation: ReAct-style instruction following
conversational-ai: Tool-augmented dialogues

Languages

English

Data Instances

ReAct Pattern Example:

{
  "id": "os_0",
  "conversations": [
    {"role": "user", "content": "Count files in /etc"},
    {"role": "assistant", "content": "Think: Need reliable counting method\nAct: bash\n```bash\nls -1 /etc | wc -l\n```"},
    {"role": "user", "content": "OS Output: 220"},
    {"role": "assistant", "content": "Think: Verified through execution\nAct: answer(220)"}
  ]
}

Running

OpenManus-RL

A simplified library for Supervised Fine-Tuning (SFT) and GRPO tunning of language models for agentic system. (developed upon Verl from Bytedance) We are still laboriously developing this part, welcome feedback.

Installation

Prerequisites

This project uses git submodules. After cloning the repository, make sure to initialize and update the submodules:

# Clone the repository with submodules
git clone --recursive https://github.com/OpenManus/OpenManus-RL.git

# Or if already cloned, initialize and update submodules
git submodule update --init --recursive

Environment Setup

First, create a conda environment and activate it:

# Create a new conda environment
conda create -n openmanus-rl python=3.10 -y
conda activate openmanus-rl

Then, install the required dependencies:

# Install PyTorch with CUDA support
pip3 install torch torchvision

# Install vllm for efficient inference
# Install the main package
pip install -e .[vllm]

# flash attention 2
pip3 install flash-attn --no-build-isolation
pip install wandb

Environment Setup

1. Webshop

To set up the WebShop environment for evaluation:

# Change to the agentenv-webshop directory
cd openmanus_rl/environments/env_package/webshop/webshop/

# Create a new conda environment for WebShop
conda create -n agentenv_webshop python==3.10 -y
conda activate agentenv_webshop

# Setup the environment
bash ./setup.sh -d all

2. ALFWorld

conda acitvate openmanus-rl
pip3 install gymnasium==0.29.1
pip3 install stable-baselines3==2.6.0
pip install alfworld

Download PDDL & Game files and pre-trained MskRCNN detector (will be stored in ~/.cache/alfworld/):

alfworld-download -f

Use --extra to download pre-trained checkpoints and seq2seq data.

Quick Start

1. Environment Setup

Make sure you have the required environments set up (see Environment Setup section above).

2. Data Preparation

Download the OpenManus-RL dataset from Hugging Face.

3. Training Examples

ALFWorld RL Training (PPO)

conda activate openmanus-rl
bash scripts/ppo_train/train_alfworld.sh

Related Work

Agent tuning

Offline Training of Language Model Agents with Functions as Learnable Weights. [paper]
FIREACT : TOWARD LANGUAGE AGENT FINE-TUNING. [paper]
AgentTuning: Enabling Generalized Agent Abilities for LLMs. [[paper](https://ar

Core symbols most depended-on inside this repo

called by 629

openmanus_rl/environments/env_package/alfworld/alfworld/agents/detector/utils.py

get

called by 186

openmanus_rl/environments/env_package/alfworld/alfworld/agents/agent/base_agent.py

get

called by 76

openmanus_rl/environments/env_package/webshop/webshop/web_agent_site/envs/web_agent_text_env.py

encode

called by 63

openmanus_rl/environments/env_package/alfworld/alfworld/agents/agent/base_agent.py

sum

called by 58

openmanus_rl/environments/env_package/alfworld/alfworld/agents/modules/segment_tree.py

get_object

called by 39

openmanus_rl/environments/env_package/alfworld/alfworld/agents/controller/base.py

called by 33

openmanus_rl/environments/env_package/webshop/webshop/web_agent_site/envs/web_agent_text_env.py

step

called by 30

openmanus_rl/environments/env_package/alfworld/alfworld/env/thor_env.py

Shape

Method 1,046

Function 353

Class 238

Route 16

Languages

Python100%

Modules by API surface

openmanus_rl/environments/env_package/alfworld/alfworld/agents/modules/layers.py147 symbols

openmanus_rl/environments/env_package/webshop/webshop/baseline_models/logger.py68 symbols

openmanus_rl/environments/env_package/alfworld/alfworld/agents/expert/handcoded_expert.py52 symbols

openmanus_rl/environments/env_package/alfworld/alfworld/env/tasks.py46 symbols

openmanus_rl/environments/env_package/alfworld/alfworld/agents/modules/generic.py35 symbols

openmanus_rl/environments/env_package/webshop/webshop/web_agent_site/envs/web_agent_text_env.py33 symbols

openmanus_rl/environments/env_package/alfworld/alfworld/agents/agent/base_agent.py33 symbols

openmanus_rl/environments/env_package/alfworld/alfworld/agents/detector/utils.py32 symbols

openmanus_rl/environments/env_package/alfworld/alfworld/gen/game_states/task_game_state.py28 symbols

openmanus_rl/environments/env_package/alfworld/alfworld/env/thor_env.py28 symbols

test/test_rollout_mock.py25 symbols

test/test_rollout_env.py25 symbols

Dependencies from manifests, versioned

Flask2.1.2 · 1×

PyYAML6.0.2 · 1×

Werkzeug2.1.0 · 1×

accelerate1×

beautifulsoup44.11.1 · 1×

cleantext1.1.4 · 1×

codetiming1×

datasets1×

dill1×

env0.1.0 · 1×

gradio4.26.0 · 1×

gym0.24.0 · 1×

For agents

$ claude mcp add OpenManus-RL \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact