MCPcopy
hub / github.com/Gen-Verse/OpenClaw-RL

github.com/Gen-Verse/OpenClaw-RL @main sqlite

repository ↗ · DeepWiki ↗
12,110 symbols 54,575 edges 1,293 files 6,073 documented · 50%
README

OpenClaw-RL Claw-RL logo

Empowering OpenClaw with RL — Train a personalized agent simply by talking to it.

Scalable RL in real-world settings — Agentic RL for terminal, GUI, SWE, and tool-call settings.

Fully Async Zero API or Zero GPU Personalized Auto Language Feedback Hybrid RL General Agentic RL

Tech Report OpenClaw-RL Blog OpenClaw Plugin Slime Based Tinker Supported License Apache 2.0

📰 News

  • [2026/4/15] 🙌 We sincerely thank Fireworks AI for its generous support of this project, which has enabled more experiments and faster iteration.
  • [2026/4/11] ✨ Qwen3.5-4B/9B/27B is supported now, both text and multi-modal!
  • [2026/4/4] 👨‍👦‍👦 We support optimizing a single model based on feedback from a group of people.
  • [2026/3/25] 🙌 We sincerely thank Tinker for its generous support of this project, which has enabled more experiments and faster iteration.
  • [2026/3/20] 💻 You can use your own openclaw now, simply install this extension.
  • [2026/3/13] ☁️ OpenClaw-RL now supports both local GPU and cloud (Tinker) deployment. Launch with one line of code — Hybrid RL, OPD, and Binary RL all supported!
  • [2026/3/12] ⚡ We support LoRA training now!
  • [2026/3/10] 📃 We have released our Technical Report! 🏆 Ranked #1 on HuggingFace Daily Papers!
  • [2026/3/10] 🔥 Huge updates today! We released a new combination method, along with an interesting evaluation of these OpenClaw-RL methods. Track 2 is released too, featuring scalable RL implementations for general agent settings across terminal, GUI, SWE, and tool-call scenarios. We only focus on real-world settings!
  • [2026/3/3] 🙌 Working with the authors of SDFT and SDPO, we have integrated their methods into openclaw-opd. We welcome the integration of novel and effective methods!
  • [2026/3/3] 📺 Check out these community tutorial videos on OpenClaw-RL: Video 1 | Video 2
  • [2026/2/26] 🔥 We release OpenClaw-RL v1 — a fully asynchronous RL framework for training personalized AI agents from natural conversation feedback.

💡 TL;DR

OpenClaw-RL is a fully asynchronous reinforcement learning framework that turns everyday conversations into training signals for personalized AI agents, and supports training general agents with large-scale environment parallelization.

Most RL-for-LLM systems assume centralized, batch-mode training with pre-collected datasets. OpenClaw-RL takes a fundamentally different approach: it wraps your self-hosted model in OpenClaw as an OpenAI-compatible API, intercepts live multi-turn conversations, and continuously optimizes the policy in the background — all without interrupting your usage.

Overview

Highlights: Fully async 4-component loop · Self-hosted & private · Zero manual labeling · Three learning paradigms (Binary RL / OPD / Combine) · Personal + General agent support

🌈 Features

Fully Asynchronous 4-Component Architecture

OpenClaw-RL decouples agent serving, rollout collection, PRM/judge evaluation, and policy training into independent async loops. None of them block one another: the model continues serving requests while training runs in the background, and judging happens concurrently with new interactions.

Self-Hosted & Private by Design

The entire stack, including the policy model, judge/PRM, and trainer, runs on your own infrastructure. Conversation data stays within your system, and no third-party model API is required.

From Feedback to Gradient — Automatically

You do not need to manually label data. The system automatically: - Organizes multi-turn interactions into session-aware training trajectories - Classifies API messages into main-line (trainable) vs. side (non-trainable) turns - Uses the next user, environment, or tool feedback as a natural "next-state" signal - Runs PRM/judge evaluation asynchronously, with majority voting when needed for more robust scoring - Submits ready samples to the trainer as they become available

Three Optimization Methods in One Framework

Binary RL (GRPO): A Process Reward Model scores each turn based on next-state feedback. The scalar reward is then used with GRPO advantage estimation and a PPO-style clipped surrogate loss.

On-Policy Distillation (OPD): When the next state reveals useful hindsight, a judge model extracts a textual hint. This hint augments the original prompt to create an enhanced teacher, whose token-level log-probability gap with the student becomes a directional advantage signal richer than any scalar reward.

Hybrid Method: OpenClaw-RL further combines Binary RL and OPD in a unified training recipe, leveraging the dense scalar supervision of Binary RL together with the richer token-level directional signal from OPD. This combination achieves stronger and more robust optimization than either method alone.

From Personal Agents to Real-World Agentic RL

The same framework supports both personalized OpenClaw optimization and scalable RL for terminal, GUI, SWE, and tool-call agents in real-world settings.


🎯 Roadmap

Our long-term goal is to advance personalized, practically useful agents with reinforcement learning. The roadmap has two tracks:

Track 1 — Personal Agent Optimization (Small-Scale but Personal)

Release Track 1: Fully async OpenClaw-RL framework with Binary RL + OPD
✅ Best recipe discovery via demonstration experiments
✅ Support LoRA Training
✅ Deploy training on Tinker
✅ Deploy training on Fireworks AI

Track 2 — General Agents Optimization (Scalable Infra)

Release Track 2: Scalable agentic RL infra for general agents
✅ Support Qwen3.5
⬜ Support more cloud services

📝 Contents


🔧 Personal Agent Optimization Quick Start

1. Deployment Requirements

  • Hardware: 8× GPUs (default; configurable via NUM_GPUS, ACTOR_GPUS, ROLLOUT_GPUS, PRM_GPUS)
  • Software: CUDA 12.9, Python 3.12
  • Framework: Slime (our base RL framework)

For detailed environment setup, see Slime or ./instructions/README.md.

<!--

Don't have a GPU?

**Optio

Core symbols most depended-on inside this repo

append
called by 1939
slime/slime_plugins/rollout_buffer/buffer.py
get
called by 1519
slime/slime/utils/types.py
items
called by 573
Megatron-LM/megatron/core/optimizer/optimizer.py
split
called by 462
Megatron-LM/tools/preprocess_data.py
format
called by 445
Megatron-LM/megatron/rl/inference/chat_templates.py
size
called by 439
Megatron-LM/megatron/core/datasets/indexed_dataset.py
pop
called by 400
Megatron-LM/megatron/core/pipeline_parallel/fine_grained_activation_offload.py
set
called by 303
swe-rl/mini-swe-agent/src/minisweagent/run/extra/config.py

Shape

Method 6,126
Function 4,435
Class 1,405
Route 144

Languages

Python100%
TypeScript1%

Modules by API surface

Megatron-LM/megatron/training/tokenizer/tokenizer.py133 symbols
Megatron-LM/megatron/core/utils.py120 symbols
Megatron-LM/megatron/core/distributed/fsdp/src/megatron_fsdp/param_and_grad_buffer.py119 symbols
Megatron-LM/megatron/core/parallel_state.py90 symbols
Megatron-LM/megatron/core/optimizer/optimizer.py90 symbols
Megatron-LM/megatron/core/extensions/transformer_engine.py76 symbols
Megatron-LM/megatron/core/transformer/moe/token_dispatcher.py63 symbols
Megatron-LM/megatron/core/transformer/cuda_graphs.py62 symbols
gui-rl/desktop_env/server/main.py60 symbols
Megatron-LM/examples/multimodal/evaluation/evaluation_datasets.py60 symbols
Megatron-LM/megatron/core/tensor_parallel/mappings.py58 symbols
Megatron-LM/megatron/core/datasets/indexed_dataset.py57 symbols

Dependencies from manifests, versioned

GitPython3.1.46 · 1×
Jinja23.1.6 · 1×
Markdown3.10.2 · 1×
MarkupSafe3.0.3 · 1×
Pillow10.1.0 · 1×
PuLP3.3.0 · 1×
PyAutoGUI0.9.54 · 1×
PyJWT2.11.0 · 1×
PyYAML6.0.3 · 1×
Pygments2.19.2 · 1×
StrEnum0.4.15 · 1×
Werkzeug3.1.6 · 1×

For agents

$ claude mcp add OpenClaw-RL \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact