hub / github.com/Gen-Verse/OpenClaw-RL

github.com/Gen-Verse/OpenClaw-RL @main sqlite

repository ↗ · DeepWiki ↗

12,110 symbols 54,575 edges 1,293 files 6,073 documented · 50%

README

OpenClaw-RL

Empowering OpenClaw with RL — Train a personalized agent simply by talking to it.

Scalable RL in real-world settings — Agentic RL for terminal, GUI, SWE, and tool-call settings.

📰 News

[2026/4/15] 🙌 We sincerely thank Fireworks AI for its generous support of this project, which has enabled more experiments and faster iteration.
[2026/4/11] ✨ Qwen3.5-4B/9B/27B is supported now, both text and multi-modal!
[2026/4/4] 👨‍👦‍👦 We support optimizing a single model based on feedback from a group of people.
[2026/3/25] 🙌 We sincerely thank Tinker for its generous support of this project, which has enabled more experiments and faster iteration.
[2026/3/20] 💻 You can use your own openclaw now, simply install this extension.
[2026/3/13] ☁️ OpenClaw-RL now supports both local GPU and cloud (Tinker) deployment. Launch with one line of code — Hybrid RL, OPD, and Binary RL all supported!
[2026/3/12] ⚡ We support LoRA training now!
[2026/3/10] 📃 We have released our Technical Report! 🏆 Ranked #1 on HuggingFace Daily Papers!
[2026/3/10] 🔥 Huge updates today! We released a new combination method, along with an interesting evaluation of these OpenClaw-RL methods. Track 2 is released too, featuring scalable RL implementations for general agent settings across terminal, GUI, SWE, and tool-call scenarios. We only focus on real-world settings!
[2026/3/3] 🙌 Working with the authors of SDFT and SDPO, we have integrated their methods into openclaw-opd. We welcome the integration of novel and effective methods!
[2026/3/3] 📺 Check out these community tutorial videos on OpenClaw-RL: Video 1 | Video 2
[2026/2/26] 🔥 We release OpenClaw-RL v1 — a fully asynchronous RL framework for training personalized AI agents from natural conversation feedback.

💡 TL;DR

OpenClaw-RL is a fully asynchronous reinforcement learning framework that turns everyday conversations into training signals for personalized AI agents, and supports training general agents with large-scale environment parallelization.

Most RL-for-LLM systems assume centralized, batch-mode training with pre-collected datasets. OpenClaw-RL takes a fundamentally different approach: it wraps your self-hosted model in OpenClaw as an OpenAI-compatible API, intercepts live multi-turn conversations, and continuously optimizes the policy in the background — all without interrupting your usage.

Overview

Highlights: Fully async 4-component loop · Self-hosted & private · Zero manual labeling · Three learning paradigms (Binary RL / OPD / Combine) · Personal + General agent support

🌈 Features

Fully Asynchronous 4-Component Architecture

OpenClaw-RL decouples agent serving, rollout collection, PRM/judge evaluation, and policy training into independent async loops. None of them block one another: the model continues serving requests while training runs in the background, and judging happens concurrently with new interactions.

Self-Hosted & Private by Design

The entire stack, including the policy model, judge/PRM, and trainer, runs on your own infrastructure. Conversation data stays within your system, and no third-party model API is required.

From Feedback to Gradient — Automatically

You do not need to manually label data. The system automatically: - Organizes multi-turn interactions into session-aware training trajectories - Classifies API messages into main-line (trainable) vs. side (non-trainable) turns - Uses the next user, environment, or tool feedback as a natural "next-state" signal - Runs PRM/judge evaluation asynchronously, with majority voting when needed for more robust scoring - Submits ready samples to the trainer as they become available

Three Optimization Methods in One Framework

Binary RL (GRPO): A Process Reward Model scores each turn based on next-state feedback. The scalar reward is then used with GRPO advantage estimation and a PPO-style clipped surrogate loss.

On-Policy Distillation (OPD): When the next state reveals useful hindsight, a judge model extracts a textual hint. This hint augments the original prompt to create an enhanced teacher, whose token-level log-probability gap with the student becomes a directional advantage signal richer than any scalar reward.

Hybrid Method: OpenClaw-RL further combines Binary RL and OPD in a unified training recipe, leveraging the dense scalar supervision of Binary RL together with the richer token-level directional signal from OPD. This combination achieves stronger and more robust optimization than either method alone.

From Personal Agents to Real-World Agentic RL

The same framework supports both personalized OpenClaw optimization and scalable RL for terminal, GUI, SWE, and tool-call agents in real-world settings.

🎯 Roadmap

Our long-term goal is to advance personalized, practically useful agents with reinforcement learning. The roadmap has two tracks:

Track 1 — Personal Agent Optimization (Small-Scale but Personal)

✅ Release Track 1: Fully async OpenClaw-RL framework with Binary RL + OPD
✅ Best recipe discovery via demonstration experiments
✅ Support LoRA Training
✅ Deploy training on Tinker
✅ Deploy training on Fireworks AI

Track 2 — General Agents Optimization (Scalable Infra)

✅ Release Track 2: Scalable agentic RL infra for general agents
✅ Support Qwen3.5
⬜ Support more cloud services

🔧 Personal Agent Optimization Quick Start

1. Deployment Requirements

Hardware: 8× GPUs (default; configurable via NUM_GPUS, ACTOR_GPUS, ROLLOUT_GPUS, PRM_GPUS)
Software: CUDA 12.9, Python 3.12
Framework: Slime (our base RL framework)

For detailed environment setup, see Slime or ./instructions/README.md.

<!--

Don't have a GPU?

**Optio

Core symbols most depended-on inside this repo

append

called by 1939

slime/slime_plugins/rollout_buffer/buffer.py

get

called by 1519

slime/slime/utils/types.py

items

called by 573

Megatron-LM/megatron/core/optimizer/optimizer.py

split

called by 462

Megatron-LM/tools/preprocess_data.py

format

called by 445

Megatron-LM/megatron/rl/inference/chat_templates.py

size

called by 439

Megatron-LM/megatron/core/datasets/indexed_dataset.py

pop

called by 400

Megatron-LM/megatron/core/pipeline_parallel/fine_grained_activation_offload.py

set

called by 303

swe-rl/mini-swe-agent/src/minisweagent/run/extra/config.py

Shape

Method 6,126

Function 4,435

Class 1,405

Route 144

Languages

Python100%

TypeScript1%

Modules by API surface

Megatron-LM/megatron/training/tokenizer/tokenizer.py133 symbols

Megatron-LM/megatron/core/utils.py120 symbols

Megatron-LM/megatron/core/distributed/fsdp/src/megatron_fsdp/param_and_grad_buffer.py119 symbols

Megatron-LM/megatron/core/parallel_state.py90 symbols

Megatron-LM/megatron/core/optimizer/optimizer.py90 symbols

Megatron-LM/megatron/core/extensions/transformer_engine.py76 symbols

Megatron-LM/megatron/core/transformer/moe/token_dispatcher.py63 symbols

Megatron-LM/megatron/core/transformer/cuda_graphs.py62 symbols

gui-rl/desktop_env/server/main.py60 symbols

Megatron-LM/examples/multimodal/evaluation/evaluation_datasets.py60 symbols

Megatron-LM/megatron/core/tensor_parallel/mappings.py58 symbols

Megatron-LM/megatron/core/datasets/indexed_dataset.py57 symbols

Dependencies from manifests, versioned

GitPython3.1.46 · 1×

Jinja23.1.6 · 1×

Markdown3.10.2 · 1×

MarkupSafe3.0.3 · 1×

Pillow10.1.0 · 1×

PuLP3.3.0 · 1×

PyAutoGUI0.9.54 · 1×

PyJWT2.11.0 · 1×

PyYAML6.0.3 · 1×

Pygments2.19.2 · 1×

StrEnum0.4.15 · 1×

Werkzeug3.1.6 · 1×

For agents

$ claude mcp add OpenClaw-RL \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact