Empowering OpenClaw with RL — Train a personalized agent simply by talking to it.
Scalable RL in real-world settings — Agentic RL for terminal, GUI, SWE, and tool-call settings.
OpenClaw-RL is a fully asynchronous reinforcement learning framework that turns everyday conversations into training signals for personalized AI agents, and supports training general agents with large-scale environment parallelization.
Most RL-for-LLM systems assume centralized, batch-mode training with pre-collected datasets. OpenClaw-RL takes a fundamentally different approach: it wraps your self-hosted model in OpenClaw as an OpenAI-compatible API, intercepts live multi-turn conversations, and continuously optimizes the policy in the background — all without interrupting your usage.

Highlights: Fully async 4-component loop · Self-hosted & private · Zero manual labeling · Three learning paradigms (Binary RL / OPD / Combine) · Personal + General agent support
🌈 Features
OpenClaw-RL decouples agent serving, rollout collection, PRM/judge evaluation, and policy training into independent async loops. None of them block one another: the model continues serving requests while training runs in the background, and judging happens concurrently with new interactions.
The entire stack, including the policy model, judge/PRM, and trainer, runs on your own infrastructure. Conversation data stays within your system, and no third-party model API is required.
You do not need to manually label data. The system automatically: - Organizes multi-turn interactions into session-aware training trajectories - Classifies API messages into main-line (trainable) vs. side (non-trainable) turns - Uses the next user, environment, or tool feedback as a natural "next-state" signal - Runs PRM/judge evaluation asynchronously, with majority voting when needed for more robust scoring - Submits ready samples to the trainer as they become available
Binary RL (GRPO): A Process Reward Model scores each turn based on next-state feedback. The scalar reward is then used with GRPO advantage estimation and a PPO-style clipped surrogate loss.
On-Policy Distillation (OPD): When the next state reveals useful hindsight, a judge model extracts a textual hint. This hint augments the original prompt to create an enhanced teacher, whose token-level log-probability gap with the student becomes a directional advantage signal richer than any scalar reward.
Hybrid Method: OpenClaw-RL further combines Binary RL and OPD in a unified training recipe, leveraging the dense scalar supervision of Binary RL together with the richer token-level directional signal from OPD. This combination achieves stronger and more robust optimization than either method alone.
The same framework supports both personalized OpenClaw optimization and scalable RL for terminal, GUI, SWE, and tool-call agents in real-world settings.
Our long-term goal is to advance personalized, practically useful agents with reinforcement learning. The roadmap has two tracks:
✅ Release Track 1: Fully async OpenClaw-RL framework with Binary RL + OPD
✅ Best recipe discovery via demonstration experiments
✅ Support LoRA Training
✅ Deploy training on Tinker
✅ Deploy training on Fireworks AI
✅ Release Track 2: Scalable agentic RL infra for general agents
✅ Support Qwen3.5
⬜ Support more cloud services
NUM_GPUS, ACTOR_GPUS, ROLLOUT_GPUS, PRM_GPUS)For detailed environment setup, see Slime or ./instructions/README.md.
<!--
**Optio
$ claude mcp add OpenClaw-RL \
-- python -m otcore.mcp_server <graph>