
SimpleVLA-RL is an efficient RL framework for VLA that improves long-horizon planning under data scarcity. It leverages reinforcement learning that can substantially outperforms SFT in simulation and real-world tasks, reveals a "pushcut" new-action phenomenon, and strengthens spatial/object/goal generalization.
https://github.com/user-attachments/assets/45fca289-39d4-4a42-8014-1ef7eff2d806
SimpleVLA-RL extends veRL with VLA-specific components across the following modules:
verl/trainer/main_ppo.py
- Main entry point with ray initialization
- RobRewardManager for reward distribution
verl/trainer/ppo/ray_trainer.py
- Main RL training loop: data loading, VLA rollout, model updates, evaluation, checkpointing
- RL algorithm-specific advantage computation
verl/workers/fsdp_workers.py
- Source of core functions called in ray_trainer.py
- VLA model/optimizer initialization, generate_sequences, compute_entropy, update_actor
verl/workers/actor/dp_rob.py
- Specific implementation of functions in fsdp_workers.py
- RL loss computation, policy updates, compute_log_prob, compute_entropy
verl/workers/rollout/rob_rollout.py
- VLA rollout implementation: environment creation, multi-environment parallel rendering, VLA action generation, environment interaction, video saving, trajectory and 0/1 reward collection
verl/utils/dataset/rob_dataset.py
- Dataset construction for training/testing across benchmarks
verl/utils/vla_utils/
- VLA model implementations (OpenVLA-OFT/OpenVLA from official code)
See SETUP.md for detailed instructions on setting up the conda environment.
An SFT (Supervised Fine-Tuning) VLA model is required for RL training. Below are the available options:
libero-10 traj1/trajall SFTlibero-goal/object/spatial traj1 SFTRobotwin2.0 tasks traj1000 SFTOpenVLA SFT Models
Download from here.
Other Models
For other models, you may need to fine-tune them yourself.
Before running the training script, ensure the following configurations are properly set:
Set Your Weights and Biases (WandB) API Key
Replace the WANDB_API_KEY field in SimpleVLA-RL/align.json with your own WandB API key.
Modify Key Variables
Update the following variables in examples/run_openvla_oft_rl_libero/twin2.sh as needed:
WANDB_API_KEY: Your WandB API key.EXPERIMENT_NAME: The name of your experiment. You can choose any name.SFT_MODEL_PATH: Path to your SFT model.CKPT_PATH: Path where your checkpoints will be saved.DATASET_NAME: For detailed options, refer to examples/run_openvla_oft_rl_libero/twin2.sh.ALIGN_PATH: Path to the SimpleVLA-RL/align.json file.NUM_GPUS: Number of GPUs available per node (e.g., 8).NUM_NODES: Number of nodes used for RL training (e.g., 1).[!NOTE]
- The script has been tested on the following configurations:
- Single-node setup:
NUM_NODES=1,NUM_GPUS=8(1 node with 8 NVIDIA A800 GPUs, each having 80GB memory).- Multi-node setup:
NUM_NODES=2,NUM_GPUS=8(2 nodes with 16 NVIDIA A800 GPUs, each having 80GB memory).- The driver version used is
470.161.03, and the CUDA version is12.4. (Not necessary)
bash
bash examples/run_openvla_oft_rl_libero.sh
or
bash examples/run_openvla_oft_rl_twin2.sh
To evaluate the performance of your model, enable evaluation mode by setting trainer.val_only=True in examples/run_openvla_oft_rl_libero/twin2.sh. Then, execute the same script:
bash examples/run_openvla_oft_rl_libero.sh
or
bash examples/run_openvla_oft_rl_twin2.sh
We evaluate SimpleVLA-RL on the LIBERO using OpenVLA-OFT. SimpleVLA-RL improves the performance of OpenVLA-OFT to 97.6 points on LIBERO-Long and sets a new state-of-the-art. Remarkably, using only one trajectory per task for cold-start SFT, SimpleVLA-RL raises the performance of OpenVLA-OFT from 17.3 to 91.7, yielding an improvement of 74.4 points (430.1%).


We develop this preview version of the code based on veRL, OpenVLA-OFT, RoboTwin2.0, and PRIME. We acknowledge their significant contributions! For further details and updates, please refer to the official documentation and repositories of the respective projects.
If you find SimpleVLA-RL helpful, please cite us:
@article{li2025simplevla,
title={SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning},
author={Li, Haozhan and Zuo, Yuxin and Yu, Jiale and Zhang, Yuhao and Yang, Zhaohui and Zhang, Kaiyan and Zhu, Xuekai and Zhang, Yuchen and Chen, Tianxing and Cui, Ganqu and others},
journal={arXiv preprint arXiv:2509.09674},
year={2025}
}
$ claude mcp add SimpleVLA-RL \
-- python -m otcore.mcp_server <graph>