Background. Vision-Language-Action (VLA) models typically rely on imitation learning for training,
which performs well in static environments. However, they perform poorly when faced with real-world changes,
as slight deviations can cause Policy to gradually deviate from the expert demonstration,
leading to error accumulation and affecting robustness.
In this work. We introduce VLA-RFT, a reinforcement fine-tuning framework that
leverages a data-driven world model as a controllable simulator. Trained from real
interaction data, the simulator predicts future visual observations conditioned on
actions, allowing policy rollouts with dense, trajectory-level rewards derived from
goal-achieving references. This design delivers an efficient and action-aligned
learning signal, lowering sample requirements.
Performance. With fewer than 400 fine-tuning steps, VLA-RFT surpasses strong supervised baselines and achieves greater efficiency than simulator-based RL.
Brief Description. In the pre-training stage, both the world model and VLA policy are initialized, where the world model takes a 7-dimensional action input that is consistent in format with the VLA’s action output. In the reinforcement fine-tuning stage, the VLA generates action chunks based on an initial frame and language instruction, which are rolled out in the world model to predict future states. Verified rewards are then computed from the predicted states and used to optimize the VLA via GRPO Optimization.
Stage I: WM and Policy Pre-Training. In the first stage, we pretrain the world model on offline datasets so that it can capture environment dynamics. In parallel, we pretrain the VLA policy to produce stable action chunks, which serve as a reliable initialization for subsequent optimization.
Stage II: VLA Optimization through WM Interaction. In the second stage, given an initial frame and a language instruction, the VLA rolls out n action chunks. The world model then interactively generates trajectories conditioned on these actions and provides verified rewards. Using these feedback signals, the VLA is fine-tuned with GRPO to progressively improve policy performance.
In perturbation scenes, VLA-RFT can sensitively capture the transformation situation, and through interaction with the world model, correct the action and successfully complete the task.
Instruction:
Put the black bowl in the bottom drawer of the cabinet and close it
@misc{li2025vlarftvisionlanguageactionreinforcementfinetuning,
title={VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators},
author={Hengtao Li and Pengxiang Ding and Runze Suo and Yihao Wang and Zirui Ge and Dongyuan Zang and Kexian Yu and Mingyang Sun and Hongyin Zhang and Donglin Wang and Weihua Su},
year={2025},
eprint={2510.00406},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2510.00406},
}