VLA-RFT: Vision-Language-Action Reinforcement Fine-Tuning with
Verified Rewards in World Simulators

Hengtao Li1,3,7,*,♢ Pengxiang Ding1,2,3,*,† Runze Suo1,3,4,* Yihao Wang1,3,6,*
Zirui Ge1,2,3 Dongyuan Zang7 Kexian Yu5 Mingyang Sun2,3,4
Hongyin Zhang1,2 Donglin Wang1,✉ Weihua Su7,✉
1 Westlake University 2 Zhejiang University 3 OpenHelix Team 4 Fudan University
5 Zhengzhou University 6 BUPT 7 Hebei University of Technology
* Equal Contribution Corresponding Authors Project Lead
Work done during interning at Westlake University

Why Propose VLA-RFT?

Background.   Vision-Language-Action (VLA) models typically rely on imitation learning for training, which performs well in static environments. However, they perform poorly when faced with real-world changes, as slight deviations can cause Policy to gradually deviate from the expert demonstration, leading to error accumulation and affecting robustness.

In this work.   We introduce VLA-RFT, a reinforcement fine-tuning framework that leverages a data-driven world model as a controllable simulator. Trained from real interaction data, the simulator predicts future visual observations conditioned on actions, allowing policy rollouts with dense, trajectory-level rewards derived from goal-achieving references. This design delivers an efficient and action-aligned learning signal, lowering sample requirements.

Framework
Figure 1. The Framework of VLA-RFT.

Performance.   With fewer than 400 fine-tuning steps, VLA-RFT surpasses strong supervised baselines and achieves greater efficiency than simulator-based RL.

Framework
Figure 2. Performance under General Settings and Perturbation Settings (LIBERO). "Base" is
VLA-Adapter model with the state-of-the-art performance.

1. Training Paradigm

Brief Description.   In the pre-training stage, both the world model and VLA policy are initialized, where the world model takes a 7-dimensional action input that is consistent in format with the VLA’s action output. In the reinforcement fine-tuning stage, the VLA generates action chunks based on an initial frame and language instruction, which are rolled out in the world model to predict future states. Verified rewards are then computed from the predicted states and used to optimize the VLA via GRPO Optimization.

Framework
Figure 3. The Training Paradigm of VLA-RFT.

Stage I: WM and Policy Pre-Training.   In the first stage, we pretrain the world model on offline datasets so that it can capture environment dynamics. In parallel, we pretrain the VLA policy to produce stable action chunks, which serve as a reliable initialization for subsequent optimization.

Stage II: VLA Optimization through WM Interaction.   In the second stage, given an initial frame and a language instruction, the VLA rolls out n action chunks. The world model then interactively generates trajectories conditioned on these actions and provides verified rewards. Using these feedback signals, the VLA is fine-tuned with GRPO to progressively improve policy performance.

2. Settings & Results

2.1 Task Settings

Framework
Figure 4. Illustration of Perturbed Task Settings in LIBERO. We consider four perturbation types to evaluate out-of-distribution robustness: (Object Position) shifting the initial (x, y) coordinates of the manipulated object; (Goal Position) displacing the target object in the (x, y) plane; (Robot State) modifying the gripper’s vertical height and horizontal offset; and (Combination) applying all perturbations together. Each row shows the original setting (Origin), the perturbed variant (Disturb), and a side-by-side comparison (Contrast).

2.2 Comparison Results

Table 1. Performance under perturbation settings. All perturbation magnitudes are in centimeter.
Framework
Table 2. Performance under General Settings of LIBERO. We report SR (%) across the four suites (Spatial, Object, Goal, and Long) and their average. The radar plot on the right provides a visual comparison of different model stages across tasks.
Framework

2.3 Execution Examples of SFT and RFT in Perturbation Scenes

In perturbation scenes, VLA-RFT can sensitively capture the transformation situation, and through interaction with the world model, correct the action and successfully complete the task.

Instruction:
Put the black bowl in the bottom drawer of the cabinet and close it

SFT: False
VLA-RFT: True

BibTeX

      @misc{li2025vlarftvisionlanguageactionreinforcementfinetuning,
      title={VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators}, 
      author={Hengtao Li and Pengxiang Ding and Runze Suo and Yihao Wang and Zirui Ge and Dongyuan Zang and Kexian Yu and Mingyang Sun and Hongyin Zhang and Donglin Wang and Weihua Su},
      year={2025},
      eprint={2510.00406},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2510.00406}, 
      }