VLA-RFT: Vision-Language-Action Reinforcement Fine-Tuning with
Verified Rewards in World Simulators

Hengtao Li^1,3,7,*,♢ Pengxiang Ding^1,2,3,*,† Runze Suo^1,3,4,* Yihao Wang^1,3,6,*
Zirui Ge^1,2,3 Dongyuan Zang⁷ Kexian Yu⁵ Mingyang Sun^2,3,4
Hongyin Zhang^1,2 Donglin Wang^1,✉ Weihua Su^7,✉

¹ Westlake University ² Zhejiang University ³ OpenHelix Team ⁴ Fudan University
⁵ Zhengzhou University ⁶ BUPT ⁷ Hebei University of Technology
^* Equal Contribution ^✉ Corresponding Authors ^† Project Lead
^♢ Work done during interning at Westlake University

Paper ArXiv Code 🤗 Models

Why Propose VLA-RFT?

Background. Vision-Language-Action (VLA) models typically rely on imitation learning for training, which performs well in static environments. However, they perform poorly when faced with real-world changes, as slight deviations can cause Policy to gradually deviate from the expert demonstration, leading to error accumulation and affecting robustness.

In this work. We introduce VLA-RFT, a reinforcement fine-tuning framework that leverages a data-driven world model as a controllable simulator. Trained from real interaction data, the simulator predicts future visual observations conditioned on actions, allowing policy rollouts with dense, trajectory-level rewards derived from goal-achieving references. This design delivers an efficient and action-aligned learning signal, lowering sample requirements.

Performance. With fewer than 400 fine-tuning steps, VLA-RFT surpasses strong supervised baselines and achieves greater efficiency than simulator-based RL.

1. Training Paradigm

Brief Description. In the pre-training stage, both the world model and VLA policy are initialized, where the world model takes a 7-dimensional action input that is consistent in format with the VLA’s action output. In the reinforcement fine-tuning stage, the VLA generates action chunks based on an initial frame and language instruction, which are rolled out in the world model to predict future states. Verified rewards are then computed from the predicted states and used to optimize the VLA via GRPO Optimization.

Stage I: WM and Policy Pre-Training. In the first stage, we pretrain the world model on offline datasets so that it can capture environment dynamics. In parallel, we pretrain the VLA policy to produce stable action chunks, which serve as a reliable initialization for subsequent optimization.

Stage II: VLA Optimization through WM Interaction. In the second stage, given an initial frame and a language instruction, the VLA rolls out n action chunks. The world model then interactively generates trajectories conditioned on these actions and provides verified rewards. Using these feedback signals, the VLA is fine-tuned with GRPO to progressively improve policy performance.

2. Settings & Results

2.1 Task Settings

2.2 Comparison Results

2.3 Execution Examples of SFT and RFT in Perturbation Scenes

In perturbation scenes, VLA-RFT can sensitively capture the transformation situation, and through interaction with the world model, correct the action and successfully complete the task.

Instruction:
Put the black bowl in the bottom drawer of the cabinet and close it

SFT: False ✖

VLA-RFT: True ✔

BibTeX

      @misc{li2025vlarftvisionlanguageactionreinforcementfinetuning,
      title={VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators}, 
      author={Hengtao Li and Pengxiang Ding and Runze Suo and Yihao Wang and Zirui Ge and Dongyuan Zang and Kexian Yu and Mingyang Sun and Hongyin Zhang and Donglin Wang and Weihua Su},
      year={2025},
      eprint={2510.00406},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2510.00406}, 
      }

VLA-RFT: Vision-Language-Action Reinforcement Fine-Tuning with Verified Rewards in World Simulators