đ Abstract
VisionâLanguageâAction (VLA) policies excel in aligning language, perception, and robot control. However, most VLAs are trained purely by imitation, which overfits to demonstrations and is brittle under distribution shift. Reinforcement learning (RL) directly optimizes task reward and thus addresses this misalignment, but real-robot interaction is expensive and conventional simulators are hard to engineer and transfer.
We address both data efficiency and optimization stability in VLA post-training via a learned world model and an RL procedure tailored to flow-based action heads. Specifically, we introduce Prophet, a unified action-to-video robot actuation model pretrained across large-scale, heterogeneous robot data to learn reusable actionâoutcome dynamics. Prophet can be few-shot adapted to new robots, objects, and environments, yielding a rollout-ready simulator.
Upon Prophet, we reinforce action policies with Flow-action-GRPO (FA-GRPO), which adapts Flow-GRPO to operate on VLA actions, and with FlowScale, a stepwise reweighting that rescales per-step gradients in the flow head. Together, Prophet, FA-GRPO, and FlowScale constitute ProphRL, a practical, data- and compute-efficient path to VLA post-training. Experiments show 5â17% success gains on public benchmarks and 24â30% gains on real robots across different VLA variants.
GT Action Rollouts Across Tasks
Interactive Demo: 3-step precise control
From top to bottom, pick the 1st, 2nd, 3rd move, then the movement distance. The three moves form one control sequence.
Interactive Demo: BRIDGE multi task
All six options are action-edited rollouts from the same scene. Pick one button to play its task.
Interactive Demo: 2-step dual-arm control
Try 2-step translations of each arm or 2-step roll / pitch / yaw rotations of both arms.
Action Editing with Prophet
Prophet replays the original dual-arm motion under GT actions (left), and follows an edited sequence where the left gripper is frozen while the right arm keeps moving to the right (right).
With GT actions (left), Prophet rolls out the nominal trajectory. When we edit the action chunk to push the gripper further left (right), the rollout smoothly adjusts end-effector and object motion over time.
Under GT actions, Prophet produces a successful pick-and-place (left). When we edit the actions to keep the gripper open throughout (right), the object is never grasped and the rollout depicts a realistic failure.
Real world PlaceBowl: RL discovers new behavior
- Training demos for PlaceBowl only contain left-side grasps.
- The SFT policy occasionally tries a right-side approach, but these cases are rare and unstable.
- Successful right-side trials receive positive reward, and RL with FA-GRPO + FlowScale amplifies this weak mode into a consistent right-side strategy.
- In contrast to SFT imitation, RL can discover and reinforce behaviors only weakly present in the data.
Real world PulloutTissue
- SFT policy often drifts sideways, makes weak contact with the tissue edge, and misses the pull.
- ProphRL shows a straighter approach, firmer edge contact, and more successful pulls.
- RL training yields more stable and reliable policies in the real world.
RL post-training results
SimplerEnv (WidowX) on BRIDGE
Single-image VLA policies are first SFT on BRIDGE and then post-trained with FA-GRPO and FlowScale in Prophet. We summarize overall success and gains over SFT.
| Backbone | Training stage | Overall success [%] | Gain vs. SFT |
|---|---|---|---|
| VLAâAdapterâ0.5B | SFT only | 23.3 ± 2.2 | â |
| + FAâGRPO | 38.2 ± 2.4 | (+14.9) | |
| + FAâGRPO & FlowScale | 41.0 ± 2.4 | (+17.7) | |
| Pi0.5â3B | SFT only | 38.9 ± 2.6 | â |
| + FAâGRPO | 46.9 ± 3.0 | (+8.0) | |
| + FAâGRPO & FlowScale | 51.0 ± 1.2 | (+12.1) | |
| OpenVLAâOFTâ7B | SFT only | 25.0 ± 1.8 | â |
| + FAâGRPO | 29.2 ± 1.8 | (+4.2) | |
| + FAâGRPO & FlowScale | 30.9 ± 0.6 | (+5.9) |
Real-robot evaluation on UR30e
All policies are first SFT on real-robot data, then post-trained with FA-GRPO + FlowScale in Prophet and finally deployed on the UR30e arm. We report overall real-world success and absolute gains.
| Backbone | Training stage | Overall success [%] | Gain vs. SFT |
|---|---|---|---|
| VLAâAdapterâ0.5B | SFT only | 35.8 ± 3.1 | â |
| + FAâGRPO & FlowScale | 60.4 ± 0.7 | (+24.6) | |
| Pi0.5â3B | SFT only | 52.1 ± 3.8 | â |
| + FAâGRPO & FlowScale | 82.1 ± 0.7 | (+30.0) | |
| OpenVLAâOFTâ7B | SFT only | 35.4 ± 0.7 | â |
| + FAâGRPO & FlowScale | 62.9 ± 0.7 | (+27.5) |
Bibtex
If you find this project or dataset helpful, please consider citing our paper:
@article{zhang2025prophrl},
title={Reinforcing Action Policies by Prophesying},
author={Zhang, Jiahui and Huang, Ze and Gu, Chun and Ma, Zipei and Zhang, Li},
year={2025},
journal={arXiv preprint arXiv:2511.20633},
}