📄 Paper Summary
Vision–Language–Action (VLA) policies excel in aligning language, perception, and robot control. However, most VLAs are trained purely by imitation, which overfits to demonstrations, and is brittle under distribution shift. Reinforcement learning (RL) directly optimizes task reward and thus addresses this misalignment, but real-robot interaction is expensive and conventional simulators are hard to engineer and transfer. We address both data efficiency and optimization stability in VLA post-training via a learned world model and an RL procedure tailored to flow-based action heads. Specifically, we introduce Prophet, a unified action-to-video robot actuation pretrained across large-scale, heterogeneous robot data to learn reusable action–outcome dynamics. It is able to few-shot adapt to new robots, objects, and environments, yielding a rollout-ready simulator. Upon Prophet, we reinforce action policies with Flow-action-GRPO (FA-GRPO), which adapts Flow-GRPO to operate on VLA actions, and with FlowScale, a stepwise reweighting that rescales per-step gradients in the flow head. Together, Prophet, FA-GRPO, and FlowScale constitute PorphRL, a practical, data- and compute-efficient path to VLA post-training. Experiments show 5-17% success gains on public benchmarks and 24-30% gains on real robots across different VLA variants.
GT Action Rollouts Across Tasks
Interactive Demo: 3-step precise control
Interactive Demo: BRIDGE multi task
Action Editing with Prophet
Prophet replays the original dual-arm motion under GT actions (left), and follows an edited sequence where the left gripper is frozen while the right arm keeps moving to the right (right).
With GT actions (left), Prophet rolls out the nominal trajectory. When we edit the action chunk to push the gripper further left (right), the rollout smoothly adjusts end-effector and object motion over time.
Under GT actions, Prophet produces a successful pick-and-place (left). When we edit the actions to keep the gripper open throughout (right), the object is never grasped and the rollout depicts a realistic failure.
Real world PlaceBowl: RL discovers new behavior
- Training demonstrations for PlaceBowl only show left-side grasps.
- The SFT policy occasionally samples a right-side approach, but these rare trajectories are unstable and unreliable.
- Whenever a right-side attempt succeeds, the reward model assigns a positive signal.
- RL with FA-GRPO + FlowScale amplifies this weak mode, turning the right-side strategy into a consistent and reliable behavior.
- This highlights a key difference: SFT imitates the dataset, whereas RL can discover and reinforce behaviors that are only weakly expressed in demonstrations.
Real world PulloutTissue
RL post-training results
SimplerEnv (WidowX) on BRIDGE
Single-image VLA policies are first SFT on BRIDGE and then post-trained with FA-GRPO and FlowScale in Prophet. We summarize overall success and gains over SFT.
| Backbone | Training stage | Overall success [%] | Gain vs. SFT |
|---|---|---|---|
| VLA–Adapter–0.5B | SFT only | 23.3 ± 2.2 | — |
| + FA–GRPO | 38.2 ± 2.4 | (+14.9) | |
| + FA–GRPO & FlowScale | 41.0 ± 2.4 | (+17.7) | |
| Pi0.5–3B | SFT only | 38.9 ± 2.6 | — |
| + FA–GRPO | 46.9 ± 3.0 | (+8.0) | |
| + FA–GRPO & FlowScale | 51.0 ± 1.2 | (+12.1) | |
| OpenVLA–OFT–7B | SFT only | 25.0 ± 1.8 | — |
| + FA–GRPO | 29.2 ± 1.8 | (+4.2) | |
| + FA–GRPO & FlowScale | 30.9 ± 0.6 | (+5.9) |
Real-robot evaluation on UR30e
All policies are first SFT on real-robot data, then post-trained with FA-GRPO + FlowScale in Prophet and finally deployed on the UR30e arm. We report overall real-world success and absolute gains.
| Backbone | Training stage | Overall success [%] | Gain vs. SFT |
|---|---|---|---|
| VLA–Adapter–0.5B | SFT only | 35.8 ± 3.1 | — |
| + FA–GRPO & FlowScale | 60.4 ± 0.7 | (+24.6) | |
| Pi0.5–3B | SFT only | 52.1 ± 3.8 | — |
| + FA–GRPO & FlowScale | 82.1 ± 0.7 | (+30.0) | |
| OpenVLA–OFT–7B | SFT only | 35.4 ± 0.7 | — |
| + FA–GRPO & FlowScale | 62.9 ± 0.7 | (+27.5) |
Bibtex
If you find this project or dataset helpful, please consider citing our paper:
@article{zhang2025prophrl},
title={Reinforcing Action Policies by Prophesying},
author={Zhang, Jiahui and Huang, Ze and Gu, Chun and Ma, Zipei and Zhang, Li},
year={2025},
journal={arXiv preprint arXiv:2511.20633},
}