Teaser Figure

📄 Paper Summary

Vision–Language–Action (VLA) policies excel in aligning language, perception, and robot control. However, most VLAs are trained purely by imitation, which overfits to demonstrations, and is brittle under distribution shift. Reinforcement learning (RL) directly optimizes task reward and thus addresses this misalignment, but real-robot interaction is expensive and conventional simulators are hard to engineer and transfer. We address both data efficiency and optimization stability in VLA post-training via a learned world model and an RL procedure tailored to flow-based action heads. Specifically, we introduce Prophet, a unified action-to-video robot actuation pretrained across large-scale, heterogeneous robot data to learn reusable action–outcome dynamics. It is able to few-shot adapt to new robots, objects, and environments, yielding a rollout-ready simulator. Upon Prophet, we reinforce action policies with Flow-action-GRPO (FA-GRPO), which adapts Flow-GRPO to operate on VLA actions, and with FlowScale, a stepwise reweighting that rescales per-step gradients in the flow head. Together, Prophet, FA-GRPO, and FlowScale constitute PorphRL, a practical, data- and compute-efficient path to VLA post-training. Experiments show 5-17% success gains on public benchmarks and 24-30% gains on real robots across different VLA variants.

GT Action Rollouts Across Tasks

Task 1: Put yellow cube
Real video
Cosmos
Genie-Envisioner
Prophet (ours)
Task 2: Move carrot
Real video
Cosmos
Genie-Envisioner
Prophet (ours)
Task 3: Pick up spoon
Real video
Cosmos
Genie-Envisioner
Prophet (ours)
Task 4: Take lid
Real video
Cosmos
Genie-Envisioner
Prophet (ours)
Task 5: Close drawer
Real video
Cosmos
Genie-Envisioner
Prophet (ours)
Task 6: Fold cloths
Real video
Cosmos
Genie-Envisioner
Prophet (ours)

Interactive Demo: 3-step precise control

Initial frame
First move
Second move
Third move
Distance

Interactive Demo: BRIDGE multi task

Initial frame
Choose a sub-task

Action Editing with Prophet

AgiBot: dual-arm manipulation
GT actions
Edited actions

Prophet replays the original dual-arm motion under GT actions (left), and follows an edited sequence where the left gripper is frozen while the right arm keeps moving to the right (right).

Open-X: precise lateral motion
GT actions
Edited actions

With GT actions (left), Prophet rolls out the nominal trajectory. When we edit the action chunk to push the gripper further left (right), the rollout smoothly adjusts end-effector and object motion over time.

DROID: grasping under gripper edits
GT actions
Edited actions

Under GT actions, Prophet produces a successful pick-and-place (left). When we edit the actions to keep the gripper open throughout (right), the object is never grasped and the rollout depicts a realistic failure.

Real world PlaceBowl: RL discovers new behavior

Training demonstrations
SFT policy (Pi0.5)
ProphRL (Ours)
  • Training demonstrations for PlaceBowl only show left-side grasps.
  • The SFT policy occasionally samples a right-side approach, but these rare trajectories are unstable and unreliable.
  • Whenever a right-side attempt succeeds, the reward model assigns a positive signal.
  • RL with FA-GRPO + FlowScale amplifies this weak mode, turning the right-side strategy into a consistent and reliable behavior.
  • This highlights a key difference: SFT imitates the dataset, whereas RL can discover and reinforce behaviors that are only weakly expressed in demonstrations.

Real world PulloutTissue

SFT policy – trial 1
SFT policy – trial 2
ProphRL (Ours) – trial 1
ProphRL (Ours) – trial 2

RL post-training results

SimplerEnv (WidowX) on BRIDGE

Single-image VLA policies are first SFT on BRIDGE and then post-trained with FA-GRPO and FlowScale in Prophet. We summarize overall success and gains over SFT.

Backbone Training stage Overall success [%] Gain vs. SFT
VLA–Adapter–0.5B SFT only 23.3 ± 2.2
+ FA–GRPO 38.2 ± 2.4 (+14.9)
+ FA–GRPO & FlowScale 41.0 ± 2.4 (+17.7)
Pi0.5–3B SFT only 38.9 ± 2.6
+ FA–GRPO 46.9 ± 3.0 (+8.0)
+ FA–GRPO & FlowScale 51.0 ± 1.2 (+12.1)
OpenVLA–OFT–7B SFT only 25.0 ± 1.8
+ FA–GRPO 29.2 ± 1.8 (+4.2)
+ FA–GRPO & FlowScale 30.9 ± 0.6 (+5.9)

Real-robot evaluation on UR30e

All policies are first SFT on real-robot data, then post-trained with FA-GRPO + FlowScale in Prophet and finally deployed on the UR30e arm. We report overall real-world success and absolute gains.

Backbone Training stage Overall success [%] Gain vs. SFT
VLA–Adapter–0.5B SFT only 35.8 ± 3.1
+ FA–GRPO & FlowScale 60.4 ± 0.7 (+24.6)
Pi0.5–3B SFT only 52.1 ± 3.8
+ FA–GRPO & FlowScale 82.1 ± 0.7 (+30.0)
OpenVLA–OFT–7B SFT only 35.4 ± 0.7
+ FA–GRPO & FlowScale 62.9 ± 0.7 (+27.5)

Bibtex

If you find this project or dataset helpful, please consider citing our paper:


      @article{zhang2025prophrl},
        title={Reinforcing Action Policies by Prophesying},
        author={Zhang, Jiahui and Huang, Ze and Gu, Chun and Ma, Zipei and Zhang, Li},
        year={2025},
        journal={arXiv preprint arXiv:2511.20633},
      }