Teaser Figure

📄 Abstract

Vision–Language–Action (VLA) policies excel in aligning language, perception, and robot control. However, most VLAs are trained purely by imitation, which overfits to demonstrations and is brittle under distribution shift. Reinforcement learning (RL) directly optimizes task reward and thus addresses this misalignment, but real-robot interaction is expensive and conventional simulators are hard to engineer and transfer.

We address both data efficiency and optimization stability in VLA post-training via a learned world model and an RL procedure tailored to flow-based action heads. Specifically, we introduce Prophet, a unified action-to-video robot actuation model pretrained across large-scale, heterogeneous robot data to learn reusable action–outcome dynamics. Prophet can be few-shot adapted to new robots, objects, and environments, yielding a rollout-ready simulator.

Upon Prophet, we reinforce action policies with Flow-action-GRPO (FA-GRPO), which adapts Flow-GRPO to operate on VLA actions, and with FlowScale, a stepwise reweighting that rescales per-step gradients in the flow head. Together, Prophet, FA-GRPO, and FlowScale constitute ProphRL, a practical, data- and compute-efficient path to VLA post-training. Experiments show 5–17% success gains on public benchmarks and 24–30% gains on real robots across different VLA variants.

GT Action Rollouts Across Tasks

Task 1: Pick up the yellow cube
Real video
Cosmos
Genie-Envisioner
Prophet (ours)
Task 2: Move the carrot
Real video
Cosmos
Genie-Envisioner
Prophet (ours)
Task 3: Pick up the spoon
Real video
Cosmos
Genie-Envisioner
Prophet (ours)
Task 4: Take the lid
Real video
Cosmos
Genie-Envisioner
Prophet (ours)
Task 5: Close the drawer
Real video
Cosmos
Genie-Envisioner
Prophet (ours)
Task 6: Fold the cloths
Real video
Cosmos
Genie-Envisioner
Prophet (ours)

Interactive Demo: 3-step precise control

From top to bottom, pick the 1st, 2nd, 3rd move, then the movement distance. The three moves form one control sequence.

Initial frame
Choose a control sequence

Interactive Demo: BRIDGE multi task

All six options are action-edited rollouts from the same scene. Pick one button to play its task.

Initial frame
Choose a sub-task

Interactive Demo: 2-step dual-arm control

Try 2-step translations of each arm or 2-step roll / pitch / yaw rotations of both arms.

Initial frame
Control the left arm only
Control the right arm only
Rotate both arms (roll / pitch / yaw)

Action Editing with Prophet

AgiBot: dual-arm manipulation
GT actions
Edited actions

Prophet replays the original dual-arm motion under GT actions (left), and follows an edited sequence where the left gripper is frozen while the right arm keeps moving to the right (right).

Open-X: precise lateral motion
GT actions
Edited actions

With GT actions (left), Prophet rolls out the nominal trajectory. When we edit the action chunk to push the gripper further left (right), the rollout smoothly adjusts end-effector and object motion over time.

DROID: grasping under gripper edits
GT actions
Edited actions

Under GT actions, Prophet produces a successful pick-and-place (left). When we edit the actions to keep the gripper open throughout (right), the object is never grasped and the rollout depicts a realistic failure.

Real world PlaceBowl: RL discovers new behavior

Training demonstrations
SFT policy (Pi0.5)
ProphRL (Ours)
  • Training demos for PlaceBowl only contain left-side grasps.
  • The SFT policy occasionally tries a right-side approach, but these cases are rare and unstable.
  • Successful right-side trials receive positive reward, and RL with FA-GRPO + FlowScale amplifies this weak mode into a consistent right-side strategy.
  • In contrast to SFT imitation, RL can discover and reinforce behaviors only weakly present in the data.

Real world PulloutTissue

SFT policy – trial 1
SFT policy – trial 2
ProphRL (Ours) – trial 1
ProphRL (Ours) – trial 2
  • SFT policy often drifts sideways, makes weak contact with the tissue edge, and misses the pull.
  • ProphRL shows a straighter approach, firmer edge contact, and more successful pulls.
  • RL training yields more stable and reliable policies in the real world.

RL post-training results

SimplerEnv (WidowX) on BRIDGE

Single-image VLA policies are first SFT on BRIDGE and then post-trained with FA-GRPO and FlowScale in Prophet. We summarize overall success and gains over SFT.

Backbone Training stage Overall success [%] Gain vs. SFT
VLA–Adapter–0.5B SFT only 23.3 ± 2.2 —
+ FA–GRPO 38.2 ± 2.4 (+14.9)
+ FA–GRPO & FlowScale 41.0 ± 2.4 (+17.7)
Pi0.5–3B SFT only 38.9 ± 2.6 —
+ FA–GRPO 46.9 ± 3.0 (+8.0)
+ FA–GRPO & FlowScale 51.0 ± 1.2 (+12.1)
OpenVLA–OFT–7B SFT only 25.0 ± 1.8 —
+ FA–GRPO 29.2 ± 1.8 (+4.2)
+ FA–GRPO & FlowScale 30.9 ± 0.6 (+5.9)

Real-robot evaluation on UR30e

All policies are first SFT on real-robot data, then post-trained with FA-GRPO + FlowScale in Prophet and finally deployed on the UR30e arm. We report overall real-world success and absolute gains.

Backbone Training stage Overall success [%] Gain vs. SFT
VLA–Adapter–0.5B SFT only 35.8 ± 3.1 —
+ FA–GRPO & FlowScale 60.4 ± 0.7 (+24.6)
Pi0.5–3B SFT only 52.1 ± 3.8 —
+ FA–GRPO & FlowScale 82.1 ± 0.7 (+30.0)
OpenVLA–OFT–7B SFT only 35.4 ± 0.7 —
+ FA–GRPO & FlowScale 62.9 ± 0.7 (+27.5)

Bibtex

If you find this project or dataset helpful, please consider citing our paper:


      @article{zhang2025prophrl},
        title={Reinforcing Action Policies by Prophesying},
        author={Zhang, Jiahui and Huang, Ze and Gu, Chun and Ma, Zipei and Zhang, Li},
        year={2025},
        journal={arXiv preprint arXiv:2511.20633},
      }