ProphRL

📄 Paper Summary

Vision–Language–Action (VLA) policies excel in aligning language, perception, and robot control. However, most VLAs are trained purely by imitation, which overfits to demonstrations, and is brittle under distribution shift. Reinforcement learning (RL) directly optimizes task reward and thus addresses this misalignment, but real-robot interaction is expensive and conventional simulators are hard to engineer and transfer. We address both data efficiency and optimization stability in VLA post-training via a learned world model and an RL procedure tailored to flow-based action heads. Specifically, we introduce Prophet, a unified action-to-video robot actuation pretrained across large-scale, heterogeneous robot data to learn reusable action–outcome dynamics. It is able to few-shot adapt to new robots, objects, and environments, yielding a rollout-ready simulator. Upon Prophet, we reinforce action policies with Flow-action-GRPO (FA-GRPO), which adapts Flow-GRPO to operate on VLA actions, and with FlowScale, a stepwise reweighting that rescales per-step gradients in the flow head. Together, Prophet, FA-GRPO, and FlowScale constitute PorphRL, a practical, data- and compute-efficient path to VLA post-training. Experiments show 5-17% success gains on public benchmarks and 24-30% gains on real robots across different VLA variants.

GT Action Rollouts Across Tasks

Task 1: Put yellow cube

Real video

Cosmos

Genie-Envisioner

Prophet (ours)

Task 2: Move carrot

Real video

Cosmos

Genie-Envisioner

Prophet (ours)

Task 3: Pick up spoon

Real video

Cosmos

Genie-Envisioner

Prophet (ours)

Task 4: Take lid

Real video

Cosmos

Genie-Envisioner

Prophet (ours)

Task 5: Close drawer

Real video

Cosmos

Genie-Envisioner

Prophet (ours)

Task 6: Fold cloths

Real video

Cosmos

Genie-Envisioner

Prophet (ours)

Interactive Demo: 3-step precise control

First move

Second move

Third move

Distance

Interactive Demo: BRIDGE multi task

Choose a sub-task

Action Editing with Prophet

AgiBot: dual-arm manipulation

GT actions

Edited actions

Prophet replays the original dual-arm motion under GT actions (left), and follows an edited sequence where the left gripper is frozen while the right arm keeps moving to the right (right).

Open-X: precise lateral motion

GT actions

Edited actions

With GT actions (left), Prophet rolls out the nominal trajectory. When we edit the action chunk to push the gripper further left (right), the rollout smoothly adjusts end-effector and object motion over time.

DROID: grasping under gripper edits

GT actions

Edited actions

Under GT actions, Prophet produces a successful pick-and-place (left). When we edit the actions to keep the gripper open throughout (right), the object is never grasped and the rollout depicts a realistic failure.

Real world PlaceBowl: RL discovers new behavior

Training demonstrations

SFT policy (Pi0.5)

ProphRL (Ours)

Training demonstrations for PlaceBowl only show left-side grasps.
The SFT policy occasionally samples a right-side approach, but these rare trajectories are unstable and unreliable.
Whenever a right-side attempt succeeds, the reward model assigns a positive signal.
RL with FA-GRPO + FlowScale amplifies this weak mode, turning the right-side strategy into a consistent and reliable behavior.
This highlights a key difference: SFT imitates the dataset, whereas RL can discover and reinforce behaviors that are only weakly expressed in demonstrations.

Real world PulloutTissue

SFT policy – trial 1

SFT policy – trial 2

ProphRL (Ours) – trial 1

ProphRL (Ours) – trial 2

RL post-training results

SimplerEnv (WidowX) on BRIDGE

Single-image VLA policies are first SFT on BRIDGE and then post-trained with FA-GRPO and FlowScale in Prophet. We summarize overall success and gains over SFT.

Backbone	Training stage	Overall success [%]	Gain vs. SFT
VLA–Adapter–0.5B	SFT only	23.3 ± 2.2	—
	+ FA–GRPO	38.2 ± 2.4	(+14.9)
	+ FA–GRPO & FlowScale	41.0 ± 2.4	(+17.7)
Pi0.5–3B	SFT only	38.9 ± 2.6	—
	+ FA–GRPO	46.9 ± 3.0	(+8.0)
	+ FA–GRPO & FlowScale	51.0 ± 1.2	(+12.1)
OpenVLA–OFT–7B	SFT only	25.0 ± 1.8	—
	+ FA–GRPO	29.2 ± 1.8	(+4.2)
	+ FA–GRPO & FlowScale	30.9 ± 0.6	(+5.9)

Real-robot evaluation on UR30e

All policies are first SFT on real-robot data, then post-trained with FA-GRPO + FlowScale in Prophet and finally deployed on the UR30e arm. We report overall real-world success and absolute gains.

Backbone	Training stage	Overall success [%]	Gain vs. SFT
VLA–Adapter–0.5B	SFT only	35.8 ± 3.1	—
VLA–Adapter–0.5B	+ FA–GRPO & FlowScale	60.4 ± 0.7	(+24.6)
Pi0.5–3B	SFT only	52.1 ± 3.8	—
Pi0.5–3B	+ FA–GRPO & FlowScale	82.1 ± 0.7	(+30.0)
OpenVLA–OFT–7B	SFT only	35.4 ± 0.7	—
OpenVLA–OFT–7B	+ FA–GRPO & FlowScale	62.9 ± 0.7	(+27.5)

Bibtex

If you find this project or dataset helpful, please consider citing our paper:


      @article{zhang2025prophrl},
        title={Reinforcing Action Policies by Prophesying},
        author={Zhang, Jiahui and Huang, Ze and Gu, Chun and Ma, Zipei and Zhang, Li},
        year={2025},
        journal={arXiv preprint arXiv:2511.20633},
      }