MoLA | From Imagined Futures to Executable Actions

Abstract

Action-space grounding for imagined futures.

Video generation models offer a promising imagination mechanism for robot manipulation by predicting long-horizon future observations, but effectively exploiting these imagined futures for action execution remains challenging. Existing approaches either condition policies on predicted frames or directly decode generated videos into actions, both suffering from a mismatch between visual realism and control relevance. As a result, predicted observations emphasize perceptual fidelity rather than action-centric causes of state transitions, leading to indirect and unstable control. To address this gap, we propose MoLA (Mixture of Latent Actions), a control-oriented interface that transforms imagined future videos into executable representations. Instead of passing predicted frames directly to the policy, MoLA leverages a mixture of pretrained inverse dynamics models to infer a mixture of latent actions implied by generated visual transitions. These modality-aware inverse dynamics models capture complementary semantic, depth, and flow cues, providing a structured and physically grounded action representation that bridges video imagination and policy execution. We evaluate our approach on simulated benchmarks (LIBERO, CALVIN, and LIBERO-Plus) and real-world robot manipulation tasks, achieving consistent gains in task success, temporal consistency, and generalization.

Method

From video imagination to executable latent actions.

MoLA converts predicted visual futures into action-centric representations with a mixture of modality-aware inverse dynamics models, then decodes the inferred latent actions into robot controls.

Comparison among VLA models, Video-Action models, and MoLA. — **Figure 1.** MoLA differs from direct VLA policies and Video-Action pipelines by inserting a latent-action interface between video imagination and execution. The mixture of inverse dynamics models infers actions implied by generated visual transitions, providing a control-oriented bridge to the action head.

Overview of MoLA pretraining and fine-tuning. — **Figure 2.** During pretraining, flow-aware, semantic-aware, and depth-aware inverse dynamics models learn discrete latent actions from current and future frames. During fine-tuning, the frozen video generation model imagines future rollouts, MoIDM extracts a mixture of latent actions, and the action head generates executable action sequences.

Visualization Videos

Rollouts on real-world and simulated tasks.

The videos below are loaded from this repository and cover the real robot setup plus CALVIN, LIBERO, and LIBERO-Plus benchmark examples.

Real-world UR5e manipulation rollout.

CALVIN

Push the pink block left.

CALVIN

Pull the handle to open the drawer.

LIBERO

Put the alphabet soup and cream cheese box in the basket.

LIBERO

Turn on the stove and put the moka pot on it.

LIBERO-Plus

Put the black bowl in the bottom drawer and close it.

LIBERO-Plus

Turn on the stove and put the moka pot on it.

Benchmark Results

Consistent gains across benchmarks and real robots.

MoLA achieves the best reported average performance on CALVIN ABC-D, LIBERO, and LIBERO-Plus, and improves over OpenVLA and VPP in the real-world UR5e setting.

Experimental environments for LIBERO, CALVIN, LIBERO-Plus, and real-world UR5e robot. — **Figure 3.** Experiments cover CALVIN ABC-D, LIBERO, and LIBERO-Plus in simulation, plus a real-world UR5e setup with tabletop manipulation tasks and distribution-shift evaluations.

CALVIN ABC-D 4.55 Avg. length

LIBERO 97.0% Average success

LIBERO-Plus 92.7% Average success

Real-world UR5e 73.0% Average success

CALVIN ABC-D

Task completed in a row; higher is better

Method	1	2	3	4	5	Avg. Len.
3D Diffusor Actor	92.2	78.7	63.9	51.2	41.2	3.27
OpenVLA	91.3	77.8	62.0	52.1	43.5	3.27
UniVLA	95.5	85.8	75.4	66.9	56.5	3.80
pi_0	93.8	85.0	76.7	68.1	59.9	3.84
pi_0.5	94.8	87.4	78.2	71.7	64.3	3.97
GR00T N1	94.2	86.1	79.6	73.9	66.8	4.01
CLOVER	96.0	83.5	70.8	57.5	45.4	3.53
UP-VLA	92.8	86.5	81.5	76.9	69.9	4.08
Seer	96.3	91.6	86.1	80.3	74.0	4.28
VPP	96.5	90.9	86.6	82.0	76.9	4.33
DreamVLA	98.2	94.6	89.5	83.4	78.1	4.44
MoLA (Ours)	98.5	95.0	91.1	88.1	82.6	4.55

LIBERO

Success rate (%)

Method	Spatial	Object	Goal	Long	Avg.
Octo	78.9	85.7	84.6	51.1	75.1
OpenVLA	84.7	88.4	79.2	53.7	76.5
SpatialVLA	88.2	89.9	78.6	55.5	78.1
CoT-VLA	87.5	91.6	87.6	69.0	83.9
VPP	85.0	95.0	92.0	91.5	90.9
MoLA (Ours)	93.0	99.5	99.5	96.0	97.0

LIBERO-Plus

Success rate (%)

Method	Spatial	Object	Goal	Long	Avg.
OpenVLA	19.4	14.0	15.1	14.3	15.6
WorldVLA	32.5	28.6	31.8	8.2	25.0
NORA	47.6	34.4	38.8	36.3	39.0
UniVLA	55.5	36.7	40.7	39.9	42.9
pi_0	60.7	61.4	44.9	48.4	53.6
pi_0-Fast	74.4	72.7	57.5	43.4	61.6
RIPT-VLA	85.8	64.3	58.0	67.5	68.4
OpenVLA-OFT	84.0	66.5	63.0	66.4	69.6
OpenVLA-OFT+	86.1	84.5	70.7	77.7	79.5
MoLA (Ours)	97.5	96.3	85.1	91.8	92.7

Real-world UR5e Robot

Success rate (%)

Method	In-distribution		Out-of-distribution		Avg.
Method	Place bottle	Grab bowl	Distracting objects	Lighting changes	Avg.
Diffusion Policy	44.0	56.0	28.0	38.0	41.5
OpenVLA	52.0	46.0	24.0	30.0	38.0
pi_0.5	82.0	90.0	64.0	74.0	77.5
VPP	66.0	72.0	52.0	58.0	62.0
MoLA (Ours)	76.0	92.0	60.0	64.0	73.0

Data efficiency comparison between VPP and MoLA on CALVIN ABC-D. — **Figure 4.** MoLA remains stronger than VPP across 10%, 20%, 50%, and 100% CALVIN fine-tuning data regimes, with the largest gains in low-data settings.

Ablation studies for MoIDM pretraining and fine-tuning. — **Figure 5.** MoIDM pretraining and downstream fine-tuning both improve performance, supporting the latent-action interface as the key bridge from imagined futures to execution.

BibTeX

Citation

@inproceedings{li2026imaginedfutures,
  title = {From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation},
  author = {Li, Yajie and Zhang, Bozhou and Gu, Chun and Ma, Zipei and Zhang, Jiahui and Deng, Jiankang and Zhu, Xiatian and Zhang, Li},
  booktitle = {Forty-third International Conference on Machine Learning},
  year = {2026}
}