Accepted to ICML 2026

From Imagined Futures to Executable Actions

Mixture of Latent Actions for Robot Manipulation

Yajie Li*1,2, Bozhou Zhang*1,2, Chun Gu1,2, Zipei Ma1,2, Jiahui Zhang1,2, Jiankang Deng3, Xiatian Zhu4, Li Zhang1,2

1School of Data Science, Fudan University   2Shanghai Innovation Institute   3Imperial College London   4University of Surrey
* Equal contribution

Abstract

Action-space grounding for imagined futures.

Video generation models offer a promising imagination mechanism for robot manipulation by predicting long-horizon future observations, but effectively exploiting these imagined futures for action execution remains challenging. Existing approaches either condition policies on predicted frames or directly decode generated videos into actions, both suffering from a mismatch between visual realism and control relevance. As a result, predicted observations emphasize perceptual fidelity rather than action-centric causes of state transitions, leading to indirect and unstable control. To address this gap, we propose MoLA (Mixture of Latent Actions), a control-oriented interface that transforms imagined future videos into executable representations. Instead of passing predicted frames directly to the policy, MoLA leverages a mixture of pretrained inverse dynamics models to infer a mixture of latent actions implied by generated visual transitions. These modality-aware inverse dynamics models capture complementary semantic, depth, and flow cues, providing a structured and physically grounded action representation that bridges video imagination and policy execution. We evaluate our approach on simulated benchmarks (LIBERO, CALVIN, and LIBERO-Plus) and real-world robot manipulation tasks, achieving consistent gains in task success, temporal consistency, and generalization.

Method

From video imagination to executable latent actions.

MoLA converts predicted visual futures into action-centric representations with a mixture of modality-aware inverse dynamics models, then decodes the inferred latent actions into robot controls.

Comparison among VLA models, Video-Action models, and MoLA.
Figure 1. MoLA differs from direct VLA policies and Video-Action pipelines by inserting a latent-action interface between video imagination and execution. The mixture of inverse dynamics models infers actions implied by generated visual transitions, providing a control-oriented bridge to the action head.
Overview of MoLA pretraining and fine-tuning.
Figure 2. During pretraining, flow-aware, semantic-aware, and depth-aware inverse dynamics models learn discrete latent actions from current and future frames. During fine-tuning, the frozen video generation model imagines future rollouts, MoIDM extracts a mixture of latent actions, and the action head generates executable action sequences.
Visualization Videos

Rollouts on real-world and simulated tasks.

The videos below are loaded from this repository and cover the real robot setup plus CALVIN, LIBERO, and LIBERO-Plus benchmark examples.

Real-world UR5e manipulation rollout.

CALVIN

Push the pink block left.

CALVIN

Pull the handle to open the drawer.

LIBERO

Put the alphabet soup and cream cheese box in the basket.

LIBERO

Turn on the stove and put the moka pot on it.

LIBERO-Plus

Put the black bowl in the bottom drawer and close it.

LIBERO-Plus

Turn on the stove and put the moka pot on it.

Benchmark Results

Consistent gains across benchmarks and real robots.

MoLA achieves the best reported average performance on CALVIN ABC-D, LIBERO, and LIBERO-Plus, and improves over OpenVLA and VPP in the real-world UR5e setting.

Experimental environments for LIBERO, CALVIN, LIBERO-Plus, and real-world UR5e robot.
Figure 3. Experiments cover CALVIN ABC-D, LIBERO, and LIBERO-Plus in simulation, plus a real-world UR5e setup with tabletop manipulation tasks and distribution-shift evaluations.
CALVIN ABC-D 4.55 Avg. length
LIBERO 97.0% Average success
LIBERO-Plus 92.7% Average success
Real-world UR5e 73.0% Average success

CALVIN ABC-D

Task completed in a row; higher is better
Method 1 2 3 4 5 Avg. Len.
3D Diffusor Actor92.278.763.951.241.23.27
OpenVLA91.377.862.052.143.53.27
UniVLA95.585.875.466.956.53.80
pi_093.885.076.768.159.93.84
pi_0.594.887.478.271.764.33.97
GR00T N194.286.179.673.966.84.01
CLOVER96.083.570.857.545.43.53
UP-VLA92.886.581.576.969.94.08
Seer96.391.686.180.374.04.28
VPP96.590.986.682.076.94.33
DreamVLA98.294.689.583.478.14.44
MoLA (Ours)98.595.091.188.182.64.55

LIBERO

Success rate (%)
Method Spatial Object Goal Long Avg.
Octo78.985.784.651.175.1
OpenVLA84.788.479.253.776.5
SpatialVLA88.289.978.655.578.1
CoT-VLA87.591.687.669.083.9
VPP85.095.092.091.590.9
MoLA (Ours)93.099.599.596.097.0

LIBERO-Plus

Success rate (%)
Method Spatial Object Goal Long Avg.
OpenVLA19.414.015.114.315.6
WorldVLA32.528.631.88.225.0
NORA47.634.438.836.339.0
UniVLA55.536.740.739.942.9
pi_060.761.444.948.453.6
pi_0-Fast74.472.757.543.461.6
RIPT-VLA85.864.358.067.568.4
OpenVLA-OFT84.066.563.066.469.6
OpenVLA-OFT+86.184.570.777.779.5
MoLA (Ours)97.596.385.191.892.7

Real-world UR5e Robot

Success rate (%)
Method In-distribution Out-of-distribution Avg.
Place bottle Grab bowl Distracting objects Lighting changes
Diffusion Policy44.056.028.038.041.5
OpenVLA52.046.024.030.038.0
pi_0.582.090.064.074.077.5
VPP66.072.052.058.062.0
MoLA (Ours)76.092.060.064.073.0
Data efficiency comparison between VPP and MoLA on CALVIN ABC-D.
Figure 4. MoLA remains stronger than VPP across 10%, 20%, 50%, and 100% CALVIN fine-tuning data regimes, with the largest gains in low-data settings.
Ablation studies for MoIDM pretraining and fine-tuning.
Figure 5. MoIDM pretraining and downstream fine-tuning both improve performance, supporting the latent-action interface as the key bridge from imagined futures to execution.
BibTeX

Citation

@inproceedings{li2026imaginedfutures,
  title = {From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation},
  author = {Li, Yajie and Zhang, Bozhou and Gu, Chun and Ma, Zipei and Zhang, Jiahui and Deng, Jiankang and Zhu, Xiatian and Zhang, Li},
  booktitle = {Forty-third International Conference on Machine Learning},
  year = {2026}
}