Structured 4D Latent Predictive Model for Robot Planning

Abstract

Video predictive models are emerging as a powerful paradigm in robotics, offering a promising path toward task generalization, long-horizon planning, and flexible decision-making. However, prevailing approaches often operate on 2D video sequences, inherently lacking the 3D geometric understanding necessary for precise spatial reasoning and physical consistency.

We introduce a Structured 4D Latent Predictive Model, which predicts the evolution of a scene’s 3D structure in a structured latent space conditioned on observations and textual instructions. Our representation encodes the scene holistically and can be decoded into diverse 3D formats, enabling a more complete and 3D consistent scene understanding. This structured 4D latent predictive model serves as a planner, generating future scenes that are translated into executable actions by a goal-conditioned inverse dynamics module.

Experiments demonstrate that our model generates futures with strong visual quality, substantially better 3D consistency and multi-view coherence compared to state-of-the-art video-based planners. Consequently, our full planning pipeline achieves superior performance on complex manipulation tasks, exhibits robust generalization to novel visual conditions, and proves effective on real-world robotic platforms.

Method overview

Our 4D latent predictive model integrates multi-view images and text instructions to forecast future 3D dynamics, enabling robots to plan and execute tasks that require precise 3D understanding.

Planning results

We visualize generated 3D futures across manipulation scenes. Select a scene from the gallery, then step through the observed input point cloud and its predicted unroll in a single interactive viewer.

Loading scenes

Planning timestep

Step 0

Input Step 0

Real world experiments

Real-world results are learned from action-free demonstrations: after reconstructing the 3D point cloud, we register predicted gripper geometry to recover end-effector poses and use motion planning to execute each subgoal, providing a learning-free inverse-dynamics approach.

Loading demo

Show gripper

Execution recording

Novel view generalization

All models were trained on fixed global views but tested on a novel local viewpoint. Our model generates a consistent 3D scene from the unseen view, which significantly outperforms baselines.

BibTeX

@article{li2026structured,
      title={Structured 4D Latent Predictive Model for Robot Planning}, 
      author={Zhiyi Li and Peilin Wu and Xiaoshen Han and Ruojin Cai and Yilun Du},
      year={2026},
      eprint={2607.01166},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2607.01166}, 
}