Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for robotics and other real-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with robust multi-frame spatial understanding by integrating depth perception, visual correspondence, and dynamic perception. Central to our approach is the MultiSPA dataset, a novel, large-scale collection of more than 27 million samples spanning diverse 3D and 4D scenes. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable, generalizable multi-frame reasoning. We further observe multi-task benefits and early indications of emergent capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.
1. Data engine of static data. For each static scene, we sample image pairs according to the overlap ratio and calcuate meta spatial information for constructing QA pairs.
2. Rigid body segmentation. We use 4D datasets (4D object tracking datasets) to consruct data for object movement percpetion. To ensure diversity, we develop a rigid body segmentation method to segment the rigid bodies in the dynamic scenes. More data generation modules can be found in the paper.
3. MultiSPA dataset and benchmark. Based on our data engine, we collect a large-scale dataset called MultiSPA with more than 27 million samples and develop the MultiSPA benchmark focusing on multi-frame spatial understanding. We employ a unified accuracy metric for the benchmark, where each task has its own criteria for determining true positive results.
Our model can potentially serve as a multi-frame reward annotator for robot learning. It can predict the moving distance of a target object given two frames.
@article{xu2025multi,
title={Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models},
author={Xu, Runsen and Wang, Weiyao and Tang, Hao and Chen, Xingyu and Wang, Xiaodong and Chu, Fu-Jen and Lin, Dahua and Feiszli, Matt and Liang, Kevin J.},
journal={arXiv preprint arXiv:2505.17015},
year={2025}
}