Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

Abstract

Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for robotics and other real-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with robust multi-frame spatial understanding by integrating depth perception, visual correspondence, and dynamic perception. Central to our approach is the MultiSPA dataset, a novel, large-scale collection of more than 27 million samples spanning diverse 3D and 4D scenes. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable, generalizable multi-frame reasoning. We further observe multi-task benefits and early indications of emergent capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.

Data Engine and Benchmark

1. Data engine of static data. For each static scene, we sample image pairs according to the overlap ratio and calcuate meta spatial information for constructing QA pairs.

2. Rigid body segmentation. We use 4D datasets (4D object tracking datasets) to consruct data for object movement percpetion. To ensure diversity, we develop a rigid body segmentation method to segment the rigid bodies in the dynamic scenes. More data generation modules can be found in the paper.

3. MultiSPA dataset and benchmark. Based on our data engine, we collect a large-scale dataset called MultiSPA with more than 27 million samples and develop the MultiSPA benchmark focusing on multi-frame spatial understanding. We employ a unified accuracy metric for the benchmark, where each task has its own criteria for determining true positive results.

Experiment Results

Performance of Multi-SpatialMLLM on MultiSPA benchmark. Our Multi-SpatialMLLM significantly outperforms baselines across both qualitative and quantitative subtasks, demonstrating an average 36% gain and surpassing even larger proprietary models.

Generalization performance. Multi-SpatialMLLM shows strong generalization capabilities on the held-out BLINK dataset and general VQA benchmarks.

Scalable performance. We find that by scaling up the trainable parameters and training data, Multi-SpatialMLLM can achieve better performance.

Emergent capability. We find that for some hard spatial understanding tasks, for example, visual correspondence task where distractors are close to the correct point, only large models can achieve better performance through fine-tuning. More can be found in the paper.

Robotics Applications

Our model can potentially serve as a multi-frame reward annotator for robot learning. It can predict the moving distance of a target object given two frames.

BibTeX

@article{xu2025multi,
  title={Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models},
  author={Xu, Runsen and Wang, Weiyao and Tang, Hao and Chen, Xingyu and Wang, Xiaodong and Chu, Fu-Jen and Lin, Dahua and Feiszli, Matt and Liang, Kevin J.},
  journal={arXiv preprint arXiv:2505.17015},
  year={2025}
}