MMSI-Bench is a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours crafting 1,000 challenging multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and step-by-step reasoning annotations.
We conduct extensive experiments and evaluate 34 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30% accuracy, OpenAI's o3 model reaches 41%, while humans score 97%. These results highlight the challenge and the substantial headroom for future research. We also provide an automated error analysis pipeline that diagnoses four dominant failure modes, offering insights for advancing multi-image spatial intelligence.
1. Multi-image. We target multi-image spatial reasoning: each of the ten fundamental tasks involves two images, while the multi-step reasoning tasks use more.
2. High quality. Every question is fully human-designed—selecting images, crafting questions, carefully designing distractors, and annotating step-by-step reasoning processes.
3. Aligned with real-world scenarios. All images depict real-world scenes from domains such as autonomous driving, robotic manipulation, and scene scanning, and every question demands real-world scene understanding and reasoning. We do not use any synthetic data.
4. Comprehensive and challenging. We benchmark 34 MLLMs—nearly all leading proprietary and open-source models—and observe a large gap between model and human performance. Most open-source models perform at roughly random-choice level. To the best of our knowledge, our benchmark shows the largest reported model-human gap.
5. Reasoning processes. Each sample is annotated with a step-by-step reasoning trace that justifies the correct answer and helps diagnose model errors.
MMSI-Bench categorizes tasks around three core spatial elements: camera, object, and region, focusing on their positional relationships, attributes, and motion. There are six types of positional relationships: camera-camera, camera-object, camera-region, object-object, object-region, and region-region. The benchmark also includes two types of attributes (measurement and appearance), two types of motion (camera and object), and one multi-step reasoning category. All questions require reasoning across multiple images and cannot be answered from a single image alone.
Representative MMSI-Bench samples from each category. Please zoom in to inspect image details. Questions and rationales are simplified for brevity; the complete versions appear in the section below. Correct answers are highlighted in green.
Illustration of the MMSI-Bench construction pipeline: images are collected from diverse real-world datasets, relevant image sets are carefully selected, complex QA tasks and detailed reasoning processes are manually annotated, and all data undergo rigorous quality control.
Model | Avg. (%) | Type |
---|---|---|
🥇 Human Level | 97.2 | Baseline |
🥈 o3 | 41.0 | Proprietary |
🥉 GPT-4.5 | 40.3 | Proprietary |
Gemini-2.5-Pro--Thinking | 37.0 | Proprietary |
Gemini-2.5-Pro | 36.9 | Proprietary |
Doubao-1.5-pro | 33.0 | Proprietary |
Qwen2.5-VL-72B | 30.7 | Open-source |
NVILA-15B | 30.5 | Open-source |
GPT-4.1 | 30.9 | Proprietary |
GPT-4o | 30.3 | Proprietary |
Claude-3.7-Sonnet--Thinking | 30.2 | Proprietary |
Seed1.5-VL | 29.7 | Proprietary |
DeepSeek-VL2-Small | 28.6 | Open-source |
InternVL2.5-8B | 28.7 | Open-source |
InternVL3-78B | 28.5 | Open-source |
InternVL2.5-78B | 28.5 | Open-source |
LLaVA-OneVision-72B | 28.4 | Open-source |
InternVL2.5-2B | 29.0 | Open-source |
InternVL2.5-26B | 28.0 | Open-source |
NVILA-8B | 28.1 | Open-source |
DeepSeek-VL2 | 27.1 | Open-source |
InternVL3-1B | 27.0 | Open-source |
InternVL3-9B | 26.7 | Open-source |
Qwen2.5-VL-3B | 26.5 | Open-source |
InternVL2.5-1B | 26.1 | Open-source |
InternVL2.5-4B | 26.3 | Open-source |
InternVL3-8B | 25.7 | Open-source |
Qwen2.5-VL-7B | 25.9 | Open-source |
InternVL3-2B | 25.3 | Open-source |
Llama-3.2-11B-Vision | 25.4 | Open-source |
🃏 Random Guessing | 25.0 | Baseline |
LLaVA-OneVision-7B | 24.5 | Open-source |
DeepSeek-VL2-Tiny | 24.0 | Open-source |
Blind GPT-4o | 22.7 | Baseline |
Illustration of four error types identified in MLLM spatial reasoning on MMSI-Bench.
Distribution of correct and error types across three representative MLLMs. (Analyzed using the automated error analysis pipeline on all MMSI-Bench questions)
MMSI-Bench makes use of data from: ScanNet, nuScenes, Matterport3D, Ego4D, AgiBot-World, DTU, DAVIS-2017, and Waymo. We thank these teams for their open-source contributions.
@article{yang2025mmsi,
title={MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence},
author={Yang, Sihan and Xu, Runsen and Xie, Yiman and Yang, Sizhe and Li, Mo and Lin, Jingli and Zhu, Chenming and Chen, Xiaochen and Duan, Haodong and Yue, Xiangyu and Lin, Dahua and Wang, Tai and Pang, Jiangmiao},
journal={arXiv preprint arXiv:2505.23764},
year={2025}
}