MMSI-Bench is a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours crafting 1,000 challenging multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and step-by-step reasoning annotations.
We conduct extensive experiments and evaluate 34 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30% accuracy, OpenAI’s o3 model reaches 41%, while humans score 97%. These results highlight the challenge and the substantial headroom for future research. We also provide an automated error analysis pipeline that diagnoses four dominant failure modes, offering insights for advancing multi-image spatial intelligence.
MMSI-Bench categorizes tasks around three core spatial elements: camera, object, and region, focusing on their positional relationships, attributes, and motion. There are six types of positional relationships: camera-camera, camera-object, camera-region, object-object, object-region, and region-region. The benchmark also includes two types of attributes (measurement and appearance), two types of motion (camera and object), and one multi-step reasoning category. All questions require reasoning across multiple images and cannot be answered from a single image alone.
Representative MMSI-Bench samples from each category. Please zoom in to inspect image details. Questions and rationales are simplified for brevity; the complete versions appear in the section below. Correct answers are highlighted in green.
Illustration of the MMSI-Bench construction pipeline: images are collected from diverse real-world datasets, relevant image sets are carefully selected, complex QA tasks and detailed reasoning processes are manually annotated, and all data undergo rigorous quality control.
Model | Avg. (%) | Type |
---|---|---|
🥇 Human Level | 97.2 | Baseline |
🥈 o3 | 41.0 | Proprietary |
🥉 GPT-4.5 | 40.3 | Proprietary |
Gemini-2.5-Pro--Thinking | 37.0 | Proprietary |
Gemini-2.5-Pro | 36.9 | Proprietary |
Doubao-1.5-pro | 33.0 | Proprietary |
Qwen2.5-VL-72B | 30.7 | Open-source |
NVILA-15B | 30.5 | Open-source |
GPT-4.1 | 30.9 | Proprietary |
GPT-4o | 30.3 | Proprietary |
Claude-3.7-Sonnet--Thinking | 30.2 | Proprietary |
Seed1.5-VL | 29.7 | Proprietary |
DeepSeek-VL2-Small | 28.6 | Open-source |
InternVL2.5-8B | 28.7 | Open-source |
InternVL3-78B | 28.5 | Open-source |
InternVL2.5-78B | 28.5 | Open-source |
LLaVA-OneVision-72B | 28.4 | Open-source |
InternVL2.5-2B | 29.0 | Open-source |
InternVL2.5-26B | 28.0 | Open-source |
NVILA-8B | 28.1 | Open-source |
DeepSeek-VL2 | 27.1 | Open-source |
InternVL3-1B | 27.0 | Open-source |
InternVL3-9B | 26.7 | Open-source |
Qwen2.5-VL-3B | 26.5 | Open-source |
InternVL2.5-1B | 26.1 | Open-source |
InternVL2.5-4B | 26.3 | Open-source |
InternVL3-8B | 25.7 | Open-source |
Qwen2.5-VL-7B | 25.9 | Open-source |
InternVL3-2B | 25.3 | Open-source |
Llama-3.2-11B-Vision | 25.4 | Open-source |
🃏 Random Guessing | 25.0 | Baseline |
LLaVA-OneVision-7B | 24.5 | Open-source |
DeepSeek-VL2-Tiny | 24.0 | Open-source |
Blind GPT-4o | 22.7 | Baseline |
Illustration of four error types identified in MLLM spatial reasoning on MMSI-Bench.
Distribution of correct and error types across three representative MLLMs. (Analyzed using the automated error analysis pipeline on all MMSI-Bench questions)
MMSI-Bench makes use of data from: ScanNet, nuScenes, Matterport3D, Ego4D, AgiBot-World, DTU, DAVIS-2017, and Waymo. We thank these teams for their open-source contributions.
@article{yang2025mmsi,
title={MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence},
author={Yang, Sihan and Xu, Runsen and Xie, Yiman and Yang, Sizhe and Li, Mo and Lin, Jingli and Zhu, Chenming and Chen, Xiaochen and Duan, Haodong and Yue, Xiangyu and Lin, Dahua and Wang, Tai and Pang, Jiangmiao},
journal={arXiv preprint arXiv:2505.23764},
year={2025}
}