MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

Abstract

MMSI-Bench is a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours crafting 1,000 challenging multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and step-by-step reasoning annotations.

We conduct extensive experiments and evaluate 34 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30% accuracy, OpenAI's o3 model reaches 41%, while humans score 97%. These results highlight the challenge and the substantial headroom for future research. We also provide an automated error analysis pipeline that diagnoses four dominant failure modes, offering insights for advancing multi-image spatial intelligence.

Why MMSI-Bench?

There are several concurrent works on building spatial intelligence benchmarks for MLLMs. Our MMSI-Bench possesses the following unique features:

1. Multi-image. We target multi-image spatial reasoning: each of the ten fundamental tasks involves two images, while the multi-step reasoning tasks use more.
2. High quality. Every question is fully human-designed—selecting images, crafting questions, carefully designing distractors, and annotating step-by-step reasoning processes.
3. Aligned with real-world scenarios. All images depict real-world scenes from domains such as autonomous driving, robotic manipulation, and scene scanning, and every question demands real-world scene understanding and reasoning. We do not use any synthetic data.
4. Comprehensive and challenging. We benchmark 34 MLLMs—nearly all leading proprietary and open-source models—and observe a large gap between model and human performance. Most open-source models perform at roughly random-choice level. To the best of our knowledge, our benchmark shows the largest reported model-human gap.
5. Reasoning processes. Each sample is annotated with a step-by-step reasoning trace that justifies the correct answer and helps diagnose model errors.

Task Categories

MMSI-Bench categorizes tasks around three core spatial elements: camera, object, and region, focusing on their positional relationships, attributes, and motion. There are six types of positional relationships: camera-camera, camera-object, camera-region, object-object, object-region, and region-region. The benchmark also includes two types of attributes (measurement and appearance), two types of motion (camera and object), and one multi-step reasoning category. All questions require reasoning across multiple images and cannot be answered from a single image alone.

Representative Task Samples

Representative MMSI-Bench samples from each category. Please zoom in to inspect image details. Questions and rationales are simplified for brevity; the complete versions appear in the section below. Correct answers are highlighted in green.

Benchmark Construction Pipeline

Illustration of the MMSI-Bench construction pipeline: images are collected from diverse real-world datasets, relevant image sets are carefully selected, complex QA tasks and detailed reasoning processes are manually annotated, and all data undergo rigorous quality control.

Leaderboard (2025-05)

Model	Avg. (%)	Type
🥇 Human Level	97.2	Baseline
🥈 o3	41.0	Proprietary
🥉 GPT-4.5	40.3	Proprietary
Gemini-2.5-Pro--Thinking	37.0	Proprietary
Gemini-2.5-Pro	36.9	Proprietary
Doubao-1.5-pro	33.0	Proprietary
Qwen2.5-VL-72B	30.7	Open-source
NVILA-15B	30.5	Open-source
GPT-4.1	30.9	Proprietary
GPT-4o	30.3	Proprietary
Claude-3.7-Sonnet--Thinking	30.2	Proprietary
Seed1.5-VL	29.7	Proprietary
DeepSeek-VL2-Small	28.6	Open-source
InternVL2.5-8B	28.7	Open-source
InternVL3-78B	28.5	Open-source
InternVL2.5-78B	28.5	Open-source
LLaVA-OneVision-72B	28.4	Open-source
InternVL2.5-2B	29.0	Open-source
InternVL2.5-26B	28.0	Open-source
NVILA-8B	28.1	Open-source
DeepSeek-VL2	27.1	Open-source
InternVL3-1B	27.0	Open-source
InternVL3-9B	26.7	Open-source
Qwen2.5-VL-3B	26.5	Open-source
InternVL2.5-1B	26.1	Open-source
InternVL2.5-4B	26.3	Open-source
InternVL3-8B	25.7	Open-source
Qwen2.5-VL-7B	25.9	Open-source
InternVL3-2B	25.3	Open-source
Llama-3.2-11B-Vision	25.4	Open-source
🃏 Random Guessing	25.0	Baseline
LLaVA-OneVision-7B	24.5	Open-source
DeepSeek-VL2-Tiny	24.0	Open-source
Blind GPT-4o	22.7	Baseline

Error Types in Spatial Reasoning

Illustration of four error types identified in MLLM spatial reasoning on MMSI-Bench.

Automated Error Analysis

Distribution of correct and error types across three representative MLLMs. (Analyzed using the automated error analysis pipeline on all MMSI-Bench questions)

Acknowledgment

MMSI-Bench makes use of data from: ScanNet, nuScenes, Matterport3D, Ego4D, AgiBot-World, DTU, DAVIS-2017, and Waymo. We thank these teams for their open-source contributions.

BibTeX

@article{yang2025mmsi,
title={MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence},
author={Yang, Sihan and Xu, Runsen and Xie, Yiman and Yang, Sizhe and Li, Mo and Lin, Jingli and Zhu, Chenming and Chen, Xiaochen and Duan, Haodong and Yue, Xiangyu and Lin, Dahua and Wang, Tai and Pang, Jiangmiao},
journal={arXiv preprint arXiv:2505.23764},
year={2025}
}