MMSI-Bench Logo MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

Shanghai AI Laboratory1, The Chinese University of Hong Kong2, Zhejiang University3, Tsinghua University4, Shanghai Jiaotong University5, University of Hong Kong6, Beijing Normal University7
*Equal Contribution   Project Lead   Corresponding Author
Teaser Image

MMSI-Bench is a VQA benchmark dedicated to multi-image spatial intelligence, featuring 1,000 challenging questions annotated by six 3D vision researchers from diverse real-world scene images.

Abstract

MMSI-Bench is a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours crafting 1,000 challenging multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and step-by-step reasoning annotations.

We conduct extensive experiments and evaluate 34 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30% accuracy, OpenAI’s o3 model reaches 41%, while humans score 97%. These results highlight the challenge and the substantial headroom for future research. We also provide an automated error analysis pipeline that diagnoses four dominant failure modes, offering insights for advancing multi-image spatial intelligence.

Task Categories

MMSI-Bench categorizes tasks around three core spatial elements: camera, object, and region, focusing on their positional relationships, attributes, and motion. There are six types of positional relationships: camera-camera, camera-object, camera-region, object-object, object-region, and region-region. The benchmark also includes two types of attributes (measurement and appearance), two types of motion (camera and object), and one multi-step reasoning category. All questions require reasoning across multiple images and cannot be answered from a single image alone.

Category Pie Chart

Representative Task Samples

Representative MMSI-Bench Samples

Representative MMSI-Bench samples from each category. Please zoom in to inspect image details. Questions and rationales are simplified for brevity; the complete versions appear in the section below. Correct answers are highlighted in green.

Benchmark Construction Pipeline

MMSI-Bench Construction Pipeline

Illustration of the MMSI-Bench construction pipeline: images are collected from diverse real-world datasets, relevant image sets are carefully selected, complex QA tasks and detailed reasoning processes are manually annotated, and all data undergo rigorous quality control.

Leaderboard (2025-05)

Model Avg. (%) Type
🥇 Human Level97.2Baseline
🥈 o341.0Proprietary
🥉 GPT-4.540.3Proprietary
Gemini-2.5-Pro--Thinking37.0Proprietary
Gemini-2.5-Pro36.9Proprietary
Doubao-1.5-pro33.0Proprietary
Qwen2.5-VL-72B30.7Open-source
NVILA-15B30.5Open-source
GPT-4.130.9Proprietary
GPT-4o30.3Proprietary
Claude-3.7-Sonnet--Thinking30.2Proprietary
Seed1.5-VL29.7Proprietary
DeepSeek-VL2-Small28.6Open-source
InternVL2.5-8B28.7Open-source
InternVL3-78B28.5Open-source
InternVL2.5-78B28.5Open-source
LLaVA-OneVision-72B28.4Open-source
InternVL2.5-2B29.0Open-source
InternVL2.5-26B28.0Open-source
NVILA-8B28.1Open-source
DeepSeek-VL227.1Open-source
InternVL3-1B27.0Open-source
InternVL3-9B26.7Open-source
Qwen2.5-VL-3B26.5Open-source
InternVL2.5-1B26.1Open-source
InternVL2.5-4B26.3Open-source
InternVL3-8B25.7Open-source
Qwen2.5-VL-7B25.9Open-source
InternVL3-2B25.3Open-source
Llama-3.2-11B-Vision25.4Open-source
🃏 Random Guessing25.0Baseline
LLaVA-OneVision-7B24.5Open-source
DeepSeek-VL2-Tiny24.0Open-source
Blind GPT-4o22.7Baseline

Error Types in Spatial Reasoning

Error Types in MMSI-Bench

Illustration of four error types identified in MLLM spatial reasoning on MMSI-Bench.

Automated Error Analysis

Error Distribution in MLLMs

Distribution of correct and error types across three representative MLLMs. (Analyzed using the automated error analysis pipeline on all MMSI-Bench questions)

Acknowledgment

MMSI-Bench makes use of data from: ScanNet, nuScenes, Matterport3D, Ego4D, AgiBot-World, DTU, DAVIS-2017, and Waymo. We thank these teams for their open-source contributions.

BibTeX

@article{yang2025mmsi,
title={MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence},
author={Yang, Sihan and Xu, Runsen and Xie, Yiman and Yang, Sizhe and Li, Mo and Lin, Jingli and Zhu, Chenming and Chen, Xiaochen and Duan, Haodong and Yue, Xiangyu and Lin, Dahua and Wang, Tai and Pang, Jiangmiao},
journal={arXiv preprint arXiv:2505.23764},
year={2025}
}