Logo MMComposition

Revisiting the Compositionality of Pre-trained Vision-Language Models

1University of Rochester, 2Apple, 3Microsoft

Introduction

The development of large Vision-Language Models (VLMs) has greatly improved multimodal tasks like image captioning, visual question answering, and cross-modal retrieval by better integrating visual and textual information. However, while VLMs show strong performance, researchers still lack a full understanding of their compositionality — the ability to grasp and generate new combinations of known visual and textual elements. Current benchmarks assess compositionality mainly through objects, relations, and attributes, but overlook deeper aspects like object interactions, counting, and complex compositions. To address these limitations, we introduce MMComposition, a novel benchmark specifically designed to comprehensively evaluate the compositionality of vision-language models (VLMs). MMComposition assesses VLMs across three main dimensions: vision-language (VL) compositional perception, reasoning, and probing. Unlike previous benchmarks that primarily focus on text-to-image retrieval or single-choice questions, MMComposition offers a diverse set of 4,342 tasks, including single-image and multi-image scenarios, single-choice and indefinite-choice questions. This ensures a thorough evaluation of models' ability to handle complex compositional tasks across modalities. Our findings reveal that even state-of-the-art models like GPT-4o struggle with fine-grained compositional reasoning, highlighting the need for further advancements in VLMs' compositional capabilities. Our key contributions are:
  1. Introducing MMComposition, a novel, high-quality benchmark for evaluating the compositionality of pre-trained VLMs across perception, reasoning, and probing.
  2. Providing a comprehensive experimental evaluation of 54 state-of-the-art VLMs, demonstrating the challenging nature of MMComposition and exposing significant gaps between model and human performance.
  3. Analyzing the critical factors in VLM architecture that influence compositionality, offering insights for future improvements in model design and training.

🏆 Leaderboard

By default, this leaderboard is sorted by overall results. To view other sorted results, please click on the corresponding cell. Colored rows indicate closed-source models/APIs.

# Model LLM
Params
Date Overall (%) Perception (%) Reasoning (%) Probing (%)
1 InternVL2-40B

Shanghai AI Lab

40B 2024/10/01 67.95 65.44 73.99 59.59
2 InternVL2-76B

Shanghai AI Lab

76B 2024/10/01 67.28 63.41 75.44 58.46
3 Qwen2-VL-72B

Alibaba

72B 2024/10/01 65.24 56.53 76.39 70.26
4 InternVL-Chat-V1.2-Plus

Shanghai AI Lab

40B 2024/10/01 64.94 60.73 70.78 65.80
5 InternVL2-26B

Shanghai AI Lab

26B 2024/10/01 63.08 60.40 70.03 52.43
6 VILA1.5-40B

NVIDIA & MIT

40B 2024/10/01 63.08 60.40 70.03 52.43
7 GPT-4o

OpenAI

- 2024/10/01 59.71 57.63 64.17 54.65
8 InternVL-Chat-v1.2

Shanghai AI Lab

40B 2024/10/01 59.61 56.49 63.79 60.71
9 InternVL-Chat-v1.5

Shanghai AI Lab

26B 2024/10/01 59.40 53.68 68.20 57.01
10 InternVL2-8B

Shanghai AI Lab

8B 2024/10/01 58.47 53.44 67.00 54.10
11 LLaVA-1.6-34B

NTU & UW Madison & ByteDance

34B 2024/10/01 58.25 57.82 58.88 58.17
12 MiniCPM-V2.6

Tsinghua University

8B 2024/10/01 57.01 55.36 60.14 54.43
13 InternLM-XComposer2-4KHD-7B

Shanghai AI Lab & CUHK & SenseTime

7B 2024/10/01 56.69 52.55 61.71 60.02
14 Qwen-VL-Max

Alibaba

- 2024/10/01 55.18 50.36 59.63 63.87
15 InternLM-XComposer2.5-7B

Shanghai AI Lab & CUHK & SenseTime

7B 2024/10/01 55.10 50.61 63.16 49.64
16 Hunyuan-Vision

Tencent

- 2024/10/01 54.64 54.80 57.18 45.03
17 InternLM-Xcomposer2-VL

Shanghai AI Lab & CUHK & SenseTime

7B 2024/10/01 54.62 51.25 58.75 57.15
18 Gemini-1.5-Pro

Google

- 2024/10/01 53.27 50.64 58.12 49.60
19 Mini-Gemini-34B

CUHK & SmartMore

34B 2024/10/01 53.06 51.25 58.94 41.79
20 InternVL2-4B

Shanghai AI Lab

4B 2024/10/01 52.03 46.94 62.53 41.18
21 LLaMA-3.2-11B-Vision-Instruct

Meta

11B 2024/10/01 52.01 50.88 54.47 49.17
22 MiniCPM-Llama3-V2.5

Tsinghua University

8B 2024/10/01 51.54 45.68 62.85 41.79
23 Mini-Gemini-34B-HD

CUHK & SmartMore

34B 2024/10/01 51.48 47.73 61.40 35.91
24 Bunny-LLaMA-3-V

BAAI

8B 2024/10/01 50.81 47.81 52.64 59.44
25 Mini-Monkey

HUST

2B 2024/10/01 50.41 47.81 56.49 42.37
26 Phi3.5-Vision-Instruct

Microsoft

4.2B 2024/10/01 50.02 45.97 54.53 54.65
27 ColgVLM2-Llama3-Chat-19B

Zhipu AI

19B 2024/10/01 49.84 50.34 48.87 50.69
28 Phi3-Vision-Instruct

Microsoft

4.2B 2024/10/01 48.52 45.55 50.44 56.75
29 Yi-VL-34B

01.AI

34B 2024/10/01 47.86 42.99 53.15 53.88
30 Step-1V-32K

Stepfun

- 2024/10/01 47.64 41.25 57.49 45.46
31 ConvLLaVA-1024-7B

Alibaba & Tsinghua University

7B 2024/10/01 47.32 43.70 54.41 40.89
32 Yi-VL-6B

01.AI

6B 2024/10/01 46.87 43.80 50.76 48.76
33 Bunny-3B

BAAI

3B 2024/10/01 46.32 43.42 47.98 55.08
34 Bunny-4B-V1.0

BAAI

3B 2024/10/01 46.07 43.68 50.50 42.66
35 LLaVA-HR-13B

Xiamen University

13B 2024/10/01 46.02 41.83 51.26 48.80
36 ConvLLaVA-1536-7B

Alibaba & Tsinghua University

7B 2024/10/01 45.52 41.84 54.09 34.20
37 InternVL2-2B

Shanghai AI Lab

2B 2024/10/01 45.11 42.37 51.07 38.16
38 Monkey-Chat

HUST

7.7B 2024/10/01 44.90 41.79 48.24 48.91
39 Mini-Gemini-13B

CUHK & SmartMore

13B 2024/10/01 43.74 38.51 54.60 32.28
40 SliME-7B

UCAS & Squirrel AI & Alibaba & Meta

7B 2024/10/01 43.45 40.56 51.51 30.03
41 INF-LLaVA*

Xiamen University

8B 2024/10/01 43.32 40.13 51.39 31.41
42 SliME-8B

UCAS & Squirrel AI & Alibaba & Meta

8B 2024/10/01 43.29 40.44 51.26 29.96
43 INF-LLaVA

Xiamen University

8B 2024/10/01 43.04 41.80 46.98 35.58
44 LLaVA-HR-7B

Xiamen University

7B 2024/10/01 42.73 39.38 50.38 33.04
45 SliME-13B

UCAS & Squirrel AI & Alibaba & Meta

13B 2024/10/01 42.63 39.30 50.06 33.55
46 ConvLLaVA-768-7B

Alibaba & Tsinghua University

7B 2024/10/01 42.40 36.51 52.46 37.11
47 InternVL2-1B

Shanghai AI Lab

1B 2024/10/01 42.06 39.65 49.62 27.89
48 Mini-Gemini-13B-HD

CUHK & SmartMore

13B 2024/10/01 41.99 37.24 51.07 34.28
49 Qwen-VL-Chat

Alibaba

13B 2024/10/01 41.64 36.10 49.69 41.54
50 DeepStack-L-HD-Vicuna-7B

Fudan University & Microsoft

7B 2024/10/01 40.26 35.19 48.87 35.88
51 DeepStack-L-Vicuna-7B

Fudan University & Microsoft

7B 2024/10/01 39.75 36.92 46.60 30.21
52 LLaVA-1.6-Vicuna-13B

NTU & UW Madison & ByteDance

13B 2024/10/01 38.03 31.15 47.92 38.16
53 LLaVA-1.6-Mistral-7B

NTU & UW Madison & ByteDance

7B 2024/10/01 37.18 33.64 42.00 38.24
54 LLaVA-1.5-13B

NTU & UW Madison & ByteDance

13B 2024/10/01 36.07 29.91 43.45 41.39
55 Random Choice

-

- 2024/10/01 30.15 24.88 38.22 28.61

Benchmark

📊 Statistics & Analysis

Question Categorie Hierarchy: Question Types in MMComposition Benchmark for Evaluating Vision-Language Models.

🧪 Experiments





🔭 Visualization Results





Citation


    @article{hua2024mmcomposition,
        title={MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models},
        author={Hua, Hang and Tang, Yunlong and Zeng, Ziyun and Cao, Liangliang and Yang, Zhengyuan and He, Hangfeng and Xu, Chenliang and Luo, Jiebo},
        journal={arXiv preprint arXiv:2410.09733},
        year={2024}
      }