MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models

* Equal contribution
1University of Rochester, 2Purdue University, 3NVIDIA

Abstract

Recent advances in multi-modal image generation models have led to remarkable progress in both text-to-image (T2I) and personalized generation tasks. However, existing benchmarks are fragmented, focusing on only a subset of capabilities or lacking compositional, explainable evaluation. To fill this gap, we propose Multi-Modal Image Generation Benchmark (MMIG-Bench), a comprehensive benchmark for evaluating multi-modal image generation models. MMIG-Bench unifies compositional evaluation across T2I and customized generation, introduces explainable aspect-level metrics, and provides extensive human and automatic evaluations. Our results offer a thorough analysis of state-of-the-art diffusion, autoregressive, and API-based models, highlighting strengths and limitations and revealing future research directions for robust, explainable multi-modal generation.

Contributions

  • Unified Benchmark: MMIG-Bench offers the first unified, compositional benchmark for both T2I and personalized (customization) image generation models.
  • Aspect-Level Explainability: Proposes explainable aspect-level metrics (object, relation, attribute, counting) to evaluate fine-grained compositional capabilities.
  • Comprehensive Evaluation: Provides extensive comparisons with human studies and automated metrics across 18 state-of-the-art models.
  • Open-Source Platform: Publicly releases benchmark, code, datasets, and evaluation scripts to promote transparent and reproducible research.
Overview of MMIG-Bench

Overview of MMIG-Bench. We present a unified multi-modal benchmark which contains 1,750 multi-view reference images with 4,850 richly annotated text prompts, covering both text-only and image-text-conditioned generation. We also propose a comprehensive three-level evaluation framework: low-level of artifacts and identity preservation, mid-level of VQA-based Aspect Matching Score, and high-level of aesthetics and human preferences—delivers holistic and interpretable scores.

Qualitative Examples of Different Models

Qualitative examples of MMIG-Bench

Representative qualitative results on MMIG-Bench. Our benchmark enables interpretable, compositional analysis of generation outputs at object, relation, attribute, and counting levels.

Comparison of Models (Text-to-Image)

Method CLIP-T ↑ PAL4VST ↓ AMS ↑ Human ↑ Aesthetic ↑ HPSv2 ↑ PickScore ↑
Diffusion Models
SDXL 33.529 14.340 79.08 72.29 6.337 0.277 0.120
Photon-v1 33.296 2.947 77.12 69.49 6.391 0.284 0.088
Lumina-2 33.281 15.531 84.11 73.18 6.048 0.287 0.116
HunyuanDit-v1.2 33.701 8.024 83.61 74.89 6.379 0.300 0.144
Pixart-Sigma-xl2 33.682 9.283 83.18 76.65 6.409 0.304 0.165
Flux.1-dev 33.017 2.171 84.44 76.44 6.433 0.307 0.210
SD 3.5-large 33.873 6.359 85.33 77.04 6.318 0.294 0.157
HiDream-I1-Full 33.876 1.522 89.65 83.18 6.457 0.321 0.450
Autoregressive Models
JanusFlow 31.498 365.663 70.25 75.69 5.221 0.209 0.031
Janus-Pro-7B 33.358 31.954 85.35 80.36 6.038 0.275 0.129
API-based Models
Gemini-2.0-Flash 32.433 11.053 85.35 81.98 6.102 0.275 0.110
GPT-4o 32.380 3.497 82.57 81.02 6.719 0.279 0.263

Table 1. Quantitative comparison across 12 text-to-image models using 2,100 prompts. Bold indicates best in column; Underline second best.

Comparison of Multi-Modal Generation Models (Customization Task)

Method CLIP-T ↑ CLIP-I ↑ DINOv2 ↑ CUTE ↑ PAL4VST ↓ BLIPVQA ↑ AMS ↑ Aesthetic ↑ HPSv2 ↑ PickScore ↑
Diffusion Models
BLIP Diffusion 26.137 80.286 26.232 69.681 56.780 0.247 41.59 5.830 0.213 0.032
DreamBooth 24.227 88.758 38.961 79.780 43.535 0.108 28.00 5.368 0.179 0.019
Emu2 28.410 79.026 31.831 71.132 10.461 0.378 53.13 5.639 0.243 0.066
Ip-Adapter-XL 28.577 85.297 34.177 74.995 8.531 0.290 51.10 5.840 0.233 0.073
MS Diffusion 31.446 77.827 23.600 71.306 4.748 0.496 71.40 5.979 0.271 0.143
API-based Models
GPT-4o 33.527 75.152 25.174 64.776 1.973 0.672 90.90 6.368 0.289 0.550

Table 2. Quantitative comparison across 6 multi-modal image generation models (1,690 samples). Bold indicates best in column; Underline second best.

BibTeX


         @article{hua2025mmig,
          title={MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models},
          author={Hua, Hang and Zeng, Ziyun and Song, Yizhi and Tang, Yunlong and He, Liu and Aliaga, Daniel and Xiong, Wei and Luo, Jiebo},
          journal={arXiv preprint arXiv:2505.19415},
          year={2025}
          }