MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models

Abstract

Recent advances in multi-modal image generation models have led to remarkable progress in both text-to-image (T2I) and personalized generation tasks. However, existing benchmarks are fragmented, focusing on only a subset of capabilities or lacking compositional, explainable evaluation. To fill this gap, we propose Multi-Modal Image Generation Benchmark (MMIG-Bench), a comprehensive benchmark for evaluating multi-modal image generation models. MMIG-Bench unifies compositional evaluation across T2I and customized generation, introduces explainable aspect-level metrics, and provides extensive human and automatic evaluations. Our results offer a thorough analysis of state-of-the-art diffusion, autoregressive, and API-based models, highlighting strengths and limitations and revealing future research directions for robust, explainable multi-modal generation.

Contributions

Unified Benchmark: MMIG-Bench offers the first unified, compositional benchmark for both T2I and personalized (customization) image generation models.
Aspect-Level Explainability: Proposes explainable aspect-level metrics (object, relation, attribute, counting) to evaluate fine-grained compositional capabilities.
Comprehensive Evaluation: Provides extensive comparisons with human studies and automated metrics across 18 state-of-the-art models.
Open-Source Platform: Publicly releases benchmark, code, datasets, and evaluation scripts to promote transparent and reproducible research.

Overview of MMIG-Bench. We present a unified multi-modal benchmark which contains 1,750 multi-view reference images with 4,850 richly annotated text prompts, covering both text-only and image-text-conditioned generation. We also propose a comprehensive three-level evaluation framework: low-level of artifacts and identity preservation, mid-level of VQA-based Aspect Matching Score, and high-level of aesthetics and human preferences—delivers holistic and interpretable scores.

Qualitative Examples of Different Models

Representative qualitative results on MMIG-Bench. Our benchmark enables interpretable, compositional analysis of generation outputs at object, relation, attribute, and counting levels.

Comparison of Models (Text-to-Image)

Method	CLIP-T ↑	PAL4VST ↓	AMS ↑	Human ↑	Aesthetic ↑	HPSv2 ↑	PickScore ↑
Diffusion Models
SDXL	33.529	14.340	79.08	72.29	6.337	0.277	0.120
Photon-v1	33.296	2.947	77.12	69.49	6.391	0.284	0.088
Lumina-2	33.281	15.531	84.11	73.18	6.048	0.287	0.116
HunyuanDit-v1.2	33.701	8.024	83.61	74.89	6.379	0.300	0.144
Pixart-Sigma-xl2	33.682	9.283	83.18	76.65	6.409	0.304	0.165
Flux.1-dev	33.017	2.171	84.44	76.44	6.433	0.307	0.210
SD 3.5-large	33.873	6.359	85.33	77.04	6.318	0.294	0.157
HiDream-I1-Full	33.876	1.522	89.65	83.18	6.457	0.321	0.450
Autoregressive Models
JanusFlow	31.498	365.663	70.25	75.69	5.221	0.209	0.031
Janus-Pro-7B	33.358	31.954	85.35	80.36	6.038	0.275	0.129
API-based Models
Gemini-2.0-Flash	32.433	11.053	85.35	81.98	6.102	0.275	0.110
GPT-4o	32.380	3.497	82.57	81.02	6.719	0.279	0.263

Table 1. Quantitative comparison across 12 text-to-image models using 2,100 prompts. Bold indicates best in column; Underline second best.

Comparison of Multi-Modal Generation Models (Customization Task)

Method	CLIP-T ↑	CLIP-I ↑	DINOv2 ↑	CUTE ↑	PAL4VST ↓	BLIPVQA ↑	AMS ↑	Aesthetic ↑	HPSv2 ↑	PickScore ↑
Diffusion Models
BLIP Diffusion	26.137	80.286	26.232	69.681	56.780	0.247	41.59	5.830	0.213	0.032
DreamBooth	24.227	88.758	38.961	79.780	43.535	0.108	28.00	5.368	0.179	0.019
Emu2	28.410	79.026	31.831	71.132	10.461	0.378	53.13	5.639	0.243	0.066
Ip-Adapter-XL	28.577	85.297	34.177	74.995	8.531	0.290	51.10	5.840	0.233	0.073
MS Diffusion	31.446	77.827	23.600	71.306	4.748	0.496	71.40	5.979	0.271	0.143
API-based Models
GPT-4o	33.527	75.152	25.174	64.776	1.973	0.672	90.90	6.368	0.289	0.550

Table 2. Quantitative comparison across 6 multi-modal image generation models (1,690 samples). Bold indicates best in column; Underline second best.

BibTeX


         @article{hua2025mmig,
          title={MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models},
          author={Hua, Hang and Zeng, Ziyun and Song, Yizhi and Tang, Yunlong and He, Liu and Aliaga, Daniel and Xiong, Wei and Luo, Jiebo},
          journal={arXiv preprint arXiv:2505.19415},
          year={2025}
          }