Recent advances in multi-modal image generation models have led to remarkable progress in both text-to-image (T2I) and personalized generation tasks. However, existing benchmarks are fragmented, focusing on only a subset of capabilities or lacking compositional, explainable evaluation. To fill this gap, we propose Multi-Modal Image Generation Benchmark (MMIG-Bench), a comprehensive benchmark for evaluating multi-modal image generation models. MMIG-Bench unifies compositional evaluation across T2I and customized generation, introduces explainable aspect-level metrics, and provides extensive human and automatic evaluations. Our results offer a thorough analysis of state-of-the-art diffusion, autoregressive, and API-based models, highlighting strengths and limitations and revealing future research directions for robust, explainable multi-modal generation.
Overview of MMIG-Bench. We present a unified multi-modal benchmark which contains 1,750 multi-view reference images with 4,850 richly annotated text prompts, covering both text-only and image-text-conditioned generation. We also propose a comprehensive three-level evaluation framework: low-level of artifacts and identity preservation, mid-level of VQA-based Aspect Matching Score, and high-level of aesthetics and human preferences—delivers holistic and interpretable scores.
Representative qualitative results on MMIG-Bench. Our benchmark enables interpretable, compositional analysis of generation outputs at object, relation, attribute, and counting levels.
Method | CLIP-T ↑ | PAL4VST ↓ | AMS ↑ | Human ↑ | Aesthetic ↑ | HPSv2 ↑ | PickScore ↑ |
---|---|---|---|---|---|---|---|
Diffusion Models | |||||||
SDXL | 33.529 | 14.340 | 79.08 | 72.29 | 6.337 | 0.277 | 0.120 |
Photon-v1 | 33.296 | 2.947 | 77.12 | 69.49 | 6.391 | 0.284 | 0.088 |
Lumina-2 | 33.281 | 15.531 | 84.11 | 73.18 | 6.048 | 0.287 | 0.116 |
HunyuanDit-v1.2 | 33.701 | 8.024 | 83.61 | 74.89 | 6.379 | 0.300 | 0.144 |
Pixart-Sigma-xl2 | 33.682 | 9.283 | 83.18 | 76.65 | 6.409 | 0.304 | 0.165 |
Flux.1-dev | 33.017 | 2.171 | 84.44 | 76.44 | 6.433 | 0.307 | 0.210 |
SD 3.5-large | 33.873 | 6.359 | 85.33 | 77.04 | 6.318 | 0.294 | 0.157 |
HiDream-I1-Full | 33.876 | 1.522 | 89.65 | 83.18 | 6.457 | 0.321 | 0.450 |
Autoregressive Models | |||||||
JanusFlow | 31.498 | 365.663 | 70.25 | 75.69 | 5.221 | 0.209 | 0.031 |
Janus-Pro-7B | 33.358 | 31.954 | 85.35 | 80.36 | 6.038 | 0.275 | 0.129 |
API-based Models | |||||||
Gemini-2.0-Flash | 32.433 | 11.053 | 85.35 | 81.98 | 6.102 | 0.275 | 0.110 |
GPT-4o | 32.380 | 3.497 | 82.57 | 81.02 | 6.719 | 0.279 | 0.263 |
Table 1. Quantitative comparison across 12 text-to-image models using 2,100 prompts. Bold indicates best in column; Underline second best.
Method | CLIP-T ↑ | CLIP-I ↑ | DINOv2 ↑ | CUTE ↑ | PAL4VST ↓ | BLIPVQA ↑ | AMS ↑ | Aesthetic ↑ | HPSv2 ↑ | PickScore ↑ |
---|---|---|---|---|---|---|---|---|---|---|
Diffusion Models | ||||||||||
BLIP Diffusion | 26.137 | 80.286 | 26.232 | 69.681 | 56.780 | 0.247 | 41.59 | 5.830 | 0.213 | 0.032 |
DreamBooth | 24.227 | 88.758 | 38.961 | 79.780 | 43.535 | 0.108 | 28.00 | 5.368 | 0.179 | 0.019 |
Emu2 | 28.410 | 79.026 | 31.831 | 71.132 | 10.461 | 0.378 | 53.13 | 5.639 | 0.243 | 0.066 |
Ip-Adapter-XL | 28.577 | 85.297 | 34.177 | 74.995 | 8.531 | 0.290 | 51.10 | 5.840 | 0.233 | 0.073 |
MS Diffusion | 31.446 | 77.827 | 23.600 | 71.306 | 4.748 | 0.496 | 71.40 | 5.979 | 0.271 | 0.143 |
API-based Models | ||||||||||
GPT-4o | 33.527 | 75.152 | 25.174 | 64.776 | 1.973 | 0.672 | 90.90 | 6.368 | 0.289 | 0.550 |
Table 2. Quantitative comparison across 6 multi-modal image generation models (1,690 samples). Bold indicates best in column; Underline second best.
@article{hua2025mmig,
title={MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models},
author={Hua, Hang and Zeng, Ziyun and Song, Yizhi and Tang, Yunlong and He, Liu and Aliaga, Daniel and Xiong, Wei and Luo, Jiebo},
journal={arXiv preprint arXiv:2505.19415},
year={2025}
}