The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal tasks, enabling more sophisticated and accurate reasoning across various applications, including image and video captioning, visual question answering, and cross-modal retrieval. Despite their superior capabilities, VLMs struggle with fine-grained image regional composition information perception. Specifically, they have difficulty accurately aligning the segmentation masks with the corresponding semantics and precisely de- scribing the compositional aspects of the referred regions. However, compositionality – the ability to understand and generate novel combinations of known visual and textual components – is critical for facilitating coherent reasoning and understanding across modalities by VLMs. To address this issue, we propose FINECAPTION, a novel VLM that can recognize arbitrary masks as referential inputs and pro- cess high-resolution images for compositional image cap- tioning at different granularity levels. To support this endeavor, we introduce COMPOSITIONCAP, a new dataset for multi-grained region compositional image captioning, which introduces the task of compositional attribute-aware regional image captioning. Empirical results demonstrate the effectiveness of our proposed model compared to other state-of-the-art VLMs. Additionally, we analyze the capabilities of current VLMs in recognizing various visual prompts for compositional region image captioning, highlighting areas for improvement in VLM design and training.
We propose FINECAPTION, a novel Vision-Language model with the improved capabilities of Attribute-Aware Regional Captioning, Regional Dense Captioning, and Comprehensive Global Image Captioning. FINECAPTION can recognize arbitrary masks as referential inputs and process high-resolution images. Moreover, models trained using the traditional bounding boxes as region reference are inadequate to precisely describe the region of interest.
Overview of FINECAPTION: The model incorporates a mask-aware visual encoder and two high-resolution encoders (ConvNext and SAM), enabling precise recognition of mask references and the perception of detailed compositional and spatial information for images.
FINECAPTION provides accurate and concise descriptions focused on specified attributes and regions, while GPT-4o often misses f ine-grained references and includes irrelevant information.
@article{hua2024finecaption,
title={FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity},
author={Hua, Hang and Liu, Qing and Zhang, Lingzhi and Shi, Jing and Zhang, Zhifei and Wang, Yilin and Zhang, Jianming and Luo, Jiebo},
journal={arXiv preprint arXiv:2411.15411},
year={2024}
}