logo Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs

State Key Lab of CAD&CG, Zhejiang University
ECCV 2024
teaser

We discovered the "Text Inertia" phenomenon in LVLMs, which means some hallucinations are caused by text context instead of misinterpretation of image information. We proposed a method called PAI to alleviate this phenomenon.

Abstract

Large Vision-Language Models (LVLMs) align image features to the input of Large Language Models (LLMs), enhancing multi-modal reasoning and knowledge utilization capabilities. However, the disparity in scale between models of different modalities has resulted in LLMs assuming a predominant role in multimodal comprehension. This imbalance in model integration can lead to instances of hallucinatory outputs. In particular, LVLMs may generate descriptions that persist in the absence of visual input, suggesting that these narratives are disproportionately influenced by the textual context. We refer to this phenomenon as ``text inertia.'' To counteract this issue, we introduce a training-free algorithm designed to find an equilibrium between image comprehension and language inference. Specifically, we involve adjusting and amplifying the attention weights assigned to image tokens, thereby granting greater prominence to visual elements. Meanwhile, we the logits of multimodal inputs from the model logits of pure text input, which can let model not biased towards only LLM. By enhancing images tokens and reducing the stubborn output of LLM, we can let LVLM pay more attention to images, towards alleviating text inertia and reducing the hallucination in LVLMs. Our extensive experiments shows that this method substantially reduces the frequency of hallucinatory outputs in various LVLMs in terms of different metrics.

Method Overview

teaser

Our method primarily consists of two components: (1) inference intervention: scaling up the attention weights assigned to image tokens, and (2) logits refinement: mitigating the influence of pure text context by subtracting pure text logits.

Main Results

More Show Cases

BibTeX

@misc{liu2024payingattentionimagetrainingfree,
          title={Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs}, 
          author={Shi Liu and Kecheng Zheng and Wei Chen},
          year={2024},
          eprint={2407.21771},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2407.21771}, 
    }