Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs

Abstract

Large Vision-Language Models (LVLMs) align image features to the input of Large Language Models (LLMs), enhancing multi-modal reasoning and knowledge utilization capabilities. However, the disparity in scale between models of different modalities has resulted in LLMs assuming a predominant role in multimodal comprehension. This imbalance in model integration can lead to instances of hallucinatory outputs. In particular, LVLMs may generate descriptions that persist in the absence of visual input, suggesting that these narratives are disproportionately influenced by the textual context. We refer to this phenomenon as ``text inertia.'' To counteract this issue, we introduce a training-free algorithm designed to find an equilibrium between image comprehension and language inference. Specifically, we involve adjusting and amplifying the attention weights assigned to image tokens, thereby granting greater prominence to visual elements. Meanwhile, we the logits of multimodal inputs from the model logits of pure text input, which can let model not biased towards only LLM. By enhancing images tokens and reducing the stubborn output of LLM, we can let LVLM pay more attention to images, towards alleviating text inertia and reducing the hallucination in LVLMs. Our extensive experiments shows that this method substantially reduces the frequency of hallucinatory outputs in various LVLMs in terms of different metrics.

Method Overview

Our method primarily consists of two components: (1) inference intervention: scaling up the attention weights assigned to image tokens, and (2) logits refinement: mitigating the influence of pure text context by subtracting pure text logits.

Main Results

Evaluation Results on CHAIR: which evaluates hallucinations in a long sequence generation paradigm.

Evaluation Results on POPE: which evaluates hallucinations in a VQA paradigm. We constructed the evaluation into single-turn and multi-turn categories.

More Show Cases

Input Image:

Instruction:

Please describe the image in detail.

LLaVA-1.5(Greedy)

The image features a woman wearing a floral dress, walking through a garden with a parasol in her hand. She appears to be enjoying her time in the garden, possibly posing for a picture. A handbag can also be seen in the scene, placed close to the woman, possibly belonging to her. The overall atmosphere of the image is pleasant and inviting, showcasing the beauty of the garden.

LLaVA-1.5(PAI)

The image features a smiling woman wearing a floral dress, walking through a garden. She is holding an umbrella in her hand, and appears to be enjoying her time in the garden. The woman is walking between two large white pillars, which are part of a garden wall. The scene is set in a garden, with a potted plant nearby. The woman's smile and the pleasant atmosphere of the scene suggest a pleasant and relaxing environment.

Input Image:

Instruction:

Please describe the image in detail.

LLaVA-1.5(Greedy)

The image features a train with graffiti on its side, traveling down the tracks. The train is positioned on the left side of the image, and the graffiti is visible on the side of the train, covering a significant portion of the surface. There is a person standing near the train, possibly observing or waiting for it to stop. The scene also includes a bench located on the right side of the image, providing a place for people to sit and wait for the train.

LLaVA-1.5(PAI)

The image features a train with graffiti on its side, which is parked at a train station. The train is blue and white in color, and it is parked on the tracks. The graffiti on the train is quite elaborate, covering a significant portion of the train's side. The train is parked on the tracks, and it appears to be a passenger train, as it is not a freight train. The train is parked at a station, and it is likely waiting for passengers to board. The graffiti on the train adds a unique and artistic touch to the otherwise ordinary scene.

Input Image:

Instruction:

Please describe the image in detail.

LLaVA-1.5(Greedy)

The image depicts a snowy hillside with a herd of sheep grazing on the grass. There are at least 14 sheep scattered across the field, with some closer to the foreground and others further away. The sheep are of various sizes, indicating a mix of adult and younger animals. The scene captures the beauty of the snow-covered landscape and the peacefulness of the animals as they graze in the wintery environment.

LLaVA-1.5(PAI)

The image features a large herd of sheep standing in a snowy field. There are at least 14 sheep in the scene, with some of them standing close to each other, while others are spread out across the field. The sheep are standing in the snow, and the scene appears to be a peaceful, snowy landscape. The image captures the beauty of the winter landscape and the calmness of the sheep in their natural environment.

Input Image:

Instruction:

Please describe the image in detail.

LLaVA-1.5(Greedy)

The image features a wooden bench situated in a serene garden setting. The bench is positioned near a tree, providing a shaded and relaxing spot for visitors. The garden is adorned with several potted plants, creating a lush and green atmosphere. In addition to the bench, there are two people in the scene, one standing near the left side of the bench and the other closer to the right side. The presence of these individuals adds a sense of life and activity to the otherwise peaceful garden setting.

LLaVA-1.5(PAI)

The image features a wooden park bench situated in a park-like setting, surrounded by a variety of trees and plants. The bench is located in the middle of the scene, with a few trees and bushes in the background. The bench is surrounded by a few bushes and trees, creating a serene and peaceful atmosphere. The bench is situated in the middle of the scene, with a few trees and bushes in the background. The bench is made of wood and is placed on a gravel surface, adding to the overall charm of the scene.

PAI's performance on reducing hallucinations of LLaVA-1.5-7B.

Input Image:

Instruction:

Please describe the image in detail.

Minigpt4(Greedy)

This image shows a busy street with cars parked on both sides of the road. The buildings on either side of the street are tall and white, with large billboards on the sides. The sky is cloudy and dark, with a few dark clouds visible in the distance. The street is lined with trees and streetlights.

Minigpt4(PAI)

This image shows a busy city street with cars parked on the side of the road. The building in the background is a tall white building with a large billboard on the side. The billboard has a blue and white design on it. There are also several motorcycles parked on the side of the road.

Input Image:

Instruction:

Please describe the image in detail.

Minigpt4(Greedy)

This image shows a pizza with various toppings, including peppers, onions, and mushrooms, on a piece of parchment paper. The pizza appears to be cut into slices, with some slices missing. There is a knife and fork next to the pizza, and a bottle of tomato sauce on the table in the background.

Minigpt4(PAI)

This image shows a pizza that has been cut into slices and is sitting on a piece of parchment paper. The pizza has a variety of toppings, including peppers, onions, and mushrooms, and is covered in a thick layer of cheese. A knife is visible in the foreground, next to the pizza.

Input Image:

Instruction:

Please describe the image in detail.

Minigpt4(Greedy)

The image shows a motorcycle parked in a field next to a tent. The tent is set up on a wooden platform and has a green tarp covering it. The motorcycle is parked on the grass and has a helmet on the handlebars. There are trees in the background and the sun is setting behind them. The sky is orange and pink, and there are clouds in the sky.

Minigpt4(PAI)

This image shows a motorcycle parked in a field next to a tent. The tent is set up on a wooden platform and has a green tarp covering it. There are trees in the background and the sun is setting behind them. The motorcycle is parked on the ground next to the tent, and there are no other objects in the image.

Input Image:

Instruction:

Please describe the image in detail.

Minigpt4(Greedy)

This image shows a stop sign on the side of a road, next to a brick wall. The stop sign has a red background with white letters that read "stop." There is a gate in the wall, with a black iron fence and a wooden door. The gate is closed, and there are no people or cars in the image. The trees on the other side of the road are in full bloom, with pink flowers covering the branches. The sky is clear and blue, with a few clouds in the distance.

Minigpt4(PAI)

This image shows a stop sign on the side of a road, with a gate in the background. The stop sign has a red background with white letters that read "stop." The gate in the background is made of metal and has a black finish. There are some leaves on the ground next to the stop sign, and there are trees in the background.

PAI's performance on reducing hallucinations of Minigpt4.

Input Image:

Instruction:

Please describe the image in detail.

Shikra(Greedy)

The image features a green and yellow train traveling down the tracks, passing by a station. The train appears to be a trolley-type train that is either leaving the station or pulling into it. There are several people around the train and the station, possibly waiting for the train or just going about their day.
In addition to the train and people, there are a few traffic lights scattered around the scene. One traffic light is located near the top right corner, another one is positioned further to the right, and the third traffic light is on the left side of the image. The presence of these traffic lights suggests that the train is operating in an urban environment with well-organized transportation infrastructure.

Shikra(PAI)

The image depicts a green and yellow train traveling down the train tracks, passing through a small town. The train is moving along the tracks that are situated above a grassy field. The scene also features a tree-lined sidewalk adjacent to the train tracks.
In the background, there are some pink flowers in bloom, adding a touch of color to the scene. The train tracks can be seen extending across the image, with the train positioned in the middle of the frame. The train appears to be a local commuter train, possibly heading towards a city or town.

Input Image:

Instruction:

Please describe the image in detail.

Shikra(Greedy)

The image features a large elephant standing on a lush green field, eating grass using its trunk. The elephant is positioned in the center of the scene, and it appears to be enjoying a meal.
In the background, there are several chairs scattered around the field, possibly for people to sit and observe the elephant. Additionally, there are two people in the scene, one near the left edge of the image and another closer to the center, both of them likely watching the elephant as it feeds.

Shikra(PAI)

The image features a large elephant standing on a dirt field, eating a mouthful of green grass. The elephant is enjoying a meal consisting mainly of grass, as it is eating the grass in its mouth. The elephant is standing under a bamboo hut, which provides shade and shelter.\n\nThe image shows a large elephant standing in a field, eating grass by placing it in its mouth using its trunk. The scene captures the unique moment of the elephant enjoying its meal.

Input Image:

Instruction:

Please describe the image in detail.

Shikra(Greedy)

The image features a small boat sailing across a large body of water, likely an ocean or a sea, under a cloudy sky. The boat is speeding through the water, cutting through the waves and creating a wake behind it.
There are a few people on the boat, with one person visible on the left side and two other people on the right side of the boat. They seem to be enjoying their time on the water, possibly engaged in recreational activities or sightseeing.

Shikra(PAI)

The image features a small red boat, resembling a red and white ship, speeding across the water with a large body of waves around it. The boat is moving quickly through the water, creating a wake in the ocean. The scene captures the excitement of an adventurous boat ride on the open ocean, providing a thrilling experience for those on board.

PAI's performance on reducing hallucinations of Shikra.

@article{liu2024paying, title={Paying more attention to image: A training-free method for alleviating hallucination in lvlms}, author={Liu, Shi and Zheng, Kecheng and Chen, Wei}, journal={arXiv preprint arXiv:2407.21771}, year={2024} }

Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs

We discovered the "Text Inertia" phenomenon in LVLMs, which means some hallucinations are caused by text context instead of misinterpretation of image information. We proposed a method called PAI to alleviate this phenomenon.

Abstract

Method Overview

Our method primarily consists of two components: (1) inference intervention: scaling up the attention weights assigned to image tokens, and (2) logits refinement: mitigating the influence of pure text context by subtracting pure text logits.

Main Results

Evaluation Results on CHAIR: which evaluates hallucinations in a long sequence generation paradigm.

Evaluation Results on POPE: which evaluates hallucinations in a VQA paradigm. We constructed the evaluation into single-turn and multi-turn categories.

More Show Cases

PAI's performance on reducing hallucinations of LLaVA-1.5-7B.

PAI's performance on reducing hallucinations of Minigpt4.

PAI's performance on reducing hallucinations of Shikra.

BibTeX