Despite their remarkable potential, Large Vision-Language Models (LVLMs) still face challenges with object hallucination, a problem where their generated outputs mistakenly incorporate objects that do not actually exist. Although most works focus on addressing this issue within the language-model backbone, our work shifts the focus to the image input source, investigating how specific image tokens contribute to hallucinations. Our analysis reveals a striking finding: a small subset of image tokens with high attention scores are the primary drivers of object hallucination. By removing these hallucinatory image tokens (only 1.5% of all image tokens), the issue can be effectively mitigated. This finding holds consistently across different models and datasets. Building on this insight, we introduce EAZY, a novel, training-free method that automatically identifies and Eliminates hAllucinations by Zeroing out hallucinatorY image tokens. We utilize EAZY for unsupervised object hallucination detection, achieving 15% improvement compared to previous methods. Additionally, EAZY demonstrates remarkable effectiveness in mitigating hallucinations while preserving model utility and seamlessly adapting to various LVLM architectures.
Figure 1: Removing three image tokens results in the elimination of the hallucinated objects, "apples" and "oranges", and reveals the real object "kiwis".
EAZY identifies and removes hallucinatory image tokens from the input image, effectively mitigating hallucinations in LVLMs. The method is training-free and can be applied to various LVLM architectures. In this example, EAZY removes three image tokens, which eliminates the hallucinated objects, "apples" and "oranges", revealing the real object "kiwis".
We first investigated an important question: How do Large Vision-Language Models (LVLMs) extract visual information and generate text tokens accordingly?
Figure 2: Token-wise attention heatmap on Layer-32 and Layer-15 of LLaVA-1.5
Through a comprehensive analysis of attention patterns, we found that Large Vision-Language Models (LVLMs) primarily extract and process object-related information in their middle-to-late layers, not in the initial or final layers as one might expect. We introduced a new metric, the Object Localization Visual Attention Ratio (OL-VAR), to quantify this. The results show that for a model like LLaVA-1.5, layers 10 through 25 are critical for linking generated object tokens to their corresponding regions in the image. This discovery provides a precise tool to understand where the model "looks" when generating text about objects.
Figure3: Visualization showing OL-VAR scores peaking in middle layers and highest at Layer-15, which is aligned with our observation of the token-wise attention heatmap pattern in Figure 2.
Figure4: Utilizing the layer-15 attention, we can effectively bound the corresponding image region of "dog".
We explore the root causes of object hallucination (OH) from a visual perspective. Building on our previous findings that attention in middle layers aligns object tokens with their visual anchors, we investigate whether a small subset of image tokens plays a decisive role in hallucinated object generation.
Through case studies and systematic analysis, we identify that hallucinated objects consistently attend to a few image tokens with the highest attention scores, which we term Hallucinatory Image Tokens (HITs). Remarkably, we find that zeroing out as few as three of these HITs can eliminate hallucinated objects from the generated text, without affecting real object descriptions.
To further validate the semantic interpretation of HITs, we apply the Logit Lens technique and confirm that these tokens are indeed decoded as hallucinated object labels (e.g., "apple", "orange"). This reveals that hallucinations are not random artifacts but stem from biased visual evidence in the input image encoding.
Figure 5: Logit Lens interpretation of HITs. We display the top-5 projected words of image tokens related to HOs. The HITs are interpreted as the hallucinatory objects, ”apple” and ”orange”.
We then quantify the hallucination mitigation performance by evaluating how zeroing out the top-K HITs affects real and hallucinated objects. Our experiments on the curated Hall-COCO dataset show that hallucinated objects are highly sensitive to HIT removal, whereas real objects remain robust, thus confirming the discriminative power of our method.
Figure 6: Reduction ratio of OHs in Hall-COCO after zeroing out HITs. We show the ratio changes of image w/o improv.(ratio of images with the same HOs after zeroing out), image w OH (images with OH in the response), and HO (Ratio of HOs in all text responses).
Figure 7: Hallucinated object removal rate significantly increases after zeroing out top-K HITs, with minimal impact on real objects.
The EAZY method identifies object hallucinations by leveraging the insight that real and hallucinated objects respond differently to the removal of high-attention image tokens. The process begins by generating an initial response from the LVLM for a given image. From this response, potential object tokens are extracted using NLP techniques like POS tagging. For each identified object token, the method pinpoints the top-K image tokens that received the highest attention scores from it. These top-K tokens are considered candidate Hallucinatory Image Tokens (HITs).
Next, the model's input is modified by "zeroing out" these candidate HITs, replacing their embeddings with zero vectors. A new response is then generated using this modified input. If an object token from the original response disappears in the new response, EAZY classifies it as a hallucination. Conversely, if the object token remains, it is classified as a real object. This differential behavior serves as a robust, training-free mechanism for detecting hallucinations, showing that real objects are largely unaffected by this zeroing-out process, while hallucinated ones are effectively eliminated.
The mitigation process in EAZY follows a similar logic to detection and is designed to correct the initial, potentially flawed output of an LVLM. First, an initial inference is performed to get the model's description of the image. All object tokens, both real and hallucinated, are extracted from this initial response. For all identified objects, the method aggregates the top-K candidate HITs into a single set.
These aggregated candidate HITs are then zeroed out from the image token sequence to perform a second, "estimation" inference. By comparing the initial response with the second one, EAZY identifies the objects that disappeared, flagging them as hallucinations. Finally, the candidate HITs corresponding only to these confirmed hallucinated objects are collected to form a final zero-out list. This final set of tokens is zeroed out from the original input, and a final inference is run to produce a corrected, mitigated response that is free of the identified hallucinations while preserving the accurately described objects.
Figure 6: Overview of the proposed EAZY method. The process starts with an image input encoded into image tokens. The LVLM generates an initial response with hallucinated objects (apples, oranges). EAZY estimates HITs via text-to-image token-wise attention distribution. With HITs zeroed out, the final response has the hallucinations disappeared, revealing correct objects (kiwis).
We conduct extensive experiments to evaluate the effectiveness of EAZY in detecting and mitigating object hallucinations across various LVLMs and datasets. Our results demonstrate that EAZY significantly outperforms existing methods in both detection accuracy and mitigation effectiveness, while maintaining model utility.
CHAIR Results
EAZY's validation results show a significant reduction in hallucination rates across multiple LVLMs, demonstrating its robustness and adaptability.
POPE Results
Evaluation Result on POPE. We take the average accuracy and F1 score of random, popular, and adversarial models.
Hallucination Detection Curves Comparison
Object Hallucination Detection Curves on Hall-COCO. We present the Precision-Recall and ROC curves of the proposed OH detection method and baselines.
Detection Performance
OH Detection Results on Hall-COCO. PR(RO) represents the precision of real objects (positive instances), while PR(OH) represents the precision of object hallucination (negative instances).
Evaluation on MLLM Benchmarks
EAZY maintains the general capabilities of LVLMs.
We present qualitative case studies demonstrating EAZY's effectiveness in detecting and mitigating object hallucinations across diverse scenarios. These examples showcase how our method identifies problematic image tokens and successfully eliminates hallucinated objects while preserving the accuracy of real object detection.
@misc{che2025hallucinatoryimagetokenstrainingfree,
title={Hallucinatory Image Tokens: A Training-free EAZY Approach on Detecting and Mitigating Object Hallucinations in LVLMs},
author={Liwei Che and Tony Qingze Liu and Jing Jia and Weiyi Qin and Ruixiang Tang and Vladimir Pavlovic},
year={2025},
eprint={2503.07772},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.07772},
}