|
Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces Gaze Shift-Guided Cross-modal Fusion Enhancement, a novel method for mitigating hallucinations in VLMs. The proposed method tracks changes in visual attention, referred to as gaze shifts, during the processing of information-rich query tokens. These gaze shifts are then used to create a visual saliency map that guides cross-modal fusion, enhancing both visual and query attention during decoding. The paper demonstrates that GIFT effectively reduces hallucination in VLMs across various tasks and datasets, providing significant improvements in hallucination mitigation while maintaining general performance with low computational overhead.
1. The idea of using gaze shifts to dynamically adjust visual attention in VLMs is a novel and promising approach. It effectively addresses key challenges in cross-modal fusion and visual attention misallocation (visual attention sink), which are critical issues in VLM performance.
2. The paper provides extensive experiments that show GIFT achieves up to 20.7% improvement in hallucination mitigation, outperforming existing methods across several vision-language datasets and models of varying architectures.
3. GIFT demonstrates that it can improve hallucination mitigation without introducing significant computational overhead, making it a practical solution for inference-time interventions in VLMs.
1. Some formulas are missing concluding punctuation (e.g., periods at the end of equations). Sections 5 and 6 could be merged. Both sections discuss experimental results and analyses, and their separation feels redundant. Combining them into a single cohesive section would improve the flow and clarity of the paper.
2. The experiments in the paper are mainly focused on the LLaVA model, which limits the generalizability of the results. Although the authors show promising results for LLaVA, there is no comprehensive evaluation on other popular VLMs or tasks. This raises concerns about the method's applicability to a wider range of models and real-world scenarios. The lack of a broader experimental comparison is the primary reason I am rating this paper 6/10 instead of 8/10, as it makes it difficult to assess whether GIFT is a universally applicable solution or if it is specific to certain architectures.
3. While the paper emphasizes that GIFT maintains a low computational cost compared to some baselines, it still incurs a slight increase in latency (1.13x compared to greedy decoding). However, this is not a major issue.
See Weakness. |
Moderately AI-edited |
|
Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper addresses the critical issue of hallucination in VLMs. The authors propose GIFT (Gaze Shift-Guided Cross-Modal Fusion Enhancement), a lightweight inference-time method inspired by human visual gaze dynamics to tackle: (1) misallocated attention to irrelevant visual regions (2) over-reliance on linguistic priors (3) imbalanced cross-modal fusion.
1.GIFT introduces a human-inspired "gaze shift" tracking approach that addresses a critical gap in existing work: static attention averaging (used by baselines like VAF) often misallocates attention to irrelevant regions.
2.It integrates into existing VLMs without retraining, unlike training-based methods that incur high computational costs.
3.It consistently improves performance across diverse VLMs (LLaVA-1.5 7B/13B, Qwen2-VL 7B) and tasks (object detection, captioning, VQA), demonstrating its versatility.
1.GIFT heavily relies on "information-rich query tokens" (identified via POS tagging) to compute accurate saliency maps. The authors acknowledge that vague, ambiguous, or visually irrelevant queries (e.g., "Describe this image" without specific cues) may lead to inaccurate maps and reduced hallucination mitigation. However, they do not provide concrete strategies to handle such cases—e.g., no analysis of performance on low-specificity queries or a fallback mechanism for query-scarce scenarios.
2.While the authors tune key hyper-parameters, they lack a deeper analysis of how these choices generalize.
3.GIFT is compared to three baselines (VAF, Rel-Attn, VAR) but not to recent state-of-the-art contrastive decoding methods[1,2]. These methods reduce hallucination by contrasting outputs with perturbed visual inputs and have shown strong performance on VLM hallucination tasks. Omitting this comparison limits the paper’s ability to position GIFT against the broader landscape of mitigation strategies.
[1] Mitigating Object Hallucinations in Large Vision-Language Models Through Visual Contrastive Decoding.
[2] Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models.
please refer to weaknesses. |
Moderately AI-edited |
|
Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses object hallucination, a phenomena that is prevalent in existing VLMs, which are typically caused by over-reliance on language priors. Authors propose Gaze Shift-Guided Cross-modal Fusion Enhancement (GIFT), a simple training-free method that pre-computes a visual saliency map by tracking positive changes in visual attention, which are further leveraged to amplify attention at decoding step. They test the effectiveness on multiple hallucination benchmarks and show effectiveness.
- Presentation. The overall presentation of this paper is clear.
- Clarity. The overall idea is straightforward and easy-to-follow.
- Comparison to Attention Modification Approaches. Note that there are a series of studies [A,B,C,D] that focuses on fixing the attention patterns to address object hallucination in this field, while authors ignores such discussions, which are suggested to include. This could involve discussions in related works, and performance comparisons.
[A] Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding. CVPR 2025.
[B] Mitigating Object Hallucination via Concentric Causal Attention. NeurIPS 2024.
[C] Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention. CVPR 2025.
- Performance on General Benchmarks. How are the performance gains on general benchmarks rather than hallucinations, for example, MMStar, or MMBench?
See weakness 2. |
Fully human-written |
|
Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes GIFT, an inference-time hallucination mitigation method for VLMs. The key novelty lies in creating a visual saliency map by tracking positive changes in visual attention during comprehension of information-rich query tokens. Unlike previous approaches that only enhance visual attention, GIFT also proportionally adjusts query token attention to preserve cross-modal fusion balance, reducing the risk of visual attention sink and low visual contribution. Evaluations on multiple hallucination benchmarks (CHAIR, POPE, MMHal-Bench) and general VLM benchmarks (MME, SEED-Bench) show significant decrease in hallucination rates with minimal impact on overall reasoning capabilities. The method is lightweight, training-free, and generalizes across different VLM architectures.
1. Clear and intuitive idea: The gaze shift concept is easy to understand and well-motivated by human visual attention behavior.
2. Addresses multiple issues simultaneously: Tackles visual attention sink, low visual contribution, and imbalanced cross-modal fusion, which existing methods often address in isolation.
3. Low computational overhead: Achieves improvements with modest runtime increase compared to greedy decoding.
4. Comprehensive evaluation: Experiments span multiple benchmarks, models, and hallucination types, with ablation to validate design choices.
1. Experimental setting is somewhat outdated: The chosen base VLMs (LLaVA-1.5 series and Qwen2-VL) were released over a year ago. More recent models—such as LLaVA-Next, InternVL—implement Dynamic High Resolution image processing, which could impact saliency computation. Testing the method on these architectures would strengthen claims about generality.
2. Limited hallucination benchmarks: Evaluation could include newer datasets such as HallusionBench or other recent challenging hallucination tasks to better measure robustness.
3. Interpretability validation missing: Since the method relies heavily on saliency maps, adding segmentation-based experiments from classic interpretability literature could reveal whether human-perceived semantic enhancement indeed contributes to hallucination reduction[1, 2].
4. Hyperparameters vary per model without explicit robustness test: While tuning is explained, an ablation on sensitivity to hyperparameter changes would reinforce the robustness claim.
[1] Optimizing Relevance Maps of Vision Transformers Improves Robustness
[2] Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers
See weaknesses. |
Moderately AI-edited |