|
DVLA-RL: Dual-Level Vision–Language Alignment with Reinforcement Learning Gating for Few-Shot Learning |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper proposes DVLA-RL, a framework for few-shot learning (FSL) that integrates large language models (LLMs) for hierarchical vision–language alignment. It introduces two main components:
1. Dual-level Semantic Construction : generates low-level attributes and high-level class descriptions from class names and support samples, using progressive Top-k filtering to avoid irrelevant or hallucinatory attributes.
2. Adaptive RL-Gated Attention : formulates cross-modal fusion as a reinforcement learning (RL) process that dynamically balances self-attention and cross-attention between visual and textual tokens across layers.
DVLA-RL achieves state-of-the-art results across nine benchmarks in three FSL scenarios.
1. Clear motivation and design: The dual-level semantic structure effectively bridges fine-grained and global representations, addressing the limitations of single-level methods like SemFew or ECER.
2. Novel adaptive mechanism: The RL-gated attention introduces a lightweight yet dynamic policy to control fusion between self- and cross-attention, enabling layer-wise semantic alignment.
3. Strong results: DVLA-RL achieves SOTA across all nine datasets, with significant gains.
1. The paper describes RLA as containing both “image-guided” and “text-guided” paths, yet both equations use text-based queries while only differing in key and value sources. The notion of “image-guided” does not match the presented formulation, leaving the two paths conceptually indistinguishable.
2. The output fusion step is written as adding the fused representation to image features and then concatenating, but the dimensional correspondence between the two is not specified. If text and image features differ in length or channels, the operation is not well-defined. The formulation leaves the dimensional alignment unclear.
2. Minor writing inconsistencies: Mixing of terms “RLA/ARL”, and unclear notation (e.g., U, t*) reduce readability.
1. How are the two attention paths in RLA (Eq. 5–6) distinct if both use text queries? Could the authors clarify it?
2. What are the values and tuning ranges for the Beta concentration κ and RL weights λ?
3. Why does ChestX show minimal improvement, are there examples of failed attribute generation or hallucinated semantics? |
Heavily AI-edited |
|
DVLA-RL: Dual-Level Vision–Language Alignment with Reinforcement Learning Gating for Few-Shot Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a few-shot learning (FSL) method DVLA-RL to conduct controlled cross-modal alignment between vision and language from low-level to high-level semantics. In details, Dual-level Semantic Construction (DSC) generates details description of visual attributes via LLMs and Adaptive RL-gated Attention (RLA) that balances the self-attention and cross-attention operation between visual and text tokens. The proposed method achieves strong performance in general FSL, fine-grained FSL, and cross-domain FSL tasks.
- The idea of conducting both low-level and high-level semantics alignment is well-motivated, and the paper is clearly written.
- The proposed method achieves strong results on several FSL tasks and is supported by several analytical experiments.
- The training efficiency of the method could be a major problem due to the time-costly LLM inference and reinforcement learning.
- In recent years, another line of works focus on enhancing few-shot learning performance on CLIP vision encoder [R1,R2]. Can the proposed method also be applied on CLIP vision encoder? Additionally, it would be nice if the author could discuss the difference between the two few-shot learning settings.
Refs:
R1. Learning to prompt for vision language models. IJCV 2022
R2. Logits deconfusion with clip for few-shot learning. CVPR 2025
- In table 6, the author compares the alternative of using Qwen2.5-VL-32B for direct classification. Can the author elaborate more on the implementation details of the experiment? |
Fully human-written |
|
DVLA-RL: Dual-Level Vision–Language Alignment with Reinforcement Learning Gating for Few-Shot Learning |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents Dual-level Vision–Language Alignment with Reinforcement Learning gating (DVLA-RL), which comprises Dual-level Semantic Construction (DSC) and RL-gated Attention (RLA). This framework addresses the issue of neglecting progressive and adaptive alignment between vision and language, ranging from low-level to high-level semantics.
1. Integrating reinforcement learning and cross-attention mechanisms to enhance the performance of few-shot models sounds good.
2. Experiments have demonstrated that using reinforcement learning can improve few-shot performance.
1.The motivation for the proposed method requires further clarification.
Limitations of Prior Work: The paper does not sufficiently explain why existing methods for vision-language alignment are limited. Given that visual and textual data are inherently different modalities with a fundamental modality gap, the justification for why simply adding more textual attribute descriptions effectively bridges this gap is unclear.
Reliance on LLM Correctness:A critical concern is the dependence on the correctness of attribute descriptions generated by the LLM. The paper lacks a discussion on how the accuracy of these descriptions is ensured. What is the impact on the model's performance if the LLM generates hallucinated, biased, or incorrect attributes? Without addressing the robustness to potential errors in the semantic input, the claim of obtaining high-quality advanced feature representations may be undermined.
2. Ambiguity in Experimental Setup and Questions on Visualization
The experimental details for the T-SNE visualization are unclear, leading to questions about the validity of the results.
Missing Experimental Conditions: The specific baseline model and the number of samples used to generate the T-SNE plots in Figure 3 are not specified. This omission makes it difficult to interpret the visualization accurately and assess the comparative improvement.
Inconsistent Performance Portrayal: The quantitative results in Table 1 show a performance improvement of approximately 2% over the strong baseline SemFew. However, the T-SNE visualization appears to depict a near-perfect (100%) separation of classes, which seems to suggest an improvement far exceeding what the quantitative results indicate. This significant discrepancy between the quantitative metrics and the qualitative visualization raises questions about the authenticity and representativeness of the presented T-SNE results.
3.The definition of the reward function Rt in Equation (7) is somewhat ambiguous. The term "episodic accuracy improvement" requires a more precise explanation. Does it refer to the accuracy improvement compared to a baseline model, or does it denote the change in accuracy during the training process within the same episode? A clearer definition is necessary to understand how the reinforcement learning signal is constructed.
4.For the comparative experiments presented in Table 5, could the authors confirm that the baseline comparisons with methods like SP, SemFew, and ECER were conducted under identical experimental settings? This includes using the same visual backbone, pre-training data, and Large Language Model (LLM). If DVLA-RL employs a more powerful LLM (e.g., Qwen2.5-VL-32B) or a different pre-training strategy, a portion of the performance gain might be attributed to these factors rather than the core proposed modules. It would strengthen the claims if the comparisons were ensured to be fair and controlled.
5.While the paper emphasizes the performance improvements, it does not discuss the computational overhead introduced by DVLA-RL. The Dual-level Semantic Construction (DSC) module requires calls to an LLM for generating attributes and descriptions, and the RL-gated Attention (RLA) module involves reinforcement learning training. It would be valuable to analyze whether these components significantly increase the training and inference time compared to baseline methods. A discussion on parameter efficiency and computational cost would provide a more comprehensive view of the method's practicality.
see Weaknesses |
Moderately AI-edited |