|
Single-Sample Test-Time Reinforcement Learning for Vision-Language Models |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents VR-TTRL, a test-time RL framework for visual reasoning tasks. This paper adapts the core idea of TTRL, while extending its reward function to fit the visual reasoning tasks where there is no closed-form answers to easily compute majority vote reward score. Specifically, they design a special majority vote method suitable for structure output (like JSON) through pair-wise IoU computation. Experiments on two visual reasoning tasks show the effectiveness of the proposed method.
1. Interesting idea for using TTRL for visual reasoning tasks.
This presents an interesting idea to adapt a VLM on a single unlabeled test image by sampling multiple rollouts, selecting a consensus pseudo-label via IoU/L1 computation, and conduct RL with the proposed reward function. This modification of the prior TTRL framework to support structured output of visual reasoning tasks is interesting and of practical interest of researchers in the filed.
2. Solid results on visual reasoning tasks.
Table 2 reports consistent and solid improvements for two different LLM based models after the VR-TTRL framework on visual segmentation and counting tasks. The performance is generally strong and make the conclusion convincing.
3. Good analysis about compute/efficiency.
This paper includes timing curves and an extensive set of hyperparameter ablation studies, for example, update-count vs. performance, the effect of having different number of rollouts. This is very practical for deployment in practical settings of the VR-TTRL framework.
1. Need to discuss the risk of consensus enhancing mistakes of the model.
The reward function aggregates rollouts towards the most "average" answer of the multiple rollouts. Why doesn't this drive the model toward a mean-valued, low-variance solution and lead to mode collapses? Especially under iterative updates, why this does not lead to reinforcement of early bias in the model and circulate errors? If this happens, it would be helpful to show some failure cases. If this generally does not happen, it would be good to have some analysis and clear explanation of the reason.
2. Task scope is narrow.
Experiments focus on only bbox, points and counting as output of the answers. It unclear how well the approach extends to open-ended VQA or free-form reasoning tasks where majority vote might not work that well or there is no geometric agreement signal.
3. Need to explain why test-time learning is specifically benefited from the framework vs. simply using the training data.
I'm curious where do the gains really come from? Would the same adaption method applied offline on a large training dataset yield similar or better gains? In other words, is the improvement primally from the objective itself of from being applied to the test distribution?
Please refer to the weakness section. |
Fully human-written |
|
Single-Sample Test-Time Reinforcement Learning for Vision-Language Models |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces VR-TTRL, the first framework to apply test-time reinforcement learning (TTRL) to vision-language models for visual reasoning tasks like segmentation and counting. Unlike prior TTRL methods that require multiple samples or ground truth labels, VR-TTRL adapts a model using just a single unlabeled test image by generating multiple reasoning rollouts, using majority voting based on geometric similarity (IoU, L1 distance) to create pseudo-labels, and then refining the model via reinforcement learning. It achieves significant performance gains—especially on segmentation tasks for base models like Qwen2.5-VL—without any external supervision, demonstrating that models can improve their reasoning at inference time through self-supervised iteration. The approach is computationally efficient, with optimal performance reached in just 10 updates and a small rollout size of 8, making it practical for real-world deployment.
- VR-TTRL achieves meaningful adaptation using only a single unlabeled test sample without ground truth, eliminating reliance on batched data or external supervision.
- The majority-voting mechanism is cleverly adapted to structured outputs (bounding boxes, points) using geometric similarity metrics (IoU, L1 distance), enabling effective pseudo-label generation where exact string matching fails.
- Strong empirical results show consistent performance gains across multiple benchmarks.
- The framework relies on a majority-voting mechanism that assumes sufficient diversity in stochastic rollouts, yet ablations show minimal gains from increasing rollout size beyond 8, suggesting the consensus signal may be weak or redundant.
- Counting performance improvements are inconsistent: Qwen2.5-VL-7B sees no meaningful gain or even degradation, indicating that VR-TTRL’s reward structure may not generalize well to tasks requiring precise object enumeration.
- The approach assumes structured output formats (bounding boxes, point coordinates) are reliably parseable and comparable, but real-world deployment risks failure when models generate malformed or non-uniform JSON.
- Computational cost per sample remains high (up to ~200s/image), and while ablations suggest optimal settings, the method is still impractical for latency-sensitive or edge deployments.
- The evaluation omits comparison against recent single-sample TTA baselines (e.g., entropy minimization or contrastive adaptation) that don’t require generative rollouts.
See Weaknesses. |
Fully AI-generated |
|
Single-Sample Test-Time Reinforcement Learning for Vision-Language Models |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper proposes VR‑TTRL, a single‑sample test‑time reinforcement learning (TTRL) procedure for VLMs. For each test image–query pair, the model samples multiple stochastic rollouts, computes a majority pseudo‑label by measuring agreement across structured predictions via IoU/L1, and then performs GRPO updates with an accuracy reward (to the consensus) plus format rewards. The method is evaluated on referring‑style segmentation datasets and counting datasets using Qwen2.5‑VL‑7B‑Instruct and VisionReasoner‑7B. Reported results show moderate segmentation gains and mixed counting outcomes.
The paper positions itself as applying TTRL (as introduced for LLMs recently) to VLMs, drawing on recent RL‑for‑vision reasoning frameworks.
- Clear, implementable mechanism. Extends majority‑vote rewards to structured outputs; reward decomposition (format + accuracy‑to‑consensus) is well described.
- Fig. 2–3 give a sense of adaptation dynamics and cost.
1) Limited novelty.The method is a straightforward combination of TTRL’s majority‑vote reward with VisionReasoner‑style GRPO for boxes/points on a Qwen2.5-VL-7B. This is far from a substantial conceptual advance.
2) Missing vote‑only control. No baseline that performs the identical sampling/consensus. Given that majority‑vote/self‑consistency is already powerful (and is the reward used by TTRL), the paper cannot attribute gains to RL.
3) Counting by list length (not detection‑aware) can overstate improvements; the SAM2 dependency for segmentation is not isolated.
4) Recent results show that both TTRL [1] and 1-shot-RLVR [2] can yield large gains without reliable rewards—e.g., with random, incorrect, or 1‑shot rewards—especially on Qwen Series models, and often fail to transfer to other families (e.g., Llama3/OLMo2). This raises concerns that the proposed consensus reward could be **spurious** [3] and model‑family specific, which the current experiments do not address.
5) For a single-sample test-time RL method, showing how performance evolves with training steps is preferable.
[1] TTRL: Test-Time Reinforcement Learning (arXiv:2504.16084)
[2] Reinforcement Learning for Reasoning in Large Language Models with One Training Example (arXiv:2504.20571)
[3] Spurious Rewards: Rethinking Training Signals in RLVR (arXiv:2506.10947)
1. Vote‑only baseline. Please add “rollouts + consensus”; also report Best‑of‑N. This isolates the value of RL over selection.
2. Show how IoU/L1 thresholds (0.5/10px/30px) affect consensus and reward.
3. In light of One‑Shot RLVR and Spurious Rewards, run a check on more non‑Qwen VLMs to test for model‑family effects.
4. Given that even incorrect or random rewards can move Qwen models, analyze failure cases where consensus is wrong.
5. How does VR-TTRL differ fundamentally from TTRL and One‑Shot RLVR beyond the input modality? |
Lightly AI-edited |
|
Single-Sample Test-Time Reinforcement Learning for Vision-Language Models |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper adapts Test-Time Reinforcement Learning (TTRL) to Vision Language Models (VLM) for visual reasoning tasks, in particular, segmentation and counting. Given a single unlabeled test image, the model generates multiple rollouts, computes the pairwise similarity between candidates, selects a consensus candidate as a pseudo-label, and uses GRPO with format & accuracy rewards to fine-tune the model on that single sample. The proposed method improves both VisionReasoner-7B model and Qwen2.5-VL-7B on segmentation and counting benchmarks.
1. The first work to extend Test-time reinforcement learning (TTRL) to the multimodal structured outputs.
2. The reward components are clearly described, which makes them immediately applicable to other related visual reasoning tasks.
3. Improved quantitative and qualitative results.
1. Some claims/statements are contradictory. In line 52, the authors argue that "existing TTRL approaches have been primarily applied to language models, with no exploration in vision-language tasks, and they typically require multiple samples or **known ground truth** answers for effective optimization." However, in L59, it says "...TTRL Zuo et al. (2025) leverages majority voting across model rollouts to generate **reliable pseudo-labels**".
2. An analysis of performance (accuracy) versus reasoning steps is missing.
3. The experiments deliberately limit evaluation to 200 images per dataset for "computational efficiency". While this seems a plausible reason, such a small sample size raises concerns about statistical robustness, variance of reported gains, and possible cherry-picking of test samples.
4. More baseline methods. Several strong baselines for test-time adaptation may be needed: (1) simple Test-Time Adaptation (TTA) methods, (2) supervised fine-tuning with pseudo-labels from a single greedy rollout, etc. Right now, the paper only compares results with base VLMs.
Please refer to the weakness section. |
Fully human-written |