ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	1 (25%)	6.00	4.00	4248
Heavily AI-edited	1 (25%)	8.00	4.00	3863
Moderately AI-edited	1 (25%)	6.00	4.00	2575
Lightly AI-edited	0 (0%)	N/A	N/A	N/A
Fully human-written	1 (25%)	4.00	3.00	2490
Total	4 (100%)	6.00	3.75	3294

Title	Ratings	Review Text	EditLens Prediction
ProxyThinker: Test-Time Guidance through Small Visual Reasoners	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes PROXYTHINKER, a training-free decoding method that transfers “slow-thinking” visual-reasoning behaviors from a small RFT-trained expert VLM to a larger base VLM by adding a logits delta at every generation step. The approach consistently improves Pass@1 on visual math and multi-disciplinary benchmarks and sometimes approaches or even surpasses a fully RFT-trained large model. The authors further implement a multi-model scheduling scheme in vLLM to reduce the overhead of running three models at inference. # Strengths 1. Practical inference-time approach: The method requires no additional training of the large model, addressing the high cost of RFT for VLMs. It is simple to implement (just logit arithmetic) and can leverage existing small RFT models. 2. Empirical gains on visual reasoning tasks: ProxyThinker consistently improves accuracy on spatial and math reasoning benchmarks. In many cases the base VLM closes most of the gap to a fully RFT-trained model. For example, applying a small visual “OpenVLThinker-7B” expert to Qwen2.5-32B raises MathVision accuracy from 38.4% to 40.8%, slightly exceeding the 40.5% achieved by the full-scale RFT model. # Weaknesses 1. Overall, the novelty of the proposed method is under par. ProxyThinker closely mirrors existing logit-guidance techniques. The existence of current methods, like DExperts, Proxy-Tuning and DoLa, makes the ProxyThinker not novel enough. 2. Weak justification for VLM-specific focus. The authors motivate the work by the expense of RFT on large VLMs, but they do not identify any modality-specific challenge that makes ProxyThinker inherently necessary for vision. The technique appears generic and could equally apply to language-only reasoning tasks. 3. Although the gains are statistically significant, they are small (often only a few percentage points). Crucially, ProxyThinker requires simultaneous inference of multiple models (base + expert + amateur), which increases serving complexity, latency, and cost; in many settings this overhead may not be justified by the marginal improvements over fully RFT-trained baselines. # Questions: 1. Did you try to apply the proposed method on language-only reasoning tasks? What are the challenges and performance? 2. Did you observe any negative side effects (e.g. increased repetition, verbosity, or unnatural reasoning) from adding the logit shift? 3. On MMMU/MMMU-Pro the gains are negligible. Could you provide a detailed diagnosis?	Fully human-written
ProxyThinker: Test-Time Guidance through Small Visual Reasoners	Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces ProxyThinker, a test-time decoding method that transfers visual reasoning abilities from small RFT models to LVLM without any additional training. By applying the token-level logits difference between a small RFT "expert" and its non-RFT "amateur" counterpart to the large base model's output distribution, ProxyThinker effectively elicits "slow-thinking" behaviors such as self-verification and backtracking. The method achieves consistent improvements on challenging benchmarks, often matching or surpassing full RFT-trained models. Analyses show that ProxyThinker, not only enhances quantitative performance but also induces interpretable reasoning behaviors. 1. This paper presents clear motivation and method. The method is not complex but effective, which is easy to follow. 2. The system implementation is carefully engineered on top of vLLM, with tensor-parallel scheduling and asynchronous multi-model inference that achieve up to 38× speedup. 3. Experiments are thorough across five visual reasoning benchmarks and two model scales (32B/72B). The results are consistent and reported with clear baselines and ablations. 1. The paper begins by arguing that RFT does not teach new knowledge but merely amplifies existing reasoning patterns. However, ProxyThinker itself inherits exactly those amplified patterns without introducing new reasoning capabilities. The proposed issue is not well resolved. 2. The related-work section omits some concurrent or near-concurrent works on multimodal reasoning with RL or decoding-time control, such as - Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning (Wei et al., 2025) - R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO (Yao et al., 2025) - Semi-Off-Policy Reinforcement Learning for Vision-Language Slow-Thinking Reasoning (Shen et al., 2025) 3. While ProxyThinker improves on math- and spatial-reasoning benchmarks, gains on knowledge-intensive datasets such as MMMU-Pro are marginal or even negative with weaker experts. This suggests the transferred signal mainly benefits structural reasoning rather than factual grounding. 4. ProxyThinker requires simultaneous execution of three large models: a base model plus two auxiliary 7B-scale models (expert and amateur). This setup substantially increases both memory footprint and compute parallelism requirements, which could be prohibitive for typical deployment environments where even a single 32B LVLM may require multiple GPUs. See above weaknesses.	Moderately AI-edited
ProxyThinker: Test-Time Guidance through Small Visual Reasoners	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes PROXYTHINKER, a novel and training-free decoding strategy designed to elicit multi-step reasoning from large vision-language models without expensive training. The method operates at inference time, steering a large base model's generation by applying a guidance signal derived from the logit difference between a small, RFT expert model and its non-RFT amateur counterpart. Experiments compellingly demonstrate that this approach effectively transfers the reasoning skills, enabling a 32B model to achieve performance on complex visual reasoning benchmarks that is comparable to, and in some cases surpasses, its fully RFT-trained equivalent, showcasing a highly efficient path to enhancing model reasoning. 1. The paper is exceptionally well-written, clearly articulated, and easy to follow. The motivation is strong and well-grounded, directly addressing the significant and timely challenge of improving the scalability of reinforcement learning for large-scale VLMs. 2. The core idea is highly intuitive and logically sound. It builds upon the key insight that RFT methods like GRPO often do not introduce new external knowledge but rather reshape the model's output distribution to elicit a step-by-step reasoning process. The authors astutely hypothesize that if this capability is already latent within a large instruction-tuned model, it can be unlocked via a direct, inference-time decoding control mechanism. This work provides a feasible and effective solution that validates this hypothesis with impressive results. 3. This research could pioneer a new and more scalable paradigm for building powerful reasoning systems. It suggests a practical "division of labor," where intensive and costly reinforcement learning is focused on creating compact, specialized "expert" models. Larger, more general models could then be trained primarily with SFT and efficiently endowed with these specialized reasoning skills at inference time through interaction, paving a more sustainable path for model development. 1. Lack of Principled Analysis on Expert Model Selection: The paper's primary contribution relies on guidance from a small "expert" model, yet the criteria for selecting this expert seem somewhat ad-hoc. While the authors experiment with three public models chosen based on "differing training paradigms and data selection strategies" (line 259), the paper falls short of addressing a crucial question: What properties define an optimal expert for the PROXYTHINKER framework? A deeper investigation is needed into the essential characteristics an expert must possess. For instance, is it absolute accuracy, the diversity of reasoning paths, or the magnitude of the logit shift post-RFT? A more solid contribution would involve discussing these factors and perhaps even exploring how one might intentionally train a more effective, targeted expert model specifically for this guidance role. Such an analysis would significantly enhance the paper's impact and provide more concrete guidance for the community. 2. Limited Generalizability of Base Models: The empirical validation exclusively uses models from the Qwen2.5-VL series as the base model. While these are strong models, this narrow selection raises concerns about the generalizability of the proposed method. It is plausible that the success of PROXYTHINKER is partially contingent on the specific latent reasoning capabilities inherent in the Qwen architecture and its instruction-tuning process. The paper's claims would be substantially more robust if the authors demonstrated effectiveness on other diverse and widely-used LVLM families (e.g., InternVL / LLaVA-OV). Without such experiments, it remains an open question whether this is a universally applicable decoding technique or one that works particularly well for a specific class of models. See Weaknesses	Heavily AI-edited
ProxyThinker: Test-Time Guidance through Small Visual Reasoners	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes a training-free method to improve the visual reasoning of large vision-language models (LVLMs). The core problem it addresses is the immense computational cost of full-scale Reinforcement Fine-Tuning (RFT) on large models. The paper proposes to use logit delta to transfer the "slow-thinking" reasoning behaviors (like self-correction) from the small expert to the large base, 1. The paper spot on a valuable problem of training cost of RL. 2. The paper delivers a surprising and useful finding that the reasoning behavior can be transfered from small expert model to large base model, alleviating the burden of training cost. 3. The authors have thoroughly addressed the practical viability of using three models at inference. 1. A Significant Trade-off in Reasoning Diversity (Pass@k): The paper's claim of "striking a balance" in reasoning exploration (Sec 4.2) is an oversimplification. The data in Figure 5 clearly shows that while Pass@1 performance (greedy decoding) is improved, the Pass@k performance for $k>4$ drops below that of the unguided base model. This suggests the guidance narrows the large model's reasoning diversity, forcing it down the single "slow-thinking" path favored by the small expert. This trade-off—improving the single best answer at the cost of exploratory reasoning capacity—is a major weakness that is not sufficiently discussed. 2. Unexplored Mechanism of Action: The paper's central hypothesis—that the logit delta $(z_{\text{expert}} - z_{\text{Amateur}})$ isolates a scalable, model-agnostic "reasoning vector"—is demonstrated that it works but not why it works. The ablation in Figure 4 proves that subtracting the amateur model is critical, but the paper provides no deeper analysis into the nature of this delta. Is it merely up-weighting a shared vocabulary of "reasoning tokens" (e.g., "Wait," "let's check"), or is it manipulating a deeper, abstract representation of the reasoning process? The simplicity of the mechanism makes this lack of explanation a significant gap. 3. The method's effectiveness is strictly bounded by the small expert's capabilities. - As admitted in Appendix A.4, the method fails on knowledge-intensive tasks (like MMMU) where the small expert lacks the requisite knowledge. It cannot transfer knowledge, only reasoning patterns, and a correct pattern applied to incorrect knowledge still fails. - As shown in Appendix B.2, a low-quality expert results in negative transfer, actively harming the large model's performance. This implies the method is not a universal booster; it requires a small expert that is both knowledge-sufficient for the domain and has been RFT-trained to a high-quality "floor." 1. Could you elaborate on the Pass@k trade-off? Do you view the loss of reasoning diversity (worse Pass@k performance) as an acceptable price for the gain in Pass@1 performance? Does this imply RFT-style reasoning (at least as captured by this method) inherently prunes other valid reasoning paths? 2. Could you elaborate on why a simple logic delta could work so well, and what might be the premise of the success achieved from applying this logit delta? Following on Weakness #2, what is in the delta $(z_{\text{expert}} - z_{\text{Amateur}})$? Have you analyzed its composition? For instance, does the delta vector's direction change over a reasoning trace, perhaps pushing toward "sub-goal" tokens at the start and "verification" tokens near the end? A temporal analysis of the delta's top-k components would be fascinating. 3. The knowledge bottleneck (Appendix A.4) is a key limitation. How do you see this method performing on tasks that are a tight mix of reasoning and knowledge? Is there a risk that the small, knowledge-poor expert could misguide the large, knowledge-rich base model into a factually incorrect but "well-reasoned" answer? 4. Given that a poor expert causes "negative transfer" (Table 5), how should one practically determine if a small expert is "good enough" to be a PROXYTHINKER? Its own standalone accuracy seems unreliable (e.g., the OpenVLThinker-7B had only 25.3% on MathVision but gave a positive boost). Is there a "floor" of capability or a metric (beyond final accuracy) that predicts positive transfer?	Fully AI-generated

PreviousPage 1 of 1 (4 total rows)Next