ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 6.00 3.00 2394
Heavily AI-edited 1 (25%) 8.00 4.00 3730
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 2 (50%) 5.00 2.50 2822
Fully human-written 0 (0%) N/A N/A N/A
Total 4 (100%) 6.00 3.00 2942
Title Ratings Review Text EditLens Prediction
Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 1: You are unable to assess this paper and have alerted the ACs to seek an opinion from different reviewers. This paper proposes Perception-R1, a method to enhance multimodal reasoning in Multimodal Large Language Models (MLLMs) by introducing a visual perception reward alongside the standard accuracy reward in Reinforcement Learning with Verifiable Rewards (RLVR). Key Contributions: 1. Problem Identification: Through McNemar's test, the authors demonstrate that existing accuracy-only RLVR fails to improve MLLMs' multimodal perception capabilities, which they identify as a major bottleneck. 2. Method: They introduce a visual perception reward that extracts textual visual annotations from CoT trajectories and uses a judging LLM to assess consistency between these annotations and model responses, which provides additional training signal beyond answer correctness. 3. Results: Using only 1,442 training samples, Perception-R1 achieves SOTA performance across multiple benchmarks, outperforming Vision-R1 (which uses 200K samples) and other baselines. 1. The methods introduced in the paper provide denser reward for the reinforcement learning process, which accurately solve the problem identified by the authors. 2. The experiments are extensive and thorough. 3. Using only 1,442 training samples, Perception-R1 achieves SOTA performance across multiple benchmarks, outperforming Vision-R1 (which uses 200K samples) and other baselines. Concern 1: Marginal Improvement Beyond GRPO Baseline. According to Table 2, the performance improvements appear to be primarily driven by GRPO rather than the proposed visual perception reward. The reviewer notes that GRPO is also used in Vision-R1, making it unclear how much of the improvement is attributable to the novel contribution versus the baseline RL algorithm. Concern 2: Judging LLM Quality Dependency. Figure 3b shows that when using smaller judging LLMs (e.g., Qwen2.5-7B or 14B), the performance sometimes drops below even the base model performance (e.g., MathVerse: 46.1% vs 47.4% baseline; MathVision: 24.2% vs 25.1% baseline). This raises questions about the robustness and practical applicability of the method when high-quality judging models are unavailable. The reviewer suggests that the author could conduct additional experiments where: (1) Vision-R1 trained on the same 1,442 samples used in this paper. (2) Using the same model to generate CoT trajectories for fair comparison. Lightly AI-edited
Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces Perception-R1, a reinforcement learning framework that enhances multimodal reasoning in MLLMs by explicitly improving their visual perception. Specifically, the authors (1) extract visual annotations from correct chain-of-thought trajectories as ground-truth perceptual references, (2) employ a judging LLM to evaluate the consistency between these annotations and the model’s generated reasoning, and (3) aggregate this feedback with accuracy and format rewards under the GRPO optimization scheme. - The idea of augmenting RL with a verifiable visual perception signal represents a clear conceptual advance over prior RLVR frameworks (e.g., Vision-R1, MM-Eureka) that focus solely on final answer correctness. - The authors conduct extensive evaluations on multiple multimodal benchmarks, demonstrating the method's effectiveness and robustness. - The paper is well-structured and clearly written. - The paper lacks systematic exploration of critical parameters such as the perception reward weight (γ) and judgment thresholds, leaving robustness questions unanswered. - Although data-efficient, the additional judging and reward assignment stages may increase computational overhead, which is not quantitatively discussed. - The paper would benefit from more qualitative evidence demonstrating how the model’s perception improves—e.g., visual attention maps, step-by-step perception-reasoning examples, or case studies showing corrected misperceptions. Such analyses would strengthen interpretability and directly connect the proposed reward to perceptual behavior. - The method’s success relies heavily on the quality and alignment of the judging LLM used to evaluate perceptual consistency. As shown in Figure 3(b–c), weaker judges introduce reward hacking and degrade performance, but the paper stops short of analyzing why this happens or proposing safeguards (e.g., calibration, ensemble judgment, or confidence filtering). Further discussion or mitigation strategies would make the approach more robust and reproducible across settings. - The perception reward weight (γ) and the number/quality of visual annotations are central to the method, yet their interactions are not fully studied. Figure 3(a) provides only coarse exploration. More systematic experiments varying γ and annotation noise would clarify stability and guide practitioners in tuning the method. Fully AI-generated
Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces Perception-R1, a method to improve multimodal reasoning by fixing the key bottleneck of poor visual perception. The authors find that standard accuracy-only reinforcement learning (RLVR) fails to correct these perception errors, as models can guess the right answer despite flawed visual understanding. Perception-R1 addresses this by adding a visual perception reward. This reward is calculated by a judging LLM that compares the model's response to pre-extracted "visual annotations" (key visual facts) from correct solutions. Experiments show this method achieves superior performance on multiple benchmarks with high data efficiency, using only 1,442 training samples. 1. The paper provides a clear and compelling statistical analysis (using McNemar's test) of accuracy-only RLVR-trained MLLMs. This builds a strong case that a significant bottleneck for current models is indeed multimodal perception, not just high-level reasoning. 2. The proposed visual perception reward is intuitive and cleverly designed. By having an LLM judge responses against verifiable, extracted annotations rather than training a holistic reward model, the method directly targets the identified bottleneck. This approach appears significantly more robust to the reward hacking that can harm end-to-end MLLM-as-reward-model RLVR. 3. The performance gains achieved using only 1,442 training samples are impressive. This strongly suggests that a higher-quality, more targeted reward signal (i.e., combining perception and accuracy) can be far more sample-efficient than simply scaling up data for a sparser, accuracy-only reward. 4. The method delivers substantial performance improvements not only on its training domain (math/geometry) but also, surprisingly, across several general-domain benchmarks, outperforming baselines that used orders of magnitude more data. 1. Limited analysis of generalization: The model's strong generalization from geometry-only training (Geometry3K) to general-domain benchmarks (like MMMU and MMStar) is a key result, but it is not fully explained. The authors hypothesize that they are improving a foundational perception capability, but the link between 'perceiving geometry diagrams' and 'perceiving real-world images' could be strengthened. To make this claim more concrete, the authors could include: - Qualitative analysis on general benchmarks: Provide qualitative examples from MMMU or MMStar. Does the Perception-R1 model now exhibit the same "describe-then-solve" behavior on these general-domain images? Where do the baseline models fail on perception in these tasks? Is Perception-R1 delivering more accurate visual perception in these examples? - Error breakdown on general benchmarks: Conduct a small-scale error analysis on a subset of a general benchmark (like MMMU-Pro, where they show strong results). What percentage of the baseline's failures on these tasks are due to perception errors, and what percentage of those specific errors does Perception-R1 fix? This would directly support the claim of foundational perception improvement. 2. Dependence on a single training data domain: The reliance on Geometry3K, while clearly effective, is a potential limitation. The data curation pipeline itself seems general, but its effectiveness has only been demonstrated on this one domain. An ablation study training on a different domain (e.g., general textbook diagrams, or even a VQA dataset) using the same pipeline would be highly valuable. This would help demonstrate the general applicability of the Perception-R1 framework, distinguishing its contribution from the (clearly very effective) choice of geometry data as a training source. Please see the weaknesses above. Heavily AI-edited
Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper tackles a key but often neglected limitation in reinforcement learning for Multimodal Large Language Models (MLLMs): existing Reinforcement Learning with Verifiable Rewards (RLVR) methods focus solely on final answer correctness, overlooking the accuracy of visual perception during reasoning. The authors show that such outcome-only rewards allow models to guess correct answers despite severe perception errors. To address this, they propose Perception-R1, which introduces a verifiable visual perception reward into RLVR. This reward is derived from textual visual annotations extracted from high-quality CoT trajectories and evaluated by a judging LLM that measures consistency between model outputs and these annotations. Contributions: 1. Empirically and statistically demonstrate that accuracy-only RLVR fails to enhance multimodal perception. 2. Introduce a novel, verifiable visual perception reward that alleviates reward sparsity and improves perceptual grounding. 3. Achieve state-of-the-art performance on multiple multimodal reasoning benchmarks using only 1,442 training samples, showing exceptional data efficiency. 1. This paper reveals the impact of poor perception on reasoning performance. Current RLVR methods fail to enhance multimodal perception, which fundamentally limits the reasoning performance of MLLMs. 2. The introduced Perception-R1 framework incorporates a novel visual perception reward that significantly strengthens the visual understanding and reasoning capabilities of MLLMs, particularly in mathematical reasoning tasks. 3. Extensive experiments across multiple benchmarks verify that Perception-R1 substantially improves both perception and reasoning performance, achieving superior results even with highly limited training data. 1. The paper claims that it enhances the multimodal reasoning capabilities of MLLMs through improved perception. However, the presented results do not provide direct evidence that the observed performance gains stem specifically from enhanced perception. I suggest including an analysis or ablation that directly links perception improvement to the reasoning gains. 2. While the paper reports significant improvements on multimodal math benchmarks, these results primarily reflect reasoning performance rather than perception itself. To convincingly demonstrate perception enhancement, it would be helpful to include evaluations on dedicated perception-level benchmarks (e.g., BLINK, MMBench, MME, or similar datasets). 3. The method employs Gemini-2.5 Pro to generate CoT trajectories and uses an LLM to extract visual annotations, followed by GRPO training on these annotations. This pipeline closely resembles a distillation process from Gemini-2.5 Pro, which may primarily transfer reasoning knowledge rather than genuinely improving perception. It would strengthen the paper to disentangle and clarify whether the observed gains truly originate from improved perception rather than implicit reasoning distillation. 1.After distilling the CoT trajectories from Gemini 2.5 Pro, could you clarify why an LLM is used to transform these trajectories into atomic statements? In particular, how does this approach differ from directly inputting the trajectory data into the LLM to evaluate the atomic statements? Lightly AI-edited
PreviousPage 1 of 1 (4 total rows)Next