ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 2 (50%) 4.00 4.50 2968
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 2 (50%) 3.00 4.00 1914
Total 4 (100%) 3.50 4.25 2441
Title Ratings Review Text EditLens Prediction
Group Pattern Selection Optimal: Let LRMs Pick the Right Pattern for Reasoning Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The authors present a work to incorporate a set of handcrafted reasoning patterns when fine-tuning LRMs with RLVR. To do so, they prompt the model with the pattern in the prompt, keeping the best pattern according to success rate for the advantage calculation. They additionally mask out the pattern suffix to mask out their contribution during updates. - Evaluate the performance of the approach in a variety of math reasoning domains - Perform an ablation study of each component of their method to show the contribution - GRPO with a sample equivalent N (e.g N = num_patterns x num_samples per pattern) not compared against as a baseline - Patterns are handcrafted per task and can be viewed as a basic form of prompt optimization[1,2,3,4, 5], which has a rich history of approaches and is not compared against. Thus unclear about novelty of proposed approach. References [1] DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines [2] RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning [3] GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning [4] A Systematic Survey of Automatic Prompt Optimization Techniques [5] Prefix-Tuning: Optimizing Continuous Prompts for Generation See Weaknesses above. Fully human-written
Group Pattern Selection Optimal: Let LRMs Pick the Right Pattern for Reasoning Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper identifies that Large Reasoning Models (LRMs) often default to dominant, sub-optimal reasoning patterns. It introduces Group Pattern Selection Optimization (GPSO), a reinforcement learning framework extending GRPO. The method operates by performing multi-pattern rollouts, where a single problem is prompted with several different "reasoning pattern suffixes". A verifier signal is used to identify the most effective pattern for that specific problem, and the model's policy is then updated using only the trajectories from that optimal pattern group. GPSO employs attention masking to prevent the model from overfitting to the explicit pattern suffixes, thereby forcing it to learn an intrinsic mapping from the problem to the optimal reasoning strategy. Experiments on math and science benchmarks demonstrate consistent, though often modest, performance gains. - The motivation is clear. Figure 1 demonstrates that different reasoning patterns yield different accuracies for the same problem and that current RLVR pipelines can bias models toward a single, sub-optimal dominant pattern. - The proposed training mechanism, GPSO, is general and well-designed. It builds on GRPO by adding multi-pattern rollouts, best-pattern selection, and attention masking, which could potentially be integrated into various RLVR-style pipelines. - The contribution of the reasoning pattern optimization, using masking for pattern prompting invariance and a best-group update to force the model to learn an internal reasoning pattern policy, is a valuable idea for improving the quality of intermediate reasoning tokens generated by the model. - The method shows consistent performance gains, particularly on weaker to midsize models (e.g., ~2–3 point average gains on 1.5B–8B parameter models) on difficult math benchmarks. 1. The pipeline’s absolute effectiveness remains unclear because the trained model still trails very much behind the oracle “Best” per-question pattern upper bound from the analysis in Figure 1; the gap (e.g. 90% best on AIME2024 with Qwen3 thinking compared to the achieved 77.5% which is only less than 1% gain compared to the baseline 76.7% and trailing behind even some reasoning pattern prompting methods) suggests the bottleneck is only slightly addressed. 2. The training objective relies on a "hard" best-pattern selection. The method always picks the single best pattern group (by highest verified accuracy) and completely ignores all other patterns. This discards potentially useful training signals from other rollouts. An ablation exploring "soft" inclusion (e.g., weighting each pattern group by its verifier score) would be useful to support the hard best-only design. 3. There is a lack of statistical significance reporting. Evaluation relies on Pass@1 averaged over 4 samples per problem, decoded at a non-zero temperature (T=0.6). Results are reported as single numbers without variance or confidence intervals. This reduces the impact of the findings, especially since several gains on stronger models are less than 1 point (e.g., +0.8 on Qwen3-8B) and could be statistical noise. 4. Despite discussing patterns like “employing tools,” the experiments are confined to math/science QA with textual reasoning. There is no evaluation on tasks that require external tools (e.g., code execution, retrieval, calculators), task decomposition, or search, so it is unclear whether pattern effects and GPSO’s gains persist in those domains. 1. Can you report variance or confidence intervals for the results in Tables 1? 2. How does GPSO's performance compare to a simpler inference-time selection baseline? For example, prompting the base model with all n (3) patterns, sampling m times, and using majority voting for example to select the final answer? This would help isolate the gains from the RL training itself versus the multi-pattern ensemble. 3. What is your intuition for why a "soft inclusion" of other reasoning patterns (e.g., a weighted mix based on verifier scores) would not outperform the "best-only" hard selection strategy? 4. Have you measured the intrinsic problem -> pattern mapping accuracy (e.g., probability of generating the correct pattern without an explicit prompt) before and after GPSO training? Moderately AI-edited
Group Pattern Selection Optimal: Let LRMs Pick the Right Pattern for Reasoning Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes Group Pattern Selection Optimization (GPSO), an RL framework for training reasoning models to adaptively select the right reasoning pattern for a problem. The paper considers three reasoning patterns: direct solution, reflect-and-verify, and explore multiple solutions, each of which is a 1-2 sentence addition to the prompt template (App. B.2). GPSO rolls out multiple responses for each pattern, selects the pattern with the highest empirical accuracy, and updates the policy only on rollouts from the optimal pattern. They mask attention to the suffix tokens during gradient computation to prevent overfitting to explicit prompts. Their experiments show consistent gains across the four reasoning benchmarks. The paper is generally clearly written. It has clear motivation and a good visual presentation. - The abstract claims "GPSO learns the intrinsic mapping from problem to pattern", which led me to believe that the model is autonomously discovering reasoning patterns during training. In reality, GPSO selects one of three (or four?) pre-defined prompt templates during training. I think the framing of the paper overstates, and that it's more accurate to view GPSO as an improvement to rollout sampling that prioritizes supervision from high-reward patterns. - Generally, I think the fact that the method relies on the diversity in a small number of hand-crafted prompt templates is a limiting factor. - In the intro (line 82-89), you claim that Figure 1 "demonstrates that if LLMs were capable of dynamically selecting the most suitable pattern, their overall performance could be enhanced by a substantial margin". This is a core motivation for your paper. If I understand the Best bar in Figure 1 correctly, you took a question-wise maximum over the different patterns and averaged that value. This sampling procedure is an unfair comparison, similar to pass@k vs pass@1. - The captions are overall quite sparse and don't contain enough self-contained context - Could you comment on computational cost? If I understand correctly, you'd have to roll out three times more examples at each training iteration. If that is correct, perhaps it's reasonable to consider regular GRPO with 3x more iterations as another point of comparison? - In a few places, you describe your method name as Group Pattern Selection *Optimal* (GPSO). This is a typo, right? I'm assuming you mean Optimization. - Very minor, but in line 283, you say that your metric is Pass@k with k=1. This is just accuracy; I don't see why you're mentioning Pass@k at all. Fully human-written
Group Pattern Selection Optimal: Let LRMs Pick the Right Pattern for Reasoning Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. GPSO is a reinforcement-learning framework that lets a model try several high-level reasoning patterns on the same problem and then pick the best one using verifier signals. It extends GRPO with multi-pattern rollouts and an attention-masking trick so the model doesn’t just memorize pattern suffixes but actually learns when to use which pattern. Their analysis shows current models often choose a suboptimal pattern, and GPSO learns a problem→pattern mapping that reduces this mismatch. The authors also conduct some experiments to show the performance of their method. 1. The method explicitly models diverse reasoning patterns instead of forcing one dominant pattern. 2. The mehtod wses verifier signals to select the best pattern per problem, improving sample efficiency and final accuracy. 1. The paper should include a baseline that uses the original GRPO but with a comparable (i.e., larger) number of rollouts/samples per batch, to show that GPSO’s gains are not just from extra sampling. 2. The procedure for constructing the pattern set needs to be spelled out: how are the patterns obtained, filtered, and validated, and what evidence do we have that this set is sufficiently diverse for the target domains? 3. The motivation for “selecting” a single best pattern is unclear; please clarify why optimizing all correct patterns isn’t preferable, since learning to solve a problem via multiple patterns may improve robustness and generalization. 4. Please discuss applicability to stronger/larger models: do they still benefit from explicit pattern conditioning, or do their naturally diverse trajectories make GPSO less useful at scale? Please refer weakness Moderately AI-edited
PreviousPage 1 of 1 (4 total rows)Next