ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 3 (100%) 5.33 3.33 2271
Total 3 (100%) 5.33 3.33 2271
Title Ratings Review Text EditLens Prediction
QuRL: Rubrics As Judge For Open-Ended Question Answering Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors propose a new/novel approach to improving the performance of LLMs on Open Ended QA. This paper proposes QuRL (Open-Ended Question Answering with Rubric-guided Reinforcement Learning), a framework that transforms human-authored web sources into case-wise rubrics. The key contributions are: • QuRL, a framework that leverages internet text to construct case-wise rubrics as reward signals for open-ended question answering • With the assistance of QuRL, two new datasets have been created. QuRL-Train dataset consisting of 800 Question–Rubric pairs, along with a QuRL-Test dataset of 400 entries that underwent human verification. The main strength of the paper is in proposing a novel method to obtain rubrics on a case basis and use these rubrics to help train the LLMs to give an answer that is more human like. The authors have conducted various experiments using a variety of LLMs comparing their approach to some existing benchmarks. Thorough implementation details and results are a notable strength. The results indicate that QuRL achieves strong correlation with human judgments, outperforming other existing approaches. Nothing major however the impact of the work is difficult to understand from a practical usecase point of view. It looks like the authors have not studied the case where facts are misrepresented in an answer by the LLM and how the rubrics deal to penalize such behavior. 1. The paper would become easier to follow if a couple of examples for rubrics generated are shown as part of the main paper. 2. A minor typo in line 90 - Deepseek Fully human-written
QuRL: Rubrics As Judge For Open-Ended Question Answering Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper examines the automatic creation of rubrics for a given open-ended question $q$. The idea is to induce a set of useful rubrics to score generation output so as to allow a RLVR-like mechanism to apply for such questions. The idea proxies search engine API ranking results as a means of implicitly creating preferential document listing that applies to $q$. Finally with the stable of metrics derived from a limited training set (also provided by the authors), the authors apply GPRO as the RL to tune output, resulting in visible improvements to 3 benchmarks. * Contributes a training set for RL tuning of such open-ended questions (albeit quite small: 800 train/400 test) * Substantial baselining against a range of models for useful input. * Conducts a human evaluation to help assess consistency, shows that the correlation is not high (although not discussed much). * Contributes a useful introspection of the aligned / created metrics in the **4.5 Case Study** section. * The submission is generally ok, but suffers from a lack of proofediting: the arguments are not crisp and precise. * In the face of prior work (see **Questions**), I find the work incremental, as there have been directly relevant work on inferring rubrics from question. * The work implies that the metrics are done _per question_ and as described, over _multiple runs_ (but I may be mistaken, since the training seems to be done over the entire set of rubrics for the training set. This seems overly expensive, but the inference cost of creating such metrics is not discussed. * Some main text figures reference key metadata only present in the Appendices. This makes the paper reliant on the supplemental materials and hence I would judge it not fitting in the length requirements for ICLR. Other minor problems * The "contributions" final paragraph of the intro overlaps significantly with the rest of the work, it'd be better to leave it out or at least de-duplicate the text with the rest of the intro. * 090: Deepseek is misspelled. * 121-124: your claim should mention on what data. The gains could not be properly scoped otherwise. * It seems you use $\LaTeX$ for typesetting. Please use the appropriate quote structure (e.g., 301-303 caption) * The ethics statement is mis-used. The ethics block should not be used to describe annotation guidelines and data capture; that should be part of the normal space requirements. * References should be appropriately checked for correct formatting. Many titles have capitalisation protection errors (e.g. "llms" vs. "LLMs"). * There are a few relevant works that work on inferring metrics from tasks that don't have specific metrics, which are relevant means to induce guidelines/metrics that seem to be missing from related work. See: * Minzhi Li, Zhengyuan Liu, Shumin Deng, Shafiq Joty, Nancy Chen, and Min-Yen Kan (2025) DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation. In Proceedings of the 31st International Conference on Computational Linguistics, pages 2277–2290, Abu Dhabi, UAE. Association for Computational Linguistics. * Do Xuan Long, Duong Ngoc Yen, Do Xuan Trong, Luu Anh Tuan, Kenji Kawaguchi, Shafiq Joty, Min-Yen Kan and Nancy F. Chen (2025) Beyond In-Context Learning: Aligning Long-form Generation of Large Language Models via Task-Inherent Attribute Guidelines. In Proceedings of the Findings of the Association for Computational Linguistics (ACL '25), Vienna, Austria. * 330: how important is the cold-start SFT to the work? While not a contribution it is important to know. * 296: is this meant to say `w/o rlfh`? Also: I'm not sure that these are additive ablations or whether they are single ablations. Notation might help. * Fully human-written
QuRL: Rubrics As Judge For Open-Ended Question Answering Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a rubrics guided RL based framework for open-ended QA. Authors claim that the GRPO based approach with case-wise rubrics significantly improves QA performance on multiple evaluation benchmarks. The main contributions of this paper are as follows: 1.QuRL, a framework which leverages internet text to build case-wise rubrics as reward signal for open-ended QA. 2. A new QuRL-Train dataset with question-rubrics pair, along with a human verified test set for evaluation The main strengths of the paper are as follows: 1. The question-rubric train and test set might be useful for researchers. 2. Rubric based reward models seems to have some novelty and it is interesting. 3. Detailed experiments and comparison with SOTA LLMs to support the claim that QuRL improves the avg performance on multiple benchmarks 4. Human verification of the dataset improved trust and increases quality. 5. Consistency analysis between automatic evaluation method and human judgements is interesting. The main weaknesses of the paper are as follows: 1. The proposed approach used search engine returned results for initial evidences. 2. LLMs for meta-description generation might have its own error and biases. 3. Designs of rubrics might be subjective and might have some generalizability issues. 1. How rubric parameters are decided? is it domain specific? 2. How strong are the reward signal coming from the rubrics? 3. How does agentic QA methods like react compare against QuRL? Fully human-written
PreviousPage 1 of 1 (3 total rows)Next