|
JudgeLRM: Large Reasoning Models as a Judge |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces JudgeLRM, a family of reinforcement learning–trained models designed for judgment tasks that demand complex reasoning. The central hypothesis is that many judgment and reward-scoring tasks inherently require reasoning, which limits the effectiveness of standard SFT-based judges and reward models. To address this, the authors propose an RL framework that trains models to reason and evaluate by rewarding them based on the accuracy of their predicted scores relative to gold human or frontier model judgments. Experiments demonstrate that JudgeLRM models outperform both SFT and baselines such as BT and DPO, highlighting the importance of using reasoning and RL for developing better judges.
* The motivation is clear: judgment tasks are inherently reasoning-intensive, and this work provides an interesting attempt to align reward learning with reasoning ability.
* The empirical results are strong, showing consistent improvements over SFT and existing baselines (e.g., BT, DPO).
* **Restricted applicability**: The framework seems limited to settings where ground truth scores are available. It’s unclear how it extends to preference-based or binary feedback data.
* **Writing**: The paper is poorly written. The method is not introduced properly, the introduction is too experimental, the results are not written cleanly. The baselines and experimental design can be explained significantly better.
* **Heuristic reward design**: The absolute reward formulation and its hyperparameters appear heuristic and tuned to specific score distributions, raising concerns about generalizability.
* **Confidence reward is poorly motivated**: It’s unclear why a non-continuous confidence reward is used, and it seems to specifically encourage overconfidence rather than calibrated predictions.
* **Pairwise-only framework**: Since the model is trained pairwise, it may struggle to produce meaningful single-response scores, limiting its inference-time utility.
* **Weak case study**: The case study relies purely on qualitative analysis, without quantitative evidence to substantiate claims about emergent complex reasoning behaviors (e.g., verification. subgoal setting).
* **Limited related work**: The related work section is underdeveloped. There have been a few other works training reasoning reward models. Even if they are concurrent, some more discussion is necessary.
* **Poor presentation choices**: Table 1 placement is suboptimal—this section should feature a main figure summarizing the method or key results.
* **Cluttered introduction**: The introduction includes excessive experimental discussion. These details would fit better in the Motivation or Background section.
1. How would the framework handle preference data (e.g., pairwise comparisons without numeric scores)?
2. Can you clarify the motivation and formulation of the confidence reward? Why is it not continuous or calibrating?
3. I am highly doubtful about the generalizability of the method to different score distributions. For example if the ground truth scores are distributed between 0-100, then reward hyperparameters need to be tuned accordingly.
4. The claim that these rewards encourage calibrated confidence is probably incorrect as no proper scoring rule [2] was used in the reward function. Can the authors argue why this should be true?
5. How were the reward hyperparameters chosen, and how sensitive are results to these values?
6. Since the framework is pairwise by design, how is it adapted for single-response evaluation, and how reliable is that setup? If single-response evaluation is possible, it would be useful to have a baseline trained specifically for that setting to test if relational reasoning is truly necessary.
7. Could you provide explicit equations for the baselines, especially DPO? Is the DPO implementation just the standard version? It is okay to put these into the appendix if space is a constraint.
8. Could you compare against pairwise preference models trained explicitly for this setting (e.g., [1] but also used in multiple other works)?
9. In the “length reward” experiment, why was a threshold-based reward (120 tokens) chosen instead of a more standard continuous reward formulation? Why was 120 tokens picked as a threshold?
10. Can the authors formally define the training setting and data assumptions—specifically, the use of continuous scores between [0,10]—and discuss how this constrains generalizability?
[1]: Munos, R., Valko, M., Calandriello, D., Azar, M. G., Rowland, M., Guo, Z. D., ... & Piot, B. (2024, July). Nash learning from human feedback. In Forty-first International Conference on Machine Learning.
[2]: Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477), 359-378. |
Lightly AI-edited |
|
JudgeLRM: Large Reasoning Models as a Judge |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work addresses one of the problems of existing LLM-as-a-judge approaches -- their poor performance on the tasks requiring complex reasoning. Authors show the negative correlation between the performance gains from SFT and proportion of reasoning-demanding samples in a given domain. To fix this gap, they propose a family of models -- JudgeLRM -- which were trained using GRPO method with "judge-wise, outcome-driven" reward function. This reward function was designed to optimize for structural correctness, relational accuracy, absolute score accuracy, and judgment confidence. Their results show that 3B JudgeLRM model is more accurate than GPT-4 on the human-annotated PandaLM benchmark, and 7B outperforms DeepSeek-R1 model. Additional analysis show that JudgeLRM gains most on the reasoning-demanding tasks were SFT models fail.
- Clear motivation: The authors show that SFT is insufficient for judging tasks requiring some extend of reasoning.
- Technical Contribution: Core contribution -- the judge-wise, outcome-driven reward function -- is novel and give significant results.
- Strong results and deepened analysis: Authors show that JudgeLRM works better than SFT on reasoning-demanding tasks. They provide ablation studies to justify the reward-function design, and provide qualitative examples of improved reasoning.
- Contradictory Statistics in Analysis: In the caption of Figure 3, the authors mention the negative linear trend, but both the plot and the equation parameters (y = 0.2x − 1.05) seem positive. In the Section 4.3 the authors also mention "we observe a correlation coefficient of 0.20 between relative improvement and reasoning rate", which imply $R^2=0.04$. In the Figure 3 caption the $R^2=0.95$ is given.
- "Reasoning" Labeling Methodology: In the two most important figures (Figure 1 and Figure 3) the authors use "Proportion Requiring Reasoning to Judge (%)" metric. The process of creating this metric is only revealed in the appendix. While authors did validate this method against human annotators it still creates a potential circularity. The paper essentially uses GPT-4's own definition of "reasoning" to motivate the need for a new model. The paper then uses this new model to claim it has surpassed the performance of GPT-4 itself. Given that this metric is so central to the paper's argument, the use of an LLM to generate it should be discussed transparently in the main paper, not just in the appendix.
- Overstated claims about surpassing GPT-4: JudgeLM dataset used to train JudgeLRM was made entirely from GPT-4 answers. The reward function was design to match the final outcomes from GPT-4 labels. Taking that into consideration, claiming that JudgeLRM "surpasses GPT-4" may be too strong. More precise framing, that this specialized 3B model was able to outperform its general-purpose teacher (GPT-4) on a different test (the human-annotated PandaLM benchmark). This is still a very impressive result, but this description is more accurate.
I don't have specific questions, but I would be more than happy to see improvements stated in the weaknesses section. |
Fully human-written |
|
JudgeLRM: Large Reasoning Models as a Judge |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper establishes a relationship between reasoning ability (enhanced through reinforcement learning) and judge quality. The authors first find a negative correlation between SFT performance and judge quality on reasoning-heavy tasks. They then propose RL-based rewards and train the judge model using RLVR. Experimental results show that JudgeLRMs yield significant improvements.
1. The paper introduces a new dimension of judge quality, i.e., its relationship with reasoning ability.
2. The paper adapts RLVR to pairwise response comparisons and introduces three types of content rewards.
3. The paper conducts experiments using models of various sizes and from different families to validate the effectiveness of the RL method.
1. The authors regard SFT as the opposite of reasoning, which is not convincing. This makes the overall claim somewhat confusing.
- Q1: SFT models distilled from GPT-4 should also be capable of generating CoT. As shown in Figure 18, the response lengths of JudgeLRM-3B and JudgeLRM-7B are not particularly long. What, then, is the major difference between the responses produced by the SFT and RL models?
- Q2: What if trajectories from LRMs such as DeepSeek-R1 or the Qwen3 series are used to fine-tune the model via SFT?
2. The ablation study is not comprehensive, so the necessity of the proposed rewards cannot be fully verified. The authors introduce three types of content rewards, but the ablation results only report *w/o r_absolution + r_confidence*.
- Q3: Could you provide more ablation studies to demonstrate that all three rewards are indispensable? I am interested in whether the policy model can learn such complex relationships from a sparse scalar reward.
3. The writing quality is relatively low. In Table 2, there is a nonsensical phrase "yaobuyao," and the table also mixes up *Instruct* and *Ins*. Lines 315–323 lack indices (4) and (5). Additionally, "Qwen3 Base" in Table 3 does not clarify whether it refers to the Qwen3-Base series or a model based on Qwen3 (the reasoning version).
Q4: What are the results of Qwen3-4B and Qwen3-8B without any training? Since they are reasoning models, are they better than their non-reasoning counterparts?
Q5: Is JudgeLRM-8B based on Qwen3-8B-Base or Qwen3-8B? If it is based on Qwen3-8B-Base, why not use Qwen3-8B, given that the focus is on LRMs? If it is based on Qwen3-8B, could you explain why it performs worse than JudgeLRM-7B on PandaLM, considering that Qwen3-8B has been shown to outperform Qwen2.5 in reasoning ability? |
Lightly AI-edited |