|
Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The work proposes a simple yet effective method to address logical inconsistencies in RLAIF. The authors introduce the Conflict Detection Rate (CDR) metric to capture the ratio of inconsistent pairwise judge results among candidate responses. They further propose Deconflicted Graph Rewards, which leverage graph algorithms to prune inconsistent parts and provide a more reliable scoring mechanism.
Experiments show performance improvements over pointwise, listwise, and other baseline approaches across GRPO and GSPO optimizers, as well as various models and prompt variations. The work also produces several interesting findings, such as the observation that more accurate prompts tend to be more logically inconsistent.
The study lacks an in-depth analysis of learning dynamics, such as how the purifier affects the number of rollouts in a group, the number of responses maintained during GRPO, and how the mean, standard deviation, and value signals evolve over training. The paper also does not quantify the additional cost introduced by the purification process, which may become significant when the number of rollouts is large.
Overall, the results appear consistently better, although it would be more reassuring to rule out the influence of other confounding factors.
1. The paper tackles a highly relevant and important problem: logical inconsistencies in pairwise judge signals for RLAIF. This issue becomes especially critical in subjective tasks, where the reliability of the reward signal directly affects model performance.
2. The proposed method is simple yet effective, leveraging a graph-based approach to prune inconsistent responses. The results demonstrate consistent improvements over pointwise, listwise, and other baseline approaches in RL training.
3. The ablation findings are interesting and thought-provoking, such as the observation that more accurate prompts tend to exhibit greater logical inconsistency. The paper is also well organized, with a clear presentation and a coherent narrative.
1. There is no comparison of the computational cost associated with the proposed approach. In particular, the purification step is executed on the fly during rollout generation, which likely increases the payload sent to the judge server and introduces additional latency. Quantifying this overhead would provide a clearer picture of the practical trade-offs.
2. The evaluation protocol may be biased. It appears that the authors select the best checkpoint based on benchmark performance rather than using a separate validation set or validation metrics for checkpoint selection. This risks cherry-picking peak results and may overestimate the true performance.
3. The study lacks an in-depth analysis of learning dynamics. For example, the paper does not report how many rollouts are removed by purification throughout training, nor how this evolves as later checkpoints tend to generate more similar quality responses. Understanding these dynamics would help demonstrate that the proposed method is the primary driver of performance improvement, which is especially important for subjective tasks.
• Have we analyzed how many rollouts or responses remain after the purifier step? This reduction could affect the effective sample size per group and therefore alter the rollout statistics, potentially influencing the learning dynamics.
• Have we measured the Conflict Detection Rate (CDR) for pointwise vs. listwise approaches in RL rollout group to verify whether higher CDR correlates with lower improvements?
• The main results are reported using prompt p1, while the ablations use prompt p5. What is the rationale behind this difference in prompt selection?
• As training progresses, one would expect the model to generate rollouts of increasingly similar quality, which could lead to more rollouts being removed. Did we track how the number of retained rollouts and their statistical properties evolve throughout training? |
Lightly AI-edited |
|
Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning |
Soundness: 3: good
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses the problem of logical inconsistencies, specifically preference cycles (A>B, B>C, C>A), in AI-generated feedback for Reinforcement Learning from AI Feedback (RLAIF). The authors propose a two-part framework: a Conflict Detection Rate (CDR) metric to quantify these inconsistencies, and Deconflicted Graph Rewards (DGR), a graph-theoretic method that transforms raw, conflicted pairwise judgments into a logically consistent reward signal by removing a minimal set of edges to break cycles. Experiments on benchmarks like Arena-Hard and MT-Bench show that DGR, when integrated with policy optimizers like GRPO and GSPO, outperforms baseline methods (Pointwise, Listwise, Pairwise, ELO) in training stability and final model performance.
1. The paper correctly identifies a significant and often overlooked issue in RLAIF.
2. The paper is well written and easy to understand.
**1. Lack of theoretical justification for core methodology.**
The paper's central contribution—resolving preference cycles by deleting edges to create a Directed Acyclic Graph (DAG)—lacks sufficient theoretical grounding. The authors employ a minimum feedback arc set (FAS) approach but fail to justify why this particular graph transformation preserves the underlying "true" preference ordering. Crucially, there is no discussion of:
- Edge deletion criteria: The paper does not explain how the algorithm chooses which specific edges to remove, nor does it provide theoretical or empirical evidence that the removed edges correspond to "noisy" or incorrect judgments rather than legitimate preference expressions.
- Preference preservation: The authors assume that the deconflicted DAG accurately reflects the original preference system, but no analysis demonstrates that the transformation minimizes distortion of the judge's intent. Alternative conflict-resolution strategies (e.g., edge reversal or probabilistic approaches) are not sufficiently explored or theoretically compared.
- Transitivity assumptions: The method implicitly assumes that acyclic preferences are inherently superior, but real-world human preferences can exhibit legitimate non-transitivities. The work does not address when cycles might reflect nuanced judgment rather than error.
**2. Incomplete and insufficient baseline comparisons.**
Missing learned reward model baseline: The authors compare only against rule-based scoring methods (Pointwise, Listwise, Pairwise, ELO). The most relevant and baseline, i.e., a learned reward model (e.g., Bradley-Terry model trained on the pairwise data), is absent. Such models naturally aggregate noisy preferences and can handle inconsistencies through probabilistic modeling, making them a fundamental benchmark for any preference-purification method.
**3. The paper neglects recent work that addresses circele preference.**
The article lacks research on work in the same field, for example, the following papers all discuss the issue of non-transitivity:
[1] A Minimaximalist Approach to Reinforcement Learning from Human Feedback
[2] Self-Play Preference Optimization for Language Model Alignment
[3] Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF
Same as weakness |
Fully AI-generated |
|
Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning |
Soundness: 1: poor
Presentation: 3: good
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces CDR, a simple measure of preference non-transitivity/circularity in a pairwise judge model, and DGR, a method to de-cycle a set of pairwise preferences and use them for RL training by converting them into point-wise reward. The paper uses the former to compare judge models & judge prompts, and uses the latter to train models on pairwise preferences labeled by judge model, receiving higher scores on most of the tested domains when compared to some baseline methods.
**Clarity**: The paper is well-written and easy to follow. Method and experiment details are clearly and concisely specified; the motivation is also clear.
**Significance**: Non-transitivity is an important problem especially in the case of pluralistic preferences in a population, although the authors did not study this setup specifically.
Key weaknesses:
1. **Strong baselines missing (DPO and RM-based preference optimization)**: There is reason to believe that classical RMs (trained with contrastive loss on pairwise preferences), as well as DPO, are a good solution to preference non-transitivity/circularity, but they are not compared against in the experiments.
- RMs enforce a linear order over responses (as they assign cardinal rewards to them), and their training minimizes disagreement (measured with a logistic function) with the possibly inconsistent preferences. In other words, the trained RMs de-conflicts the preferences and put them into a linear order.
- The same goes for DPO, which optimizes an implicit RM.
- Both RMs and DPO nominally relies on the B-T model, but as I have outlined above, in the regime with inconsistent preferences, they are no less theoretically sound than DGR. DGR also enforces a linear order against inconsistent preferences, with the only difference being that DGR uses an ordinal punishment for disagreement (every feedback arc in the graph counts the same), while RMs/DPO uses a cardinal punishment (a feedback arc is punished more if the reward difference between its endpoints is larger; exact amount of punishment defined by a logistic function).
- There is thus no theoretical reason suggesting the DGR-based method will work better than DPO or classical RMs, and, at the same time, there is no experiment empirically comparing them either.
2. **Strong baselines missing (Nash equilibrium approaches)**: In 2023, there are solutions proposed to handle non-transitivity and other problems with preference optimization methods based on the B-T model [1]. There have, seen then, been many followup works [2,3,4]. They are not mentioned in the paper, nor compared against in the experiments.
Other weaknesses:
1. From my experience, many of the WildChat questions (e.g. user asking a text-only model to generate image) don't have much to do with model capability. This is consistent with the observation that trained model see limited performance improvement on the capability-focused evaluation in Table 1.
2. Statistical significance is not reported, which is especially important given the small effect sizes of training.
3. Re "peak score achieved across all training step": the rigorous approach would be to use a validation set to select a best step for each training approach, then compare the single best step of each training approach on a test set. Without this split, we introduces maximization bias.
[1] Nash Learning from Human Feedback
[2] Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning
[3] Online Iterative Reinforcement Learning from Human Feedback with General Preference Model
[4] Improving LLM General Preference Alignment via Optimistic Online Mirror Descent
1. I'd appreciate results addressing the weaknesses outlined above.
2. On Figure 1(a): How much does the ranking change across prompts? To what extent can the ranking be an artifact of the specific prompts you choose? Also, adding confidence intervals would be helpful.
3. On Table 1: What is the sample size for evaluation? Are the differences statistically significant? I'd be keen to see confidence intervals and/or pairwise t-tests, including baseline vs DGR, and other methods vs DGR.
4. Re "two independent experimental runs": Do the performance rankings differ significantly between these two runs? If yes, it would make sense to add more runs until the aggregate stabilizes.
Minor: It seems that you use "judge model" and "reward model" interchangeably in the paper, e.g., in the section heading of 4.4.2. I suggest referring to them simply as judge models, as reward models typically refer to those that give pointwise rewards, especially those trained to give logit-based rewards. |
Fully human-written |
|
Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper addresses inconsistency in LLM-as-a-judge feedback for preference-based RL (cycles such as $(A\succ B\succ C\succ A))$. It introduces **Conflict Detection Rate (CDR)** to quantify cyclic judgments and **Deconflicted Graph Rewards (DGR)**: construct a pairwise preference graph over group size (G), remove a (near-)minimum feedback arc set to get a DAG, and compute net-win scores as advantages for GRPO/GSPO. Main claims: DGR yields more stable training and better results than pointwise/listwise/pairwise/ELO scoring. Evidence: Fig. 1 shows non-trivial conflict rates; Table 1 reports gains on Arena-Hard, MT-Bench, WritingBench; Tables 2–3 & Fig. 2 present ablations across cycle resolution, prompts, judges, and (n).
* **Clear problem diagnosis:** Fig. 1 quantifies non-trivial preference cycles; motivates going beyond accuracy.
* **Modular method:** DGR is optimizer-agnostic and easy to integrate (Alg. 1).
* **Ablations:** Table 2 shows optimal cycle resolution > random/reversal; Table 3/Fig. 2 suggest robustness across prompts/judges/(n).
* **Practical relevance:** Improves GRPO/GSPO on multiple benchmarks (Table 1).
* **Missing principled baselines:** No Rank Centrality/BTL/Hodge/Kemeny comparisons [1–5]; this is central for scientific credibility.
* **Robustness under-reported:** Peak-of-training from two runs; no mean±SD/CI or budget-normalized curves.
* **Judge bias risk:** Sole reliance on GPT-4.1/Claude; no human study or multi-judge aggregation [7,8].
* **Scalability unclear:** FAS step’s overhead vs. (G)/batch size not quantified; solver/heuristic choices not profiled [5,6].
* **Potential leakage:** No explicit de-dup between training sources and evaluation sets; needs clarification.
1. Could you include Rank Centrality, BTL/Plackett–Luce, HodgeRank, and Kemeny/Minimum-FAS baselines under the same protocol, or explain why they are infeasible [1–5]?
2. Could you report means±SD over at least five seeds (or 95% CIs) and provide full training curves, rather than peak-only metrics?
3. What is the wall-clock overhead and throughput (tokens/s) of DGR as group size (G) varies (e.g., 4, 6, 8, 10, 12, 16), and how do exact vs. heuristic FAS methods compare [5,6]?
4. Were the baselines tuned fairly under identical budgets, and can you provide the exact hyperparameters and sweeps used?
5. Did you perform de-duplication between training corpora and Arena-Hard/MT-Bench/WritingBench, and if so, what were the procedures and counts?
6. Can you add a small human evaluation or a multi-judge ensemble to assess and mitigate judge bias [7,8]?
### References
[1] Negahban, Sahand, Sewoong Oh, and Devavrat Shah. "Iterative ranking from pair-wise comparisons." Advances in neural information processing systems 25 (2012).
[2] Hunter, David R. "MM algorithms for generalized Bradley-Terry models." The annals of statistics 32.1 (2004): 384-406.
[3] Jiang, Xiaoye, et al. "Statistical ranking and combinatorial Hodge theory." Mathematical Programming 127.1 (2011): 203-244.
[4] Ailon, Nir, Moses Charikar, and Alantha Newman. "Aggregating inconsistent information: ranking and clustering." Journal of the ACM (JACM) 55.5 (2008): 1-27.[5] Kenyon-Mathieu, Claire, and Warren Schudy. "How to rank with few errors." Proceedings of the thirty-ninth annual ACM symposium on Theory of computing. 2007.
[6] Eades, Peter, Xuemin Lin, and William F. Smyth. "A fast and effective heuristic for the feedback arc set problem." Information processing letters 47.6 (1993): 319-323.
[7] Zheng, Lianmin, et al. "Judging llm-as-a-judge with mt-bench and chatbot arena." Advances in neural information processing systems 36 (2023): 46595-46623.
[8] Chiang, Wei-Lin, et al. "Chatbot arena: An open platform for evaluating llms by human preference." Forty-first International Conference on Machine Learning. 2024. |
Fully AI-generated |