ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	0 (0%)	N/A	N/A	N/A
Heavily AI-edited	1 (20%)	4.00	4.00	2143
Moderately AI-edited	1 (20%)	4.00	3.00	2177
Lightly AI-edited	1 (20%)	8.00	3.00	1665
Fully human-written	2 (40%)	2.00	3.00	1728
Total	5 (100%)	4.00	3.20	1888

Title	Ratings	Review Text	EditLens Prediction
Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper introduces MACA (multi-agent consensus alignment), a reinforcement learning framework that trains LLMs to be more self-consistent. MACA uses debate-derived consensus trajectories and trains LLMs via preference learning (DPO/KTO) or other objectives (GRPO, SFT). Experiments show significant performance gains on math reasoning datasets and strong generalization to unseen reasoning domains, validating that MACA can improve self-alignment and elevate reasoning capabilities. - The work is well-motivated, addressing LLMs' inconsistency when sampled multiple times. The proposed method is simple yet effective, using majority voting from multi-agent debates as a weak supervision signal to construct preference pairs for training. - Thorough experiments and ablations yield valuable insights, such as "Multi-agent debate produces more informative training signals than single-round majority voting" and "Addressing consensus alignment through preference learning improves over GRPO and SFT". No major weaknesses. A few questions are listed below. - Regarding the scaling of debate settings, will more agents or more rounds or heterogeneous agents yield better consensus? Would scaling these dimensions lead to higher-quality pairwise training data? - Cross-model transfer: If debate trajectories generated by a more capable LLM (e.g., Llama-8B) are used to train a smaller model (e.g., Llama-3B), how would this impact the smaller model’s self-consistency and accuracy? - Presentation: In Figure 2, which post-training methods (DPO, KTO, or SFT) are being illustrated? In Table 4, what does “Debate” refer to in the single-agent setting?	Lightly AI-edited
Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes MACA (Multi-Agent Consensus Alignment), a post-training framework that “internalizes” self-consistency by using multi-agent debate to generate majority (consensus) and minority (dissent) trajectories, then optimizing the model with MV-DPO/MV-KTO/MV-GRPO or MV-SFT on those signals. The paper formalizes self-consistency via majority probability over sampled reasoning paths and measures agreement in multi-agent settings. Experiments on small LLMs (2B–8B) across math-heavy benchmarks report sizeable gains in sampling consistency and accuracy. 1. The paper explicitly defines single-agent sampling consistency and multi-agent agreement, making the target capability measurable, and uses them consistently in analysis. 2. MACA reuses standard post-training objectives (DPO/KTO/GRPO/SFT) with debate-derived preference signals, and shows that MV-DPO/KTO tend to outperform SFT and scalar-reward baselines across several model/dataset pairs. 3. The paper compares debate-majority supervision to ground-truth labels and finds similar performance, and also tests training with/without peer context during debate—both helpful for understanding what drives gains. 1. Since self-consistency prompting (Wang et al., 2022) is a main comparator, I expected a training-time baseline that (i) uses self-consistency/majority vote to curate a dataset (e.g., majority-consistent rollouts), then (ii) finetunes/SFTs on that curated set—without multi-agent debate. This would test whether debate-derived signals truly add value over classical self-consistency data augmentation. 2. The training/inference cost versus gains isn’t quantified (GPU hours, wall-clock, debate throughput). 3. Most training is math-centric; generalization to GPQA/CSQA is interesting but still limited in breadth. Also, the “formalization of self-consistency” largely re-casts majority probability/consensus ideas already known from self-consistency and majority-vote literature; the novelty is primarily in using debate-derived preference pairs, which would be stronger if contrasted directly against the SC-curation+FT baseline noted above. As in weaknesses	Heavily AI-edited
Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment	Soundness: 1: poor Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper addresses Language Models’ reasoning inconsistencies stemming from probabilistic sampling by introducing a reinforcement learning framework MACA. MACA utilizes multi-agent debate to self-generate preference data, designating trajectories aligned with the majority consensus as 'preferred' and dissenting trajectories as 'not preferred'. It is then optimized on these self-supervised signals, using methods like DPO, to favor consensus-forming reasoning paths. It yields improvements in self-consistency and single-agent reasoning, and demonstrates generalization to unseen domains. 1.This paper tries to address an important problem: the reasoning inconsistency of LMs. 2.The MACA framework is conceptually simple. 3.The paper provides a very detailed appendix. 1.The core mechanism fails in face of “correct but minority” reasoning and actively rewards incorrect consensus. 2.The evidence for the central novelty claim is weak. As shown in Table 6, gains are marginal in 5 of 8 cases when compared fairly MV(C) with DMV(C). 3.The method essentially trains the model to agree more strongly with what it already agrees on, creating a self-reinforcing echo chamber that may amplify a model's inherent biases. See weaknesses.	Fully human-written
Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment	Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	In this work, the author argues that self-consistency should be a desired property of well-aligned reasoning models, and they address this by introducing MACA, a reinforcement learning framework that post-trains models to prefer consensus-supporting trajectories using majority/minority outcomes from multi-agent debate. The proposed MACA framework can relax the need for ground truth labels and uses majority outcomes as the learning signal. 1. Learning from multi-agent debate trajectories is explored in prior work [1], as well as using the majority answer as the learning signal [2, 3]. In [1], the author also shows different levels of learning from the consensus-supporting and dissenting trajectories. This limits the novelty of this work. 2. Post-training methods like GRPO naturally sharpen the distribution and improve pass@1 performance substantially. This is similar to the motivation of this work about internalizing consistency and has been proven to be very effective. The author should compare with ScPO [2] and TTRL [3] to show how many improvements are coming from multi-agent debate, or the proposed method even outperforms these baselines. 3. The training is conducted on the base model instead of the instruction-tuned version. Comparing with instruction-tuned models is necessary, and it will be more convincing if the model is trained from instruction-tuned checkpoints. 4. Limiting responses to 256 tokens does not make sense to me. This budget is not sufficient for tasks like MATH and GPQA. Although in the appendix, this budget increases to 512, it is still too small. 5. The fact that "Unsupervised majority-vote signal is comparable to ground-truth" as shown in the ablation study is established in [2, 3], again limiting the novelty of this work. 6. The improvements are overclaiming since it is comparing the trained performance with the base model. The improvements should be compared with external baselines such as [2] and [3]. 7. On the note of "how many improvements are coming from multi-agent debate", it is also important to compare with [1]. [1] https://arxiv.org/abs/2402.01620 [2] https://arxiv.org/abs/2411.04109 [3] https://arxiv.org/abs/2504.16084 Please see weaknesses.	Fully human-written
Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces MACA, a self-supervised, reinforcement learning framework for post-training LLMs to enhance self-consistency in reasoning. The MACA approach formalizes self-consistency as an intrinsic property of LLMs and utilizes multi-agent debate, where multiple clones of an LLM generate, critique, and revise reasoning trajectories. Th framework generates training signals from these deliberative exchanges and optimizes LLMs using several objectives to internalize consensus rather than simple aggregation. Substantial empirical improvements are demonstrated over baseline and strong post-training baselines on mathematical, science, and commonsense reasoning benchmarks, with gains in both accuracy and self-consistency, generalization to unseen tasks, and efficiency in inference. 1. The MACA framework is not limited to GRPO; it can also be integrated with multi-agent RL and preference-based alignment objectives, suggesting broader applicability to LLM self-alignment settings. 2. The improvements in self-consistency observed on mathematical reasoning tasks transfer to science and commonsense benchmarks, indicating that the method generalizes beyond a single domain. 1. The core learning loop is closely aligned with recent test-time reinforcement learning approaches [1], in which multiple sampled reasoning trajectories are compared and the consensus outcome is used as a self-supervised preference signal to update the model. The main distinction in this paper is that the consensus signal is generated via multi-agent debate rather than independent sampling. However, this conceptual similarity should be made more explicit, and a direct comparison to test-time RL methods is necessary to clarify what is genuinely novel in the proposed contribution. 2. While the improvements over SFT baselines and previous post-training paradigms are clear, the paper would benefit from direct, quantitative comparison to more diverse non-MACA multi-agent aggregation schemes. It is unclear how much gain stems from the debate protocol itself versus just using more samples at training. [1] Zuo et al., "TTRL: Test-time Reinforcement learning." 2025 See weaknesses	Moderately AI-edited

PreviousPage 1 of 1 (5 total rows)Next