ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	2 (50%)	4.00	3.50	2854
Heavily AI-edited	0 (0%)	N/A	N/A	N/A
Moderately AI-edited	0 (0%)	N/A	N/A	N/A
Lightly AI-edited	1 (25%)	4.00	3.00	2216
Fully human-written	1 (25%)	6.00	4.00	3415
Total	4 (100%)	4.50	3.50	2835

Title	Ratings	Review Text	EditLens Prediction
AdaBoN: Adaptive Best-of-$N$ Alignment	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes a method for adaptively allocating compute in parallel scaled inference-time alignment. A popular way to scale inference-time compute is through the Best-of-N algorithm, where for each prompt, $N$ responses are generated and scored and the best is returned. Typically the same $N$ is used for all prompts in a given batch, but this can be wasteful in settings where there is significant heterogeneity in difficulty of answering a question. This paper proposes a way to adaptively select the number of responses to be used for each prompt in order to allocate a higher $N$ for responses that have a potentially higher reward. The method they propose is to first sample some number of responses for each prompt and use these to form an estimate of the distribution of rewards using a kernel density estimator with a Gaussian kernel. They then estimate the expected maximum reward using this density estimator and greedily assign allocation. The authors conclude with an empirical study on several datasets and openweight models demonstrating improvement relative to the naive uniform allocation strategy. This paper is timely in the sense that scaling inference time compute remains an important paradigm for improving LM performance and the question of how best to use such compute is essential. The paper presents an interesting solution to the problem of how to allocate responses per prompt and demonstrates empirically that their solution nontriviallly separates from the naive uniform baseline. I think it would be helpful for the paper to address the fact that in many important problems where best-of-N is used, the true reward we care about is binary, e.g., in math the ground truth reward is 1 if the correct answer is returned and zero otherwise. Similar phenomena occur in rule-based rewards for safety and alignment purposes. Including a discussion of this fact and how the paper relates to this setting seems important, especially as I suspect that the studied framework does not generalize to that setting. Second, I think it would be useful to discuss the fact that learned rewards are often imperfect at the tails and thus best of N is susceptible to reward hacking. To the extent that this method generates responses with higher estimated reward that are thus highly atypical, it would be good to discuss this issue. Third, I am somewhat confused by the evaluation metric that the authors consider empirically, namely the batch win rate. In most settings, I think people are more concerned with the per prompt win rate marginalized over prompts in a batch, not the aggregate reward over the batch. This would manifest as $\frac 1 K \sum_{i = 1}^K \mathbb P\left(\max_{j \in [A_i]} R_{i,j} \geq \max_{j \in [B]} R_{i,j} \right)$ (in the simpler formulation that treats ties as wins). I am concerned that the BWR as defined in the paper is susceptible to a single prompt having outsized influence and producing some very high rewards. Could the authors please show some subset of their empirical results with this more standard notion of win rate to better demonstrate empirical efficacy? Fourth, a minor point is that I think the authors are missing a reference to another paper that proposes an adaptive best-of-N approach that provides theoretical savings. See e.g. Appendix D of Self-Improvement in Language Models: The Sharpening Mechanism. See weaknesses.	Fully human-written
AdaBoN: Adaptive Best-of-$N$ Alignment	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes AdaBoN, a two-stage adaptive allocation strategy to improve the computational efficiency of Best-of-N (BoN) sampling. It addresses the key limitation of standard BoN, which inefficiently applies a fixed, uniform sampling budget to all prompts regardless of their alignment difficulty. AdaBoN's first stage uses a small exploration budget ($d$) to sample each prompt and model its reward distribution using Gaussian Kernel Density Estimation (KDE). The second stage uses these distribution estimates to calculate the expected marginal gain for each prompt, and then greedily allocates the remaining inference budget to the prompts most likely to benefit from additional sampling. This method is model-agnostic and requires no auxiliary model training. Experiments on the AlpacaEval, HH-RLHF, and PKU-SafeRLHF datasets show that AdaBoN consistently outperforms the uniform allocation baseline and remains competitive against uniform methods using 20% larger budgets The paper's primary strength is its simple, practical, and well-motivated solution to the clear inefficiency of standard Best-of-N (BoN) sampling. The proposed AdaBoN method is an elegant two-stage algorithm that is model-agnostic, requires no auxiliary model training , minimizes latency through parallelizable calls , and needs minimal hyperparameter tuning . These claims are substantiated by empirical evaluation across 3 datasets and 12 LM-RM pairs. The results consistently show that AdaBoN outperforms the uniform allocation baseline and remains competitive even against a uniform baseline given a 20% larger inference budget, demonstrating clear computational savings. 1. Lack of Comparison to Relevant Adaptive Methods: Absence of any empirical comparison against other adaptive allocation methods. The paper's baseline is limited exclusively to "uniform allocation" (i.e., standard Best-of-N sampling). The authors themselves identify Damani et al. (2024) as "the most closely related work", as it addresses the exact same problem of inference budget allocation. The authors justify omitting this crucial comparison by citing the lack of a public implementation and the "computationally prohibitive" cost of reproducing the method. While these practical hurdles are understandable, it leaves the paper's core contribution unevaluated against any relevant competition. By only outperforming a simple, non-adaptive baseline, the paper demonstrates an improvement but fails to show that this specific approach is a state-of-the-art or even a competitive solution. The claims of efficiency and effectiveness would be substantially stronger if AdaBoN were benchmarked against at least one other adaptive method from the related work section. 2. Limited Scope of Tasks and Reward Distributions: The paper's empirical validation is confined to a single class of problems: open-ended alignment tasks (AlpacaEval, HH-RLHF, PKU-SafeRLHF) evaluated by real-valued reward models. This narrow scope makes it difficult to assess the generalizability of the proposed method. It is unclear how this method would perform on tasks with fundamentally different reward structures, such as math reasoning (e.g., GSM8K) or coding, where reward distributions are more likely to be sparse. 1. Given that reproducing Damani et al. (2024) is "computationally prohibitive", could you provide a comparison against a simpler adaptive heuristic (e.g., a greedy allocator that uses the variance of the initial $d$ samples) to better situate your method's performance against something other than uniform allocation? 2. How robust is the Gaussian KDE estimator to the sparse or highly bimodal reward distributions found in tasks like math reasoning (e.g., GSM8K)? Does this estimation method risk poor allocation if the distribution assumption is violated?	Fully AI-generated
AdaBoN: Adaptive Best-of-$N$ Alignment	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper studies how to allocate a fixed Best-of-N (BoN) sampling budget across a batch of prompts using a reward model. It proposes AdaBoN, a two-stage, prompt-adaptive allocation scheme: a small exploration budget is first used to estimate per-prompt reward distributions via Gaussian KDE, then the remaining budget is greedily assigned to prompts with the largest estimated marginal gain in reward. The method is model-agnostic, requires no extra training, and operates with only two LM calls per batch. Experiments on AlpacaEval, HH-RLHF, and PKU-SafeRLHF with multiple LM–RM pairs show that AdaBoN consistently improves over uniform BoN with the same budget and can match or beat uniform BoN that uses a considerably larger budget. 1. The method is simple, test-time only, and easy to plug into existing BoN pipelines without retraining or heavy tuning. 2. Empirical evaluation spans several datasets and LM–RM pairs, with ablations over batch size, budget, and exploration fraction, supporting the main claims. 3. The proposed BWR and EST metrics are well aligned with practice and make the compute–quality trade-offs interpretable for BoN-style alignment. 1. Comparisons are restricted to uniform BoN; there is no empirical baseline from other adaptive allocation or difficulty-aware methods discussed in related work. 2. The paper does not quantify wall-clock overhead or throughput, leaving the practical cost of KDE fitting and Monte Carlo estimation somewhat unclear at larger scales. 3. The method's practicality at scale remains unclear, as its two-stage design and synchronous batching may struggle with heterogeneous prompts, and the paper offers limited evidence on how well it handles such real-world variability. 1. Can the authors include simple adaptive baselines (e.g., variance- or entropy-based allocators) to better contextualize AdaBoN beyond uniform BoN?	Fully AI-generated
AdaBoN: Adaptive Best-of-$N$ Alignment	Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	Recent advances in test-time alignment methods, like Best-of-N sampling, enhance language models' behavior steering using reward models but can be computationally intensive when uniformly applied across prompts. This work introduces a prompt-adaptive strategy with a two-stage algorithm that efficiently allocates inference-time compute, outperforming uniform allocation and remaining competitive even with larger budgets. 1. The authors propose a simple yet effective two-stage Adaptive Best-of-N (AdaBoN) allocation scheme: in the first stage, a small exploration budget is used to estimate reward distributions for each prompt, and in the second stage, these estimates help compute the marginal value of additional samples, with a greedy algorithm assigning the remaining budget accordingly. 2. Two new evaluation metrics are introduced, termed the Batch Win Rate (BWR) and Expected Survival Time (EST), to assess the performance of the alignment strategy, which can help the community to assess the performance in similar setting. 3. Empirical findings reveal that reward distributions are mostly smooth and can be skewed by Kernel Density Estimation (KDE). 1. The proposed method may be impractical when dealing with batches containing prompts of varying difficulty levels, as simple prompts requiring short responses must wait for the completion of more complex prompts needing longer, reasoning-intensive responses, leading to inefficiency; a token-based allocation approach could be more practical. 2. The paper lacks a clear definition and discussion of variables V and Z in equation (2), which undermines its rigor. 3. The two-stage method can be costly, as it requires calling the language model twice in parallel, doubling the time cost compared to a fixed budget approach; a single-stage method to estimate prompt difficulty and allocate the budget initially might be more effective. 4. The current reward distribution is one-dimensional, focusing on a single property like helpfulness, which may be smooth; however, using a multi-property reward model could require a larger budget to estimate higher-dimensional distributions, potentially leading to the curse of dimensionality. See the weakness.	Lightly AI-edited

PreviousPage 1 of 1 (4 total rows)Next