ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (20%) 4.00 4.00 5420
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (20%) 6.00 3.00 2346
Fully human-written 3 (60%) 2.67 3.00 4486
Total 5 (100%) 3.60 3.20 4245
Title Ratings Review Text EditLens Prediction
LLM Probability Concentration: How Alignment Shrinks the Generative Horizon Soundness: 1: poor Presentation: 1: poor Contribution: 3: good Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper investigates the diversity of language model outputs—i.e., the effective “number” of possible continuations of $p(\mathbf{Y}\_{\geq t} \mid \mathbf{x} \circ \mathbf{y}\_{< t})$—for different prompts $\mathbf{x}$ and different timesteps $\mathbf{y}\_{< t}$. To this end, the paper proposed branching factor, which (if I understand correctly) is the normalised per-token perplexity of the model’s full continuations: $$B(\mathbf{x} \circ \mathbf{y}\_{< t}) = \exp\Bigg( E\_{\mathbf{y}\_{\geq t} \sim p(\mathbf{Y}\_{\geq t} \mid \mathbf{x} \circ \mathbf{y}\_{< t})} \bigg[ \frac{1}{|\mathbf{y}\_{\geq t}|} \sum\_{t’=1}^{|\mathbf{y}\_{\geq t}|} H(p(Y\_{t+t’} \mid \mathbf{x} \circ \mathbf{y}\_{< t + t’}))\bigg]\Bigg).$$ Note the entropy, in this equation, is taken for a single symbol at a time. The paper than uses an Asymptotic Equipartition Property (AEP) for language models to estimate this value. It shows that the branching factor decreases with $t$, meaning that later tokens have a smaller branching factor than earlier ones. It also shows that instruction tuned models have an order of magnitude lower branching factor than “base” models. Investigating how "concentrated" the probability distribution of language models is under different situations is an interesting research question. The experiment "teacher-forcing" the base model's generation with a prefix generated by an instruction-tuned model is quite interesting. Personally, I found this paper a bit hard to read and understand due to certain imprecisions. **Definition of Branching Factor.** To define the `branching factor`, the paper first states there exists an “effective tree” $\mathcal{T}$ with high probability sequences $\mathbf{y}\_{\geq t}$. It, however, never defines exactly what it means by “high probability”. In section 4.1, the paper then re-defines $\mathcal{T}$ as the perplexity: $\exp\Big( H(p(Y\_{\geq t} \mid \mathbf{x} \circ \mathbf{y}\_{< t})) \Big)$, as this is the effective number of equally probable outcomes with the same total uncertainty. I appreciate this definition, and I think the authors could have started directly with this, instead of first informally introducing an “effective tree”, which in my opinion makes the definitions less precise and more confusing. **Role of end-of-sequence; and fixing $N$ in definitions, but not in experiments.** The definition of branching factors relies on fixing the length of all strings to $N$. However, the experiments suggest an end-of-sequence token is used—as some plots seem to run longer than others. This creates an important mismatch between theory and experiments, in my opinion. **Incorrect use of AEP for LLMs.** In equation 3, the authors write: $$ \lim\_{N \to \infty} P\Bigg( \bigg| - \frac{1}{N} \log p(\mathbf{y}\_{\geq t} \mid \mathbf{x} \circ \mathbf{y}\_{< t}) - H(p(\mathbf{Y}\_{\geq t} \mid \mathbf{x} \circ \mathbf{y}\_{< t})) \bigg| < \epsilon \Bigg) = 0 $$ However, unless I’m mistaken, in the original cited paper (Mudireddy et al., 2024) the definition is: $$ \lim\_{N \to \infty} P\Bigg( \frac{1}{N} \bigg| - \log p(\mathbf{y}\_{\geq t} \mid \mathbf{x} \circ \mathbf{y}\_{< t}) - \sum_{t’=1}^{\infty} H(p(Y\_{t’} \mid \mathbf{x} \circ \mathbf{y}\_{< t + t’})) \bigg| < \epsilon \Bigg) = 1 $$ With the conditional entropies being computed “locally”, one token at a time. The AEP that the authors write down only holds in certain cases—most importantly, for ergodic processes. Note also that this difference is smaller than epsilon with probability 1, meaning that it holds for *any* string $\mathbf{y}\_{\geq t}$ with non-zero probability mass. Note as well that as the limit is taken with $N \to \infty$, these strings are infinite, meaning that: $$ \frac{1}{N} \log p(\mathbf{y}\_{\geq t} \mid \mathbf{x} \circ \mathbf{y}\_{< t}) = \frac{1}{N} \log p(\mathbf{y}\_{\geq t + 1} \mid \mathbf{x} \circ \mathbf{y}\_{< t + 1}) $$ As the log-probability of a single symbol cancels out in the infinite limit. I believe this suggests that, if this AEP were correct, all plots should be constant (as appending an extra token $y\_{t + 1}$ to the conditioning prompt shouldn't change results). In turn, I believe this suggests an issue in the theory. Could you expand on the role of end-of-sequence in the theory put forward by this paper, and in the experiments you ran? Is an end-of-sequence token used in the experiments? And, if yes, how do you choose an $N$ when estimating the branching factor? Could you confirm that there is an issue with your use of the AEP? Or did I make any mistakes in my interpretation of your, or Mudireddy et al.’s (2024) work? As I see it, the AEP results are not actually used in practice here: (i) you use end-of-sequence tokens, thus not simulating infinite-length in any way; (ii) you use 50 string-samples—instead of one very long sample—to estimate log-probabilities. Would it be fair to instead characterise your estimation procedure as using a simple monte carlo estimator? Fully human-written
LLM Probability Concentration: How Alignment Shrinks the Generative Horizon Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces a tool, called Branching Factor (BF), that can be used to study how LMs distribute probability over generations. For example, if we obtain multiple conditionally independent samples, given a prompt that is, and observe multiple, distinct strings, this situation corresponds to high BF. Conversely, if we only observe few, distinct strings, this situation corresponds to low BF. The paper uses this tool to study the impact of model size and instruction tuning in the distributions over responses induced by different LMs. This kind of analysis can be informative when deciding on which model to use, or which decoding algorithm to use and by extension other things that affect decoding (e.g., prompt, few-shot examples, etc.). The paper motivates the proposed tool using theory of stochastic processes, extended (by prior work) to LMs. It also motivates the tool intuitively by claiming that, unlike token level statistics, it is token-invariant. The tool was used to analyse a few models across a few tasks, supporting some insightful observations, but most practical implications of the use of this tool are left as discussion points in section 9. 1. The technique is simple and reasonably well-motivated 2. The paper is mostly clear (though I do find it to abuse of mathiness, which, in my reading, adds little) 3. The technique can power interesting decisions regarding decoding algorithms and/or models and/or prompting techniques. I find the positioning of the work unclear and, as a result I perceive some mismatch between what it is claiming (or what it might be claiming) and what it delivers empirically. Part of the motivation for this BS technique is 'token-invariance', but this, I believe, stayed at the level of argumentation only, with no empirical validation against techniques that do not deliver token-invariance. For example, would some 'not-token-invariant' technique lead to essentially different and/or misleading conclusions? (I am not claiming I know whether it will go one way or the other, it just looks like this should have been explored but it wasn't). If I understand it correctly, BF is an exponentiated estimate of entropy (of the distribution over sequences, given a prompt), possibly aggregated across prompts depending on the analysis. Maybe the paper is claiming insight into analysing entropy (albeit exponentiated), or maybe it's claiming something else around the AEP result. Analysing entropy estimates does not come across as too surprising. On the other hand, the AEP result comes from prior work. So I am not too sure how to position this paper. Perhaps it is really about exploiting the AEP result in analysis, which would be fine as far as I am concerned, but somehow the presentation isn't clearly and transparently about 'just' that. The only issue then is that the AEP result is not contributing anything (if I understand it and its use here correctly) beyond 'licensing' us to interpret exponentiated entropy estimates (though entropy estimates are routinely interpreted, aren't they?). I would appreciate clarifications on the two points in weakness. Also, I have some comments for clarity: 1. H(Y|x) is not conditional entropy, it's the entropy of a conditional rv. Conditional entropy would be taken in expectation under the joint distribution with x given random treatment, right? 2. I find the explanation in terms of perplexity rather confusing. Perplexity is a property of the model which we estimate on a data sample (like a dataset of human generated text). What you have is more like entropy, which you happen to exponentiate (for reasons that were clear without talking about perplexity), isn't it? And the point of connecting to an AEP result is to make a connection to a stochastic process' entropy rate (not perplexity), isn't it? Fully human-written
LLM Probability Concentration: How Alignment Shrinks the Generative Horizon Soundness: 1: poor Presentation: 2: fair Contribution: 3: good Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This work explores the behavior of LLMs during generation, focusing on the contrast between base models and aligned models that have undergone post-training. The authors propose a metric named "branching factor" (BF) to quantify the model probability distribution during generation, where BF represents how wide or narrow is the the space of plausible paths to continue the generation, and is closely related to perplexity measures. In a set of experiments over several models and datasets, they show that aligned models have a much lower BF, reflecting a narrower space of high-probability outputs, and connect this to the generation behavior where aligned LLMs are less sensitive to different decoding methods. They also present some analyses on the effect of intervening in the generation process at different stages. 1. A lot of the experimental results here are not trivial, and help shed some light on what the behavior of aligned vs. non-aligned models looks like. 2. Extensive experimental results in the paper including detailed appendices. 3. The subject matter would likely appeal to a broad community. 1. In a few places I felt there is a bit of conflation between hypotheses/conjecture and things that are actually demonstrated empirically in the experiments. Specifically:
 - In Table 2, the results show that the aligned models have lower standard deviation in task accuracy. This result does not in itself "confirm that BF is a reliable predictor of sampling consistency" (l. 377). What we see is a very anecdotal pattern of correlation between BF and accuracy STD, we do not know that this pattern is reliable and consistent (especially given the very small absolute differences in BF), and we certainly don't know that there is a causal connection between these factors (as opposed to, say, a general connection between higher task performance and lower STD).
 - In Fig. 5, we see the effect of the resampling intervention. This is interesting in itself, but again I don't see how it can be used to directly infer that the BF metric "reflects a deeper commitment to specific generative paths", just because BF and the output token index are correlated.
 - In §7 and also in the intro and abstract there is talk about the effect of stylistic tokens, but I did not see that this is demonstrated anywhere in the paper. 2. The main focus of the paper is on the BF metric, and in their motivation the authors state that "token-level metrics such as entropy or log-likelihood… offer only a narrow lens on model behavior: they capture local properties but miss the global structure of the output space… this motivates our proposal of the BF". As the BF is a major contribution I would have expected the paper to demonstrate this statement in some way - directly comparing how the conclusions on model behavior they draw from BF in this work differ (if at all) from the conclusions reached in prior works that supposedly focused on more "local" properties. Moreover, in practice the calculation is very similar to mean token entropy and it wasn't entirely clear to me that BF really measures something different. 3. I found some later parts of the paper harder to follow and less self-contained - I was missing a more technical description of how exactly the resampling in §6 was performed, and I felt the concept of "nudging" (§7) and the motivation behind it was not sufficiently clear from the text. 4. It is worth mentioning that the conclusions in §3 have largely been shown by prior work, most explicitly in Shi et al. 2024 "A Thorough Examination of Decoding Methods in the Era of LLMs" (https://aclanthology.org/2024.emnlp-main.489/). 1. Is there a reason for the mix between 70B and 8B models in §7? This combines the factor of large/small and aligned/not-aligned which may make the results more difficult to interpret. 2. Why does the text reference an average of "ten times higher BF" (line 335)? As far as I can see in most of the panels in Fig. 3 / Fig. 8 the ratio is much lower. Minor suggestions: * I did not feel the extra y-axis (cumulative impact) in Fig. 4 adds information to the reader, same for the 80% threshold (and it is unclear from the text why this threshold is important) * The related work section focuses a lot on semantic entropy which IMO is a bit further removed from this work; in contrast, I think elaborating more on some works mentioned in the intro could provide more context to this work - specifically the various works mentioned under Hypotheses 1-3 (l. 183-189). Typos: l. 360 model size M -> model size S l. 362 gains the use -> gains from the use l. 450 - which applies Fully human-written
LLM Probability Concentration: How Alignment Shrinks the Generative Horizon Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper tackles the phenomenon that aligned LLMs produce outputs that are significantly less diverse than their base model counterparts. The authors introduce a metric they term the Branching Factor (BF) to formalize and measure this "probability concentration". The central thesis is that alignment tuning acts as a "shrinking" mechanism on the model's generative horizon, and the paper provides a strong, unified framework that connects this concentration to several other observed behaviors, such as decoding insensitivity and the stability of CoT reasoning. The paper's core contribution is the formalization and application of the Branching Factor. Unlike perplexity, BF is measuring the perplexity of the space the model chooses to explore on its own. With this, the authors find that alignement is the dominant factor reducing from ~12 to ~1.2 going from base model to instruction tuned models. Also, the generation locks into a specific topic or reasoning and becomes more predictable as it generates more tokens, which is verified with the resampling experiment and the nudging experiment. 1. **Unifying Metric**: The paper's primary strength is providing a single, intuitive metric, BF, that explains and unifies several disparate, known phenomena: low diversity, insensitivity to decoding parameters, and the stability of CoT reasoning. 2. **Strong Experimental Design:** The resampling and nudging experiments are clever and highly effective ways to demonstrate the consequence of BF reduction, showing that models become "brittle" and locked into their chosen path. 3. **Robustness and Clarity:** The findings are shown to be consistent across multiple model families and various tasks. The Pareto analysis is particularly effective at isolating alignment as the key variable. 1. **Metric Novelty:** The paper proposes "Branching Factor," but it is almost a specific measurement of perplexity. 2. **Complexity in alignment:** The authors mentioned that they did not disentangle which stage of alignment contributes most to BF reduction, and ideally some experiment on this should be in the paper. 1. How do you consider beam search as part of the picture, i.e. do you think that since "off-path" trajectories in aligned models are not just low-probability but low-quality, are alternative beams are likely to be "garbage" paths? Lightly AI-edited
LLM Probability Concentration: How Alignment Shrinks the Generative Horizon Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes the Branching Factor (BF), a scalar metric grounded in information theory that measures how many plausible next tokens a language model entertains at each generation step. By expressing BF in terms of per‐token entropy and invoking the Asymptotic Equipartition Property, the authors derive an estimator that scales to realistic sequence lengths without exhaustive enumeration. They evaluate BF across base, instruction‐tuned, and Chain‐of‐Thought variants of contemporary LLMs on tasks ranging from summarization to multi‐step reasoning, observing that BF declines steadily as generation proceeds and that alignment tuning collapses BF by almost an order of magnitude from the very first token. Nudging experiments indicate that alignment does not fundamentally rewrite model parameters but rather surfaces latent low‐entropy trajectories present in the unaligned model. Overall, the work offers both theoretical insight and practical diagnostics for understanding the reduction in output diversity that often accompanies model alignment. The paper introduces a clear, distribution-level lens on why aligned LLMs tend to be more deterministic, formalizing “probability concentration” via a task-agnostic Branching Factor (BF) instead of surface diversity metrics. The BF is grounded in information theory—defined as the exponentiated entropy rate over continuations—and connected to a balanced-tree abstraction of the effective output space, which makes the idea intuitive and comparable across settings. The authors also provide two practical estimators: a token-entropy aggregation for short outputs and an AEP-based estimator for long outputs that leverages length-averaged NLL, with empirical plots showing NLL closely tracks entropy and stabilizes with length. This framework unlocks several cohesive empirical findings: BF typically shrinks as generation proceeds; aligned models sit near BF≈1.2 (roughly an order of magnitude lower than base models), helping explain reduced decoding sensitivity; and a Pareto-style analysis highlights alignment as the dominant driver relative to model size, generation, and prompt complexity. Beyond diagnostics, the work ties BF to behavior: majority-vote variance drops with lower BF, resampling late in a sequence degrades performance more than early resampling, and CoT’s longer chains naturally push inference into low-BF regions that stabilize answers. Together, the formalization, efficient estimation, and multi-angle evidence (decoding study, variance analyses, resampling, and nudging) create a persuasive, unified account of alignment’s impact on LLM outputs. Some claims hinge on estimator assumptions and experimental choices that invite further stress-testing. The AEP-based estimator inherits conditions (e.g., long sequences, autoregressive generation, finite precision) and approximations; while these are argued to be mild and empirically supported, deviations (short outputs, atypical decoding, domain shift) could bias BF estimates, and Monte Carlo underestimation issues remain salient for short generations. Causal attributions around alignment are suggestive rather than surgical: alignment dominates in the Pareto analysis, but the study does not disentangle which alignment stage (SFT vs. reward modeling vs. RL) drives BF reductions; the authors themselves flag this and hypothesize RL as the main culprit, leaving an important gap for checkpoint-level ablations. Some demonstrations (e.g., nudging with a different “instruct” model for prefixes) risk confounds from model mixing; similarly, results are concentrated on specific open-weight families and tasks, so generalization to other architectures, languages, and safety/creative settings deserves replication. Finally, while BF is positioned as deeper than sample-level diversity, its practical relationship to user-perceived variety is complex and sometimes uncorrelated with Distinct-N; practitioners may still need guidance on how BF should inform decoding or training interventions without sacrificing quality. These caveats don’t undercut the paper’s core insight, but they do mark clear avenues for more rigorous causal analyses, broader model/task coverage, and tooling that translates BF diagnostics into actionable training or deployment knobs. The manuscript would be strengthened by deeper engagement with related and contrasting work. Just to name a few, the recent paper by Rodemann et al. https://arxiv.org/abs/2502.14581 exploring implicit statistical biases via alignment deserve explicit discussion. Contra‐alignment perspectives such as the Overton Pluralism framework proposed by Lake et al. (https://arxiv.org/abs/2406.17692) are mentioned, but could be discussed in more detail. By omitting these debates, the paper risks overstating novelty and understating the broader scholarly context. On the methodological side, the assumptions of ergodicity and stationarity required for AEP may be violated in highly variable prompts - open‐domain dialogue or code synthesis, for example - and quantitative error bounds under such conditions would bolster the argumetn. While I cannot vote for acceptance of the paper in its current form, I am confident that a thorough revision of the currently very brief related works (sec 8) section and properly addressing my concerns wrt ergodictiy and stationarity can substantially improve the paper and lift it over the bar. Fully AI-generated
PreviousPage 1 of 1 (5 total rows)Next