ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	1 (25%)	2.00	4.00	2239
Heavily AI-edited	0 (0%)	N/A	N/A	N/A
Moderately AI-edited	0 (0%)	N/A	N/A	N/A
Lightly AI-edited	0 (0%)	N/A	N/A	N/A
Fully human-written	3 (75%)	3.33	3.33	1988
Total	4 (100%)	3.00	3.50	2050

Title	Ratings	Review Text	EditLens Prediction
REAR: Scalable Test-time Preference Realignment through Reward Decomposition	Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes an idea of reward decomposition that is used in test-time inference for the personalized preference task. The idea is simple, decompose the whole reward into question-related reward and preference-related reward. Then, the authors use the policy probability $\pi(a\|s)$ as a proxy of the reward $r(s,a)$. Based on these reward decomposition, the authors use them in two test-time algorithms, Best-of-N and DVTS. Experiments demonstrate the effectiveness/efficiency of their reward design. The reward decomposition strategy is simple and easy to understand, while the effectiveness is demonstrated by the experimental results. 1. The authors tried to use some mathematical derivation to show the depth of their approach, however, I feel these components were not well stated. For example: From (5) to (7), it feels like nothing substantial is explained. Lemma 3.1 also appears quite abrupt — it introduces a seemingly fancy formula, but its meaning is unclear, and in the end, everything just circles back to (7). Note that Lemma 3.1 is merely an intermediate result and does not offer any theoretical guarantee. It only shows that your policy is an optimizer of a certain expression, making the interpretation of this relation crucial. However, I don't find it particularly insightful or helpful for understanding your algorithm’s design. In fact, it ultimately leads to equation (7), which is equivalent to (5). 2. What truly matters is Lemma 3.2, as it informs the reader how to compute the reward. However, the authors merely state that the reward can somehow be replaced by the policy probability. I believe this is a critical step in the algorithm’s design, yet the paper provides no intuition or explanation for this substitution. Although proofs are included in the appendix, they also fail to offer any meaningful intuition. 2. The presentation could be improved for better clarity and coherence. For example: (a) What is the REAR score, and how is it defined? (I can roughly infer its meaning, but the paper should state it explicitly.) (b) What is $\hat{r}_{REAR}$ in Lemma 3.2? (c) In Proposition 2.1 and several other places, you should distinguish between the symbols '=' (equality) and ':= / \triangleq' (definition). This is especially important when your notation deviates from standard conventions. For instance, in Equation (3), your Q function is not the one that I learned. If you want to define a new Q function (or the soft Q-function, as you called), you should use a definition symbol rather than the equality sign. 3. After analyzing the paper, I find the novelty is quite limited. It decompose the reward into two reward terms, then replaces the two reward terms by two policy probability terms, and use it as guidance in two common test-time inference methods. The mathematical derivations make little sense and sometimes disrupt the reading flow of the paper. see weakness.	Fully human-written
REAR: Scalable Test-time Preference Realignment through Reward Decomposition	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper presents a test-time scaling method for preference alignment in LLMs, in case of non-verifiable rewards. This paper makes the assumption that preference is specified in-context. The authors decompose rewards into question-related and preference-related components, then derive a score based on policy probabilities that can be integrated with best-of-N sampling and DVTS. While the core idea has merit, the paper suffers from significant theoretical gaps, questionable experimental design, and overclaimed contributions. 1. The "realignment" framing is intuitive: base models have implicit preferences from training that may not match specific user needs. 2. The paper correctly identifies that test-time scaling (TTS) has been limited to verifiable domains (math, coding) and extending it to subjective preference alignment is a worthwhile research direction. 1. What is α? Is it: - Task specific constant - A property of how the model was trained? 2. This paper would have been much easier to understand if it were presented as "Test-Time Preference Alignment via Policy Interpolation". Overcomplicating a simple method doesn't add value to the paper. 3. Experimental Design Lacks Statistical Rigor. Statistical significance of the results is not reported. Please see weaknesses.	Fully human-written
REAR: Scalable Test-time Preference Realignment through Reward Decomposition	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes REAR, a test-time method for aligning large language models with user preferences without further training. The key idea is to decompose the implicit reward into question-related and preference-related components and to rescale the preference term at inference time. The authors show that REAR can be computed as a linear combination of log-probabilities and integrated into existing test-time scaling (TTS) algorithms such as best-of-N sampling and diverse verifier tree search (DVTS). Experiments on several preference-alignment benchmarks (PrefEval, Multifaceted, PingPong) demonstrate modest gains compared to existing TTS and test-time alignment baselines. - The paper is well written and clearly structured, with theoretical derivations and reasonable experimental design. - The proposed formulation is efficient and lightweight, requiring no extra training or reward models. - The idea of using log-probabilities or policy scores for implicit reward shaping at test time has been explored in prior work. The contribution mainly reformulates known ideas in a slightly different analytical framing. The proposed reward decomposition essentially feeds the reward model (or policy probability) with different segments of the same input (question vs. question + preference) to derive a rescaled score. This approach, while intuitive, lacks genuine novelty or theoretical depth. - In the second paragraph of the introduction part, the authors state that "However, existing TTS research has predominantly focused on domains such as mathematics and coding". However, test-time alignment has been long studied, even before the prevalence of TTS, including papers such as: 1. Args: Alignment as reward-guided search. 2. Inference-time language model alignment via integrated value guidance. 3. Fast Best-of-N Decoding via Speculative Rejection. - Performance gains over baselines are modest, often within small margins and without strong qualitative differentiation. - The method remains heuristic: while mathematically presented as a decomposition, it does not provide clear theoretical or empirical evidence that the “preference” and “question” components can be distinctly separated in practice. See weaknesses.	Fully AI-generated
REAR: Scalable Test-time Preference Realignment through Reward Decomposition	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces REAR, a framework for aligning with user preferences at test time, specifically targeting subjective and open-ended tasks. REAR decomposes the reward function into question-related and preference-related components, enabling dynamic re-weighting of user preference without retraining. The authors show that REAR can be efficiently formulated as a linear combination of policy probabilities, making it tractable and compatible with TTS approaches such as best-of-N (BoN) sampling and Diverse Verifier Tree Search (DVTS). 1. REAR is plug-and-play, and requires no external models or retraining, making it attractive for deployment. 2. REAR outperforms both token-level preference-alignment baselines (e.g., Amulet, Linear Alignment) and TTS methods. 3. The method is grounded in a reinforcement learning formulation and provides detailed proofs. 1. Performance of REAR depends on the choice of the λ parameter, which may require tuning for different questions or preferences. 2. If user preferences are not expressed in words (for example, are only implicit, behavioral, or external), the REAR method as proposed would not function without modification. 3. Heavy reliance on LLM-based evaluation for some tasks raises concerns about evaluation robustness and objectivity. 1. Hyperparameter selection of λ: How robust is the method to the choice of λ across different tasks and datasets? Is there a principled or automated way to select λ at inference time, possibly in the absence of validation data? 2. It it insufficient to use only the “helpfulness” score to measure the general response quality in the Section "Analysis on Generated Responses", and also "helpfulness" can be one of the preferences to consider.	Fully human-written

PreviousPage 1 of 1 (4 total rows)Next