ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (25%) 4.00 4.00 3224
Fully human-written 3 (75%) 4.00 4.00 1948
Total 4 (100%) 4.00 4.00 2267
Title Ratings Review Text EditLens Prediction
Multi-Sample Preference Optimization for Generative Model Alignment Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes multi-sample variants of direct preference optimization methods (called mDPO and mIPO) that replace single response comparisons with group-wise comparisons aimed at aligning distributional properties (e.g., diversity, bias). In particular, the authors lift DPO/IPO from singletons to sets by using the (geometric-mean) product likelihood of a response group and optimizing the same surrogate with expectations over group samples. Then, they derived an unbiased mini-batch estimator for mIPO (a squared-loss objective over mean implicit rewards) and discussed a biased but lower-variance estimator for mDPO. Finally, they added NLL as an auxiliary term to stabilize finetuning. Experiments cover LLM random number generation, creative fiction, and diffusion debiasing, plus a synthetic-label robustness study where multi-sample wins more often under label noise. - The paper argues that many properties, such as diversity, are distributional and not captured by single-sample comparisons, and the group-wise formulation is intuitive. - Extending DPO/IPO by averaging implicit rewards over sets keeps training compatible with existing implementations. - Technically, mDPO/mIPO mostly reuse the same surrogates with a group-average of implicit rewards and a straightforward mini-batch estimator. The constrained-optimization view for adding NLL is already common in practice. Relative to DPO/IPO, the step from singletons to sets reads as expected algebra rather than a new learning principle. - There is prior work on distributional difference/alignment that directly targets set-level objectives. The paper cites some of these, but the differences are mainly about different experiments and applications, which are vague to me. - Can you provide a theoretical justification that mDPO/mIPO are proper surrogates for a target distributional objective? Can we provide any consistency under a Bradley–Terry-style group model? - How does mIPO compare to work in distributional preference alignment in principle (what objective is optimized)? - Is there any specific reason behind considering the geometric mean for the aggregation over a group? Fully human-written
Multi-Sample Preference Optimization for Generative Model Alignment Soundness: 3: good Presentation: 4: excellent Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. RLHF and DPO are popular post-training methodologies for LLM alignment. The standard method uses preference pairs consisting of single samples to align LLMs. This increases the probability of generating preferred samples when the reward is computed at a per-sample level. However, in several cases like increasing the diversity of responses etc. the rewards cannot be computed at a per-sample level. This paper extend DPO and IPO to such cases. In this work, the authors generate a set of responses for each prompt. Analogous to preferred and dis-preferred samples, the multi-sample formulation has a preferred and dis-preferred set. The authors provide an unbiased estimator for the multi-sample formulation of IPO through theoretical analysis. Through five empirical studies, they show that the multi-sample DPO and IPO performs better than DPO and IPO when the reward is computed at a distribution level. 1. Computing rewards at a sample level is an important limitation of DPO and IPO. The extension of standard preference optimization framework to distributional level rewards addresses this limitation 2. The theoretical analysis which shows that the variance of the multi-sample estimator decreases with group size is novel 3. The range of empirical studies show that the multi-sample DPO/IPO improves upon the performance of standard DPO/IPO when the rewards are formulated at a distribution level across a range of modalities. 1. The contribution of this work seems limited. The primary difference between the standard DPO setting and the multi-sample DPO setting is in the way that the samples are separated into preferred and unpreferred groups. Instead of separating them at a sample level, they are first grouped into sets and separated at a set level. Given that the reward is computed at a distributional level, this seems like the natural application of DPO for such a problem. 2. There seems to be a strong overlap between this paper and Li et al [2024]. The authors have not clearly stated the original contributions which differ from Lit et al [lines 118 - 124] 3. This work could benefit from stronger baselines - extending the RLHF framework to distributional reward problems. Furthermore, I would encourage the authors to compare this results with Zhong et al [2023] and Melnky et al [2024]. Please see the weaknesses Fully human-written
Multi-Sample Preference Optimization for Generative Model Alignment Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper extends the direct preference alignment frameworks DPO and IPO to deal with preferences over groups of items over binary preference over pairs of items. The authors provide intuitive modifications to the DPO/IPO objective to incorporate group-wise comparisons, derive mini-batch estimators for these objective function, and show the validity of the method with various experiments. The problem of needing to compare groups of items instead of individual items in some cases is natural and of practical relevance. The experiment case studies (random numbers generation, image debiasing, improving quality of creative fiction generation and training with label noise) are all quite interesting and both validates the approach and gives real-world examples where one might want to compare distributions instead of pairs of items. The paper is also well written and easy to follow. 1. The proposed methodology is quite straight-forward and the novelty of the proposed solution is moderate. 2. It seems like the objective estimates can have bias/variance and it seems like this would depend on the batch size. However, from the experiments I don't see this angle being explored sufficiently. For someone trying to deploy this method, how would they deal with the bias/variance issue, how does that change with the batch size? See weakness Fully human-written
Multi-Sample Preference Optimization for Generative Model Alignment Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces **multi-sample extensions of preference optimization methods (mDPO and mIPO)** for alignment. Whereas standard approaches such as RLHF and DPO/IPO rely on **single-sample pairwise comparisons**, this work proposes to instead optimize over distributions of responses. This allows the methods to capture distributional characteristics such as **diversity and bias**, and to better handle cases where preferences may not exist between individual samples but emerge clearly when comparing groups. The authors highlight challenges in unbiased estimation and provide empirical evidence that the proposed approaches can improve robustness against label noise and enhance diversity in generated outputs. - Tackles an **important and underexplored problem**: extending preference optimization from single-sample to multi-sample comparisons. - Provides a **novel formulation** (mDPO, mIPO) that enables aligning distributions rather than instances. - Shows **promising empirical results**, especially in improving **diversity** and reducing **bias** in outputs. - Highlights the robustness of mDPO against label noise, which is valuable in real-world preference datasets. - Overall **presentation** is good, with several intuitive illustrations of the advantages of multi-sample formulations. While I liked the paper overall and I believe it tackles an important problem, some key weaknesses should be addressed: 1. **Insufficient experimental comparisons**: - The paper does not compare against **other multi-sample methods** such as DFT (Guo et al., 2025). - No experiments with **naive multi-sample baselines**, e.g., running DPO/IPO over all pairwise comparisons between positive and negative sets in the mini-batch, i.e. $$ \frac{1}{k^2}\sum_{i=1}^k\sum_{j=1}^k l(x, y_{w,i}, y_{l,j}) $$ 2. **Overstatements in claims**: page 9 line 449, “both mDPO and mIPO significantly outperform the baselines” is too strong looking at the figure for mIPO $k=5$. This should be further argued or weakened. 3. **Experimental detail gaps**: - The paper acknowledges that obtaining an unbiased estimator of mDPO is challenging, but still reports experiments with mDPO. It is unclear whether an estimator or a biased version is used. - Figure 5 and Table 3: why does mIPO with $k=3$ outperform $k=5$? One would expect larger $k$ to monotonically improve performance (even if with diminishing returns). - Figure 6: it is not clear whether $k=5$ is an outlier or performance saturates? $k=6$ would help clarify this. - Iterative improvement experiments (page 9): baseline not clearly stated, should be iterative DPO/IPO for fairness. Guo et. al. "Discriminative Finetuning of Generative Large Language Models without Reward Models and Human Preference Data." 2025. arXiv:2502.18679. **Minor issues that have not affected the rating** - Page 3 line 121: “foci” → “focus”. - Page 7 line 341: Figure 5 caption should have more space from text. - See the weakness section for suggested additional experiments; more baseline comparison would greatly benefit the paper. - See the weakness section for some discussion and experimental details that the paper would benefit from answering. Lightly AI-edited
PreviousPage 1 of 1 (4 total rows)Next