ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (25%) 6.00 3.00 1610
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 2 (50%) 6.00 3.50 1634
Fully human-written 1 (25%) 6.00 4.00 1765
Total 4 (100%) 6.00 3.50 1661
Title Ratings Review Text EditLens Prediction
Reinforced Preference Optimization for Recommendation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes Reinforced Preference Optimization (RPO), a reinforcement learning–based extension of Direct Preference Optimization (DPO) for recommendation tasks. Unlike standard DPO, which aligns models to short-term revealed preferences, RPO introduces a utility critic that estimates long-term user satisfaction, combining it with reward modeling and policy updates. The framework alternates between preference optimization (based on pairwise comparisons) and utility reinforcement (based on delayed feedback). The overall idea of integrating a reinforcement objective into DPO is reasonable and grounded in established RL principles. However, the derivation of the “utility critic” and the way it interacts with the preference policy is somewhat heuristic. The connection to standard RL formulations (e.g., Q-learning or advantage functions) is not rigorously developed. The paper claims that reinforcement learning with verifiable rewards (RLVR) “naturally” addresses implicit reward issues in LLM-based recommenders, but this claim is not rigorously justified. I am wondering what is verifiable reward? what will be the difference between verifiable reward and traditional reward in DL/RL? The core of ReRe—beam search for sampling and ranking-based auxiliary rewards—is an extension of existing DPO and GRPO concepts. There is no much novel optimization algorithm or unique reward modeling principle. My main concern is that the proposed method primarily combines existing ideas (beam search, ranking loss, constrained decoding) with limited algorithmic innovation. Please refer to weakness part. Fully AI-generated
Reinforced Preference Optimization for Recommendation Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors have addressed reinforcement learning in generative recommenders. The authors found that existing works often rely on implicit rewards and ignore high-quality negative modeling. To address these two problems, this paper proposes to integrate the RLVR into the post-training of LLM-based recommenders. For adaptation, this paper proposed a constrained beam search and an augmenting rule-based accuracy reward. The extensive experiments have validated the effectiveness of the proposed method. + S1. This paper is well-organized and -written, making it easy to follow. + S2. This paper is well-motivated. + S2. Extensive experiments have been conducted. + S3. The code is released, making it easy to reproduce. - W1. Some up-to-date papers were ignored by this paper. For example, LatentR3[1] also adopted the GRPO algorithm. What's the difference between ReRe and previous works? - W2. It is better to further investigate the generality of the proposed ReRe. This paper has investigated the LLM-based recommender with the input of item titles, but how about the one with item ID, such as E4SRec or LLaRA? [1]. Zhang, Yang, et al. "Reinforced Latent Reasoning for LLM-based Recommendation." *arXiv preprint [arXiv:2505.19092](https://arxiv.org/abs/2505.19092)* (2025). [2]. Li, Xinhang, et al. "E4srec: An elegant effective efficient extensible solution of large language models for sequential recommendation." *arXiv preprint [arXiv:2312.02443](https://arxiv.org/abs/2312.02443)* (2023). [3]. Liao, Jiayi, et al. "Llara: Large language-recommendation assistant." *Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval*. 2024. All my questions have been included in the weakness section. Fully human-written
Reinforced Preference Optimization for Recommendation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes ReRe to fix two flaws of LLM-based generative recommenders: poor high-quality negative modeling and reliance on implicit rewards. ReRe uses constrained beam search to improve sampling efficiency/diversify hard negatives and combines rule-based with ranking rewards for finer supervision. 1. ReRe effectively addresses the two key flaws of LLM-based generative recommenders (insufficient high-quality negative modeling and reliance on implicit rewards) by integrating constrained beam search (for improving sampling efficiency and diversifying hard negatives) and a combined reward (rule-based accuracy + auxiliary ranking rewards), directly tackling the unique generation space and sparse supervision challenges of RLVR adaptation . 2. The study uses three real-world datasets (Amazon Toys, Amazon Industrial, Yelp) and compares ReRe with diverse baselines. 3. ReRe maintains robust performance across different backbone models (Qwen2.5-1.5B, Gemma-2B, Qwen2.5-7B) and initialization methods (Base, SFT). 1. There is a lack of training-time efficiency comparisons with other generative recommendation methods (e.g., those that do not use reinforcement learning) as well as with traditional methods. 2. The dataset information in Table 5 is not clearly described, and the experiments rely exclusively on relatively small-scale datasets. 1. In Table 5, do the numbers for “Tran” refer to the number of interactions or the number of users? 2. Can this method be combined with generative recommendation approaches (e.g., TIGER)? A semantic ID–based generative paradigm appears more practically viable, whereas relying on text prompts may constrain inference efficiency. Lightly AI-edited
Reinforced Preference Optimization for Recommendation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper addresses two major limitations of Reinforcement Learning with Verifiable Rewards (RLVR) in generative recommendation models. First, the unique generative space often produces invalid or duplicate items, reducing sampling efficiency. Second, since most items receive identical zero rewards, the ranking supervision signals become sparse. To overcome these issues, the authors propose Reinforced Preference Optimization for Recommendation (ReRe). Specifically, ReRe introduces a constrained beam search mechanism to improve sampling efficiency and increase the diversity of hard negative samples. In addition, it supplements rule-based accuracy rewards with auxiliary ranking rewards, enabling finer-grained supervision. 1.ReRe effectively addresses the challenge of hard negative sampling by introducing constrained beam search. 2.The use of ranking rewards alleviates the limitations of binary rule-based supervision. 3.The experimental results demonstrate solid performance and validate the method’s effectiveness. 4.The overall structure and logic of the paper are clear and well-organized. 1.For LLM-based recommender systems, prompt design is a crucial component, yet the paper does not discuss it. 2.Figure 2 fails to clearly illustrate ReRe’s contribution to negative sample sampling; further clarification or revision is needed. 3.Although constrained beam search mitigates the generation of invalid or duplicate items, ReRe may still be biased toward popular items due to the long-tail distribution, potentially overlooking less popular ones. See weakness Lightly AI-edited
PreviousPage 1 of 1 (4 total rows)Next