ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 1 (33%) 8.00 4.00 1819
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 2 (67%) 7.00 4.50 2994
Fully human-written 0 (0%) N/A N/A N/A
Total 3 (100%) 7.33 4.33 2602
Title Ratings Review Text EditLens Prediction
Safe Exploration via Policy Priors Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces SOOPER, a model-based reinforcement learning algorithm designed for safe exploration for constrained Markov decision processes. The method leverages conservative policy priors to ensure safety while optimistically exploring to improve performance. Conservative policy priors are derived from offline data or simulations. Theoretical guarantees are established for both safety and sublinear cumulative regret. The experiments demonstrate strong empirical performance on several benchmarks and real-world robotic systems. 1. This paper is easy to follow and the motivations behind the proposed algorithm are clearly presented. The paper provides a well-motivated discussion of the need for safe exploration in reinforcement learning, emphasizing the trade-off between conservatism and exploration. 1. The relevant literature has been well-covered and I do not find missing references. 1. The proposed algorithm is technically sound. 1. Theoretical results are rigorous, providing both safety guarantees and a sublinear cumulative regret bound. As far as I know, the theoretical results in this paper is novel. 1. The experimental section is comprehensive, including evaluations on RWRL, SafetyGym, and a real hardware setup. The proposed method consistently outperforms or matches strong baselines such as SAILR, CRPO, and Primal-Dual. 1. The implementation based on MBPO demonstrates that the proposed algorithm can scale to high-dimensional continuous control tasks. MBPO is known as a highly scalable algorithm, which is a good technique for making SOOPER practical. 1. The theoretical analysis relies on several strong assumptions (e.g., Gaussian noise, Lipschitz continuity, bounded RKHS norms, and the existence of a pessimistic policy prior that satisfies safety for all plausible dynamics). While such assumptions are sometimes seen in other safe RL literature, they may limit the practical applicability of the theoretical results. 1. Although the paper claims that SOOPER can be implemented on top of standard model-based methods, the overall system involves several non-trivial components (uncertainty estimation, pessimistic value computation, termination logic). A short discussion on the computational overhead or design trade-offs would be helpful. 1. I consider the assumption of treating state transitions in the form of equation (1) and Assumptions 1–3 to be very strong. Could you share the authors' views on the strength of this set of assumptions and any attempts to relax them in future research? 1. To what extent do the assumptions made in the paper hold in the experimental setting? While the paper provides a sound theoretical framework, it is somewhat unclear whether all these assumptions are necessary in the experimental setting. It would strengthen the paper to discuss or empirically examine how SOOPER behaves when some of these assumptions do not hold. Since the implementation builds on MBPO to learn the dynamics model, the method should, in practice, already be capable of handling moderately stochastic transitions. Confirming this empirically would provide valuable evidence for the robustness of the proposed approach, by evaluating SOOPER under relaxed or partially violated assumptions. I do not intend to criticize the fact that some assumptions may not hold; rather, if the proposed method still functions effectively as a kind of meta-algorithm even when certain assumptions are violated, that would further enhance the value of this paper. Lightly AI-edited
Safe Exploration via Policy Priors Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper investigates the use of conservative policy prior to enhance safe exploration in reinforcement learning (RL). The proposed method, SOOPER, performs optimistic exploration when safety can be guaranteed with high probability, and switches to the conservative policy prior when exploration becomes potentially unsafe. The authors provide theoretical guarantees for both high-probability constraint satisfaction and optimality. Extensive experiments demonstrate that SOOPER outperforms state-of-the-art baselines across various benchmark tasks. 1. While the combination of offline training and online exploration is not entirely new, the paper demonstrates novelty through its theoretical development, particularly in providing safety and optimality guarantees. 2. The empirical evaluations are comprehensive and effectively support the theoretical findings, illustrating the method’s applicability to practical scenarios. 3. The paper is well written and easy to follow, even in the theoretical sections, which are presented with clarity and coherence. 1. In the Introduction, the authors state that their theoretical results hold under regularity assumptions, and in the “Optimality” subsection of Related Works, they claim to relax some assumptions from prior studies. However, in Section 3 (Problem Setting), Assumption 1 regarding Gaussian noise appears rather restrictive. Moreover, the assumption that the transition dynamics follow \( s_{t+1} = f(s_t, a_t) + \omega_t \) is quite strong; a more general formulation such as \( s_{t+1} = f(s_t, a_t, \omega_t) \) might be preferable. As far as I understand, these assumptions are stronger than those used in [1]. 2. The proposed simulated exploration strategy is model-based, which may limit its applicability to environments with large or complex state spaces. It would be valuable for the authors to discuss these limitations and possible extensions to model-free settings. [1] Akifumi Wachi, and et al, Safe exploration in reinforcement learning: a generalized formulation and algorithms, in NeurIPS 2023. 1. The proposed method relies heavily on the set of plausible models. In general, as the amount of data increases, the model can be refined and improved. How does such model improvement affect the theoretical guarantees presented in this paper? I did not find a clear discussion on this point. 2. Could the authors clarify why CRPO fails in the safe online learning experiment on real hardware? Lightly AI-edited
Safe Exploration via Policy Priors Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes SOOPER (Safe Online Optimism for Pessimistic Expansion in RL), a model-based algorithm for safe exploration in continuous CMDPs. It leverages pessimistic policy priors, which are safe but suboptimal policies learned offline or in simulation, to ensure constraint satisfaction, while using probabilistic world models for optimistic exploration. The authors prove that SOOPER satisfies safety constraints at all times (Theorem 1) and achieves sublinear cumulative regret (Theorem 2), unlike prior methods that only guarantee asymptotic safety or simple-regret bounds. The analysis couples online cost-tracking with a "planning MDP with pessimistic termination reward," allowing standard RL methods to be used while maintaining guarantees. Empirical validation on SafetyGym, RWRL, and real robots shows consistent safety and performance gains. + Strong analytical results by combining always-safe learning with sublinear cumulative regret. + The pessimistic-termination MDP approach converts a constrained problem into a standard RL one. + Unification of optimism, pessimism, and expansion through a single intrinsic-reward objective. + Empirically validated on diverse continuous-control tasks and real hardware, not only simulators. - The proof seems to implicitly assume Lipschitz continuity of the uncertainty estimate $\sigma_n,$ which is not formally stated. I think this assumption is required for several bounds. - Assumption 4 requires the prior policy to be at least as safe as the optimal one for all states, which seems very strong. How sensitive is the safety guarantee to imperfect calibration of $\sigma_n$ ​ Could Assumption 4 be relaxed to hold in expectation over $\rho_0$ How would the analysis change if the policy prior were stochastic or state-dependent only approximately safe? Fully AI-generated
PreviousPage 1 of 1 (3 total rows)Next