ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 2 (67%) 4.00 3.00 4001
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 1 (33%) 4.00 4.00 2497
Total 3 (100%) 4.00 3.33 3500
Title Ratings Review Text EditLens Prediction
Symmetric Behavior Policy Optimization Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper studies symmetric behavior regularization for offline RL within a BRPO-style framework and asks whether symmetric $f$-divergences admit an analytic optimal policy $\pi*$ suitable for target matching. The authors prove that for common symmetric divergences (Jeffreys, Jensen–Shannon, GAN) no closed-form $\pi^*$ exists and that naively using symmetric losses can be numerically unstable with finite-support policies. They propose a remedy by Taylor expanding $f$-divergences into a $\chi^n$ series, truncating at small order ($N<5$), which yields an analytic surrogate policy (closed form for $N=2$) and a practical algorithm Sf-AC: advantage regression plus a truncated conditional-symmetry term with ratio clipping and a truncation-error bound. Experiments on a Mixture-of-Gaussians fit and 9 D4RL MuJoCo tasks show competitive and failure-robust performance; the JS/Jeffreys variants are frequently top-3 by AUC. Clear theory: explains why standard symmetric divergences do not yield an analytic $\pi^*$ and exposes instability. Principled workaround: $\chi^n$ expansion recovers a usable surrogate for $N\le 4$; simple closed form for $N=2$. Practical loss: decomposes into advantage regression + symmetry expansion, with clipped ratios and an error bound. Empirical robustness: strong results on D4RL with ablations over series order and clipping; resilient under failure cases. Approximation vs exact symmetry: truncation introduces bias; the bias/variance/stability trade-off could be analyzed deeper. Benchmark scope: limited to MuJoCo; harder OOD domains (AntMaze, Kitchen, Adroit-relocate) would stress-test the method. Behavior-policy access: guidance is light when $\pi_D$ (or density ratios) are poorly estimated. No end-to-end convergence or error-propagation guarantees under function approximation for the combined objective. Sensitivity to series order $N_{\text{loss}}$ and analytic-policy order $N$ (e.g., 2 vs 3–4)? Any task-dependent guidance? Wall-clock and memory vs $N_{\text{loss}}$ and clipping $\epsilon$? Robustness when behavior coverage is narrow or multimodal; diagnostics for ratio misestimation? Could uncertainty signals (critic ensembles) adapt $\epsilon$ or series coefficients to avoid rare failures? Any caveats for discrete/bounded actions beyond clipping (e.g., projection effects)? Fully AI-generated
Symmetric Behavior Policy Optimization Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper addresses a clear theoretical and practical gap in behavior-regularized offline reinforcement learning (BRPO). While existing approaches rely almost exclusively on asymmetric divergences (e.g., KL or *$\chi^2$*) that induce strong mode-seeking bias, this work systematically investigates the use of symmetric *$f$*-divergences as regularizers — an area that has been largely unexplored in prior literature. The authors prove that symmetric BRPO generally lacks analytic optimal policies due to the non-affine form of *$f'(t)$* in *$\ln t$* and propose a principled approximation based on finite *$\chi$*-series truncation. This yields the Symmetric f-Actor-Critic (Sf-AC) algorithm, which preserves convexity, maintains bounded approximation error, and provides a balanced compromise between mode-seeking and mode-covering behaviors. Empirical results on D4RL MuJoCo benchmarks show consistent robustness and per-environment stability improvements over strong baselines (IQL, SQL, AWAC, XQL). However, the empirical evaluation lacks sufficient quantitative evidence to clearly demonstrate the effectiveness of the proposed symmetric regularization. While the results suggest improved robustness, the paper does not include concrete numerical comparisons or diagnostic analyses against existing offline RL approaches that would clarify why symmetric regularization helps. In particular, additional metrics illustrating the robustness deficiencies of conventional mode-seeking or mode-covering methods would make the contribution more complete and convincing. If such quantitative analyses or diagnostic metrics are provided during the rebuttal period, I would be highly willing to raise my score. * **Theoretical novelty** * The paper fills a clear theoretical gap in behavior-regularized offline RL by systematically exploring the use of symmetric *$f$*-divergences. This is a novel perspective that extends the well-established BRPO framework beyond its asymmetric KL- or χ²-based formulations and highlights an under-examined dimension of regularization geometry. * **Practical workaround through principled approximation** * The authors introduce a mathematically grounded yet practical approximation via finite χ-series truncation and Taylor expansion of the conditional symmetry term. This approach preserves convexity and boundedness while providing a tractable implementation, making the otherwise intractable symmetric regularization feasible without major instability. * **Lack of rigorous comparative analysis** The paper mainly presents performance plots (e.g., Figure 3) showing only average returns across environments, without quantitative summaries such as mean ± confidence intervals. This makes it difficult to verify whether the improvements are **statistically significant** or whether the baseline implementations are correctly reproduced. In particular, the reported baseline performances on MuJoCo tasks appear unexpectedly low, raising concerns about possible implementation discrepancies or hyperparameter mismatches. Providing detailed numerical tables with variance statistics would make the comparisons more reliable and transparent. * **Insufficient diagnostic metrics for robustness** The claim of “robust performance” remains qualitative. In offline RL, robustness is often better captured by additional statistics such as CVaR or worst-case return, rather than mean performance alone. Including such risk-sensitive or tail-distribution metrics would substantially strengthen the claim that symmetric regularization mitigates the brittleness of conventional mode-seeking methods. * **Need for deeper analysis contrasting mode-seeking and symmetric behaviors** While prior works have demonstrated the empirical benefits of using symmetric divergences in *$D_{opt}$*, this paper claims that applying symmetry directly to the regularizer *$D_{reg}$* offers additional advantages. Although the paper clearly articulates the motivation for symmetric regularization, it does not sufficiently highlight the limitations of existing mode-seeking regularization that the proposed method intends to address. In offline RL, purely mode-seeking approaches are often regarded as sufficient for ensuring stability and avoiding out-of-distribution (OOD) actions. Introducing more mode-covering behavior through symmetry could, in principle, encourage exploration into unsupported regions of the dataset, potentially harming performance. Therefore, a deeper analysis or empirical evidence clarifying when and why symmetric regularization provides tangible benefits over purely mode-seeking objectives would make the paper more convincing and theoretically grounded. * **Quantitative Comparison** Could the authors provide detailed numerical results (e.g., mean ± confidence intervals) for each baseline, especially on the MuJoCo tasks where baseline performance appears unusually low, to confirm that the reported improvements are statistically significant and reproducible? * **Robustness Evaluation** Beyond average returns, can the authors include additional robustness-oriented metrics (e.g., CVaR, percentile or worst-case returns) to substantiate the claim of “robust performance” and demonstrate that symmetric regularization indeed improves stability over mode-seeking methods? * **Clarification on D_reg vs. D_opt Effectiveness** Since prior works have already shown benefits of symmetric divergence when used in *$D_{opt}$*, could the authors isolate or ablate the contribution of using symmetry in *$D_{reg}$* to clarify when and why this leads to tangible gains over purely mode-seeking regularization? Fully AI-generated
Symmetric Behavior Policy Optimization Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents an interesting perspective on behavior regularized policy optimization: using symmetric divergences, instead of asymmetric divergences, as the implementation of the regularization. The difficulty of using a symmetric divergence is twofold: 1) common symmetric divergences do not admit analytical solutions for the optimal policy; and 2) numerical issues can occur when dealing with a finite support distribution. To tackle the challenges, this paper introduces two techniques that are all based on Taylor expansion. First, in order to derive a closed-form optimal policy, they truncate the divergence in the RL objective to the second order; second, they truncate the divergence used in the policy improvement step to improve numerical stability. Some other techniques are also included in the proposed Sf-AC algorithm. The empirical evaluations are conducted on D4RL, and Sf-AC does demonstrate improvement over previous methods like SQL, IQL, and XQL. 1. The presentation of the overall idea is clear to me. 2. The literature review is comprehensive. I especially appreciate the discussion in the Appendix, which covers both policy regularization and distribution matching, and elucidates why the problem studied in this paper is unique and has not been addressed by previous literature. 3. The proposed algorithm appears robust and does not require per-task hyperparameter tuning (Tables 4, 5, 6). Although the paper briefly discussed the issue with asymmetric divergence in the introduction section, I would say it would be more appropriate to formally define the problems of using asymmetric divergence somewhere between Sections 2 and 3. In lines 51-53, it is unclear to me how symmetric divergence solves the issue of multiple minimum points due to the capacity of the policy function class. In short, the current version of the draft doesn’t provide sufficient motivation for me to switch from an asymmetric divergence to a symmetric one. I would consider raising my score if the authors can make this clear in the rebuttal. 1. In equation (8), the first term uses weights proportional to the exponential of the advantage, while in line 310, the optimal policy is defined by using [1 + A/\tau]_+. What causes the mismatch? 2. The value of \epsilon varies from 0.2 to extremely large values like 100. Is it really necessary to include this hyperparameter? 3. Copy-edits: In Algorithm 1, when defining the advantage regression, the subscript of V should be \phi. Fully human-written
PreviousPage 1 of 1 (3 total rows)Next