ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (33%) 6.00 2.00 1692
Fully human-written 2 (67%) 3.00 3.50 2922
Total 3 (100%) 4.00 3.00 2512
Title Ratings Review Text EditLens Prediction
Reward Shaping Control Variates for Off-Policy Evaluation Under Sparse Rewards Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper addresses the critical failure of standard off-policy evaluation (OPE) estimators, such as Importance Sampling (IS) and Doubly Robust (DR), which suffer from prohibitively high variance in sparse-reward environments. The authors introduce Reward-Shaping Control Variates (RSCV), a new class of unbiased estimators. The core idea is to leverage policy-invariant potential-based reward shaping (PBRS) to construct a new, provably zero-mean control variate. The authors provide a practical algorithm to learn the potential function $\Phi$ and the optimal weight $\lambda$ directly from data by maximizing variance reduction. Experiments on a chain MDP, a cancer simulator, and the ICU-Sepsis benchmark show that this method reduces variance and Mean Squared Error (MSE) significantly compared to standard baselines. - The method directly targets a well-known and significant limitation of OPE, high variance in sparse-reward settings, which is a common feature of real-world applications such as healthcare. - The experimental results demonstrate orders of magnitude improvement in variance and MSE over PDIS, DR, and MRDR in sparse-reward tasks. The method is also shown to be more robust to reward noise. - The method's success depends on learning a good potential function $\Phi$. The experiments are on tabular or low-dimensional (4-47 features) state spaces. How well does the learning algorithm for $\Phi$ (Algorithm 1) scale to high-dimensional problems (e.g., images)? - There is a contradiction in the ICU-Sepsis setup. Section 5.3 states $\pi_b$ is a generated PPO policy, while Appendix F.2 says it's the empirical clinician policy (Line 895). Which was it? See Weaknesses. Lightly AI-edited
Reward Shaping Control Variates for Off-Policy Evaluation Under Sparse Rewards Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper tackles the challenge of doing off-policy evaluation for sparse reward tasks by proposing reward-shaping control variates (RSCV). The method uses a potential-based reward shaping technique to maintain the optimal policy under the shaped MDP. It defines a learnable additional random variable which has a zero mean under the behavior-policy distribution, and learns the random variable by a potential network for variance reduction. The work is evaluated on medical treatment benchmarks. The method is supported by theoretical proof, showing the modification added to the reward does not change the optimal action distribution (Theorem 1), and remains unbiased (Theorem 2). Theorem 3 and Corollary 1 show the bound of variance. The paper discusses the difference between the proposed method and related works, DR and MRDR (Section 4.3). The discussion highlights that RSCV mainly relies on learning the potential function, while related works focus on learning the action value estimation. The paper performs a sanity check in a clear tabular environment. A simple tabular environment is helpful in providing us with a clear demonstration of RSCV’s effectiveness with an increasing sample size. The paper considers multiple evaluation metrics, including the bias, variance, mean-squared error, and effective sample size, supporting the theoretical result regarding the bias and variance. The method is limited to discrete action space tasks. For datasets with continuous action spaces, the paper discretizes the space (indicated in F.2). Discretizing values introduces information loss because different actions may be mapped to the same discrete representation. In addition, the choice of discretization parameters, such as the number of bins, can significantly affect performance and stability. Fitted Q evaluation (FQE) is used as the action value estimation in DR as a baseline, however, it is not a strong baseline choice especially when the dataset coverage is limited. FQE does not properly handle the out-of-distribution action sample in bootstrapping, which can cause inaccurate value estimates even after many training iterations. The paper discusses marginalized estimators such as DualDICE and GenDICE in the related work section, but does not include them in the empirical comparison. It may be worth checking their performance as well. Experimental validation is conducted on only three datasets, one of which serves as a sanity check. Evaluating the method on a broader range of benchmarks would make the empirical results more convincing. The method introduces a weighting parameter ($\lambda$) for the learnable random variable term, but lacks the sensitivity analysis on it. It would be good to discuss how the weight affects performance. I have the following 2 questions: 1. To optimize the model, the paper set the training process to use k-fold validation. K-fold is an effective way to prevent overfitting, but at the cost of longer computational time and larger computational resources. Could the author please further explain the computational overhead for introducing k-fold here? With a large dataset, is it recommended to further reduce the K? According to the current experiment, is the difference large when running with different K’s? 2. Could the author please explain what the error bars in the figures represent? Is it the standard deviation? I would be happy to discuss further if I have misunderstood any part of the method or experimental setup. Fully human-written
Reward Shaping Control Variates for Off-Policy Evaluation Under Sparse Rewards Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes an unbiased low-variance off-policy evaluation operator based on reward shaping control variates. The authors leverage the property of state shaping potential functions do not alter the optimal policy to inject knowledge to sparse reward environments, where conventional methods like DR, MRDR can fail due to extreme sparsity. The authors prove their RSCV estimator strictly reduces variance with an optimal $\lambda^*$. The paper nicely combines reward shaping and off-policy evaluation and clearly leverages the idea that potential based shaping does not shift the optimality of an MDP. The paper also surveys existing approaches like DR/MRDR on this problem to better position the proposed RSCV. Presentation is clear and understandable. Math is not deep but sufficient to illustrate the advantage of RSCV. Sepsis and cancer as testbeds are standard and the results seem convincing. Disclaimer: I am not an expert in off-policy evaluation. My concern is mainly technical. - Definition 1 defines potential based reward shaping $F(s, a, s') = \gamma \phi(s') - \phi(s)$. However, no action is involved in the RHS. Do the authors suggest an extended PBRS which is based on action? But in line 244 the authors write: "$\Phi$ captures progress towards reward or risk of failure, so that the differences $\gamma\Phi(s_{t+1}) - \Phi(s_t)$ absorb predictable structure in the return". So it seems the potential is based solely on states. However, it confuses me why state-based potential could capture structure of reward that depends on actions? - Line 183, I don't understand why the two sides equal under $W_t = \prod_k^t \frac{\pi_e(a_k|s_k)}{\pi_b(a_k|s_k)}$? Can you show more details? - line 48 claims that marginalized estimators like DualDICE remain brittle due to the reward support. But isn't DICE methods estimate state marginals that do not depend on actions? - line 160 claims that PBRS guarantees the evaluation policy's value is estimated consistently with the reward structure. But have the authors verified that holds (approximately) true with a parametrized $\Phi_{\beta}$? How do you guarantee the assumption $\mathbb{E}_{\pi_b}[C^{\Phi] = 0$ given a learned $\Phi_{\beta}$? If this holds only approximately, any insight on the difference it might bring to the result? please refer to weaknesses. Fully human-written
PreviousPage 1 of 1 (3 total rows)Next