ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (33%) 6.00 3.00 1390
Fully human-written 2 (67%) 5.00 3.50 3812
Total 3 (100%) 5.33 3.33 3005
Title Ratings Review Text EditLens Prediction
Efficient and Sharp Off-Policy Learning under Unobserved Confounding Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper derives closed-form sharp bounds for policy value under MSM, plus a semi-parametrically efficient estimator. It avoids unstable minimax/IPW and proves optimal confounding-robust policy learning. - Closed-form sharp bound for value under MSM - One-step bias-corrected estimator hits the efficiency bound - Learning guarantees to the optimal confounding-robust policy - The EIF and one-step estimator rely on quantiles $F_{x,a}^{-1}(\alpha_{\pm})$. You do not state standard conditions ensuring pathwise differentiability. - You claims the estimator “is semi-parametrically efficient” and points to D.2, which provides an influence function expression and cites a chain-rule lemma. But you never identify the canonical gradient in the nonparametric model nor verify your influence function equals it. - Theorem 4.4 needs a uniform bound, but the current version is pointwise in $\pi$. - The nuisance $\eta$ in (14) includes the quantiles, but your EIF contains terms like $(\Delta-\alpha)F^{-1}$ that rely on differentiability of these nuisances. Please clarify. - Before Theorem 4.4, you write “parametric policy classes (e.g., neural networks) have vanishing $R_n(\Pi)\in O(n^{-1/2})$". For neural networks, this needs norm constraints, otherwise $R_n$ need not decay at root-n rate. - In Algorithm 1, Step 6 says Estimate $V^{+,*}$ as in (2), but (2) just defines the propensity. Lightly AI-edited
Efficient and Sharp Off-Policy Learning under Unobserved Confounding Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper considers the problem of unobserved confounding in offline policy learning. They assume that the unobserved confounding satisfies the marginal sensitivity model (Tan, 2006), which is often used in the sensitivity analysis literature (Aronow and Lee 2013, Miratrix et al. 2018, Zhao et al. 2019, Yadlowsky et al. 2018, Kallus et al. 2018, Kallus and Zhou 2020). - The paper is clearly written and easy to understand. - The main contribution of this paper is a semiparametrically efficient estimator for offline robust policy learning problem, arguing that the approach of Kallus and Zhou 2020 may be unstable due to the dependence on inverse propensity weights. Instability of inverse propensity weights is a known problem that can lead to instability of estimators. - They propose a naive plug-in estimator for the optimal robust policy but note that it will suffer from first-order bias. Then, they derive the semiparametrically efficient estimator that does not suffer from the first-order bias from the estimation of nuisance components - The theoretical contributions are sound. - The problem of robust offline policy learning under the marginal sensitivity model and Rosenbaum selection model is quite well-studied, e.g. Aronow and Lee 2013, Miratrix et al. 2018, Zhao et al. 2019, Yadlowsky et al. 2018, Kallus et al. 2018, Kallus and Zhou 2020. Furthermore, other works such as Bruns-Smith and Zhou, 2023 consider dynamic policy learning. So, the problem that the authors aim to solve has limited novelty. Nevertheless, this paper does cite and reference many of the relevant works in this area and I believe they do make a technical contribution (in terms of semi-parametric efficiency of their estimator), relative to the Kallus and Zhou 2020. - The new estimator appears to only provide modest improvements over the naive plug-in estimator. N/A Fully human-written
Efficient and Sharp Off-Policy Learning under Unobserved Confounding Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This work offers a method for off-policy learning under unmeasured confounding, whereby unmeasured covariates can jointly affect treatment decisions and the outcome. Concretely, the authors propose a one-step bias-corrected estimator that estimates “sharp” (i.e., tightest possible) bounds around the true policy value under the marginal sensitivity model. These estimated bounds are then used for downstream policy learning. The authors show that their bias-corrected estimation approach is efficient — i.e., obtains the lowest variance among unbiased estimates — and relies upon a simplified minimization objective as compared to the mini-max style objective studied in prior work. The authors validate the approach theoretically and via experiments on synthetic and real-world data. Policy learning under unmeasured confounding is an important problem with broad applications. The authors identify a key gap in the literature and address it via appropriate methods. Presentation of the work is effective: I especially appreciate Figure 2 illustrating how the concept of sharpness connects to regret. The theoretical results - including identification bounds (4.1), bias-corrected estimator (4.2), and learning guarantees (4.3) are also well suited to this problem setting. The synthetic + real-world empirical validation is also well-suited to the goals of the work. ## Connection to prior work & significance In general, the authors provide solid coverage of prior work and appropriately situate the contribution in the literature. However, this work can be viewed as a targeted improvement on top of the basic framework established in Kallus & Zhou (2018a; 2021). While I still believe such work is valuable and worthy of publication, this somewhat limits the significance of the results. More specifically, it would be helpful if the authors could provide more detailed technical discussion of the differences with (Kallus & Zhou, 2018a; 2021). How does instability in IPW weights propagate up to the estimated policy value/regret, and how is this solved by the proposed approach? Similarly, Kallus & Zhou (2018a; 2021) also show that the regret interval obtained under their proposed approach is sharp. While lines 211-226 offer helpful initial discussion, adding additional technical clarity would strengthen the discussion for the reader. Further, Rambachan, Coston, and Kennedy (2022) derive sharp bounds for the policy value under a related Mean Outcome Sensitivity Model (MOSM), which they then estimate via a doubly-robust method. The authors further show that bounds on the MSOM imply MSM bounds. It would be helpful to outline similarities and differences with this approach. To start, I think the method proposed in this work generalizes to non-binary actions, and the bounds in Rambachan, Coston, and Kennedy (2022) may not remain sharp w.r.t. the MSM after converting from the MOSM framework used in this work. [1] Robust Design and Evaluation of Predictive Algorithms under Unobserved Confounding, https://arxiv.org/abs/2212.09844, Ashesh Rambachan, Amanda Coston, Edward Kennedy ## Empirical validation The general empirical validation of the work is sound, and the authors demonstrate that the proposed approach yields a benefit over relevant baselines. However, I do have several questions about the empirical validation. Related to my point above, could the authors report experiments which illustrate the mechanism by which the proposed approach obtains improved bounds over Kallus & Zhou (2018a; 2021)? For instance, can the authors illustrate how the efficiency of the estimator yields tighter finite-sample bounds, and in turn, improves downstream policy learning? Or, similarly, that error in IPW weights propagates down to learned policies? Evidence along these dimensions would help the reader understand why the proposed method is necessary over Kallus & Zhou (2018a; 2021). Further, it appears in several of the empirical results that the proposed method obtains high variance across runs. I find this surprising given that the estimation approach should in principle reduce variance in estimated policy value bounds. For example, confidence intervals are quite wide with Efficient + sharp as compared to baselines. We also see similar behavior in Figure 3. Could the authors explain why this is the case, and also include Kallus & Zhou (2018a; 2021) in Figure 3 for a clear comparison of variance across runs? Additionally, while comparing regret against a fully randomized policy is a reasonable starting point, this seems overly simplified for a real-world experiment, especially because non-randomized baseline policies may yield more challenging distribution shift. Can the authors also report comparisons against other baseline policies? Finally, given that the proposed estimator depends on learned nuisance functions, can the authors report details surrounding the procedure used to fit and select nuisance functions used to construct the doubly-robust estimates? Overall, I see this work as providing a valuable contribution but do have significant concerns. I am open to re-considering my score if these concerns regarding significance and empirical validation are appropriately addressed. ## Questions: See questions raised above. Additionally: - Kallus and Zhou also require "Strong overlap" to hold w.r.t. the true propensity - i.e., exists $ν > 0$ such that $e_a(x, u) ≥ ν, \; \forall a \in A$. Is such an assumption also needed here given the use of the MSM? - Algorithm 2: Can cross-fitting be performed to improve sample efficiency? - Figure 5: Why are other approaches not sensitive to the choice of sensitivity parameter? Can you show results for the full range, starting at $\Gamma=1$? Fully human-written
PreviousPage 1 of 1 (3 total rows)Next