ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 1 (33%) 6.00 4.00 5277
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 1 (33%) 2.00 4.00 2162
Fully human-written 1 (33%) 4.00 4.00 7566
Total 3 (100%) 4.00 4.00 5002
Title Ratings Review Text EditLens Prediction
Provably Efficient Policy-Reward Co-Pretraining for Adversarial Imitation Learning Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper investigates the theoretical foundations of pretraining in adversarial imitation learning (AIL). The authors first theoretically reveal the fundamental reason why policy-only pretraining fails: reward error becomes the bottleneck. Based on this insight, they propose CoPT-AIL, which leverages reward shaping theory to jointly pretrain both policy and reward through a single BC procedure. Theoretically, they prove that when the number of expert demonstrations satisfies N ≳ C√(|S||A|H²/K), the method achieves a tighter imitation gap bound compared to standard AIL. Experiments validate the effectiveness of the approach on 6 DMControl tasks. 1. Clear motivation and important problem: The paper addresses a practical pain point in AIL—the requirement for extensive online interactions—and provides theoretical explanation for why existing policy-only pretraining approaches have limited effectiveness. The problem formulation is valuable. 2. Solid theoretical contributions: The error decomposition in Proposition 1 clearly reveals the bottleneck role of reward error. The use of reward shaping theory to circumvent the reward ambiguity issue is clever. Theorem 1 provides the first theoretical guarantee for the benefits of pretraining in AIL, filling an important theoretical gap 3. Simple and elegant method: The derivation of r̃*(s,a) = log π^E(a|s) from the expert's soft-optimality property is natural, enabling joint pretraining of policy and reward through a single BC procedure. The method is simple to implement and computationally efficient. Reasonable experimental design: The paper includes comparisons with multiple SOTA methods, and ablation studies adequately validate the necessity of joint pretraining. The theoretical analysis is explicitly limited to tabular MDPs (using |S|, |A| notation), which the authors acknowledge in the conclusion: "this work focuses on the standard tabular setup". However, the experiments use continuous state-action spaces in DMControl tasks, which do not match the theoretical setting. There is no discussion on how to extend the theory to function approximation and continuous spaces. Moreover, the constant C = max d^{πBC}(s)/d^{πE}(s) can be large when BC quality is poor, potentially weakening the practical utility of the theoretical results. Setting r¹(s,a) = log π^BC(a|s) depends on BC quality, which may be poor when demonstrations are limited. For continuous action spaces, Algorithm 3 does not specify how to compute log π^BC(a|s). If π^BC deviates significantly from π^E, the pretrained reward may mislead subsequent training. The paper only tests on 6 DMControl tasks, lacking environmental diversity. It is missing sensitivity analysis with respect to the number of expert demonstrations N (which the theoretical condition depends on). Eq. 14 introduces regularization term β exp(-r), but Table 1 does not report the value of β, and there are no ablation experiments to validate its importance. Additionally, there is no analysis of whether the condition N ≳ C√(|S||A|H²/K) is satisfied in the experiments. The derivation from r*(s,a) to r̃*(s,a) in Section 4.1 is somewhat abrupt; the choice of shaping function Φ could be explained more explicitly. Implementation details for baselines in the experimental section are insufficient, such as specific hyperparameter settings for each method. While Proposition 2 is based on the classic result of Ng et al. (1999), its application in the current problem is reasonable. 1. Your theory explicitly targets tabular MDPs (original text: "standard tabular setup"), but experiments use continuous state-action spaces. Can you (a) discuss how the theory extends to function approximation settings, or (b) at least provide heuristic analysis or experimental validation of theoretical predictions in continuous spaces? 2. For continuous actions, how do you compute r¹(s,a) = log π^BC(a|s)? Do you assume π^BC has a specific parametric form (e.g., Gaussian policy)? Algorithm 3 should explicitly clarify this implementation detail. 3. Practical impact of constant C: In your experimental tasks, what are typical values of C = max d^{πBC}/d^{πE}? How does this affect the tightness of the theoretical results? 4. Regarding the regularization coefficient β in Eq. 14: (a) Why is the value of β not listed in Table 1? (b) Can you provide ablation experiments to validate the importance of this regularization term? 5. The theory requires N ≳ C√(|S||A|H²/K). Can you verify whether this condition is satisfied in your experiments, or show sensitivity analysis of algorithm performance with respect to N? 6. Can you provide experimental comparisons with Jena et al. 2021 "Augmenting GAIL with BC for Sample Efficient Imitation Learning"? This method also combines BC and AIL, and comparison would more clearly demonstrate the advantages of CoPT-AIL. 7. Can you discuss the dependence of the method on the soft-optimality assumption for the expert policy (Eq. 1)? If the expert is suboptimal, does the derivation of r̃*(s,a) = log π^E(a|s) still hold? Note: Satisfactory answers to questions 1-5 would strengthen the paper significantly and could raise my recommendation to Accept. Questions 6-7 are less critical but would further enhance the contribution. Heavily AI-edited
Provably Efficient Policy-Reward Co-Pretraining for Adversarial Imitation Learning Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a theoretical analysis on the effects of pre-training both policy and reward function before using AIL. The paper proposes an interesting theoretical analysis of AIL with policy pretraining alone. Furthermore, it quantifies the regret arising from not pre-training the reward function (Proposition 1, Eq. (5)) and provides an improved bound to showcase the effects of pre-training the reward function. 1. The paper’s contributions are mainly incremental and overlap with existing analyses in the AIL literature. 2. Previous theoretical work on AIL has already highlighted its reliance on reward updates (e.g., Lemma 3.2 and Theorem 4.5 in [1]). 3. The reward pretraining contribution is not entirely novel, as the connection between policy pretraining and reward learning has already been explored in [2]. 4. The choice of experimental baselines is not fully justified. Given the close connection to OLLIE in [2], which similarly combines pretraining with AIL, including OLLIE as a baseline would have been more appropriate. Since CoPT-AIL emphasizes stronger theoretical analysis, the comparison against OLLIE would have been particularly interesting to demonstrate the practical benefits of the added theory. 5. The results in Fig. 2 are not fully convincing and do not clearly substantiate the paper’s claims regarding CoPT-AIL’s superiority over prior AIL approaches. **References**: [1] Chen Y, Giammarino V, Queeney J, Paschalidis IC. Provably efficient off-policy adversarial imitation learning with convergence guarantees. arXiv preprint arXiv:2405.16668, 2024. [2] Yue S, Hua X, Ren J, Lin S, Zhang J, Zhang Y. OLLIE: Imitation learning from offline pretraining to online finetuning. arXiv preprint arXiv:2405.17477, 2024. 1. Setting aside the theoretical analysis, what are the main differences between OLLIE in [2] and CoPT-AIL? Are there reasons why one approach should be preferred over the other? 2. Can you make concrete claims about the quality and quantity of data required for CoPT-AIL to succeed? In particular, how does it perform with suboptimal demonstrations or with only a small number of expert trajectories? Lightly AI-edited
Provably Efficient Policy-Reward Co-Pretraining for Adversarial Imitation Learning Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper tackles the failure mode of BC pre-training in adversarial imitation learning (AIL), where the AIL agent barely benefits from BC pretraining, if at all. This phenomemon has been observed and reported in prior works, but without identifying an actionable cause nor carrying out a rigourous theoretical study. The paper addresses this overlooked gap in the literature and idendifies the culprit as being the lack of reward pre-training in AIL: the reward starts from random initialization. They augment the AIL algorithm with a reward pre-training step, such that the policy and reward are co-pre-trained, hence naming their approach AIL with Policy-Reward Co-Pretraining (CoPT-AIL). Theoretical results show a tighter imitation gap. Empirical results show that CoPT-AIL can outperform standard AIL methods. * The paper targets a crucial desideratum of AIL: taking advantage of potentially limited expert data to squeeze as much performance as possible while remaining offline. * The detailed contribution rundown in the introduction is appreciated. This could be improved by indicating, for each point at least, where in the paper they are treated. * The regret bound the authors end up with have are easy to interpret. For example, the new “relative policy evaluation error” bound decreases as 1/N as the number of expert trajectories N increases. * Despite the weaknesses of the experimental setup, the proposed "fix" to BC-pretrained AIL is so simple that there is little reason for any practitioner not to use it. * The authors mention "the telescoping argument" as a discovered property; it is not an incidental property but the reason reward shaping is constructed as such. It might be worth writing about the design principle behind the definition of reward shaping earlier in the main text or with more emphasis, especially considering how central it is to the solution developed in the paper. * Experiments: going beyond DMC would be a big plus; but even in DMC, there are tasks like humanoid-walk/run, quadruped, and dog that could all be valuable additions. While the authors treat 6 environments, those are 6 tasks from 4 different environments (i.e. only 4 different underlying simulated robotics models). The environment with the highest degrees of freedom (leading to highest state space and action space dimensionalities) is the bipedal walker. A quadruped and/or a humanoid would me more convincing. * When it comes to the practical use of the algorithm, it would be useful to see whether there is some “unlearning” happening in the reward, e.g. by showing how the reward gradients are behaving, or how aligned the BC and AIL gradients are, which would give empirical evidence that, with CoPT-AIL, BC and AIL are not working against each other anymore. In Walker-Run for example (the most complex task treated in the paper), there is a ~50k step initial phase where CoPT-AIL is unable to accumulated any reward when 3 baselines can. How I interpret this initial plateau at near-zero return where nothing happens: the reward pre-training with the log-probability of the BC-pre-trained policy as reward leads the reward to be peaked at its maximal value on the expert samples, and zero anywhere outside the support of the expert distribution, with very abrupt changes in value as we go in and out of the support. This is a tedious signal to optimize via RL (sparse) a priori. The typical AIL agent (especially in the off-policy setting, cf. “Lipschitzness is all you need” by Blondé et al 2022) would employ a gradient penalty regularizer to smooth out the reward signal and make learning possible. The authors seem to do something akin to such penalization in their practical implementation (page 22). In L1159-1165, the authors introduce a reward regularizer to “improve the stability of reward training”. What the exponential-based regularizer does: the term penalizes very low or negative reward values, and suppresses excessively large ones, since exponentiation smooths the gradient and bounds the value to a small positive range. I would therefore be useful to show and discuss the reward landscape that the co-pre-training creates, which would at least answer the question of why there is a flat performance plateau at the very start of the learning curve for Walker-Run, only for CoPT-AIL. Note: using the reward smoothing regularlizer also in the reward pre-training phase might solve the issue already, since it would allow AIL to start with a non-random yet smooth reward landscape. * With CoPT-AIL, the policy and reward are co-pre-trained (first the policy, then the reward). By the end of the pre-training phases, Q is still randomly initialized. Would it not make even further progress toward your goal of avoiding any negating effect to also pre-train Q? For example, with a method akin to fitted Q iteration? That however assumes that the expert data is available in triplets (state, action, next-state), since the next state is required for the Q target. This condition is a fortiori statisfied if the expert trajectories are available whole and ordered, without subsampling. * L95-96: aligning with how Ross and Bagnell 2010 (“Efficient Reductions in IL”) defined their non-stationary policies, I think it would be clearer for the authors to qualify the policies, dynamics, and reward structures introduced as non-stationary in the text, grounding the exposition in the pre-existing literature and improving precision. Do the concepts introduced by the authors not align with those in Ross and Bagnell 2010? * The point made by the authors about the presence of a difference of value with an untrained reward in the bound of Prop 1 is sound. That being said, would it not be best to leave the second term of what the authors call “reward error” out of the curly brace? The same goes for the very last term of the bound. * How many expert demonstrations were in use? (In the appendix the authors write how many expert demonstrations were collected from the experts, but it is unclear how many you used for the experiments shown on the plots.) * How does the method fare w.r.t. various number of available demonstrations? * Is the relative policy gap increasingly reduced compared to the baselines, as the developed theory would dictate? * It would be interesting to see how CoPT-AIL reacts to different policy and reward architectures, considering how sensitive RL agents generally are to their reward landscape. For example, a reward designed from the error of a powerful diffusion model would, by its representational capacity, probably overfit the reward signal to the pre-training reward learning signal CoPT-AIL introduce, compared to a weaker model. I that process makes the reward landscape particularly non-smooth and peaked, AIL would have to deal with an initially sparse reward function, which would probably be more tedious to deal with than a randomly initialized reward. Style, typos, suggestions: * The shorthand “relative policy evaluation error” is rather confusing, as it is part of the reward error, which is distinct from the policy error. It could simply be called reward error if, in Prop. 1, the authors would call the first term of the bound the (scaled) reward error and the second term of the bound the “approximation error” since it grows to 0 as the dataset size tends to infinity. * The authors should add what the dotted horizontal lines represent in the plot legends . * [minor] Exposition is very clear; first part of 4.1 however is unnecessarily long for how well-known its contents are. Fully human-written
PreviousPage 1 of 1 (3 total rows)Next