|
Noise-Guided Transport for Imitation Learning |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents Noise-Guided Transport, an off-policy imitation learning (IL) method that uses an optimal transport formulation of IL and adversarial learning for optimisation. One of the main components of the paper is a frozen random prior network that acts as a “regulariser” by moving the optimal solution away from agent data. The primary focus of the paper is to develop a method for IL in data-sparse settings, and its main claims are that it achieves high data and sample efficiency.
The paper presents an interesting reformulation of the imitation learning problem from the lens of optimal transport. The authors formulate an objective similar to several prior works in imitation learning (matching expert and agent distributions), and approach this as the optimisation of the earth mover distance using a histogram loss. While this is similar to WGANs (and WGAN-based IL), the paper presents some interesting theoretical results and a new learning strategy that uses a frozen random prior network to regularise the predicted rewards. The proposed ideas seem novel and worth exploring.
While the idea of the paper is interesting, I believe that the execution is lacking in presentation and experimentation. Overall, I find the paper poorly written and often unnecessarily hard to digest. Several sections are very long-winded, and explicit mathematical notation is often missing. A clearly written background would have significantly improved the clarity and impact of the paper. Additionally, I find some claims overly bold and the experimental section to be insufficiently rigorous. I address some specific concerns below:
1. Section 4.1 is very unclear. The paper does not motivate or introduce “learning from random priors”. A short background on this would greatly improve the clarity and impact of the paper. On line 133, “By the properties of neural networks,” is also a very vague reasoning. If I understand correctly, this is true only given infinite data and if the function is trained till global optimality. This assumption is clearly violated in the data-sparse settings that this paper analyses. Please correct me if I am mistaken.
2. On line 168, “Methods based on GANs optimise a JS-divergence between distributions $P_{expert}$ and $P_{agent}$, which suffers from mode collapse and vanishing gradients” is again an unsupported and potentially incorrect statement. If I am not mistaken, optimising the Jensen-Shannon divergence is not necessarily the root cause for GAN instabilities. Rather, the issue likely arises because of perfect discrimination and poorly aligned supports [1, appendix E in 2].
3. Could you please clarify how Eq 2 is different from the apprenticeship learning setup (Eq 6 in [3])? To my understanding, your objective of “distinguishing the expert and agent distributions” is closely aligned with several prior works in IRL. If this is the case, I request the authors to rewrite their claims to mention that their method is a reformulation of the distribution matching idea (but from an OT point of view).
4. The core functionality of this paper is quite close to Adversarial Motion Priors [4], a method that uses the WGAN style loss for adversarial IL. I believe that this is important prior work that should at least be mentioned in the analysis of this paper (and potentially used as a baseline).
5. The experiments in this paper only use 4 random seeds per task. In my opinion, this is quite low and not rigorous enough to rule out the possibility that the reported results are due to random fluctuations or chance rather than consistent underlying performance differences. A larger number of seeds (10+) would help to claim statistically significant improvements.
6. On line 460, the authors state that “we did not carry out per-task tuning for any method.” I believe this is a flawed methodology. Each method should be tuned individually for each task, or alternatively, all methods should be tuned to achieve the best average performance across all tasks. Using the same hyperparameters across tasks that differ significantly in dynamics, reward formulations, and optimal expert policies leads to an unfair comparison and potentially misleading conclusions. Different environments naturally require distinct exploration strategies and hyperparameter settings. Consequently, applying a fixed hyperparameter set across all tasks may cause some baselines (such as DiffAIL) to underperform relative to their true potential. I therefore request that the authors clarify their hyperparameter tuning strategy—specifically, for which environments the methods were tuned and for which they were kept fixed.
**Clarity Concerns:**
1. In section 3 (lines 110-117), could you please include the full expressions for the transition dynamics $P(s_{t+1} | s_t, a_t)$ and reward function $r(s,a)$. The line “policy π(a|s) maps states s ∈ S to distributions over actions a ∈ A, aiming to maximize cumulative rewards” is incorrect as the policy does not map states to distributions. It *is* a distribution over actions conditioned on a given state. Also, the RL objective is to maximise the *expected* cumulative reward (unless your work sets up the problem differently). Additionally, could you please clarify the line “We work in the episodic setting, where γ resets to 0 upon episode termination”?
2. The beginning of section 4 (lines 121-137) is phrased very awkwardly in my opinion. I would prefer if the terms are defined explicitly in mathematical language (eg: $r_{\zeta}: \mathbb{X} \rightarrow \mathcal{R}$ where $\mathbb{X}$ is either $\mathbb{S} \times \mathbb{S}$ …). Please also explicitly define $P_{expert}$ and $P_{agent}$. The definitions will change depending on $\mathbb{X}$ (eg: $P_{agent}(s,a) = \pi(a|s) \sum_{t} \gamma^t P_{agent}(s)$). I believe that, unless necessary for the rest of the paper, the explanation of the actor-critic methods can be relegated to the appendix.
3. I don’t find Figure 1 particularly informative. From what I understand, it is an aggregation of all the curves in Figure 2. However, such aggregation across different tasks loses any nuance/meaning. I also don’t understand how the normalization is done in this figure. Do you consider mean expert performance across tasks? If so, how do you account for the fact that all environments have different max expert performances? Further, the dotted lines in Figure 1 aren’t labelled, nor is Figure 1 referenced in the text. Could you also explain why the humanoid results were omitted?
**References**
[1] Arjovsky M, Bottou L. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862. 2017 Jan 17.
[2] Diwan AA, Urain J, Kober J, Peters J. Noise-conditioned Energy-based Annealed Rewards (NEAR): A Generative Framework for Imitation Learning from Observation. arXiv preprint arXiv:2501.14856. 2025 Jan 24.
[3] Abbeel P, Ng AY. Apprenticeship learning via inverse reinforcement learning. InProceedings of the twenty-first international conference on Machine learning 2004 Jul 4 (p. 1).
[4] Peng XB, Ma Z, Abbeel P, Levine S, Kanazawa A. Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (ToG). 2021 Jul 19;40(4):1-20.
Included alongside weaknesses |
Fully human-written |
|
Noise-Guided Transport for Imitation Learning |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This work proposes Noise-Guided Transport (NGT), a lightweight off-policy imitation learning method for low-data regimes. NGT formulates imitation as an optimal transport problem solved via adversarial training, requiring no pretraining or specialized architectures. It naturally models uncertainty, is simple to implement, and achieves strong performance on challenging continuous control tasks with as few as 20 expert transitions.
- The paper proposes a novel approach to enforce the Lipschitz condition and measure distributional distance under the Wasserstein-1 metric without relying on a gradient penalty.
- Empirical results demonstrate that the proposed method outperforms state-of-the-art adversarial imitation learning (AIL) approaches in terms of episode rewards.
- The ablation studies are comprehensive, and the paper provides theoretical justifications supporting the proposed method.
- The writing quality of the paper needs improvement. There are numerous unnecessary bolded words, and missing hyperlinks (e.g., line 706 in the appendix).
- Although the authors claim improved computational efficiency by avoiding gradient penalties for enforcing the Lipschitz condition, Table 3 shows that the proposed method does not demonstrate a clear advantage in computation time compared to existing approaches.
- Could the authors provide a more detailed explanation or empirical justification for the claimed computational efficiency of the proposed method compared to approaches that use gradient penalties? |
Moderately AI-edited |
|
Noise-Guided Transport for Imitation Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper studies the imitation learning problem in the low-data regime and introduces an offline policy approach called Noise Guided Transport (NGT). The method addresses the problem using adversarial training. The authors evaluate the performance of their approach in several continuous control environments, such as Ants, HalfCheetah, etc.
The paper is overall well written with some room for improvement. The problem it addresses is both interesting and relevant. The experimental results demonstrate that the proposed method performs well compared to state-of-the-art approaches.
There is some concern regarding the novelty of the proposed method and its positioning within the existing literature. The idea of using generative adversarial models for imitation learning, i.e., employing a loss function similar to Eq. (2), has been explored in prior work, including studies that establish its connection to optimal transport theory, e.g., [Xiao et al., 2019]. The authors should more clearly emphasize the novelty and specific contributions of their approach to distinguish it from existing methods in the literature.
The Lipschitz constant of the potential function plays an important role in the analysis of this method. However, this constant can become arbitrarily large for functions with sharp transitions, such as ReLU-type functions. How would such large Lipschitz constants affect the performance of the proposed method?
In the experimental results (Figure 2), it appears that the performance of NGT does not improve as the number of demonstrations increases, for instance, in the HalfCheetah, Humanoid, or Walker environments. This raises the question of whether the proposed method is consistent.
Please see the Weaknesses. |
Lightly AI-edited |
|
Noise-Guided Transport for Imitation Learning |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper is set to address imitation learning in low-data regimes. To this end, the authors define and learn a “potential”, $h$, and a trainable predictor, $f$. They then use the difference of potential's expectations over expert vs. agent distributions as an adversarial reward, and argue this is equivalent to optimizing an optimal transport objective provided that the potential is 1-Lipschitz. Large parts of the methodology are focused on providing a practical recipe for the 1-Lipschitz approximation. Empirically, MuJoCo Gymnasium tasks, including a more challenging humanoid control task, and other results report strong performance in low‑data regimes.
I appreciate the authors' effort in presenting a complete and thorough work. I believe that the motivation is strong: the authors properly identify the problem with vanishing gradients in GANs and try to tackle this with a novel formulation, especially under low-data regimes. The literature is sufficiently covered, and the appendix is rather thorough. The paper explains key concepts and design choices in great detail, and attempts to support them with proofs. The authors also compare their method against multiple baselines, some re-implemented, and release the codebase, which helps the credibility and reproducibility of the empirical results.
**Introduction**
1. The statement “IRL … is shielded from compounding errors” is not accurate and overstates IRL's capabilities. IRL helps mitigate compounding errors, but it doesn’t guarantee avoidance. If the learned reward is misspecified/ambiguous, if the discriminator overfits, or if optimization stalls, the learned policy can still drift into regions where its behavior is poor and errors cascade.
2. The method, NGT, is defined in the intro, but the logic behind the chosen name is left to multiple pages into the manuscript, and remains unclear to the reader until then. Please consider aligning the explanation with the method's name.
3. The last paragraph of the introduction is not cohesive. The text keeps referring to Figure 3, and tries to justify or explain the results. The introduction is better off by focusing on high levels, and explaining specific results can be done in the experiments. It could also benefit from being split into two paragraphs. Maybe consider a more structured, cohesive, and to-the-point version of this paragraph?
**Related work**
4. The authors perhaps misuse the LaTeX's \cite and \citep commands, and make the related works section harder to follow.
5. Some abbreviations are defined multiple times (like OT), and some are defined and never used again, like Maxent IRL.
6. The related work section could also use some structure, like categorizing the previous work based on their main perspective or application.
**Background**
7. What does it mean to reset $\gamma$? Isn't the discount factor fixed while learning?
**Methodology**
8. The definition of actor-critic methods belongs more to the background than the methodology, the same for reply buffer and SAC.
9. Equation (1) is far from any density estimator, not only because it doesn't integrate to 1, but because it varies based on different seeds, scaling, etc. Given this, is calling it a pseudo-density still justified?
10. In Equation (2), the authors basically reparameterize the discriminator as a frozen random network + predictor, but it’s still optimizing
a divergence between expert and agent distributions. There’s no clear reason why this avoids vanishing gradients. Don't gradient magnitudes depend on how the predictor generalizes, not on its target’s randomness? A clarification on this is indeed needed here.
11. Why choose an exponential function of potential for the reward (other functions can provide positivity and a suitable range)? Exponentiating the reward makes the system exponentially sensitive to small changes in the potential, which can happen based on noise or initialization.
12. The reward is a derived quantity, not a general potential. There’s no guarantee this mapping can represent all 1-Lipschitz functions.
It’s actually a highly restricted subset of functions shaped by the architecture of loss and the initialization of networks. So how is this an effective search within 1-Lipschitz functions?
13. Upon reading the proof of Theorem 4.1, I believe that there is an ambiguous logic applied in the last line, where the comparison switches from less than or equal to an equality without any justification. Can you elaborate on this?
14. In addition to the previous concern, I think the basic properties of Lipschitz compositions are known, and Theorem 4.1 merely restates them? It’s useful to state, but the framing reads stronger than it is, more like part of the paper's contributions. Can you justify this?
**Experiments**
15. Running for only 4 random seeds could be insufficient, especially for low-data regimes with higher uncertainty. Is the low number of seeds due to computational problems of the method / baselines?
16. On the previous note, I expected to see larger error bands for fewer demonstrations, but the plots in the main text do not follow this trend. Why is this the case? And why does lower data not cause higher uncertainty in the predictions, not only in NGT but also across other baselines?
Please refer to the weaknesses, and justify the comments / answer the questions. I am open to discussion and will indeed be willing to raise the score if these points are sufficiently addressed.
Again I thank the authors for their time and dedication. |
Fully human-written |