ICLR 2026 - Reviews

Submissions Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	2 (50%)	3.00	2.00	3082
Heavily AI-edited	0 (0%)	N/A	N/A	N/A
Moderately AI-edited	0 (0%)	N/A	N/A	N/A
Lightly AI-edited	1 (25%)	6.00	4.00	4227
Fully human-written	1 (25%)	4.00	2.00	2305
Total	4 (100%)	4.00	2.50	3174

Title	Ratings	Review Text	EditLens Prediction
PRISM: Performer RS-IMLE for Single-pass Multisensory Imitation Learning	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper presents PRISM, a single-pass generative policy for multisensory imitation learning. PRISM integrates a Performer-based linear attention generator with Rejection Sampling Implicit Maximum Likelihood Estimation (RS-IMLE) to achieve real-time inference, handle multisensory observations, and multimodal action distributions. Unlike diffusion or flow-based policies, the proposed method performs one forward pass to produce full action trajectories. Across MetaWorld, CALVIN, Robomimic, and real-robot loco-manipulation tasks, PRISM achieving 10–25% higher success compared to baselines with real-time inference. - The authors tested their method in simulation and on real-robot experiments - Outperforms diffusion and flow policies in both simulation and real-world experiments by 10–25%. - Ablation studies in the appendix help to understand the impact of the performance of the different components of the method - One of the main motivations for this work is multimodal action distributions. Yet, the authors did not show that their method is actually capable of learning these multimodal behaviours - The method has many hyperparameter such as $K’$, $\epsilon$, Top-K weight - Demonstrations limited to small-scale tasks (e.g., pick-and-place, insertion); scalability to larger problems are untested - The proposed loss combines many heuristics. There are cleaner probabilistic formulations with the same goal of avoiding mode averaging with convergence guarantees based on e.g. mixture of experts. For example [1]. Did the authors try something along those lines? [1] Information Maximizing Curriculum: A Curriculum-Based Approach for Imitating Diverse Skills - Did the authors observe diverse behaviors using their method? - How robust is PRISM to variations in batch size for $ε$-calibration? - What is the speedup for using linear attention? I assume there is no big difference to attention since the authors use a small amount of tokens - Why did the authors not compare their method with the results in the official CALVIN benchmark (see http://calvin.cs.uni-freiburg.de/). Outperforming existing methods could show that the different sensor modalities help to improve performance. - How does PRISM scale to more complex vision-language-action (VLA) architectures or larger datasets?	Fully human-written
PRISM: Performer RS-IMLE for Single-pass Multisensory Imitation Learning	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper aims to unify real-time control, multimodal sensor fusion, and multimodal action generation in a single-pass policy for imitation learning. It eliminates iterative diffusion/flow sampling while maintaining multimodal expressivity and temporal smoothness. It reframes that diffusion and flow models can be viewed as approximate stochastic regularizers of an underlying implicit mapping. The idea is to combine a temporal multisensory encoder with a Performer-based generator. The implicit generative framework ensures coverage of multimodal expert behaviors without requiring adversarial or iterative processes. - The central claim is that multimodal diversity can be achieved without iterative sampling, which is conceptually reasonable and empirically validated. It approximates the data distribution by ensuring that every expert trajectory is covered by at least one generated sample, which is a principled alternative to adversarial or diffusion-based modeling, avoiding training instability and heavy computation. - The model’s bidirectional attention inherently enforces motion continuity and temporal consistency. The IMLE-based formulation is mathematically consistent with prior works and extended to batch-global coverage. - Evaluated across different tasks and environments, including low-dimensional and high-dimensional sensory inputs to show versatility. Real-world deployment at a reasonable frequency shows that the design is practical. - RS-IMLE matches samples implicitly but doesn’t yield a tractable log-likelihood or uncertainty measure, limiting interpretability and making it less suitable for planning or risk-sensitive control compared to probabilistic diffusion policies. - The threshold scheduling is empirical. Performance can vary significantly with $\epsilon_{RS}$ choice and the paper lacks theoretical guidance for selecting it. Tasks are short or moderately long. It’s unclear whether the pipeline retains temporal consistency over hundreds of steps. - Paper claims it can collapse toward the most frequent modes in the dataset and proposes to handle it by introducing soft coverage term but the justification seems empirical. - Would adaptive thresholds based on local manifold density improve robustness across heterogeneous datasets? - How does RS-IMLE behave when the dataset contains multimodal clusters that never co-occur in a minibatch? - What happens if $\epsilon_{RS}$ or K is too aggressive? Does the policy overfit to frequent modes or drift to outliers?	Fully AI-generated
PRISM: Performer RS-IMLE for Single-pass Multisensory Imitation Learning	Soundness: 3: good Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	PRISM introduces a fast, single-pass imitation learning policy that handles rich sensory streams (vision, depth, tactile, proprioception, and audio when available) and produces full action sequences in one forward computation. Instead of relying on slow iterative sampling like diffusion or flow-matching models, PRISM combines a temporal multisensory encoder with a Performer-based linear-attention generator and augments IMLE training with a batch-global rejection mechanism to maintain multimodal action diversity and trajectory smoothness. The policy generates multiple candidate futures in parallel, selects a coherent action sequence in a receding-horizon manner, and achieves real-time control rates with significantly smoother motions. Experiments across MetaWorld, CALVIN, Robomimic, and Unitree GO2 hardware show that PRISM consistently outperforms recent diffusion, flow-matching, and IMLE baselines in success rates, robustness to missing modalities, and motion quality, especially in low-data settings. - The paper provides strong empirical validation through both extensive simulation benchmarks (MetaWorld, CALVIN, Robomimic) and real-world deployment on a Unitree GO2 manipulation platform. - RS-IMLE enables efficient parallel candidate generation and selection, allowing single-pass inference with low latency and avoiding the costly iterative denoising loops of diffusion and flow-matching methods. - The Performer-based architecture supports real-time multisensory control, scaling effectively to visual, proprioceptive, depth, tactile, and audio inputs while maintaining smooth and coherent action sequences. - Sensor ablation experiments reveal robustness to partial observability and provide useful insights into the relative contribution of each sensing modality. - The flow in the methodology section is sometimes difficult to follow, particularly around Section 4.3, where the introduction of the robust sequence distance and RS-IMLE steps feels abrupt. Providing clearer preliminaries, unified notation, and a more gradual build-up to the rejection-sampling formulation (e.g., by first introducing standard IMLE objectives before the proposed batch-global extension) would help improve clarity and overall conceptual continuity. - The paper lacks explicit reporting of model parameter counts and computational footprint across architectures. Since one of the core claims is real-time efficiency, additional information on model size would strengthen the evaluation and help readers contextualize the single-pass efficiency relative to diffusion, flow, and prior IMLE baselines. - Although trajectory smoothness and jerk metrics are reported, the paper would benefit from more detailed analysis of failure cases, especially in scenarios where multiple candidate trajectories might still produce suboptimal or unstable executions. Understanding when and why the method fails would improve transparency. - L220: The latent variable $z$ appears suddenly; consider motivating its role in enabling multimodal action generation before introducing it formally. - L260: The notation $\Phi$ is used without explanation; please clarify it refers to the Performer random feature map or provide a pointer to its definition. - L142: The notation “top-K'” is used without explicit justification, and its difference from the standard top-K criterion may appear counter-intuitive, even though it is later listed in the notation table. A short explanation at first mention would make the intent clearer. - L1093: Appendix section titled “F Appendix” is empty and looks like a placeholder; consider removing or completing this subsection.	Fully AI-generated
PRISM: Performer RS-IMLE for Single-pass Multisensory Imitation Learning	Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces PRISM, a novel imitation learning framework built upon a batch-global rejection-sampling variant of IMLE. PRISM integrates inputs from multiple sensors—including RGB, depth, tactile, audio, and proprioception modalities—and employs a Performer-based architecture that enables efficient inference through a single forward pass. Extensive experiments on the MetaWorld, CALVIN, RoboMimic, and real-world Loco-Manipulation benchmarks demonstrate that PRISM consistently outperforms existing transformer- and diffusion-based imitation learning methods. These results highlight PRISM as an efficient and high-performing framework capable of modeling complex multi-modal action distributions. 1. The paper presents extensive simulation experiments across diverse tasks from MetaWorld, RoboMimic, and CALVIN, covering a wide range of control scenarios with varying numbers of available modalities. 2. The paper includes real-robot loco-manipulation experiments, which validate the effectiveness of the proposed PRISM method in real-world settings. 3. The paper is well-written and provides comprehensive details. 1. The proposed PRISM method is compared against different sets of baselines across benchmarks. This appears to result from directly adopting results from the original papers, which may not be an ideal practice (please correct me if I am mistaken). It would strengthen the evaluation if the authors could include a consistent set of baselines or at least some major baselines should be available across all benchmarks. Additionally, the naming of baselines varies between benchmarks, which introduces potential ambiguity. 2. The paper lacks detailed descriptions of baseline implementations. For example, it is unclear which modalities are used by each baseline and how they are modified to incorporate multiple modalities. A dedicated section detailing the implementation and modality integration of baselines is highly recommended. 3. It is noted that AdaFlow outperforms PRISM on the RoboMimic benchmark, despite a slightly higher NFE. The authors are encouraged to discuss this result and, if possible, provide additional comparisons across other benchmarks. 4. The authors should consider adding a more comprehensive breakdown of PRISM’s performance under different modality combinations. While the current paper includes breakdowns on CALVIN with occlusion, depth dropout, and tactile dropout, it would be valuable to explore more modality combinations. Furthermore, the authors should analyze the effect of each modality and discuss whether some modalities are redundant or whether PRISM fails to effectively leverage them. 1. (Related to Weakness 1) Is “DP” in MetaWorld, “Diffusion Policy” in CALVIN and the real-robot suite, and “DDiM” in RoboMimic referring to the same algorithm? Likewise, does “FlowPolicy” in MetaWorld, “Flow Matching Policy” in CALVIN, and “Flow Matching” in the real-robot suite represent the same method? If so, the same algorithm should be consistently named across different benchmarks to avoid confusion. 2. (Related to Weakness 2) I am unclear about which modalities are used by the baselines and how they are integrated. For example, 3D Diffusion Policy (DP3) relies on point cloud inputs, which are typically derived from depth maps. However, as stated in Appendix C.1, MetaWorld does not provide depth information. This discrepancy suggests that the authors may have directly taken results from the original papers (please correct me if I am mistaken). 3. Can the authors explain in detail how action selection is done during inference without action target labels ? I am confused about ProxyScore and deterministic tie-break in Algorithm 2. 4. As shown in Figure 1, text is listed as one of the modalities supported by the proposed method. However, I could not find any experiments involving text inputs. Could the authors clarify why such experiments were not included? 5. Did the authors use any pre-trained encoders for sensor feature extraction? If not, it is recommended that they consider incorporating pre-trained encoders in future work.	Lightly AI-edited

PreviousPage 1 of 1 (4 total rows)Next