ICLR 2026 - Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	15899 (21%)	4.43	3.58	3687
Heavily AI-edited	3233 (4%)	4.22	3.59	2990
Moderately AI-edited	7082 (9%)	4.20	3.61	2722
Lightly AI-edited	16648 (22%)	4.15	3.68	2746
Fully human-written	32938 (43%)	4.13	3.62	2917
Total	75800 (100%)	4.21	3.62	3026

Title	Ratings	Review Text	EditLens Prediction
Projected Coupled Diffusion for Test-Time Constrained Joint Generation	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The submission proposes a test-time framework to sample from multiple pre-trained diffusion models while (1) promoting some notion of joint "correlation" across variables and (2) enforcing constraints on the individual outputs of each model. PCD augments the usual Langevin / DDPM updates with (1) a user-specified coupling cost between variables and (2) projections at every diffusion step to guarantee individual constraint satisfaction. Their framework several existing methods (e.g., classifier guidance, projected diffusion, and some forms of compositional diffusion) as special cases. Empirical demonstrations cover three domains: multi-robot navigation (collision avoidance coupling cost, velocity constraints), robot manipulation on PushT (diverse, non-intersecting trajectories as a coupling cost with velocity constraints), and ``paired'' face generation (age-contrast coupling with gender/attribute constraints). The paper is well structure and clearly written. The problem of composing several trained diffusion models at test time under constraints is indeed relevant, and prior work has aimed to tackle similar problems when sampling a single variable (e.g. a single image). The main novelty, in my understanding, is the ability to compose diffusion models potentially defined over different variables. The approach is widely applicable and requires minimal modifications of standard sampling algorithms. The use case in muti-robot systems is well motivated, since constraints arise naturally in physical systems and training joint distributions can be computationally costly as the number of agents grows. The toy-example in images -- although artificial -- showcases the applicability of the framework in a totally different setting. I think that the notion of "correlated" variables, which is emphasized throughout the paper, could be better defined/explained/motivated. Since the framework is flexible to accommodate arbitrary coupling costs, correlations are in my view an understatement with respect to the practical utility and applicability of the proposed method and underlying problem it is tackling. In PushT trajectory dissimilarity appears to me as a contrived objective/task, there might be manipulation examples (e.g. bimanual over multiple objects) that lend themselves more naturally/directly to the framework. Can you scale the approach beyond two variables/agents? Can you numerically verify that the two coupling limits match known methods (classifier guidance, projection), at least in a toy experiment?	Fully human-written
Projected Coupled Diffusion for Test-Time Constrained Joint Generation	Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes a method to jointly sample two independent variables coupled using a joint cost function. Practically, given the independent denoising score functions of each variable, the goal is to couple their denoising process directly at test time to sample them jointly. The paper introduces some variants: (1) standard LMC/DDPM based: perform one step-denoising, use classifier guidance using the gradient of the cost function and then perform projection to map the noisy latents to the safe-set. (2) use tweedie clean estimates to calculate the gradient of the cost function. Overall the paper claims that the proposed method generalizes the formulation of classifier-guidance, projected diffusion, compositional diffusion and joint diffusion. I am familiar with a concurrent work in this line of research: https://arxiv.org/abs/2509.08775 While my concerns are based on the concurrent work, my judgment of this paper will be independent of it. 1. The paper's unified formulation of classifier-guidance, projected diffusion, compositional diffusion and joint diffusion is very impactful and timely. The authors have provided exhaustive results to analyze the performance of their algorithm on toy-domain, a planar robotics task and an image generation task as well. 2. PCD can impose hard constraints in addition to joint diffusion while being computationally efficient. 1. Data fidelity vs projection: Since the cost function and projection operation hold for the clean distribution, data fidelity is of primary importance here. The more realistic the clean estimates are, the better evaluation and guidance can be done. However, as the authors acknowledge as well, the projection operation hurts data fidelity, as also observed by the concurrent work, when applied to low-quality clean estimates (particularly at higher noise levels). 2. Differentiability of cost function: This is a considerably strong assumption (which is also used by many prior works like MPD, DPCC). For example, the signed distance based indicator function in SHD might not be differentiable everywhere. This is a case, especially for collision checking objectives for real robot executions. I agree that effective engineering design can mitigate this, but this limits the scalability of the approach. 3. convexity of constraints: Since two experiments in the paper deal with navigation and manipulation, it is worth noting that non-convex safe sets are pretty common in these two settings, most commonly arising from obstacle avoidance constraints. For example, in the highway task if the trajectories are trained without the rectangle in between and forced to avoid it at test time, the resulting safe-set becomes non-convex. This again limits the scalability. 1. How feasible is designing a projection operator for every task? How sensitive is the overall quality of samples to projection hyperparameters? 2. How is the cost function in general defined for noisy latents? It seems from the algorithms that the method always uses the clean estimates for projection and using the same for cost function also empirically results in the best performance.	Fully human-written
Projected Coupled Diffusion for Test-Time Constrained Joint Generation	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces Projected Coupled Diffusion (PCD), a novel test-time framework for generating jointly correlated samples from multiple pre-trained diffusion models while simultaneously enforcing hard, task-specific constraints. The core problem is that generating from a joint distribution $p(x, y)$ is difficult, especially when $x$ and $y$ must be correlated in a specific way and satisfy some given hard constraints. PCD addresses this by modifying the reverse diffusion sampling process. At each step, the update for each variable is guided by three components: 1. The score from its own pre-trained model (e.g., $s_X^\theta(x_t, t)$). 2. A gradient from a coupling cost function $c(x, y)$ that encourages the desired correlation between variables. 3. A projection operator $\Pi_{\mathcal{K}}$ that forces the updated sample back into the feasible set of hard constraints. This approach requires no retraining and unifies compositional generation with hard-constraint enforcement at test time. The authors demonstrate PCD's effectiveness across three distinct domains: multi-robot motion planning, diverse robot manipulation, and constrained image-pair generation. 1. This paper unifies two important aspects in diffusion sampling: coupled generation and constrained generation through a clear and effective way. The proposed PCD provides a general, test-time-only framework to address this. PCD operates over multiple pre-trained models and costs (analytic or learned), requiring no retraining of the base diffusions, which enables the method applicable to a wide variety of settings, for example, where paired data and costs are scarce or proprietary. 2. This paper is well written, easy-to-follow, and conducts extensive studies across various domains, including multi-robot planning, Push-T trajectory pairs, and paired face generation. 1. PCD relies on projection and gradient-based updates to enforce constraints. How does it handle test-time constraints that are non-differentiable, for instance, a logic-based rule where a sample is accepted only if it passes some non-differentiable verification? A concrete example would help. 2. The performance of the proposed method might require the estimated Tweedie to be of high-quality. Otherwise, the further guidance term might likely be inaccurate or even compromise the overall sampling process. 3. This paper introduces gradient-based guidance and per-step projections, which can increase wall-clock latency compared with non-gradient based baselines. In appendix C.1, the authors also mention "PCD is approximately 4 ∼ 7× slower than vanilla diffusion mainly due to the per-step projection operation." Are there any potential ways to enable faster sampling with PCD? 4. Could the authors provide a curve of performance v.s. the number of sampling steps to illustrate how will the quality of the estimated Tweedie term affect performance? An intuition is that with more sampling steps, the Tweedie estimate will be more accurate, which could facilitate better guidance. 5. I would encourage the authors to include discussion on the limitations of PCD along with failure analysis with qualitative examples. See the Weakness above.	Fully human-written
DGNet: Self-Supervised Delta2Gamma Multi-Band EEG Representation Learning for Dementia Classification	Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	The paper presents a model for dementia classification. They extract 5 bands of the EEG signal and assign an individual encoder to each band for representation learning. The SimCLR framework is used for self-supervised pre-training. The experiment is evaluated on the dataset ADFTD using Leave-One-Subject-Out (LOSO) cross-validation. N/A 1) Lack of novelty. The overall idea lacks novelty. Extracting canonical EEG frequency bands and applying CNN-based classification is a well-established approach that has been widely explored for nearly 10 years. Similarly, the use of SimCLR for self-supervised pre-training is a classical framework that has been popular for several years. Therefore, the methodological contribution of this paper appears limited. 2) Incorrect evaluation protocol and misunderstanding of basic concepts. The paper demonstrates fundamental misunderstandings of the EEG domain and deep learning training protocols. In a leave-one-subject-out (LOSO) setting, one subject is held out for testing, and all remaining subjects are used for training. There is no validation set in LOSO; thus, early stopping must not be applied, as it effectively tunes on the test set, leading to severe data leakage and inflated performance. Moreover, the authors use self-supervised pre-training on all subjects, including the “left-out” subjects in the LOSO loop, which again results in information leakage. This reminds me of a funny paper several years ago called "Pretraining on the test set is all you need"[1]. Such a trick renders the reported results unreliable due to strong performance inflation. 3) Improper preprocessing for the ADFTD dataset. The preprocessing pipeline for ADFTD is inappropriate for an end-to-end deep-learning framework. The authors segment the EEG into 30-second windows for training, drastically reducing the number of training samples and severely disadvantaging deep models for the EEG-based dementia detection task. Such long segments are typically used only when hand-crafted features are extracted before learning (e.g., DICE-Net[2]). In contemporary end-to-end deep learning EEG models for dementia detection, segment lengths of 1-4 seconds (e.g., 200Hz sampling → 1s segments) are standard, as the ultimate goal is not for EEG sample classification but per-subject detection. For example, LaBraM fine-tuned with proper 1-second segments and without early stopping at least achieves ≥80% LOSO accuracy on ADFTD. The current setup artificially lowers baseline performance and does not provide a fair comparison. 4) Misleading title and incomplete problem formulation. Although the title claims dementia classification, the experiments only include Alzheimer’s disease (AD) and healthy controls (HC) from the ADFTD dataset, completely excluding frontotemporal dementia (FTD) subjects. This mismatch between title and experimental design undermines the scope and contribution of the work. [1] Schaeffer, Rylan. "Pretraining on the test set is all you need." Arxiv. [2] Miltiadous, Andreas. "DICE-net: a novel convolution-transformer architecture for Alzheimer detection in EEG signals." IEEE Access See weakness.	Lightly AI-edited
DGNet: Self-Supervised Delta2Gamma Multi-Band EEG Representation Learning for Dementia Classification	Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 0: Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes DGNet, a self-superved learning model based on SimCLR for classifying dementia from EEG signals. The core architectural proposal consists of a "Multi-Band Head" approach, where the EEG signal is decomposed into five standard frequency bands (delta, theta, alpha, beta, gamma). Each band is then processed by an independent CNN encoder and projection head before the resulting features are fused for the downstream classification task - The goal of progressing EEG for disease classification such as dementia is highly important given its severity and high prevalence. - Significant lack of novelty: The paper's core components are standard practice and it fails to position itself within the existing literature. Analyzing the five standard EEG frequency bands is a fundamental preprocessing step in a vast majority of EEG-based studies, not a novel contribution specific to dementia. The application of SimCLR-style contrastive learning with data augmentations to EEG signals has been explored previously (e.g., [1,2,3]), but the authors do not cite or compare against any of this work. - Poorly motivated architecture: The central architectural "contribution" is a simple late-fusion strategy. The paper provides no neurophysiological or machine learning justification for why processing each frequency band with a separate, independent CNN encoder is superior to standard, more efficient models that process all bands jointly (i.e., early or mid-fusion). This choice seems arbitrary and is not supported by analysis. - Insufficient and confusing experimental validation: The evaluation is too weak to support the paper's claims. The entire study is based on a single, small dataset (n=88 participants). This is insufficient to make broad claims about state-of-the-art performance or the generalizability of the method. The reporting of results is confusing and inconsistent. Table 2 is specified as using Leave-One-Subject-Out (LOSO) validation, but the validation method for Table 1 is unstated. Furthermore, the sets of baseline models in Table 1 and Table 2 are inexplicably different. - Poor paper quality and structure: The paper contains several critical structural flaws. The introduction is overly general and fails to provide any meaningful review of the specific field (self-supervised learning for EEG). Most egregiously, the conclusion introduces a new method: "Adaptive Multi-head Contrastive Learning (AMCL) strategy (Wang et al., 2024)" that I dont think is ever mentioned, implemented, or evaluated in the main body of the paper. Unless I've missed something, this suggests a very low-effort or rushed submission. 1. Mohsenvand MN, Izadi MR, Maes P. Contrastive representation learning for electroencephalogram classification. InMachine learning for health 2020 Nov 23 (pp. 238-253). PMLR. 2. Yang, C., Xiao, C., Westover, M.B. and Sun, J., 2023. Self-supervised electroencephalogram representation learning for automatic sleep staging: model development and evaluation study. JMIR AI, 2(1), p.e46769. 3. Gijsen S, Ritter K. Self-supervised Learning for Encoding Between-Subject Information in Clinical EEG. In Learning Meaningful Representations of Life (LMRL) Workshop at ICLR 2025 Mar 6. 1. Could you please explain table 1 and 2? How are models in table 1 evaluated? Why are the models different between these two tables? 2. Are you able to provide (extensive/significant) additional evaluations to evaluate claims about the progress of your methods? 3. Would you be able to explain what I've misunderstood about your method/architecture and how it progress the field of EEG classification in any way?	Fully human-written
DGNet: Self-Supervised Delta2Gamma Multi-Band EEG Representation Learning for Dementia Classification	Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes DGNet, a self-supervised learning (SSL) framework for dementia classification using EEG signals. The core contribution is the multi-head SimCLR architecture that is specifically designed to leverage the known neurophysiology of dementia. The key idea is to decompose the EEG signal into its five primary frequency bands. The DGNet architecture consists of two stages: 1. Pre-training (SSL): An independent CNN encoder and projection head is used for each of the five frequency bands. The model is pre-trained on unlabeled data using a contrastive, adaptive Normalized Temperature-scaled cross-entropy (NT-Xent) loss. 2. Linear Evaluation: The pre-trained encoders for all five bands are frozen. Their output feature vectors are concatenated and fed into a simple, trainable MLP classifier for the downstream task of classifying Alzheimer's Disease (AD) vs. Cognitively Normal (CN) subjects. 1. Clear Motivation: The paper's core strength is its foundation in established neurophysiology. The architecture is explicitly designed to model the "slowing" of brain oscillations (i.e., differential changes across $\delta, \theta, \alpha, \beta, \gamma$ bands) that is a known biomarker for dementia. 1. Methodological Ambiguity: The paper lacks clarity on several key methodological details. - Training Objective: For instance, two distinct objectives are described for the pre-training stage (Section 2.2), but the connection between them is unclear, where one is using the "worst-case" negative sample while the other uses all negatives. Additionally, the regularization term mentioned in Equation 3 is only used in one of the two objectives. - Multi-head Architecture: The paper describes at least 2 different multi-head architectures (Section 4.1, 4.2, 4.3), namely "adaptive 5 band heads" and "Multi-head (5 heads)". It is not clear how these differ, and the attribution of each to the final results is not well explained. 2. Single Dataset Evaluation: The model is developed and validated on a single, relatively modest-sized dataset (88 subjects total). The generalizability of a model is best confirmed by testing it on an external, out-of-distribution dataset (e.g., from a different hospital, using different EEG hardware). The current experimental setup does not provide evidence of generalizability beyond the chosen dataset. 3. Untapped Data: The chosen dataset also contains data for 23 patients with Frontotemporal Dementia (FTD). The paper's experiments are limited to the binary AD vs. CN classification. Given the paper's title "Dementia Classification", it's a missed opportunity not to test the model's ability to perform the more challenging (and clinically relevant) 3-class (AD vs. FTD vs. CN) classification. This would be a valuable extension. 1. Clarification on Frequency Band Extraction: This is my main point of confusion. Since this is a critical architectural detail, could you please clarify the exact procedure in Section 2.1 and Figure 2? - Option A: Are traditional (e.g., Butterworth) bandpass filters first applied to the raw signal to create 5 separate band-limited signals? And then each of these 5 signals is passed to its own 1D depthwise conv encoder? - Option B: Is the raw signal (or a single broadband-filtered signal) fed into the "frequency band extractor" module, where the "five parallel 1-dimensional depthwise convolution layers" are expected to learn band-specific features from the full signal? - Figure 2 seems to imply Option A, but the text is ambiguous. - Figure 3 on the other hand, seems to suggest Option B, since the embeddings after passing through the encoder still appear to have distinct frequency content, as "(In the) spectrogram visualization of embeddings from the encoder ..., y-axis denotes the frequency ranges corresponding to each band". 2. Clarification on Training Objective: In Section 2.2, two different training objectives are described (Equations 1 and 3). I assume that only the first objective (Equation 1) is used for pre-training the model, while the second objective (Equation 3) is only used for ablation purposes. Is that correct? If so, please clarify this in the text. 3. Use of more Data: The dataset also includes 23 subjects with Frontotemporal Dementia (FTD). Did you perform any experiments on the 3-class (AD vs. FTD vs. CN) classification task? Showing how DGNet performs on this more complex task would significantly strengthen the paper's claim of being a general "dementia classification" model. Also, evaluating your method on more datasets would significantly strengthen the evidence for the claim of generalizability. 4. Training Protocol: For the SSL pre-training stage (Fig 1a), was the model pre-trained once on the entire unlabeled dataset (i.e., data from all 88 subjects)? Or was the pre-training performed inside each LOSO fold, using only the (N-1) subjects' data for that specific fold? Although the former is a common practice in SSL, since the current evaluation relies on the same dataset, it would cause data leakage. Also, using the same set for validation and testing together with early stopping would also lead to indirect data leakage, so please clarify. 5. Clarification on Experiment Results: Section 4.1 and 4.2 describe multiple sets of experiments with different architectures. However, the distinctions between these comparisons are not very clear. For example, does the results in Table 1 use the LOSO setting? Moreover, how do the experiments in Section 4.2 (Table 2) differ from those in Section 4.1 (Table 1)? Please clarify the differences between these experiments and their results. 6. Epoch Length: Supplementary Figure 5b shows that performance increases significantly with epoch length, with 30s being the best. This makes sense, as longer epochs are needed to capture low-frequency (delta) oscillations. Did you test any epochs longer than 30s? Is it possible that 60s (a common length for sleep EEG) would be even better?	Fully human-written
DGNet: Self-Supervised Delta2Gamma Multi-Band EEG Representation Learning for Dementia Classification	Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 0: Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	DGNet introduces a self-supervised, multi-band SimCLR approach for EEG-based dementia classification, where separate encoders learn frequency-specific representations with adaptive contrastive loss. While the approach seems to be conceptually sound and tailored to EEG physiology, the evaluation on only one small dataset and the extremely high reported accuracy raise concerns about generalization and methodological robustness. - Proposes a domain-informed self-supervised framework that integrates frequency-band–specific encoders and adaptive temperature contrastive loss, creatively adapting SimCLR to EEG data. - Targets with EEG-based dementia classification an important and clinically relevant problem - The paper reports identical results (93% accuracy, 93% F1) in Tables 1 and 2 despite claiming different validation strategies (standard classification (not clear what has been done here) vs. LOSO cross-validation). This inconsistency raises serious concerns about methodological correctness and potential data leakage. The authors should clearly describe and verify their evaluation protocol. - The reported gains (over 20-30% improvement on a dataset of only 88 subjects) appear implausibly large compared to prior EEG-based dementia studies. Statistical significance testing, multiple-dataset validation, or external replication would strengthen credibility. - While the adaptive multi-band SimCLR design is well-motivated, it represents an incremental adaptation of existing self-supervised learning methods rather than a fundamental methodological advance. - The paper devotes excessive space to dementia background and EEG basics while providing insufficient technical detail on e.g., pretraining splits. In addition, it tends to overstate the role of EEG in dementia classification, presenting it as a definitive diagnostic tool rather than a complementary modality with ongoing methodological challenges. - Could you explicitly clarify how each evaluation was conducted? Were both tables based on LOSO cross-validation, or was a different split used in Table 1? How did you ensure no data leakage between pretraining and evaluation subjects, given the small dataset? - Given the small dataset size, what measures were taken to prevent overfitting? - Was the self-supervised pretraining performed on all subjects, including those used in the LOSO test folds? - Please evaluate your method on additional (external) data sets. - Were all baseline models re-implemented and trained under identical conditions, or were results taken from prior literature? - How were hyperparameters tuned for each baseline? If tuned on the same dataset, how was data leakage prevented? - Could you further justify why independent encoders per frequency band outperform joint processing approaches?	Fully AI-generated
DriftLite: Lightweight Drift Control for Inference-Time Scaling of Diffusion Models	Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces DriftLite, a training-free method for steering diffusion models at inference time using an improved sequential Monte Carlo (SMC). To mitigate the path degeneracy problem common in SMC, the authors derive an optimal, variance-minimizing drift for the governing Feynman-Kac PDE and propose two practical approximations: Variance-Controlling Guidance (VCG) and Energy-Controlling Guidance (ECG). These approximations can be obtained by solving a linear system at each sampling step, adding only a small computational overhead. The methods are analyzed on Gaussian mixture models, where they are empirically shown to reduce variance and improve sample quality. They are also tested on practical benchmarks, molecular systems and protein-ligand co-folding, demonstrating improved inference-time steering capability over the standard SMC baseline. 1. The paper is clearly written, and the core mathematical arguments are generally well-explained. 2. The work tackles a significant but often-neglected challenge of the weight degeneracy (or path degeneracy) in SMC-based diffusion steering. 3. The paper translates a mathematical insight into novel and practical (training-free) algorithms. 4. Their experiments are well-designed to support their arguments. The proposed methods clearly outperform the SMC baseline on real-world benchmarks like molecular systems and protein-ligand co-folding. 1. The method's practical limitations could be discussed in more detail. It introduces a significant computational overhead, with experiments showing up to a 6x increased runtime over the SMC baseline. Furthermore, the reward-tilting framework fundamentally relies on access to the gradient of the reward, which is inaccessible in many black-box applications, limiting its practical scope. 2. The paper would be strengthened by adding empirical analysis of the approximation error from VCG and ECG. 3. No source code is provided. 1. I have little expertise in functional analysis and found the formal proof of Proposition A.5 difficult to follow. Could you provide a more intuitive explanation for why solving the Poisson equation (Eq. 3.2) yields the optimal control? 2. Line 269, "... where reweighting can be unstable": Aren't VCG and ECG without resampling still weighted with path-level weights? Those weights should still have high variance. I don't understand the rationale behind this (if you want to mitigate the path degeneracy from resampling, then you can consider using tempered weights for resampling, e.g., [1]). 3. Why were different $\gamma$ values used for each figure 1, 2, and 3? 4. What is the main computational bottleneck of the algorithm in practice? (Hutchinson's estimator? or solving the linear equation?) 5. If $r$ is given by a large neural network, can this method still be used (in terms of memory and time complexity)? 6. Why didn't you consider the ALDP experiment (which is more multi-modal compared to LJ-13, to my knowledge)? 7. (suggestion) Line 417, "from $T=1.0$ to $0.4$": it might be better to say "with $\gamma = 2.5$" to avoid any confusion. --- References [1] Choi, Sanghyeok, et al. "Reinforced sequential Monte Carlo for amortised sampling." arXiv preprint arXiv:2510.11711 (2025). --- LLM usage disclosure: I used LLM to check the grammar and make each sentence clearer.	Fully human-written
DriftLite: Lightweight Drift Control for Inference-Time Scaling of Diffusion Models	Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper extends previous work on adapting diffusion models at inference-time to new tasks using Sequential Monte Carlo (SMC) methods, focusing on reward-tilting and sampling from an annealed version of the base model distribution. In particular, they target the problem of sampling from the sequence of distributions $q_t(x) = p_t(x)^\gamma \exp(r_t(x))$, where $p_t(x)$ is the original diffusion model distribution and $r_t(x)$ is some time-varying reward function, such that at $t=0$ we approach some predefined clean data distribution $p_0(x)^\gamma \exp(r(x))$. Following previous work, $N$ particles are sampled concurrently for each $t$, and the diffusion model is used to move these particles towards lower $t$ in the $q_t(x)$ chain. The resulting proposal particles are reweighted using Feynman-Kac-type equations and periodically resampled, enabling the particle ensemble to approximate the sequence of target distributions. The new idea in the paper to analyze the Feynman-Kac-type equations and notice that there is additional freedom in choosing the proposal distributions, such that they can add an additional term to the drift in the Feynman-Kac equations if they correspondingly change the time evolution in the weights. The authors then show that there exists an optimal added drift such that it is the gradient of a scalar potential that follows a particular PDE. With this added design space, they propose two choices for the added drift: 1) Minimizing the variance in the change of the weights 2) minimizing a Lagrangian of the aforementioned PDE such that at the minimum, the optimal drift is obtained. In practice, they parameterize the added drift as a linear combination of $\nabla r_t(x), \nabla \log p_t(x)$, and the original diffusion drift $u_t(x)$ for the variance-minimizing version, and a similar choice for the PDE-optimizing version. This results in a simple $3x3$ systems of equations per particle per step, resulting in a relatively lightweight correction, although evaluating the required matrices requires backpropagation through the diffusion denoiser. The authors evaluate the method on a 30-dimensional Gaussian Mixture Model, two toy particle systems, and on a protein-ligand co-folding task. They notice that the method consistently improves effective sample size and reduces weight variance, outperforming the naive SMC baseline. On the protein-ligand task, they demonstrate that using physical energies as the rewards, the method can improve the physical validity of the generated samples. - The core idea of defining a correction to the proposal distribution that directly optimizes the variance or consistency with the target distributions, is principled and seems to be novel at least in the diffusion context. It seems to be quite simple to implement, does not cause a huge overhead, and seems to legitimately improve the performance of the SMC method considerably on the tasks chosen. - The experiments on the GMM, DW-4 and LJ13 are quite thorough and well-presented, and the protein-ligand task is an example of a realistic problem that the method could help with and the method clearly works better than more naive SMC approaches. - Overall, the paper is well written and not too difficult to follow, with the caveat that the mathematical exposition may be difficult to follow to readers that are not familiar with this way of describing SMC methods in advance. Disclaimer: I am not an expert on SMC or SMC-based methods for diffusion models, and as such it is possible that I am missing some details, wider context or some previous literature. As such, I am open to being corrected. - The method incurs significant cost on the wall-clock time compared to the baselines, 2.3x-6x in the experiments for which the wall-clock time was reported. This is not mentioned in the limitations or in the main paper, however, and the conclusion instead states that the method causes minimal computational overhead, potentially misleading the reader or a reviewer who does not have the time to read through the Appendix. Backpropagation through the denoiser at each step requires much more memory than the forward pass and, and my assumption is that it may limit the batch size in practice. - As such, it might be more fair to compare to the baselines by adjusting the inference hyperparameters such that the wall-clock time is equal. E.g., I would expect that the G-SMC method improves with more particles, and potentially with more diffusion steps. - Following that, I would be interested in seeing the ESS and wall-clock time comparisons for the protein-ligand task as well, especially since this seems to be an example of a higher-dimensional case. As an alternative, it would also be interesting to see how the method scales, e.g., to the LJ55 task or some other task that is more than about 30-dimensional. - Do the authors agree that the method could be described as a "twisted" proposal for SMC? And further, even the initial choice of using the modified diffusion reverse process would be also simply another choice of a twisted proposal. My understanding is that this is a standard idea in the Sequential Monte Carlo literature, although the basis function scheme and the variance-minimization is novel at least in the diffusion context to my knowledge. The problem is that this seems to set the first contribution of 'identifying a fundamental degree of freedom in the Feynman-Kac type FP equation' to be less novel than it sounds, but I am open to being corrected on this. In any case, I think that positioning the method in its proper wider SMC context would be helpful for correctly contextualizing the paper. - In the protein-ligand, task, would another simple baseline be to take the base model samples, and optimize them directly using the gradient of the energy function? Does the method outperform this? - The paper [1] seems to be a relevant prior work on annealed distributions with diffusion models with SMC (and is the first paper to do this?). It may be that the proposed method is better than this, but I think it should be compared to at least in some context to clearly show this. - Further, the paper [2] does reward-tilted sampling with SMC by using a different twisted proposal by using the $p(y\|x_0(x_t))$ gradient. It would be good to compare to that as well in the reward-guided experiments. References [1] Thornton, James, et al. "Composition and Control with Distilled Energy Diffusion Models and Sequential Monte Carlo." The 28th International Conference on Artificial Intelligence and Statistics (2025). [2] Wu, Luhuan, et al., "Practical and Asymptotically Exact Conditional Sampling in Diffusion Models", 37th Conference on Neural Information Processing Systems (NeurIPS 2023). - Although not considered in previous literature on SMC, it seems that another simple baseline would be to not use SMC and instead interleave Langevin dynamics steps after each generative step, similarly to [1], such that the wall-clock time is matched with the proposed method. I would be curious to see how well this could work on some task, although this is not a priority. Overall, I will start out with a marginal reject due to the concerns raised in the weaknesses, but am open to changing the score with the rebuttal. I think the core idea in the paper is interesting and mostly seems to work well and potentially could be extended further. As such I am not initially strongly against accepting the paper with the assumption that the weaknesses are cleared out or I am incorrect about them. References [1] Du, Yilun, et al. "Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc."	Fully human-written
DriftLite: Lightweight Drift Control for Inference-Time Scaling of Diffusion Models	Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes DriftLite, a lightweight alternative to reward-driven particle dynamics derived from the Feynman–Kac formulation of the Fokker–Planck equation. While the principled dynamics (Prop. 2.1) yield unbiased sampling by reweighting trajectories according to a reward function, they suffer from severe weight degeneracy and are impractical for high-dimensional problems. DriftLite replaces explicit weighting with a Variational Coefficient Generator (VCG) that learns a low-rank, three-basis decomposition of the drift field, achieving a balance between theoretical faithfulness and numerical stability. The method demonstrates stable performance on challenging high-dimensional tasks, including protein–ligand systems. ## Strong theoretical grounding. Proposition 2.1 provides a clean derivation from the Feynman–Kac representation, clarifying why naive guidance or reward-based drift correction leads to unnormalized density propagation. In addition, Proposition 3.1 theoretically provides design space within the Fokker-Plank equation. ## Elegant practical solution. The transition from weighted to unweighted dynamics via the VCG formulation is conceptually neat and computationally efficient. Representing drift using three physically motivated basis components—potential, diffusive, and reward—yields a low-rank parameterization. Directly minimizing the variance of the potential term under a simple parameterization reduces the problem to a linear system, making the overall procedure remarkably simple. ## Empirical validation on high-D systems. Although the approach projects the control drift $b$ onto a three-dimensional subspace, it scales to complex molecular and protein–ligand environments without losing stability, indicating that this design effectively captures the dominant drift modes. ## Partial theoretical exposition. While Prop. 2.1 and 3.1 are elegant, the paper omits intermediate derivations linking the reward term to the weighted dynamics; some readers may struggle to follow the jump from theory to implementation. Could the authors clarify whether the choice of three basis functions in the VCG has a theoretical grounding? In particular, does this low-rank representation guarantee that the dominant drift modes of the reward-driven dynamics are captured, or is it mainly an empirical observation?	Moderately AI-edited
DriftLite: Lightweight Drift Control for Inference-Time Scaling of Diffusion Models	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The authors propose a method for learning to sample from a combination of tempered and/or tilted distribution, where the exponent of a reward function describes the tilting factor. Importantly, they note that the Feynman Kac PDE can be manipulated by adding an additional drift term which is compensated in the re-weighting factor, reducing the problem of sampling from the new target density to finding an optimal drift. It is an important direction of research since re-weighting typically relies on self-normalized importance sampling which can lead to importance weights blowing up. Thus, finding a good additional drift function can reduce this problem by essentially reducing the variance of the importance weights. The authors further demonstrate the improved performance of their framework, which relies on efficiently finding the optimal control through solving of a linear system of equations. - The authors provide a rigorous and correct assessment of the setting, where existing methods rely on self-normalized importance sampling which suffer from the problem of large variance of importance weights and the sensitivity to the number of particles used for integration. - The experiments indicate that the proposed method leads to improved performance across a wide array of settings, substantiating the generalizability of the proposed method. - A lot of the theory proposed in this work is actually already well known and not new, and the authors should have introduced it as such. In particular, Prop 3.1 is the basis of [1-3] which highlights that the Feynman Kac PDE can be manipulated with additional drifts as long as they are similarly compensated for in re-weighting. - Is there a reason why the authors did not consider the more difficult setting of LJ-55 to evaluate their framework? - I strongly urge the authors to provide comparisons to [2-3] which follows the same idea, learning a neural network model as an additional drift to reduce the variance of importance weights. A clear comparative analysis between modeling the drift as a linear coefficient of basis functions or a neural network would make this manuscript a lot better. - In continuation of the previous point, it is further important to highlight the training and inference costs of the proposed method against [2,3]. While I agree that the proposed method only requires solving a system of linear equations, it does require this at all time-points every time during inference. On the other hand, learning a time-conditioned neural network requires an upfront cost of training but then at inference only requires an additional forward pass at each step. Is solving this system of linear equations at every step cheaper than a forward pass at each step? Such an analysis has been missing from this work. [1] Skreta, Marta, et al. "Feynman-kac correctors in diffusion: Annealing, guidance, and product of experts." arXiv preprint arXiv:2503.02819 (2025). [2] Albergo, Michael S., and Eric Vanden-Eijnden. "Nets: A non-equilibrium transport sampler." arXiv preprint arXiv:2410.02711 (2024). [3] Vargas, Francisco, et al. "Transport meets variational inference: Controlled monte carlo diffusions." arXiv preprint arXiv:2307.01050 (2023). - How do the authors evaluate their method on LJ-13 and the other benchmarks? How are the ground-truth samples obtained to compute metrics such as MMD? - It would be good to provide some of the standard metrics like 2-Wasserstein distance. - What is the reward-tilting used in LJ-13, and what is the motivation behind the specific choice of tilting applied? - Why do the authors not compare against reward-based diffusion finetuning tasks and methods, e.g. [4-6]? [4] Venkatraman, Siddarth, et al. "Amortizing intractable inference in diffusion models for vision, language, and control." Advances in neural information processing systems 37 (2024): 76080-76114. [5] Domingo-Enrich, Carles, et al. "Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control." arXiv preprint arXiv:2409.08861 (2024). [6] Fan, Ying, et al. "Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models." Advances in Neural Information Processing Systems 36 (2023): 79858-79885.	Fully human-written
Confounding Robust Meta-Reinforcement Learning: A Causal Approach	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper tackles unobserved confounders in Meta-RL via partial identification methods to generate counterfactual trajectories from candidate environments that align with the confounded observations. This paper addresses an important challenge in reinforcement learning: the presence of unobserved confounders, approached from a causal inference perspective. Numerical experiments are conducted to demonstrate the effectiveness of the proposed method. 1. The confounding environment appears to be overly simplified. In many sequential settings, the transition dynamics are modeled as a function $f: \mathcal{S} \times \mathcal{X} \times \mathcal{U} \rightarrow \mathcal{S} \times \mathcal{U}$, whereas the paper assumes that the current unobserved state is not influenced by the history and it somehow does not reflect the challenge in the sequential setting. This assumption restricts the applicability of the proposed method. I am wondering whether the methods extend to this more general setting, and if so, what additional assumptions or modifications (e.g., on the evolution of $\mathcal{U}$, identifiability, or estimation strategy) are required. 2. There appear to be important missing components in the paper. In particular, no formal identification results are established for the counterfactual trajectories. It remains unclear what additional assumptions and regularity conditions are required to ensure identification, and what the theoretical guarantees are regarding the quality of the estimated counterfactual trajectories. Furthermore, in practical applications, it is not evident how to determine the dimension of the unobserved states. 3. The paper claims that the solution minimizes the generalization error. However, the theoretical results only show for the first-order stationary point. More argument and discussion are needed for this statement. See above.	Lightly AI-edited
Confounding Robust Meta-Reinforcement Learning: A Causal Approach	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This work aims to address the problem of unobserved confounders in meta-reinforcement learning environments. By leveraging the method of partial counterfactual identification, the authors propose a causal MAML framework, which utilizes counterfactual trajectories to find a policy initialization that exhibits strong generalization performance in target domains. 1. The paper addresses an important issue in meta-reinforcement learning, confounding bias, and introduces a causal perspective to tackle it. 2. The proposed algorithm is rigorously derived through formal theoretical development, including the definitions of CMDPs and canonical causal models, as well as a convergence proof under bounded-gradient assumptions. 1. The entire study is conducted under discrete and finite CMDP settings. Consequently, both theoretical formulation and empirical validation are confined to simplified, low-dimensional environments. While the framework is theoretically sound, its applicability and scalability to high-dimensional continuous control tasks remain unverified. 2. The literature review could be more comprehensive. While the paper states that research on handling unobserved confounders in meta-reinforcement learning is still missing, several existing works have already investigated this direction using causal approaches. 1. Not disclosing significant LLM usage. 2. In lines 086–090, the paper identifies the lack of “a systematic approach for performing meta-learning across heterogeneous domains with the presence of unmeasured confounding.” However, several recent studies [1,2] have already explored causal approaches to address unobserved confounders in meta-reinforcement learning. Therefore, the gap and motivation would be stronger if the authors explicitly acknowledge these prior works and clarify how their method fundamentally differs from, or advances beyond, existing causal meta-RL approaches. 1. ‘The shorter purple route’ should be ‘The … orange …’ in line 209. 2. The paper does not include comparisons with existing meta reinforcement learning based on casual methods. [1] Dasgupta I, Wang J, Chiappa S, Mitrovic J, Ortega P, Raposo D, Hughes E, Battaglia P, Botvinick M, Kurth-Nelson Z. Causal reasoning from meta-reinforcement learning. arXiv preprint arXiv:1901.08162. 2019 Jan 23. [2] Dasgupta I, Kurth-Nelson Z, Chiappa S, Mitrovic J, Ortega P, Hughes E, Botvinick M, Wang J. Meta-reinforcement learning of causal strategies. InThe Meta-Learning Workshop at the Neural Information Processing Systems (NeurIPS) 2019.	Lightly AI-edited
Confounding Robust Meta-Reinforcement Learning: A Causal Approach	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The authors propose a robust meta learning method that can perform effectively in environments with unmeasurable confounding factors that affect the environment dynamics. The key idea is to use causal inference and partial identification of confounding variables to overcome this. By augmenting counterfactual trajectories from an environmental model consistent with the observed data repeatedly, the proposed method un-biases the meta learner from effects of confounding variables. The authors present in depth theoretical proofs and empirical experiments in “Windy Gridworld” environment (unobserved wind patterns as confounding factor) show that the proposed method outperforms vanilla MAML and other RL-based methods. The paper is well structured and easy to read. The authors motivate the problem well, by clearly identifying gaps in existing meta-RL algorithms. Bringing ideas from causal inference into meta-learning applications is quite novel and the results are promising, compared to vanilla meta learning methods. The detailed theoretical analysis, well outlined pseudo algorithms, description of the experiments conducted in the grid world and the performance achieved is very interesting. A key concern is the choice of baselines - while the proposed method clearly is better than vanilla MAML, I think a comparison to other state of the art distributional robust [1] or bayesian [2] meta learning methods would put the strength of the proposed algorithm in better perspective. Quantitative comparison against these more relevant baselines would have made the contributions much more impactful. The fact that Causal PPO, In appendix C, matches or beats the proposed approach in all the tasks supports the need for stronger baselines. Another approach to tackle confounding variables could be to formulate it as partially observed markov decision processes (POMDPs) and leverage methods like RL^2 [3]. What are some advantages of using a causal inference approach over this? Minor typo : In appendix B.2 “log” appears twice in the equation. This is most probably a typo. Minor : The plot colors in the main paper are not consistent with the appendix, it would be great if they are consistent. PPO is “green” in the appendix but “orange” in the main paper. [1] A Simple Yet Effective Strategy to Robustify the Meta Learning Paradigm https://arxiv.org/abs/2310.00708 [2] https://arxiv.org/abs/1806.03836 Bayesian model agnostic meta learning [3] RL^2 https://arxiv.org/abs/1611.02779 Causal PPO outperforming the proposed causal MAML approach brings forward the question of why we need meta-learning at all? The key seems to be having counterfactual data augmentation. Do the authors have some thoughts on certain tasks where a causal MAML would hold advantage over Causal PPO?	Fully human-written
Confounding Robust Meta-Reinforcement Learning: A Causal Approach	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper aims to address a critical and under-explored issue in meta-reinforcement learning: how to learn a policy that can robustly and quickly adapt to new tasks when the expert data used for meta-learning is contaminated by unobserved confounding variables. The authors propose a framework called Causal MAML. This framework employs variational inference to learn a causal generative model for inferring the posterior distribution of confounding variables from biased observational data. Then, this distribution is used to generate counterfactual trajectories to "purify" the data to help the training of MAML. The paper provides theoretical analysis to support the unbiasedness of its meta-gradient estimation and validates the effectiveness of the method through experiments in two custom-designed confounding environments (Windy Gridworld and Causal Pendulum). 1. The paper's primary strength is its rigorous formulation of the problem. It moves beyond heuristics to prove that the core issue is a biased meta-gradient resulting from confounding. The central theoretical result—that a gradient computed on ideal counterfactual data is an unbiased estimator of the true, unconfounded gradient (Theorem 4.1)— provides a strong guarantee on the correctness of the optimization objective. This establishes a clear and principled target for the algorithm. 2. This paper focuses on the challenge of confounding robustness in Meta-RL, a highly relevant yet often overlooked issue in real-world applications. The introduction of causal inference into Meta-RL sounds interesting. 1. The proposed practical algorithm (Algorithm 1) is built on a foundation that is not scalable. The core "Counterfactual Bootstrap" step requires MCMC sampling from the posterior over all possible MDPs (ρb(M \| Di_obs)). This is computationally infeasible for any non-trivial environment. The method's success in the paper is an artifact of using toy domains where this step is barely possible. This reliance on an unscalable sampling procedure makes the proposed algorithm impractical for real-world application. 2. The Windy Gridworld and Causal Pendulum used in the paper are essentially "toy problems", characterized by low-dimensional state spaces and simple dynamics models. While these environments help illustrate the concept of "confounding" intuitively, they are far removed from the complexity of real-world problems. 3. The most concerning shortcoming of the paper is the complete absence of any direct evidence demonstrating that its causal inference module actually learns meaningful information about the confounder. The paper's core claim is that it "infers and utilizes the confounding variable $U$," yet the experimental section only reports final task rewards without any analysis or visualization of the learned latent variable $U$. In a controlled environment where the ground truth of the confounding variable is known, such validation is straightforward and necessary. For instance, the authors should visualize the relationship between $U$ and the true confounder. In the absence of such validation, the causal inference module becomes an uninterpretable "black box." We cannot determine whether the performance improvement stems from successful causal inference or merely because the VAE structure happened to learn a useful—but causally irrelevant—abstract representation in these simple tasks. This significantly undermines confidence in the methodological core contribution. 4. The comparisons in the paper are limited to standard MAML and a simple pre-training baseline. This overlooks a substantial body of related work in unsupervised/self-supervised RL, which focuses on learning latent representations or skills from reward-free interactions (e.g., SMM, DIAYN). These methods are typically evaluated on more complex and widely adopted benchmarks (e.g., DeepMind Control Suite, MuJoCo). The failure to compare Causal MAML against such methods, or at least test it in environments of comparable complexity, makes its contribution appear somewhat isolated and raises serious doubts about its scalability. 1. Could you provide a qualitative analysis to demonstrate that the learned latent variable $U$ indeed captures the true confounding information? For instance, by visualizing the relationship between $U$ and the actual wind force/spring stiffness. 2. How would your method handle confounding variables that dynamically change (non-stationary) or affect system dynamics in non-additive ways? Is the current framework sufficient to address these, or would it require significant modifications? 3. Have you considered evaluating your method on more challenging benchmarks (such as task sets with introduced confounding variables in MuJoCo) to demonstrate its scalability? Compared to self-supervised RL methods designed to learn latent environmental factors, what advantages does your approach offer?	Moderately AI-edited
Parameter-Efficient Reinforcement Learning using Prefix Optimization	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper investigates parameter-efficient reinforcement learning (RL) for math reasoning by optimizing only the first k generated tokens (the “prefix”) and letting a frozen, larger target model complete the solution. Two approaches are explored: (1) Prefix-Clustering, which selects a single fixed prefix by clustering candidate prefixes from the reference model and choosing the best on a training set; and (2) Prefix-RL, which RL-finetunes a small 1B adapter model to generate the first k tokens while the large target model remains frozen. Using verifiable rewards (answer correctness), the authors show consistent gains on math benchmarks (e.g., MATH-500, AIME, Minerva) across Qwen and Llama families, including FP8-quantized ones, with significantly lower training compute than full-model RL. The empirical results suggest that a substantial share of RL gains arises from steering toward effective formats/strategies rather than improving token-by-token reasoning across the entire sequence. - This paper introduces a compute-lean adapter-based RL setup where only k initial tokens are learned, separating strategy choice from long-horizon generation. - The pipeline is well-illustrated (adapter emits prefix; target completes; reward computed on final answer). 1. The paper implicitly assumes early tokens determine the solution strategy that the model will keep following. This may not hold for reflective/iterative solvers (e.g., o3-like, DeepSeek-R1, Qwen-Thinking) that backtrack, revise, or branch mid-solution. The generality of prefix steering under multi-pass reflection remains untested. 2. Prefix-Clustering protocol seems train-set-dependent and of unclear inference value. The method traverses MATH-train to choose a single best fixed prefix, which may not be practically meaningful at inference time (and risks train-set over-selection). 3. To support the efficiency claim, add a direct baseline: “1B Prefix-RL + Large Target” vs. “Full RL on the Large Target” under matched or budget-normalized compute and matched data. Without this, the efficiency–performance trade curve is hard to judge. 4. Some figure narratives (e.g., the 1B self-completion plot analogous to Fig. 3) could better articulate what hypothesis each figure specifically tests (e.g., how much of full-RL gain is recovered by prefix control?) 5. The paper states this is the first demonstration of RL finetuning applied to quantized models yet the method does not RL-update the quantized target’s weights—only the small adapter is updated while the FP8 model is used for inference-only completion. As worded, this can be misleading. See weakness.	Fully AI-generated
Parameter-Efficient Reinforcement Learning using Prefix Optimization	Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes a method for performing RL with low computational resources. The approach optimizes a small model to generate the beginning portion of responses, after which a large model completes the remaining decoding. The authors experiment with Llama3.1 and Qwen2.5 series models on several mathematical reasoning tasks. Results show that prefix-RL can achieve most of the performance gains of standard RL using relatively little computational resources. 1. This paper proposes a method for RL under low computational resources that can achieve most of the performance gains with far less computational cost than conventional RL. 2. The proposed method does not require full access to the target model; it only needs inference access. Therefore, it is applicable not only to open-source models but also to closed-source models. 3. The paper designs a Prefix Clustering experiment to verify the importance of the beginning portion of the response for performance gains, and further proposes the main method of this work, Prefix-RL. 1. The method in this paper is limited by the need to use models from the same family, which may cause it to perform poorly or even fail in settings involving closed-source models. 2. Experiments in this paper were conducted only on mathematical reasoning tasks, so the method's applicability to other RLVR tasks remains unknown. 1. The upper-right subplot of Figure 4 shows an anomalous behavior of Prefix Clustering on the Qwen model; in L375–L376 the paper explains this as "Qwen’s preferred openings are more input-dependent." Could a clearer example be provided to substantiate this point? 2. If the target model were used directly as the adapter model, what kind of performance could be expected? This approach seems to potentially eliminate the need for an additional model and avoid the restriction that the method requires a smaller model within the same family. Theoretically, such performance should lie between the current method and Full-RL, and it would also allow a clearer comparison of the stylistic and performance differences between the "prefixes" obtained by this method and those obtained in the paper. 3. Has Prefix-RL shown gains on OOD tasks? For example, can we observe that the adapter model produces more guiding responses in tasks other than mathematical reasoning? 4. In L460–L461 it is mentioned that "cross-family configurations lead to performance degradation." Is there any data that can visually show the extent of these performance drops? Also, note that open-source models versus closed-source models would also count as "cross-family."	Lightly AI-edited
Parameter-Efficient Reinforcement Learning using Prefix Optimization	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces a methodology called _Prefix-RL_ to algorithmically identify ways for an LLM to start its response given a user input so that it is more likely to correctly answer math questions. The work uses this methodology to fine-tune both LLama and Qwen models on the mathematical reasoning benchmarks MATH, AIME, AMC23, and Minerva. The results show similar gains to direct RL finetuning. * The paper's primary strength is the ~3000x reduction in FLOPs and the 4x reduction in GPU requirements during training. This approach to RL fine-tuning is much more accessible to research labs than tuning the full model. * The core idea is based on an interesting insight and is also practical to implement. The insight that prefixes index into parts of the training data that are useful for answering certain math questions is potentially a fruitful idea for inspiring more works. * This method is demonstrated on FP8-quantized models, which, as noted in Section 3.2, was previously difficult to do. It seems to make progress on the performance gap between quantized Llama-8B and its full-precision counterpart, which is impressive and useful. Weaknesses * The work is posed as a form of parameter-efficient RL, but only compares against a standard RL baselines. A more fair comparison would consider other techniques for parameter-efficient RL such as LoRA [Hu et al., 2022] (or QLoRA [Dettmers et al., 2023] for quantized models) or Adapters [Houlsby et al., 2019]. It would also be nice to compare against prefix-tuning, as mentioned in the related work, given that it can be directly trained on the same labels generated for RL as a supervised signal. * It is unclear why the baseline method of prefix clustering was selected with k=16, whereas the experimental method was tested at k=32 and k=64. I believe this is a bit of an unfair comparison, as it is possible that k=16 is simply not enough tokens to meaningfully index into the parts of the LLMs training data that are “good” for solving math questions, which is the core insight of this work. It would be better to have a comparison that has equal numbers of k values across all approaches. * Additionally, selecting k doesn’t seem to be very clear-cut. In Table 1, it appears that having k=32 seems to work well for some models/benchmarks (e.g., the Qwen-72B model or the Minerva Benchmark), whereas k=64 works better for others. It seems difficult to know ahead of time what value should be selected for k to achieve good performance. This mitigates some of the benefits of efficiency, because implementing this approach now requires an engineer to search over the k-values that work the best. * The paper appears to report results from a single training run for each experiment. The training curves (e.g., in Figure 4) seem to be quite noisy, with high variance between steps. Selecting the "best checkpoint" from a single, noisy run is not super robust to initial conditions. The paper should report the mean and standard deviation over multiple runs (e.g., 3-5) with different random seeds to establish statistical significance, as is the standard in the RL literature. * This approach seems to rely on having some objective way of calculating “correctness” of an answer. It is not clear how well this does under different kinds of label noise that is common in RLHF. Having a “correct answer” is a large assumption for problems that LLMs are typically used for, such as creative writing, open-ended dialogue, or exploratory information retrieval. Minor Edits * The section on Prefix clustering is a bit confusing. Is it one prefix for all evaluation examples, or is it the nearest cluster’s prefix? It seems like the former, but a more reasonable baseline would be the latter. * It is not immediately clear why setting g_theta to the same size and architecture of the reference model implies that “improvement from RL upweights existing strategies” as claimed in Sec 2 under the subheading “Prefix-RL”. This could be better elaborated on. * In Section 3.1, the authors state the MATH training split has 7,500 examples. This dataset is appears to have 12,500 examples. Your subsequent filtered number of 8,888 examples also suggests the starting number was larger than 7.5k. Please double-check and clarify this dataset statistic. References: Hu, Edward J., Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. "Lora: Low-rank adaptation of large language models." ICLR 1, no. 2 (2022): 3. Dettmers, Tim, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. "Qlora: Efficient finetuning of quantized llms." Advances in neural information processing systems 36 (2023): 10088-10115. Houlsby, Neil, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. "Parameter-efficient transfer learning for NLP." In International conference on machine learning, pp. 2790-2799. PMLR, 2019. The questions below correspond to the number of bullet points of the weaknesses. * 1.1: How does this technique compare to other techniques for parameter-efficient RL such as LoRA [Hu et al., 2022] (or QLoRA [Dettmers et al., 2023] for quantized models) or Adapters [Houlsby et al., 2019]? * 1.2: What are the benefits of this approach of using RL with automatically calculated labels vs. supervised fine-tuning with automatic labels? * 2.1: How does prefix clustering at k=32 and k=64 compare to the proposed approach? * 3.1: how can one determine a k-value for their problem? * 3.2: what is the worst-case number of evaluations to make for k? * 4.1: how consistent are these results across different training runs? * 5.1: how sensitive is this approach to noise in labels? * 5.2: how applicable is this approach to other problems that are common uses of LLMs?	Fully human-written
Parameter-Efficient Reinforcement Learning using Prefix Optimization	Soundness: 3: good Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper investigates whether the performance gains observed in RLVR for mathematical reasoning are due to genuine improvements in reasoning ability, or primarily from shifting the model toward high-accuracy solution strategies already present in the base distribution. To answer this, the authors propose prefix optimization: only the first k tokens of a generated solution are optimized, while the remainder is completed by a frozen reference model. Two methods are evaluated: 1. Prefix Clustering — selects a fixed prefix via k-means clustering of sampled candidate prefixes and uses it for all inputs. 2. Prefix-RL — trains a small adapter using PPO to generate an input-conditional prefix conditioned on the question. Despite modifying only a tiny fraction (first 16–64 tokens) of the sequence, both methods yield substantial accuracy improvements on math benchmarks such as MATH-500, AIME, AMC, Minerva, OlympiadBench, often recovering a large share of full RL gains. Prefix-RL is compute-efficient, works with quantized models, avoids catastrophic forgetting, and requires inference-only access to the main model. Improvements are most pronounced when the adapter and target share a model family. Overall, the work argues that strategy selection and formatting, not deep reasoning skill, may explain a substantial portion of RL gains. 1. A simple enough method, especially Prefix Clustering, not only brings significant improvements on downstream tasks, but also unveils that the high-quality solution already learned in the pre-training distribution. It offers another profound insight into the origin. 2. Highly compute-efficient; 3. Could work with closed-weight models; 1. Lack of direct comparison to full RL at a large scale 2. Lack of comparison to other parameter-efficient RL methods. 3. Generalization beyond math remains uncertain; 4. Prefix clustering harms Qwen but helps Llama, suggesting architectural or data-distribution differences worth deeper investigation. I think more analysis on why Qwen behaves differently needs to be conducted. Same to Weakness.	Lightly AI-edited
SafeDec: Constrained Decoding for Safe Autoregressive Generalist Robot Policies	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	SafeDec introduces a constrained decoding framework that enforces formal safety specifications during inference for large transformer-based robot policies such as SPOC, Flare, and PoliFormer. Instead of retraining these models to internalize safety rules, SafeDec intervenes only at decoding time to ensure that generated action sequences comply with Signal Temporal Logic (STL) constraints. - Model-agnostic generality without retraining. The framework operates at test time and is agnostic to the underlying policy; it is demonstrated on SPOC, Flare, and PoliFormer without any additional training or fine-tuning. - Clear empirical trade-offs between variants. The simulation study contrasts HCD (strict safety satisfaction) and RCD (safety–performance trade-off), making variant-specific strengths and weaknesses evident. - Clear writing and easy to understand. - Scope mismatch (navigation vs. manipulation). While the introduction references both navigation and manipulation, all experiments are limited to navigation tasks in AI2-THOR, creating a gap between the paper’s claimed and demonstrated scope. - Problem framing conflated with solution design. The claim of “introducing the novel problem of constrained decoding for transformer-based policies under STL specifications” blends the problem statement with specific solution choices (transformers and STL). The work would be more precise if it framed the broader problem as online safety enforcement during inference and justified these components as its methodological instantiation. - Limited evaluation. Although the motivation centers on real-world safety, results are confined to a single simulated domain with two constraint types (geofencing and avoidance). This limits the generality of the conclusions. - No comparison with relevant baselines (see questions) - Ambiguity around “post-hoc manipulation.” The term is used to critique existing methods but is not clearly defined—whether referring to post-training safety alignment or post-generation filtering at inference. Moreover, no empirical evidence from robotic settings supports the assertion that post-hoc interventions necessarily lead to “degenerate or brittle behaviors.” - Deviation from the base model’s distribution. The method seeks to preserve the policy’s original action preferences while ensuring safety, yet both HCD (masking unsafe actions) and RCD (reweighting logits) modify the underlying probability distribution. This change conflicts with the stated goal of maintaining distributional faithfulness to the base model (lines 195--199). - Limited expressivity of tested constraints. The evaluated constraints are simple invariants (“always avoid”). Demonstrating performance on temporally scoped constraints, such as “avoid room A between time 1--time 2” or “avoid A until visiting B”, would better motivate the need for STL’s representation. - Omission of baselines. The paper only compares with bound-to-fail baselines and would be strengthened by comparisons against (a) multi-sample rollout selection (choose the trajectory satisfying safety), (b) corrective optimization of predicted actions to resolve violations, (c) described “posthoc manipulation” methods, and (d) recent guided-sampling approaches (e.g., DynaGuide). - Writing and presentation issues. The reference style is occasionally inconsistent, and acronyms such as HCD and RCD should be expanded upon first mention in the abstract or introduction for clarity. - Missing experimental and implementation details. All rollouts were feasible under the specified constraints, including how infeasible cases were treated, and key implementation details such as action vocabulary, decoding horizon, sampling strategy, and mapping from STL predicates to action tokens. Including these would improve clarity and reproducibility. - Missing discussion on failure modes.	Moderately AI-edited
SafeDec: Constrained Decoding for Safe Autoregressive Generalist Robot Policies	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The authors present an approach for satisfying invariant properties in plans generated by transformer models. Their approach uses an output filter that leverages the underlying distribution from the model, and mask predicted tokens that violate the invariant specification based on a hard constraint or robustness-based weighting. The latter provides a trade-off between satisfaction of the constraints and satisfaction of the original task, whereas the hard constraint formulation will sacrifice performance to satisfy the invariant property. The authors present a sound, reasoned approach to safety using transformer models. The paper is well-written and easy to follow. The authors present the work well in the context of existing approaches, and it represents a nice first step to this type of online safety. This paper overstates the application of STL. Invariant properties are a (useful) but small subset of STL. I think the abstract and paper need to be much clearer about this. It is a worthwhile step forward, but it is a very limited fragment of STL. After the abstract, invariants are not mentioned until section 4. Reading the paper, it sounds as though the authors will satisfy general STL specifications, which their proposed approach is not able to do. It is very important to be clear about this, as it represents a very concrete limitation of the approach. It would be useful to see a comparison of the results against something like SafeVLA. That model requires fine-tuning with safety constraints, as the authors note. The existing comparators show the cost of satisfaction in terms of success rate. However, SafeVLA or a similar model could help the reader understand how close SafeDec comes to an optimal safe execution. That is, if SafeVLA outperforms SafeDec in terms of task satisfaction, then it helps a reader identify the tradeoff between fine-tuning for safety compared to run-time safety constraints. See weaknesses above.	Fully human-written
SafeDec: Constrained Decoding for Safe Autoregressive Generalist Robot Policies	Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes SafeDec, a constrained decoding framework for robot policies. The robot policies are modelled as a pre-trained multi-modal causal transformer. The proposed framework controls the transformer’s output at inference time to ensure that robot behaviors satisfy safety requirements specified in Signal Temporal Logic (STL). Under this framework, the authors propose two techniques termed Hard Constrained Decoding (HCD) and Robustness Constrained Decoding (RCD). Assuming the system dynamics are available, HCD casts the logits for actions resulting in future failures to “-inf” to rule out unsafe behaviors, whereas RCD re-weights the logits for actions based on their contribution towards the robustness score as well as their original logits to balance the trade-off between safe behaviors and achieving tasks. The authors conduct experiments on AI2-THOR indoor scenes using three pretrained robot policies (SPOC, PoliFormer, and Flare) across two STL requirements. Compared with the unrestricted transformer baseline and filtering-based baseline, the proposed framework achieved a better trade-off between safety and goal-directed behavior. Ablation studies show that SafeDec can also handle inaccurate dynamics to some extent and show how different balancing coefficients for RCD lead to different performance. 1. The problem this paper aims to tackle is indeed valuable. Many robot policies now use transformer backbones, and one concern is the lack of safety guarantees. An efficient solution to enforce task constraints is of utmost importance. 2. The proposed framework is easy to understand and appears flexible to different robot systems. Since it doesn’t require the system to be differentiable, the proposed framework can handle broader types of safety constraints and system dynamics. 1. The idea is a bit trivial, given that the “constrained decoding” idea for Transformers has already been discussed in [1], and the experimental findings are not significant (the filtering baseline also achieves relatively high performance). 2. The experiment section lacks essential baselines (consider gradient-based methods [2,3], “beam-search” methods and sampling-based approaches such as CEM or CMA-ES [4], since the authors assume the system dynamics are available). 3. The proposed framework appears to work only on a discrete action space. 4. This framework is hard to generalize to STLs other than invariance properties. 5. The writing quality needs to be improved (misuse of citation types in L033-035; invalid citation in L084-085 and L301-303; wrong superscripts for “t’” used in L633-636; improper format for robustness score computation in L249-250 and L253-254, as the robustness score is on trajectories, not on states; L228-229 it should be $\hat{a}_{t+k-1}^{(i)}$) Some implementation details are missing (planning horizon T; the computation time, etc.) References: 1. Willard, Brandon T., and Rémi Louf. "Efficient guided generation for large language models." arXiv preprint arXiv:2307.09702 (2023). 2. Leung, Karen, and Marco Pavone. "Semi-supervised trajectory-feedback controller synthesis for signal temporal logic specifications." 2022 American Control Conference (ACC). IEEE, 2022. 3. Zhong, Ziyuan, et al. "Guided conditional diffusion for controllable traffic simulation." arXiv preprint arXiv:2210.17366 (2022). 4. Kapoor, Parv, Anand Balakrishnan, and Jyotirmoy V. Deshmukh. "Model-based reinforcement learning from signal temporal logic specifications." arXiv preprint arXiv:2011.04950 (2020). 1. I got confused for HCD introduced in L222-238. What is the k here, is it a hyperparameter defined for "look-ahead steps", or is it an index enumerated from 1 to T-t progressively? How do you get $x_{t+k-1}$ when you try to get $\hat{x}\_{t+k}$, I assume you only have $x_0,...x_t$ when you say "t is the current decision step"? Is the whole HCD process like, first from $x_t$, find possible actions $\hat{a}\_t$ that lead to STL-violation based on $\hat{x}\_{t+1}$, and then make these $z_t=-\inf$ so you never sample from them, and then sample from the rest buckets to get real $a_t$, and then use system dynamics to get $x_{t+1}$, then do the same thing to get real $a_{t+1}$ and then get $x_{t+2}$, ... till finding the final $x_{t+k}$? In this process, do you need to call the transformer forward pass multiple times? Do you need to update the transformer visual input? I guess the RCD procedure is somewhat similar. A pseudo-code algorithm or a simple example will be nice to illustrate the whole process. 2. In L213-215 the authors mentioned "... we propose SafeDec: A constrained decoding strategy that integrates STL specifications into the foundation model action selection process itself, ensuring satisfaction without distorting the model’s distribution." I am a bit confused why SafeDec doesn't distort the model's distribution. It seems like HCD will filter out some actions, and RCD will also change the action distribution by reweighting. 3. The reason for saying "Hard to generalize to STLs other than invariance properties" is that for the invariance properties without explicit time intervals, one just needs to treat them as state constraints per timestep, which is exactly how the current HCD and RCD handle constraints right now. For this type of constraint, one does not even need to use the concept of robustness score to reweight the logits. But for more complex STLs (like "Eventually (Always ...)"), it is no longer Markovian, and the agent needs to use extra bits to keep track of the history (as in some RL works[1,2], but only for two-temporal-layer STLs.) And for general STLs, one needs to first decompose them into reachability and invariant properties [3,4], then decides how to activate the associated subtasks as time progresses. The proposed framework does not seem to adapt to these settings. References 1. Aksaray, Derya, et al. "Q-learning for robust satisfaction of signal temporal logic specifications." 2016 IEEE 55th Conference on Decision and Control (CDC). IEEE, 2016. 2. Venkataraman, Harish, Derya Aksaray, and Peter Seiler. "Tractable reinforcement learning of signal temporal logic objectives." Learning for dynamics and control. PMLR, 2020. 3. Kapoor, Parv, Eunsuk Kang, and Rômulo Meira-Góes. "Safe planning through incremental decomposition of signal temporal logic specifications." NASA Formal Methods Symposium. Cham: Springer Nature Switzerland, 2024. 4. Liu, Ruijia, et al. "Zero-Shot Trajectory Planning for Signal Temporal Logic Tasks." arXiv preprint arXiv:2501.13457 (2025).	Fully human-written
SafeDec: Constrained Decoding for Safe Autoregressive Generalist Robot Policies	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	The paper presents SafeDec, a constrained decoding framework that enforces formal safety specifications in Signal Temporal Logic (STL) within transformer-based robot navigation policies at inference time. The method is an inference-time technique that reweights or masks candidate actions using STL satisfaction scores, i.e., decoding either through hard masking (HCD) or robust reweighting (RCD) of logits, ensuring that selected actions do not violate safety constraints under a simple dynamics model. SafeDec is evaluated on AI2THOR environments using three generalist navigation policies (SPOC, Flare, PoliFormer), demonstrating substantial improvements in STL satisfaction with limited impact on task success. The paper also compares performance when dynamics are noisy and the effect of weighting $\beta$ between robustness and base logits in RCD. - SafeDec adapts constrained decoding from natural language processing to low-level action generation in robotic policies. - The evaluation convincingly shows that SafeDec enforces simple invariants (avoidance, geofencing) with a modest performance loss across different state-of-the-art robot navigation policies. - The paper demonstrates adaptation to small noise in dynamics and performance changes in the relative weighting of the safety–performance trade-off. - The paper doesn’t compare performance to other similar methods mentioned (SafeVLA and SELP); instead, it is limited to simple filtering baselines. - Implementation details are unclear, such as the setup of Simplex default actions and the decoding interface modification of the generalist policies. This affects the reproducibility. - The paper has limited STL specification diversity, with only the Always operator (invariant) for “always avoid” and “stay within bounds” conditions being tested. - The writing has stray citations (line 084, page 3, and lines 302-302, page 6 )and needs to be proofread once again. The explanation/caption of Figure 1 can be expanded to detail the architecture, the robot's initial and final positions, and the graph if the authors choose to place it at the beginning of the problem formulation section. Trajectory visualizations (Figure 2) have low contrast compared to the background and lack clear legends. 1. How are actions represented in SPOC, Flare, and PoliFormer? Can you elaborate on how you access and manipulate the last-layer logits of these policies? 2. Can the paper include comparisons to other baselines, such as SafeVLA or SELP? Both explicitly address safety constraints in transformer-based robotics or planning. 3. Can the authors elaborate more on the mechanism of the default actions in the Simplex/Filtering baseline and how they were determined? 4. Can SafeDec experiments demonstrate performance with other STL specifications that use the eventually operator or until operator? Or larger compound specifications? 5. Have you tried experiments where the error in dynamic modeling was large due to unmodeled effects?	Fully human-written
Last-iterate Convergence of ADMM on Multi-affine Quadratic Equality Constrained Problem	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper studies the convergence of ADMM in multi-affine quadratic equality-constrained problems. Under certain assumptions, the main results include: * A sublinear convergence rate of ADMM with general multi-affine quadratic constraints * A linear convergence rate of ADMM when the constraints are close to linear constraints Both results are stated in terms of last-iterate convergence. Moreover, future results include experiments to explore the effect of multi-affine quadratic constraints on the convergence rate, comparisons with other optimization methods, and applications to locomotion problems in robotics. * The results extend previous linear constraints to affine quadratic constraints, which is more practical than the previous setting where linear constraints are mainly discussed. * The convergence results are in terms of last-iterate convergence, which is more interesting from a theoretical perspective. * The discussion of the results in the locomotion problem in robotics is interesting and shows the practical use of the theoretical results. My main concern lies in the use of Assumption 2.3, which requires that the functions $f(x)$ and $\phi(x)$ are both strongly convex, so the objective function discussed in the current work is a sum of two strongly convex functions and several indicator functions of convex and closed sets. This greatly constrains the degree of non-convexity of the objective function considered in this paper. Moreover, the authors do not discuss how these convexity assumptions on $f(x)$ compare with those in previous works. For example, in the table at the top of page 3, this seems none of the related works need any convex assumptions on $f(x)$; does this mean that the other works do not require the same convexity assumptions as those used in the current paper? Other problem: * In line 97 and the table at the top of page 3, the "KL" should be "PL" as used in Definition 2.5 ? * In the right part of Figure 5, the labels for each curve are missing. * Whether the related works, especially the four related works presented in the table at the top of page 3, also require the strong convexity condition for $f(x)$? * If they do not, what is the main difficulty of removing this assumption from the current work? * As the authors stated in the abstract of the current paper, "Although these problems are generally non-convex, they exhibit convexity or related properties when all variables except one are fixed," could the authors provide further discussion on this point? For example, could you provide important real-world examples where such objective functions are used to model problems?	Fully human-written
Last-iterate Convergence of ADMM on Multi-affine Quadratic Equality Constrained Problem	Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper shows that the classical ADMM procedure for optimization under equality constraints does converge to a local minimum when the constraints are multi-affine quadratic and the objective function satisfies some additional properties (including strong convexity plus some indicator functions). It further establishes a linear convergence rate when some additional assumptions are satisfied and shows applications to robotic locomotion. The results obtained appear to be new: in particular, the convergence of ADMM under certain assumptions was proven in Guo et al. (2020) but without a convergence rate analysis. The simulation experiments with robots are limited but rather convincing. I am not an optimization specialist, but the paper is interesting and well written, and a cursory look at the proofs indicates that they are reasonable (e.g., they go further than noting that the sequences are decreasing and bounded below, or that the difference between iterates converges to zero, which would not be sufficient). Given the wide use of ADMM in the community I think the paper is of interest to the ICLR community Although the robotic experiments validate the assumptions made in the paper, it would be nice to discuss these further, for example Assumption 2.3 on the objective function, which seems rather restrictive, as well as their importance in practice. See weaknesses above.	Fully human-written
Last-iterate Convergence of ADMM on Multi-affine Quadratic Equality Constrained Problem	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper analyses the convergence properties of the Alternating Direction Method of Multipliers (ADMM) when applied to multi-affine, quadratic, equality-constrained optimisation problems. Under assumptions including $L$-smooth and strongly convex objectives, as well as full-rank constraint matrices, the authors prove that ADMM converges to a stationary point at a rate of at least sublinear. When the nonlinearity in the constraints is sufficiently small relative to the linear components, they also prove linear convergence. These theoretical findings are applied to robotic locomotion trajectory optimisation, where centroidal dynamics lead to multi-affine constraints, and are experimentally validated. - This paper clearly demonstrates the real-world applicability of centroidal dynamics in locomotion by highlighting how they give rise to multi-affine quadratic constraints. - It provides comprehensive theoretical results and several meaningful extensions. - The analysis makes novel contributions by establishing explicit conditions for linear convergence when nonlinear constraint coefficients are sufficiently small in relation to linear components. The author needs to make a major revision to improve the quality. Some detailed comments are provided below. - Although multi-affine quadratic constraints are generally challenging, the problem becomes affine when all but one variable block is fixed. ADMM can naturally exploit this structure, which appears to have a negligible effect on the analysis framework of classic ADMM. The convergence rate (Theorem 3.1) also basically follow the convergence analysis in [1] and the KL framework in [2]. (Here the PL property is a special case of KL). - Several extensions of the main results appear incomplete: - Although the main motivation is robotic applications, Corollaries 4.1 and 4.2 only guarantee convergence when certain assumptions about the functions are met. This means that, although problem (6) can be reformulated as (1), theoretical convergence is not fully assured, thereby limiting its practical application. - The analysis of the approximated-ADMM (Algorithm 2) does not explicitly address the effect of approximation errors on convergence. Furthermore, in Theorem D.3, conditions P2 and P3 are assumed rather than proven, which weakens the theoretical support. - Although each ADMM subproblem is strongly convex and smooth, a sufficiently large penlty parameter $\rho$ causes the term $\frac{\rho}{2}\\|A(x)+Qz\\|^2$ in the augmented Lagrange function to dominate. This increases the condition number of the Hessian of the subproblem, which can slow convergence or require high-precision solvers. - The experimental evaluation lacks rigour and comprehensive baseline comparisons. Comparisons with existing methods (PADMM and IPDS-ADMM are designed for non-convex objectives) are limited to a few scenarios and omit standard benchmarks. Furthermore, since Algorithm 1 is only a classical ADMM, it should also be compared with ADMM methods designed for convex objectives [3,4]. - The paper provides incomplete practical guidance on parameter selection. While a 'sufficiently large $\rho$' is theoretically required, the paper provides no sensitivity analysis or practical tuning strategies, which hinders reproducibility. - Some of the expressions are inaccurate. For example, on page 3, the description of [5] incorrectly states that $\phi$ is not smooth, and the objective in [6] does not match the original reference. - It seems that the author is missing some key references, including [7,8], which also address nonconvex optimization with nonlinear equality constraints using ADMM. Since the constraints studied here are a special case of nonlinear equalities, it would be good to contextualize the work the authors have done with these important papers. - The writing suffers from typographical and organizational issues: - The cross-ref to expressions are inconsistent (for example, 'Equation (13)' on page 24, line 1244 vs. 'equation 34' on line 1260). - The table on page 3 lacks a caption. - On page 3, line 155, 'Assumption' should be pluralized as 'Assumptions'. # References [1] Gao, W., Goldfarb, D., & Curtis, F. E. (2020). ADMM for multiaffine constrained optimization. Optimization Methods and Software, 35(2), 257-303.\ [2] Guo, K., Han, D. R., & Wu, T. T. (2017). Convergence of alternating direction method for minimizing sum of two nonconvex functions with linear constraints. International Journal of Computer Mathematics, 94(8), 1653-1669.\ [3] Cai, X., Han, D., & Yuan, X. (2017). On the convergence of the direct extension of ADMM for three-block separable convex minimization models with one strongly convex function. Computational Optimization and Applications, 66(1), 39-73.\ [4] Tang, T., & Toh, K. C. (2024). Self-adaptive ADMM for semi-strongly convex problems. Mathematical Programming Computation, 16(1), 113-150.\ [5] Li, J., Ma, S., & Srivastava, T. (2024). A Riemannian alternating direction method of multipliers. Mathematics of Operations Research.\ [6] Yuan, G. (2025). ADMM for nonconvex optimization under minimal continuity assumption. ICLR.\ [7] El Bourkhissi, L., & Necoara, I. (2025). Convergence rates for an inexact linearized ADMM for nonsmooth nonconvex optimization with nonlinear equality constraints. Computational Optimization and Applications, 1-39.\ [8] Li, B., & Yuan, Y. X. (2025). Convergent Proximal Multiblock ADMM for Nonconvex Dynamics-Constrained Optimization. arXiv preprint arXiv:2506.17405. Please see Weakness.	Fully human-written
Last-iterate Convergence of ADMM on Multi-affine Quadratic Equality Constrained Problem	Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proves the convergence of a variant of the alternating direction method of multipliers (ADMM) for multi-affine quadratic equality constrained problems, which are non-convex. Assumptions are less restrictive than in prior work. Linear convergence rates are also proven. The ADMM scheme is evaluated on robotics locomotion problems. The paper tackles an important and non-trivial problem: designing solvers with convergence guarantees for non-convex problems, with many applications. The paper is well-written throughout. Assumptions and results are clearly stated, discussed, and compared to other ones in the literature, which makes the contribution clear. Examples are instructive and show the necessity of assumptions. The ADMM scheme is tested on a non-trivial locomotion problem, and results validate the derived convergence rates. The following limitations and suggestions are minor: 1) Application to locomotion: The proposed method can only handle the case with pre-defined contact sequences and timings. This limitation should be stated. 2) Section 5, baselines: - Adding a sentence describing the baselines and their difference with the proposed ADMM scheme would strengthen the comparison. - Computation times for solving the locomotion problem are not reported. It is unclear if the proposed method is faster and converges more robustly in practice than other tailored solvers for such problems. 3) Mathematical clarifications: - On line 223, the dual variable $w$ has the wrong dimensions and should be in $\mathbb{R}^{n_c}$. - In Definition 2.1, "such" => "such that". Also, "Moreover," should be replaced with an "and" for the definition to make sense. Also, it could be worth noting that quadratics ($C_i=I$) do not satisfy this assumption, so this assumption implies that the diagonal elements $(C_i)_{jj}$ are zero. - Typo: in (18), instances of $\nabla A$ should be $\nabla_i A$. - The first step of the proof of Theorem 3.2 states that second order differentiability of the Lagrangian at $(x^\star,z^\star,w^\star)$ implies strict feasibility in a neighborhood ("indicator functions are all zero for the points in that neighborhood"), which is correct. This assumption implies that $(x^\star,z^\star,w^\star)$ is in the strict interior of the set $\cap_i X_i$. This assumption can be strong and it would be worth discussing how to potentially relax it, e.g., with a refined assumption and analysis accounting for active constraints on the boundary of the $X_i$'s. - On line 1199, $\alpha=1/2$ should be $\alpha=2$. Please clarify Definition 2.1 and the assumption of second order differentiability of the Lagrangian in Theorem 3.2 (see my previous comment).	Fully human-written
Perturbations Matter: Sensitivity-Guided Hallucination Detection in LLMs	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes a novel approach to detect hallucinations in LLMs by analyzing the sensitivity of internal representations to prompt-induced perturbations. The key finding is that truthful responses exhibit greater sensitivity to perturbations than hallucinatory ones, with theoretical guarantees for high separability. The proposed method, Sample-Specific Prompting (SSP), dynamically constructs prompts per sample, measures representation shifts via cosine distance, and trains lightweight encoders using contrastive loss. Experiments on QA datasets such as TruthfulQA demonstrate superior AUROC over baselines, along with strong efficiency and generalization. Novel Perspective on Hallucination Detection: This work is the first to explicitly leverage perturbation sensitivity as a detection signal, shifting the focus from static embeddings to dynamic representation shifts. Solid Theoretical Foundation: Theorems 1 and 2 provide probabilistic guarantees on separability, supported by Cantelli's inequality and empirical validation on real datasets. This grounds the method in theory rather than relying solely on heuristics. Comprehensive Experiments: Evaluations across multiple models, datasets demonstrate the robustness of the approach, which outperforms baselines on these benchmarks. The comparison in Figure 1 may be problematic. K-means is an unsupervised method, while the proposed Perturbation Sensitivity is supervised. The poor performance of K-means might not indicate weak inherent separation but could result from its inability to find the optimal separation hyperplane, making the comparison potentially unfair. Therefore, the claim that "internal representations of LLMs frequently fail to provide a clear separation between truthful and hallucinatory content" requires further discussion. Several recent studies [1,2] have demonstrated good separation using internal representations directly, raising questions about the validity of this separation bottleneck. Although this method beats the traditional methods like Linear Probe or SAPLMA, the method introduces complexity by requiring prompt optimization for the training data, which may raise concerns about its robustness and practicality. [1] Bürger et al. Truth is Universal: Robust Detection of Lies in LLMs. NeurIPS 2024 [2] Liu et al. On the Universal Truthfulness Hyperplane Inside LLMs. EMNLP 2024 The main questions concern the limitation described in the introduction and Figure 1. Beyond the concerns mentioned in the Cons, the paper uses three labeling criteria: ROUGE-L (R), BLEURT (B), and DeepSeek-V3 (D). Given that using ROUGE-L to measure correctness can be inaccurate [3], I suggest that only the scores for DeepSeek-V3 should be displayed, as there are too many scores in Table 1.	Lightly AI-edited
Perturbations Matter: Sensitivity-Guided Hallucination Detection in LLMs	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper studies hallucination detection in large language models by measuring the perturbation sensitivity of intermediate embeddings under sample-specific prompts. The method is motivated by a theoretical oracle construction suggesting that prompts can be designed to maximize sensitivity for truthful outputs while minimizing it for hallucinations. Accordingly, the authors propose to generate a noise prompt is generated to probe the model’s internal representations' sensitivity. The paper evaluates the approach across multiple benchmarks, demonstrating improved detection performance over several baselines and showing robustness across different datasets. The paper includes thorough evaluations, including multiple datasets and models, and perform valuable ablation studies. Particularly, the method demonstrates promising generalization performance, indicating practical applicability. 1. The paper should better situate itself among prior perturbation-based hallucination detection methods (e.g., SPUQ[1]), which often find hallucinated outputs to be more sensitive. While the current work focuses on a different aspect—intermediate embeddings rather than outputs—it nonetheless creates an apparent tension with prior work that would benefit from clarification. [1] Gao, Xiang, Jiaxin Zhang, Lalla Mouatadid, and Kamalika Das. "SPUQ: Perturbation-Based Uncertainty Quantification for Large Language Models." In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2336-2346. 2024. 2. It would be beneficial to decode and display the perturbed prompts and report model performance under perturbation.If performance degrades significantly, the method requires two inference passes per sample -- one for generation, one for decoding. In that case, comparisons to simpler baselines, such as linear probes, may no longer be fair given the added computational cost. 3. The proof in Section 3 appears to formalize the intuition rather than justify it. It assumes the existence of a prompt per sample that maximizes sensitivity for non-hallucinated outputs and minimizes it for hallucinated outputs. However, a symmetric argument could be made in the reverse direction if the prompt is assumed to minimize sensitivity for hallucinated outputs. This does not resolve the tension noted above and can make the theoretical claim more confusing. 4. The motivation for defining a separate “separation” metric is unclear. It seems closely related to AUROC and does not appear in prior literature; the added value should be clarified. 5. It would be valuable to see the model's performance on smaller and newer llama models, such as Llama 3.2 1B. see weakness.	Lightly AI-edited
Perturbations Matter: Sensitivity-Guided Hallucination Detection in LLMs	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes a sensitivity-based approach for hallucination detection in large language models (LLMs), arguing that internal representations alone fail to separate truthful and hallucinatory responses effectively. The authors introduce the idea of prompt-induced perturbation sensitivity—measuring how internal representations change when a prompt is slightly altered—and develop a theoretical claim that truthful responses are more sensitive to such perturbations. Building on this observation, they propose a new method, Sample-Specific Prompting (SSP), which dynamically generates prompts to estimate this sensitivity. Experiments across several datasets and LLM architectures show that SSP outperforms existing baselines on AUROC-based metrics. The paper addresses a relevant and timely problem—detecting hallucinations in LLMs—where most prior approaches focus either on self-assessment or static internal representations. The authors attempt to provide a fresh angle by investigating sensitivity rather than absolute embedding values. The theoretical framework, although somewhat abstract, is clearly presented, and the authors support their arguments with multiple datasets and models. The inclusion of both empirical and analytical sections demonstrates thoroughness, and the work’s emphasis on sample-specific perturbations aligns with growing interest in adaptive and interpretable LLM evaluation methods. Despite its ambitious scope, the paper’s contribution remains limited and somewhat overstated. The central claim—that prompt-induced sensitivity can differentiate truthful from hallucinatory outputs—appears largely empirical and circular, since prompts are tuned per sample in a way that trivially enforces separability - if I am not missing something. The “theoretical” results are more notational restatements of empirical intuition rather than proofs grounded in meaningful probabilistic assumptions. The dependence on oracle-style prompt optimization raises questions about scalability and real-world feasibility—how such per-sample prompt adjustments can be achieved during inference is unclear. Moreover, the methodology is excessively complex for what amounts to a perturbation-based scoring heuristic. Theoretical bounds presented (Theorems 1 and 2) are neither practically verifiable nor rigorously linked to model architecture or data distributions. The authors claim near-perfect separability (~99%) but later concede that this is merely an “oracle lower bound,” which significantly weakens the practical impact. In the experiments, improvements in AUROC are modest (a few percentage points) once fair baselines and realistic labeling conditions are considered. There is also no statistical significance testing or error analysis to justify the claimed robustness. Conceptually, the idea of “sensitivity-guided detection” overlaps with well-known approaches in adversarial robustness and influence functions, but these connections are not acknowledged. The paper’s novelty claim therefore feels overstated. The writing also tends toward heavy mathematical formalism without clear intuition, and the overall contribution risks being perceived as incremental. How can the proposed Sample-Specific Prompting (SSP) be efficiently implemented at inference time, given that per-sample optimization is computationally expensive? What are the actual computational costs (training and inference) compared to strong baselines such as HaloScope or EGH? Can the authors provide an ablation showing how much improvement comes from sensitivity computation versus prompt tuning? How sensitive are the reported results to the choice of similarity metric (cosine vs. Euclidean) or embedding layer? What guarantees exist that perturbation sensitivity reflects factual correctness rather than syntactic or stylistic variance? The paper claims “theoretical separability.” Could the authors clarify what assumptions make the theorems valid and whether these hold in practical LLMs? Why are real human-labeled factuality datasets not used for calibration, given the reliance on DeepSeek-V3 pseudo-labels? Is there any evidence that the proposed metric generalizes to open-ended generation tasks, rather than structured QA datasets?	Fully AI-generated
Perturbations Matter: Sensitivity-Guided Hallucination Detection in LLMs	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes a hallucination detection technique that uses expertly chosen prompts (adapted for each question) to perturb the distribution of true and hallucinated outputs, creating a bigger separation between them than without these additional prompts. The aim is to use these perturbations to increase the separability between true outputs and hallucinations. The paper first proposes a theoretical discussion on how a specific prompt can be crafted for each question that would almost perfectly separate the hallucinated and true outputs. The paper then supports its claim by comparing its hallucination detection technique with other SOTA techniques, showing definite improvements. 1. The paper provides both a theoretical analysis as well as the empirical results to support their hypothesis. To my understanding, they both appear to be on solid ground. 2. A large plethora of techniques are compared, across several different datasets, and two different models (a third one in the appendix). Overall, the experiments are robust enough to support the final claim. 3. The paper is well written and easy to read. I enjoyed reading this work. I don't have any weaknesses in the soundness or contributions of this paper. I really enjoyed reading this work. I do, however, have a big objection to papers that move the Related Work section to the appendix. The appendix is to provide additional results/analysis for readers who might be interested in learning more about the paper. It is NOT simply an extension of the main paper, and in my opinion, a lack of related works discussion in the main paper really hurts the readability of the work. I understand problems of limited space, but related work should not be the section that gets axed because of it. I don't like it when reviewers suggest adding new parts to the paper without also suggesting what should be removed. Just a suggestion, I believe details about Theorem 1 and the results of Figure 2 can be compressed, with the rest moved to the appendix, to make space for related work discussion in the main paper (if the authors prefer, they can even have a 'shorter' related work section in the main paper to situate their work, and a 'longer' related work section in the appendix). Feel free not to take this suggestion, using your own way of finding space, but I strongly recommend having a brief discussion of related work in the main paper. My assumption is that the final 'training data' actually used to train the detector contains a set of 100 questions, one answer per question, and a label for whether that answer is a hallucination or the truth. This is what I believe is used in other detection methods, and so I assume the same is done here. If the above is correct, the detector never really has access to the 'hallucinated output' and 'true output' pairs for the same question. While the objective is to push all hallucinated answers to low sensitivity and true answers to high sensitivity, which automatically creates a separation, I wonder if having access to actual pairs would help with the separability even more? Curious to hear if the authors think a small set of carefully labeled data with actual pairs could be beneficial, or if just pushing the answers to two extremes can achieve that implicitly?	Fully human-written
EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper propose a framework of exploration and filter sample in RLVR. The experimental results show that it can enhance the performance of RL reasoning. - This article focuses on an important issue, the significance of exploration for RL. - The idea is very simple and extensible to prior methods. - The experimental design is relatively weak, with too few baselines — aside from the fundamental algorithm GRPO, the comparison includes only one method of the same type (DAPO). The experimental analysis is also insufficient. - The paper does not verify scalability, such as testing across different model architectures or sizes. - All chosen benchmarks are standard math tasks, without any out-of-distribution (OOD) tasks to demonstrate the effectiveness of exploration. As noted in the weaknesses, the paper need to add more experiments and analyses: - Provide clearer differentiation from concurrent work with detailed comparisons. - Expand experiments to include more baselines, models, and diverse domains. - Conduct more thorough case analysis. If the authors can address the above concerns in a revision, I would be willing to reconsider my assessment.	Lightly AI-edited
EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	Vanilla Group Relative Policy Optimization (GRPO) suffers from limited exploration and training instability. To address these, the authors introduce EFRame that augments GRPO via three components: (1) additional rollouts to promote exploration, (2) online filtering removes low-quality samples to stabilize gradients, and (3) experience replay for stable convergence. Through these mechanisms, EFRame balances exploration, efficiency, and stability, finally achieving 4.6% improvement on Geometry3K over vanilla GRPO. 1. Authors provide recipe for stable RL training which includes additional rollouts with higher temperature, online filtering, and experience replay. I believe it's a promising research direction. 2. This paper provides detailed analysis of each introduced mechanism based on the current challenges of GRPO, which is well motivated and reasonable. 3. This paper is well organized and easy to follow. I discuss the weaknesses of originality and experiments. Weaknesses marked with W are key concerns that might affect the final rating, while weaknesses marked with M may have minor impact on my rating. ### Originality [M1] The core ideas used in this work, i.e., adaptive sampling for hard problems [1][2], online filter [2][3] and experience relay [4][5], have been explored in prior literature. This work combines these existing ideas well, but it's not very inspiring to me. ### Experiment [W1] Limited evaluation domains. I notice that the training dataset includes math domain (DAPO-17k) and other general domains (e.g., ScienceQA, ChartQA). However, the evaluation only focuses on mathematical reasoning domains, which raises concerns about its generalization ability. I suggest adding general reasoning tasks like MMLU-STEM, MMLU-Pro, GPQA, MMMU, and DocVQA to further validate the effectiveness of the proposed method. [W2] Limited model backbones. This paper only use Qwen2.5-7B series (i.e., Qwen2.5-math-7b, Qwen2.5-VL-7B-Instruct) to conduct the experiments. However, recent studies reveal that there may be potential data contamination in the Qwen model [6]. Consequently, conclusions derived from contaminated benchmarks (MATH-500, GSM8K) on Qwen2.5 series may be unreliable. The transferability of the proposed method to different models like Llama or Gemma warrants a more in-depth investigation. [M2] Concerns on Performance Gains. I notice that EFRame improves ~1.0% over baselines on most benchmarks (Table 1). Does this suggest a possible limit to the power of this method to discover new reasoning patterns? [W3] Concerns on Experimental Settings. I notice the maximum response length is set to 2,048 in RL training (Appendix A). But in DAPO, the default maximum response length is 20,480. So I wonder if the settings unintentionally impaired the exploration capabilities of baselines like GRPO and DAPO. [M3] Missing hyperparameters. I don't find any information about the hyperparameters for the evaluation. What's the maximum response length, temperature, and sample numbers in evaluation? Besides, I don't see the prompt template used in training. [W4] Missing computational costs discussion. In the vanilla GRPO baseline, what is the number of responses sampled for each question? It seems that the introduction of additional rollout and experience replay may bring more computational overhead. I suggest reporting relevant computational costs clearly. --- [1] Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL. NeurIPS, 2025 [2] DAPO: An Open-Source LLM Reinforcement Learning System at Scale. arXiv preprint arXiv:2503.14476 [3] MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning. arXiv preprint arXiv:2503.07365 [4] RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning. arXiv preprint arXiv:2507.07451 [5] Learning to reason under off-policy guidance. arXiv preprint arXiv:2504.14945 [6] Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination. arXiv preprint arXiv:2507.10532 1. What's the setting of GRPO-1 and GRPO-2 in the Introduction? Are there any differences? 2. How many numbers of old responses are used to replay? Can authors provide more details on the process of replay? 3. How will EFRame handle samples without positive signals after additional rollout?	Fully human-written
EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework	Soundness: 3: good Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes EFRame, a framework that enhances the reasoning ability of LLMs through an exploration–filter–replay mechanism built on top of GRPO. The idea is to generate more diverse samples, filter low-quality responses online, and replay high-quality trajectories to improve stability and exploration efficiency. Experiments are conducted on Qwen models across several reasoning benchmarks. 1. The proposed Exploration–Filter–Replay framework is conceptually clear and easy to follow. 2. The method improves training stability and reasoning accuracy compared to GRPO baselines. 3. The ablation experiments provide useful insight into the contribution of each component. 1. Limited novelty: Similar mechanisms have already been explored in RLEP [1], RePO [2] and VL-Rethinker [3], which all employ replay-based or filtering strategies to stabilize reinforcement learning for reasoning tasks. 2. Baseline insufficiency: The paper does not compare against these closely related works [1–3], making it unclear how much gain is attributable to EFRame itself. 3. Lack of exploration metrics: The claimed improvement in exploration is not supported by pass@k, a standard evaluation metric for reasoning diversity. 4. Model limitation: Experiments are restricted to two Qwen models, with no tests on other LLM families (e.g., Llama, Deepseek). 5. Benchmark limitation: The paper omits newer reasoning benchmarks such as AIME25, MMT-Feb24, HMMT-Feb25, and CMIMC25. 6. Data contamination risk: Recent research shows that Qwen2.5 is susceptible to data leakage on certain reasoning benchmarks, raising doubts about evaluation improvements [4]. 7. Lack of theoretical explanation: The paper provides no deeper analysis of why the combined exploration–filter–replay design leads to consistent improvement beyond empirical evidence. 1. How does EFRame differ algorithmically from RLEP [1] , RePO [2] and VL-Rethinker [3] 2. How about the results on pass@k? 3. Have the authors tested EFRame on other model families to confirm generality? 4. Would the conclusions hold on newer benchmarks such as AIME25 or HMMT-Feb25? [1] Zhang et al., RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning, arXiv:2507.07451 [2] Li et al., RePO: Replay-Enhanced Policy Optimization, arXiv:2506.09340 [3] Wang et al., VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning, arXiv:2504.08837 [4] Wu et al., Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination, arXiv:2507.10532	Moderately AI-edited
EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework	Soundness: 1: poor Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes EFRame, a Exploration–Filter–Replay reinforcement learning framework designed to enhance the reasoning capability and stability of large language models (LLMs) during post-training. Building upon Group Relative Policy Optimization (GRPO), EFRame introduces three complementary modules: additional rollouts to promote targeted exploration for difficult prompts, online filtering to remove low-quality or zero-advantage samples (stabilizing gradients and improving efficiency), and experience replay to amplify the influence of rare but informative trajectories (mitigating entropy explosion and ensuring stable convergence). - The paper is well written and easy to follow. - The experiments are conducted on three diverse datasets and the gains are strong. - The framework has three distinct parts which the authors conduct ablations by isolating the effect of each component. - While well-engineered, the framework primarily combines known components (resampling, filtering, replay buffer) on top of the existing GRPO framework rather than introducing a fundamentally new optimization principle. - The paper lacks theoretical justifications, and some claims are poorly supported: - In lines 243 - 248, "low-quality samples are significantly more numerous than high-quality ones, ... the informative signal from high-quality samples may be drowned out by chaotic updates from low-quality ones". I am not sure if it is correct. Although there are more low-quality samples, the absolute value of their advantages is much closer to 0 than the high-quality ones. In other words, we assign larger weights to the high-quality ones when updating than to the low-quality ones. As you stated in Theorem 3.4, the sum of their advantages should be 0, which means that we put the same amount of "weight" onto low and high quality responses. - Theorem 3.5 uses a theorem under the tabular setting with NPG. I am not sure if it can be directly applied here. - The experiment settings are a bit "toy" and not really realistic. The paper uses Qwen 2.5, which is a bit old, a relatively short context of 2k tokens, and the number of rollouts ($G_1$) is 5, while DAPO and other recent recipes set it to 16. The scale of the experiments is not large enough to showcase the effectiveness of the method. - Inconsistent experiment results. - Could the authors clarify how evaluation is performed on AIME and MATH? Is it based on pass@1 only? The AIME results suggest this may be the case, since the reported numbers are multiples of 1/30 (AIME has 30 questions). If so, the variance of pass@1 is quite high, and it would be more robust to report pass@32 or a similar metric. - For MATH, if the metric is indeed pass@1, it is unclear how results such as 65.5 and 78.3 were obtained. The test set contains 500 questions, so the accuracy should be a multiple of 1/500, and fractional correctness (e.g., half a question) is not meaningful. You cannot answer a question half correctly. - From Figure 5, we can see that, at step 100, the reward of the orange line is much higher than the blue line, while the accuracy is the other way around. I wonder if it is also due to the large variance of the evaluation. - DAPO takes more time for each step as it keeps resampling. However, under the same number of steps, the performance of DAPO should be higher than GRPO since, for each batch, it contains more gradient signals as zero-advantage prompts are all filtered out. Could the authors provide an explanation for why DAPO performs worse than GRPO in their experiments? - Small typos: - Figure 3 (b) and (c), EFR should be EFRame. Please refer to the weaknesses.	Fully human-written
OptAgent: Optimizing Query Rewriting for E-commerce via Multi-Agent Simulation	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	- The paper introduces OptAgent, a framework designed to optimize e-commerce query rewriting through the use of multi-agent simulations and genetic algorithms. - OptAgent employs a population of LLM-based agents to simulate user evaluations by assessing the relevance of retrieved products. The ensemble's average rating functions as a fitness score for an evolutionary algorithm, which incrementally refines the user’s query using crossover and mutation operators executed by LLMs. - Experiments conducted on a newly developed dataset demonstrate significant improvements over all baselines. Comprehensive analyses of agent behaviors, task categories, and ablation studies on OptAgent components enhance the paper's completeness. - Query rewriting for e-commerce is a practical and underexplored setting for multi-agent systems. The framing of "multi-agent simulation + evolutionary algorithm" is creative and relevant. - The paper introduces a curated dataset of real-world e-commerce queries, which can support future research in this area. - The paper is well-written, with intuitive figures and detailed algorithmic breakdowns, making the method easy to follow. - The proposed evaluation function does not directly measure how well the rewritten query satisfies the user’s latent intent. Instead, it measures product–query relevance as judged by LLMs. This proxy may not reflect the actual goal of e-commerce query rewriting, which is user satisfaction or purchase fulfillment. - The same evaluation metric (multi-agent fitness) is used for both optimization and final evaluation, which is conceptually similar to evaluating on the training signal. This may inflate reported improvements and is not a standard practice in ML evaluation. - The baselines (LLM-Rewrite, BoN-Rewrite) are relatively weak. The absence of comparisons to stronger reasoning or optimization approaches (e.g., CoT-SC, MAD, or RL-based fine-tuning methods like SFT or RLVR) makes it hard to gauge the true competitiveness of OptAgent. - The reported metric improvements are entirely simulation-based. There is no direct user or behavioral validation (e.g., click or purchase data). The link between simulated multi-agent evaluation and real-world e-commerce outcomes remains unverified. - The current evaluation emphasizes product relevance but does not explicitly account for users’ latent intent. How can you ensure that the multi-agent scoring meaningfully reflects actual user satisfaction? A more robust evaluation might include fine-grained annotation of latent intent or user studies involving real participants. - Employing the same fitness function for both optimization and evaluation introduces a risk of overfitting. Could you justify this strategy by referencing previous works that have successfully adopted a similar approach, or alternatively, incorporate an independent evaluation metric, such as human assessments or retrieval-based evaluations, to better demonstrate generalization? - In some cases, the LLM baseline underperforms compared to the original query. Could you provide qualitative analysis or illustrative examples highlighting the scenarios where direct rewriting results in degraded performance, especially considering that an LLM could theoretically opt to retain the original query if it is already optimal? - To more thoroughly demonstrate OptAgent’s strengths, it would be beneficial to compare with stronger baselines such as CoT-SC and MAD. Additionally, incorporating baselines that use trained models, such as SFT or RLVR on smaller model sizes, would further reinforce the rigor of your evaluation. - The current approach ensembles multiple agents by varying the sampling temperature, rather than using different underlying models or roles. As this is not a conventional ensembling strategy, it would be helpful to include an ablation study or further analysis to motivate and support this design choice.	Lightly AI-edited
OptAgent: Optimizing Query Rewriting for E-commerce via Multi-Agent Simulation	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces a framework called optagent to optimize E-commerce query rewriting. It is a multi-agent framework that applies genetic algorithms. In a dataset of 1000 queries, it achieves better performance than baselines. 1. This paper applies the genetic algorithm, which is an interesting approach. 2. It uses multiple agents to simulate user behaviors. 3. This paper focuses on query rewriting in e-commerce, which can benefit both customers and sellers. 1. The introduction cites information from unofficial websites. It is recommended to valid these sources with official documents such as financial statements to enhance credibility. 2. The repository link is not provided, which hinders reproducibility. 3. The dataset is relatively small (only 1,000 queries), which may limit the generalizability of the results. Consider expanding the dataset or using other datasets. 4. The problem of efficiency. Running multiple LLM agents over multiple generations is expensive. The paper defers cost details to the appendix, but comparative analysis on compute vs. gain tradeoff is lacking. 1. For a query, what is the average efficiency of your method? 2. Do you consider adding an RL-based method for query rewriting as a baseline?	Lightly AI-edited
OptAgent: Optimizing Query Rewriting for E-commerce via Multi-Agent Simulation	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper addresses the challenge of e-commerce query rewriting (QR): user inputs are often ambiguous, and evaluation is subjective with no clear standards. It uses multiple LLM agents to simulate customers (enhancing diversity through temperature sampling) to generate dynamic reward signals, combined with a genetic algorithm (crossover and mutation in natural language) to iteratively optimize queries without relying on historical data, particularly helping tail queries capture user intent accurately. Key contributions include: - A multi-agent simulation serving as a robust fitness function for evaluation; - The OptAgent framework, which replaces static rewards with dynamic scoring; - On 1,000 Etsy queries, relevance improved by 21.98% over the original queries and 3.36% over the BoN baseline (4.5% for tail queries). The paper's key strength lies in its multi-agent simulation for query rewriting evaluation. Using LLM agents with varied temperatures (0.00–1.00), it creates dynamic reward signals that better approximate human preferences, reducing biases and enabling robust fitness scores through aggregated semantic relevance and purchase decisions. Another strength is its genetic algorithm optimization, which refines queries via natural language crossover and mutation. This population-based approach enhances exploration in subjective domains. (1) The multi-agent simulation uses isolated agents that only differ in temperature, lacking interaction which could better mimic real shopper dynamics. Please explain the reasons for not using multi-agent interaction. (2) The inference-time cost is not adequately discussed. The high computational demand of ~1,525 LLM calls per generation casts doubt on the method's feasibility for high-volume use. (3) Comparisons are limited to basic baselines like BoN, and the reliance on a single LLM (Gemini) for both optimization and evaluation raises concerns about the generalizability of the findings. Please refer to the above.	Moderately AI-edited
OptAgent: Optimizing Query Rewriting for E-commerce via Multi-Agent Simulation	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes OPTAGENT, which couples a multi-LLM “shopper” simulation with a genetic algorithm to optimize e-commerce query rewriting. The ensemble’s averaged scores serve as the fitness for evolution; reported gains are +21.98% over original queries and +3.36% over a Best-of-N LLM baseline. 1. Clear problem framing for subjective, label-scarce QR; replaces a single judge with a simulated crowd. 2. Simple, reusable mechanics: temperature-diverse evaluators; crossover/mutation directly in natural language. 1. Evaluator–optimizer coupling. Training and testing both rely on the same agentic evaluator; human correlation is only moderate (Pearson r=0.552), so overfitting to the evaluator remains likely. Needs independent metrics or online A/B evidence. 2. Baselines are weak. Comparisons cover user query, single LLM rewrite, and BoN only; missing stronger QR baselines common in practice. 3. External bias not controlled. Authors show strong position bias in purchases toward early-ranked items (Fig. 5) yet do not correct for it in fitness or evaluation (e.g., randomization, PBM/DBN). 4. Reproducibility and generalization risk. Dataset is from a single platform (Etsy). Results depend on live search pages and site behavior; claims of code release do not mitigate the moving-target evaluation environment. No evidence on other domains. 1. Baselines and Generalization: The paper mentions "LLM-as-a-Judge" similar to RLAIF and criticizes single LLM judges for biases, while noting Zuo et al.'s heavy reliance on "vast amounts of historical user interaction data." Beyond BoN, why not compare with RLAIF or session-graph-based QR methods? How does the framework generalize to other subjective tasks (e.g., personalization in recommendation systems)? What are the details of translation handling for multilingual queries (e.g., using LLM translation)? 2. Human Evaluation Details: What are the sample size, annotator diversity (e.g., cultural backgrounds) in the human study in Appendix D? Are the agent bias cases (Appendix C.1) mitigated during the optimization process? If budget allows, is there a plan for A/B testing on real Etsy data?	Lightly AI-edited
OptAgent: Optimizing Query Rewriting for E-commerce via Multi-Agent Simulation	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper introduces OptAgent, a multi-agent framework for e-commerce query rewriting that (i) aggregates judgments from diverse LLM agents into a modular fitness signal and (ii) applies genetic search to optimize candidate rewrites at test time. On an Etsy query set spanning head/torso/tail, fandom, and multilingual segments, OptAgent outperforms the original user queries and Best-of-N baselines. 1. The paper introduces a practical and empirically effective approach that couples a multi-agent, policy-guided evaluator with genetic search over query rewrites, delivering clear gains. Incorporating genetic search into the multi-agent optimization workflow offers a measure of novelty. 2. The paper offers a comprehensive evaluation spanning multiple query segments (head/torso/tail, fandom, multilingual). 1. Using the same multi-agent system to both grade and guide the search can make offline results look better than they really are, because the optimizer may learn the judge’s preference instead of what real users want; if you swap in a different judge (e.g., another model or prompt), the gains may shrink, suggesting overfitting to the original judge. 2. The baselines could be stronger, because gains over the user’s original query and a Best-of-N sampler alone don’t establish progress; beyond BoN, the paper should include baselines from more recent works, e.g., context-aware query rewriting and RL-optimized QR. Besides, the improvements over BoN are not that impressive (less than 5%). 1. The authors should show results under an independent evaluator to rule out judge-specific optimization. 2. The authors should add broader baselines beyond BoN (e.g., context-aware QR, RL-optimized QR) to demonstrate improvements over stronger baselines.	Moderately AI-edited
Resolving the Security-Auditability Dilemma with Auditable Latent Chain-of-Thought	Soundness: 1: poor Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	- The paper highlights a “Security-Auditability Dilemma” that exists with reasoning models, where exposing reasoning traces can useful for transparency but can create vulnerabilities and information leakages. - The paper first performs experiments to provide evidence of this dilemma. They show that reasoning can improve safety to non-adaptive attacks but that reasoning is still vulnerable to adaptive attacks. They also show that masked reasoning methods greatly outperform non-reasoning, highlighting the value of maintaining reasoning (despite vulnerabilities in vanilla reasoning). - ALCA is proposed as a solution to the security vulnerability dilemma. ALCA works as follows: (1) Trains a probe to identify when a future reasoning step may be harmful to reveal. (2) Trains the LLM to generate reasoning in a latent space when the probe triggers. (3) Trains the model to decode its latent reasoning into text when a special token is inserted. - Experimental results are presented which provide evidence that ALCA maintains the models capabilities, while reducing attack success rates versus baselines. They also show that the decoded latents have semantic similarity to ground-truth texts. Combined, these results are evidence of ALCA producing an improvement in the security-auditability pareto fronteir. - The paper highlights the Security-Auditability Dilemma. This appears to be a novel contribution and an important dilemma worth noting and addressing. They also provide empirical results to validate the existence of this dilemma. - The explanation of the proposed method and solution is mostly clear - They provide evidence that their ALCA method functions as intended, and could be a solution to the Security-Auditability Dilemma: (i) It reduces attack success rates (ASR), demonstrating mostly reduced ASR versus the presented baselines, (ii) they present evidence that the decoding method works, meaning auditability can be maintained, (iii) they provide evidence that ALCA models maintain good performance on capabilities benchmarks. Motivating the utility of providing user-facing reasoning traces The authors could perhaps do a better job at motivating the utility of presenting CoT reasoning traces to users (who could be potential attackers). A solution to the security issue is to hide the reasoning completely from users. However, there may be reasons we would still like to show users as much reasoning as possible. The paper does not seem to describe these reasons very well - the reasons it does touch on, such as “transparent reasoning traces as supervision target in training”, do not seem relevant for user-facing applications. ALCA seems a convoluted solution, simpler baselines may exist ALCA seems to be an overly convoluted and complex solution to the security-auditability dilemma. It does not seem to be properly baselined against simpler, potentially more natural solutions. The main goal of ALCA appears to be to provide a method that shows the user all harmless reasoning while hiding any reasoning that could create potential vulnerabilities or information leakages. A more natural and simpler solution here, for exampe, is to simply have another LLM redact sensitive parts of the reasoning before providing them to the user. A simplification of ALCA would be to simply mask reasoning tokens from the user where the probes fires. These are methods that do not alter the models actual generations and so will not impact the “auditability” axis. The paper uses a relevant masked reasoning method in Section 2, but does not seem to baseline ALCA against masked reasoning in the main results, which seems a problematic omission. Moreover, none of the existing baselines used in the paper are described, motivated or contextualized. It is unclear if they are meaningful baselines for a security-auditability evaluation (they may only be good baselines for the "security" component). Lack of empirical focus on auditability In general, the paper seems to heavily focus on the “security” component, and neglects the “auditability” component. Namely, the auditability of ALCA is not compared to any baseline methods, and the metrics used for measuring auditability in Table 3 seem unconvincing (these metrics are proxies for auditability, not direct measures). The paper does not seem to include any model generations - it would be useful to qualitatively compare decoded latents to the ground-truth text. Presentation issues I have some concerns regarding the care gone into the preparation of the paper. All text in the paper is in bold from page 5 onwards. There are a few other formatting issues throughout (e.g., line 447). The conclusion is very minimal and there is no discussion of limitations. Table 2 seems to, on multiple occasions, highlight the ALCA result as the best performing when it appears a different method was the best performing (e.g., for Llama-3, ALCA GCG is worse than STAIR GCG?). Citations are sometimes missing, e.g., for TAP method in line 132. Other Previous papers have proposed methods for latent reasoning. This paper does not ground their latent generation approach in existing literature. - Can you confirm you are the first paper to introduce the “security-auditability dilemma” in this context? - Is there a reason you did not try the simpler baseline approaches I touched on above? Could you run these baselines? - Do you agree that the “auditability” axis is neglected in your experiments? Could you run additional experiments to better validate the auditability of ALCA? - In Section 2, how exactly is the 'masking' in the masked reasoning performed? - What dataset do you use for training the probe? - Why did you choose the decoding method you used? Did you try other approaches? Why did you not just decode the latents directly through lm_head ? - Can you include some example generations from the model? In particular, generations decoded from latents would be interesting. - For the “downstream %” results in Table 2, was the model generating its reasoning in latent mode? - In section 4.3, the plots mention L_decode, but the text mentions L_causal - is this a mistake?	Fully human-written
Resolving the Security-Auditability Dilemma with Auditable Latent Chain-of-Thought	Soundness: 3: good Presentation: 3: good Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper presents Auditable Latent CoT Alignment (ALCA) to address a key vulnerability in CoT-based safety alignment: when explicit safety reasoning is visible to attackers, jailbreaks can exploit it. ALCA: Uses a probe to detect safety-relevant reasoning steps Executes those steps as latent autoregressive deliberation in hidden states Supports self-decoding for auditability Experiments show reduced attack success rate (ASR drops from ~65% → ~9%) without harming helpfulness. Hidden CoT significantly improves resilience vs. explicit safety-CoT baselines. This work is timely and provides a practical path to improve alignment robustness under adversarial prompting. Clear motivation: The paper articulates the security–auditability dilemma clearly and supports it empirically (Table 1). Well-designed architecture: The three-component alignment strategy — probing → latent reasoning → self-decoding — is conceptually coherent and technically implementable. Superior robustness against jailbreak attacks: Attack Success Rate significantly drops compared to all explicit-CoT safety baselines (Table 2). Although ALCA improves robustness against jailbreak attacks, it remains limited to security-related CoT. Many real-world failures involve broader hallucinations (e.g., fabricated facts, URLs, or numbers), and it is unclear whether the approach generalizes to these cases. Clarifying this applicability would strengthen the practical impact. Can ALCA handle hallucinations unrelated to safety refusals, such as fabricated URLs, incorrect medical facts, or misleading numeric claims? If not, how do the authors envision extending ALCA to these scenarios? How does the probe distinguish between “security reasoning” and other forms of critical reasoning (e.g., factual validation)? Could misclassification lead to harmful latent reasoning being output without scrutiny?	Moderately AI-edited
Resolving the Security-Auditability Dilemma with Auditable Latent Chain-of-Thought	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper addresses the "Security-Auditability Dilemma" in LLM safety alignment: exposing chain-of-thought reasoning for auditability creates vulnerabilities to adaptive attacks. The authors propose ALCA (Auditable Latent CoT Alignment), which performs safety reasoning in continuous latent space (invisible to adversaries) while maintaining auditability through self-decoding. Experiments across three models show 54% reduction in adaptive attack success rates compared to baselines while preserving downstream performance. 1. Novel problem formulation: The Security-Auditability Dilemma identifies a real tension in current safety alignment approaches that deserves attention. 2. Comprehensive empirical evaluation: Testing across multiple models (Llama-3, Mistral-7B, Qwen2) and attack methods provides breadth. 3. Creative technical approach: Moving reasoning to latent space while maintaining decodability through self-decoding is innovative. 4. Strong motivating experiments: Section 2 effectively demonstrates the dilemma through controlled experiments. 1. Circular evaluation methodology: Using the model itself to evaluate reconstruction fidelity (Table 3, semantic similarity 0.96) is methodologically flawed. Independent human evaluation or external metrics are essential for trustworthy assessment. 2. Missing theoretical foundations: The "equivalent conditions" (Section 3.1) assume idealized scenarios. No formal analysis proves latent reasoning preserves safety properties or that self-decoding is faithful. 3. No adversarial analysis of ALCA: The paper doesn't consider attacks targeting the probe classifier or attempting to manipulate mode selection. For a security paper, this is a critical omission. Adversaries could learn to trigger incorrect mode selection. 4. Training instability: Figure 4 shows latent-only training catastrophically collapses mid-training. This suggests the method is fragile and may be difficult to reproduce reliably. 5. Insufficient ablation studies: Only N (latent steps) is studied. What about probe architecture, trigger mechanisms, loss weights? The selection of layer 28 for probing appears arbitrary. 1. How do you handle probe misclassification? What's the false positive/negative rate under adversarial pressure specifically targeting the probe? 2. Can you provide independent evaluation of self-decoding fidelity using human judges rather than the model itself? 3. What happens when the 4-10% semantic information lost during reconstruction includes safety-critical details? 4. How does ALCA perform against adversaries aware of its architecture who specifically try to exploit the probe or latent mechanism? 5. Why choose layer 28 for probing? Did you experiment with other layers or adaptive layer selection?	Fully AI-generated
Resolving the Security-Auditability Dilemma with Auditable Latent Chain-of-Thought	Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes Auditable Latent Chain-of-Thought Alignment (ALCA), a framework that aims to balance security and auditability in safety-aligned large language models. The authors argue that exposing explicit reasoning traces (Chain-of-Thoughts) improves transparency but simultaneously enables jailbreaks and prompt-injection attacks. ALCA attempts to solve this by encoding reasoning in a latent space, inaccessible to adversaries, while providing an auditing mechanism that can decode latent representations into interpretable rationales for safety verification. Experiments are conducted on several LLMs (LLaMA-3-8B, Mistral-7B, Qwen2-7B) using multiple jailbreak benchmarks (GCG, AutoDAN, PAIR). - The paper raises an important and underexplored issue in model alignment: the inherent tradeoff between security and auditability. - The idea of performing safety reasoning in a latent space is conceptually appealing and may inspire future research. - Evaluation includes several modern jailbreak methods (GCG, AutoDAN, PAIR), which shows awareness of the current security landscape. - The ablation experiments (latent-only vs. causal-only vs. hybrid) offer some insight into how different supervision components contribute to robustness. - Lack of experimental rigor: Attack success rate results are not averaged across runs or accompanied by standard deviations. - Unclear evaluation metrics: GPT-4-based judgments are used without assessing consistency or inter-run variability. - Incremental novelty: The method builds upon prior latent reasoning and safe decoding techniques without introducing a fundamentally new idea. - Missing cost analysis: There is little discussion of computational overhead or latency introduced by latent decoding and probing. - Presentation issues: Figures are difficult to interpret, and approximately half of the paper’s text is rendered in bold font, which significantly reduces readability and suggests formatting errors in the submission. - Reproducibility gaps: Experimental details (training hyperparameters, dataset splits, and attack configurations) are missing, making replication difficult. 1. How many independent runs were performed for each Attack Success Rate (ASR) in Table 2, and were statistical confidence intervals reported? 2. Since GPT-4 is used for evaluation, did the authors validate the consistency of its jailbreak-judgment outcomes across random seeds or prompt rephrasings? 3. Could the authors provide quantitative measurements (e.g., GPU hours, latency) comparing ALCA with baseline safe-decoding methods such as STAIR or COCONUT? 4. How does ALCA fundamentally differ from frameworks like CoIn, which already achieve auditability of hidden reasoning through token-level verification and cryptographic attestation? Specifically, what additional capability does latent-space reasoning provide beyond CoIn’s measurable and verifiable auditing approach?	Fully AI-generated
USDPnet: An Unsupervised Symmetric Deep Framework for Robust Parcellation of Infant Subcortical Nuclei	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This work, `USDPnet`, proposed an unsupervised surface-based parcellation pipeline for subcortical nuclei, focusing on infant brain MRI. Leveraging a divergence measure (namely, generalized Cauchy-Schwarz Divergence, `GCSD`) and the symmetry constraint, a deep neural network was trained to cluster each vertex of the surface mesh. The proposed framework was evaluated using longitudinal BCP datasets from infants aged 0-24 months. The main contributions are to utilize unsupervised clustering leveraging GCSD and symmetry regularization to parcellate subcortical nuclei in a surface mesh. This work bears some merits, such as providing averaged results from thirty runs under different settings and sensitivity analysis, while it also exhibits several weaknesses and has some confusing points. Please refer to my review below. If my concern can be adequately addressed, I'd be happy to revise my rating. 1. Some cluster/subregion counts for each nuclei are similar to a published work [1] on Nature Neuroscience about subcortical parcellation utilizing fMRI. 2. This work evaluated the proposed and comparison methods under various settings (different cluster numbers) in 30 runs. The average results confirm the better performance of the proposed work. 3. Parameter sensitivity and ablation analyses were conducted. 4. Open-source contribution. [1]: Tian, Ye, et al. "Topographic organization of the human subcortex unveiled with functional connectivity gradients." Nature Neuroscience 23.11 (2020): 1421-1432. 1. The reproducibility is not evaluated. I.e., with the optimal subregion/cluster counts, repeat the unsupervised clustering several (3-10) times, what is the adjusted rand index if choosing one run (e.g., the current result) as the ground truth? This is very important but missing in the current manuscript. If the adjusted rand index is low, meaning irreproducible, even if the other metrics indicate superior performance, it still significantly undermines the values of this work. This is the main reason that impacts my rating. 2. There is a significant sensitivity to the parameters. This should be elaborated more in the manuscript and expressed as a major limitation. 3. The current manuscript is not so clearly presented. Please see the questions below. 4. There are limited contributions to representation learning or unsupervised clustering, as the GCSD is an existing and published work, and symmetry regularization is an incremental change. It brings more significant contributions to neuroscience than to representation learning or unsupervised clustering. 5. The reviewer suggests refraining from using words like "anatomically plausible", "biologically plausible", or any phrasing implying the cluster is physiologically sound and correct. This is NOT supported by any evidence in the current manuscript. - The higher SC/CH/RE/FH does `NOT` indicate any plausibility in the physiology and neuroscience world. They are technical metrics evaluating a clustering algorithm. - Visual comparison with [1] in Fig. 3 does not directly imply the plausibility. There is a visual discrepancy between [1] and USDPnet, particularly in the X view of Putamen. - To claim plausibility, a lot more experiments and statistical analyses should be conducted, other than some clustering metrics and a visualization comparison with [1]. 6. In Lines 432-435, it is better to provide some quantitative metrics to indicate agreement. Qualitative visualization is not enough to support those arguments. [1]: Tian, Ye, et al. "Topographic organization of the human subcortex unveiled with functional connectivity gradients." Nature Neuroscience 23.11 (2020): 1421-1432. 1. In Appendix `D2`, the 4D atlas construction was included as part of the work. The reviewer is curious about the rationale for redoing this 4D atlas construction in the case that BCP has already constructed a 4D atlas. Moreover, that 4D atlas is peer-reviewed and publicly accessible [1]. Why rebuild the wheel? Similarly, in `D3`, the segmentation process was described in detail. On the other hand, in the anonymized repository, it directly points to the public BCP atlas, which confused the reviewer. If the BCP atlas is used, why is the publication [1] not cited in the manuscript? The way it currently reads implies that the atlas construction and segmentation are part of the contributions of this work, which is incorrect. The BCP atlas is already established, peer-reviewed, and released ([1] and [link](https://www.nitrc.org/projects/uncbcp_4d_atlas/)). It could not be reclaimed as a contribution in a new work. Doing so could be an integrity issue. I raise an ethical concern regarding this point. 2. Do the results in Table 2 correspond to the experiment mentioned in Appendix `E`? 3. Do the results in Table 1 correspond to the optimal setting mentioned in lines 916-917? 4. Is the feature encoder extracting features from the atlas at a single time point or from multiple time points? 5. Why is the symmetry consistency loss based on MSE, not cross-entropy? [1] Chen, Liangjun, et al. "A 4D infant brain volumetric atlas based on the UNC/UMN baby connectome project (BCP) cohort." NeuroImage 253 (2022): 119097.	Fully human-written
USDPnet: An Unsupervised Symmetric Deep Framework for Robust Parcellation of Infant Subcortical Nuclei	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	The paper presents USDPnet, an unsupervised network for deformable medical image registration. Instead of relying on ground-truth deformation fields, the method introduces a dual-path framework that aligns source and target images through both intensity-based and structural similarity losses. The model incorporates a pyramid-level deformation strategy and an uncertainty-guided regularization term to stabilize training and improve anatomical alignment. Experiments on multiple 3D medical datasets show that USDPnet achieves accuracy on par with or better than supervised approaches while maintaining fast inference. The paper addresses a core challenge in medical image registration: learning accurate deformation fields without supervision through a well-thought-out architecture. The dual-path design combining global and local cues is elegant and grounded in practical clinical needs. The inclusion of uncertainty-guided regularization is a nice touch that helps balance smoothness and precision in difficult regions. Results across datasets are solid, showing clear improvements in dice scores and alignment metrics compared to VoxelMorph and TransMorph. The paper is also clear, with figures that make the deformation behavior interpretable. The novelty is somewhat modest; many elements (e.g., pyramid strategy, dual losses) build upon existing unsupervised registration methods. The paper could better clarify what makes its dual-path design fundamentally different, rather than a refined combination of prior ideas. The evaluation is also limited to standard benchmarks; there’s little exploration of how USDPnet generalizes to unseen modalities or pathological scans. The uncertainty term, while useful, is described heuristically with little theoretical justification. Finally, runtime and memory costs aren’t reported, leaving open how scalable the model is for large 3D volumes. - How sensitive is USDPnet to hyperparameter settings in the uncertainty weighting term? - Could the model adapt to multi-modal registration (e.g., CT–MRI) without retraining? - How does USDPnet handle large deformations compared to transformer-based approaches like TransMorph?	Fully AI-generated
USDPnet: An Unsupervised Symmetric Deep Framework for Robust Parcellation of Infant Subcortical Nuclei	Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The study imposes mirror symmetry in parcellation of infant subcortical nuclei, that is known to be structurally symmetrical and processed with methods for adult brain parcellation methods. The method depends on an autoencoder-based architecture that process the left and right hemisphere vertices with a symmetry-aware clustering mechanism that utilizes Generalized Cauchy-Schwarz divergence on surface meshes. The clustering loss favors bilateral consistency, higher inter-cluster sample distances, lower intra-cluster sample distances and sparse cluster assignment vectors. The results provide a thorough ablation and various baseline clustering methods, showing the proposed method has higher performance than the baselines. - The bilateral symmetry-aware segmentation/parcellation is a recent topic [1, 2] and the study suggests a sound method for the problem. - The divergence measure Generalized Cauchy-Schwarz Distribution introduces computational improvements over average-pairwise divergence measures. - The study shows strong performance against alternative clustering baselines. - Lack of Ablation on Core Component: The paper's claim of a novel divergence function is insufficiently supported, as the ablation study does not compare against alternative divergence measures (e.g., KL-divergence, JS-divergence). Without this comparison, it is impossible to assess whether the proposed function is truly responsible for the performance gains or if other architectural choices are the primary driver. - References to other bilateral symmetry-aware segmentation/parcellation: The study can involve alternatives in different areas of research, i.e. [1,2]. [1] Sanket Wathore, Subrahmanyam Gorthi, Bilateral symmetry-based augmentation method for improved tooth segmentation in panoramic X-rays, Pattern Recognition Letters, Volume 188, 2025, Pages 1-7, ISSN 0167-8655, https://doi.org/10.1016/j.patrec.2024.11.023. [2] Raina, K., Yahorau, U. and Schmah, T. Exploiting Bilateral Symmetry in Brain Lesion Segmentation with Reflective Registration. DOI: 10.5220/0008912101160122, In Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2020) - Volume 2: BIOIMAGING, pages 116-122, ISBN: 978-989-758-398-8; ISSN: 2184-4305 Typo: Line 456-457 vactors -> vectors Could the authors comment on the performance of the framework when using divergence measure alternatives to $D_{GCS}$, other than GJRD, beyond the computational advantages? Specifically, if the components A and Q are kept intact, how do alternative measures (average-pairwise divergences) compare in terms of both performance and the computational advantages highlighted in the paper? [1] Mingfei Lu, Lei Xing, Badong Chen, Measuring generalized divergence for multiple distributions with application to deep clustering, Pattern Recognition, Volume 157, 2025, 110864, ISSN 0031-3203, https://doi.org/10.1016/j.patcog.2024.110864.	Fully human-written
USDPnet: An Unsupervised Symmetric Deep Framework for Robust Parcellation of Infant Subcortical Nuclei	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes USDPnet, an unsupervised deep clustering framework incorporating anatomical symmetry constraints for fine-grained infant subcortical parcellation. The method leverages surface-mesh vertex area trajectories, a latent representation encoder, and a Generalized Cauchy–Schwarz Divergence (GCSD) objective, along with a hemisphere-pairing MSE symmetry loss. Experiments on the Baby Connectome Project (BCP) dataset demonstrate improvements over several conventional and deep clustering baselines, accompanied by ablation studies and statistical significance testing. Visualizations indicate anatomically reasonable results and improved bilateral consistency. 1. The paper tackles the important and challenging problem of infant subcortical parcellation, which is clinically relevant and understudied in the unsupervised setting. 2. The approach is label-free and has potential value for large-scale infant neurodevelopment studies where manual annotations are difficult or costly to obtain. 3. The manuscript is generally well-organized, with clear presentation of the model design, loss components, and visual examples. 1. Lack of external validation with anatomical ground truth No Dice, ARI, or NMI comparison to expert labels or standard infant atlases. Reliance on internal clustering metrics limits biological interpretability. 2. Risk of suppressing true biological asymmetry The symmetry constraint may over-regularize regions with known asymmetries (e.g., amygdala, thalamus). No analysis provided to quantify the impact or demonstrate robustness. 3. Scalability concerns The full-batch GCSD objective may not scale to larger datasets or higher-resolution surfaces. No discussion on computational efficiency or potential approximations. 4. Limited feature modalities Only vertex-area trajectories are used. Incorporating curvature, thickness, deformation tensors, or multimodal T1/T2 contrast might provide more stable clustering. 5. FH metric insufficiently defined The proposed Feature Homogeneity metric is not formally introduced in the main text, limiting reproducibility and interpretation. 6. Recent literature coverage is insufficient Related work on deep clustering and infant brain parcellation from the past 3 years is under-represented. 1. The authors cite Lu et al. (2025b) for the GCSD estimator. Please clarify the precise differences between that prior work and the current implementation — specifically, have you introduced any modifications or simplifications relative to the original estimator? 2. You note that small asymmetries may lead to misleading conclusions. The manuscript also demonstrates to some extent that imposing a symmetry constraint aids segmentation. However, what happens if the input data include small but true asymmetries? Can the proposed method maintain robustness under such conditions, or does the symmetry‐constraint unduly suppress biologically meaningful asymmetry? 3. The use of MSE as the loss for the symmetry constraint raises a concern: might this penalty unintentionally penalize genuine developmental asymmetries? Have you considered employing a weighted or soft symmetry constraint (for example, applying it only to high‐confidence regions) to avoid suppressing valid anatomical differences? 4. The metric “Feature Homogeneity” (FH) appears in your results, but I could not find its formal definition in the main text. Please provide in the rebuttal the exact formula for FH and clarify its physical/biological interpretation. 5. In the figures and tables, please ensure that appropriate legends and annotations are included so that readers can understand the visualized results without excessive ambiguity. 6. Regarding GCSD computation, Equation (2) involves logarithms and matrix products, which may be prone to numerical instability. Could you clarify how potential underflow or negative values are handled to ensure safe log computation? 7. The current reference list seems to under‑represent the past three years of literature in deep clustering and infant brain parcellation. Please consider incorporating more recent studies to demonstrate how your work builds upon and differs from the state‑of‑the‑art.	Fully AI-generated
DriveE2E: Closed-Loop Benchmark for End-to-End Autonomous Driving through Real-to-Simulation	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper presents a real-to-sim digital-twin pipeline that ports real-world intersections into the CARLA simulator for closed-loop evaluation of autonomous driving systems. The authors reconstruct static environment assets (geometry/layout) to create digital twins of 15 real intersections, and use these scenes to evaluate multiple end-to-end driving baselines in fully closed-loop simulations. The goal is to enable repeatable, scenario-faithful testing in simulation while preserving key characteristics of real-world locations. - `Well-motivated digital-twin evaluation: `Building digital twins for closed-loop assessment of end-to-end autonomous driving is timely and reasonable. The pipeline is clearly presented, and the multi-source asset creation—combining Blender tooling with OpenStreetMap, HD maps, and satellite imagery—is thoughtfully executed. The engineering effort is evident and appreciated. - `Benchmarking value:` The paper provides extensive, apples-to-apples comparisons of popular end-to-end driving methods within the same digital-twin environments. This offers a strong reference baseline for the community and supports reproducibility and fair comparison. - `Overlap with prior digital-twin platforms (ScenarioNet/MetaDrive):` The core idea—building digital twins in a simulator for evaluation—appears closely related to prior work such as ScenarioNet/MetaDrive [1]. Please clarify the advantages over the previous work. Reference: [1] ScenarioNet: Open-Source Platform for Large-Scale Traffic Scenario Simulation and Modeling - `Visual domain gap between real and simulated scenes:` While digital twins enable controllable evaluation, the appearance gap between real images and simulated renderings (cf. Fig. 2) is noticeable and may bias conclusions for on-vehicle deployment. Please quantify and discuss the impact. - `Unexpected baseline ranking (UniAD outperforming MomAD in Table 3):` The result contrasts with MomAD’s strong closed-loop performance on Bench2Drive. In my opinion, there are indeed possible factors that include differences in sensor suites/resolution, control frequency/latency, action spaces and low-level controllers, training data/domain, metric definitions (e.g., Driving Score variants), and route/weather distributions. I am wondering the reason behind this difference from the authors' aspects. N/A	Lightly AI-edited
DriveE2E: Closed-Loop Benchmark for End-to-End Autonomous Driving through Real-to-Simulation	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper introduces DriveE2E, a closed-loop end-to-end autonomous driving benchmark that integrates real-world driving scenarios into the CARLA simulator. The authors extract 800 dynamic traffic scenarios from real world data collected by infrastructure-mounted sensors and create digital twin assets for 15 real-world intersections. Then the authors analyze the data distribution and evaluate several state-of-the-art end-to-end autonomous driving methods on the benchmark. 1. The paper proposes a practical real-to-simulation workflow to bulid driving scenarios in CARLA simulator with the real world data. 2. The proposed infrastructure cooperation is effective to collect extra information for simulator. 1. The scale of the digital twin build by the benchmark is insufficient. Only containing 15 intersections, the digital twin is much smaller than the towns already in CARLA and the diversity of surroundings is limited. Additonally, the benchmark does not contain other road structures such as T-junctions or roundabouts. 2. The sim-to-real gap are not evaluated qualitatively. Adding figures or videos of scenarios in real world and simulator at the same location in the same view can show the gap clearly. 3. The simulation is only partly "closed-loop". The agents only replay the logs without interacting with the ego-vehicle, which hurts the realism of simulation. 4. How to make digital models of traffic agents is not mentioned. CARLA itself has a relatively small amount of vehicle models. Only selecting existing models for traffic agents in real world may lead to a large sim-to-real gap on the perception task of autonomous driving. 1. Will the data processing code used to build the digital twins be open-sourced? These tools are helpful for the community to build more simulation scenarios with real world data. 2. It seems that the static intersection assets construction includes a lot of manual work. The high cost may make the digital twin construction unscalable. What is the typical time to build the digital twin of a single intersection manually? Are there any methods to reduce the manual construction time?	Fully human-written

PreviousPage 6 of 1516 (75800 total rows)Next