ICLR 2026 - Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	15899 (21%)	4.43	3.58	3687
Heavily AI-edited	3233 (4%)	4.22	3.59	2990
Moderately AI-edited	7082 (9%)	4.20	3.61	2722
Lightly AI-edited	16648 (22%)	4.15	3.68	2746
Fully human-written	32938 (43%)	4.13	3.62	2917
Total	75800 (100%)	4.21	3.62	3026

Title	Ratings	Review Text	EditLens Prediction
TAMER: A Tri-Modal Contrastive Alignment and Multi-Scale Embedding Refinement Framework for Zero-Shot ECG Diagnosis	Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper introduces TAMER, a tri-modal self-supervised framework for zero-shot ECG diagnosis by jointly learning from raw ECG signals, their spectrograms, and associated clinical reports. The main contribution is a hierarchical alignment strategy that first enforces consistency between temporal and spectral ECG representations via a global-local alignment module (GLTSA). Subsequently, it performs another alignment with text reports through a report-aware alignment (RAAR). Extensive experiments demonstrate that TAMER achieves good performance on zero-shot classification. The originality lies in the tri-modal formulation and, more importantly, the dual-level alignment with clinical reports, which moves beyond simple global matching to a more fine-grained report-guided refinement of waveform representations. The overall framework is not novel and prior work has explored self-supervision in frequency domain with raw signals (e.g., using wavelet transforms). I am not sure why using spectrogram is a good idea in this case given such a low sampling rate of ECG signals. Likewise, the central claim of requiring a tri-modal setup is not rigorously substantiated by the ablation studies. The experiments fail to include a crucial bi-modal (ECG+Report) baseline that uses the proposed RAAR module, making it impossible to disentangle the gains from adding the spectrogram versus the improved alignment strategy. Furthermore, the reliance on static attention weights from a frozen text encoder in the RGWR module is a strong, potentially flawed assumption, as these weights may not be a valid proxy for diagnostic importance. The paper also completely omits any analysis of computational overhead compared to baselines like MERL and most recent paper DBETA [1], which is a critical consideration given the increased complexity of the third modality and its associated encoders and alignment modules. [1] Pham et al. "Boosting Masked ECG-Text Auto-Encoders as Discriminative Learners." Forty-second International Conference on Machine Learning. The RGWR module's reliance on attention weights $w^r$ from a frozen text encoder is questionable. These weights are pre-determined by the text encoder's original objective and are not adapted to the ECG alignment task. Why should we assume these static weights are a meaningful proxy for the diagnostic importance of specific report tokens when aligning with waveform features? Does this not introduce a strong, potentially incorrect, inductive bias? The ablation study in Table 4 is insufficient to justify the tri-modal design. The primary comparison should be against a bi-modal (ECG+Report) baseline that uses the same RAAR module but omits the spectrogram and GLTSA. Without this experiment, it is impossible to determine if the gains stem from the novel tri-modal fusion or simply from a superior RAAR alignment. Is it not possible that the entire performance improvement over MERL is attributable to RAAR alone, making the spectrogram modality an unnecessary complication? The domain shift evaluation protocol (Tables 1-3) constitutes an apples-to-oranges comparison. Uni-modal models benefit from full supervised fine-tuning on source domain labels, while multi-modal models are evaluated zero-shot. This confuse architectural superiority with the training paradigm. A fairer comparison would involve also fine-tuning the full multi-modal model on the source domain labels. The paper completely omits the computational cost, a critical factor for clinical applicability. Adding a third modality with its own encoder and the GLTSA module must introduce significant overhead. Can you provide a detailed comparison of pre-training time, memory consumption, and, crucially, inference latency against the bi-modal MERL baseline, to justify whether the reported AUC gains warrant the increase in architectural complexity? The CKEPE prompt dictionary is used for zero-shot evaluation. Was this dictionary developed completely independently of the MIMIC-ECG dataset? Please confirm that the selection of prompt phrases was not informed by analyzing the common terminology or structure of the reports used in your pre-training data, as this would constitute subtle form of data leakage into the evaluation protocol. What is the reason of omitting [1]?	Lightly AI-edited
TAMER: A Tri-Modal Contrastive Alignment and Multi-Scale Embedding Refinement Framework for Zero-Shot ECG Diagnosis	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper introduces "TAMER," a self-supervised tri-modal learning framework for zero-shot ECG diagnosis, aiming to overcome limitations of existing uni-modal and bi-modal approaches. TAMER jointly models 12-lead ECG signals, their spectrograms, and clinical diagnostic reports. It comprises three key modules: TFEP for feature encoding, GLTSA for global-local temporal-spectral alignment, and RAAR for report-aware alignment and refinement, which collectively enrich ECG representations with multi-scale and cross-modal semantics. The authors report that TAMER achieves state-of-the-art performance in zero-shot classification and cross-domain generalization across three public ECG datasets. * New Tri-modal Approach and Performance Gains: The attempt to integrate three distinct information sources—ECG time-series, spectrograms, and clinical reports—within a self-supervised learning framework is a novel approach that differentiates it from prior work. The demonstrated superior performance in zero-shot and cross-domain settings, outperforming existing SOTA models, is highly encouraging. This suggests that the complementary information from the three modalities can significantly contribute to learning robust ECG representations. 1. Lack of Direct Evidence for "Localized Wave Feature and Semantic Diagnosis Alignment": * The paper states that addressing "local correspondences between waveform anomalies and diagnostic phrases" is a key objective, with the RGWR module in RAAR being responsible for this. However, while the ablation study shows RGWR contributes to overall performance, there is no direct visual or qualitative evidence (e.g., attention map visualizations, highlighting of specific waveform anomalies matched with corresponding report phrases) presented in the experiments to convincingly demonstrate how this local alignment between ECG wave features and specific diagnostic semantics is achieved or how accurately it performs. This diminishes the persuasiveness of this core claim. 2. Lack of Distinctiveness and Novelty in Modules: * Many modules are proposed (TFEP, GLTSA with RLCA, WLAI, UECR; RAAR with RADA, RGWR), but the paper struggles to clearly articulate the fundamental conceptual originality or technical innovation of each module beyond simply "adding" another component. For instance, it's not clear what makes WLAI's "two-stage residual attention mechanism" or RGWR's "dual cross-attention" truly unique or specifically tailored to the ECG-report domain, or what advantages they offer over standard attention mechanisms. The core contributions of these modules need to be more explicitly highlighted. 3. Fundamental Question Regarding the Definition of "Tri-modal": * The spectrogram is generated deterministically from the ECG time-series data using a Short-Time Fourier Transform (STFT). It can be argued that the spectrogram is merely another perspective or representation of the same underlying ECG signal, rather than an independent modality or a distinct source of information, unlike a patient's clinical metadata (e.g., blood pressure, lab results) or independent medical imaging (e.g., echocardiogram images). From this perspective, claiming "tri-modal" might be an overstatement or misleading. 4. Excessive and Non-intuitive Acronyms with Insufficient Explanation: * The paper uses a high number of acronyms (TFEP, GLTSA, RLCA, WLAI, UECR, RAAR, RADA, RGWR) throughout. The meaning of each acronym is not immediately intuitive in the flow of the text, and coupled with the complexity of Figure 1, this makes it difficult for the reader to grasp the entire framework quickly. More clear and intuitive explanations for each module upon its introduction, or a reduction in the overall number of acronyms, would greatly improve readability and reduce cognitive load for the reader. 1. Request for Localized Alignment Evidence: * Please provide qualitative analysis to directly demonstrate how the Report-Guided Wave-Level Refinement (RGWR) module within RAAR achieves local semantic alignment between fine-grained ECG waveform features and specific diagnostic phrases from clinical reports. For example, present attention map visualizations or matching examples of particular ECG waveform abnormalities (e.g., QRS complex changes, ST segment elevation) with corresponding text in diagnostic reports (e.g., "ST elevation") to substantiate this claim. 2. Clarification of Module Originality and Theoretical Contributions: * Please more clearly explain what novel ideas, unique adaptations, or theoretical contributions the individual modules within GLTSA and RAAR (e.g., WLAI, RGWR) offer, beyond merely combining existing deep learning components (attention, contrastive learning). Emphasize the key differentiators and specific advantages of these modules in addressing the unique characteristics of the ECG domain and the goal of tri-modal integration. 3. Reconsideration or Justification of "Multi-modal" Definition: * Please reconsider the conceptual validity of treating ECG time-series signals and their deterministically generated spectrograms as two independent "modalities," or provide a stronger justification for this claim. Explain whether spectrograms genuinely offer unique information (e.g., specific abnormal frequency band patterns) that is not directly evident in the time-series, effectively acting as an independent physiological signal. Otherwise, the "tri-modal" claim should perhaps be more accurately reframed, for example, as "bi-modal (ECG-report) with enhanced time-frequency feature representation." 4. Improvement of Acronym Usage: * To enhance reader comprehension, please reduce the number of acronyms used throughout the paper or ensure that each acronym is thoroughly explained upon its first appearance. Additionally, consider refining diagrams like Figure 1 to make the functional roles of each module more intuitive and immediately understandable.	Fully AI-generated
An Efficient Rubric-based Generative Verifier for Search-augmented LLMs	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces Search-Gen-V, a rubric-based generative verifier for search-augmented LLMs. The key idea is to represent factual “nuggets” as structured rubrics that provide verifiable supervision for both short-form and long-form search tasks. Through an automated rubric-generation pipeline and a two-stage SFT + RL distillation process, a compact 4B-parameter verifier achieves performance comparable to much larger models on TREC, DeepResearchBench, and HotpotQA. - Clear motivation and practical relevance: Addresses a genuine bottleneck in search-augmented LLMs—how to construct verifiable yet robust rewards for reinforcement learning with retrieval-based systems. - The nugget-as-rubric formulation elegantly bridges short-form and long-form search workloads under a single paradigm, improving consistency across RL reward modeling. - Search-Gen-V-4B is efficient as it achieves near-parity with 200B-scale models at significantly lower computational cost. - Several core components (e.g., rubric aggregation, DAPO optimization schedule, interaction between SFT and RL stages) are insufficiently detailed for replication. - The contribution mainly integrates existing ideas—rubric-based verification, nugget extraction, and reward distillation—into one pipeline rather than introducing a fundamentally new principle. The advantage of “nugget-as-rubric” over prior rubric or preference-based reward models (e.g., standard LLM judges) is not sharply articulated. - The verifier is not yet used in an RL loop to show downstream improvements. It would be better to provide some end-to-end demonstration of reward effectiveness. - No systematic study of the quality for rubrics. See above. Some additional questions: - What is the runtime and cost of rubric generation per instance, and can it scale efficiently to large corpora? - How do you detect and filter erroneous or hallucinated rubrics during automatic construction? - How does the rubric weighting scheme influence performance? Have any learned aggregations been attempted?	Fully AI-generated
An Efficient Rubric-based Generative Verifier for Search-augmented LLMs	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces a unified "nugget-as-rubric" framework for reward modeling in search-augmented LLMs. It proposes an automatic pipeline to build rubrics from retrieved passages and trains an efficient 4B generative verifier, Search-Gen-V. Experiments show this 4B verifier achieves accuracy comparable to a 235B teacher model at a much lower computational cost. 1. The topic of this paper is crucial. The "nugget-as-rubric" approach provides a single, verifiable formulation that works across both short-form and long-form tasks. 2. The automatic rubric construction pipeline reduces the need for costly manual annotation and helps to mitigate the "pool bias" found in traditional passage-labeling methods. 3. The 4B Search-Gen-V model is highly efficient, addressing the high computational cost of generative rewards. It maintains strong performance, closely matching a 235B teacher model's judgments after a two-stage training strategy. 1. The reliability of the proposed method needs further demonstration. 1. The correctness of the automatically generated rubrics is not independently verified. The pipeline's heavy reliance on an LLM-based Judge ($\Psi$) means any bias or errors from this Judge are propagated into the "ground truth" rubrics. 2. The "golden" verification labels are derived from a teacher model (Gemini-2.5-Flash) whose own accuracy is not rigorously validated. Although the appendix includes a small human preference comparison showing a slight advantage over Qwen, this is insufficient to establish that the teacher has adequate labeling capability. Consequently, the reported F1 scores primarily reflect the student model's high fidelity to a potentially flawed teacher, rather than true factual accuracy. 2. The paper lacks comparative experiments with more rule-based metrics, such as F1-score or ROUGE on short-form tasks, which are more robust baselines than Exact Match. Furthermore, the paper does not compare the reward accuracy against other powerful reward modeling approaches, nor does it include the end-to-end RL training comparison to validate the improvement over other reward modeling approaches. 3. While the research is critical, the method is only suitable for knowledge verification for search-augmented LLMs and only on one dataset. 4. The performance improvement of RL is limited. And the difference between Search-Gen-V-1.7B and Search-Gen-V-4B is 0.06 in average. Do these indicate the task is not very difficult? Please see Weakness.	Lightly AI-edited
An Efficient Rubric-based Generative Verifier for Search-augmented LLMs	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes a unified and verifiable paradigm, namely ``nugget-as-rubric", which treats atomic information points as structured evaluation criteria for different search-augmentation workloads. Short-form tasks correspond to a single rubric, whereas long-form tasks expand to multiple rubrics aligned with the question’s information needs. To support long-form settings, this paper designs an automatic rubric construction pipeline based on query rewriting, which can automatically retrieve passages relevant to each question and extract rubrics from them, both from static corpora and from dynamic online web content. Experimental results show that the proposed method and the trained model achieve strong verification accuracy across different workloads, making it a scalable, robust, and efficient verifiable reward constructor for search-augmented LLMs. 1. The paper proposes "nugget-as-rubric," a unified paradigm that treats atomic information points (nuggets) as structured evaluation criteria (rubrics). This approach successfully unifies the reward modeling for both short-form tasks (seen as a single rubric) and long-form tasks (seen as multiple rubrics). The method is designed to overcome the flaws of current reward models. It solves the "fragility" of rule-based rewards (like Exact Match), which perform poorly with variations in expression and cannot scale to long-form tasks. It also addresses the issues of generative rewards, which are often non-verifiable, unstable, and computationally expensive for long-form workloads. 2. The paper introduces an automatic rubric construction pipeline. This pipeline uses query rewriting to retrieve relevant passages and extract nuggets from both static corpora and dynamic web content. This automated process replaces traditional manual annotation, which is costly, labor-intensive, and prone to bias. 3. Experiments show that Search-Gen-V-4B achieves strong verification accuracy across different workloads. Notably, its performance is comparable to a much larger 200B+ parameter verifier model (Qwen3-235B-A22B-Instruct-2507) , making it a scalable, robust, and efficient verifiable reward constructor. 1. While the automated rubric construction pipeline eliminates manual annotation, its iterative nature and reliance on an LLM-based judge result in slow convergence. The authors state that constructing rubrics for a single question takes, on average, one to two hours. 2. The experiments for each workload (short-form and long-form) were conducted on only one representative dataset. The authors acknowledge that other datasets may have different characteristics, and future research should expand the evaluation to a wider range of datasets. None	Lightly AI-edited
An Efficient Rubric-based Generative Verifier for Search-augmented LLMs	Soundness: 2: fair Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes a nugget-as-rubric reward paradigm to deliver verifiable rewards for search-augmented models, train a 4B generative discriminator, Search-Gen-V, to assign verifiable scores for both short-form and long-form tasks (short/long text), which can be used as RL rewards or evaluation signals. The training follows two distillation-style stages: SFT → RL. 1. Proposes nugget-as-rubric to uniformly model short-form and long-form tasks, enabling a consistent, verifiable reward across settings. 2. Trains a 4B Search-Gen-V via two stages (SFT → RL) whose effectiveness approaches Qwen3-235B-A22B-Instruct-2507. ### Method 1. The key notion of “atomic golden information points (nuggets)” within nugget-as-rubric is not explained rigorously and lacks a precise, formal definition. If this concept is derived from prior paper, the manuscript lacks explicit citations to those sources. 2. In the RL training of Search-Gen-V, the format reward weight reaches 30%, which diverges from some mainstream setups (e.g., DeepSeek-Math). This might bias the model toward learning the format reward. It is recommended to provide reward curves to make the training dynamics clearer. ### Baselines 1. The evaluation datasets are limited: each of the short-form and long-form settings is validated on only 1 dataset, so generalization is not convincingly demonstrated. 2. There is a lack of comparisons with other evaluation metrics. In Figure 4, the short-form workloads are compared against EM, but for long-form tasks there is no comparison to the original metrics of DeepResearch Bench (or other long-form benchmarks). 3. Baseline coverage is insufficient. The method is mainly compared to other base models; it should also be compared to the generative reward model or the scalar reward model mentioned around line 159. ### Experiments 1. The experiments focus on the reward discriminant stage only. The paper does not demonstrate using Search-Gen-V rewards to actually train a search-augmented LLM, making it hard to validate the real effectiveness of Search-Gen-V. It is suggested to conduct RL experiments that compare Search-Gen-V against rule-based or reward model based in practice. 1. line 293, the paper states that Gemini-Flash aligns better with human inspection. Why, then, is Qwen3-235B used to generate the rubrics? 2. line 333, the overlength penalty is introduced, but the manuscript lacks concrete details about how it is computed and applied. Could the authors clarify this component?	Lightly AI-edited
VidEEG-Gen: A Dataset and Diffusion Framework for Video-Conditioned Privacy-Preserving EEG Generation	Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 0: Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper presents a dataset for privacy-preserving EEG synthesis. Synthetic data generation may be important for increasing augmented data available for applications, but I do not understand why we need a dataset for this task. Normally, we’d like to augment data from a particular recording setup and task. - None -There is no description of data or recording from subjects - This is not dataset contrbution - Concept controllable EEG data is not considered to be valid. There is some evidence of neural correlates that differ across semantic stimuli, but generally this mapping does not exist as EEG is very noisy and encodes attention, but not semantics. - The paper claims to generate diverese synthetic responses, but I think this is overclaiming - If the approach is syntethic data generation, then why is the paper written as dataset contribution? See weaknesses	Fully human-written
VidEEG-Gen: A Dataset and Diffusion Framework for Video-Conditioned Privacy-Preserving EEG Generation	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces VidEEG-Gen, a dataset and framework for video-conditioned, privacy-preserving EEG generation. The authors define a new task—stimulus- and subject-conditioned EEG synthesis using naturalistic video inputs—and propose the Self-Play Graph Network (SPGN), a graph-based diffusion model designed to more faithfully capture the spatial, temporal, and semantic relationships in EEG. The work includes the release of a 1007-sample synthetic EEG dataset aligned to video stimuli, reference implementations, and comparative/ablation studies benchmarking the approach against recent alternatives. The paper tackles the important issue of data scarcity and privacy in EEG research, proposing synthetic generation as a practical solution for applications in brain-computer interfaces and emotion analysis. The SPGN framework judiciously combines graph neural networks (for capturing inter-electrode spatial dependencies) with denoising diffusion probabilistic models and cross-modal alignment/fusion mechanisms. The work includes quantitative comparisons across multiple recent generative baselines, highlighting SPGN's solid performance in terms of signal fidelity and computational efficiency. Explorations of cross-modal conditioning or alternative fusion architectures are not exhaustively presented—raising questions about generality. The proposed dataset, while meticulously constructed, is generated using only SEED-DV video stimuli and is of modest size. The generalizability of the approach to more diverse or non-Chinese video/subject populations, or other brain recording scenarios, is acknowledged as a limitation, but no quantitative cross-dataset evidence is provided. Could you detail how "spatial-graph attention" in the SPGN is formulated and how it interacts with the denoising diffusion process at each step? Is it applied as a preprocessing step, or jointly with diffusion iterations? What is the practical impact of using solely synthetic EEG for both training and evaluation? Are there risks of model feedback loops or degraded transfer performance to real data scenarios?	Fully AI-generated
VidEEG-Gen: A Dataset and Diffusion Framework for Video-Conditioned Privacy-Preserving EEG Generation	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper defines a new task of stimulus- and subject-conditioned EEG generation under naturalistic video stimulation. The paper also introduces VidEEG-Gen, a unified dataset and framework to study it. VidEEG-Gen contains 1,007 samples that align video clips (drawn from the SEED-DV corpus for semantic diversity) with synthetic EEG trajectories. The method centres on SPGN, a graph-enhanced diffusion model that fuses video features, subject metadata and optional EEG priors via a dedicated alignment and fusion pipeline. The approach then models inter-electrode dependencies with electrode and signal graphs while using diffusion for temporally coherent synthesis. The paper also establishes an evaluation protocol (including spectral band similarity, correlation, stability and composite scores) and reports that SPGN outperforms several recent EEG generative baselines. - The paper introduces the task of stimulus- and subject-conditioned EEG generation under naturalistic video stimulation, whic is an interesting contribution. This redefines EEG synthesis as a multimodal mapping from visual input to biologically plausible neural dynamics, rather than data-driven signal reconstruction. The accompanying VidEEG-Gen dataset establishes the first benchmark for this setting. - The SPGN architecture represents a synthesis of graph neural networks (for electrode-level spatial structure) and diffusion models (for temporal consistency). This integration addresses the limitations of earlier GAN- or VAE-based EEG generators due to the lack of spatiotemporal coherence and stimulus-response alignment. - The paper presents the architecture and preprocessing pipeline in a structured and detailed way, including the temporal alignment, multimodal fusion and spatial graph construction processes. It is clear how EEG, video and text features interact across modules, to make reproducibility and easy understanding possible. - While the proposed task and dataset are conceptually novel, the work’s significance remains constrained by the narrow empirical scope. All experiments are based on video stimuli from SEED-DV, which covers only 15 participants and 40 concepts. The authors acknowledge that cross-subject generalisation degrades by 12% in mean squared error and that downstream utility (e.g., whether the generated EEG improves classifier performance) is untested. - The same SPGN model both defines and evaluates the dataset’s structure. The biological plausibility and realism of the synthetic signals are assessed through internal metrics (e.g., frequency-band similarity, stability index) but not through expert or empirical comparison with real EEG traces. This limits confidence in the dataset’s physiological credibility. - The paper presents multiple interacting modules, including CLIP-based video encoders, text embeddings, graph convolutions, adversarial self-play and diffusion steps, but the ablation study explores only a small subset of these components (spatial attention and diffusion step count). As a result, it is unclear which design elements contribute most to the observed improvements. - The paper mentions that the EEG signals in VidEEG-Gen are generated entirely by the SPGN model and not derived from real EEG recordings. Could the authors clarify how they prevent circularity between dataset construction and model evaluation? Specifically, is the same trained SPGN model used both to create and to benchmark the dataset? - In Section 4.1, the paper describes a fusion pipeline that aligns CLIP-extracted video features, demographic text embeddings and optional EEG priors using cross-attention. Could the authors clarify the role of the EEG prior in this process? For instance, is the prior always available during training, or is it optional to simulate unseen-subject conditions?	Fully AI-generated
VidEEG-Gen: A Dataset and Diffusion Framework for Video-Conditioned Privacy-Preserving EEG Generation	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces VidEEG-Gen, a framework addressing the scarcity and privacy concerns associated with EEG data. The authors propose a new task: generating personalized, synthetic EEG signals conditioned on naturalistic video stimuli. To support this, they present a synthetic dataset derived from the SEED-DV video stimuli and a novel generative model called Self-Play Graph Network (SPGN). SPGN is described as a graph-enhanced diffusion model designed to capture the spatiotemporal dependencies in EEG signals while being conditioned on video features and subject metadata (e.g., demographics). The goal is to produce biologically plausible, stimulus-aligned EEG data that preserves privacy by avoiding the use of real subject recordings. The authors evaluate SPGN against several other EEG generation models, claiming superior performance in signal fidelity, stability, and spectral characteristics. The paper tackles the critical issues of data scarcity and privacy in EEG research, which are significant barriers in the field. Proposing the specific task of video-conditioned, personalized EEG generation is potentially valuable for advancing stimulus-response modeling. The SPGN model attempts to explicitly model both spatial (graph) and temporal (diffusion) aspects of EEG, which is methodologically relevant. The core contribution, the VidEEG-Gen dataset, is generated by the proposed model (SPGN) itself. The entire evaluation framework appears to operate within this synthetic domain, lacking grounding in real EEG data distributions. There is no evidence presented showing that the synthetic EEG generated by SPGN accurately reflects the characteristics (dynamics, spectral properties, spatial patterns) of real EEG recorded in response to the SEED-DV videos. Claims of "biological plausibility" are entirely unsubstantiated. The paper does not clearly explain how the initial "ground truth" synthetic EEG signals (used for training SPGN and evaluating all models) were created or validated. Please clarify precisely how the "ground truth" synthetic EEG signals used for training the SPGN model and for evaluating all methods in Table 1 were generated. What ensures their fidelity to real EEG responses elicited by the SEED-DV videos? Why was the evaluation not performed by training models on real SEED-DV EEG data (or another suitable real dataset) and evaluating their ability to generate plausible signals conditioned on video, perhaps assessing quality via downstream tasks or established distribution metrics (like FID adapted for EEG)? What specific mechanism allows SPGN to generate personalized EEG based on metadata? How was this personalization capability validated? What is the definition and validation for the "signal stability index" and "comprehensive performance index" metrics used to claim SOTA performance?	Fully AI-generated
MultiCFV: Detecting Control Flow Vulnerabilities in Smart Contracts Leveraging Multimodal Deep Learning	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces MultiCFV, a multimodal deep learning framework for detecting control flow vulnerabilities and code clones in smart contracts. The proposed approach integrates three complementary feature types, Control Flow Graphs (CFGs) extracted from bytecode, Abstract Syntax Trees (ASTs) from source code, and comment-based semantic embeddings, to capture structural, syntactic, and contextual information. The authors employ GRU-GCN for graph embedding, CNN with attention for comment feature extraction, and a fusion network for final classification. Extensive experiments are conducted on four benchmark datasets, showing that MultiCFV outperforms existing static analysis tools such as Slither and Mythril in both accuracy and generalization. 1. About design. Combining CFG, AST, and comment information is original and addresses the limitations of unimodal vulnerability detectors. 2. About experiments. The work includes comparisons with several baselines, ablation experiments, and cross-dataset evaluation, establishing strong empirical support. 3. High performance. The model achieves good accuracy and generalization to unseen vulnerabilities, including unprotected Ether withdrawal cases. 1. Incremental novelty. While the multimodal fusion is valuable, it mainly combines known feature extraction techniques rather than introducing a fundamentally new learning paradigm. 2. Limited theoretical justification. The paper lacks a formal explanation of why multimodal integration improves detection robustness beyond empirical evidence. 3. Dataset dependence. The evaluation relies heavily on public datasets; no large-scale or real-world deployment test is included. 4. Scalability and runtime cost. Although mentioned briefly, there is no quantitative analysis of inference time or computational overhead on large-scale contracts. 1. How does MultiCFV handle unseen vulnerability types not represented in the training set? 2. Could you provide runtime benchmarks or scalability analysis compared with Slither or Mythril? 3. How sensitive is the model to the quality or availability of comments? If comments are sparse or missing, does performance degrade significantly? 4. Were any measures taken to mitigate overfitting given the relatively small vulnerability datasets? 5. Can MultiCFV be adapted for on-chain real-time contract auditing or incremental analysis during contract updates?	Fully AI-generated
MultiCFV: Detecting Control Flow Vulnerabilities in Smart Contracts Leveraging Multimodal Deep Learning	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper introduces MultiCFV, a deep learning framework for detecting control-flow-related vulnerabilities and code clones in smart contracts. It combines control-flow graphs (CFG) extracted from bytecode and abstract syntax trees (AST) from source code, and exploit a GRU-GCN and another independent network to process them. Moreover, comment embeddings encoded by fine-tuned BERT models are also used.The three input features are concatenated for final prediction. Experiments on four public datasets show the proposed model outperforming existing static tools. - This work focuses on a pratical problem and address real-world vulnerabilities. - The combination of structural (CFG), syntactic (AST), and semantic (comments) information in a framework is a contribution. - Implementation details and source code are provided. - The proposed architecture integress three components (BERT, GCN, and another network) to process three different features (comment, CFG, and AST). However, all these techniques have already been well explored and widely used in existing approaches. This work represents an incremental extension of earlier multi-encoder frameworks, rather than aligning with the current frontier of LLM-driven contract analysis. - Some implementation details are missing. For example, the AST feature is extracted by a deep learning model but the architecture is not clearly provided. - The paper does not include any baseline or discussion involving modern LLM-based approaches. - The paper claims utilizing deep learning techniques can enable faster and more efficient detection, but there are no measurements of inference time or computational cost on the vulnerability detection task. - The writings need improving. There are several grammatical errors and typos (e.g., To overcome the time-consuming and labor-intensive,). - Please improve the writing quality. - Could the method generalize to function-level or statement-level vulnerability detection instead of contract-level? - Please clarify the computational cost of MultiCFV compared to the compared baselines on the vulnerability detection task. - It would strengthen the paper to include a comparison with modern LLMs and explain why a finetuned GCN+BERT architecture remains necessary in the current LLM era. - It would benifit the paper if evaluation on real-world deployed contracts is provided.	Fully human-written
MultiCFV: Detecting Control Flow Vulnerabilities in Smart Contracts Leveraging Multimodal Deep Learning	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 1: You are unable to assess this paper and have alerted the ACs to seek an opinion from different reviewers.	The paper introduces a multimodal deep-learning framework to detect erroneous control-flow vulnerabilities in Ethereum smart contracts and to perform contract-level clone detection. It fuses three complementary views into a single contract representation used for verification and similarity search. The authors claim the first application of multimodal deep learning to this class of smart-contract vulnerabilities, outline dataset usage, and note that resources will be open-sourced. - The paper’s multimodal design yields a clearly superior representation, with the full fusion outperforming all single and dual modalities. - It delivers large, consistent gains over strong baselines across multiple vulnerability types and also shows good transfer to a new dataset, evidencing robustness and generalization. - Beyond detection, the system adds a practical clone-detection pipeline using the unified contract embedding with an RBF–cosine similarity, broadening utility for auditing and analysis workflows. - The ablation exploration is narrow. It focused mainly on learning-rate sweeps and modality combinations, without probing other impactful choices. - Baseline coverage is thin and clone-detection evaluation hinges on a single dataset and heuristic similarity thresholds. - The paper provides no theoretical analysis to complement its empirical results. - Authors classify a contract as vulnerable when the sigmoid probability exceeds 0.95. Why 0.95? How sensitive are results to that choice? - Tables report point metrics. Please add variance estimates. - Could you also probe hidden sizes, dropout, training epochs/early-stopping, and fusion strategies to assess robustness and design choices? - Please expand to include more learning-based detectors or recent multimodal methods.	Moderately AI-edited
MultiCFV: Detecting Control Flow Vulnerabilities in Smart Contracts Leveraging Multimodal Deep Learning	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes MultiCFV, a multimodal deep learning method for detecting control-flow-related vulnerabilities and code clones in smart contracts. The approach integrates three modalities: Control Flow Graphs (CFG) for structural features, Abstract Syntax Trees (AST) for syntactic features, and code comments for semantic information. The features from these modalities are fused to train a model for vulnerability detection and to build a feature database for clone detection. - The paper addresses the critical and high-impact problem of smart contract security. It maintains a clear focus on a specific, challenging class of bugs: erroneous control flow vulnerabilities (e.g., reentrancy, unsafe external calls, and delegatecall). - The ablation study (Table 2) is a strong point of the paper. It clearly demonstrates that the proposed method is effective. - The core idea of combining structural (CFG), syntactic (AST), and human-semantic (Comments) information is logical and provides an intuitive, holistic view for understanding complex code vulnerabilities. - The novelty of this paper is limited. The proposed approach is largely an application of existing, standard components (BERT, GCN, GRU, CNN). The fusion mechanism appears to be simple feature concatenation ("vertically stacked"). More discussions are required to highlight its specific novelty over contemporaneous multimodal vulnerability detection work (e.g., Jie et al., 2023; Qian et al., 2023) cited in its own related work section. - There lacks experimental comparison to other learning-based SOTA methods for vulnerability detection (e.g., Peculiar, or the other GNN/multimodal approaches mentioned in the related work). - Some experimental results require deeper discussions. E.g., in Table 3, the reported 0% accuracy for both Slither and Mythril on "Access Control" vulnerabilities is puzzling, as these tools are industry standards specifically designed to find such flaws. The authors did not clarify how analysis failures (e.g., contracts that Slither or Mythril failed to parse) were handled in the metrics. Were they excluded, or counted as False Negatives? - The reported 99.13% accuracy (Table 2) seems high and may indicate overfitting. The paper mentions using SMOTE (Section 4.2) to balance the dataset; it is critical to clarify that SMOTE was applied only to the training split. If synthetic samples from the test set's distribution were included in training (a common data leakage pitfall), the validation and test results would be artificially inflated. The paper exhibits several presentation issues that affect clarity and precision. In Section 3.2.1, the authors refer to 256-dimensional vectors from BERT while also describing 128-dimensional node feature vectors in Equation (3), but the relationship between the two is unclear. Moreover, Sections 3.3 and 3.4 reuse the same variable ($F_{ast}$) to represent feature vectors for both the AST and the comments, which may cause confusion.	Lightly AI-edited
RT-Remover: A Real-Time Video Object Removal by Composing Tracking and Removal in Auto-Regressive Diffusion Transformers	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper presents RT-Remover, a real-time video object removal system that merges tracking and inpainting into a single auto-regressive diffusion model. It requires only a starting mask for the first frame and uses causal attention with a key-value cache for efficient sequential generation. Through step distillation (distribution matching distillation) and a lightweight VAE, the method reduces sampling steps to two, achieving 33 FPS and 0.12s latency while maintaining strong visual quality and temporal consistency. 1) The paper introduces a single model that jointly performs tracking and inpainting, removing the need for separate stages and simplifying the pipeline significantly. 2) By combining auto-regressive diffusion with key-value caching and applying a tailored distillation strategy, the method reduces sampling steps from 25 to 2 while maintaining quality. 3) The approach achieves 33 FPS and 0.12s latency, making real-time interactive video editing feasible. 1) Lack of qualitative results: it would be great to see the examples of edited videos (not rolled out frames) to assess the quality of removal by the model (especially temporal consistency). 2) Lack of qualitative and quantitative comparisons: in Table 3 and Table 4, why not compare to Minimax-Remover? 3) Minor grammatical errors: for instance, in line 138 "simplies" -> "simplifies", in lines 280-281 "togather" -> "together", and so on. 1) How does the model handle inaccurate initial masks or cases with multiple objects? Is there any quantitative or qualitative analysis on robustness? 2) Can the authors share video examples of the removal results? Static images are insufficient to judge temporal consistency and overall quality. 3) Table 7 evaluates fixed window sizes, but have you considered adaptive strategies where the KV cache length changes based on motion or chunk complexity? Could this further optimize efficiency while preserving quality?	Lightly AI-edited
RT-Remover: A Real-Time Video Object Removal by Composing Tracking and Removal in Auto-Regressive Diffusion Transformers	Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces RT-Remover, a lightweight, low-latency, and autoregressive diffusion-based video object removal model. A key feature of RT-Remover is its ability to remove masked objects from an entire video by only requiring a mask for the first frame. The authors propose a two-stage training strategy to realize this model: 1. Stage 1: A pre-trained Wan2.1 generative model is fine-tuned into an auto-regressive general inpainting model. 2. Stage 2: The model undergoes distillation using a distribution matching distillation method. This process leverages object removal data generated by Minimax-Remover to transform the model into a lightweight object removal solution that requires only 2 inference steps and can simultaneously track and remove objects based solely on the first-frame mask. Furthermore, the authors replace the VAE with LeanVAE to achieve further acceleration. 1. This is the first real-time video object removal model. 2. The proposed RT-Remover in this paper achieves extremely fast inference speed, which is 14× faster than state-of-the-art models. 3. The authors integrate both mask tracking and object removal functionalities into a single model, eliminating the need for an additional model to handle mask tracking. This not only enhances user-friendliness but also reduces the model latency. 1. Table 2 is not referenced or discussed anywhere in the main body of the paper. Additionally, the 'SAM2 Latency (s)' column in Table 2 appears incomplete, with several data missing. 2. The quantitative and qualitative comparative baselines are all generative models, but it’s unclear if they were trained on object removal datasets, making the fairness of comparison questionable. Additionally, no metrics are compared with mainstream video object removal models. To ensure fair comparison, these mainstream non-autoregressive models could be tested under an autoregressive inference setup (i.e., only inputting causal temporal frames) to align with RT-Remover’s inference logic. 3. While RT-Remover is designed to track and remove objects using only the first-frame mask, this input setting is unfair when comparing with other video object removal models, as those baselines lack the design to track objects across frames with just the first-frame mask. 4. Visualized results are limited, with no demos on challenging cases (e.g., target exiting and re-entering the frame). Such edge-case results would better validate robustness. See weaknesses.	Lightly AI-edited
RT-Remover: A Real-Time Video Object Removal by Composing Tracking and Removal in Auto-Regressive Diffusion Transformers	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes a real-time approach for video object removal by introducing three key innovations: Joint tracking and inpainting: The method integrates object tracking and video inpainting into a unified process to improve temporal consistency and efficiency. Auto-regressive diffusion with distillation: It leverages an auto-regressive diffusion model enhanced through a distillation technique to achieve high-quality, temporally coherent results in real time. Fixed-length key-value cache: A fixed-length cache mechanism is employed to manage memory and computation effectively, enabling fast inference across video frames. Belows are strong points that this paper has: 1. Comprehensive and well-engineered approach: The paper demonstrates remarkable research and engineering effort toward achieving real-time video object removal. The authors systematically present a complete methodology that focuses on optimizing both efficiency and speed in the training and inference pipelines. 2. Extensive experimental validation: The proposed method is thoroughly evaluated through a wide range of experiments, comparing both performance and efficiency against existing approaches. 3. Multi-perspective evaluation: The paper provides convincing evidence of the method’s effectiveness through diverse evaluation metrics—including efficiency benchmarks, quantitative performance measures, GPT-5 assessments, and user studies, offering a well-rounded understanding of RT-Remover’s strengths. Belows are weak points that this paper has: 1. Incomplete performance comparison: The paper lacks a comprehensive experiment table comparing the proposed RT-Remover with other video object removal models in terms of model performance. Table 2 would be more informative if it combined both efficiency and performance metrics to provide a unified view of trade-offs. 2. Insufficient methodological details: - The process of fine-tuning LeanVAE to align with the latent space of Wan2.1 VAE is not clearly explained and should be made self-contained for reproducibility. - The fixed-length key-value cache mechanism and its impact on performance are not sufficiently detailed; currently, the only reference is Figure 6, which lacks quantitative or descriptive depth. - The notation $N$ mentioned around Lines 194–195 is undefined or ambiguous and should be clarified in the text. 3. Lack of failure case analysis: It would be valuable for the authors to include examples of failure cases. For instance, scenarios where the target object disappears and reappears within the video could help illustrate the model’s limitations and potential areas for improvement. Please check above the questions listed in Weaknesses section.	Moderately AI-edited
RT-Remover: A Real-Time Video Object Removal by Composing Tracking and Removal in Auto-Regressive Diffusion Transformers	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	RT-Remover is a real-time video object removal system that unifies object tracking and inpainting into a single streamlined process. It employs an auto-regressive diffusion model with distribution-matching distillation to reduce sampling steps from 25 to 2, achieving 0.12s latency and 33 FPS on a 5090 GPU. This approach significantly simplifies the pipeline while maintaining competitive visual quality and achieving the lowest latency among existing methods. [1] The paper addresses an important problem, as video editing speed is a critical factor for enabling deployment on mobile platforms. [2] The presented experimental results demonstrate excellent latency and efficiency. [1] The paper reads as if various existing methods were simply combined for faster performance, giving the impression of a technical report describing a series of empirical design choices rather than a research paper presenting novel insights. [2] What exactly is the problem being addressed, and what are the underlying causes of this problem? [3] What is the proposed contribution for fast distillation? Is the main contribution merely the adoption of DMD2, or is there an additional methodological innovation? [4] The current writing structure mostly follows a pattern of “problem → apply existing method”, which raises the question of whether this work truly qualifies as a research paper rather than an implementation summary. [5] Is there any video results? My questions are in the weakness	Moderately AI-edited
C-Flat Turbo: A Faster Path to Continual Learning	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	C-Flat Turbo is a practical advancement in continual-learning optimization that achieves ~2× speedup by caching the slowly-evolving orthogonal component $g_{vf}$ of the flatness gradient, eliminating redundant backward passes while preserving flat-minima properties. The method uses stage-wise scheduling and adaptive triggering to balance efficiency with stability, representing a data-driven refinement that integrates curvature-aware theory with engineering pragmatism for scalable continual learning. However, the theoretical foundation remains incomplete: $g_{vf}$ reuse is empirically motivated without formal convergence proofs or approximation-error bounds, leaving open questions about how gradient drift affects optimization trajectories. In summary, C-Flat Turbo is a promising, empirically validated tool that demonstrates gradient-reuse can maintain stability and generalization, though future work connecting these heuristics to rigorous optimization theory would strengthen its theoretical standing. 1. Practical Efficiency Improvement C-Flat Turbo achieves roughly $2\times$ speedup over the original C-Flat while maintaining comparable or slightly better accuracy across diverse CL benchmarks (CIFAR-100, CUB-200, IN-R, ObjectNet). The reuse of the first-order flatness gradient’s orthogonal component $g_{vf}$ effectively removes redundant backward passes, reducing training overhead without altering the optimization target. 2. Empirical Robustness and Plug-and-Play Design Demonstrates consistent improvement on both from-scratch (ResNet-18/34) and PTM-based (ViT-B/16) continual learning setups. Can be seamlessly integrated into existing methods (iCaRL, MEMO, L2P, Ranpac, EASE) via a simple optimizer-level replacement, confirming its general plug-in compatibility. 3. Dynamic Adaptation Mechanisms Introduces a stage-wise scheduler $k_t = k_0 + 10\cdot n/N$ to reduce flatness updates as tasks stabilize, and an adaptive trigger based on the EMA of $\\|g_0\\|^2$ to activate regularization only when curvature increases. These heuristics yield further efficiency gains while preventing over-regularization. 4. Empirical Observation-Driven Insight The paper provides empirical evidence that the orthogonal component of the flatness gradient $g_{vf}$ changes slowly: \[ g_{vf} = g_f \sin(\phi_f), \quad \text{with small temporal variation over iterations.} \] This observation underlies the caching mechanism and is supported by training-curve analysis. 1. Lack of Theoretical Justification The claimed invariance of $g_{vf}$ and its safe reuse is empirically observed but not theoretically proven. No convergence or approximation-error bound is provided for the caching mechanism or the adaptive trigger. In contrast, the original C-Flat offered a formal connection between $R^{(1)}\rho(\theta)$ and the largest eigenvalue of the Hessian $\lambda{\max}(H)$: \[ R^{(1)}\rho(\theta^\) = \rho^2 \lambda*{\max}(H), \] whereas C-Flat Turbo does not extend this analysis to its modified update rule. 2. Heuristic Nature of Performance Gains The improvement in accuracy is attributed to “horizontal and vertical components of the oracle gradient” and “progressive update of sharpness gradients,” which are qualitative heuristics rather than rigorously derived results. 3. Limited Statistical Reliability All experiments use a single fixed seed (1993) for class-order shuffling, without reporting variance or confidence intervals. This makes it difficult to assess the statistical robustness of the observed accuracy differences ($\approx 1{-}3\%$). 4. No Formal Analysis of Reuse Error The paper does not analyze how long the cached $g_{vf}$ remains valid before its direction drifts, nor how this affects convergence. Without a formal bound on the approximation error of reused gradients, stability guarantees remain absent. 5. Empirical-Only Validation All justifications for the “better performance” rest on visualization (loss surface flattening) and ablations, with no analytical or theoretical explanation for why the heuristic accelerations lead to higher accuracy rather than degradation. 6. Comparative Scope While the method is evaluated across several baselines, the dataset and model diversity (4 datasets, 2 architectures) is moderate compared to recent CL papers reporting 8–11 datasets and broader ablations. 1. On Theoretical Justification of Gradient Reuse The paper’s main efficiency gain comes from reusing the first-order flatness gradient’s orthogonal component $ g_{vf} $ across several iterations. Could you elaborate on whether there exists — or could be derived — a formal bound on the deviation between cached and true gradients over time? In particular, how do you ensure that the accumulation of approximation error from reused $ g_{vf} $ does not destabilize convergence in longer continual learning sequences? 2. On Empirical versus Theoretical Balance The empirical results show consistent accuracy improvements despite the heuristic modifications to C-Flat. Do you attribute this performance gain primarily to reduced noise in optimization (e.g., smoother gradient trajectories) or to a fundamentally different convergence behavior induced by the caching mechanism? If the latter, can you provide theoretical or experimental evidence that the modified update dynamics reach flatter minima than C-Flat’s full recomputation? 3. On Reproducibility and Statistical Robustness The experiments were conducted using a single random seed (1993) for class-order shuffling. Could you clarify whether multi-seed or multi-run evaluations were attempted during development, and if so, how consistent were the performance trends? Given that continual learning tasks are sensitive to seed variation, do you expect the observed improvements (≈1–3%) to hold under different random task orders or stochastic initialization conditions?	Fully AI-generated
C-Flat Turbo: A Faster Path to Continual Learning	Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes C-Flat Turbo as a faster and more effective version of C-Flat for CL problems. Instead of recomputing the first-order flatness term from scratch at each step, the main idea is to decompose it into a "vertical" component $g_{vf} = g_1 sin(\theta_f)$ with respect to $g_0$, defined as the gradient with respect to the proxy model $\theta+\epsilon_1$. By caching $g_{vf}$ for the next steps, it can be added to $g_0$ using a coefficient $\beta$, hence avoiding the recomputation of the whole $g_1$. On top of that, the authors proposed both a stage-wise scheduler for the step size and an adaptive trigger, which uses EMAs to trigger the sharpness regularization only when needed. Complete experiments have been performed, showing the practical speed and efficiency of the proposed method with respect to C-flat and other sharpness-aware optimizers. - The positioning of the paper is very interesting, as it describes a method for CL that takes into consideration the sharpness/curvature of the loss landscape. This research direction is extremely relevant as it can be impactful for the efficiency of real-world pretrained models. - The idea of the paper is simple and straightforward, enabling its use with a wide array of other CL proposals - The experiments are complete and show the very strong empirical capabilities of C-Flat Turbo, which is able to improve over C-Flat both the training speed and the final accuracy. - The strongest weakness, in my opinion, is the presentation of the paper. I arrived at the end of the C-Flat Turbo section without having a clear idea of what C-Flat Turbo does. Actually, it seems that the only place where it is explicitly defined what C-Flat Turbo is on the description of Figure 2. I also found the notation to be quite confusing and sometimes not well-defined (I will provide examples in the Questions) and many concepts are used before any actual definition or explanation, making hard to follow the paper. - While the empirical results of the paper are very strong, they are the only real contribution of this work. In my opinion, the complete lack of theoretical results and explanations strongly weakens the proposal. Moreover, the proposed C-Flat Turbo, appears to be only a small tweak around the already published C-Flat, which is a smart tweak, but scientifically not very impactful. Finally, as the only strong contribution is the empirical results, it seems strange to me not to share the code. Here are some questions and notes that I hope the authors can answer/improve: - There are some small writing issues. Some examples: "C-Flat features a robust convergence that can converge" (I do not think that the convergence can converge), "More notations are provided in Appendix..." (I think that the notation should be univocal on the whole paper), "the empirical loss term : g=..." (I do not understand why a gradient is callled empirical loss term). I hope the paper can be revised to be more clear. - In the context of the paper I do not think it is true that "...Adam reduces the loss function along gradient directions" as in the original paper of Adam, it is well explained that the curvature information is considered through an approximation of the Fisher information. What do you mean? - In line 264 a ratio is defined which is never used again in the main text. In the following lines some parenthesis are used, but without recalling the ratio function. - In line 258 the authors affiirm that $g_vs$ facilitates the exploration of flatter regions, but no reference or theoretical results are provided. What am I missing? - i would really appreciate a deeper explanation of the main concept of the paper which is briefly presented in line 286 "$g_f$ embodies an invariant direction toward flatness...". Why? Invariant to what? - I would like a definition of "vertical component" as I am not sure of the meaning of "vertical" in this context. - in section 3.2.2 the "turbo steps" term is used without having defined previously what these turbo steps are. What are they? - Finally, many of the results of the paper are based on some results in Figure 2 and 3. I personally believe that this is a weak way of presenting the results, and I would appreciate a deeper explanation of Figure 2 (which I think it is not clear). A paper fully based on a couple of observation about some figures can be very weak. If possible, more theoretical results on the main paper would be greatly appreciated. I appreciate the work of the authors, but I think it is still not complete. My final rating can easily change (increasing or decreasing) based on the authors answers.	Fully human-written
C-Flat Turbo: A Faster Path to Continual Learning	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper presents a sharpness-aware optimization method for continual learning, extending the C-Flat framework. The authors propose alternative strategies to explore flatter regions of the loss landscape while significantly reducing the computational overhead introduced by C-Flat’s additional gradient computations. After exploring both zeroth-order sharpness and first-order flatness, the authors identify an invariant direction that effectively promotes flatness. They leverage this directional invariance to enable the reuse of previously computed components through a caching mechanism and an adaptive triggering strategy. The proposed method is evaluated against both pretrained and non-pretrained baselines, as well as existing sharpness-aware optimizers. Experimental results show that C-Flat Turbo outperforms prior methods, achieving comparable or superior continual learning performance while substantially lowering computational cost compared to the original C-Flat. - The paper introduces the core concepts in a logically coherent sequence, effectively linking empirical findings on loss landscape dynamics to the formal derivation of the proposed regularization and optimization mechanisms - The results appear to substantiate the main hypothesis, demonstrating consistent improvements in stability and supporting the intuition behind the proposed method. Moreover, the proposed C-Flat Turbo significantly reduces computational cost compared to C-Flat while maintaining (or improving) performance. - While the paper provides a clear rationale for leveraging invariant directions in the optimization of flatness, there remains an asymmetry between how sharpness and flatness are treated. Although the intuition is understandable, this design choice would benefit from a clearer justification, as the underlying reasoning for not reusing $g_{vs}$ remains implicit (see question). - Since the paper is presented as an improvement over C-Flat and builds upon the stabilization of sharpness and flatness during C-Flat optimization, one would expect a more direct comparison with the baselines used in the original C-Flat paper (e.g., Replay, WA, PODNet, DER, FOSTER, etc.). However, these baselines are missing, and since the experimental setup also changes (e.g., number of tasks, class-per-task scheduling, or initial task size). This makes it difficult to assess the consistency and fairness of the reported gains. - Although the proposed approach clearly reduces the overhead of C-Flat, its overall computational cost still appears substantial. It is therefore not entirely clear whether the observed gains justify the use of C-Flat (and its Turbo variant) over standard optimizers in continual learning settings. Minor - The current use of in-text citations makes the paper difficult to read, especially in the introduction and related work sections. Please use parentheses consistently and ensure correct usage of `\cite`, `\citet`, and `\citep`. - There are a few potential typos: for instance, line 308 should be _Turbo-kt_, and line 373 should read _scheduler_ instead of _schedule_. Also lines 206-207 seem to be out of context. - The notation for speedup is somewhat confusing. Typically, “1×” denotes no improvement (same speed). Consider rephrasing using “2×” (for double speed) or stating explicitly “100–125% improvement” to avoid ambiguity. - Some details appear to be missing or unclear. For example, in Table 2 it is not specified which dataset is used for training. Please verify that all experimental setups are fully described. The metrics are not clearly presented. Since continual learning metrics are not universally standardized, consider adding a brief description or a supplementary section defining the metrics used (especially “Avg” and “Last”). The nomenclature can be confusing without explicit clarification. 1. The paper mentions that both the sharpness and flatness components exhibit slowly varying, direction-invariant behavior. However, only the flatness component is reused via caching in C-Flat Turbo. Could you elaborate on why this asymmetry exists? Why is the caching mechanism applied only to the flatness component, even though both sharpness and flatness are described as direction-invariant? 2. Could the authors clarify or quantify how many gradient computations are skipped at most when both the _scheduler_ and the _adaptive trigger_ are active? In other words, what is the maximum reduction in backward passes per iteration that the method achieves under the full optimization scheme? 3. How does C-Flat Turbo perform on the baselines reported in the original C-Flat paper but omitted here (e.g., Replay, WA, PODNet, DER, FOSTER)? If these experiments were not possible, how do the authors justify their absence? 4. The same clarification would be useful for the changes in experimental settings (e.g., number of tasks, class-per-task schedule). 5. How were the confidence intervals in Tables 2 and 3 computed? Would it be possible to report them also for Table 1? 6. Were any statistical significance tests conducted to validate the reported improvements? 7. Could the authors formally define the metrics used for evaluation?	Lightly AI-edited
C-Flat Turbo: A Faster Path to Continual Learning	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	Motivated by the observed stabilization of sharpness-aware gradients in continual learning, this paper introduces C-Flat Turbo, a variant that selectively shortcuts along stable directions toward flatter regions. It further proposes a stage-wise linear scheduler, coupled with an adaptive triggering mechanism, to dynamically regulate C-Flat’s behavior during training. And compared with the C-Flat method, the approach delivers better performance and faster speed. 1. This paper’s motivation is clear, and it offers a concrete critique of C-Flat’s limitations, most notably the heavy computational overhead of the first-order flatness term, which requires computing g and g1 at a proxy point (i.e., two extra backpropagations per step). 2. The writing is fluent and the structure is sound. The paper presents its motivation, method, and results in a clear, logical order, with figures and notation that generally support comprehension. Section transitions are smooth, and the key assumptions and design choices are stated explicitly. 1. Limited practical impact. Reported speedups are only relative to C-Flat; the method remains slower than original training, while accuracy gains are very small (often <0.5%). The paper lacks a compelling compute–accuracy trade-off to justify adoption in realistic CL settings. 2. Narrow evidence and weak theoretical grounding. The claim that gf is a subtler correction than gs, g is drawn mainly from Fig. 3 (EASE, ~5 epochs). There is no systematic validation of a theoretical account why this should hold. 3. Incomplete compute reporting. Experiments emphasize throughput/wall-clock but omit hardware-agnostic metrics , peak memory, and energy (e.g., J/sample). Sensitivity to k,β,λ,ρ and the resulting complexity Pareto curves are also missing. 1. From the experimental results, the claimed speedup is only relative to C-Flat. It is still slower than the vanilla methods, and the accuracy gains are very marginal. Thus, I question the practical significance of this improvement over C-Flat.It is necessary to strengthen the explanation of whether such plug-in methods are worth using in continual learning. 2. The proposed method appears to rely solely on the phenomenon visualized in Figure 3, obtained from experiments in EASE over just 5 epochs, to infer that (g_f) serves as a smaller rectification upon SAM, even subtler than the correction (g_s - g) induced by SAM itself. Is this observation accidental? It seems to lack evidence of generality or any supporting theoretical justification. 3. On the experimental side, there are the following issues: a) Across the four datasets, the improvements are generally within 0.5%. Is this because these datasets are not suitable to showcase the method, or is the method’s effect inherently limited? b)The abstract states that C-Flat results in up to 4× computational overhead, yet this paper only reports speed comparisons. It lacks comparisons of computational resources (e.g., FLOPs, forward/backward counts, peak memory, energy).	Lightly AI-edited
C-Flat Turbo: A Faster Path to Continual Learning	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper develops an optimizer C-Flat Turbo, which improves upon a previously developed one named C-Flat. This class of optimizers target continual learning, where the goal is to train a network on a stream of tasks, ensuring the learning of new tasks while avoiding catastrophic forgetting on old ones. This class of optimizers achieve it by directing model weights towards regions of "flat minima". In C-Flat, this necessitates backpropagating w.r.t. multiple different weights, costing 4x computational overhead of vanilla GD due to introducing two additional loss terms -- zeroth-order sharpness (change in loss when weights are perturbed), and first-order flatness (gradient magnitude when weights are perturbed). The latter itself requires computing two separate gradients $g_0$ and $g_1$, the former requires 1 ($g_s$), in addition to standard SGD gradient $g$. In their new proposed optimizer, the authors improves upon the efficiency of naive C-Flat. This is achieved via (1) caching a component of the gradient $g_1$ to be reused for $k$ steps, where $k$ is determined by a initial hyperparameter $k_0$ and gradually increases as number of tasks increase, and (2) adaptively triggering the computation of $g_1$ based on deviations of the $g_0$ from its EMA. A similar process is also done for $g_s$, based on previous works LookSAM and AE-SAM respectively. Combined, this results in an optimizer that is "about 2x the speed of C-Flat, and 0.6x that of SGD" (L377), and experiments show that C-Flat Turbo can achieve slightly better accuracy than C-Flat. - The authors conducted in-depth empirical analysis and visualizations to motivate the proposed method, which greatly adds to the clarity and presentation of the method. In particular, each component of the method is well-motivated by empirical analysis, and parallels to prior work in sharpness-aware optimization (SAM) are clearly drawn to provide better intuition for why they are used for flatness gradients. - Comprehensive experiments across multiple datasets, and with multiple CL methods, with clear improvement over the latency of C-Flat while not only maintaining, but even slightly improving, accuracy scores. - Overall improvement of CL optimizers do not seem very significant, as opposed to improvements from apply different CL methods. For instance, comparing within "Typical" and "PTM-based" baseline (no C-Flat) methods in Table 1, we see much larger variations among different method. E.g. MEMO is almost 2x faster in Img/s compared to iCaRL while achieving large improvements across all datasets (e.g. almost 5 points on CIFAR100). In contrast, applying C-Flat Turbo often results in much smaller gains across all datasets, while slowing training by 2x). Despite latency optimizations from C-Flat, this seems like a hefty training time trade-off for a proportionately much smaller improvement. - Scheduler requires knowing the number of tasks beforehand (knowing $N$), which is often not the case in real-world continual learning scenarios. - Usage of math notations need to be more consistent, especially in equations (1)-(5) -- the paper commonly alternates between usage of $g$ and $\nabla {\mathcal L}(\theta)$, $g_s$ and $\nabla \mathcal{L}(\theta + \epsilon_0^\ast)$, etc. - Some of the gains in latency are actually based on LookSAM and AE-SAM, rather than from this work. From what I can tell, the novel components proposed here can only improve latency for computing $g_1$, which means speedups isolated to methods proposed here are upper-bounded by 1.33x (4 backprops / step -> 3+ backprops / step). - More ablations of C-Flat Turbo is important for assessing which component of C-Flat Turbo is contributing most towards the accuracy) improvements -- is it from the changes to computation of the sharpness gradients, or that of the flatness gradients? I.e. there should be some additional rows between +C-Flat and +C-Flat Turbo in Table 3, e.g. (+C-Flat + LookSAM, C-Flat Turbo - LookSAM, etc.) - Overall I do believe that this paper is valuable in offering useful empirical insights for reducing the computation time of C-Flat, but I feel that its contributions are limited by the relatively weak performance of C-Flat itself. Since C-Flat Turbo is an optimized version of C-Flat using a series of approximations for caching / avoiding gradient computations, do the authors have an explanation for why C-Flat Turbo here can achieve better results than the actual optimizer that it is attempting to approximate? - Minor: Several usages of "... 1x speedup ..." in the paper, this is confusing, especially given that the paper also uses "... 2x speedup" in other areas.	Fully human-written
Neurocircuitry-Inspired Hierarchical Graph Causal Attention Networks for Explainable Depression Identification	Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper presents NH-GCAT, a neurocircuitry-inspired model designed for explainable depression identification using fMRI data. The model integrates brain knowledge with graph neural networks through three main modules: RG-Fusion, HC-Pooling, and VLCA, which together capture hierarchical and causal relationships among brain regions. When evaluated on the REST-meta-MDD dataset, NH-GCAT achieves 73.8% accuracy and 78.5% AUC, outperforming previous methods while revealing biologically meaningful patterns in key brain networks associated with depression. This paper introduces a multi-level modeling framework that includes three hierarchical layers: the region level, the circuit level, and the network level, which together help capture brain functional dynamics from local to global scales. It is also validated on the large-scale REST-meta-MDD dataset, which contains more than 1,600 subjects from 16 research centers. This paper contains many critical methodological and conceptual flaws, as well as unclear details. 1. The overall framework of the paper is outdated. Many existing studies have already proposed similar approaches. Please refer to related works in IEEE TMI, IEEE JBHI, and MICCAI. 2. Several fundamental assumptions in the paper are problematic, particularly regarding the causal inference in the VLCA module. The variational conditional probability assumptions are incorrectly formulated, and the paper completely ignores the prior and posterior distributions. 3. The authors designed Equation 21, but no ablation experiments on the parameter $\lambda$ are presented. 4. According to the description of Counterfactual Reasoning on page~19, the authors set $A^{cf} =\textbf{I}_{C}$ Identity Matrix. No	Lightly AI-edited
Neurocircuitry-Inspired Hierarchical Graph Causal Attention Networks for Explainable Depression Identification	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The author proposed a hierarchical graph neural network for major depressive disorder (MDD) analysis. A residual gated fusion module was proposed to aggregate BOLD signals at the temporal level. The authors also conducted extensive experiments to show that the model performs better than baselines, that every design is useful, and that the model provides sufficient interpretability. - Figure 4 includes the ROC and PR curves for better performance evaluation - Table 2 includes weighted average values, which makes the performance difference clearer. - The analysis is comprehensive. While the datasets are somewhat limited, the author discussed them in the future works section. - A complete ablation is done in table 3 that details the contribution of each component. - This seems to be a resubmission of a previously reviewed work, where the authors promised to discuss how the work differentiates itself from related approaches like https://arxiv.org/pdf/2410.18103, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10230606 in the related works section. As of the current draft, I don’t see this being done. The current related works section is largely the same as the previous draft. The authors seem to briefly touch upon this in Section. A.3. However, there is not enough comparison with specific works, and no citations were added in the entire section of A.3. Furthermore, no comparison was done against the works that the authors promised to do. - While the ROC curve is useful, the implications are limited as the curves for other baselines are not reported. It would be useful to replicate one or two baselines and see how the curves compare. See weaknesses.	Fully human-written
Neurocircuitry-Inspired Hierarchical Graph Causal Attention Networks for Explainable Depression Identification	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes a novel framework for diagnosing major depressive disorder (MDD) from fMRI data. The method integrates three hierarchical levels of brain information: node (cortical region) level, neural circuit level (for example, default mode and salience networks), and whole-brain network level, within a unified graph neural network (GNN) architecture. The effectiveness of the framework is evaluated on a single fMRI dataset. The integration of three levels of information makes sense to me, and I also appreciate the general idea of leveraging neural circuits as prior information. However, this prior knowledge does not seem to be fully utilized or to effectively reflect existing neuroscience evidence. The experimental evaluation is too weak. First, only a single dataset is used. Why not evaluate on other MDD datasets such as SRPBS, OpenNeuro, or even the UK Biobank? It would also be more convincing to train on one dataset (for example, REST-meta-MDD) and test on another (for example, SRPBS) to assess the generalization ability of the proposed approach. In addition, the comparisons with prior work are neither rigorous nor fair. The results of several state-of-the-art methods appear to be directly copied from the original papers, even though the experimental setups differ substantially. For instance, BrainIB used a 10-fold cross-validation scheme, while the current work adopts 5-fold cross-validation, which makes the comparison unreliable. You should rerun these methods in your own environment for a fair evaluation, especially since many of them have released official implementations. Furthermore, even the baseline results reported here differ from those in other published replications, which raises concerns about reproducibility and evaluation consistency. The network design appears overly complicated and seems to contradict the stated motivation for interpretability. From a neuroscience perspective, researchers generally prefer architectures that are simple, easy to use, and supported by clear clinical evidence or interpretability. Although you attempt to incorporate circuit-level priors (which might be the only clinically grounded component), the overall network design (especially with several components insufficiently explained) undermines the interpretability of the entire framework. 1. Although the integration of three levels of information is conceptually reasonable, the current ablation study, which incrementally adds one module at a time, does not clearly reveal which component contributes most to the final decision. My question is: among the node-level, circuit-level, and network-level information, which source or combination of information plays the dominant role in the diagnostic performance? 2. The network design is not clearly explained. For the RG-Fusion module, there are two inputs, $X^1$ and $X^2$, but it is unclear why they are fused in such a complicated manner. It appears that $X^1$ and $X^2$ are first fused, and then another branch fuses information from $X^1$ again. The motivation for this structure should be clarified. In addition, is the feature dimension $d$ consistent between Equations (1) and (5)? 3. The loss function contains three regularization terms. How are the different $\lambda$ values tuned in practice? It is unclear whether a single set of $\lambda$ values can generalize across different datasets, and I suspect that the optimal configuration might be highly dataset-dependent. 4. I acknowledge that other approaches do not explicitly consider low-frequency oscillatory patterns in BOLD signals. However, it is unclear why you claim that your method captures such information. From my understanding, you simply add a new input (the raw BOLD signal) and apply a basic Transformer architecture. Do you attribute the ability to capture low-frequency oscillations solely to the Transformer design? 5. It is difficult to clearly understand the source of the reported performance gain. The proposed model takes two inputs, the functional connectivity (FC) matrix $X^1$ and the BOLD signal $X^2$, which effectively makes the framework a multi-view learning system. If I understand correctly, most existing GNN-based approaches use only $X^1$. Since multi-view learning has recently gained attention in neuroscience, it is important to clarify whether the performance improvement primarily stems from introducing an additional input modality rather than from the proposed complex network design itself?	Lightly AI-edited
Neurocircuitry-Inspired Hierarchical Graph Causal Attention Networks for Explainable Depression Identification	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes NH-GCAT, a neurocircuitry-inspired hierarchical graph causal attention network designed for explainable MDD identification. The model introduces three major modules aiming to incorporate neuroscientific priors into graph-based learning. The authors report advanced performance on the REST-meta-MDD dataset and provide multi-level interpretability analyses demonstrating biologically meaningful findings. 1. The paper tackles an important topic — enhancing both accuracy and interpretability of GNNs for MDD classification — and makes a solid attempt to integrate biological priors (depression-related circuits) with deep learning. 2. The interpretability analyses (frequency-specific validation, hierarchical circuit visualization, causal inter-circuit analysis) are thorough and align well with known MDD mechanisms. 3. The paper is clearly written and provides extensive quantitative results, including LOSO-CV analysis across 16 sites, supporting generalizability. 1. Unclear module motivation and mapping between equations and architecture. It is difficult to align the mathematical formulations in Section 3 (Equations 1–21) with the modules illustrated in Figure 2. The description of RG-Fusion, HC-Pooling, and VLCA lacks explicit motivation for each design component — for example, why certain fusion mechanisms, Gumbel-Softmax hierarchical assignments, or causal attention structures were chosen. The rationale for these designs should be better explained or visualized in connection with the biological circuits they represent. 2. Ambiguity in ROI-to-circuit mapping. The paper uses AAL116 for ROI definition, yet defines five circuits based on functional organization. It remains unclear how the authors aligned AAL ROIs to these five circuits. This mapping is problematic because AAL includes cerebellar regions, which are not part of these circuits. The paper should clarify how such ROIs were handled or reassigned — were cerebellar nodes excluded, or mapped to the nearest cortical network based on spatial proximity? 3. Lack of comparison with related literature. The authors cite several interpretable GNNs but omit discussion or comparison with relevant recent studies that also integrate community structure [1-3] or causal learning [4] in brain graphs. 4. Limited experimental scope. Experiments are only conducted on a single dataset. Given the claim that the model is neurocircuitry-inspired and generalizable, evaluation on at least one other psychiatric or neurological condition (e.g., ASD, AD, schizophrenia, bipolar disorder) would better demonstrate the adaptive ability and robustness of the proposed framework. [1] Community-Aware Transformer for Autism Prediction in fMRI Connectome. MICCAI 2023 [2] Biologically Plausible Brain Graph Transformer. ICLR 2025 [3] BrainGT: Multifunctional Brain Graph Transformer for Brain Disorder Diagnosis [4] BrainOOD: Out-of-distribution Generalizable Brain Network Analysis. ICLR 2025 See Weaknesses.	Lightly AI-edited
Low Rank Transformer for Multivariate Time Series Anomaly Detection and Localization	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper advances multivariate time series (MTS) anomaly detection in three distinct aspects. First, it provides theoretical insights into how the Transformer encoder represents and learns from the MTS data, revealing how its representations relate to classical time series. For instance, the authors equate the embedding process to Vector Moving Average (VMA) filtering, and the self-attention mechanism to the Space-Time Autoregressive (STAR) structure. Second, the authors propose Attention Low-Rank Transformer (ALoRa-T), which consists of the LightMTS-Embed module and Attention Low-Rank (ALoRa) layers, and a decoder. Lastly, given this new architecture, the authors propose a novel detection score and localization method: ALoRa-T score and ALoRa-Loc method. - The authors theoretically relate the Transformer architecture back to the techniques from classical time series modeling. Based on this insight, they propose technically sound and well-motivated modifications to the Transformer architecture, further specializing it for the task of MTS anomaly detection. - The authors propose novel detection and localization frameworks that are more reliable than previously used metrics. - Together, the proposed method and detection/localization methods successfully outperform other baselines. The experimental results are quite comprehensive, and the authors have included code and sufficient experimental details to reproduce the results. - According to Table 1, it appears that AloRa-Det is more effective on some datasets (ex) HAI or SMD) than other (SwAT, MSL). What causes such a discrepancy in the results? Is ALoRa-Det more effective at detecting certain anomaly types than others? - The majority of the baselines are drawn from Transformer-backed anomaly detection methods (for a good reason). Yet, it would be helpful to add some baselines from other families of MTS anomaly detection methods, such as reconstruction or contrastive learning-based methods. - Do the authors expect their method to stay functional in application scenarios where anomalies and distributional shifts (concept drifts) appear mixed together? If so, how could ALoRa-T be extended or modified to such cases? - Although the authors present ablation studies in Section D, I believe a more thorough ablative study that investigates the effectiveness of each proposed technical component separately to assess its contribution is necessary. - Just a minor comment on paper formatting: I understand that the authors have chosen to move many of the experimental results due to the page constraint, but I personally think key results and analyses should still remain as a part of the main manuscript. I suggest that the authors truncate some of the materials in the introduction/related works to make room for the results section. Please refer to the weaknesses above.	Fully human-written
Low Rank Transformer for Multivariate Time Series Anomaly Detection and Localization	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes ALoRa, a Transformer-based framework for multivariate time-series (MTS) anomaly detection and localization grounded in a theoretical analysis of Transformer encoders on MTS. The authors show that the encoder’s latent representations can be expressed as linear combinations of Space-Time Autoregressive (STAR) processes, which motivates (i) ALoRa-T—a Transformer with low-rank regularization on self-attention—and (ii) a detection score that counts significant singular values of the final attention matrix. They further derive contribution weights from inputs → latent → outputs to trace anomaly propagation and attribute anomalies to variables (ALoRa-Loc). (1) The paper provides a coherent spectral perspective on attention that is simple to compute conceptually and ties to an interpretable diagnostic. (2) The authors diagnose that point-adjustment inflates results—sometimes making them indistinguishable from random scoring—and therefore pivot to range-aware/affiliation-based metrics, improving evaluation validity. (3) The localization section explicitly models propagation via contribution weights (E, C), which is more principled than per-dimension reconstruction heuristics. (4) The training objective is compact and implementable; the regularizer integrates cleanly with standard reconstruction losses. (1) While the detection pipeline uses two thresholds ($h_1, h_2$), Appendix A provides data-driven approach of choosing threshold $h_1$, but this is still a per-dataset manual step, introducing hyperparameter sensitivity. Also, neither ablation on $h_2$ selection nor heuristics on choosing it was provided. (2) The paper’s central intuition—\textit{anomalous windows yield higher attention rank}—is supported empirically (plots/observations) but lacks a formal guarantee. No theoretical background specifies conditions under which anomalies must raise rank (or non-anomalies must not). (3) ALoRa-Loc traces propagated influence, but ranking metrics like HR/NDCG/IPS do not distinguish origin variable from downstream affected variables; without per-segment confusion analyses, it’s unclear whether the method finds causes or merely effects. (1) How sensitive is the final detection F1-score (which relies on the combined $AS(x_t)$) to this choice? For instance, what is the performance impact if $h_1$ is set 10x larger or 10x smaller than the value chosen via the eigenvalue distribution analysis? (2) Please provide explanation on how $h_2$ value was selected and why. (3) Do different anomaly types (point vs collective vs contextual) induce distinct singular-value patterns? Any class-wise analysis of detection latencies? (4) For segments where the anomaly propagates widely, how often does top-k ALoRa-Loc identify the true origin vs “most affected” variables? Could you report per-segment confusion analyses? (5) Some important ablations are missing: rank-only score vs error-only vs multiplicative combo; head-wise vs averaged penalty; all-pair vs top-K embeddings; FFN on/off at matched params. Could you please provide ablations on these? * A minor typo in line 228; "throught" $\rightarrow$ "thought"	Lightly AI-edited
Low Rank Transformer for Multivariate Time Series Anomaly Detection and Localization	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes a transformer-based framework for time-series anomaly detection that leverages attention rank analysis to interpret and localize anomalies. The key idea is that the rank of self-attention matrices increases when anomalies occur, providing a new signal for both detection and localization. 1. The idea of detecting anomalies by analyzing the transformer’s learning behavior is original and insightful. It opens a new direction for understanding model-internal representations in time-series anomaly detection. 2. The focus on anomaly localization is meaningful and practically valuable. 1. The paper uses Spearman correlation to estimate dependencies among sequence pairs but does not justify why this choice is preferred over Pearson correlation or Cosine Similarity. Furthermore, the paper states that only the top-K correlated pairs are retained, yet the criterion for determining K is not specified or experimentally analyzed. 2. The central claim that “the rank of SA-matrices increases in the presence of anomalies” is only supported by empirical observation on a few datasets. The paper does not provide a theoretical explanation or evidence that this phenomenon holds consistently across diverse anomaly types and domains. 3. The definitions of variables are inconsistent—sometimes the input sequence is denoted as x, other times as y, making the mathematical expressions difficult to follow. 4. The inference process depends critically on the threshold h_2. Although the paper mentions that Appendix A describes its selection, the appendix does not include such details yet. 5. Localization evaluation requires ground-truth information about the precise anomalous series. However, the datasets used in the experiments typically provide only record-level anomaly labels (anomalous or normal per timestamp) without explicit localization annotations. Could the authors clarify how the localization ground truth is obtained? 6. The main text experiments are overly concise and lack detailed analysis. Although the appendix includes an ablation study, it only evaluates the embedding module. A more critical ablation, particularly on the ALoRa loss function, is missing and should be included to support the claimed effectiveness of the proposed loss. 7. The paper does not provide the source code, and the methodological descriptions are not detailed enough to reproduce the reported results reliably. See the weaknesses section	Moderately AI-edited
Low Rank Transformer for Multivariate Time Series Anomaly Detection and Localization	Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	The paper tackles multivariate time series anomaly diagnosis, covering both detection and localization. It analyzes the learning behavior of Transformers from a theoretical perspective and connects it to classical statistical time-series analysis. Based on these insights, the authors propose the Attention Low-Rank Transformer (ALoRa-T) with low-rank regularization to better capture temporal anomaly patterns, and introduce ALoRa-Loc for variable-level anomaly localization. Experiments on real and synthetic datasets show that the proposed approach outperforms existing methods in both detection and localization tasks. 1. The paper offers valuable theoretical insights by linking the Transformer’s self-attention mechanism to established statistical time-series principles, providing a more interpretable foundation for deep anomaly detection models. 2. Unlike many prior works focusing only on detection, the introduction of ALoRa-Loc enables variable-level anomaly attribution, advancing the underexplored area of multivariate anomaly localization. 1. The distinction between “time series” and “variable” is not consistently maintained throughout the paper. Since each variable corresponds to a univariate time series, the terminology should be clarified to avoid conceptual confusion. 2. The paper states that each kernel learns representations from only two time series, but the motivation for selecting exactly two is not explained. 3. The metrics used to evaluate anomaly localization ability — Hit Rate, Normalized Discounted Cumulative Gain (NDCG), and Interpretation Score — are not well-suited for this task. Hit Rate and NDCG are designed for ranking or recommendation settings, while Interpretation Score lacks a clear definition in the context of anomaly localization. 4. The paper does not compare with established approaches for anomaly localization or root cause identification, such as “Root Cause Analysis of Anomalies in Multivariate Time Series through Granger Causal Discovery.” 5. In Table 2, several numeric values use commas instead of decimal points. Please see the weaknesses.	Moderately AI-edited
LRIM: a Physics-Based Benchmark for Provably Evaluating Long-Range Capabilities in Graph Learning	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces a physics based dataset to measure long range modeling capabilities in graph neural networks, named as Long-Range Ising Model (LRIM) Graph Benchmark. The benchmark utilizes the Ising model with power law interactions, where the target task provably depends on long-range dependencies. The paper provides 10 datasets ranging from 256 to 65k nodes with controllable difficulty through tunable parameter that controls the interaction strength between nodes inversely. Analysis shows that local information is insufficient, theoretical study is given on long rangeness measures, and empirical evaluations demonstrate that both message passing architectures and graph transformers perform lower. The entire dataset is synthetically generated and the graphs are 4-regular and 2D grid like. - For the datasets, the use of the Ising model provides a physics based foundation where long range dependencies are mathematically guaranteed and controllable. This is unlike some prior long range benchmarks in graphs such as superpixels where the long range is not mathematically guaranteed. - Compared to previous benchmarks which demonstrate long rangeness of tasks using performance of different model classes, this work has elaborate analysis of the proposed dataset with oracle predictor, theoretical lower bounds and long rangeness metric. - The task difficulty can be tuned and is also demonstrated with examples in Figure 3. In addition, there are clear performance gaps between message passing networks and full-neighborhood graph transformers, as in Tables 2,3. - The proposed collection of datasets with sizes and difficulty can be used for developing long range graph networks, alongside other recent works/datasets which study this topic. - As acknowledged by the paper, the benchmark is limited to regular lattice structures. This is significant since real world graphs rarely have such regular topology and and message passing GNNs may not be the best architecture; the grid structure may favor certain architectural choices. - In addition, methods designed specifically for grid-like data are excluded. however, including them could inform on the necessity of graph-specific networks perform in such settings. - A major limitation is the real-world applicability of the datasets which the paper acknowledges. One observation, GPS shows OOM on LRIM-256 in Table 3. Is there a possibility to include a more approximate alternative for GPS, for instance, specifically to inform the missing scores here?	Fully human-written
LRIM: a Physics-Based Benchmark for Provably Evaluating Long-Range Capabilities in Graph Learning	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper constructs a graph regression dataset based on the well-known ising model as a dataset for long-range dependencies. They show that indeed long-range interactions are needed in order to solve the problem and that deeper networks tend to work a lot better than shallow ones. They also provide an oracle value for $k$-hop GNNs which is a lot better than the trained networks, indicating sufficient complexity to be challenging. The main strength of the suggested dataset is that it addresses one of the key problems in graph machine learning: long-range interactions. Existing datasets are often purely empirical (maybe except for the road networks from Liang et al 2025) instead of being principled. The dataset satisfies a number of desirable properties such as coming in varying complexities and sizes (including large graphs with 65k nodes each) while leaving a significant performance gap between a simple restricted oracle and existing GNNs. Overall, I am not yet convinced that the dataset is really what we are looking for, mostly because the graphs are extremely simple and as far as I understood the task, it is not that much about interactions influencing spin patterns, but rather on aggregating information from far away but in a way that is mostly independent of information that has already been digested. Concretely, there are a few aspects about the dataset that I consider not too strong: - it could have been a lot more clear how exactly the LRIM graphs are generated based on the background that has been described before, especially for graph-learning experts that have not worked with the ising model before. Apart from that, the paper is well-written and easy to follow. - The graphs are extremely simple (just lattices, even simpler than the road networks from Liang et al). Thus the task is really an oversimplified edge case for graph learning. - There is not that much interaction going on, especially since on a regular lattice all $J_{ij}$ are the same. - I did not get convinced that the construction really tests interaction and not just global aggregation (see questions). - The provided lower bound states that there exists a solution thats "very different", but says nothing about the distribution of such solutions. In particular I believe its possible to "cheat" using global statistics. Concrete (small) things: - since LRIM uses lattice graphs, positional embeddings should make looking at the edges irrelevant. (e.g. using a PE that is made for images and is able to encode an x and y position) - The task is really about very exact computations which tends to be not the strongest suite of machine learning models. And at some point floating-point precision will start becoming problematic (probably way before -20 where the trivial accuracy boundary is). - in the experiments it looked to me that a lot larger MPGNNs would have been possible without exceeding the computational demands of the tested graph transformers. Is there a reason why only "small" models have been tested? How does a 50M GatedGCN model perform? - 422: Maybe I was misreading the plot, but the numbers of oracle and learned method are not far apart for up to 12 layers (Fig 4). I do not agree with the conclusion made here. - 475: As soon as we know the distances, the graph itself becomes highly unimportant. So it is only partially about graph learning, i'd say. 1. is it really "interaction" or rather "aggregation" that is happening in the ising model? Especially when it is about energy prediction as in LRIM? 2. in the monte-carlo simulation that is used to simulate the system, I do not really understand how deterministic this is and how exactly it is used for LRIM. 3. when going for more complex graphs 4. How do you rate the possibility of cheating for a model based on e.g. global statistics and thus outperforming the oracle which has limited information (but uses that information optimally). And in that context, how helpful is the provided lower bound? 5. Do you have an intuition why LapPE did not perform at all? And how it happened to have this odd curve in Fig 4?	Fully human-written
LRIM: a Physics-Based Benchmark for Provably Evaluating Long-Range Capabilities in Graph Learning	Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper introduces LRIM, a physics-based benchmark built on the Ising model that provably depends on long-range interactions, addressing gaps in existing graph-learning benchmarks. Concretely, it proposes a node-regression task to predict per-node energy changes on 2D grid graphs that model different spin configurations of an Ising model. A theoretical analysis shows how the dependence on long-range patterns can be directly controlled in this setting. The provided empirical results report that graphs transformers do significantly outperform local MPNNs in this setting, although at a significantly increased computational cost. I think the suggested task is a valuable addition to the existing set of graph learning benchmarks to study long-range information in a controlled setting. In particular: 1. Obtaining a provably long-range graph learning task from the Ising model is an original idea and addresses the main problem of prior "long-range" benchmarks for which the justification of long-rangedness was purely empirical. 2. The reported results do show a clear separation between local and global architectures. 4. The large graph sizes of up to 65k nodes are challenging for standard graph transformers and seem like a good test bed for developing more efficient long-range architectures. 3. The presentation is clear and key details like hyperparameter budgets are fully provided. The restrictions to only using regular 2D grids is a weakness in the context of graph learning, as the main feature of GNNs is their ability to process arbitrary graph structures. I think this is an acceptable weakness for a benchmark that intends to be complementary to "real-world" datasets, but a broader range of graph structures would ultimately be more convincing. The set of provided baselines also misses MPNNs with virtual nodes [1] as a standard trick to propagate global information in graphs. It would be very interesting how such architectures perform on this dataset, as VNs allow for global information aggregation but lack the pairwise global interactions of transformers that seem to align well with the suggested task. [1] Gilmer, Justin, et al. "Neural message passing for quantum chemistry." International conference on machine learning. Pmlr, 2017. 1. What is the numerical range of the regression target $\Delta E_i$? Do these need to be normalized for training? 2. Given that the graphs are currently regular 2D grids, would it be reasonable to use the same task for benchmarking vision models like CNNs or Vision Transformers?	Fully human-written
Heteroscedastic Variational Bayesian Last Layers: Modeling Input-Dependent Noise in Sparse-Data Regression	Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	Aiming to address the drawbacks of Variational Bayesian Last Layer (VBLL), i.e., assuming existence of homoscedastic noise and sufficient data, the authors propose to model heteroscedastic noise within VBLL and suggest a clustering-based initialization for the prior noise variance for robust performance. Experimental results on toy datasets and several real-world datasets on UCI and ERA5 demonstrate the benefits of the proposed idea when compared with six baselines. - The proposed idea seems to address a relevant problem of modeling input-dependent/heteroscedastic aleatoric uncertainty within the promising approach for efficient uncertainty estimation - VBLL. - Technically, the authors recognize the issue of misspecified prior noise when modeling the heteroscedastic noise and suggest a clustering-based initilization to mitigate this issue. - Ablation study on the effects of misspecified noise initilization is insightful and convincing. Presentation: - The presentation flow of the draft is a bit hard follow. Especially in the prelinaries section, it seems the motivation and connection between each sub-sections are missing. - It would be also more readable if the part about prior works can be separated from the introduction and used to position this work in the leterature more clearly. Method: - the proposed extension of modeling heteroscedastic noise in VBLL is technically incremental and the cases for classification and generative classification have not been discussed and considered. - The clustering-based approach requires an additional dataset for the initilization. Does it add more hyper-parameters or complexity for this approch? - There is a lack of discussing the roles of aleatoric and epistemic uncertainty based on the given formulation. - It seems that VBLL can be connected to Gaussian Processes (GPs) in view of the function space [1], it would be inspring for general readers to add a discussion about the commonality and difference between them. Experiments: - Missing a relevant baseline: Gaussian Processes via Neural Tangent Kernel for the last layer, in which [1] has implemented for a robot inverse dynamics regression task. - It would be more clear to add results when using all the data points in the real-world regression tasks. - It would be more convincing to add the variance to the result table from different runs/sample sizes. [1] Lee, J., Feng, J., Humt, M., Müller, M. G., & Triebel, R. (2022, January). Trust your robots! predictive uncertainty estimation of neural networks with sparse gaussian processes. In Conference on Robot Learning (pp. 1168-1179). PMLR. see above	Fully human-written
Heteroscedastic Variational Bayesian Last Layers: Modeling Input-Dependent Noise in Sparse-Data Regression	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes a method based on the idea of Variational Bayesian Last Layer (VBLL) to estimate heteroschedastic noise in regression type problems within a sparse data regime. The paper proposes a clsuteting-based noise level estimation in this framework and demonstrates performance of the method on several synthetic and real world data sets. In terms of the methods the comparison is done with respect to Monte Carlo drop out. One of the main conclusions is that the prior on the noise has a significant effect in terms of the quality of uncertainty quantification. Generally the problem is well posed and there is lack of computationally efficient methods for uncertainty quantification within neural networks context, especially when it comes to the estimation of heteroskedastic noise. - The set up of the paper is not convincing. In particular, the proposed method outperforms other methods (such as MC dropout) in small data regimes. But I am not sure why neural networks in this regime would be chosen for modeling at all. As introduction states, Gaussian processes provide well calibrated uncertainty estimates, although they are computationally expensive. It seems like that would be a much more desirable modeling framework for such kinds of applications. - The results suggest the there is a strong effect for the choice of the noise of the prior level on performance of the proposed method. Clustering-based noise estimation to set the prior is proposed, however, that does not seem to be rigorously analyzed. Essentially this is not sufficiently developed method to be used in diverse scenarios. - The literature review for the sparse data regimes can be stronger. In particular it would make sense then to compare to Gaussian processes. It is not clear from the paper why choose neural network setup for such cases as it does not seem to be optimal framework. This could be motivated perhaps by some critical applications which require fast real time inference, but nothing like that is considered in the experiments. - How well does the method perform compared to more conventional methods for sparse data regimes? - What are the failure modes of the proposed method to set prior hyperparamters? Overall, the performance of the method seems good in the cases where neural networks would not be the first choice for modeling. Since the performance of the method depends so strongly on the choice of the hyperparamerers, this part of the approach needs to be developed further to account for the diverse scenarios, including more complex data sets.	Fully human-written
Heteroscedastic Variational Bayesian Last Layers: Modeling Input-Dependent Noise in Sparse-Data Regression	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes a new extension of the Variational Bayesian Last Layer (VBLL) framework called Heteroscedastic VBLL (HVBLL) that can explicitly model heteroscedastic (input-dependent) noise while retaining VBLL’s computational efficiency and sampling-free properties. The main idea of the proposed HVBLL is to replace the constant-homoscedastic noise assumption in VBLL with an input-dependent Gaussian distribution, with the variance $\sigma(x)^2$ parameterized by a neural network. Moreover, the authors demonstrate the sensitivity of both VBLL and their proposed HVBLL to noise priors in sparse-data scenarios, and design a clustering-based noise-level estimation algorithm to infer a more reasonable noise prior. Empirically, the proposed method shows strong performance (e.g., captures heteroscedastic noise, models noise prior) across both synthetic and real-world datasets (UCI, ERA5, and a composite structure failure dataset). The HVBLL consistently outperforms baselines such as VBLL, MC-Dropout, SWAG, BLL, DVI, and PNN in both accuracy and uncertainty metrics (NLL, MAE, CRPS). 1. The proposed HVBLL generalizes VBLL to heteroscedastic noise settings with a flexible parametrization for the input-dependent noise variance (modeling as a neural network) while retaining deterministic and sampling-free training. 2. The proposed clustering-based noise prior estimation is simple ($Mean(v_i)$) yet powerful and directly addresses a key sensitivity issue in sparse-data BNNs. 3. Empirically, experimental evaluation involving multiple tasks (heteroscedastic noise capture, noise priors, benchmark uci regression) to illustrate the effectiveness of the proposed method. HVBLL achieves consistently lower NLL and CRPS than VBLL and other baselines, especially in heteroscedastic and sparse-data cases (Tables 1–2 and 14–19). 1. While parameterizing heteroscedasticity with neural networks $g_{\beta}$ offers flexibility, it increases computational burden. The paper does not analyze the computational efficiency of the proposed method. Moreover, there is no ablation study for this $g_{\beta}$ to demonstrate its effect on the performance of HVBLL, which is the key difference between the proposed HVBLL and Vallina VBLL. 2. For uncertainty estimation, the paper reports NLL, MAE, and CRPS but does not provide calibration metrics such as ECE, which is an important uncertainty calibration metric. 3. Although the paper includes real-world datasets (UCI), it is unclear whether the method could scale to large real-world inputs (e.g., image classification tasks for CIFAR10/100, etc). 4. While Algorithm 1 performs well empirically, there is no theoretical analysis of its convergence or bias relative to ground-truth variance estimation. Moreover, the choice of cluster size directly affects the estimation of $E_{noise}$, however, the paper provides no sensitivity or ablation study for this parameter. 5. There are some typos that require further calibration, e.g., in eq.(9), the variance $g_{\beta}$ should be $\exp g_{\beta}$; In Fig 3, the equation reference in the figure is incorrect (eq.24-27) and should be eq.28-31. 1. I believe the main distinction between your proposed HVBLL loss function and the VBLL loss function lies in modeling noise variance. $q(\epsilon) \sim N(0, \sigma^2)$ in Vallina VBLL and $q_{\beta}(\epsilon)\sim N(0, g_{\beta}(x))$ in your method. In your loss function (10), there is a KL term between the noise prior and approximation posterior; however, Eq. (4) for Vallina VBLL lacks such a KL term, containing only $\log p(\sigma^2)$. Why is this KL term omitted? What does $p(\sigma^2)$ represent? 2. In Eq.(5), such $\mathcal L(\theta, \eta, \sigma^2)$ is demonstrated as the ELBO; however, I believe this term is just the expected log-likelihood (because there is no KL term in this $\mathcal L$). What's the difference between these two? 3. Can you provide a theoretical justification or bias analysis of Algorithm 1’s estimation of $Mean(v_{i})$ relative to the true $E_{noise}$? I think this approximation accuracy should be highly correlated with the number of clusters. Can you provide a sensitivity analysis for this hyperparameter? Due to time constraints, demonstrating it on a toy experiment would suffice. 4. How does HVBLL scale in complexity and stability to higher-dimensional image classification tasks (e.g., CIFAR10/100, ≥ $10^3$ features)? All networks in your experiments are MLPs. Can HVBLL be efficiently integrated into other network architectures, such as CNNs? 5. You just used one-hidden-layer neural network for $g_{\beta}$, with 32 hidden units for real-world datasets. Is this simple architecture sufficient for high-dimensional inputs? How significantly does it impact the final performance of HVBLL? Could you provide the ablation study results? 6. What's the computational complexity of the HVBLL? How much computational overhead does this parameterized neural network $g_{\beta}$ introduce? Can you provide runtime comparisons with other methods? 7. It's unclear how AI and WR are calculated in Table 1-2. Can you provide the specific computational definitions?	Fully human-written
Team, Then Trim: An Assembly-Line LLM Framework for High-Quality Tabular Data Generation	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes a framework “Team, then trim” which consists of aggregated workers (LLMs) specialized to (conditionally) generate specific (subset of) columns thus capturing the inter-feature logics and domain dependencies. Once the synthetic dataset has been generated, it undergoes a quality check (QC) process, namely sanity check, objective based cost assessment (by comparing synthetic samples with bootstrapped original data) and diversity based monitoring (to make sure the samples are not skewed) to make sure that the synthetic dataset is of high quality. - Overall idea and analogy of assembly line workers is intuitive enough. - The paper is clear to understand and well presented. - [Experiment on recent baselines] Addition of more recent baselines, especially the ones that explored the usage of LLMs for tabular generation [1, 2, 3] will strengthen the paper. Moreover, ‘team-then-trim’ has some similarities with [1] in terms of using specialized model components per column/subset of columns (MoEs for [1], worker LLMs here), so it is also important to compare and contrast the pros and cons in related works. - [Experiment on model sizes] Varying model sizes will be interesting to understand its importance on data quality. Questions around design choices such as “Larger/smaller task manager + smaller/larger role specialists” i.e do we need more capable task manager with average workers or an average task manager with highly capable workers; will be interesting to understand. - Questions like “how’s varying model size affects data quality” will also come under this experimental design choice. - [Experiment on model families] Related to above and a follow-up one would be to look at different families of LLMs for Task Manager and role specialists. Will there be any bias coming up due to collaborations among LLMs coming from different families? - Eg: in LLM-as-a-judge literature, there’s bias associated with model preferring responses given by it’s family [4] i.e self-preference bias. So, any analysis and observation in that direction would be interesting. - Appendix E is an interesting starting point for addressing this kind of follow-up questions. - [Discussion on time complexity] Please add time complexity analysis for proposed framework. The size of task manager and role specialist workers (if they are different), number of columns and rows to be generated, number of LLM workers to be used, costs associated with data quality checks (clustering) etc; contribute to overall time complexity. As some of it is data-specific (columns) and task-manager specific decisions (how many workers to assign), it is important to understand the time complexity from practical standpoint beforehand. - [Question, L804] Please add more details in this section in terms of how many samples were generated for each dataset? - [Discussion/Experiment on additional metrics] Inclusion of additional metrics such as MLE (Machine Learning Efficacy) [5], DCR (Distance to closest record) [6], Discrimination [7] is important for discussions associated with privacy preservation, synthetic-vs-real data quality validation. - This will complement some of the discussions in Sec 2.2, especially for objective and diversity cost assessments. - Consider adding descriptions of various metrics in Appendix (including AUC, Accuracy, F1, Precision, Recall etc;) complementing sec 3.2.2. - [Discussion/Experiment on construction and evaluation of `G`] From Fig 5 (L760-765), I see that task manager LLM is responsible in forming the relationships among data (i.e construction of `G`, eq. 1), and would like to know how it fairs with manual-human graph construction and assignment of worker LLMs. And how can one evaluate the quality of `G` i.e discard it or regenerate the work assignments. - [Discussion/Possible Experiment] How can one extend the framework for their use case specific requirements for which LLMs doesn’t have enough domain knowledge, let’s say rare data which LLMs didn’t learn in their training process? For example, to generate UUIDs, distinct IDs which is rare/might be spurious from training process. Is it possible to do some fine-tuning with the current framework to get reliable predictions? - [Discussion/Experiment on column dependencies] Following up from previous point, how does the conditional order of data generation affect in scenarios when columns has a bidirectional relationship i.e there can be different choices to resolve a scenario such as: - Generate column A, then column B vs - Generate column B, then column A or - Generate both A and B together. So, understanding how task-manager (LLM) and human might resolve role conflicts would be interesting. - A quick experiment would be to pick a dataset and have task-manager generated roles and human generated roles and compare the performance differences on role conflicts and worker assignment differences. - Consider adding discussion on different scenarios i.e “independent columns, unidirectionally causal columns and bidirectionally causal columns”. 1. Tabby: Tabular Data Synthesis with Language Models: https://arxiv.org/abs/2503.02152 2. Language Models are Realistic Tabular Data Generators: https://arxiv.org/abs/2210.06280 3. HARMONIC: Harnessing LLMs for Tabular Data Synthesis and Privacy Protection: https://arxiv.org/abs/2408.02927v1 4. Self-preference bias in LLM-as-a-judge: https://arxiv.org/abs/2404.13076, https://arxiv.org/abs/2410.21819 5. A Multi-Dimensional Evaluation of Synthetic Data Generators: https://ieeexplore.ieee.org/document/9686689 6. SynthEval: A Framework for Detailed Utility and Privacy Evaluation of Tabular Synthetic Data: https://arxiv.org/abs/2404.15821 7. Synthcity: facilitating innovative use cases of synthetic data in different data modalities: https://arxiv.org/abs/2301.07573 - L462-463: Is there a hypothesis on why some metrics are not better for the proposed framework. - L338: Is there a sentence continuing post “generated data”, I didn’t get that part. - Consider adding a section describing the datasets used along with representative examples.	Fully human-written
Team, Then Trim: An Assembly-Line LLM Framework for High-Quality Tabular Data Generation	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces Team-then-Trim, a framework for synthetic tabular data generation using coordinated large language models. A task-manager LLM partitions the feature space into semantically aligned components and schedules specialized worker LLMs to generate each subset sequentially based on dependency structure. The resulting partial outputs are concatenated into full samples, which then pass through a three-stage quality control process assessing validity, task utility, and diversity preservation. Across simulated and real-world datasets, the proposed method yields synthetic data that improves downstream model performance and maintains distributional fidelity compared to both traditional oversampling and single-LLM baseline. - The team-then-trim structure separates generation from post-hoc quality control, providing robustness against LLM hallucination. - The three-stage quality control pipeline (sanity, objective-driven filtering, diversity enforcement) is systematic and targets well-known challenges in synthetic data generation, including invalid entries, distributional bias, and limited incremental information. - The use of model-based scoring and information-gain comparison to filter batches offers a principled framework beyond heuristic rejection rules that previous work used. - The method demonstrates downstream performance better than existing tabular data generation baselines. - The quality control pipeline assumes access to a reasonably performant base model and sufficient initial real data to bootstrap quality signals, which can limit applicability in low-data or scarce-label settings (including simulated data incompleteness setting in the paper). - The method incurs non-trivial computational overhead due to repeated generation, batch scoring, and rejection loops. The generation resource trade-offs are not fully addressed. - The reliance on a single trained classifier for qualifying the cost of synthetic data raises the possibility that the QC process overfits to the specific classifier used, rather than reflecting true data utility. It would be valuable to evaluate whether the selected batches remain consistent when multiple different classifiers are used for the scoring stage. - The evaluation reports performance using 500 generated and original samples. How does downstream performance scale as the number of synthetic samples increases? Specifically, does performance continue to improve with additional synthetic data, or does it plateau or degrade? - In scenarios where the number of original samples is limited, can the synthetic data still recover or cover the full cluster structure that would be observed if the complete real dataset were available? In other words, does the proposed method retain the ability to approximate the true distributional clusters when starting from a partially observed dataset? - Which LLM was used for the curated generation process in CLLM? The original CuratedLLM paper reports that stronger LLMs exhibit better performance, particularly on under-represented samples. Therefore, it would be helpful to clarify the specific model used in your reproduction to understand the reported results. - The proposed pipeline appears to rely on data-specific prompt construction for effective synthetic sample generation. Could the authors evaluate the robustness of the method with respect to prompt variations? Such an analysis would strengthen the novelty claim by demonstrating that performance is not overly reliant on manually curated prompt engineering.	Fully AI-generated
Team, Then Trim: An Assembly-Line LLM Framework for High-Quality Tabular Data Generation	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes using an agentic AI approach for generating tabular data. A coordination LLM splits the generation problem into K parts assigned to different LLMs, each generating a subset of the tabular features in a way coordinated by the coordination LLM. Here, the prompt requires this LLM to handle dependencies between features for improved quality. The generated data is then passed into a three-stage quality check pipeline ensuring: 1) a sanity check for data types and values; 2) the provided learning potential for a given downstream model, and 3) a good level of diversity. The generated data is evaluated on the downstream task utility against baselines from related work. - Leverages structural knowledge of the data during generation - Incorporates multi-level quality checks to ensure high-quality data from different points software view: sanity, utility, and diversity - Allows for the recovery of data subgroups missing in the original data - Evaluation against related work misses typical tabular generators, e.g., GReaT [1] and Tabula [2], and in particular also any other agentic LLM, e.g., [3] or diffusion-based ones, e.g., [4]. - All LLMs in the evaluation seem to be of the same type, i.e., Llama 3.3 70B Instruct, but the power of this method could also be to use more targeted LLMs for the different roles, coordinator vs worker, or for specific features. No evaluation in this direction has been done. - Following that, the same LLM is used for all roles, the paper should stress more what the advantage of this approach is with respect to some kind of chain-of-thought/in-context learning type of guidance of a single LLM during the data generation. - Only full or no quality control is considered as an ablation study. It could be interesting how much each of the three QC steps contributes. Minor: - The type of data noise (label flip) has not been specified. Did you use symmetric or class-specific flipping? - Table 1, 2, 4, 5: font is way larger than surrounding text. - Figure 3 is placed before the text referencing it - Figure 3: to ease comparison I suggest to use the same y range on all subfigures [1] Borisov, V., Seßler, K., Leemann, T., Pawelczyk, M., Kasneci, G.: Language models are realistic tabular data generators. arXiv preprint arXiv:2210.06280 (2022) [2] Zhao, Z., Birke, R., & Chen, L. Y. (2025, June). Tabula: Harnessing language models for tabular data synthesis. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 247-259). [3] Benoît Ronval, Pierre Dupont, Siegfried Nijssen. TAGAL: Tabular Data Generation using Agentic LLM Methods. arXiv preprint arXiv:/2509.04152 (2025) [4] Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, Artem Babenko. TabDDPM: Modelling Tabular Data with Diffusion Models. ICML 2023: 17564-17579 - How does Team-then-Trim perform against other baselines from related work, such as the ones referenced under weaknesses? - What is the noise transition matrix used for label flipping? - What is the benefit of the different QC steps?	Fully human-written
Team, Then Trim: An Assembly-Line LLM Framework for High-Quality Tabular Data Generation	Soundness: 3: good Presentation: 3: good Contribution: 1: poor Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	The paper proposes the team-then-trim framework for tabular data generation in low-data regimes (n<100). It’s two parts (1) Multi-agent synthetic data generation & (2) QC pipeline. Assessed on different tabular data settings vs other generative and LLM based synthetic data generators. Core contribution: LLM decomposition via multi-agent workers and a different curation mechanism (not the idea of the generation+curation pipeline itself, which prior work proposed) - Significance: Tackles an important and well-studied problem with high impact in many domains - Originality: Proposes an interesting idea of multi-agent synthetic data generation + a multi-step QC pipeline. The idea of feature decomposition and the generation to respect dependencies is great. - Quality: Good set of experiments in many scenarios: (imbalance, incompleteness, noise, scarcity) + multiple downstream models. Seems to outperform existing methods on the settings tested (albeit minimally) - Clarity: clearly written paper (1) Limited novelty: basically the same idea as CLLM, just with a multi-agent approach + different QC approach. Better positioning is needed to understand the gain because there are more LLM calls (via more agents) (2) Inconsistent results: while train-then-trim mostly outperforms other approaches, it is not universally the case. Understanding when it helps and when it doesn’t and why is important (3) Missing computational cost: given the extra LLM calls via multi-agent, it is important to understand the performance cost vs performance gain trade-off (4) Source of gain: It’s useful to understand if the source of gain is from the multi-agent generation or the QC approach. What if the CLLM curation mechanism were applied to the multi-agent generation, how would it perform with the new QC mechanism? i.e. an important ablation is Train-then-team + CLLM curation, CLLM generation + new QC mechanism (5) Analysis of teaming — more in-depth analysis on when teaming provides value. What types of datasets, what dataset sizes, when should it be used and when does it provide minimal value (6) Additional LLMs: the paper only uses Llama as the backbone LLM. It is important to try different architectures of LLMs, different sizes and more recent LLMs, such that we know as of today if multi-agent is needed and for all LLMs. - Please can you had the computational costs and number of LLM calls - Please can you add analysis with other LLMs bases, sizes/parameters (and generally more recent) - Please can you add this abaltion to understand which component (generation vs curation) drives improvement? Team-then-Trim generation + CLLM's learning dynamics curation vs CLLM's single-LLM generation + Team-then-Trim's 3-stage QC - Please can you add std dev for all results over the 10 seeds - The datasets used are public and might be known to the LLM. Some analysis needs to be done on newer/private datasets to assess generalisation	Fully human-written
TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes TriSpec, a ternary speculative decoding (SD) framework that adds a lightweight proxy verifier between the drafter and the target model to reduce verification cost. A margin-based routing rule decides when the proxy's verification is "trusted". Several experiments on Qwen3 and DeepSeek-R1-Distill-Qwen demonstrate the effectiveness of the method. - This paper is overall well-written. - The idea of applying a proxy verification model offers a new angle on SD efficiency to reduce the verification time - Several experiments demonstrate the effectiveness of the TriSpec. - Results are limited to two families (Qwen3, DSQ). It’s unclear how well the “same-family small proxy” assumption holds for other backbones, including Llama 2, Llama 3, and the Vicuna series. - Accuracy is measured via pass@1 on math/code; there’s little analysis of generation fidelity for open-ended text or long-form reasoning where subtle proxy deviations could matter. - TRISPEC itself cannot strictly speed up the LLM reasoning process losslessly. Some minor differences are unacceptable in some fields, such as medicine and law. - Trispec adds a proxy model, which also brings additional deployment and memory overhead。 Please refer to weakness.	Fully human-written
TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes TriSpec, a ternary speculative decoding (SD) pipeline that inserts a lightweight proxy verifier between the usual drafter and the target LLM. The proxy is a smaller, same‑family model (e.g., Qwen3‑1.7B for Qwen3‑32B) that pre‑verifies drafted tokens and locally corrects the first rejection; the expensive target model is called only when the proxy’s margin test (top‑1 minus top‑2 probability) marks positions as untrusted. The authors extend EAGLE‑style single‑layer drafters with a small adapter so the drafter can be seeded by proxy or target features. Algorithm 1 and Fig. 1–3 describe the flow; two regimes are covered: proxy completes the round without target, or the target verifies the untrusted suffix with token pruning of proxy‑trusted tokens. Experiments on Qwen3 and DeepSeek‑R1‑Distill‑Qwen show up to ~30% higher speedup than standard SD pipelines (HASS/EAGLE‑3) at ≤1% average accuracy loss, with >50% fewer target invocations. 1. The paper identifies verification time as a first-order bottleneck in modern SD stacks and operationalizes a clear, reproducible fix: insert a same-family proxy and gate escalation with a top-1 vs. top-2 margin. The algorithm is simple to implement atop EAGLE-family drafters. 2. The presentation and the figures are intuitive and easy to understand. 3. The experiments show large reductions in target-invocation ratio and lower per-round verification time, while keeping acceptance length stable. 1. Novelty is limited versus recent verification-side work. While the motivation to reduce target calls with a cheaper verifier is straightforward, the idea of introducing a mid-level LLM into the draft model and the target model is well explored. 2. TriSpec achieves better speedup ratio at the cost of losing the theoretical lossless property of speculative decoding, which is especially important in real-world applications. The method can accept proxy-approved tokens that differ from the target. While Appendix B argues these are usually acceptable, there is no stress test on open-ended generation, multilingual prompts, or safety-sensitive settings where small token changes may carry large semantic shifts. Meanwhile, The paper should quantify quality shifts under temperature and across domains, not only average accuracy. 3. Missing Related Works. Some related works [1, 2] already explored the idea of multi-level speculative decoding. The lack of these baselines weakens the novelty and evidence. 4. Evaluation is narrow and controlled. The experiments are only conducted on 2 Qwen-32B series models. The effectiveness of TriSpec on large-scale LLMs (>=70B) and other LLM backbones (e.g. llama and GLM) remains unknown. [1] Bachmann, Gregor, Sotiris Anagnostidis, Albert Pumarola, Markos Georgopoulos, Artsiom Sanakoyeu, Yuming Du, Edgar Schönfeld, Ali Thabet, and Jonas Kohler. "Judge decoding: Faster speculative sampling requires going beyond model alignment." arXiv preprint arXiv:2501.19309 (2025). [2] Narasimhan, Harikrishna, Wittawat Jitkrittum, Ankit Singh Rawat, Seungyeon Kim, Neha Gupta, Aditya Krishna Menon, and Sanjiv Kumar. "Faster cascades via speculative decoding." arXiv preprint arXiv:2405.19261 (2024). 1. Could you please specify the detailed training cost of the draft model? 2. Could you please provide more experiments on some extremely difficult tasks? Will TriSpec significantly decrease the model performance? If the user's query is out of the domain of the training data of the proxy model, may the proxy model give low-quality judge? 3. Does the proxy and the target model run on the same GPU? Whether the KV cache is shared?	Lightly AI-edited
TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification	Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	TriSpec is a speculative decoding framework that uses a small model of the same family as an approximate proxy of the target model for use in verification. Unlike classic speculative decoding, not every token is verified by the target model. Drafted tokens are first verified by the proxy model. Only when the proxy is unable to make a confident verification (indicated by low margin between top-1 and top-2 token probability) is the target model used for verification. On math and code reasoning benchmarks, the authors show that TriSpec achieves larger speedups than baseline speculative decoding methods while seeing negligible performance drop despite the target model never validating the full output. - TriSpec is a simple and effective idea, using small models as verifiers for a fast single-layer drafter, similar to model cascades but for verification. - Across all domains presented in the paper, TriSpec demonstrates higher speedups compared to baselines while showing negligible performance loss compared to the target model. These results show that with the right proxy, the loss of the losslessness guarantee from classical speculative decoding will not adversely affect output quality. - The paper only examines two model families, both based on Qwen: Qwen3 and DeepSeek-R1-Distilled-Qwen. Experiments on model families from other providers would strengthen the paper. In the paper’s current state, it is unclear whether the effectiveness of smaller model variants as proxy verifiers is particular to Qwen as a model provider. - The paper only examines two settings: math and code reasoning. These settings may be much more structured than more general domains, better suiting proxy models. Evaluations on other domains like question answering (e.g., HotpotQA) or instruction following would make the paper stronger. It could also be interesting to see results in domains where there is a much larger performance gap between the proxy and target models. - The evaluation set sizes are small, only 100 questions per benchmark. This, along with the lack of error bars and confidence intervals in the paper, makes it difficult to fully contextualize the results. - Did you investigate the impact of a mis-aligned proxy (e.g., from a different model family) on accuracy/latency? - How did you select the proxy model size? In particular, why not Qwen3 0.6B? - Have you experimented with more layers of proxies between the draft and target model, and if so, why did you decide to have only one proxy?	Fully human-written
TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper presents TriSpec, a novel speculative decoding (SD) framework that introduces a proxy verifier to reduce verification cost—an often-overlooked bottleneck in SD pipelines. Unlike previous work (e.g., Medusa, EAGLE, SpecDec++) that primarily optimized the drafting phase, TriSpec focuses on verification efficiency by employing a lightweight, same-family small model to pre-verify tokens before escalating uncertain ones to the full target model. A margin-based routing criterion determines when to trust the proxy versus when to defer to the target. The writing is very clear and easy to follow. I particularly appreciate that the authors clearly illustrate the bottlenecks that current speculative decoding systems suffer from, as shown in Figure 2. The proposed approach—based on introducing a lightweight proxy verifier to reduce verification cost—is both reasonable and well motivated. In terms of experiments, the authors conduct comprehensive evaluations on five benchmarks across two metrics (accuracy and speedup), demonstrating consistent improvements and strong empirical support for the proposed framework. The hierarchical framework seems not entirely new, previous work such as triforce [1] also employs similar hierarchical framework. I understand there are some difference, but the authors should give some discussion between them. In addition, I find the preliminary observation in Figure 2(b) particularly interesting. However, I wonder whether this phenomenon persists under varied temperature settings. Intuitively, when the temperature is higher, the output distribution becomes smoother, which might weaken the reliability of the margin-based routing criterion. In such cases, the proposed approach may not perform as well. I hope authors can clarify this. Moreover, the main experiments are conducted only under a fixed temperature = 0 setting. I recommend the authors evaluate their approach under more diverse temperature conditions to better assess its robustness. [1] Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding See weakness.	Fully human-written
A Convergence Analysis of Adaptive Optimizers under Floating-point Quantization	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper presents the first theoretical framework analyzing the convergence of adaptive optimizers like Adam and Muon under floating-point quantization of gradients, weights, and optimizer states. It shows that both can maintain near full-precision convergence rates if mantissa precision scales logarithmically with iterations, with Muon proving more robust to quantization errors than Adam. complete quantization error analysis under certain settings, both for Adam and Muon Basically experiments are limited, theory is not that informative either in proving practicality. The novelty and contribution is limited. - line 402 "the second moment (qV) is stricter than for the first moment (qM)" -> there are well known fact existing works e.g., https://arxiv.org/abs/2405.03637 and https://arxiv.org/abs/2405.03637 - many missing connection with stochastic rounding work where it give unbiased estimation, but brining higher variance. In modern low bit training like 4 bits, SR is widely adopted. for 8 bit, MX and NV leads to minimal quantization errors in computing gradient. Moreover thm3 https://arxiv.org/pdf/2502.20566 has some simpler Adam analysis under quantization error of q_V . better compare. - experiments are very limited, not covering practical scenarios like LLM, missing of which does not affect any practicality - more through theoretical analysis is needed, otherwise, it is just naive extension. For example, for β_12(1 + qM)2 <β2(1−qV),the effect of qM and qV is real? any toy example to test? - for "ensuring the relative quantization errors satisfy qG,qM = O(1/T)", need more explanation as, although theretical understandable, qG is not controllable as a function of T in practice, also dependency on W, W_Q. - which quantization is most important between weight, gradient, M1, M2? are these supported from theory and aligned with empirical studies? See weakness	Fully human-written
A Convergence Analysis of Adaptive Optimizers under Floating-point Quantization	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper presents a theoretical analysis of the convergence properties of Adam-type optimization algorithms. The authors propose a unified framework to analyze convergence rates and provide theoretical guarantees for both convex and non-convex settings. The work includes detailed proofs in the appendix and experiments on synthetic datasets and CIFAR-10 to validate the theoretical claims. The paper also discusses practical implications for hyperparameter selection and algorithm design. S1: The paper provides the first convergence guarantee for Adam under a practical floating-point quantization model, addressing a significant gap in the literature. S2: The theoretical analysis is rigorous and well-structured, with careful handling of quantization errors and their impact on convergence. S3: The paper establishes concrete hyperparameter settings that ensure convergence at the same rate as full-precision Adam, making the theoretical results practically applicable. S4: The empirical results (Figure 4) provide strong validation of the theoretical findings, showing that increased precision (larger mantissa bit-lengths) leads to smaller converged gradient norms. S5: The paper carefully justifies the theoretical framework by establishing two foundational equivalences, demonstrating that the analysis of quantizing weighted-sum states is directly applicable to practical quantization scenarios. W1: The paper does not sufficiently discuss the practical implications of the theoretical results for real-world applications, particularly regarding the trade-off between precision (bit-length) and computational efficiency. W2: The convergence analysis assumes certain conditions that might be restrictive in practice, but the paper doesn't fully explore how these conditions affect real-world implementation. W3: The empirical evaluation appears limited to a single dataset (Rosenbrock) in Figure 4, which may not be sufficient to generalize the findings across different optimization problems and model architectures. W4: The paper doesn't provide a comprehensive comparison with other quantization techniques for optimization algorithms, making it difficult to assess how Quantized Adam compares to alternative approaches. W5: The connection between the theoretical convergence rate and practical training performance (e.g., final model accuracy) is not explicitly established, which would strengthen the practical relevance of the results. Q1: Could you provide more detailed analysis of the practical trade-offs between precision (bit-length) and computational efficiency in Quantized Adam? Specifically, how does the convergence rate $O(T^{-1/4})$ translate to practical training time and memory usage? Q2: The empirical evaluation appears limited to a single dataset (Rosenbrock). Could you expand the experiments to include more diverse optimization problems and model architectures to better validate the theoretical results? Q3: How does Quantized Adam compare to other quantization techniques for optimization algorithms (e.g., error-feedback mechanisms, stochastic rounding) in terms of convergence rate and practical performance? A comparative study would strengthen the paper's contribution. Q4: Could you elaborate on how the theoretical convergence rate $O(T^{-1/4})$ translates to practical training performance (e.g., final model accuracy) for different quantization levels? This would help bridge the gap between theory and practice. Q6: The paper mentions "we use a small constant $\epsilon>0$ for numerical stability" but doesn't discuss the impact of $\epsilon$ on convergence. Could you analyze how the choice of $\epsilon$ affects the convergence rate and practical performance?	Fully AI-generated
A Convergence Analysis of Adaptive Optimizers under Floating-point Quantization	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper provides the theoretical convergence results for Adam and Muon, under floating-point quantization of gradients, weights, and optimizer states (e.g., the EMA moment estimation in Adam). They show that both Adam and Muon can derive convergence rates of $O(1/T^{1/4})$ on smooth non-convex objectives, even under the quantization. (a). To my best knowledge, this seems to be the first theoretical convergence result for Adam under the quantization of two EMA moment estimations, and for Muon. (b). The results somehow reflect the effects of quantization on the convergence rate. In addition, the results seem to provide the insight that Adam is sensitive to weights and second-moment quantization, while Muon is potentially more robust, as Muon allows for a weaker quantization error control than Adam. My major concerns lie in the theoretical part. - The term $\tilde{Q}(T)$ in the convergence bound of Theorem 4.5 is not very clear. Although the authors provide a detailed expression in Eq. A.43, it's still very complicated and lacks of detailed discussion with regard to the dependency on $T$. The dependency on $T$ is crucial since it is the dominating order in the convergence rate. I suggest providing a detailed calculation on the order of $\tilde{Q}(T)$, particularly when $\eta,\beta_2$ and terms like $q_G,q_W$ are set as in Line 372. - Assumptions 3.1 requires the compressed coefficient to be $2^{-M}$, where $M$ is the mantissa length of the target floating-point format. It seems that $M$ is an important parameter to quantify the accuracy of quantization, which, however, does not appear in the convergence bound. - The convergence results are heavily relied on sufficiently small $q_G,q_W,q_M,q_V$ (relative quantization errors), which assembles the case of non-quantization. Based on the existing convergence results for non-quantization of Adam and Muon, it seems that the results in this paper are not very novel. In addition, it lacks very clear definitions of these terms in the main body of the paper. - The convergence results require relative quantization errors to be $O(1/T)$ or $O(1/T^2)$ order. However, under the quantization error made in Assumption 3.1, is it possible to achieve a sufficiently small relative quantization error, such as $O(1/T)$ order? - Literatures such as [1] and [2] usually consider the constant compressed coefficient. Could the convergence results be extended to a broader range of compressed coefficients, such as any constant within $(0,1)$?	Fully human-written
A Convergence Analysis of Adaptive Optimizers under Floating-point Quantization	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	In summary, the paper fills an important gap: it builds on existing convergence results for adaptive optimizers but in a new setting --- floating-point errors. The authors establish new converge guarantees for Muon and Adam, showing that both methods retain rates close to their full-precision counterparts. The key assumption --- the boundedness of the relative quantization error is quite realistic and achievable in low-precision-aware architectural designs. To the best of my knowledge, no previous paper has offered similar full-precision-vs-quantization convergence rates for adaptive methods. Rigorous analysis and theoretical insights that align with recent practice. Providing clear statements (Th. 4.5 and Th. 4.6), the work explains why Muon tolerates quantization better than Adam --- mostly due to an important assumption of $\beta_2\to1$ in the Adam analysis. This theoretical insight matches practitioners’ observations [1], narrowing the theory–empirical gap. Empiritical validation confirms theory. Experiments on the Rosenbrock function and small fully connected models confirm theory. For instance, Figs. 3–4 show that increasing mantissa bits reduces final gradient norms, consistent with the $\Omega(\log T)$-bit requirement Framework. The idea of analyzing jointly quantized gradients, weights, and optimizer states, whereas prior work focused mainly on gradient-only quantization sounds promising for future research. [1] "Beyond Outliers: A Study of Optimizers Under Quantization", 2025 Missing research on convergence of matrix-based optimizers, leaving a room for improvement. Unlike Kovalev et al. [2] who handles constrained/composite and star-convex settings, or Shen et al. [3], who exploits Hessian structure in several assumptions, the presented theory is only for unconstrained smooth non-convex functions. Also discussions with the results on constrained / unconstrained LMO optimization [4] --- resulting in the Scion optimizer --- would benefit the theoretical flavor of the work. Please, offer any ideas on how to extend your findings to assumptions in mentioned works. Additionally, can you provide an idea of how to extend your results to another, promising setting regarding --- non-smooth, convex functions? As it is demonstrated to be a setup which explains LLMs fairly well [5,6]. Mantissa growth requirement. For Muon, to retain the rate of the full-precision training, you assume the logarithmic grows of mantissa $M = \Omega(\log{T})$. However, in practice, the bit-width is typically fixed (e.g., 8-bit). This means convergence is to a neighborhood in fixed precision. The paper does show empirically that moderate precision suffices, but the theory only covers the increasing-bit regime. So there is a mismatch that can be left for future research. If this issue is not solved, I recommend to live it as a limitation. Otherwise, it would be nice to offer some discussion regarding this topic in the paper. Missing research on optimizers trained in low precision formats. Naturally, the state-of-the-art LLM training is held in the mixed-precision format --- when optimizer states, softmax, normalization layers are in float32 and other parameters are in bfloat16. Recent work [1] studies optimizer behavior in quantization-aware training paradigms, running models in precisions of up to 4 bits. A notable takeaway --- Shampoo consistently yields the lowest accuracy drop. I believe this can be helpful because studying convergence (in the low-precision setup) of other matrix-based optimizers that emerge as a steepest descent under the spectral norm can be a direct consequence of your research. Moreover, two concurrent works [7,8] have benchmarked a zoo of optimizers at scale, showing that matrix-whitening methods are highly performant. Naturally, extending your theoretical findings to other "matrix" optimizers would be very useful for the community. Can you give a couple of comments regarding how it is possible to extend you framework to optimizers validated in the works above? [1] "Beyond Outliers: A Study of Optimizers Under Quantization", 2025 [2] "Understanding Gradient Orthogonalization for Deep Learning via Non-Euclidean Trust-Region Optimization", 2025 [3] "On the Convergence Analysis of Muon", 2025 [4] "Training Deep Learning Models with Norm-Constrained LMOs", 2025 [5] "The Road Less Scheduled", 2024 [6] "Prodigy: An Expeditiously Adaptive Parameter-Free Learner", 2023 [7] "Benchmarking Optimizers for Large Language Model Pretraining", 2025 [8] "Fantastic Pretraining Optimizers and Where to Find Them", 2025 Se the Weaknesses part	Fully human-written
A Convergence Analysis of Adaptive Optimizers under Floating-point Quantization	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper presents a theoretical framework analyzing the convergence of adaptive optimizers (Adam and Muon) under quantization of gradients, weights, and optimizer states, extending prior work that studied only partial components. It quantifies how these errors influence convergence and when performance remains close to full precision. Adam is shown to be more sensitive to quantization, while Muon is more robust. Results are mainly supported by synthetic experiments. - The paper extends prior work with a rigorous convergence analysis of adaptive optimizers under quantization of gradients, weights, and momentum terms. - It proposes a quantization schedule that aligns the behavior of quantized optimizers with their full-precision counterparts, offering insights into the sensitivity of different components to quantization error. - The inclusion of the Muon optimizer broadens the analysis and enhances the paper’s practical relevance. The paper omits key references and makes inaccurate claims about prior work. For instance, - [Hou et al. 2019] analyzed not only SGD but also adaptive optimizers such as Adam under weight and gradient quantization (without the first-order momentum term, i.e., $\beta_1=0$). As shown by [D´efossez et al. 2022], omitting momentum only introduces a multiplicative slowdown term, which should be acknowledged unless the new quantization error model changes this relationship. - Another closely related but uncited work is [Ozkara et al., 2025], which studies the convergence rate of Adam under weight quantization (again without first-order momentum) and analyzes the effect of stochastic rounding, an increasingly important direction. It remains unclear whether the proposed framework can naturally incorporate stochastic rounding into analysis. Some theoretical setups are impractical. For example, - assuming $\beta_2 \to 1$ is unrealistic, in practice a fixed $\beta_2 < 1$ is used, when the term $T \log(\beta_2)$ would diverge. Besides, [Ozkara et al., 2025] already emphasizes the reliance of convergence on $\beta_2 \to 1$. - even accepting the assumption, the proposed quantization error schedule that makes the error term vanish is infeasible in practice, as precision cannot be controlled dynamically without effectively reverting to full-precision training. In reality, quantization error is bounded by the machine epsilon of the floating-point format unless altering the format during training The experimental validation is limited and basic. There are no studies examining the influence of (\beta_2 \to 1) or verifying how the proposed quantization error schedule aligns with theoretical predictions. Lu Hou et al., Analysis of quantized models. In International Conference on Learning Representations, 2019. Ozkara et al., Stochastic rounding for LLM training: Theory and practice. In The 28th International Conference on Artificial Intelligence and Statistics. - Around assumption 3.1, isn’t the definition of relative error and quantization error (via mantissa length) essentially the definition of machine epsilon for the underlying floating-point format? If not, what distinguishes it? - [Ozkara et al., 2025] derives a bound involving error term $T\sqrt{\log(1/\beta_2)}$, while this paper presents one with $-T\log(\beta_2)$. what causes this discrepancy, and could the authors provide a direct comparison? The difference implies distinct requirements for the quantization-error schedule. - In real training, the actual gradients come with end-to-end training , a compound effect of quantization across multiple layers, how does that correlate or concludes to the quantization error or machine epsilon $q_W$, they are different. In real mixed-precision training, gradients are perturbed by end-to-end quantization and matmul across multiple layers, leading to compounded effects. How does this aggregate behavior appear in the end to only the relative error (machine epsilon) $q_G$ term from single quantization function?	Lightly AI-edited

PreviousPage 17 of 1516 (75800 total rows)Next