ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 15899 (21%) 4.43 3.58 3687
Heavily AI-edited 3233 (4%) 4.22 3.59 2990
Moderately AI-edited 7082 (9%) 4.20 3.61 2722
Lightly AI-edited 16648 (22%) 4.15 3.68 2746
Fully human-written 32938 (43%) 4.13 3.62 2917
Total 75800 (100%) 4.21 3.62 3026
Title Ratings Review Text EditLens Prediction
Navigating the Latent Space Dynamics of Neural Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes viewing autoencoder models as latent dynamical systems, where iterating the mapping $f=E\circ D$ defines a latent vector field and reveals attractors that capture the model’s memorization and generalization behavior. The authors connect the local contractivity of this mapping to the emergence of attractors and use them for practical analyses such as (1) distinguishing memorization vs. generalization regimes, and (2) performing data-free probing and out-of-distribution (OOD) detection by analyzing trajectories toward these attractors. In addition to theoretical connections and remarks, empirical results are shown using autoencoders, a diffusion-model autoencoder, and a large-scale vision model utilizing masked autoencoders. - The idea of treating $E\circ D$ as a dynamical system in latent space is novel and intuitive, providing a unifying perspective on autoencoders. - The framework is validated across various architectures, including autoencoders, a pretrained diffusion AE, and a large-scale vision model, showing that attractors can indeed be identified even in large-scale, complex models. - Using attractors derived from noise, without requiring access to source training data, to reconstruct meaningful images is an interesting way to explore what information the model stores in its weights, opening up interpretability and compression directions. - The paper provides clear and well-motivated definitions of concepts such as contractivity and attractors, and links them intuitively to properties of the model’s Jacobian, which helps make the overall framework more interpretable and understandable. - As the authors acknowledged in the discussion, it remains uncertain whether the approach applies to widely used forecasting or next-token prediction models, or encoder-only architectures. Since such models dominate modern representation learning, discussing whether attractors exist or can be defined meaningfully in these settings would strengthen the impact. To examine this, would training a lightweight decoder on top of a frozen encoder (without modifying the encoder weights) help reveal similar attractor dynamics? This could clarify whether the attractor framework extends beyond autencoder-based models. - While the authors show that attractors inferred from noise can reconstruct the actual inputs, it is not entirely clear whether these attractors correspond to actual training examples, or how one can infer generalization capacity without explicitly comparing attractors to the training data. - The theoretical analysis assumes local contractivity, which potentially does not hold globally. Empirically, as long as stable attractors can be identified, the proposed approach appears to remain valid. Nevertheless, it would be good to quantify how much of the latent space exhibits convergent dynamics, characterize the stability of these attractors, and report how long or how many iterations are typically required to discover them beyond the MNIST example. - Is the KNN analysis performed using latent embeddings of training data or the attractors identified by training data? How sensitive are the KNN results in Figure 5a to the choice of the number of neighbors K? Would similar results hold for smaller values of K or with adaptive neighborhood sizes? Also, for the proposed attractor trajectory-based scoring, is the distance computed with the nearest training attractor's trajectory or averaged across all training attractors? - Following up on the previous item, how does the proposed attractor trajectory-based scoring compare with other standard OOD detection metrics, such as the Mahalanobis distance in latent space or the reconstruction loss (MSE) in input space? - Similarly, would OOD detection performance remain similar if attractors were computed from Gaussian noise? - Please provide the definition of FPR95 where it first appears. Also, the definition in L415 can be improved by stating that it uses a threshold such that 95% of ID samples are correctly classified. - The caption of Figure 5 needs improvement; it is currently unclear which histogram corresponds to which method. Similar clarification should be added for Figure 2 for the attractor reconstructions, by specifying whether those are latent attractors or decoded outputs. - For a pretrained model, is it possible to generate attractors from random noise and gain intuition on whether the model is operating in the generalization or memorization regime? - Do the reconstructions of attractors for the pretrained models carry interpretable or semantically meaningful information? Lightly AI-edited
Navigating the Latent Space Dynamics of Neural Models Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This work introduced a method to interpret autoencoder neural networks as dynamical systems defined by a latent vector field on their manifold. This vector field on their manifold is derived by iteratively applying their encoding-decoding map. The paper claims that inductive biases introduced by standard training can be seen as emerging attractor points in their latent vector fields and propose to leverage this vector fields as representations of the neural network for downstream tasks such as (i) the analysis of the neural network with respect to generalization and memorization, (ii) the extraction of knowledge encoded in the weights of the neural network, and (iii) as tool to identify out-of-distribution samples. The paper presents three experiments. The first experiment investigates the relationship between generalization and memorization and the role of regularization on 30 convolutional AE trained on small-scale datasets such as CIFAR10, MNIST, and FashionMNIST. The second experiment aims to investigate vision foundation models and probe the recovery of information about the data encoded in the models' weights. This is done on Stable diffusion AE and vision transformer masked AEs. The third experiment aims to demonstrate the method's expressiveness to detect distribution shifts of input data from the latent trajectories of the vector field. I like the idea and think that this paper does well in motivating the proposed methods, providing a theoretical foundation for it, and demonstrating the method's utility through the set of downstream tasks. Unfortunately, this approach is limited to encoder-decoder models, which is mentioned in the limitation section of the papers. There is one open question that I would appreciate getting answered by the authors. There exists another method that learns a lower-dimensional manifold of neural network models using an autoencoder architecture. The embeddings on this manifold are then used for several downstream tasks, revealing the encoded information of the neural network's weights. This paper and the method I have mentioned sound similar, and I would like to make sure that they are different, as I understand it. I do provide mode details in the questions section. - **(S1)**: I appreciate the paper's motivation and theoretical foundation. It provides an interesting view of (AE) neural networks and provides a novel tool for analysis. - **(S2)**: I think that the experimental section is honestly aiming to demonstrate the method's utility with respect to different downstream tasks and different datasets. I also appreciate the details and additional results listed in the appendix. - **(W1)**: The proposed method is limited to reconstruction-based autoencoder neural networks. The authors are aware of this as they do mention this in the imitation section. - **(Q1)**: As mentioned above, I would like to make sure I understand the presented approach properly and do not confuse it with another method. In [1], a lower-dimensional manifold of neural network weights is learned by an encoder-decoder setup using a reconstruction loss and a self-supervised loss. This encoder-decoder bottleneck is interpreted as the representation of a neural network, which itself can be used to reveal encoded information of the neural network. I think that this submission is different from the work mentioned, but I am not sure since the methods are similar in their terminology and ideas. [1] Self-Supervised Representation Learning on Neural Network Weights for Model Characteristic Prediction Schuerholt et al, NeurIPS, 2021 Fully human-written
Planning at Inference: MCTS Test-Time Scaling for Long Video Generation Soundness: 3: good Presentation: 3: good Contribution: 4: excellent Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes using MCTS for planning-based long video generation, which expands an important direction in the TTT field. Through this approach, the paper even achieves long video generation results that surpass closed-source SOTA models, demonstrating the potential of TTT in long video generation. - The work has a certain degree of novelty and community value. The paper is the first to apply MCTS-based TTT to long video generation, showcasing the value of classical methods in the video domain. - The experimental results are impressive. The proposed method enables Cosmos-Predict2 to surpass or tie with closed-source SOTA models (Sora/Kling), which demonstrates the strong potential of TTT. - Tab. 5 should include a comparison of the computational cost. - Regarding the long-video baselines, the paper would be more sound if a more comprehensive set could be included [1,2] [1] FIFO-Diffusion: Generating Infinite Videos from Text without Training [2] Skyreels-v2: Infinite-length film generative model - The paper lacks discussion and comparison with several recently accepted works on long-video generation. [1] Zhao et al., Riflex: A Free Lunch for Length Extrapolation in Video Diffusion Transformers (ICML 2025). [2] Tan et al., FreePCA: Integrating Consistency Information Across Long-Short Frames in Training-Free Long Video Generation via Principal Component Analysis (CVPR 2025). [3] Lu et al., FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention (NeurIPS 2024). [4] Cai et al., DitCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation (CVPR 2025). See the Weaknesses section. Lightly AI-edited
Planning at Inference: MCTS Test-Time Scaling for Long Video Generation Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The author introduces a Multi-Tree MCTS variant that improves exploration in continuous generation spaces. 1. The paper is well written. 2. The author introduces a Multi-Tree MCTS variant that improves exploration in continuous generation spaces. It is interesting. 1. I would like to know the time it takes to generate a 1-minute video with and without using your MCTS, and provide a quantitative comparison of the results. 2. The biggest issue with video generation is the excessive time consumption. This MCTS could make generating a long video take 24 hours, potentially requiring 20 times more time. 3. It is difficult to implement. The biggest challenge of this model is the accurate training of the Process Reward Model and Outcome Reward Model. As we know, video quality is hard to evaluate (the error rate of evaluation is high). Any slight error in the evaluation of these two models could lead to a massive search error. 4. MCTS does not have good robustness for the Process Reward Model and Outcome Reward Model. 5. I believe the author should focus on reinforcing the video model with reinforcement learning instead of using TTS, as it is a more efficient and practical solution. see weakness Lightly AI-edited
Planning at Inference: MCTS Test-Time Scaling for Long Video Generation Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper frames long video generation as sequential decision making and proposes test time search with Monte Carlo Tree Search to plan over chunked continuations, guided by a process reward model for local chunk quality and an outcome reward model that aggregates scores over the full sequence. The method is model agnostic, sits on top of existing backbones without retraining, and introduces a multi tree variant to widen exploration in continuous spaces. Across several generators, the approach improves temporal consistency and object permanence relative to autoregressive decoding, Best of N, greedy, and beam search, and reports longer, competitive quality videos when compared qualitatively and with automated metrics to recent long video systems. The paper provides algorithmic details, ablations on compute budget, and comparisons of single tree versus multi tree search, while also acknowledging dependencies on the underlying generator and verifier quality. 1. Clear formulation of long video generation as planning with Monte Carlo Tree Search, including a walk through of selection, expansion, rollout, and backpropagation plus an explicit UCB objective. 2. Multi tree search broadens exploration under a fixed branching factor and empirically outperforms single tree for the same budget. 3. Practical recipe that is plug in and does not require retraining, which increases utility for current systems constrained by backbone quality. 1. Heavy reliance on automated reward signals for both search guidance and evaluation, with outcome reward defined as a simple sum over chunks, risks overfitting to verifier idiosyncrasies rather than human preference on long horizon coherence. A controlled human study is missing. 2. The exploration constant, branching factor, rollout policy, and beam initialization depth can strongly affect MCTS behavior. Sensitivity analysis is not comprehensive. 1. How sensitive are results to the weighting of VideoScore, CLIP alignment, and the LAION perceptual model in the process reward, and to the definition of the outcome reward as a sum rather than a learned temporal model 2. Under a fixed wall clock and identical hardware, how does the method compare to beam and greedy tuned for the same final runtime, including beam initialization time and rollout parallelism Heavily AI-edited
Identity-Preserving Human Reconstruction from a Single Image via Explicit 3D Reasoning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents the Identity-Preserving Large Human Reconstruction Model (IPRM), a feed-forward framework that reconstructs clothed 3D humans from a single in-the-wild image. IPRM anchors the monocular 3D reasoning human reconstruction by constructing a human-based 3D feature space and explicitly preserves the human identity and details by the 3D features. Specifically, it introduce a SMPL-based sparse voxel representation to transform 2D identity features into 3D space, categorizing them as 3D visible identity tokens and invisible tokens to be reasoned. Using these 3D tokens, an identity-aware 3D reasoning module is proposed to propagate projected 3D identity features from visible to invisible tokens. Then, IPRM introduces an encoder-decoder structure to decode SMPL-based 3D features into 3DGS and mesh representation, and designs a 3D ID Adapter for identity preservation. Experiments on existing benchmarks and in-the-wild data show that IPRM outperforms state-of-the-art methods. - This paper introduces a method for directly reconstructing 3D humans while preserving 3D identity features via 3D token reasoning on SMPL-based 3D sparse voxel representation. - It proposes an identity-aware 3D reasoning module, which includes visibility mask-based self-attention blocks to maintain human 3D identity features consistency during the 3D reasoning process, and a 3D Human Feature for further refinement with human-specific knowledge. - IPRM supports decoding into diverse 3D representations, including 3DGS and mesh. Additionally, it introduce a 3D ID Adapter as critical 3D guidance to mitigate identity drift at the 3D token level, enhancing identity consistency throughout this process. - IPRM achieves efficient inference of 3D human representations from image features in approximately 0.6 seconds. Qualitative and quantitative evaluations validate the framework’s effectiveness over existing methods. - This method relies on the sparse voxel representation for feature projection and 3D reasoning. However, the paper does not specify the chosen voxel grid resolution nor provide a comprehensive ablation study on how this critical hyper-parameter affects reconstruction quality, memory consumption, and inference speed. - The submission lacks essential validation in the form of multi-view rendering videos (e.g., 360-degree rotations). While static novel view images are provided, they are insufficient to conclusively demonstrate the robustness. This makes me feel less confident about the effectiveness of the method. - The primary contribution of this work is stated as improving identity preservation. However, the qualitative comparisons presented in the supplementary material (e.g., Figure 8, 2nd row) suggest that existing methods like PSHuman and LHM appear visually superior or more accurate in preserving facial identity than the proposed IPRM. - In Identity-aware 3D Reasoning Module, instead of using self-attention with mask, how about using cross-attention to query features from visible tokens to invisible tokens. - The ablation study in Table 5 indicates that the inclusion of the dedicated 3D ID Adapter provides only marginal improvements in standard reconstruction metrics (PSNR: 28.66 vs. 28.96; SSIM: 0.953 vs. 0.954) over the baseline. Please clarify. 1. Clarity on Voxel Representation and Efficiency 2. Validation of 3D Plausibility 3. Justification of Visibility Mask-based Self-Attention 4. Addressing Identity Preservation Discrepancy 5. Alternative 3D Reasoning Architectures 6. Justifying the 3D ID Adapter See weaknesses for details. Fully human-written
Identity-Preserving Human Reconstruction from a Single Image via Explicit 3D Reasoning Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces IPRM (Identity-Preserving Human Reconstruction Model), a feed-forward framework that reconstructs clothed 3D humans from a single in-the-wild image while aiming to preserve identity. Unlike prior approaches that mainly rely on 2D features, IPRM uses a SMPL-based sparse voxel representation to project 2D identity cues into 3D space. It distinguishes between visible tokens (identity-preserving) and invisible tokens (to be reasoned), and applies an identity-aware reasoning module together with a 3D ID Adapter to prevent identity drift during decoding. Experiments on benchmarks such as THuman2.1 and CustomHuman demonstrate improvements over baselines like PSHuman, LHM, and Trellis, reporting stronger identity preservation and higher efficiency The design of visible/invisible token separation and the 3D ID Adapter provides a clear mechanism to address identity drift, which is a common problem in this area. 1. Unclear Robustness to SMPL Errors The method heavily depends on SMPL estimation, but the robustness to inaccurate SMPL poses is not systematically studied. It is also unclear in the experiments whether SMPL ground truth or estimated poses were used at test time. 2. Invisible Token Dependency on SMPL Geometry The visible/invisible token split is derived from SMPL geometry. This could fail for subjects with loose or complex clothing that deviates substantially from SMPL, raising doubts about generalization. It is also unclear whether the proposed system would work if the input image is truncated or occluded by an object or other humans. 3. Limited Qualitative Evidence Qualitative comparisons are shown with very small image sizes, without zoom-ins on faces. This makes it hard to judge whether identity is truly preserved or whether artifacts remain. No video results are provided, so multi-view or 360° consistency cannot be assessed. 4. Lack of Animation Capability Competing methods like LHM support animation of reconstructed avatars, while IPRM is limited to static reconstructions, restricting its applicability. Please see Weaknesses. Lightly AI-edited
Identity-Preserving Human Reconstruction from a Single Image via Explicit 3D Reasoning Soundness: 3: good Presentation: 2: fair Contribution: 1: poor Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper presents IPRM, a feed-forward framework that reconstructs photorealistic 3D clothed humans from a single image in ~0.6 seconds while preserving identity consistency. IPRM performs 3D token reasoning directly in SMPL-based sparse voxel space: Projects 2D identity features into 3D voxel space. Classifies voxels as visible (identity tokens) or invisible (to be reasoned). Uses visible tokens to infer invisible regions while explicitly preserving identity, which seems reasonable. The good peformance is shown in the evaluation. 1.The identity-aware 3D reasoning module(although I don't think it is reasoning) with visibility mask-based self-attention explicitly preserves visible identity tokens. 2.The 3D ID Adapter provides token-level guidance to prevent identity drift during decoding. 3.The paper includes extensive quantitative and qualitative comparisons on multiple datasets (THuman2.1, Synthetic Data, CustomHuman) with both 3DGS and mesh reconstruction. 1."REASONING" is overclaimed. I don't see clearly the reasoning part. While the overall framework is reasonable, individual components (sparse voxels, cross-attention for conditioning, SMPL priors) are adaptations of existing techniques. The main contribution is the integration rather than fundamentally new methods. 2.The method heavily relies on accurate SMPL estimation from the input image. The paper doesn't thoroughly analyze failure cases when SMPL estimation is poor or discuss robustness to SMPL errors. 3.The authors acknowledge that the sparse voxel representation limits fine detail reconstruction. This is a significant limitation for applications requiring high-fidelity details (e.g., facial wrinkles, clothing textures). 4.Most quantitative evaluations are on controlled datasets with ground truth. More extensive evaluation on truly in-the-wild images would strengthen the claims. Generalization: How well does IPRM generalize to: Extreme poses not well-represented in SMPL? Very loose clothing that significantly deviates from body shape? Occluded body parts? Computational Breakdown: Can you provide a breakdown of inference time across different components (voxel projection, reasoning module, decoder)? Qualitative Failure Analysis: Can you show and discuss failure cases to better understand the method's limitations? Fully AI-generated
Identity-Preserving Human Reconstruction from a Single Image via Explicit 3D Reasoning Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Reconstructing 3D digital human from single-view images is a hot topic. The common way of existing works directly uses 2D features for 3D reasoning. This work argues that this will cause challenges to preserve 3D identity. Thus, a novel method is presented where it first project 2D features to a SMPL-guided 3D space and construt sparse voxel representation and then a 3D reasoning module is designed to propagate features from visible to invisible. Experiments verified that the proposed method outperforms existing methods. - the motivation is very clear and the proposed design is also reasonable. - the visual results of the proposed method, as shown in Fig 4, are obviously better than others especially for the identity. Although the method seems reasonable to me, I have several concerns on the results: - Among all examples in Fig 4, 8,9,10, many of the input images look like rendered from 3D assets. So, why not just use in-the-wild images? This makes me doubt about the generalization ability of the proposed model. - For some examples, like the middle one of fig 4, although IPRM produces better face, some details are missed. For example, the ropes of the hat are missed while both LHM and PShuman can produce those details. What are possible reasons? - For some examples, such as the second one in Fig 8, the color also changes by IPRM (as seen the region of upper body). What are reasons? It seems from the results that only face region produced by the proposed method shows obvious better quality. I am curious that if putting more atttention on the face region during the training of previous methods will also work (for example, adding an extra loss functions on face part). No. Fully human-written
Rethinking CLIP for Long-Tailed Class-Incremental Learning Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper aims to address the challenging problem of exemplar-free, long-tailed class-incremental learning with CLIP model. Specifically, this paper proposes a suite of synergistic techniques: (1) Two-Stage Hybrid Augmentation for robust tail-class feature augmentation; (2)Tail-aware Semantic Shrinkage to correct tail-class statistics; (3) Adaptive Margin Hinge Loss to refine decision boundaries; (4)Mode-Connectivity Spherical Merge for stable integration of knowledge into the single adapter. Experiments on the challenging ImageNetSubset-LT benchmark validate the proposed approach. 1. It is an interesting idea to fully utilize CLIP’s pretrained feature space to prevent class interference for class incremental learning. 2. Experiments on the challenging ImageNetSubset-LT benchmark validate the proposed approach. 1. The manuscript can be further improved. First, in section 3.1, although the author argues that RAPF fails in Figure1, it seems that figure 1(a) and 1(b) are quite similar. Why the author argues that RAPF fails in Figure1 needs further clarification. Similarly, Figure 2 doesn't clearly support the failure of Inter-Task Preservation and further discussion is needed. Second, while the paper conducted experiments on two datasets, only the results for one dataset are included in the main paper, which makes the empirical validation less convincing. Third, the discussion of related work is not sufficient. There is a vast amount of literature of class incremental learning, but the paper only discusses a few of them. 2. The empirical validation is not sufficient. The paper conducted experiments on ImageNetSubset-LT and CIFAR100. Although the performance on ImageNetSubset dataset is better than baselines, the proposed method doesn’t outperform all the baselines on CIFAR100. Further empirical validation on more datasets is needed to support the advantage of the proposed method. Refer to [1] for more datasets (and/or exemplar-free CIL baselines) in CIL validation. [1] External Knowledge Injection for CLIP-Based Class-Incremental Learning 1. Duplicated paragraph(line 137), typo(figure 1 caption: class 38 or class 36), texts and legends need to be larger in figure 5(c), 1(a), 1(b) 2. In section 3.2.2, why does the paper shrink the tail classes toward their semantically related head-class neighbors? How does the paper define and find the neighbors? Will the shrink lead to confusion in classification, as shrinking mixes up class statistics? 3. What is the explicit definition of $f_{edge}$? What is the procedure to get it ? 4. In Table 4, what is the main difference between “Baseline(Adapter + SG)” and “Baseline w/ MC-SMeRG”? Does “Baseline(Adapter + SG)” have a merge process? Fully human-written
Rethinking CLIP for Long-Tailed Class-Incremental Learning Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper addresses long-tailed class incremental learning (CIL) with pre-trained vision-language models like CLIP. The authors propose a unified framework with three stages: Intra-Task Stabilization (Two-Stage Hybrid Augmentation), Inter-Task Preservation (Tail-Aware Semantic Shrinkage and Adaptive Margin Hinge Loss), and Knowledge Consolidation (Mode-Connectivity Spherical Merge). Using only a single lightweight adapter, the method efficiently leverages CLIP’s priors to handle data imbalance and forgetting. Experiments on long-tailed ImageNetSubset and CIFAR-100 show clear improvements over prior CLIP-based CIL methods. 1. Clear three-stage design: Provides a structured and logical approach to long-tailed CIL. 2. Lightweight and efficient: Uses a single adapter while maintaining strong performance. 3. Extensive experiments: Demonstrates consistent gains and robustness across multiple long-tailed benchmarks. 1. The paper does not clearly justify the necessity of studying long-tailed class-incremental learning (CIL) in the context of large-scale pre-trained vision-language models like CLIP. The proposed problem setting appears more like a forced combination of two popular research topics without sufficient motivation or evidence that such integration introduces new challenges or insights. 2. The paragraphs on lines 104 and 135 in Section 2.2 are exactly repeated. 3. The two t-SNE visualizations in Figure 1 do not show clear qualitative differences. In fact, the class boundaries appear reasonably well separated, which contradicts the claim that they “appear diffuse and weakly separated.” The numerical differences reported in the accompanying table are also small, and the paper does not explain how these numbers are computed. Without this clarification, the visual and quantitative evidence does not convincingly support the authors’ argument. 4. Most of the proposed techniques rely on the assumption that data features follow a Gaussian distribution. This is an idealized and often unrealistic assumption for real-world visual data. The paper lacks discussion, justification, or empirical verification of this assumption, which weakens the theoretical soundness of the approach. 5. In Table 4, it is unclear why the “Last” metric is chosen as the main indicator for ablation studies. The authors should provide a stronger explanation for this choice. Moreover, the reported results show that the Two-Stage Hybrid Augmentation (TSS) brings only marginal improvements, which raises questions about its effectiveness and contribution. 6. The paper does not include any analysis or discussion of hyperparameters. Since the proposed framework contains several key parameters , a sensitivity study would be necessary to evaluate robustness and reproducibility. See weakness. Lightly AI-edited
Rethinking CLIP for Long-Tailed Class-Incremental Learning Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper addresses the challenging problem of exemplar-free long-tailed class-incremental learning (LT-CIL), where models need to learn new classes sequentially under extreme data imbalance without storing historical samples. Existing CLIP-based CIL methods often degrade sharply in long-tailed scenarios due to insufficient utilization of pre-trained knowledge and failures in intra-task stabilization and inter-task preservation. To tackle these issues, the authors propose a lightweight framework with a single adapter, structured into three core stages: Intra-Task Stabilization via Two-Stage Hybrid Augmentation (TSHA) to refine tail-class features, Inter-Task Preservation using Tail-aware Semantic Shrinkage (TSS) and Adaptive Margin Hinge Loss (AMHL) to protect past knowledge and optimize decision boundaries, and Knowledge Consolidation through Mode-Connectivity Spherical Merge (MC-SMERG) to fuse task-specific adapters. Extensive experiments on ImageNetSubset-LT and CIFAR100-LT benchmarks demonstrate that the proposed method outperforms state-of-the-art approaches, especially on data-scarce tail classes, and maintains scalability as the number of classes increases. - The framework uses only one lightweight adapter, avoiding complex add-on modules, and the three-stage components (TSHA, TSS, AMHL, MC-SMERG) complement each other to solve intra-task instability and inter-task forgetting simultaneously. - The integration of mode connectivity theory and spherical interpolation for adapter fusion offers novel insights into knowledge consolidation, while the exemplar-free design avoids memory and privacy issues. - Limited Generalization to Other Architectures: The method is exclusively evaluated on CLIP ViT-B/16; its compatibility and performance with other vision-language models or backbone architectures (e.g., ViT-L/14, Flamingo) remain untested. - Sensitivity to Hyperparameters in Complex Scenarios: Although the ablation study shows stability for key hyperparameters (K=3 for TSS, τ=9 for MC-SMERG), the performance under extreme parameter variations or cross-dataset hyperparameter transfer is not thoroughly discussed. - Lack of Analysis on Computational Overhead: While the adapter is lightweight, the two-stage augmentation and virtual validation set construction for MC-SMERG may introduce additional computational costs, which are not quantified or compared with baselines. - Insufficient Discussion on Semantic Similarity Metrics: The paper uses CLIP text embeddings to measure class similarity for TSS and AMHL, but alternative metrics (e.g., visual feature similarity) are not explored, leaving room for further optimization. - The figures and tables in the paper are confusing. There are some figures showing no difference between the proposed approaches and baselines. - Please make a more comprehensive analysis of intra-class variance, such as in Figure 1, which should be a statistic of multiple classes rather than analyzing a few categories. - Can we compare the results with the examplar, because some other methods are designed with the examples. Fully AI-generated
Rethinking CLIP for Long-Tailed Class-Incremental Learning Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This submitted paper addresses the performance degradation of existing CLIP-based methods in long-tailed class-incremental learning (long-tailed CIL) and proposes an exemplar-free framework relying solely on a single lightweight adapter. First, the authors conduct the first in-depth study on the limitations of modern CLIP-based CIL methods in realistic long-tailed scenarios, highlighting the need to maximize the utilization of pre-trained knowledge rather than relying on complex add-on modules. Second, the framework solves core challenges through a synergistic three-stage design: in the Intra-Task Stabilization stage, Two-Stage Hybrid Feature Augmentation (initialization based on text priors and k-NN data-driven refinement) enhances the robustness of tail-class representations; in the Inter-Task Preservation stage, Tail-aware Semantic Shrinkage (correcting tail-class statistical bias using head classes) and Adaptive Margin Hinge Loss (dynamically adjusting boundaries to protect tail classes) alleviate forgetting and inter-class confusion; in the Knowledge Consolidation stage, Mode-Connectivity Spherical Merge (SLERP interpolation and virtual validation set optimization) realizes stable unification of old and new adapters. Finally, experiments on ImageNetSubset-LT (100/200/300 classes) and CIFAR100-LT show that the method consistently outperforms existing SOTA approaches (including exemplar-free and exemplar-based methods), with the performance advantage expanding as the number of classes increases (more severe long-tail conditions), verifying its scalability and efficient learning ability for tail classes, and providing an effective path for practical incremental learning systems. 1. The paper proposes a three-stage framework that systematically addresses instability and forgetting issues in long-tailed class-incremental learning. 2. The authors demonstrate that using a single shared lightweight adapter is sufficient to achieve significant performance gains, showing better practicality and scalability. 3. The two key components, TSHA and TSS, are specifically designed to tackle long-tailed scenarios, effectively mitigating the degradation of tail-class representations. 1. Although the overall framework is well-structured, most components are combinations or refinements of existing ideas, lacking strong novelty or justification. 2. The paper does not provide an explanation for why interpolation along a low-loss path in the weight space used in MC-SMERG can preserve old knowledge and prevent forgetting. 3. While the authors emphasize that the method is lightweight and efficient, the paper only reports the adapter fusion time (8.3 ms per task) and lacks a detailed comparison of total training time, GPU memory usage, and parameter size against baselines. How sensitive is the approach to the imbalance ratio ρ and task ordering? Other questions see weakness. Lightly AI-edited
PISA: A Pragmatic Psych-Inspired Unified Memory System for Enhanced AI Agency Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces PISA, a unified memory system to improve an agent’s adaptability to diverse tasks and enhance task-oriented decision-making. Inspired by cognitive psychology, the authors propose a schema-based memory representation that enables agents to actively construct, refine, and retrieve relevant knowledge. The framework demonstrates improved performance in memory-dependent agent tasks, and the authors further design and evaluate the AggQA benchmark to assess task-oriented memory management. The results indicate that PISA offers advantages in organizing and utilizing memory for effective task execution. - **S1**. The authors propose a schema engine that supports a constructive and adaptive hybrid memory system, allowing more structured and flexible memory management. - **S2**. The design of the AggQA task effectively demonstrates the task-oriented properties and practical advantages of the proposed PISA framework. - **S3**. Including ablation studies and evolutionary threshold analyses provides valuable insights into the contributions and behavior of different model components. - **W1**. The criteria and methodology for defining the Meta, Element, and Element-Value categories remain ambiguous. - **W2**. The ablation study results show that initialization is critical, and the system depends heavily on it. However, there is insufficient explanation regarding how initialization is performed using prior knowledge and how the schema configuration may vary across different domains. - **W3**. In the PISA framework, temporal relationships and identity tracking for similar entities (e.g., Dog1 vs. Dog2) are not adequately addressed, limiting clarity on how memory maintains continuity over time. - **W4**. The adaptive processing modules introduce schema-level similarity scores for schema matching and element-level compatibility scores for element matching, yet the operational details of these computations are not fully described. More elaboration is needed as these components appear essential for memory management within the PISA framework. - **Q1**. How is the boundary between Meta and Element determined, and what prevents overly broad or excessively fragmented schema definitions? - **Q2**. Following Q1, are the keywords used for Meta, Element, and Value guaranteed to be unique? If so, how does the system handle cases where a keyword that appears as a Meta category should also appear as an Element or Value in a different context? - **Q3**. How does the framework address temporal dependencies—such as ordering and causality—during memory retrieval and decision-making? - **Q4**. What mechanisms ensure unique identity tracking when multiple instances share the same Meta category (e.g., two different entities both categorized under “Dog”)? Lightly AI-edited
PISA: A Pragmatic Psych-Inspired Unified Memory System for Enhanced AI Agency Soundness: 2: fair Presentation: 4: excellent Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors of this paper propose a unified memory system for AI agents. The system is based on the concept of *schemas*, from the Piaget's schema theory. A schema encodes a structured knowledge, containing a meta (drink, music, ...), an element (coffee, jazz...), the value linked to the element (e.g. coffee details and preferences) and an experience identification. The system itself proceeds in three steps: in a first step, the schemas are initialized based on existing data; then the schemas are updated over time continuously, with the possibility of assimilation (new values of an element), evolution (new element of a meta) or creation (new meta). At that stage, a conflict analysis is also performed. Finally, the system contains a retrieval module, that can query the memory pool with different tools (RAG, SQL, calculator). The system is evaluated on LOCOMO and AggQA. AggQA is a new benchmark proposed by the authors, containing 134 questions from financial and medical domains (example of question: "What was the average closing price of Pinterest stock from April 1 to April 30, 2024?"). The evaluation is performed with gpt-5-mini against existing competitors (Mem0, LangMem, etc.) and direct answer by the model using full context. On LOCOMO, PISA shows better results compared to existing competitors, but is not better than full context. On AggQA, PISA outperforms both competitors and direct full context prompting. - the concept of schemas is well explained and gives a structured way to represent the memory history, with some flexibility in creating and adapting memories - consistent improvement against other memory retrieval methods on locomo (excluding the adversarial category) - I checked the availability of both source code and data - schema management seems simplistic for real cases (see questions below) - the introduction of schemas does not seem to be the main component in the good performance - the experiments do not clearly show the improvement compared to full context ## Justification that the use of schemas is flexible and provides a significant improvement The use of schemas is the central component of the system and of the paper. - Q. Can you justify that the improved performance with the PISA system is because of the use of schemas instead of the retrieval module? In particular, how similar are the retrieval tools of the competitor methods? The schema modelization seems simplistic. For formulating this issue objectively, can you explain how the mechanisms interact in those cases: - Q. From example in Fig. 6. What about "The cat of my neighbor has 3 legs"? This may be true for this particular cat. How it should be handled ideally (the "cat" schema is updated? a new specific schema is created?) - Q. From example in Fig. 1. What about "I like coffee with milk only when the music is jazz". How this is integrated? - Q. Is it possible to update multiple schemas at once? ## Comparison with full context retrieval Authors identify two challenges using AI agents: "risk of exceeding context window" and "irrelevant information may interfere with the model judgment". While I agree with the authors, the experiments against the full context size are not convincing on this aspect: comparing Tab. 2 and Tab. 3 on the AggQA financial dataset, without the retrieval module (fair comparison with full context), the results show that PISA (without the retrieval module) is not doing better than full context. - Q. Can you confirm that in the full context experiment, no tool can be used - Q. PISA is using both memory management and retrieval tools. Is it possible to compare in a fair way whether the proposed memory management is useful w.r.t. full context? ## Other questions - Q. I don't understand clearly the initialization stage. Can you explain this phase in your experiment? What is the provided data at that stage? - Q. why removing the adversarial category of locomo in Table 1? - Q. what is the list of the memory pool buckets? In Fig. 1: user trait, user event, relationship; then in Fig. 3: user events, agent events. Is it fixed at the initialization stage? - Q. How important is the conflict detection module in the experiment? Fully human-written
PISA: A Pragmatic Psych-Inspired Unified Memory System for Enhanced AI Agency Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a new memory system designed to enhance AI agents in long-term conversational and data analysis tasks. Drawing inspiration from Piaget's cognitive development theory, PISA introduces a three-module architecture: an Initialization Module that constructs task-oriented memory structures through schema creation, an Adaptation Module that implements a tri-modal mechanism (assimilation, accommodation via schema evolution, and accommodation via schema creation) for updating memories, and a Retrieval Module designed as a ReAct AI agent (Yao et al., 2023). The memory is organized hierarchically into schemas containing meta (category), element (specific instance), element-value (attributes), and experience ID (linking to original conversation). The proposed method is evaluated on the LOCOMO benchmark and a newly introduced AggQA dataset using an LLM-as-a-judge metric, demonstrating performance improvements over current state-of-the-art systems. **Novel memory architecture**: The authors introduce a psychologically-inspired approach to memory storage for AI agents that exploits hierarchical schema structures to create, access, and update memories more effectively. **Comprehensive system description**: The paper provides detailed explanations of each architectural component and presents algorithms necessary for system functioning, with a commitment to make code publicly available. **New evaluation benchmark**: The authors introduce AggQA dataset, to evaluate models on data analysis tasks across medical and finance domains with varying difficulty levels. **Presentation clarity**: Section 2 suffers from verbose, nested explanations that would benefit from more formal mathematical notation. Concepts introduced in the "Notation and Symbols" table (line 220) are insufficiently explained in the main text and Figure 3, such as multiple elements belonging to the same schema. **Unclear attribution of contributions**: For the Retrieval Module (Section 2.4 and Figure 4), the distinction between components taken from Yao et al. (2023) and those newly introduced or adapted for PISA is not clearly delineated. **Ad-hoc design concerns**: Throughout the paper, extensive notation is introduced (Memory Pool, Buckets, Schema, Element, etc.) without clarifying whether these are created specifically for this work or inspired by existing literature. The categorization of queries in Section 2.4 appears constructed to target specific task types found in the evaluation datasets (e.g., Regional Fact Retrieval -> LOCOMO's single-hop, Multi-Fragment Reasoning -> LOCOMO's multi-hop and temporal, Aggregation -> LOCOMO's open domain and AggQA). **Evaluation metric inconsistencies**: Original publications of compared models (Section 3.2) for which LOCOMO evaluations exist consistently report F1 and BLEU scores, while this paper relies solely on LLM-as-a-judge evaluation. Only Chhikara et al. (2025) reported LLM-as-a-Judge scores. Notably, relative LLM-as-a-judge scores between models show near-zero correlation with scores reported by Chhikara et al. (e.g., for open-domain: Chhikara reported Zep=77, A-mem=54; this paper reports Zep=39, A-mem=50; for multi-hop: Chhikara reported Zep=41, LangMem=48; this paper reports Zep=14, LangMem=32). This suggests the metric is unreliable or should be complemented by additional metrics such as F1 or BLEU. **Questionable importance of core mechanism**: Figure 5 demonstrates that changing $\theta_{meta}$ from 0.85 to 0.95 causes substantial behavioral changes (shift from assimilation to schema creation in Figure 5b), yet evaluation scores remain relatively stable (Figure 5a). This suggests the schema/bucket organization may not be critical for task performance. **Missing computational efficiency analysis:** The authors never report computational performance metrics (e.g., inference time, memory consumption, number of LLM calls, retrieval latency) of PISA compared to baseline methods. **Incomplete experimental analysis**: The $\theta_{meta}$ hyperparameter sweep is only performed on the LOCOMO dataset. Ablation studies are incomplete: PISA without Initialization is only evaluated on AggQA's medical domain, and PISA without Retrieval only on AggQA's finance domain. **Minor**: - Code is not currently accessible from the provided link. - LOCOMO's adversarial questions are introduced in the main text but not used for evaluation. - Table 1 shows that providing full context to the model outperforms all memory systems on the LOCOMO benchmark. How do the authors reconcile this result with the paper's central premise that memory systems should address the limitations of full-context approaches (line 42: "excessive historical information [...] risks exceeding the model's context window; [...] amount of irrelevant information contained may interfere with the model's judgment and decision-making")? - What accounts for the observation that both PISA and the full-context baseline achieve superior performance on AggQA's medium difficulty level compared to the easy difficulty level? Lightly AI-edited
PISA: A Pragmatic Psych-Inspired Unified Memory System for Enhanced AI Agency Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. the paper proposes PISA, a task oriented memory system for LLM agents, which is an active research subject the paper casts a shade of theoretically principled design (by pompously recalling Piaget & Cook -- already erroneously cited as Piaget, Cook *et al.*, as if there was a third unkown coauthor) onto a quite practical implementation of a memory structured around schemas, that can be assimilated/accomodated (adding values into existing ones) or created ex novo (for new concepts). the paper shows relatively simple experiment on one public (LOCOMO) and one purposely built benchmark. in short, the paper is lightweight in contribution and evaluation, and overall contribution seems quite thin Jean Piaget, Margaret Cook, *et al.* The origins of intelligence in children, volume 8. International universities press New York, 1952 - memory in LLM is an active topic - the paper is clearly written (detail of weaknesses expanded below in question section) - the paper has lightweight contribution - evaluation is lightweight and quite superficial - evaluation benchmark appears lower than what publicly reported ## Details about weaknesses - the paper has lightweight contribution In PISA Initialization/adaptation and retrieval module manage memories as schemas. while the paper compares PISA to mem0, AMem and LangMem, the comparison remains superficial -- i,e., it is difficult to appreciate the "Piaget-inspired adaptive memory evolution mechanism,'' as well as what is the key that from "[...] Piaget’s cognitive theory, that constructs task-oriented memory structure''. the explanation is conducted with pedagogic examples, that remain excessively simplistic and are not remotely connected to the proposed benchmarks. - evaluation is lightweight and quite superficial the paper shows relatively simple experiment on one public (LOCOMO) and one purposely built benchmark. the evaluation is conducted with either classic aggregated values (on which there may be a problem, see next) or with again pedagogic examples (in the appendix) yet, even digging tyhe appendix the examples remain excessively simplistic, and are not digging to any extent the usefulness of the proposed schemas -- and its adaptive evolutionary status for instance, Reecord Reliability Scoring depends on numerous weighting parameters, whose tuning in real-life settings may be involved and in the paper is barely mentioned and not even discussed (so for sure no ablation ios carried over it) -- but this is just one example - evaluation benchmark appears lower than what publicly reported The results of compared baslines appear lower than the expectations as already publicly reported (eg see https://www.memobase.io/blog/ai-memory-benchmark ), and the published benchmark often exceeds PISA on LOCOMO. Method Single Multi OpenDomain Temporal Overall mem0 67.13 51.15 72.93 55.51 66.88 LangMem 62.23 47.92 71.12 23.43 58.10 Zep 74.11 66.04 67.71 79.79 75.14 OpenAI 63.79 42.92 62.29 21.71 52.90 Memobase 70.92 46.88 77.17 85.05 75.78 ## Minor/Languange/Style typos: "Reproducability statement" -> reproducibility p10 ### overloaded styles the paper mixes syntactical use of “” with overloaded semantic implications: you should only use it for examples but not for implementation - implementation: “JSONB” (2.1_ schema engine) - examples: topic “Drink” and specific Element “Pure Milk” (same section) the paper again mixes syntactical use of “” with overloaded semantic implications, incresing confusion -- either you use “”for precise defs, or for approximation and intuitive metaphores - precise definition: [...] structured unit of knowledge within the system, also called a “Schema.” [...] “Assimilation" is the process of - approximation/metaphore: This makes PISA’s schemas consistent with Piaget’s definition: they are executable “mini-programs” [...] - examples: different breed of dog into the “dog” category ### mixed styles the paper mixes styling -- choose one but mixing the two is not recommended {\texttt RULER} vs RULER {\texttt LOCOMO} vs LOCOMO Fully human-written
SparseCodeQ: Extreme Sparse Coding Quantization for Large Vision-Language Models Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces SparseCodeQ, a 2-bit quantization framework for MLLMs. It solves the problem of conventional methods which assign a fixed number of codewords to all weights, ignoring their different importance. SparseCodeQ, instead, flexibly assigns an optimal sparse linear combination of codewords to each weight based on its salience, which is evaluated using second-order information. It uses a hierarchical search to efficiently find the best codeword combination. Experiments show this method achieves a 5.58x reduction in model size while outperforming state-of-the-art methods. 1. Adaptive Quantization Based on Salience. A primary strength is its novel approach. Unlike conventional codebook methods that assign the same number of codewords to all weights , SparseCodeQ is the first framework to use sparse coding principles for model compression. It adaptively assigns an optimal _sparse linear combination_ of codewords to each weight based on its salience. This allows for a more fine-grained representation of important parameters, which significantly mitigates performance degradation. 2. Superior Performance at Extreme Compression. The method demonstrates state-of-the-art results, especially at very low bitrates (e.g., 2.2 bits). Experiments on the 13B LLaVA model show SparseCodeQ achieves a 5.58x reduction in model size while outperforming state-of-the-art quantization methods by 2.78 in performance. It also shows practical efficiency gains, achieving a 3.6x memory reduction and 1.3x inference acceleration on a 7B model. Evaluation is focused on short-output tasks. The paper's experiments, while extensive, are heavily focused on VQA-style benchmarks (e.g., ScienceQA, VQA-v2, GQA). These tasks typically require short, often single-word or option-based, answers. This is a limitation because the impact of extreme quantization on long-form generative tasks (like detailed image description or open-ended multimodal dialogue) is not evaluated. Quantization errors can accumulate differently and cause more significant quality degradation in longer, coherent text sequences. The model's performance in these more open-ended scenarios remains unverified. How would SparseCodeQ perform on even larger-scale MLLMs, and does its effectiveness scale? Fully AI-generated
SparseCodeQ: Extreme Sparse Coding Quantization for Large Vision-Language Models Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes a sparse-coding quantization scheme for low-bit LVLMs. It allocates variable codewords per group using Hessian-based salience, performs hierarchical codeword selection, and regularizes the vision encoder with a Hessian-entropy term to concentrate salience. Results show accuracy gains at 2–3 bits and modest speedups. - Targets extreme low-bit settings where many PTQ baselines fail. - Methodically designed selection stages reduce naive combinatorial search. - Attempts model-side distribution shaping rather than only post-hoc mapping. 1. Proxy–objective gap. Allocation is driven by diagonal Hessian of layer reconstruction on calibration data, while evaluation is end-task VQA accuracy. The causal link between concentrated Hessian mass in the vision encoder and reduced end-to-end quantization error through the language stack is asserted, not established. 2. Hidden compute and bandwidth overhead. Multiple codewords per group imply sparse linear combinations at inference, increasing index fetches and accumulations. Despite large memory savings, the reported latency gain is small, indicating arithmetic intensity and cache residency may be worse than stated. No kernel-level analysis (index width, packing, FLOP/byte, cache hit rates) is provided. 3. Calibration sensitivity and scaling risk. Hessian estimates and hierarchical selection depend on limited calibration data. The approach is likely sample- and distribution-sensitive, but there are no stress tests under prompt-length scaling, domain shift, or smaller calibration budgets. Codebook and index streams also introduce rate overhead that is not fully accounted for in bytes-per-parameter. N/A. Fully AI-generated
SparseCodeQ: Extreme Sparse Coding Quantization for Large Vision-Language Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposed a novel extreme sparse coding quantization framework for ultra-low bit large vision-language models targets on multimodal reasoning. Authors design a flexible codeword combination for each wright based on weight salience, which involves selecting the number of codewords based on salience evaluation and hierarchical codeword selection to search an appropriate codeword combination with mininal quantization errors to reducing the large codebook. Experimental results prove the effectiveness of the proposed methods mainly on LLaVA models. 1. The writing and logic are clear and reasonable, which are easy to follow. 2. Pushing the limit of quantized LVLMs into 2-bit is contributing and the experimental results also prove the probability of applying the proposed methods into industry or for further academic research. 3. Figures are clear and informative enough for me to follow. 1. Section 3.2 is derived and widely used in previous arts, like HAWQ, BRECQ, etc. which can be omitted or moved into supplementary for better writing logic. Also, this part should not be treated as a contribution. 2. In "High-level candidate search" of Section 3.3, how to define the size of the potential codeword subsets and how does the size effect the final performance and quantization efficiency of the proposed methods? Authors should discuss more with experimental or theoretical analysis. 3. The condition and design of codewords can be different between text embedding and vision encoders since LVLMs take both text and image tokens as inputs, while these two modals can be very different in both distribution range and outliers. How to address this phenomenon with the proposed method? In other words, authors fail to discuss or solve the challenges raised by the modality. See weaknesses. Fully human-written
SparseCodeQ: Extreme Sparse Coding Quantization for Large Vision-Language Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes SparseCodeQ. To address the discretization error problem in 2-bit quantization of LVLMs, the method dynamically allocates the number of codebooks based on weight salience and optimizes the visual encoder to concentrate the salience distribution. The authors claim that the method achieves a 5.58× reduction in the size of the 13B LLaVA model. The paper innovatively integrates sparse coding with quantization, dynamically allocating codebooks to address the issue that traditional methods overlook the variance in weight salience. It features clear writing and conducts extensive experiments covering multiple datasets and model architectures. The paper has insufficient theoretical analysis on the cross-image similarity of weight salience, merely mentions multimodal extension without providing validation, and the theoretical analysis of the method is relatively simplistic. Additionally, it lacks hardware-level validation and fails to evaluate the storage and computational overhead during actual deployment. 1. Increase the analysis of different weight salience evaluation methods: what methods can be used for salience-based codebook allocation, what impacts do these methods have on performance, and why do these impacts occur? Sufficient analysis is required to reveal the reasons for performance improvement, rather than merely increasing the number of codebooks. 2. What is the specific computational overhead of hierarchical codeword selection? It is necessary to analyze the additional computational load introduced by the method. Additionally, attempts can be made to propose methods for further optimizing its efficiency to provide insights for future work. 3. I do not fully understand the basis for setting the weights of the entropy minimization objective in visual encoder optimization. It is necessary to add ablation experiments on hyperparameters. Moderately AI-edited
Importance Sampling for Multi-Negative Multimodal Direct Preference Optimization Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes MISP-DPO, a listwise preference-optimization framework for multimodal LLMs that combines (i) a Plackett–Luce (PL) objective over a *winner + multiple negative* candidates, (ii) importance sampling with a learned proposal (q_\phi) to reduce training cost and variance, and (iii) a semantically diverse negative-construction pipeline built in CLIP embedding space using sparse autoencoders (SAE) and feature-level perturbations. The method is applied to both *visual-preference* (image conditioned) and *textual-preference* (response conditioned) settings. Experiments report consistent gains over DPO/mDPO/CHiP-style baselines on datasets such as MMHal-Bench, HallusionBench. - **Clear problem framing & practical motivation.** The paper demonstrates clear problem formulation and presentation. The PL-based listwise loss directly optimizes rankings rather than isolated pairs. - **Good objective with scalable estimation.** Using **importance sampling** and a learned proposal (q_\phi) to approximate the PL gradient is a sound strategy to keep many-negative training tractable while emphasizing informative (“hard”) negatives. - **Semantically diverse negatives grounded in CLIP space.** The SAE-based feature editing and “mix-and-match” construction pipeline plausibly increases negative diversity without requiring extra human labels. - **Broad evaluation.** The method is tested across multiple hallucination/factuality benchmarks frequently used for MLLMs (MMHal-Bench, HallusionBench, POPE, WildVision, MMVP) and with a modern evaluation toolkit (VLMEvalKit). - **Novelty concern.** Beyond pairwise (1 chosen, 1 rejected) multimodal DPO, there already exists listwise optimization in the context of multimodal DPO, such as LPOI[1]. Therefore, there may be certain novelty concern, **especially when the authors claim** their paper is - "the **first** framework to incorporate *multiple*, semantically *diverse* negative images in multimodal DPO" (line 16, in abstract) - "the **first** framework to incorporate multi-negative supervision into multimodal DPO" (line 89), - "However, such techniques **remain underexplored** in vision-language models" (line 117). **So please double-check your claims in the submission.** Also, the use of Plackett–Luce objective is also not really a novelty for listwise DPO, as there are already some prior work such as PLPO[2]. - **Insufficient experiments against former baselines.** As there are already many works published on multimodal DPO, it's not enough to just incorporate mDPO and CHiP as baselines apart from Random and basic DPO in experiments. Please at least incorporate and run experiments for latest methods OPA-DPO[3] and SymMPO[4]. Any more baselines are also welcome. - **Estimator properties insufficiently analyzed.** The text would benefit from formal statements or empirical diagnostics of *bias/variance* under finite negative sampling, any **weight clipping** or self-normalization, and the stability of (q_\phi) training (e.g., divergence from target leading to high-variance importance weights). (I did not see explicit guarantees/ablation in the provided pages.) - **Ablation depth.** While the framework has several moving parts (PL listwise loss, IS with (q_\phi), SAE-based negatives, textual-preference branch), the paper would benefit from *systematic ablations* that isolate each contribution and report uncertainty (std/CI over seeds). References: [1] Fatemeh Pesaran zadeh et al. "LPOI: Listwise Preference Optimization for Vision Language Models" In ACL 2025 Main Conference. [2] "Plackett–Luce Preference Optimization (PLPO): Listwise Ranking for Preference Optimization" Preprint 2024. [3] Yang et al. "Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key" In CVPR 2025. [4] Liu et al. "Mitigating Hallucination Through Theory-Consistent Symmetric Multimodal Preference Optimization" In NeurIPS 2025. 1. **Cost accounting:** Please add wall-clock/GPU-hour comparisons vs. DPO, mDPO, CHiP to demonstrate the promised efficiency gains of importance sampling. 2. **Negative construction controls:** How do you ensure that CLIP-SAE-driven negatives are not trivially separable (e.g., distributional artifacts), and that they **stress visual grounding** rather than language priors? Any human spot-checks? 3. **Unbiasedness & variance:** Is the IS gradient strictly unbiased under your training scheme? Do you apply **weight clipping** or **self-normalized IS**? Please report effective sample sizes or variance diagnostics across training. Fully AI-generated
Importance Sampling for Multi-Negative Multimodal Direct Preference Optimization Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This work introduces MISP-DPO, a novel framework to leverage a sparse auto encoder to identify diverse negative images for multi-negative preference optimization, building on multimodal DPO. The authors demonstrate the efficacy of their method on multiple multimodal models across a range of benchmarks. * The authors report non-trivial improvements over respective baselines across benchmarks and models. They re-implement DPO, mDPO, and CHiP for a direct and fair comparison (matching data and base models). * The main contributions are the introduction of multi-negative preference optimization and the method proposed for selecting counterfactual images, building on CLIP retrieval, a sparse autoencoder, and a greedy algorithm to achieve diverse negatives. * Section 4.2, describing one of the main contributions of the work, is perhaps a bit limited in detail. For example, the training (data, recipe) for the SAE is not described. And while the math presented in the negative selection may be sufficient, some discussion behind the intuition of sampling for reconstruction error and activation may make the paper more accessible, particularly to casual readers less familiar with using SAEs for interpretability. * Another recent work [S-VCO] also argues for negative images that are substantially similar to the request image under alignment. In this work, the authors acknowledge this work and argue that this method is expensive (as [S-VCO] relies on image generation method to generate counterfactuals) while the proposed method is more efficient. However, the respective efficacy is not further discussed. An ablation comparing the retrieval + SAE based approach directly to the generative approach proposed by [S-VCO] would further enhance the contributions of this paper. ([S-VCO]’s data (MVC) appears to have been made available.) * In table 2, the caption implies that the main difference is how negative samples are chosen, but another difference seems to be the number of negative examples being used as per the description in the text. Perhaps this could be clarified? * For the ablations in table 2, “diffusion” and “crop+diffusion” have two or one negative images still selected by the proposed method as described in the text. I believe this may make comparison a bit more difficult?
I understand that multiple negatives based on diffusion or cropping may not have enough diversity, but if that is the concern, perhaps an ablation with 1 negative sample for all methods could be made fairly, further separating the improvements achieved from the targeted selection method from the multi-negative proposal? * Minor notes for table 2: typo “mdpo” (instead of “mDPO”); Missing average improvement as in table 1 for easier comparisons. [S-VCO] Wu, Shengguang, et al. "Symmetrical visual contrastive optimization: Aligning vision-language models with minimal contrastive images." arXiv preprint arXiv:2502.13928 (2025). * Considering that $d_i$ is dependent on $m_p$, $x$ and $m_n$, how scalable is the retrieval of negatives at training time, if one assumes potentially scaling up the distractor pool and the data used for alignment? * In the reproduction of mDPO, section 5.4 mentions “mDPO, which relies on a single diffusion-generated negative”. But mDPO constructs the negative image via random cropping (0-20%). Is this a typo? * It is not clear to me why selecting more than 3 negatives would be detrimental to performance as presented in figure 2 and briefly discussed in 5.4. The authors propose this may be “due to noise introduced by redundant or low-quality samples”, but then redundancy may be directly addressed through the diversity-promoting selection and COCO may not have substantial amounts of “low-quality samples”? * The reported numbers for MMHalBench for at least LLaVA 1.5 7B seem surprisingly strong, even for reported baselines? Earlier works such as [MDPO] has baseline LLaVA 1.5 7B at 2.19 (in this paper: 2.78) and with their method they achieve “only” 2.39, whereas the “mDPO” reproduction in this paper reports 2.99. Are the evaluation protocols comparable? [MDPO] Wang, Fei, et al. "mdpo: Conditional preference optimization for multimodal large language models." arXiv preprint arXiv:2406.11839 (2024). Fully human-written
Importance Sampling for Multi-Negative Multimodal Direct Preference Optimization Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper presents MISP-DPO, a framework for multimodal Direct Preference Optimization that uses multiple semantically diverse negative samples instead of a single one. It combines a Plackett–Luce ranking objective with importance sampling guided by a Sparse Autoencoder trained in CLIP space to select informative negatives. Experiments on several benchmarks show improved multimodal alignment and reduced hallucination compared to existing DPO methods. The paper clearly identifies a weakness in current multimodal DPO frameworks—the oversimplified single-negative setup—and proposes a principled multi-negative formulation to address it. Experimental evaluation is extensive, including comparisons across multiple models and benchmarks, with consistent quantitative gains in hallucination reduction. The novelty may be moderate: it mainly leverages existing models to extract multiple negative samples, without introducing substantial theoretical or methodological innovation. The paper does not deeply analyze computational overhead or training stability when incorporating multiple negatives, which could affect scalability for larger datasets. It remains unclear whether the improvements generalize beyond hallucination-oriented tasks (e.g., to reasoning or instruction following). 1. How does the proposed multi-negative sampling strategy affect training efficiency and scalability when applied to larger datasets? 2. Could the authors provide a more detailed analysis of computational overhead introduced by the sparse autoencoder and importance sampling modules? 3. Beyond hallucination reduction, has the method been evaluated on reasoning or instruction-following tasks to assess generalization across multimodal objectives? I will adjust my score based on the authors’ response. Fully AI-generated
Expressive and Invariant Graph Learning via Canonical Tree Cover Neural Networks Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces Canonical Tree Cover Neural Networks (CTNNs), a new framework for graph representation learning that generalizes canonical sequence models by replacing a single canonical representation with a set of canonical spanning trees (MSTs). Each tree is processed with a recurrent tree encoder, and message passing is applied over residual (non-tree) edges to capture local connectivity missed by individual trees. The authors further provide: (1) Theoretical results on probabilistic invariance and universality of CTNNs, (2) Distortion and expressivity bounds comparing CTNNs to sequence-based canonicalization, (3) Empirical evaluations on molecular and protein benchmarks, demonstrating improved performance. (1) The paper presents a structured extension of canonical graph neural networks by introducing Canonical Tree Cover Neural Networks (CTNNs), which replace a single canonical ordering with a collection of spanning trees. The proposed method alleviates the structural distortion and expressivity limitations commonly observed in sequence-based canonicalization approaches. (2) The paper provides a series of theoretical analyses, including probabilistic invariance, expected distortion bounds, and expressivity results, demonstrating a well-grounded theoretical foundation. (1) Although the paper presents CTNN as an innovation grounded in canonicalization, its underlying modeling paradigm bears conceptual similarity to prior tree-structured graph neural networks (GNNs), such as Neural Trees for Learning on Graphs (Talak et al., 2021). Both approaches transform a graph into a hierarchy of trees for recursive message aggregation. As a result, the conceptual novelty of CTNN appears limited. (2) Dependence on root and tree structure without theoretical guarantees. The model’s expressive capacity and stability appear highly sensitive to the root selection and tree shape. Since the recursive encoder (e.g., Tree-LSTM) is order- and hierarchy-dependent, different rootings or unbalanced spanning trees may yield substantially different representations. The paper currently provides no theoretical or empirical analysis of this effect. (3) Dependency on the Initial Labeler ${\pi}_{V}$: The paper criticizes sequence methods (Prop 3.3) for being limited by ${\pi}_{V}$'s expressivity. However, CTNN's own tree generation (Alg. 1) is also initialized by ${\pi}_{V}$. If ${\pi}_{V}$ is weak (e.g., cannot distinguish 1-WL-equivalent nodes), the initial MST selection will also be "blind. (4) Missing/Failed Key Baselines: A core claim is surpassing MPNN expressivity. A fair comparison requires stronger baselines like k-WL GNNs or Graph Transformers (GT). The paper includes GT, but it timed out (OOT) on all protein datasets where CTNN excelled. The lack of results from this key high-expressivity baseline makes CTNN's victory less convincing. (1) Considering the conceptual resemblance between CTNN and Neural Trees for Learning on Graphs (Talak et al., 2021), both of which convert graphs into hierarchical tree representations for recursive message aggregation, it would be valuable for the authors to include a direct conceptual or empirical comparison in the paper. (2) Why were GNNs with proven higher expressivity than 1-WL (e.g., k-WL GNNs like GSN, or subgraph GNNs like PPGN) not included as baselines? These seem like the most relevant competitors for a method claiming to surpass 1-WL limitations. (3) Given that the paper's core motivation is to overcome the expressivity limits of 1-WL, why were synthetic benchmarks (like CSL or EXP) completely avoided? These datasets are specifically designed to demonstrate 1-WL failures and would be the clearest way to validate the superior expressivity of CTNN. (4) Given that recursive encoders (e.g., Tree-LSTM) are inherently order- and hierarchy-sensitive, how robust is CTNN to different root node selections or unbalanced tree shapes? Do the authors have any empirical results on performance variance under random root permutations or tree rebalancing? Lightly AI-edited
Expressive and Invariant Graph Learning via Canonical Tree Cover Neural Networks Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper identifies limitations in existing graph canonicalization methods, particularly those that flatten graphs into sequences, arguing they introduce significant distance distortion and are bottlenecked by the expressivity of their node labelers. To address this, the authors propose Canonical Tree Cover Neural Networks (CTNNs), an invariant framework that represents each graph as a small set of canonical spanning trees. This tree cover is generated using an iterative, coverage-aware Minimum Spanning Tree (MST) algorithm. Each tree is then processed by an expressive tree encoder, and the results are aggregated. The authors provide theoretical contributions showing their method is (probabilistically) invariant, better preserves graph distances than sequence-based methods, and is strictly more expressive than both MPNNs and sequence canonicalization. Empirically, the paper demonstrates that CTNNs outperform standard GNNs, sampling approaches, and canonical sequence baselines on several graph classification benchmarks. 1. The paper is well-organized and clearly articulates the limitations of existing methods. 2. The proposed CTNN framework is a novel and intuitive solution that aims to tackle the issues by replacing the single-sequence representation with a more structurally tree cover. 3. The claims are supported by a combination of theoretical analysis and robust empirical results across diverse benchmarks. 1. The term "canonicalization" is misleading. The method is not deterministic; it is a "probabilistically invariant sampling" algorithm that relies on random tie-breaking, making it conceptually closer to the sampling-based (e.g., RWNN) paradigms it critiques. 2. There is a gap between the theoretical requirement for invariance (taking an expectation $\mathbb{E}[\cdot]$ over all possible random choices) and the practical implementation (aggregating a small sample of $K=4$ or $K=8$ trees). This small $K$ may be insufficient to approximate the expectation, leading to unstable representations for the same graph across different runs. 3. A key theoretical justification (Thm 5.2) for the method's low distance distortion is based on Uniform Spanning Trees (USTs). However, the proposed algorithm (Alg 1) generates Minimum Spanning Trees (MSTs) from a different, non-uniform distribution, meaning the core distortion theory does not actually apply to the method as practiced. 4. For highly regular graphs (e.g., where all nodes have the same degree $\pi_V$, resulting in identical initial edge weights ${\pi}_{E}^{(0)}$), the selection of the MST in Algorithm 1 will be determined entirely by the random tie-breaker $\zeta$. This suggests the process is not learning a "canonical" structure but rather performing a structured random decomposition of the graph. See in Weaknesses. Lightly AI-edited
Expressive and Invariant Graph Learning via Canonical Tree Cover Neural Networks Soundness: 2: fair Presentation: 4: excellent Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposes Canonical Tree Cover Neural Networks (CTNNs) as a new approach to graph canonicalization. Instead of flattening graphs into a single sequence (which causes distortion and expressivity loss), CTNNs construct a canonical spanning tree cover and process each tree with expressive tree encoders, aggregating their outputs into an invariant representation. The authors provide theoretical results (distance preservation, expressivity beyond 1-WL, universality under certain conditions) and empirical evaluation on molecular and protein benchmarks, showing improvements over message-passing GNNs, sampling approaches, and sequence canonicalization baselines. 1. The paper clearly explains the limitations of sequence canonicalization, illustrating them with intuitive examples such as star graphs. 2. It introduces a tree cover method that better preserves structural information and invariance, supported by formal analyses of distortion, expressivity, and coverage guarantees. 3. On multiple benchmarks, CTNNs consistently surpass strong baselines, achieving notable gains on molecular tasks and competitive results on protein tasks. 1. The key idea (use a set of trees instead of one sequence) feels like an incremental extension rather than a major conceptual advance. Many theoretical results (universality, expressivity boost by multiple views, distortion comparisons) are natural consequences of existing work. Frequent use of terms like “strictly more expressive,” “provably invariant,” and “universal” gives an impression of overselling. Some proofs (e.g., universality of CTNNs) rely on standard functional approximation arguments, which are not truly new. 2. The guarantees, such as using only a logarithmic number of trees and achieving low distortion, mainly hold for sparse graphs. Dense graphs, such as proteins, show weaker improvements and less favorable theoretical bounds. Although the paper claims MST construction is efficient, it does not provide runtime or memory benchmarks. For dense graphs, the (O(Km \log n)) cost could be significant, and the preprocessing advantage is asserted but not supported with quantitative evidence. 3. Improvements on molecular benchmarks are often only 1–2 AUC points. On protein tasks, performance is inconsistent, sometimes close to baselines. No large-scale or real-world datasets beyond the standard benchmarks are tested. The ablation studies show predictable drops (removing key components hurts), but provide little insight into why certain parts matter. No analysis of failure cases or adversarial graph structures. 4. While molecular and protein benchmarks are reasonable, the method is not shown on social networks, knowledge graphs, or large heterogeneous graphs. This raises questions about generality beyond biochemical datasets. I will raise my score if the authors address W2. Fully AI-generated
Expressive and Invariant Graph Learning via Canonical Tree Cover Neural Networks Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper critiques single-sequence canonicalization methods in graph learning for causing distance distortion and having limited expressivity. To address this, it introduces Canonical Tree Cover Neural Networks (CTNNs), a framework that represents graphs using a small set of canonical spanning trees that cover all edges. Each tree is processed by a tree encoder, and the results are aggregated. The authors provide theoretical guarantees that CTNNs are probabilistically invariant, better preserve distances, and are more expressive than sequence-based methods. Empirically, CTNNs are shown to outperform standard GNNs, sampling approaches, and canonical sequence baselines on graph classification benchmarks. 1. The paper addresses a fundamental problem in graph learning: the trade-off between expressivity and isomorphism invariance. 2. It proposes a framework (CTNN) that uses a tree cover to represent graph structure, addressing the identified high distortion of sequence-based methods. 3. Theoretical analysis and empirical results across multiple benchmarks provide support. 1. The experimental setup (e.g., using $\tau=1$) likely fails to meet the theoretical condition required by Lemma 5.3 for guaranteed logarithmic edge coverage. This gap between theory and practice undermines the paper's claims about efficient coverage and the universality (Thm 5.5) that depends on it. 2. The empirical evaluation is missing the most critical baselines. As a method based on subgraph representations, CTNN must be compared against other state-of-the-art, subgraph-based GNNs (like GSN or ESAN) that also achieve high expressivity, not just standard 1-WL models. 3. The universality claim (Thm 5.5) is trivial and not a unique advantage. This property holds for any model that completely decomposes a graph and uses universal encoders and aggregators. The paper does not sufficiently prove the *efficiency* of CTNN's universality (i.e., that it can be achieved with a small $K$). See in Weaknesses. Lightly AI-edited
Omni-Weather: Unified Multimodal Foundation Model for Weather Generation and Understanding Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces Omni-Weather, a unified multimodal foundation model that brings weather generation and understanding into the same architecture. The authors also create a chain-of-thought dataset tailored for causal reasoning in generation and use it for finetuning and "thinking" inference. They show strong (often SOTA) results across nowcasting, radar inversion, and radar understanding, and provide evidence that training generation and understanding together lets the two enhance one another. Ablations further indicate that mixing scientific and general data boosts performance, especially on deterministic and perceptual metrics. (1) This paper introduces a multimodal foundation model that unifies weather generation and understanding within one architecture, using modality-specific encoders, and takes a step toward reasoning-capable unified foundation models for weather. (2) They present experiments and ablations with useful insights, showing how generation and understanding tasks can mutually enhance each other. (3) They demonstrate strong results across nowcasting, radar inversion, and radar understanding, often matching or exceeding state-of-the-art models. (1) As mentioned in the limitation section by the authors, the model cannot yet adapt to general-domain VAEs. (2) It would strengthen the paper to include a small human-validation study with weather experts. In particular, having domain experts rate the generated reports/explanations, and comparing those ratings to the LLM-based judge. (3) Results are centered on SEVIR-style radar nowcasting, satellite-to-radar inversion, and RadarQA understanding, and generalization to other weather tasks is not demonstrated. (1) It is mentioned that there is a quality verification step to produce the final CoT dataset, including causal alignment, structure checks, etc. Is there human/expert validation at any point during the dataset generation or evaluation? Fully human-written
Omni-Weather: Unified Multimodal Foundation Model for Weather Generation and Understanding Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes OmniWeather, a weather understanding and forecasting model that aims to perform three main tasks: (i) radar inversion (ii) radar understanding (iii) radar nowcasting. The authors finetune the multimodal Bagel on their CoT dataset, and benchmark their method against unimodal models for these three tasks. 1. The authors train a single unified model that can reason across both images and text, and effectively produce interpretable forecasts. As far as I know, this is the first work that combines weather generation and understanding in the same model. 2. The proposed model achieves strong results on all three considered tasks. 3. The ablations are interesting, and shed light on the different steps of the model pipeline. 1. My primary concern is that the paper overstates its scope and significance. The definition for a foundation model from [1] is "A foundation model is any model that is trained on broad data (generally using self-supervision at scale) that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks;". While the paper certainly achieves impressive results in unifying different modalities, labeling the model a foundation model feels premature given that it is fine-tuned for only three task types on a limited data regime. I would recommend the authors to soften the claims in the introduction and abstract. I would also recommend a title change that better reflects the scope of the tackled problem. For example, a title with some combination of the words "unified multi-task model for short-range weather understanding and generation". 2. The work is missing several important citations and discussions related to short-range/medium-range weather forecasting, e.g. GenCast [2], Stormer [3], Pangu-Weather [4], Aurora [5], Prithvi WxC [6]. These are the canonical exemplars readers associate with large-scale weather pretraining/foundation model claims. Even if the focus is nowcasting, the paper should explicitly contrast goals, data scope, and evaluation scales with these systems. The authors should also compare their model against the important now-casting work [7] to better situate progress within the nowcasting literature. 3. From my understanding, the authors use GPT-4o (Appendix A.4) to annotate radar data and identify important phenomenon from the images. I am concerned that this process might be error-prone and introduce mistakes that might propagate into the training process. Do the authors benchmark 4o annotations against a gold standard (for example, expert human)? How reliable is this data annotation process? The manuscript also needs a precise description of the quality-control (QC) stages—currently “Structure Check, Causal Alignment, and Terminology” are named but not operationalized. ### References [1] Bommasani, Rishi. "On the opportunities and risks of foundation models." arXiv preprint arXiv:2108.07258 (2021). [2] Price, Ilan, et al. "Gencast: Diffusion-based ensemble forecasting for medium-range weather." arXiv preprint arXiv:2312.15796 (2023). [3] Nguyen, Tung, et al. "Scaling transformer neural networks for skillful and reliable medium-range weather forecasting." Advances in Neural Information Processing Systems 37 (2024): 68740-68771. [4] Bi, Kaifeng, et al. "Pangu-weather: A 3d high-resolution model for fast and accurate global weather forecast." arXiv preprint arXiv:2211.02556 (2022). [5] Bodnar, Cristian, et al. "A foundation model for the Earth system." Nature (2025): 1-8. [6] Schmude, Johannes, et al. "Prithvi wxc: Foundation model for weather and climate." arXiv preprint arXiv:2409.13598 (2024). [7] Ravuri, Suman, et al. "Skilful precipitation nowcasting using deep generative models of radar." Nature 597.7878 (2021): 672-677. Apart from the main issues flagged in the Weaknesses section, I have other minor comments/questions/suggestions. 1. The current description of CoT data annotation and Figure 4 are cluttered and hard to follow. The authors should consider simplifying it, or replacing it with a figure that reads top-to-bottom. 2. Why do the authors use the word "causal"? For example, the prompt in Appendix A.4 asks the model to extract "Temporal causal factor, perceptual causal factor" without any sufficient explanation of what this means. How do we trust that the model knows the true "causal" factors for explaining these weather phenomenon? 3. Lines 193-197 do not add any substantive value in explaining the problem setup and should either be replaced by a more complete mathematical description of the problem setup or omitted entirely. 4. There are insufficient architectural details about the VAE used in the radar inversion task, and these details should be added to the manuscript. 5. The clarity in Figure 3 could be improved. In particular, it is unclear how the tokens from the different modalities are combined in the model architecture. 6. Line 214: modal -> model 7. The authors need to add more details about how many data samples are used for training. While the authors mention that they generate 4000 CoT samples for radar nowcasting and 4,000 CoT annotations for radar inversion, the authors should also clarify the number of samples used from RadarQA, and the general metaquery data. Overall, I think this is a substantial and promising paper marred by some fixable issues. I would be willing to raise my score if the authors can satisfactorily address my concerns. Fully human-written
Omni-Weather: Unified Multimodal Foundation Model for Weather Generation and Understanding Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces Omni-Weather, a multimodal foundation model designed to address a significant gap in radar modeling: the separation of generation (numerical prediction) and understanding (textual interpretation). The authors propose a single architecture that unifies these two capabilities, arguing that they are mutually beneficial. The model's core contributions are its unified architecture, the introduction of a novel Chain-of-Thought (CoT) dataset for causal reasoning in weather, and its demonstration of strong performance on both task categories. - A multimodal model is a great direction towards briding the gap between numerical prediction tasks and high-level textual interepretations/analyses. - The framework is well motivated and is at the forefront of such multimodal models in this weather/radar domain. - Clearly writing - Evidence that joint training/multimodality provides complementary supervision signals and better scores in some areas that just a single modality. 1. The considered data is exclusively radar. Weather in the title makes it sound overly general. As the authors point out, there would be signifcant challenges in even just extending this framework to more general weather-related tasks/dataset. Thus, I suggest writing OMNI-Radar and replacing most occurrences of weather with radar in the text. Similarly, the term "foundation model" in the title feels premature; this needs to urgently be renamed and the text revised to accurately reflect the true contributions of the work. 2. Lack of clarity/details in some places. For example: - Unclear how encoders are trained and what their specific designs are (beyond high-level descriptions like "VAE decoder") - What's high-value retaining/matching? - eq 3.4 feels very abrupt... did some related sentences go missing? - $\lambda_t$ is poorly explained/introduced. Multiplying both loss terms in Eq. 3.4 by $\lambda_t$ doesn't make sense. Please correct. Also, please explain how it was tuned (same for $n_t$). - This claim should be toned down: *"On the radar inversion task, Omni-Weather consistently surpasses both specialized... and generalist... models, achieving higher CSI scores across all thresholds, with gains up to 20% at high-value levels."* given that it's not true for the RMSE metric. - How's the CRPS computed? How many ensemble members are used? - Fig. 3: Full prompts should be included in appendix. Same for exact versioning of GPT models used - I'm confused by the "CFG Setting" (classifier free guidance) paragraph. There's no reference to CFG, not even diffusion, anywhere else... did the authors use it but forgot to mention it in the main text? 3. No discussion of the complexity of the model, especially when compared to the "generation-only" baselines 4. More comprehensive evaluations would be useful. E.g.: - Human expert evaluation of "understanding" outputs would be really useful and strong contribtuion. Are the explanations at the level of a meteorology expert? How useful are they actually? Are the textual outputs given by the model consistent with the numerical nowcasts (e.g., in fig. 4)? With the current results, it's hard to judge how scientifcally useful the "understanding" part of the model actually is. - How's the RMSE in Table 2 computed? A more comprehensive ablation (e.g. like the part of table 1 that's about radar nowcasting) would be more useful. 5. While the paper is at the forefront of multimodal modeling for weather/radar, it's not there all by its own. The paper misses some important references and contextualization. In particular, 1) this paper is only *one* of the first multimodal models in this weather/radar domain [1], 2) There's been a benchmark proposed in this space, which includes SEVIR (the only weather dataset used on this paper) [2]. It would have been nice to use it here, but at least it should be discussed. Minor: - VIL should be explained before using its abbreviation form. - cascast is misspelled in Fig. 5 [1] Aquilon: Towards Building Multimodal Weather LLMs; Varambally et al. 2025 (https://openreview.net/forum?id=KVxOwEYAF4) [2] CLLMate: A Multimodal Benchmark for Weather and Climate Events Forecasting; Li et al. EMNLP 2025 (https://arxiv.org/abs/2409.19058) - Why are so many different encoder/decoder's used? E.g., why are separate single-frame radar and multi-frame radar encoder needed? - *"In the radar nowcasting task, forecasts exhibit fine-grained storm details with improved spatial coherence"*... I'm not sure how the authors identify "improved spatial coherence" in Fig. 5? - Is there anything special about the extra Metaquery data that would make it particularly useful or do you think your model would benefit from any other extra, not radar specific, data? Table 5 seems to explore this a bit, but it's unclear what the two possible "gen" datasets are and why adding the 2nd "gen" dataset is so detrimental to performance. - Why not report CRPS for radar inversion task? Fully human-written
Omni-Weather: Unified Multimodal Foundation Model for Weather Generation and Understanding Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper presents Omni-Weather, a multimodal foundation model that unifies weather generation and understanding within a single architecture. Unlike existing models that separately address forecasting or diagnostic reasoning, Omni-Weather integrates radar and text modalities through a shared self-attention backbone and a Chain-of-Thought (CoT) dataset to enable causal reasoning in weather modeling. The model achieves state-of-the-art results on both weather generation (e.g., nowcasting, radar inversion) and understanding (e.g., RadarQA tasks), demonstrating that generative and interpretive capabilities can reinforce each other. The contributions are: 1. Introduction of the first unified multimodal foundation model for weather that jointly handles generation (forecasting, inversion) and understanding (diagnostic reasoning, QA) tasks within a single framework. 2. Construction of a weather-specific Chain-of-Thought (CoT) dataset for causal reasoning in generation, improving interpretability and perceptual quality of outputs. 3. Empirical results showing Omni-Weather surpasses strong baselines (e.g., CasCast, DiffCast, WeatherGFM, RadarQA) in both pixel-level and perceptual metrics, with reasoning further enhancing visual fidelity and explainability. 1. The paper introduces the first unified multimodal foundation model for weather generation and understanding, representing a novel and impactful problem formulation. 2. The Chain-of-Thought dataset for causal reasoning in weather generation is promising, enabling interpretable forecasting and bridging the gap between prediction and explanation. 4. The experiments are comprehensive, covering both pixel-level and perceptual evaluations with clear comparisons to strong baselines. 5. The paper is well-written and clearly structured. 6. The demonstrated mutual benefit between generation and understanding tasks highlights significant scientific insight with implications for broader multimodal foundation model research. 1. The claim of a “foundation model for weather” seems overstated, as the model’s scope is limited to a single variable (radar VIL precipitation) rather than encompassing multiple atmospheric variables such as temperature, pressure and wind. 2. The proposed model only addresses short-range nowcasting (approximately one hour ahead) and is restricted to the SEVIR dataset covering the continental US, limiting its generalization and global applicability. 3. The Chain-of-Thought (CoT) dataset used for training is entirely LLM-generated, with no human expert validation or meteorological review to ensure physical correctness, as GPT-series models are not fine-tuned as meteorologist experts. 4. The CoT generation pipeline relies on GPT-4o for attribute annotation and GPT-o3 for reasoning synthesis, producing synthetic causal narratives that may not reflect authentic meteorological reasoning. An ablation or qualitative comparison between LLM-generated CoT reasoning and human meteorologist-written reasoning would clear the confound of whether the improvements stem from genuine interpretability or stylistic mimicry of GPT. Please refer to weaknesses. Heavily AI-edited
A Bootstrap Perspective on Stochastic Gradient Descent Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper studies SGD's impact on generalization for machine learning models. Based on the provided analyses, it proposes two regularization schemes, which are shown to benefit generalization for a few toy datasets. The question raised in the paper is important and the paper tsts a new regularization method based on the analyses and shows that it might benefit generalization The theoretical contribution appears to be incremental, as, to my understanding, the main insights came from Smith et al. (2021). The empirical evaluation is very limited, as the results are tested only on a very specific synthetic dataset with a sparse prior and FashionMNIST. 1) I did not understand how the analyses are specific to the SGD as opposed to the non-stochastic GD. As the opening sentence of the abstract mentions the difference between generalization of GD and SGD as a motivation, I would like to ask the authors to elaborate more on this. How can we see from the bounds derived in the paper that SGD might outperform GD? 2) As for the regularizers part, what are the novel insights made in the paper compared to Smith et al. (2021)? Fully human-written
A Bootstrap Perspective on Stochastic Gradient Descent Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper tries to understand SGD from the view of bootstrapping: SGD favors minima with smaller variance of stochastic gradient. 1. The top example in Section 2 is attractive and illustrative. 1. The presentation of the theoretical part is a bit confusing. - The theoretical results are listed as Lemmas 1 and 2 as well as Proposition 1, without a theorem that usually serves as the center of discussions. This makes me confused about what is the main theoretical contribution of the paper. - The discussions after Lemmas 1 and 2 mainly discuss why the lemmas hold, and do not actually help with the understanding of the theoretical results (especially for Lemma 2, whose righthand side has a lot of terms). 2. My understanding is that the core of the theoretical analysis is the correspondence of Equations (6) and (7) with Equations (10) and (11), which provides a viewpoint from the implicit regularization of SGD by "bootstrapping" the gradients. However, this part lacks a comparison against GD or noisy GD. 3. According to my understanding, the technical contribution is minor. Lemmas 1 and 2 are basically Taylor expansion, and Proposition 1 is basically the strong law of large numbers. I would honestly confess that I do not understand all the details of the paper, and would be happy to discuss with the authors, other reviewers and the AC. My score of 2 currently represents my unconfident understanding. i think the intuition of the paper is good, but the theoretical part may need improvements. 1. Can the authors show more details of the algorithm SGDwReg2, especially how to estimate the term Reg2? - If Reg2 is estimated in an exact way, then SGDwReg2 requires knowledge of the entire dataset at each minibatch update. In this case, is it possible to design an adaptation of SGD that incorporates the idea of SGDwReg2 but without the requirement of the entire dataset? - If Reg2 is approximated, can the authors show the details of approximation? 2. How does the bootstrapping view compare with the idea of variance-reduction techniques like SVRG? Fully human-written
A Bootstrap Perspective on Stochastic Gradient Descent Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper presents a theoretical framework for understanding the generalization properties of Stochastic Gradient Descent (SGD). The authors decompose the generalization gap and introduce the concept of "algorithmic variability", which they analyze through the lens of statistical bootstrapping. Based on this decomposition, the authors construct two novel regularizers and empirically validate that their inclusion can lead to improved generalization performance on tasks including sparse regression and neural network training. However, there are still some concerns to me. Therefore I lean to a rejection at the time being. Specifically, I am not sure whether the idea in this paper has significant differences to algorithm stability, and whether the derivation of this paper is meaningful. See below for more details. 1. The paper posits that SGD uses the gradient variability (caused by mini-batch sampling) as a "bootstrap estimate. 2. This paper proves that the expected generalization gap is determined by the trace of the product of the solution's Hessian matrix and the "algorithmic variability" matrix. 3. This paper designs a new regularizer based on the theoretical findings. 4. The authors further provide empirical evidence on this regularizer. 1. [Major Concern] It seems that Assumption 2 directly leads to a small Variability (Eqn 3). However, the authors did not discuss it much. If so, I cannot be convinced that Eqn (3) is the dominate term compared to Eqn (4), where Eqn (4) also contains the epsilon[2, T] term. 2. [Major Concern] I am not convinced that this paper has significant differences with the line of algorithmic stability. The authors claim in Line 466 that "this paper considers "Hessian-weighted and evaluated at the solutions"". It seems that algorithmic stability can include this case with pretty minor changes. Due to the simplicity, algorithm stability just bound the Hession with smoothness, and use iteration to reach the solution. But starting from the definition of algorithm stability, these are not necessary. The authors shold provide more evidence on how this paper performs differently with algorithm stability. [Minor] 1. The authors claim that "we prove rigorously that by implicitly regularizing the trace of the gradient covariance matrix, SGD controls the algorithmic variability." According to the paper's derivation, the algorithmic variability is bounded by two components (corresponding to the latter term in Eq. 6 and Eq. 7). While the authors convincingly connect the implicit regularization of SGD, as identified by Smith et al. (2021), to the first component (Eq. 6), they do not provide evidence or argumentation that SGD also implicitly regularizes the second component (Eq. 7). Consequently, the claim that SGD "controls the algorithmic variability" in its entirety appears to be an overstatement. This significantly limits their theoretical contribution, as the work seems to demonstrate that vanilla SGD only addresses a part of the problem identified by the authors. 2. The paper's analysis of the proposed regularizers, Reg1 and Reg2, lacks sufficient depth regarding their interplay and individual utility. For instance, given that the authors identify Reg1 as an existing *implicit* regularizer of SGD, a crucial discussion is missing on the utility of its *explicit* inclusion. What is the tangible difference between applying Reg1 explicitly versus relying on its implicit effect? Would applying only Reg2, which is the component not addressed by vanilla SGD, be a more practical and principled approach? The paper would be substantially strengthened by ablation studies that dissect the individual contributions of Reg1 and Reg2 and clarify their roles in guiding SGD towards better-generalizing solutions. 3. The practical significance of this work is severely hampered by the unaddressed computational overhead of the proposed regularizers. Both Reg1 and Reg2, as defined, require the computation of the full-batch gradient at each training step. This is a prohibitive cost for large-scale datasets and fundamentally contradicts the core philosophy of SGD, which is designed precisely to avoid such computations. The absence of any discussion on this issue, or on potential efficient approximations, makes it difficult to assess the empirical value of the proposed method. As it stands, the practical guidance offered by the paper appears limited. See above. Fully human-written
A Bootstrap Perspective on Stochastic Gradient Descent Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper aims to provide a novel eccplanation for the superior genetalozation property of SGD compared wirh GD, from a boostrap perspctive. Specifically, under certein assumptions, the authors show that the generalization error can be decomposed into a dominant Hessian=-preconditioned algorithmic variability term and several small terms. They further argue that the algorithmic variavbilit is stronhly correlated to the accumulated empirical covriance of gradients. As a consequence, they empirically estalish that SGD regularizes algorithmic variability as a bootstrap estimate, and hence improving the generalization error through this correlation. This paper is clearly written and has a nice structure. Although the authors provide an upper bound on the generalization error via algorithmic stability, the paper does not explicitly establish how SGD regularizes this term theoretically. Moreover, there is no theoretical characterization of the generalization gap between SGD and GD. Another concern arises from the assumptions: while Assumption 1 appears standard, Assumption 2 is rather demanding and may not hold in many scenarios: existing theoretical results generally suggest that the upper bound on uniform algorithmic stability grows with the number of iterations. This implies that the bias term, rather than variance, often dominates the generalization error. From this perspective, the argument that “SGD generalizes better because it regularizes the gradient variance” may not be entirely convincing. No further questions. Fully human-written
Tokenizing Single-Channel EEG with Time-Frequency Motif Learning Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This submission proposes TFM-Tokenizer, a single-channel EEG tokenization framework that learns a discrete vocabulary of time-frequency motifs via a dual-path encoder. The tokenizer is used to produce per-channel token sequences which are then fed to a lightweight transformer (or plugged into existing foundation models) for downstream tasks. Experiments over four dataset report performance gains; an ear-EEG sleep-staging dataset is used to argue scalability. 1/ This paper is related to a very timely topic. EEG tokenization for heterogeneous devices and non-stationary signals is an important and current problem. 2/ This paper provides rich token analyses with motif visualizations. It contains multiple analyses (such as classtoken uniqueness and class-wise token consistency) and visual examples of learned motifs in section 4.7, which enhance the insight. 1/ The novelty may be overclaimed, especially the third, tokenization learning objective, arguing that “Relying solely on capturing time-based motifs into discrete tokens risks losing important spectral structure” in L81-82. However, most existing EEG tokenization methods are already frequency or time-frequency-oriented: e.g., LaBraM reconstructs the frequency domain (with a TimeConv module on raw signals), and NeuroLM includes both time and frequency domain reconstruction. The proposed change mainly shifts from FFT to STFT, which feels incremental and weakens the claim of being “the first to encode diverse time-frequency motifs” in L142-143. 2/ The single-channel design fully discarding inter-channel topology, which seems questionable. Many EEG tasks, especially localization or differential montages in epilepsy or sleep staging, depend critically on spatial relationships and cross-channel synchrony. The relatively lower performance on CHB-MIT, compared to BIOT, may partly reflect this limitation. Moreover, such a setup implicitly assumes that the downstream backbone needs reintroduce spatial structure (as EEGPT or LaBraM do with hard-coded topographic embeddings), so the claim of being model-agnostic is also overstated. 3/ The baselines are outdated and inconsistent. Recent models such as CBraMod (ICLR’25)[1], EEG2Rep(KDD’24)[2] are not compared, and NeuroLM is included for only two of four datasets. This selective evaluation raises fairness concerns and weakens empirical credibility. 4/ This paper shows partial data leakage in its experimental setup, which weakens the claim of dataset generalization. Both the single- and multi-dataset experiments pretrain and evaluate on the same set of datasets. This design still leads to partial data leakage, as the tokenizer indirectly sees the target data distribution. 5/The writing could be more structured. The term motifs, central to the paper and appearing in the title, is not defined until the Related Work section. As this term is uncommon in EEG representation learning, it should be briefly introduced earlier to avoid confusion. [1] Wang J, Zhao S, Luo Z, et al. CBraMod: A Criss-Cross Brain Foundation Model for EEG Decoding[C]//The Thirteenth International Conference on Learning Representations. [2] Mohammadi Foumani N, Mackellar G, Ghane S, et al. Eeg2rep: enhancing self-supervised eeg representation through informative masked inputs[C]//Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2024: 5544-5555. 1/ Please include more recent baselines such as CBraMod (ICLR’25) and evaluate the proposed single-channel design on unseen EEG datasets or channel configurations that were not used during pretraining. This would help verify the claimed channel-invariant generalization and rule out potential data leakage. 2/ Could you clarify your fine-tuning strategy? 3/ This submission fixes the patch size at 1 s with 0.5 s overlap, but the reason for this choice is unclear. How does this fixed window align with the paper’s motivation of token resolution? 4/ The paper claims interpretability and perform motif case studies. Could you offer a quantitative measure of token-motif correspondence, for instance, the proportion of tokens aligning with known EEG events? 5/ As the model is based on VQ-VAE, which can suffer from instability or codebook collapse, please provide evidence that training remains stable, e.g., by reporting gradient norms, especially given the small datasets and large codebook size (8192). Fully human-written
Tokenizing Single-Channel EEG with Time-Frequency Motif Learning Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposes TFM-Tokenizer, a framework for tokenizing single-channel EEG signals by learning a vocabulary of time-frequency motifs using a dual-path architecture with frequency and temporal masking. The tokenizer is designed to be model-agnostic, integrating with existing EEG foundation models like BIOT and LaBraM, and claims improvements in accuracy (up to 17% in Cohen’s Kappa), generalization, and scalability (e.g., to ear-EEG). Experiments are conducted on four EEG datasets under single- and multi-dataset pretraining settings, with additional analysis on token quality. The paper addresses an interesting problem in EEG foundation models: effective tokenization of signals to improve representation learning, which is underexplored compared to NLP or vision domains. The single-channel approach is a reasonable design choice for device-agnostic scalability, and the integration with existing models (e.g., BIOT and LaBraM) as a plug-and-play component shows practical potential. The experiments include a range of datasets (TUEV, TUAB, CHB-MIT, IIIC-Seizure) and settings (single- vs. multi-dataset pretraining), with some ablation studies on token quality (e.g., class-discriminative analysis). The inclusion of a scalability test on ear-EEG is a nice touch for real-world applicability. Overall, the work is clearly motivated by challenges like motif capturing and frequency entanglement in EEG signals. The core contribution lacks sufficient novelty: the proposed TFM-Tokenizer heavily builds on existing VQ-based tokenization (e.g., from LaBraM) and time-frequency representations common in EEG analysis (e.g., spectrograms with masking, as in BIOT or related works like Yang et al., 2024). The "time-frequency motif learning" is essentially a combination of spectral patching, transformers, and VQ quantization, but it doesn't introduce fundamentally new mechanisms—e.g., the localized spectral window encoder is similar to patch-based processing in vision transformers, and the masking strategy mirrors BERT-like objectives without EEG-specific innovations. Claims of up to 17% improvement in Cohen’s Kappa are overstated, as they are relative to baselines on specific datasets (e.g., TUEV), but absolute gains are modest (e.g., 0.5273 to 0.6189), and statistical significance is only reported sporadically (e.g., p=1.5e-4 on IIIC-Seizure). Experiments are limited: no comparisons to more recent EEG models (e.g., BRANT or MMM beyond superficial mentions), insufficient ablation on key hyperparameters (e.g., codebook size K, masking ratios), and the multi-dataset setting uses only four datasets, which may not capture broader diversity in EEG corpora. Scalability to ear-EEG is promising but under-evaluated—only a 14% gain is claimed without details on transfer learning adaptations or failure cases. Interpretability analysis (e.g., token consistency) is superficial and lacks quantitative metrics like mutual information or visualization of failure modes. 1. Could the authors provide more ablation studies on the vocabulary size K and masking strategies (e.g., frequency band size δf)? How sensitive is performance to these, and do they generalize across datasets? 2. The paper claims the tokenizer is "model-agnostic," but integration details with BIOT/LaBraM are brief—e.g., how exactly are token embeddings fused, and what modifications were needed? A response with pseudocode or specifics could clarify. 3. For the ear-EEG scalability experiment, what adaptations (if any) were made for differences in sampling rate or noise profiles? Baseline comparisons here seem weak; adding results from non-tokenized transfers could strengthen the claim. I will consider raising my score if all my concerns are solved or clarified. Fully AI-generated
Tokenizing Single-Channel EEG with Time-Frequency Motif Learning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes TFM-Tokenizer, a model-agnostic tokenization framework that learns discrete time–frequency motifs from single-channel EEG signals. The tokenizer produces interpretable and compact tokens via dual-path time–frequency masking, enabling integration with various EEG foundation models such as BIOT and LaBraM. Extensive experiments on four EEG datasets show consistent improvements in both single- and multi-dataset pretraining settings, as well as strong scalability to ear-EEG data. The authors also provide comprehensive analyses of token quality, distinctiveness, and interpretability. 1. The proposed TFM-Tokenizer is a model-agnostic and reusable component that can enhance a wide range of EEG foundation models. 2. The authors conduct detailed analyses of token quality (e.g., class-specificity, frequency awareness, consistency, and utilization), lending strong support to the claim that the tokens are both informative and interpretable. 3. The paper is technically sound and presents a well-motivated formulation of single-channel EEG tokenization, which addresses an underexplored yet important problem in EEG representation learning. 1. In Section 4.3, the authors only test replacing the neural tokenizer in LaBraM with TFM-Tokenizer. It would strengthen the claim of generalizability if the authors also tested using TFM-Tokenizer's token embeddings as direct inputs for masked EEG modeling. 2. The token utilization score decreases with larger vocabulary size (Appendix C.8). Could the authors explore some ways to improve utilization? 3. The embedding dimension of tokens is fixed in experiments. A discussion or ablation on how this dimension affects performance would improve clarity. 4. The paper could better articulate computational costs — for example, how much training overhead or inference latency is introduced by TFM-Tokenizer compared to standard segment-based tokenization. 5. There is a typo "Abnoral" in Table 4. How sensitive is the model to the choice of STFT parameters (e.g., window length, hop size)? Is the tokenizer robust to different preprocessing pipelines? Lightly AI-edited
Tokenizing Single-Channel EEG with Time-Frequency Motif Learning Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces TFM-Tokenizer, a novel framework that learns discrete tokens from single-channel EEG signals by capturing time-frequency motifs. Unlike existing methods that use raw or continuous embeddings, TFM-Tokenizer builds a learnable vocabulary of meaningful EEG patterns, enabling plug-and-play integration with any foundation model. It uses a dual-path encoder to jointly model time and frequency domains and is trained with a mask-and-reconstruct strategy. Evaluated on four EEG datasets and an ear-EEG sleep staging task, TFM-Tokenizer consistently outperforms strong baselines, improves existing models like BIOT and LaBraM, and offers cross-device generalization with fewer parameters and better interpretability. 1. First to introduce single-channel EEG tokenization using time-frequency motifs, filling a critical gap. 2. Outperforms SOTA by up to 17% with fewer parameters. 3. Plug-and-play enhancement for existing foundation models like BIOT and LaBraM. 4. Cross-device generalization (e.g., ear-EEG) demonstrates robust transferability. 5. Learned tokens show clear class discriminability and frequency awareness, aiding clinical understanding. Downstream tasks focus on seizure and hospital datasets. The diversity of task types is relatively weak. How long does it take to train the tokenizer? Lightly AI-edited
IncentRL: Bayesian Adaptation of Preference Gaps in Reinforcement Learning Soundness: 1: poor Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper presents a framework to augment the task rewards in reinforcement learning with an intrinsic reward based on the difference in observed and desired outcomes in the environment dynamics. The framework is general enough that leaves freedom in how the preference model is designed, but only hand-crafted preferences are used throughout the paper instead of learned ones. Additionally, the paper presents ideas to autotune the parameter $\beta$ which controls the magnitude of the intrinsic incentive compared to the task rewards. In a few small MDP examples, the presented method achieves better performance than naive RL without any intrinsic incentive. The paper is clearly written and easy to follow, and the ideas are well communicated. Each of the components of the presented framework are described in detail and in a well-structured manner. The presented framework is general, since the KL between the desired and expeted outcomes is a general idea that allows many different variants in the way both the dynamics and preference models are learned or used. The discussion around different ways of encoding the preference model is sound, and the links to the cognitive motivation and free-energy principle are sound. I find this paper to be very poorly contextualized. There are very few citations to other work in intrinsically-motivated RL [1,2,3,4,5,6] (just to list a few), or adaptations of the free energy principle and active inference theory to the RL framework [7,8]. There is a large body of work in these fields published over the last decade which the paper ommits. The related work section is very brief, and skips progress in these directions. Not only empirical progress in the form of new intrinsically-motivated RL algorithms being recently proposed, but also discussion and other practices which make intrinsic rewards work in RL [6]. I wouldn't relate to the idea of adapting $\beta$ dynamically during training as the "central novelty" of this work in the Abstract, since the runs with adaptive $\beta$ are not even in shown in the paper, but are said to achieve a similar performance to the fixed $\beta$ ones. I find the ideas of using the KL divergence between predicted and desired outcomes in the environment to be more sound. The authors introduce the distribution $p(o|s,a)$ right after having defined fully-observed MDPs. To be correct, you should either specify that you are predicting the transition dynamics over states $p(s'|s,a)$ (which I believe is the case because the toy MDP used, MountainCar and MiniGrid are all fully-observed MDPs) or otherwise you should introduce partially-observed MDPs (POMDPs) making an explicit separation of the state space $\mathcal{S}$ and observation space $\mathcal{O}$. I find Section 3.3 is not needed in the main paper since the discussion on the role of $\beta$ is straightforward and can be made more concise. I have a similar opinion of Sections 4.2 (the discussion and the equation is repeated in 4.1 and 4.2) and 4.3 (discussing latent representations for encoding the preference model, but not used anywhere in the paper). Instead, the authors should allocate more space in covering related work in the area and extending their evaluation. I wouldn't call the toy MDP presented in 5.2 a "sparse-reward" problem, since with a single (state,action) pair, the agent can experience a reward in the environment with 0.3 probability, offering enough supervision for training any vanilla RL agent to solve the task. Crucially, I don't think the provided empirical evidence supports the claims of the paper. I think the paper is missing a much broader evaluation of the method in more complex and recent benchmarks used for exploration (e.g., a subset of tasks from MiniGrid, ProcGen, Atari, Crafter ,etc.) and importantly, comparisons with existing methods designed for improved intrinsic exploration. The paper does not cover, cite, or explicitly state the differences of their method in the related work area, neither it evaluates and compares existing methods that are similar in design, have been used in the same environments, and hence are relevant baseline comparisions. [1] Pathak, Deepak, Dhiraj Gandhi, and Abhinav Gupta. "Self-supervised exploration via disagreement." International conference on machine learning. PMLR, 2019. [2] Guo, Zhaohan, et al. "Byol-explore: Exploration by bootstrapped prediction." Advances in neural information processing systems 35 (2022): 31855-31870. [3] Sekar, Ramanan, et al. "Planning to explore via self-supervised world models." International conference on machine learning. PMLR, 2020. [4] Kapturowski, Steven, et al. "Unlocking the Power of Representations in Long-term Novelty-based Exploration." Second Agent Learning in Open-Endedness Workshop. 2024. [5] Badia, Adrià Puigdomènech, et al. "Never give up: Learning directed exploration strategies." arXiv preprint arXiv:2002.06038 (2020). [6] Yuan, Mingqi, et al. "Rlexplore: Accelerating research in intrinsically-motivated reinforcement learning." arXiv preprint arXiv:2405.19548 (2024). [7] Berseth, Glen, et al. "SMiRL: Surprise minimizing RL in dynamic environments." arXiv preprint arXiv:1912.05510 71 (2019). [8] Hugessen, Adriana, et al. "Surprise-Adaptive Intrinsic Motivation for Unsupervised Reinforcement Learning." arXiv preprint arXiv:2405.17243 (2024). How can the preferences be encoded in environments that are not fully-observed MDPs? Concretely, how can LLMs help with that? (since that is mentioned in the paper, but I don't understand how that would work). Fully human-written
IncentRL: Bayesian Adaptation of Preference Gaps in Reinforcement Learning Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 0: Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The work proposes a novel reward shaping approach for reinforcement learning (RL) that computes intrinsic rewards as the negative KL difference between a target distribution of outcomes and predicted outcome distributions. The outcome distributions are pre-defined and provide additional signal towards the learning of RL agents. The approach is evaluated in a toy MDP environment for illustrative purposes, as well as the MountainCar and a MiniGrid environment as sparse-reward exploration tasks. Substantial gains in efficiency can be observed in the toy MDP, while minor benefits are observed in MountainCar and Minigrid. The problem of efficient learning under sparse rewards is relevant and an impactful problem to tackle. The idea of incorporating additional information from preferred outcomes is conceptually interesting, albeit not clearly executed (see weaknesses below). Overall, I am afraid that the work is clearly not of sufficient quality to be considered for acceptance at ICLR. Below, I try to provide key weaknesses that I believe should be addressed and would substantially strengthen the work. ## Originality and Prior Work 1. To start with, the work only loosely discusses prior work on intrinsic motivation and reward shaping in Section 2.1. There is a rich space of literature that ranges from future predictions (ICM and RND being cited), state visitation counts [4], density functions of states [3], and combinations of several schemes [1, 2] just to mention few -- all propose ways to determine "novelty" or interestingness of states for exploration. I would advice to look at the literature in this space in more detail. 2. In addition to a rich space of literature on defining intrinsic rewards for sample efficient learning, there also exists prior literature on balancing intrinsic and extrinsic rewards, similar to the proposed Bayesian approach of adapting $\beta$. Some examples are [5, 6] 3. The work continually makes connections to the free energy principle and dopamine frameworks, but these are merely described as loose connections. It would be helpful if the authors would provide citations, definitions, and clearly outline any connections that they believe add to their work. [1] Raileanu, Roberta, and Tim Rocktäschel. "Ride: Rewarding impact-driven exploration for procedurally-generated environments." _arXiv preprint arXiv:2002.12292_ (2020). [2] Zhang, Tianjun, Huazhe Xu, Xiaolong Wang, Yi Wu, Kurt Keutzer, Joseph E. Gonzalez, and Yuandong Tian. "Noveld: A simple yet effective exploration criterion." _Advances in Neural Information Processing Systems_ 34 (2021): 25217-25230. [3] Bellemare, Marc, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. "Unifying count-based exploration and intrinsic motivation." _Advances in neural information processing systems_ 29 (2016). [4] Tang, Haoran, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. "# exploration: A study of count-based exploration for deep reinforcement learning." _Advances in neural information processing systems_ 30 (2017). [5] Schäfer, Lukas, Filippos Christianos, Josiah P. Hanna, and Stefano V. Albrecht. "Decoupled reinforcement learning to stabilise intrinsically-motivated exploration." _Autonomous agents and multi-agent systems (AAMAS) conference_ (2022). [6] Chen, Eric, Zhang-Wei Hong, Joni Pajarinen and Pulkit Agrawal. “Redeeming Intrinsic Rewards via Constrained Optimization.” _Advances in neural information processing systems_ (2022). ## Clarity of Methodology and Experiments Below, I list further concerns I have regarding a lack of clarity in the defined method and experiments conducted as part of this work. ### Methodology 4. The introduction and Section 3 that defined the proposed IncentRL approach, makes use of "outcomes" in the form of predicted and preferred outcome distributions. However, it is never defined and not quite clear to me what these outcomes are. Are outcomes future states that might be predicted, or some specific quantity of states? 1. Related to what outcomes are, it is not clear to me where the outcome distributions are coming from either and what assumptions are being made. It appears that the work assumes that preferred outcome distribution are specified to determine a preference over outcomes. But how do you specify the predicted outcome distribution $p(o | s, a)$? Section 3.1 mentions that this quality "[...] may be obtained from a forward model or from environment dynamics" but it is unclear to me which of these is being done throughout experiments. Do you train a forward model from experience tuples or do you assume access to the environment transition function? 5. In Section 5.4 of the experimental section, a Bayesian adaptation of $\beta$ is being evaluated, however, none such adaptation scheme of $\beta$ is described when the method is introduced in Section 3? 6. The theoretical contributions in proposition 1 and 2 both lack formal proofs. I would expect these to be at least provided as part of the Appendix. Also, both of these propositions do not appear to be particularly insightful since they merely talk about the extreme cases of extremely small and large $\beta$ values to draw fairly obvious conclusions. ### Experiments 7. In Section 5.1, the "Algorithm" paragraph mentions that the agent estimates its belief about possible next outcomes, but it is never defined how these predictions are obtained (connected to weakness 4.1). 8. The predicted and preferred outcome distributions for most experiments are incompletely defined or entirely undefined. Given these are central part of the main contribution of this work, this appears to be a lack of critical details. 1. For Experiment 1 (Section 5.2) only defines predicted and preferred outcome distributions for the 2-state MDP for state $s_0$. The rest of the distributions appear undefined. 2. For Experiment 2 (Section 5.3) in MountainCar, the distribution is described as "preferred outcome assigning all probability to the goal ($x \geq 0.5$), and $p(o | s, a)$ is the predicted outcome distribution" but those are still vague to me. The described preferred outcome distribution appears to now be defined as a property of states $x \geq 0.5$ rather than full states as before (connected to problem of lack of definition of what outcomes are, see weakness 4), and the predicted outcome distribution is fully undefined. 3. For Experiment 3 (Section 5.4), these quantities appear entirely undefined. 9. According to the description of training details in Section 5.4 on Experiment 3, the results are averaged over 3 independent seeds but Figure 2 visualizes no indication of dispersion/ deviation. I would expect a visualisation of the mean and shading to indicate dispersion (e.g. standard deviation, standard error, min/ max). Similarly, the caption of Table 2 states "$\pm$ std over 3 seeds" but then no such deviations are shown in the table. 10. It appears that in experiment 2 and 3, IncentRL provides no clear benefits. According to Table 2, only for one value of $\beta$ did IncentRL slightly outperform the base algorithm ($\beta = 0$) and it is not clear that these gains are significant without indication of deviation. For other values of $\beta$, performance dropped significantly with IncentRL compared to the baseline. In Figure 2, results also don't seem to show a very significant change from the base algorithm to IncentRL. 11. Figure 3 visualizes the (undefined; see weakness 5) Bayesian adaptation of $\beta$ and claims that "the posterior mean quickly concentrates near the effective region ($\beta \sim 0.1$)" but the Figure more so suggests that the posterior mean moves away from $0.1$ over several rounds and might converge closer to $0.02$ which I would not consider near the effective region of $0.1$. ## Significance 12. From a high-level perspective, this work states that, when given preferred outcomes, the algorithm can leverage this information under (somewhat unclear) assumptions to shape rewards and then learn more efficiently. This does not appear like a very novel or significant contribution but represent a typical instance of reward shaping. This puts into question whether this work makes significant contributions that warrant publication. 1. How would you define "outcomes" that the predicted and preferred outcome distributions are defined over? (Weakness 4 for more details) 2. How do you obtain the predicted outcome distributions? Are these assumed to be given (like the preferred outcome distributions) or learned? (Weakness 4.1 and 7 for more details) 3. Could you describe the Bayesian adaptation process of $\beta$ that is being evaluated in Section 5.4? (Weakness 5 for more details) 4. How are the predicted and preferred outcome distributions defined for each of the experiments? (Weakness 8 for more details) 5. Experiments are stated to be repeated for 3 independent seeds but no metric of dispersion is being provided in Table 2 or Figure 2. Would the authors be able to provide standard deviation or any other indication of dispersion for these experiments to judge the significance of provided results? Fully human-written
IncentRL: Bayesian Adaptation of Preference Gaps in Reinforcement Learning Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 0: Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper studies model-free RL and proposes to add a particular type of intrinsic reward to the standard extrinsic reward in order to boost performance. Specifically, they propose to be given a distribution over preferred outcomes conditioned on state and then to subtract the KL-divergence between this preferred outcome distribution and a predicted outcome distribution. Experiments on toy RL environments show that the approach potentially leads to more data efficient learning. - The method proposed is novel to the best of my knowledge. - I really like how the authors are connecting to other frameworks for intelligent behavior such as the free energy principle and ideas in cognitive science. I think the direction is interesting, despite the concerns I raise below about this particular instantiation of writing up the work. - The clarity of the paper could be significantly improved. For example, it would be helpful to specify formally what outcomes are. I wasn't sure if they were resulting next states or something else. A Bayesian adaptation scheme is mentioned multiple times (including the abstract), but never defined. - In several places it seems like there are bullet points embedded in paragraphs, suggesting that the paper was written last minute and still needs careful editing. - Motivation: it is unclear what the preferred outcome distribution is (formally) and why it is reasonable to expect a learning agent to have such a distribution. - Main method confusion: it was not clear if and how the agent's predicted outcome distribution was updated during learning. - In the empirical study, a small number of trials are reported and confidence intervals are wide (but type unspecified). Please see "Empirical Design in Reinforcement Learning" for great discussion on why these details matter in empirical RL research. Discussion of the weaknesses given above would be the most productive use of the rebuttal. Fully human-written
IncentRL: Bayesian Adaptation of Preference Gaps in Reinforcement Learning Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 0: Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposes a method to provide intrinsic rewards to an RL agent during training, based on a KL-divergence between the predicted outcome of actions vs the preferred outcome. A Bayesian modulation of the KL-parameter is claimed but not detailed in the paper. The method is demonstrated in three toy environments, demonstrating improved learning efficiency. - Reinforcement learning under sparse rewards is a difficult challenge. - The proposed method is intuitively interesting. - The paper attempts a formal analysis in addition to empirical results, which is appreciated. Unfortunately, the paper has some significant flaws: 1) A Bayesian modulation of the beta-parameter is claimed, but no details are provided in the paper about the method. 2) The prediction p(o|s,a) and preference q(o|s) are not clearly defined. What is the domain of "o"? For "q", the paper describes it as "what the agent would like to happen in state s"; happen after what? Should "a" be included here? 3) Proposition 1 is a main theoretical property claimed in the paper, but no full proof is provided (only a proof sketch). I am also not sure what the authors mean by "If the external reward rext admits an optimal policy π∗", because any MDP admits an optimal policy. 4) There are a number of other methods for intrinsic rewards, but the empirical evaluation does not compare to any other prior methods. For example, it would be interesting to include DeRL (see below) as a baseline. 5) Related work: a core motivation of the paper is that prior methods require careful modulation of the hyper-parameter that combines the intrinsic and extrinsic rewards. This problem is tackled in DeRL [1] (https://arxiv.org/abs/2107.08966), which not mentioned in this paper. DeRL eliminates the hyper-parameter by decoupling policy training for intrinsic and extrinsic rewards. The paper is also quite repetitive in some places. For example, the combined reward and return are defined repeatedly in several places. I also could not see a full specification of hyper-parameters used in the evaluated RL algorithms (reproducibility). I think this work could have potential and I hope my comments will be useful in the next version. How does the proposed method compare to a method like DeRL? Fully human-written
AFMCC: Asynchronous Federated Multi-modal Constrained Clustering Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes AFMCC for federated multimodal unsupervised clustering with (i) a Class-Correlation Matrix (CCM) constraint that projects features into a pseudo-probability space and penalizes deviations from a relaxed target;(ii)a client-specific weighted aggregation, and (iii) asynchronous training to tolerate heterogeneous compute. Experiments on several benchmarks report ACC/NMI/ARI gains. 1.Broad problem surface: the method tries to address degeneration in contrastive clustering, missing modalities, and client asynchrony in one framework. 2.The experimental results are of excellent performance, leading in multiple indicators across various datasets. 1.Limited novelty; heavy rebranding. The “particle-dynamics three-force” story largely repackages standard contrastive attraction/repulsion plus a global regularizer, and relies on strong approximations. 2.Unrealistic core assumptions. The CCM derivation assumes equal class sizes and a known K (see the definition of Q and text around Eq. 4), with no robustness analysis under heavy class imbalance or unknown K—both common in FL. 3.There is no reporting of wall-clock time, communication rounds, bandwidth, or staleness-vs-accuracy curves—so the claimed training-time reduction remains unsubstantiated. 1.What are the mathematical properties of aggregated weights? How are the weights of Equation (6) constructed from the deviation quantities? 2. How does the CCM behave under long-tail and cross-client imbalance? Can you couple AFMCC with non-parametric clustering when K is unknown? 3.Why is this algorithm effective and can it provide a more convincing theoretical proof. Lightly AI-edited
AFMCC: Asynchronous Federated Multi-modal Constrained Clustering Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper addresses the issues of representation degradation, modality absence, and imbalance in federated multimodal clustering by proposing an asynchronous federated multimodal constrained clustering method, referred to as AFMCC. This method prevents representation degradation and enhances the separability of multimodal data clustering by calculating the Class-Correlation Matrix $Q$ between different categories and integrating it with the loss function of the target matrix $Q_{tgt}$. Additionally, the article designs a client-specific weighted aggregation approach to effectively handle the problem of modality absence. Experimental results from various benchmark tests demonstrate that AFMCC outperforms other methods in terms of performance. 1. This paper addresses the issues of modality absence and representation degradation in multimodal clustering, which holds significant research value. 2. It also provides open-source code and datasets, offering strong support for community development. The innovation of this method is relatively limited. While the design of the Class-Correlation Matrix (CCM) is interesting, the calculation details of the target matrix are unclear. What is the specific process for calculating $Q_{tgt}$? What is the difference between $P_{ai}$ and $P$? Does the calculation of $Q$ require that all categories of data be present in each client or batch? Additionally, the design of the weighted aggregation has a high computational complexity, as it requires all other clients' models to compute features locally. Assigning higher weights to clients with poor cross-modal feature alignment appears unreasonable and requires further explanation. Furthermore, the design addressing asynchronous improvements appears lacking, making it difficult to effectively resolve issues related to client communication being asynchronous or clients going offline. 1. The details of the formulas in the article need further modification to improve readability. For example, the matrix $A$ in line 207, the calculation logic of $Q$ in line 248, and the definitions of $Q_{ab}$ and $I_K$ in line 253 are not sufficiently clear. These formulas require further explanation. 2. This paper requires further clarification on how it addresses asynchronous aggregation and alleviates computational imbalance issues. Lightly AI-edited
AFMCC: Asynchronous Federated Multi-modal Constrained Clustering Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper presents an asynchronous federated multi-modal constrained clustering, which adapts to scenarios with arbitrary missing modalities. This method directly fuses multimodal embeddings into a shared embedding by weighted aggregation. By introducing a class-correlation matrix, it alleviates the degradation of contrastive learning in multimodal clustering. Extensive experiments are performed and detailed theoretical analyses are provied. 1. The paper is well-written, and the motivations of the work are clear; 2. Theoretical analyses are solid; 3. The overall design is reasonable. 1. The introduction of the class-correlation matrix seems to be adopted by several works, which limits the novelty.; 2. The main solution to the arbitrary modality missing problem is to aggregate view-specific embeddings with a calculated weight, which seems somewhat trivial. 3. Experiments seem insufficient. 4. Some texts in the figures are small. 1. Do the experiments consider the Non-iid distribution of data? 2. Does the method (Section 3.2 in Page 5) require the equal distribution of samples in each cluster? In practice, the data distribution of each client might be highly different (i.e., the Non-IID issue). As a result, even with a loose constraint based on the class-correlation matrix, whether the equal distribution is reasonable in practice should be discussed. 3. The problem statament sets the sample number of each client is $N$, which might not be proper and should be corrected. Fully human-written
AFMCC: Asynchronous Federated Multi-modal Constrained Clustering Soundness: 1: poor Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper explores a new insight about the reasons why contrastive learning fails in federated clustering, especially when different clients observe different modality subsets. To address this challenge, the authors introduce a new constraint mechanism to avoid clustering degeneration over time, as well as a new client-specific aggregation method. * Explaining contrastive learning in both probabilistic and particle view is interesting * Extensive baselines and benchmark datasets. * Confusing writing. Introduction and related work fail to clarify the motivation and the problem. It is unclear between federated multimodal learning and federated clustering from the authors’ writing. * Lack of motivation. It is unclear why we need to solve this problem. What is the difference between multimodal clustering in centralized and federated settings? What are the benefits of federated clustering? * Unclear problem formulation. What is $K$ in line 160, how to determine this number. Why do we enforce these clusters to be balanced, while in the standard settings, clusters can be various in size based on the data distribution. * Lack of literature review. Contrastive learning - based regularization is used commonly in multimodal learning with different variants[1,2], which are theoretically guaranteed. Why these regularizations can not handle the clustering tasks, since these regularizations are designed for clustering modalities implicitly to perform downstream tasks. The authors should expand their literature reviews to highlight their contributions. * Lack of contributions. The proposed method, while adding explanation and new insights about federated clustering, seems to be an improvement of FMCSC[3]. Empirically, Figure 3b shows that the constraint loss – one of main contributions – does not affect the performance significantly. [1] Nguyen et al., Learning Reconfigurable Representations for Multimodal Federated Learning with Missing Data, NeurIPS’25 [2] Nguyen et al., Fedmac: Tackling partial-modality missing in federated learning with cross modal aggregation and contrastive regularization. NCA’24 [3] Chen et al., Bridging Gaps: Federated Multi-View Clustering in Heterogeneous Hybrid Views, NeurIPS’24 See Weaknesses Fully human-written
A Neuro-symbolic Approach to Epistemic Deep Learning for Hierarchical Image Classification Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 0: Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper studies uncertainty-aware learning with neuro-symbolic models. In particular, it suggests applying subjective logic to high-level prediction of a hierarchical clustering setup. Combining focal set reasoning and differentiable fuzzy logic, the paper arrives at a new loss function that can be dropped into a feed-forward prediction pipeline. The goal is to improve calibration a more interpretable way than the existing methods. The suggested approach has been evaluated on a transformer variant, the Swin transformer, and tested on two standard hierarchical classification benchmarks. * The drop-in property of the approach makes it generically applicable. * The studied topic is important for the safe use of deep learning technologies. * There exist abundant prior work on uncertainty calibration in deep neural nets. However, the paper does not provide a comparison against the state of the art in the field. The authors can find a sizeable list of alternative methods even in an almost half-decade long paper [1]. The paper claims to have a comparison against the old Guo et al. baseline, but it is not available anywhere in the paper. * The paper exhibits a convoluted and unstructured presentation practise. It starts from an abstract that lacks a meaningful progression of arguments and continues with a similar introduction. For example, the second sentence says the deep neural nets are miscalibrated and logically inconsistent. These two are different problems. Which one is our focus? The third sentence says these problems are problematic in structured classification tasks. What does this mean and why uncertainty calibration is problematic particularly in structured tasks? The paper introduces the studied data sets in the methodology section and does not really introduce a concrete methodology anywhere. * The suggested combination of techniques such as differentiable fuzzy logic and focal set reasoning have not been justified anywhere. Their value added over the alternative tracks of uncertainty calibration have not been pointed out. The related comparisons to the state of the art are also missing. * The logical trail of the suggested solution doesn‘t follow a clear rationale. Section 4 introduces the architectural elements of a standard hierarchical classifier. It then jumps to introducing some basic elements from fuzzy logic in Section 5 and an existing application of it to probabilistic deep learning called RS-CNN. However it does not explain what this prior work is doing, which aspects of it are relevant for the problem at hand, and which limitation of it will be overcome. Then Section 6 admits to follow the ROAD-R approach without explaining or motivating it, which follows some performance scores definitions. These pieces do not really come together to make a concrete scientific hypothesis. As I point out in the questions section below, all this endeavour is also missing a clearly stated purpose. * Section 9 doesn’t specify an experiment plan. It is not possible see the big picture from the way the results are presented. Tables 2 and 3 in the appendix give further details and the only take away I can extract from these tables is that all models in comparison perform comparably. [1] Minderer et al., Revisiting the Calibration of Modern Neural Networks, NeurIPS, 2021 * How generalizeable are the proposed findings across different neural architectures? The Swin transformer is a very specific architecture. Why should it be the only considered backbone architecture? Why does it have to be such central in the story line? Which property of this architecture makes it representative? * Having read the whole paper, I am left a bit confused about the end goal of the paper. Is it to improve the calibration of the uncertainty predictions of deep learning algorithms as studied in the experiments or to improve the structural consistency of the calibration methods as claimed in the first sentence of the abstract? This is not a merely aesthetic concern. If the only measurable effect of the suggested improvement will be improved calibration scores, I am missing why we need all the complications introduced by the subjective logic concepts. Furthermore, I will also then wonder why the state of the art in post-hoc uncertainty calibration is sidestepped. I would at least expect to see a comparison against temperature scaling. If the goal is to improve explainability, physical consistency, or interpretability, where is the related experiment and the result demonstrating that the suggested approach solves the problem better than what is known? Fully human-written
A Neuro-symbolic Approach to Epistemic Deep Learning for Hierarchical Image Classification Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors introduce a new neuro-symbolic architecture for hierarchical classification tasks. This architecture combinesa pre-trained swin transformers with a rather complex combination of belief functions (taking inspiration from a random set neural networks) and fuzzy logic (taking inspiration from other fuzzy-logic-based NeSy approaches). The aim is that of obtaining calibrated (low ECE) predictions that satisfy the hirarchical constraints with high probability. Experiments are carried out on two datasets and against two competitors (MultiPlexNet and RS-NN). **Originality**: This is the first time I see NeSy, fuzzy logic and belief functions all combined in the same package. While the different pieces already exist, their combination is novel. No complaints on my end. **Quality**: The overall architecture is generally sensible. **Signfiicance**: Combining calibration and rule satisfaction is a good idea. **Clarity**: The structure of the paper is a bit odd and many important details are not explained in an intuitive manner. - For instance, the datasets used for evaluation are introduced in Section 3.1, before the method and far away from the experiments; it would be best to move the description to the experiments. - I found sections 5-8 unnecessarily complicated. The authors assume the reader is familiar with belief functions, focal sets and other hyper-specialized concepts (like the architectures of RS-NN and ROAD-R). This is not necessarily the case. Equations are provided without any intuition as to what they are supposed to do. It *is* possible to make out what the authors mean, but the text doesn't make it easy. I strongly recommend the authors to provide clear intuitions for each and every equation. Adding a figure depicting the inteded information flow in the model would also help. Moreover, standard quantities (like the definition of Gaussian distribution, which appears twice) can be removed. Overall, clarity is impaired by these issues and the paper as a whole feels unpolished. It is also shorter than 9 pages (although this is just a symptom, not a problem by itself). **Quality**: The experiments are not convincing, for several reasons. - They only consider two datasets. The original works by Giunchiglia (cited by the authors as an inspiration -- Coherent hierarchical multi-label classification networks; NeurIPS and its journal version) provide a **twenty** already implemented hierarchical classification tasks that could be used for evaluation. It's not clear why the authors focus on just two. - The choice of competitors is not ideal. Giunchiglia's own approach is not compared against. More recent follow-ups, such as semantic probabilistic layers [1], are not compared against. In a nutshell, the experiments do not consider the state-of-the-art in NeSy hierarchical classification. - The authors also neglect NeSy approaches specifically designed for calibration, such as BEARS [2] and NeSy diffusion [3]. (Admittedly, the last one might be *too* recent, feel free to disregard it if so; BEARS, however, is not.) - The choice of evaluation metrics is also not well motivated. Why top-1 accuracy? Why not using the same metrics used by Giunchiglia in their work and follow-ups? Given the above, it is difficult to gauge the relative effectiveness and generality of the proposed approach. This limitation is by itself is sufficient to make me lean toward rejection. **Significance**: very difficult to assess, given how limited the experiments are. [1] Ahmed et al., Semantic probabilistic layers for neuro-symbolic learning, NeurIPS 2022. [2] Marconato et al., BEARS Make Neuro-Symbolic Models Aware of their Reasoning Shortcuts. NeurIPS 2024. [3] van Krieken et al., Neurosymbolic Diffusion Models. arXiv 2025. Feel free to comment on any of the weaknesses I pointed out. Fully human-written
PreviousPage 10 of 1516 (75800 total rows)Next