|
Network of Patterns: Time Series Forecasting with Pattern Passing |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a novel time series forecasting framework, Network of Patterns (NoP). It breaks through the limitations of traditional chain and tree-based pattern aggregation by measuring pattern similarity in the frequency domain (Spectrum KL Divergence) and organizing multi-scale time segments into a network structure. Furthermore, the paper designs a Pattern Passing mechanism, which enables flexible transmission and fusion of cross-scale information between network nodes, thereby achieving efficient modeling of complex cycles and multi-scale dependencies.
It is proposed to organize the multi-scale patterns of time series in the form of a "Network-of-Patterns (NoP)", rather than the traditional chain or tree structures. And the logical structure of Introduction, Related Work, Methodology, Experiments, Ablation Studies, and Appendix is rigorous. The ablation experiments (w/o FFT, w/o Network, w/o PPB, w/o Virtual Pattern) fully verify the effectiveness of each module.
How does the time complexity of SKL computation and network construction change with sequence length or the number of nodes? Can it be extended to ultra-long sequences (i.e., 336, 720)?
In long term forecasting task, according to Table 8 in TimeFilter [1] and Table 8 in your manuscript, under the same settings, the performance of the proposed NoP model is obviously weaker than that of TimeFilter.
In short term forecasting task, Table 4 in TimeMixer [2] shows that the model significantly outperforms TimesNet [3]. Why does TimeMixer perform worse than TimesNet in your manuscript instead? Besides, according to Table 4 in TimeMixer and Table 9 in your manuscript, under the same settings, the performance of the proposed NoP model is obviously weaker than that of TimeMixer and TimesNet. Is there any down-tuning of the baseline here?
[1] Hu Y, Zhang G, Liu P, et al. TimeFilter: Patch-specific spatial-temporal graph filtration for time series forecasting[J]. arXiv preprint arXiv:2501.13041, 2025.
[2] Wang S, Wu H, Shi X, et al. Timemixer: Decomposable multiscale mixing for time series forecasting[J]. arXiv preprint arXiv:2405.14616, 2024.
[3] Wu H, Hu T, Liu Y, et al. Timesnet: Temporal 2d-variation modeling for general time series analysis[J]. arXiv preprint arXiv:2210.02186, 2022.
As in Weaknesses. |
Fully human-written |
|
Network of Patterns: Time Series Forecasting with Pattern Passing |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes a time series forecasting method called NoP, which decomposes the sequence into multi-scale pattern segments, constructs a network structure among the segments using frequency-domain metrics, and aggregates information through a pattern propagation mechanism.
S1: All the modules used in this paper are relatively mature, and the overall model design is fairly reasonable.
S2: The description in the paper is clear.
W1: Although the authors pay attention to some pattern segment-based methods, they overlook several important baseline models based on periodic modeling, such as the linear-based CycleNet[1] and the RNN-based PGN[2]. Moreover, these methods are not included in the experimental comparisons, which limits the comprehensiveness of the model performance evaluation.
[1] CycleNet: Enhancing Time Series Forecasting through Modeling Periodic Patterns. In The Thirty-eighth Annual Conference on Neural Information Processing Systems.
[2] PGN: The RNN's New Successor is Effective for Long-Range Time Series Forecasting. In The Thirty-eighth Annual Conference on Neural Information Processing Systems.
W2: The experimental section of the paper has significant flaws due to the lack of clear description of the hyperparameter search space. It remains unclear whether the authors conducted a standard hyperparameter tuning procedure for all baseline methods (i.e., optimizing hyperparameters on the validation set and reporting final results on the test set). To ensure reproducibility and fairness, the authors should provide the following:
(a) If hyperparameter search was conducted, please provide the complete search space and the final selected parameter values for each task across different datasets;
(b) If no systematic hyperparameter tuning was performed, the credibility of the current experimental results is questionable. Since different hardware platforms can affect model performance, to ensure a rigorous comparison, all baseline methods should be rerun on the authors’ unified experimental platform, tuned using the same broad hyperparameter search space, and the best parameters selected on the validation set should then be evaluated on the test set. Without these supplementary experiments, the reliability of the current conclusions cannot be established.
W3: The efficiency experiments presented by the authors are also clearly insufficient. Although the paper claims that NoP achieves lower MSE with fewer parameters, it is unclear whether this conclusion holds across all tasks. Additionally, it is not stated whether key structural hyperparameters of all baseline models (e.g., hidden dimensions, number of layers) were unified during the efficiency comparison. To perform a fair efficiency analysis, the impact of differences in model scale must be controlled, which differs from the logic of experiments aimed solely at validating model performance. Otherwise, deliberately choosing extremely small parameter configurations for certain tasks to claim “high efficiency” would seriously compromise the rigor of the experiments and the fairness of the conclusions.
See Weaknesses. |
Lightly AI-edited |
|
From Minutes to Days: Scaling Intracranial Speech Decoding with Supervised Pretraining |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors propose leveraging ambient audio data from long intracranial studies in a contrastive supervised pre-training stage. In turn, this enables learning from intracranial signals over the length of a study, vastly increasing the amount of training data available. The authors show that pre-training a contrastive model with this data, allows it to generalise, with some fine-tuning, to downstream speech comprehension / audio listening tasks. The results also indicate that the pre-training scales log-linearly, suggesting further data could continue to improve generalisation performance.
- Interesting idea to leverage ambient audio for a supervised pre-training stage
- Fine-tuning the pre-trained model seems to convincingly beat the baseline
- Error bars and statistical tests included show that improvements are significant
- Performs appears to scale log-linearly with pre-training data between 0-100 hours
- Missing baselines: Please include (1) an end-to-end baseline where you train your full architecture directly on the supervised data and (2) a baseline where you train a linear layer directly on the raw iEEG of the downstream data. Without these, it’s hard to determine whether the pre-training was necessary at all.
- Minor: Line 126-128: Özdogan et al. 2025 quotes some of the work from [A] so this should also be cited here. Similarly, line 441/442 discusses unsupervised models, for which you may also wish to cite [B] and [C] for intracranial unsupervised foundation models.
I am open to moving towards recommending acceptance if the authors can address the above concerns satisfactorily.
[A] Jayalath, D., Landau, G. and Jones, O.P., 2025. Unlocking non-invasive brain-to-text. arXiv preprint arXiv:2505.13446.
[B] Wang, C., Subramaniam, V., Yaari, A.U., Kreiman, G., Katz, B., Cases, I. and Barbu, A., 2023. BrainBERT: Self-supervised representation learning for intracranial recordings. arXiv preprint arXiv:2302.14367.
[C] Zhang, D., Yuan, Z., Yang, Y., Chen, J., Wang, J. and Li, Y., 2023. Brant: Foundation model for intracranial neural signal. Advances in Neural Information Processing Systems, 36, pp.26304-26321.
- Why resample the brain data to 40Hz for the architecture? Intracranial recordings often pick up gamma and high-gamma band frequencies that may be relevant for speech perception [D] and could improve results. The Defossez et al. (2023) architecture was designed for non-invasive (MEG) where these frequencies are often low-signal or noise, but in intracranial recordings they are likely to be useful.
- Why use the ambient data as a pre-training stage at all? What happens when you jointly train with the ambient data as well as the true audiobook data?
[D] Mugler, E.M., Patton, J.L., Flint, R.D., Wright, Z.A., Schuele, S.U., Rosenow, J., Shih, J.J., Krusienski, D.J. and Slutzky, M.W., 2014. Direct classification of all American English phonemes using signals from functional speech motor cortex. Journal of neural engineering, 11(3), p.035015. |
Fully human-written |
|
From Minutes to Days: Scaling Intracranial Speech Decoding with Supervised Pretraining |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces a supervised pretraining framework for intracranial EEG (iEEG)-based speech decoding, leveraging week-long ambient and task-based brain-audio recordings from epilepsy patients. Using a contrastive learning approach, the authors align neural signals with representations from a pretrained speech model (wav2vec 2.0), scaling dataset sizes by orders of magnitude compared to traditional short, controlled experiments. The work demonstrates that pretraining on large-scale, ambient recordings significantly improves downstream decoding performance with robust log-linear gains as data expands, while detailed representational analyses reveal substantial cross-day drift in neural embeddings.
1. Real-world relevance: The authors effectively leverage week-long clinical iEEG recordings paired with ambient audio—data typically discarded—to scale training data by over two orders of magnitude. This represents a meaningful step toward real-world, scalable brain-speech decoding and is clearly motivated and illustrated (Figure 1).
2. Rigorous and comprehensive experimental validation: The pretraining framework consistently improves downstream speech decoding across all three subjects, with statistically significant gains (Figure 2A). The log-linear scaling with pretraining data quantity (Figure 2B) and sensitivity analyses (e.g., finetuning data ablation in Figure 4A) further strengthen the claims.
3. Representational and distribution shift analysis: The paper provides a clear analysis of the distribution shift between ambient and true audiobook sounds (Figure 3) and demonstrates the necessity of finetuning. The comparison between wav2vec 2.0 and melspectrogram features (Figure 5) offers valuable insights into which acoustic representations align better with neural activity.
4. Neurophysiologically informative embedding analysis: The UMAP visualizations and linear decoding analyses (Figures 6, 10) reveal meaningful structure in the learned embeddings, particularly the day-to-day drift in neural representations—a finding with important implications for future model design and clinical translation.
1. Limited comparison to recent state-of-the-art baselines: The paper does not adequately situate itself within the rapidly evolving literature on neural decoding. Key recent works—such as self-supervised pretraining on iEEG [1,2] and cross-subject or cross-session transfer learning [3]—are not discussed or compared. This omission weakens the claim of methodological novelty.
2. Incomplete coverage of pretraining innovations in brain decoding: While this paper emphasizes supervised pre-training on environmental data, it lacks a detailed overview of the results from related foundational models [4,5] that also utilize large-scale neural network data. Therefore, a deeper exploration is needed regarding the connections between this work and these methods, and in what ways they represent breakthroughs.
3. Lack of neural-level interpretability and spatial ablation: The embedding analyses are informative but do not directly link to neural anatomy or functional localization. Ablations over electrode groups (e.g., auditory vs. non-auditory cortex) or analysis of how different brain regions contribute to the learned representations would strengthen the interpretability and biological plausibility of the model.
4. Superficial handling of temporal non-stationarity: Although the paper identifies day-to-day drift as a key challenge, the proposed model does not explicitly account for it. Incorporating temporal adaptation mechanisms—such as domain-adversarial training, sliding-window normalization, or time-aware embeddings—could improve robustness and generalization, and should be explored or at least discussed as a future direction.
**References:**
[1] Wu, D., Li, S., Feng, C., Cao, L., Zhang, Y., Yang, J., & Sawan, M. (2024). Towards Homogeneous Lexical Tone Decoding from Heterogeneous Intracranial Recordings. *arXiv preprint arXiv*:2410.12866.
[2] Zheng, H., Wang, H., Jiang, W., Chen, Z., He, L., Lin, P., ... & Liu, Y. (2024). Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals. *Advances in Neural Information Processing Systems, 37*, 79996-80033.
[3] Singh, A., Thomas, T., Li, J., Hickok, G., Pitkow, X., & Tandon, N. (2025). Transfer learning via distributed brain recordings enables reliable speech decoding. *Nature Communications, 16*(1), 8749.
[4] Zhang, D., Yuan, Z., Yang, Y., Chen, J., Wang, J., & Li, Y. (2023). Brant: Foundation model for intracranial neural signal. *Advances in Neural Information Processing Systems, 36*, 26304-26321.
[5] Chau, G., Wang, C., Talukder, S., Subramaniam, V., Soedarmadji, S., Yue, Y., ... & Barbu, A. (2025). Population transformer: Learning population-level representations of neural activity. *ArXiv, arXiv*-2406.
See Weaknesses. |
Fully AI-generated |
|
From Minutes to Days: Scaling Intracranial Speech Decoding with Supervised Pretraining |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper describes week-long intracranial and audio recordings used to train a contrastive learning model. Learned representations seem to suggest that brain activity represents speech features, but that its global structure shifts, which identifies the practical problem that shift ought to be explicitly accounted-for.
- It is a strength that large amounts of data (over the course of a week) can be effectively used, apparently scalably. It is hard to assess the "over two orders of magnitude" claim (L17), though. This also reveals one of the main insights, regarding the cross-day neural drift and the need to correct for it.
- It is only the most minor of complaints, but the format of the Introduction is not quite typical of a scientific publication. It is suggested to omit the boldfaced headings, or to add a more narrative opening. Some claims are mentioned 'loosely' (e.g., "patients...typically spend about a week", "about 100X more neural data") or without citation. The writing generally can be tightened up and improved.
- Although references and related work are distributed throughout the paper, these tend to be isolated to specific decisions (e.g., like the wav2vec2 model used). It may have been easier to identify the apparent novelty of the work were it couched in a fulsome, contextualized background work section.
- The core of the work is a standard CLIP(-like?) contrastive alignment with typical objectives -- there's no novel architectural nor objective nor analytical
- The experiments are within-subject for a relatively small collection of patients. An ongoing problem in this community is how to either build thinker-independent models from scratch, or how to use foundation models that are generalizable, so such small-N data (in terms of patients) can be leveraged. At least for generalizability, the empirical results are narrow. Additional ablations or modifications of adjustable parameters would also be expected.
- L42: Are you suggesting that there is a tradeoff between EEG and MEG in time-v-spatial resolution?
- L46: the moment participants perform an overt speech task can be disastrous for EEG. Is overt speech in EEG not included in 'typically'?
- L122: In your loss, is it the case that the objective is to pick the right V for a given U? This makes sense, but is CLIP typically symmetric?
- IBID: Negatives appear in batch only? When batch is ~128, is the number or variety of negatives modest? |
Fully human-written |
|
From Minutes to Days: Scaling Intracranial Speech Decoding with Supervised Pretraining |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents a framework to scale intracranial (iEEG) speech decoding by leveraging week-long, continuous brain and audio recordings as supervised pretraining data. Using a contrastive learning model that aligns brain activity with pretrained audio representations (wav2vec 2.0), the approach demonstrates significant gains over models trained solely on short, controlled datasets. The study reveals that pretraining performance improves log-linearly with the amount of data and that downstream performance on controlled tasks benefits robustly from large-scale pretraining followed by supervised finetuning. Analysis of the learned embedding spaces highlights issues of cross-day neural drift and distributional shift between ambient and experimental audio.
1. The methodology is clearly formalized, with careful and transparent documentation of preprocessing, architecture, and experimental protocols.
2. The empirical evaluation is rigorous: the impact of pretraining is shown clearly in Figure 2, and the log-linear relationship between data amount and downstream performance is rare in the brain decoding literature.
3. The analysis of representation drift (Figure 6 and 9) is a valuable, often neglected aspect, revealing new neuroscientific challenges that arise with longer time windows.
1. Both the model architecture and training paradigm are directly adopted from [1] (Line 133). The dataset was not collected by the authors, yet directly selected 3 subjects from 46 subjects in [2], which is not publicly available. The code link doesn’t belong to this project, but was copied from [1].
2. The experimental evaluation included only 3 subjects from [1] and lacked comparison with advanced sEEG decoding baselines [3-6], which makes it difficult to position the contribution of this article.
3. Although the idea of using sEEG-audio signal pairs during the non-task phase to improve decoding performance during the task phase is interesting, the experimental design itself is to ensure that the subjects focus on carefully designed cognitive tasks and that the recorded sEEG signals contain information about language perception, which makes the neuroscientific basis and reproducibility of this work questionable.
**References**:
[1] Défossez A, Caucheteux C, Rapin J, et al. Decoding speech perception from non-invasive brain recordings[J]. Nature Machine Intelligence, 2023, 5(10): 1097-1107.
[2] Linnea Evanson, Christine Bulteau, Mathilde Chipaux, Georg Dorfm¨uller, Sarah FerrandSorbets, Emmanuel Raffo, Sarah Rosenberg, Pierre Bourdillon, and Jean-R´emi King. Emergence of Language in the Developing Brain. Manuscript Online, May 2025. URL
https://ai.meta.com/research/publications/emergence-of-language-in-the-developing-brain/. (Accessed 10/09/2025).
[3] Wang C, Subramaniam V, Yaari A U, et al. BrainBERT: Self-supervised representation learning for intracranial recordings[J]. arXiv preprint arXiv:2302.14367, 2023.
[4] Chau G, Wang C, Talukder S, et al. Population transformer: Learning population-level representations of neural activity[J]. ArXiv, 2025: arXiv: 2406.03044 v4.
[5] Zhang D, Yuan Z, Yang Y, et al. Brant: Foundation model for intracranial neural signal[J]. Advances in Neural Information Processing Systems, 2023, 36: 26304-26321.
[6] Yuan Z, Shen F, Li M, et al. Brainwave: A brain signal foundation model for clinical applications[J]. arXiv preprint arXiv:2402.10251, 2024.
See the above weaknesses. |
Fully human-written |
|
Scalable and Generalizable Autonomous Driving Scene Synthesis |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper focuses on multi-view generation in driving scenes. Previous works use image-level latent representations, relying on cross-view attention to maintain cross-view consistency. This work proposes encoding multi-view images into a unified and compact BEV-latent. This explicit latent representation directly guarantees cross-view consistency. The proposed method can be trained across datasets (with different camera layouts) and demonstrates strong cross-dataset generalization capability and high image quality.
The motivation and idea of this work are solid. The BEV latent representation not only explicitly ensures cross-view consistency, as the paper emphasizes, but I also guess it can largely mitigate the subjectivity issues of generative models (For example, the consistency and move/changes of the 3D content are reasonable only within the camera view). The authors could consider validating this point.
The designs of the BEV latent encoder, decoder, discriminator, and training pipeline are reasonable and well-founded. The writing is clear.
Experiments are extensive and solid. The model's capability for cross-dataset training and its few-shot generalization ability are impressive. Visualization results show that the model achieves high accuracy in reconstructing views under the highly compressed BEV latent representation.
Recommend defining F_stt in Section 3.1.2.
The title "Scalable and Generalizable Autonomous Driving Scene Synthesis" doesn't fully capture the paper's key feature (BEV latent representation / multi-view synthesis). The experiments primarily demonstrate the method's cross-dataset training capability (which is good) rather than its scalability. Consider adjusting the title to make it more distinctive?
The current method doesn't seem to involve temporal modeling. Will explore it in future work?
The paper discusses the proposed method's relatively lower FID scores (which, given the difficulty of latent representation in BEV space compared to image-level representation, I find understandable). Could increasing the BEV spatial resolution / the number of CFV sampled rays improve the view resolution/realism?
Are there any plans to open-source the code? |
Fully human-written |
|
Scalable and Generalizable Autonomous Driving Scene Synthesis |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces BEV-VAE, a novel variational autoencoder framework designed for autonomous driving scene synthesis. The core contribution is a model that unifies multi-view images into a compact and camera-agnostic Bird's-Eye-View (BEV) latent representation. This approach decouples the scene's 3D structure and semantics from the specific camera configurations, enabling the model to be trained on diverse datasets with varying camera layouts and to generalize to arbitrary new viewpoints. For generative tasks, a Diffusion Transformer (DiT) is trained within the learned BEV latent space, conditioned on 3D object layouts represented as occupancy grids. The authors demonstrate the model's effectiveness through multi-view reconstruction, novel view synthesis, and cross-dataset generalization. While the proposed method achieves state-of-the-art multi-view spatial consistency, it shows a trade-off in per-image generative fidelity (gFID) compared to prior work. Finally, the practical utility of the synthesized data is validated by improving the performance of a downstream perception model, BEVFormer.
1. **Generalization:** The paper provides compelling evidence for the model's ability to generalize across different datasets (nuScenes, AV2, nuPlan) and camera setups. The experiments showing successful reconstruction of scenes from one dataset (e.g., nuPlan) using the camera intrinsics and extrinsics of another (e.g., AV2) are particularly impressive and strongly support the claims of generalizability. The demonstrated performance gains from training on a large, mixed dataset (PAS) validate the model's scalability.
2. **Superior Multi-View Spatial Consistency (MVSC):** The model achieves a state-of-the-art MVSC score. This is a crucial metric for autonomous driving applications, where maintaining the correct 3D geometry and spatial relationships between objects across views is often more important than perfect photorealism. The architectural design, which generates all views from a single, coherent 3D representation, naturally leads to this strength.
3. **Demonstrated Downstream Task Improvement:** The experiment in Section 4.8, showing that data augmentation using the proposed method improves the NDS score of BEVFormer, is a very strong point. It demonstrates that the synthesized data is not just visually plausible but also practically useful for training and improving perception models, closing the loop between generation and perception.
1. **Lower Generative Fidelity (gFID):** The most apparent weakness, which the authors acknowledge, is the relatively high (worse) gFID score compared to state-of-the-art methods like MagicDrive and DriveWM. While the paper frames this as a trade-off for better spatial consistency, the gap is substantial (20.7 vs. ~13-16). This indicates that the generated images may lack the fine-grained texture and realism of other methods, which could limit their utility in certain applications.
2. **Low Image Resolution:** All experiments are conducted at a 256x256 resolution, which is quite low for modern autonomous driving datasets and applications. While the authors suggest using super-resolution models as a post-processing step, this feels like an external fix rather than an integrated solution. The paper would be stronger if it discussed the challenges and potential architectural changes required to scale BEV-VAE to higher resolutions (e.g., 512x512 or higher).
3. **Overstated "Zero-Shot" Capability:** The term "zero-shot" in Section 4.6 seems too strong given the quantitative results in Table 3. The zero-shot performance on WS101 is very poor (PSNR 16.6, rFID 56.7). The real strength demonstrated here is in *fast adaptation* or *efficient fine-tuning*, where the pre-trained model provides a strong prior that allows for rapid convergence on a new dataset. The terminology should be more precise to reflect this.
4. **Static Scene Limitation:** The current framework operates on static scenes. The real world is dynamic, and the ability to model temporal evolution and generate coherent video sequences is a key direction in this field. While mentioned as future work, this is a significant limitation compared to the broader goals of full-world simulation.
5. **Mismatched Framing of Contribution and Lack of Efficiency Analysis:** The title "SCALABLE...SCENE SYNTHESIS" may be slightly overstated, as the paper's core innovation lies not in the generative model itself—which is a standard Diffusion Transformer—but in the preceding VAE architecture for learning a unified BEV representation. A significant, yet underexplored, benefit of this design is its potential for computational efficiency; by compressing the multi-view scene into a compact latent space, the subsequent diffusion process should be substantially less demanding in terms of memory and latency. To truly validate the "Scalable" claim and better frame the work's practical contribution, the paper would be significantly strengthened by a quantitative comparison of GPU memory usage and inference times against other leading methods.
Regarding the FID/MVSC Trade-off: Could you elaborate on why you believe there is this trade-off? Is the lower FID an inherent consequence of the VAE's information bottleneck regularizing the latent space, potentially smoothing over high-frequency details? Have you experimented with alternative VAE formulations, such as a VQ-VAE, which might allow for sharper reconstructions while maintaining the unified BEV structure? |
Fully AI-generated |
|
Scalable and Generalizable Autonomous Driving Scene Synthesis |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper focuses on autonomous driving scene synthesis and presents BEV-VAE, a variational autoencoder that unifies multi-view driving images into a compact bird’s-eye-view (BEV) latent representation, allowing flexible encoding and decoding across arbitrary camera configurations. By incorporating a Diffusion Transformer conditioned on 3D object layouts, the method achieves spatially consistent and generalizable multi-view synthesis on nuScenes and AV2. The synthesized data further enhances BEVFormer’s perception performance, highlighting the value of scalable and generalizable scene synthesis from a training data perspective.
1. The paper presents a clear motivation and is generally well written.
2. In the autonomous driving domain, due to the inherent need for multi-view perception, a BEV-based VAE offers greater practical value than image-space VAEs.
3. Using BEV representations makes it easier to transform different camera layouts, and simulate training data for different vehicle configurations.
1. The paper does not clearly articulate the advantages of BEV-VAE over Image-VAE. In terms of generation quality, both rFID and gFID are inferior to those of Image-VAE. Moreover, recent image-based multi-view generation methods also achieve strong spatial consistency. The potential benefits of BEV-VAE, in my view, may lie in two aspects—information compression and better compatibility with 3D editing—but the paper does not appear to emphasize either of these points. The results in Table 6 also raise questions — if the primary application value lies in train data generation, the improvement in detection performance appears comparable to that achieved by BEVGen, making it difficult to identify a clear advantage of BEV-VAE in this aspect.
2. The technical novelty of the paper is weak, as the BEV-VAE architecture largely follows that of BEVFormer. The use of BEV representations is also quite similar to BEVWorld, yet the paper lacks a detailed discussion of their differences. In addition, the rendering process to images resembles existing approaches such as self-occ. It would be beneficial for the authors to more clearly articulate the technical innovations, as the method section currently appears to primarily combine components from prior works.
1. From Table 3, doesn’t the comparison with SD-VAE suggest that BEV-VAE has weaker zero-shot generalization ability than image-based latent representations?
2. Why does BEV-VAE use only 256×256 image resolution? Would scaling up the resolution introduce any potential issues or challenges?
3. How much improvement in generation quality does using DiT with BEV-VAE provide compared to using BEV-VAE alone?
4. During the model training process, which modules, if any, use pre-trained parameters, and which are trained entirely from scratch?
5. GAN losses are usually sensitive to hyperparameter settings. Could the authors comment on potential issues regarding hyperparameter sensitivity and training stability in their setup? |
Lightly AI-edited |
|
Scalable and Generalizable Autonomous Driving Scene Synthesis |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper address the problem of novel-view-synthesis (NVS) in driving scene, where the novel view are camera viewpoints around the cameras.
The author propose to address this problem through a 3D-aware Birds-eye-view VAE.
This VAE will encode multiple images together to 3D BEV latents, and then decode it back to images. Afer training this VAE with MSE, perceptual loss and GAN loss over multiple datasets, the author showed that NVS can be down by using different camera extriniscs and intrinsics when decoding. Which is a very neat idea.
Also the author showed that this 3D-aware VAE has relatively OK reconstruction PSNR compared with vanilla image space VAE used by stable diffusion.
The author also showed that the proposed method can be used to generate augmented data when trainingperception model: BEVFormer, increasing the performance, which is quite impressive.
1. The idea is so interesting. 3D aware VAE can be used for NVS by adjusting the camera parameters used in decoding process. Quite cool!
2. The evaluation is quite comprehensive (even though used proxy metric for NVS), I understand that the PSNR is lower than those of image based VAE, e.g. SD-VAE.
3. Using the proposed methods for data-augmentation is also very interesting!
I think the major weakness I have is about runtime efficiency. It seems that the deformable attention would be very slow compared with flash-attention style attention implementations. Can the author provides more details about it? I understand that it might be slow without a well-optimized flash-attention style kernel.
1. Is it possible to evaluate NVS without using proxy metrics as done in the paper (using reconstruction metric as proxy metric for NVS)
2. Will dropping input images during training improves the NVS performance? Seems that if you drop one image during the encoding stage, but still compute the reconstruction loss on that image, this process will resemble a NVS training. |
Fully human-written |
|
TAVAE: A VAE with Adaptable Priors Explains Contextual Modulation in the Visual Cortex |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors present a VAE framework that explicitly describes a task-dependent prior and test it's performance along side a neural data from mouse V1. The authors present their model, describe data collection, and present qualitative similarities between the activations of latent variables in their model to the spiking activity of V1 neurons measured by calcium imaging.
The paper presents an interesting and, to my knowledge, novel accounting of neural tuning properties in the face of changing stimulus statistics using the model they present in Section 2. They present this along side an approach for learning context specific priors in the variational framework. There do appear to be some qualitative similarities between neural data and model latents but validating these results is required before claims can be made about how their model maps mechanistically onto neural representations of stimuli.
I think there are 2 main dimensions on which this paper falls short of acceptance 1) validation of model structure, 2) statistical rigor, 3) clarity of question.
1) The authors make claims about the qualitative properties of the latents of their model and how they match those of the real data. However, I'm not sure it's possible to attribute these features (even if they are statistically valid) to the prior structure of the model exclusively. Specifically, no ablation analysis of the model was conducted to determine which of their modeling choices was essential to their findings. For example, how critical was it that the latent responses were sparse? How important was the scaling latent? Neither of these choices were evaluated in any way and it is not clear they are germane to the properties they intend to model.
2) There is virtually no statistical analysis beyond Figure 2. Error bars and shaded regions around population tuning curves are not defined. Data points in tuning curve plots (eg. red and blue dots in Figure 2a,b) are not defined. Moreover, if these really are data points and the shaded regions are supposed to be 95% confidence intervals then I suspect their inference is over-confident.
3) It's not obvious what the authors are testing when they are examining neural activity along side latent activations. This seems to be an unreasonably course level of analysis and I would not expect a clear correspondence to exist beyond something incidental. Perhaps the authors meant to examine the posterior distribution over the stimulus? This would have real cognitive meaning in the context of a shift in prior probabilities.
The authors should clarify their mechanistic claims about why their model matches the data in the ways they claim, the modeling choices, statistical inference, and all claims should be accompanied by statistical tests. |
Fully human-written |
|
TAVAE: A VAE with Adaptable Priors Explains Contextual Modulation in the Visual Cortex |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper suggests a modified image autoencoder to account on task adaptations in visual cortex.
They have a two-stage procedure when first the latent representations are trained for natural images and then they are adjusted with respect to the task prior (which is optimized).
In this particular paper they use mice performing go/no go tasks and show that the latent space of the visual autoencoder after introducing a task prior shows same qualitative phenomena as the actual responses in V1 when the task is changing, supporting the claim that the brain performs a probabilistic inference under a prior.
1. **Elegant framework**. Beautiful idea - fixing the likelihood $p(x|z)$ and only learning a new task-specific prior $p_T(z)$ is both elegant and powerful. The paper makes a clear hypothesis: systematic biases in V1 during task performance are the result of probabilistic inference under a learned, task-specific contextual prior. The model provides a concrete implementation of this hypothesis and generates specific, falsifiable predictions that are then confirmed by the experimental data.
2. **Qualitative comparisons**. The model reproduces the qualitative phenomena, eg splitting the distribution from unimodal to bi-modal when there is a mismatch between the train and test data (e.g. Fig 3)
3. **Reproducibility**. The code is provided in the supplementary materials.
4. **Statistical rigor**. All the plots and tables report error-bars.
1. **Clarity**. The paper might benefit from a more clear high-level framework introduction, before getting to the formalism. If I get it correctly, then the autoencoder model is trained on images only and neural responses are used for validation only.
2. **Lack of quantification of qualitative results**. While Fig 3 generates nice qualitative insights, some statistical tests might support the claims, eg Hartigan's Dip Test to quantify when the red line stops being unimodal (and if it happens faster in real mice or in the model), one-Sample t-test to test that the peaks of the response are significantly shifted away from the actual stimulus orientation, and some pearson correlation to check how good the model predictions fit for the actual neuronal responses.
3. **Representations alignment is not considered**. Lines 212-214 make an assumption that the autoencoder latent space $z$ is assumed to correspond to neural activities in V1, however, this correspondence is clearly violated by the fact that $z$ could be negative. Hence, this raises questions about the validity of this assumption and how aligned the representations are in general.
4. **Limited direct applicability**. While this is a beautiful hypothesis testing framework, applying it for different stimuli can be very complicated. Specifically, eq (11) is nice and tractable as *"in a typical gratings dataset we expect a symmetry in z around zero"* (247-252). However, it is not that clear how to set up a meaningful prior in case of other tasks and out-of-distribution designs (like distinguishing images by colors, primarily direction of "random" moving dot stimuli, etc)
Minor:
1. Inconsistent font sizes in the plots (see Fig 1 panel D and H, or Fig 2 panel A and E).
1. If I get it correctly, then the autoencoder model is trained on images only and neural responses are used for validation only. Is it right? Also, you first train an autoencoder using eq (9) as the loss function to get $q(z|x)$ and $p(x|z)$ and then you only train $\underline{\sigma}_{T}$ (line 253) ?
And it adjusts $q(z|x)$ to $q_{T}(z|x)$ ? Are there any other parts retrained?
2. Why the trained baseline activity in Fig2 H is negative? I though you are taking the absolute values (lines 237-240)
3. Why exactly does Laplace prior give us localized, oriented receptive fields? (lines 226-228)
4. Lines 237-240 identify that $z$ could be negative, which clearly misaligns the autoencoder latent space with the neuronal responses. Have you tried to restrict the $z$ to be strictly non-negative during training?
5. Connected to the previous question - lines 212-214 say that *"activations of latent variables, z, of the generative model were assumed to correspond to activations of individual neurons in V1"*. How adequate is this assumption? Have you tried to regress learn linear regression from $z$ to the actual neuronal responses (like "neural predictivity" in [1] ) and see how well it performs or do something like CKA analysis [2-4]?
6. How exactly was the neuronal data pooled across sessions? Have you selected the neurons which were enough orientation selective and then just averaged them across all sessions to make the lines in Fig 2 e, f for example? And for the autoencoder - you always used a single model (e.g. there were no several autoencoders to match the latent space for the number of neurons per session)?
7. I would appreciate your thoughts on weakness 4.
Minor:
1. What are blue and green lines in Fig1 h?
References:
[1] Nayebi, Aran, et al. "Mouse visual cortex as a limited resource system that self-learns an ecologically-general representation." PLOS Computational Biology 19.10 (2023): e1011506.
[2] Murphy, Alex, Joel Zylberberg, and Alona Fyshe. "Correcting biased centered kernel alignment measures in biological and artificial neural networks." arXiv preprint arXiv:2405.01012 (2024).
[3] Williams, Alex H., et al. "Generalized shape metrics on neural representations." Advances in neural information processing systems 34 (2021): 4738-4750.
[4] Chun, Chanwoo, et al. "Estimating Neural Representation Alignment from Sparsely Sampled Inputs and Features." arXiv preprint arXiv:2502.15104 (2025). |
Fully human-written |
|
TAVAE: A VAE with Adaptable Priors Explains Contextual Modulation in the Visual Cortex |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes TAVAE, a task-adapted VAE framework that modifies only the latent prior (not the encoder or decoder) to account for contextual modulation effects observed in mouse V1 during a visual discrimination task. By adapting the prior learned from natural images to task-specific contingencies., the model reproduces several well-known effects: sharpening, baseline suppression, and multimodal responses under stimulus-prior mismatch. The VAE is strongly constrained: linear decoder, Laplace prior, and overcomplete latent space, mirroring classic sparse-coding models rather than deep nonlinear architectures.
1. The model is a minimal model with biologically inspired constraints—linear decoder, sparse Laplace prior, overcomplete latent space, and GSM-style gain modulation while it mirrors classic models of V1 (e.g., Olshausen & Field).
2. The model qualitatively reproduces several experimentally observed phenomena using a single mechanism (prior variance reweighting).
1. While the paper claims that adaptation in the prior alone is sufficient to account for several task-induced changes in neural population statistics. The lack of comparison to single neuron activity left this claim speculative
2. Figure 3a: I really cannot see "drastic" difference between red and blue curves. There needs to be a metric or something to quantify how they are different.
3. Figure 4a; The curves are visually nearly identical in shape, except for slightly lower side peaks and a slightly higher center as γ increases. If all that happens is one peak increases slightly, calling it “updating the inference toward the new context” feels like a strong claim for a weak effect.
1. Is it possible to extend this model to decode neural activity? like Maheswaranathan Neuron 2023?
2. The encoder is linear, with overcomplete latent dimensions, and trained under a Laplace prior. How close is it to ICA or sparse coding rather than deep encoder following by variational sampling?
3. Would you expect the same latent prior adaptation mechanism to work in tasks involving richer stimuli or additional visual features (e.g., natural scenes/motion)? Why or why not? |
Fully human-written |
|
Memorizing Long-tail Data Can Help Generalization Through Composition |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work considers an interesting an interesting angle of how memorizing long-tail samples can be helpful for generalization. Theoretically, it proves that for linear classifier---whether the underlying distribution is noiseless or noisy---memorization can help generalization both in and out of distribution under some assumptions on the distributions. Empirically, it shows that memorization can help on an interesting construction of task: computing the sum of MNIST digits. With proper choice of model architecture, memorization can help improve the results when a certain digit is significantly under represented in the data set.
The presentation of this work is outstanding: theoretical claims are clearly defined and the underlying intuition of the proof well explained. I also appreciate the authors' effort of motivating the problem as well as presenting the related work with precise and succinct language. The theoretical claims are sound. I don't find any apparent problem in the proof, either. The design of the three-digit-sum problem is new to me. Despite some limitation of the design, which I will come to later, the idea is intriguing. Overall, this work is solid technically.
Compared to the outstanding presentation and rigorous theoretical formulation, this paper is slightly weaker on the potential impact. Specifically:
1) The linear case is slightly simplistic as composition is natural. If two features both contributes positively to the prediction score, i.e., have positive corresponding field in $\beta$, then observing one of them at a time in training example should suffice for good test results. In nonlinear case, e.g., an XOR, observing one feature at a time may not be sufficient for the telling the outcome when both features are present (nonzero). In fact, the sum-of-digit task is somewhat linear with respect to the digits. I wonder if the phenomenon of composition can be observed in more general tasks.
2) The notion of memorization here is slightly different from the literature I'm familiar with. I'm more used to influence score based criteria for memorization, e.g., removing a training example will significantly impact the prediction of another example. I believe this work assumes that an overparametrized model with unregularized training will memorize. Is this assumption common in literature?
3) Following 2), MNIST is a fairly 'simple' dataset for which a small size of sample can lead to good model already with or without memorization. For stronger impact, the authors may want to consider some more complex tasks.
My questions are mainly on the potential impact of the work.
1) Could you provide some more real world examples of tasks where composition is natural?
2) If time allows, could you quickly check the influence score a training example on itself or on the test samples? Either a simple leave-one-out test (retrain the model with a training set differing by one entry) or the estimation in Feldman,Vitaly and Zhang.
3) What could be the future extension of the result in this work?
Feldman, Vitaly, and Chiyuan Zhang. "What neural networks memorize and why: Discovering the long tail via influence estimation." Advances in Neural Information Processing Systems 33 (2020): 2881-2891. |
Fully human-written |
|
Memorizing Long-tail Data Can Help Generalization Through Composition |
Soundness: 3: good
Presentation: 3: good
Contribution: 4: excellent
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work builds on to the line of work by Feldman and Zhang that has studied how long tail memorization can help with generalization in deep learning.
There is a key conceptual shift that this paper makes on top of Feldman. Feldman argued that memorization helps because test examples are similar to memorized training examples, which allows the model to recall them directly. This paper adds a new dimension to this discourse: Memorization helps not only because it reproduces similar examples, but also because it enables composition. This means that combining multiple memorized rare examples can lead to generalization into new configurations.
To demonstrate this idea of compositionality, the authors move away from the singleton tasks in past work to new tasks such as (i) a “sum of three MNIST digits” setup and (ii) an MNIST–Omniglot mixture testing one-shot memorization.
The authors develop a theoretical model in which different data features follow a power-law frequency distribution.
They prove that the min norm solution which memorizes training data can correctly predict on test examples composed of multiple long-tail features that never co-occurred during training. The theoretical argument is supported via the experimental results on the newly created synthetic datasets.
Results show that networks capable of processing input components modularly (e.g., per-digit ResNets with additive aggregation) generalize compositionally, whereas architectures that entangle inputs early (“cross-channel” ResNets) fail.
The paper also shows that an attempt to mitigate memorization (such as weight decay penalty) leads to a loss in model generalizartion on such compositional tasks.
Disclosure: I have not reviewed the theory carefully.
1. Conceptual Extension: I quite like the extension this paper attempts to make over the singleton memorization argument made in Feldman et. al. The bridge is quite intellectually appealing and can connect various ideas like one-shot generalization and memorization in overparametrized models.
2. The power law based feature setup seems quite simple yet expressive. I believe this is sufficient to motivate the emprical underpinnings of the work
3. The paper has a good mix of toy tasks: from linear regression to a controlled mnist and omniglot task. I like how they are able to connect the architectural dependence here as we visualize the transition from memorizatioon to composition.
1. The main weakness of the work is in its experimental scope. I admit that this in general will stay a hard task but I would like to challenge the authors to find meaningful ways to extend these setups to those of more practical relevance.
i. this requires identifying where in the real world do one-shot composition of memorized instances naturally happens
ii. run controlled experiments to actually experiment by ablating away that capability
iii. if the memorized composition was indeed a mechanism by which models generalize, i actually thing it is quite a useful exercise to show that this happens in real tasks. if not, why is this phenomenon of interest? i am writing this as motivation rather than actually questioning the value of this line of work, which i quite like.
2. I believe this paper also needs a discussion on when memorization hurts composition. This is especially true for scenarios such as spurious correlations. How would the theory and/or experiments inetrsect with this.
1. The task of single example memorization in big models is hard. I wonder if some efforts around experimentation with PEFT, or in context examples can somehow connect here. In context learning is an example of single example generalization with high information recall (which is what you intend to use the word memorization for, anyways). This is just one thought to aid experimentation. |
Fully human-written |
|
Memorizing Long-tail Data Can Help Generalization Through Composition |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper explores how memorization of rare, long-tail examples can improve generalization when combined with a model’s ability to compose known features in new ways. Through theoretical analysis in linear settings and small-scale experiments on compositional MNIST tasks, the authors show that memorization enables correct predictions on unseen combinations of rare features.
The paper provides a clear theoretical formulation connecting memorization and compositional generalization, an underexplored relationship in deep learning theory. Its synthetic and modified MNIST experiments effectively illustrate how architectural structure influences compositional ability. Finally, it contributes a valuable conceptual shift, framing memorization not purely as overfitting, but as a potentially beneficial mechanism for learning from long-tail data.
**Oversimplified Definition of Memorization**
The paper treats memorization as a binary property i.e., models either memorize or do not memorize. This definition ignores the nuanced ways sample level memorization actually behaves. For example, memorization scores can vary from 0 (i.e., perfect generalization) to 1 (perfect memorization). By treating the property as binary, the authors are ignoring the entire range of values between 0 and 1.
**No Empirical Verification of Memorization**
Despite repeatedly claiming that rare examples (like the digit “9”) were memorized, the authors never tested this directly. They inferred memorization from improved performance on rare-digit test cases but did not apply any established measurement techniques (e.g., Feldman et al's self influence), to verify that the model had indeed memorized those samples. Without this validation, the central claim that memorization enables composition remains speculative.
**Reliance on Indirect Behavioral Evidence**
The experimental support for memorization is limited to behavioral trends: test loss decreases as the frequency of the rare digit increases and increases when weight decay is applied. While suggestive, these results can also be explained by better statistical coverage or regularization effects rather than genuine memorization. The lack of causal evidence weakens the argument of this work.
See above |
Moderately AI-edited |
|
SafeRBench: A Comprehensive Benchmark for Safety Assessment of Large Reasoning Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
SafeRBench introduces the first comprehensive benchmark designed to evaluate the safety of Large Reasoning Models (LRMs) throughout their full reasoning process, spanning inputs, intermediate reasoning traces, and final outputs. Unlike prior LLM safety benchmarks that focus only on surface-level harms, SafeRBench captures process-level risks unique to LRMs, such as harmful rationales or late-stage toxic reasoning. It provides a three-layer evaluation framework: (1) a stratified dataset of 1,128 harmful queries categorized across six risk domains and three risk levels; (2) micro-thought chunking that segments reasoning traces into fine-grained cognitive intents for detailed risk analysis; and (3) ten safety dimensions grouped into Risk Exposure and Safety Awareness scores. Experiments on 19 LRMs show that reasoning traces strongly predict safety outcomes: models with higher Intention Awareness and Defense Density produce safer responses. Medium-sized “thinking” models perform best, while very large models exhibit an “always-help” bias that can reintroduce risk. SafeRBench thus establishes a scalable, human-aligned framework for diagnosing and improving LRM safety across reasoning dynamics
- SafeRBench uniquely assesses model safety across the entire reasoning pipeline including from input prompts to intermediate reasoning traces to final outputs, capturing risks that traditional output-only benchmarks miss.
- By analyzing reasoning traces, SafeRBench detects latent and evolving safety failures, such as rationale laundering or late-stage harmful reasoning.
- It introduces a compact yet representative dataset of 1,128 harmful queries, systematically balanced across six harm categories and three risk tiers, enabling precise and reproducible evaluations.
- The introduced micro-thought chunking mechanism is thoughtful, which segments long reasoning traces into semantically coherent units labeled with cognitive intents, allowing fine-grained, interpretable analysis of reasoning safety.
- SafeRBench aligns LLM-based safety judgments with human annotations.
- Model evaluation is comprehensive and insightful.
Overall this work provides useful work for advancing reasoning model safety.
- The fine-grained segmentation and evaluation of long reasoning traces may be computationally expensive, restricting scalability to larger datasets or more models without significant resources. It'll be good to conduct some cost analysis to show the computation overhead.
- The benchmark's results hinges on the choice of evaluator models.
See Weaknesses + How does different evaluator models may change the benchmarking results? |
Moderately AI-edited |
|
SafeRBench: A Comprehensive Benchmark for Safety Assessment of Large Reasoning Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces a benchmark SafeRBench that assesses large reasoning models safety from inputs, and intermediate reasoning to outputs. through structured input characterization, micro-thought chunking for fine-grained analysis, and human-aligned safety evaluation
Existing benchmarks have many limitations. SafeRBench categorize queries
by risk levels, accounting for affected groups and severity of impact, and construct a balanced
benchmark dataset that reflects diverse harm gradients. And introduced trace evaluation to do fine-grained analysis of risk propagation.
LLMs and reasoning models evolve rapidly, so it’s unclear whether the benchmark’s scope will remain broad or generalizable for emerging models. The paper also lacks theoretical and mechanistic insight—it reads more like a collection and categorization of GPT-generated data. While it provides useful diagnostic metrics such as scores and rankings, it does not explain why models fail or exhibit unsafe behaviors.
1. Could you elaborate on the motivation and intuition behind the different risk categories, and explain how these risks are defined and categorized?
2. The **segmentation of reasoning traces** appears to be a crucial step, yet it remains unclear how the **BERT-based** and **LLM-based** methods are applied in this process. Since segmentation can significantly influence the final outcomes, could you clarify how the **granularity of segmentation** is determined?
3. Is the **segmentation** applied **only to long text inputs**, or does it also affect shorter reasoning traces? |
Lightly AI-edited |
|
SafeRBench: A Comprehensive Benchmark for Safety Assessment of Large Reasoning Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes SafeRBench, a benchmark and framework for evaluating the safety of large reasoning models (LRMs). On the input level, risk is categorized into 6 classes with 3 possible risk levels. Outputs are evaluated based on refusal, risk level (4 levels), and execution level. Reasoning traces are evaluated by risk density, defense density, intent awareness, and safe strategy conversion. Using this evaluation framework, the authors evaluated 19 LRMs and compared them to draw insights on how different model and reasoning features impact safety.
This work evaluates the safety of reasoning traces, which haven't been thoroughly evaluated before. This work proposes comprehensive metrics to quantify the safety of LRMs, providing more fine grained, multi dimensional understanding of how and why LRMs are unsafe. The paper is clearly written.
I have some concerns about the robustness of some of the proposed metrics and evaluation. I think the manuscript in its current state does not make a sufficiently substantial contribution to be accepted by ICLR without major revision and additional experimentation.
1. One of the main claims of novelty is that this work evaluates reasoning trace safety. However, based on this paper, it's unclear why it's insufficient to evaluate the input + output without reasoning traces. It would strengthen the paper if the authors could provide quantification of how much reasoning traces matter. For example, use an LLM judge to evaluate how safe (input + output) is vs. (input + reasoning trace + output).
2. Some of the evaluation metrics need justification and clarification. See my questions in the section below
3. lines 52-53: "Existing benchmarks mainly annotate the risk category of outputs, such as Safety-Bench (Zhang et al., 2024b) and HarmBench (Mazeika et al., 2024)" - I believe HarmBench only provides inputs, so the risk categories should be on inputs. SafetyBench seems to be the same. Many existing benchmarks label risk categories on inputs and not outputs, such as WildGuard, AIR-Bench, etc. This contradicts the claim that input risk categories is a novelty of the current work.
4. fig. 2: it's a stretch to call a low-medium-high scale a spectrum, when it only adds an intermediate level to binary categorization. nvidia/Aegis-AI-Content-Safety-Dataset-1.0 contains labels categorized into safe, needs caution, and unsafe (with specific risk category). this is similar to the stratified risk levels of the present work, which undermines the claimed novelty
5. fig. 3: please find alternative ways to illustrate the evaluation results. Currently, it's very hard to recognize which line corresponds to which model
1. In lines 42-43, can you explain what "incremental capability scaffolding, rationale laundering, or late-stage revelation" are more clearly and how these vulnerabilities might arise (e.g., are they emergent or subject to external attacks)?
2. Line 53: "This limits their effectiveness for LRMs, where long reasoning traces introduce layered risks" - why and how?
3. Line 155-158: Some risk categories inherently target groups rather than individuals (e.g., environmental & global threats, social safety). are distributions of risk levels balanced for each risk category?
4. Line 160: How were the queries generated?
5. Why is Response Complexity a safety metric? It doesn't seem to distinguish models well based on Figure 3.
6. Lines 252-258: Are there any patterns in which the risk score evolve over micro chunks in the reasoning traces? e.g., plot s over t.
7. Why does trajectory coherence measure safety?
8. Fig. 4: some metrics seem highly correlated - do we need all of them or can they be condensed into more independent metrics? |
Fully human-written |
|
SafeRBench: A Comprehensive Benchmark for Safety Assessment of Large Reasoning Models |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes a new benchmark and multiple new metrics to evaluate reasoning models.
Specifically, it proposes a taxonomy for creating a set of evaluation prompts. The evaluation then performs a chunking approach on the reasoning chains and evaluates each reasoning chunk. They also compare human annotations with AI annotations. The benchmark evaluates safety across 10 dimensions, including intent awareness and risk level.
The strengths of the paper include the following:
- interesting conclusions: The finding that for small models the thinking setup increases risk, while for medium ones it does not, and for bigger ones the risk is increased again, is very interesting. I also found the discussion on stronger tail controls insightful.
- Chunking: the use of chunking is interesting and could have been expanded upon more.
- The paper provides a thorough correlation analysis.
The main weaknesses of the paper are the following:
- Novelty: There exists a lot of published papers that provide a finer-grained analysis of risk levels of the input already, such as: Li, Jing-Jing, et al. "Safetyanalyst: Interpretable, transparent, and steerable safety moderation for ai behavior.", or Zhang, Yuyou, et al. "Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety.". It would be good to get a better understanding of what makes this paper here different and the benchmark more suitable than other approaches.
- Clarity: The paper could improve the clarity of writing, see questions below.
- Chunking: It seems like the chunking ended up being done by GPT-5 via prompting. In lines 184-186 it says that this is because GPT-5 is better than previous chunking models, but the paper does not provide any numbers to support that claim. How did you evaluate it? Also, does it mean one needs to pay for GPT to evaluate ones models’ on the benchmark? Wouldn’t it be more efficient to host a smaller model for that purpose? In general, I am a bit concerned that it requires so much prompting of proprietary models to run this benchmark.
- How did you come up with the taxonomy?
- 184-186: did you actually evaluate this? Do you have some numbers for that?
- Fig 3: very hard to read
- 248: Confusing metric: why are words per sentence a good indicator for density?
- 252: why does the chunk index matter?
- 369: does ability to infer user intent also prevent over-refusal?
- 431: would be good to have an example, as it’s very hard to understand what a high risk query is |
Fully human-written |
|
Deep Global-sense Hard-negative Discriminative Generation Hashing for Cross-modal Retrieval |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
DGHDGH presents a technically elegant yet computationally efficient solution for enhancing cross-modal hashing. The paper integrates a lightweight graph propagation (RGP) and an adaptive interpolation module (DGS) into a CLIP-based framework, showing measurable improvements with minimal resource overhead. From an implementation viewpoint, the proposed pipeline is easy to reproduce and could serve as a plug-and-play enhancement to existing retrieval systems.
1.The RGP–DGS pipeline is an elegant architectural contribution combining graph-based correlation learning with adaptive synthesis.
2.Provides a principled treatment of difficulty adaptation, moving beyond heuristic sampling.
3.The experiments are extensive, statistically robust, and demonstrate consistent gains over diverse baselines.
4.The approach is efficient (no extra generator) and generalizable to existing hashing frameworks.
1.The paper introduces λ as a channel-wise coefficient but does not explicitly state whether it is a fixed hyperparameter or a learned variable. Clarifying whether λ is shared between modalities would aid implementation.
2.While the paper claims not to use additional generators, RGP is still a GNN-based component. It would be useful to show FLOPs or parameter comparisons between DGHDGH and baselines to substantiate the claim of efficiency.
3.The RGP module resembles self-attention. Could the authors comment on whether it could be replaced by a lightweight transformer encoder?
4.The paper alternates between the terms “Global-sense” and “Global correlation”, which may confuse readers. Unifying terminology at the beginning of Section 3 would make the exposition cleaner.
1.Is λ initialized randomly or via prior heuristics?
2.Could the authors quantify the overhead (Params / FLOPs) of RGP relative to a standard self-attention layer?
3.For deployment, have the authors explored quantizing RGP parameters to further reduce inference cost? Some discussion on the theme matched with hashing retrieval would be welcome. |
Fully AI-generated |
|
Deep Global-sense Hard-negative Discriminative Generation Hashing for Cross-modal Retrieval |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes DGHDGH, a new framework introducing hard negative generation into cross-modal hashing retrieval. The key idea is to model global semantic correlations among heterogeneous samples via a Relevance Global Propagation graph transformer, and synthesize channel-wise adaptive hard negatives using the Discriminative Global-sense Synthesis module. The method avoids relying solely on local pairwise interpolation, thereby maintaining semantic consistency in Hamming space. Extensive experiments across MIRFLICKR-25K, NUS-WIDE, and MS-COCO show state-of-the-art results.
1. The RGP–DGS pipeline is an elegant architectural contribution combining graph-based correlation learning with adaptive synthesis.
2. Provides a principled treatment of difficulty adaptation, moving beyond heuristic sampling.
3. The experiments are extensive, statistically robust, and demonstrate consistent gains over diverse baselines.
4. The approach is efficient and generalizable to existing hashing frameworks.
1. The three loss components (L_sp, L_is, L_cd) are optimized in parallel, yet the paper does not clarify their relative weights or potential gradient interactions. A short sensitivity analysis would strengthen the presentation.
2. The experiments mainly use CLIP-ViT backbones; limited tests with other vision–language models (e.g., BLIP, SigLIP, ALIGN) make it difficult to judge generalization across architectures.
3. The radar plot visualizing parameter sensitivity (Fig. 7) is not clearly described — axis meaning, normalization range, and metric selection should be elaborated to help readers interpret the results.
1. Are the loss weights fixed throughout training or tuned per dataset? Could the authors report whether the optimization of three loss terms exhibits any instability during early training stages?
2. How are the radar-plot axes normalized, by relative gain or absolute metric value? |
Fully AI-generated |
|
Deep Global-sense Hard-negative Discriminative Generation Hashing for Cross-modal Retrieval |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors propose DGHDGH, a cross-modal hashing framework coupling a global propagation network (RGP) with an adaptive negative synthesis module (DGS). The paper delivers strong quantitative results and clear empirical validation, but certain visualizations and terminological inconsistencies slightly hinder comprehension.
1. Demonstrates strong performance on multiple datasets and provides meaningful ablation studies.
2. The framework is innovative in combining semantic propagation with synthetic negative mining. The method is technically coherent and easily interpretable when fully understood.
3. The framework appears extendable to other multi-modal applications.
1. Some statistical figures, such as the radar chart summarizing multiple metrics, lack sufficient description. It is unclear what normalization or metrics were used for each axis.
2. How effectively do the Fisher Ratio and PH2 verify the discrimination of Hamming spaces? It is necessary to add more experimental analysis in Section 4.
3. It would be beneficial to test whether the global propagation remains stable under noisy or partially corrupted modalities—for instance, when the embedding graphs contain random noise—to verify robustness.
1. Future work might explore coupling DGHDGH with pre-trained large multi-modal models (e.g., BLIP-2) to test transferability.
2. It might also be fruitful to explore hybrid discrete–continuous codes instead of pure binary hashing, leveraging the same hard-negative generation principle.
3. The figures could use larger fonts and more contrast; are the authors planning visual revisions for the camera-ready version? |
Heavily AI-edited |
|
Deep Global-sense Hard-negative Discriminative Generation Hashing for Cross-modal Retrieval |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses the challenge of discriminative cross-modal hashing through a global-sense perspective, constructing adaptive hard negatives that reflect global semantics rather than local feature proximity. The approach builds upon two main components: a graph propagation network (RGP) for capturing higher-order semantic dependencies and a discriminative synthesis unit (DGS) that regulates interpolation difficulty via channel-wise weighting. Conceptually, this work reframes hard-negative generation as an optimization over a semantic manifold.
1. Introduces a coherent, theoretically inspired motivation for rethinking negative sampling as a global consistency problem.
2. The loss design reflects an interesting interplay between semantic preservation, interpolation similarity, and coefficient diversity.
3. Empirical results validate the conceptual claims with strong mAP improvements.
1. In Figure 3, the comparison between DGHDGH and DHaPH shows large performance gains, but it is unclear whether both models use identical backbones and training setups. A controlled experiment would be necessary to ensure fairness.
2. The RGP module remains largely intuitive, while its empirical benefits are evident, there is no theoretical analysis of how the propagation maintains information fidelity or prevents over-smoothing.
3. The DGS module’s channel-wise λ weights perform well, but their dynamics are not visualized. A λ-distribution plot or feature-space interpolation visualization would clarify how “difficulty” is being modulated.
4. The method is validated only for image-text retrieval. Can DDGSH extend to audio/video modalities?
1. Could the authors mathematically relate the RGP operation to spectral diffusion or Laplacian smoothing?
2. How sensitive is the overall model to λ initialization?
3. Discussing the future work like audio/video retrieval in Conclusion section would strengthen the impact. |
Fully AI-generated |
|
PANDORA: Diffusion-based Protein Conformation Generation |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper have strong motivation. The main challenges of this research is sampling diverse conformations, not a single point estimation, They proposed diffusion with noise injection across reverse steps to solve this challenge. The authors did MAE and Wasserstein distance to evaluate their method. However, no energy-based validity, steric-clash rates or experimental comparisons are reported.
1.Node/edge embeddings use Gaussian smearing with fixed grids. Did you tune K, \mu_min, \mu_max per protein class, and how does performance change with learned (e.g., radial basis) centers vs. fixed ones?
2. Beyond MAE/Wasserstein, did you compute steric clash rates, Ramachandran distributions, or energy (e.g., with a force field) to corroborate “physically plausible”? If not, can you add these checks?
3. You note a small right-shift in L (~0.025 Å). Does this accumulate along chains (e.g., drift in end-to-end distance) or vanish after reconstruction to 3D? Any impact on Rg/RMSD beyond the shown density maps?
4. How robust are results across sequence-diverse and longer proteins (>80 aa), and what happens under domain shifts (membrane proteins, IDPs)?
The paper’s strengths are clear problem framing and a technically coherent choice to model stereochemistry in ξ-space (bond lengths/angles/dihedrals) with step-wise regulation, which naturally supports diversity while maintaining plausible geometry. Empirically it delivers solid wins: lower MAEs on L/Θ/X recovery, distributional alignment via Wasserstein distances, and ablations that link Gaussian-smeared geometric features to accuracy; the generalization check on unseen proteins (e.g., Trp-cage, BBA) is a nice touch.
validation leans on geometry statistics without independent physics/quality checks (clashscore, Ramachandran, energies), so “biophysical plausibility” isn’t fully established. The repeated cutoff/clamping during sampling could bias the stationary distribution but isn’t analyzed; data scale is narrow (small proteins), and head-to-head comparisons against the strongest coordinate-space diffusion baselines with structure-quality metrics are limited. There’s also no runtime/efficiency accounting or downstream utility test after reconstructing 3D (e.g., RMSD/Rg or functional tasks).
1.Node/edge embeddings use Gaussian smearing with fixed grids. Did you tune K, \mu_min, \mu_max per protein class, and how does performance change with learned (e.g., radial basis) centers vs. fixed ones?
2. Beyond MAE/Wasserstein, did you compute steric clash rates, Ramachandran distributions, or energy (e.g., with a force field) to corroborate “physically plausible”? If not, can you add these checks?
3. You note a small right-shift in L (~0.025 Å). Does this accumulate along chains (e.g., drift in end-to-end distance) or vanish after reconstruction to 3D? Any impact on Rg/RMSD beyond the shown density maps?
4. How robust are results across sequence-diverse and longer proteins (>80 aa), and what happens under domain shifts (membrane proteins, IDPs)? |
Fully AI-generated |
|
PANDORA: Diffusion-based Protein Conformation Generation |
Soundness: 2: fair
Presentation: 4: excellent
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces Pandora, a diffusion-based generative framework that operates in protein internal coordinates—bond lengths, bond angles, and dihedral angles—to produce diverse protein conformations. Experiments across multiple MD datasets, with comparisons against several baselines, show that Pandora achieves superior conformation reconstruction and distributional fidelity.
1. The paper is well organized and clearly stated, especially Section 1 & 2.
2. By modeling in angular space, Pandora preserves structural accuracy while enabling more flexible architectures, removing the constraint of equivariance.
1. Using internal coordinates for molecular conformation generation is highly efficient for small molecules, but it poses challenges for large proteins:
a). The dimensionality of internal coordinates grows rapidly and can far exceed that of Cartesian coordinates in large proteins.
b). Errors accumulate during coordinate transforms—when reconstructing Cartesian coordinates from internal angles, small angular errors propagate along the chain, resulting in large deviations in terminal atoms.
c). Internal coordinates capture only local geometry, so constructing residues that are distant along the sequence but spatially adjacent is likely to cause steric clashes.
Do authors consider these problems?
2. It appears the authors directly cutoff variables to suitable range, which may introduce discontinuities and unstable boundary derivatives. Would a diffusion defined on an appropriate Riemannian manifold be a more suitable choice? Have the authors considered this alternative?
3. As the paper notes, the baselines operate at the residue level, implicitly or explicitly fixing subsets of bond lengths and bond angles. By contrast, Pandora is an atomic-level model. Consequently, many of the baseline comparison metrics are not directly meaningful. Could the authors report Pandora’s structure-quality metrics —e.g., TM-score, RMSD, GDT-TS, IDDT, Cα clash rate, and peptide-bond break frequency—to provide a overall, assessment?
The same as weaknesses. If the authors can satisfactorily address the concerns above, I would be inclined to increase my score. |
Lightly AI-edited |
|
PANDORA: Diffusion-based Protein Conformation Generation |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes Pandora, a diffusion-based model to generate diverse, plausible native and non-native protein conformations, filling gaps in existing methods. It uses a conditional transformer and integrates structural info (bond lengths, angles) for validity. Experiments show it outperforms baselines and generalizes to unseen proteins.
- This study investigates the usage of diffusion models to generate the protein conformations and the experiments show its effectiveness.
- The introduction of related knowledge is comprehensive.
- The novelty of the method is limited. I think the novelty of denoising the bond lenghts/angles with diffusion models is not a very innovative.
- There is a large space for improvement of the writing. I think the authors spend too many words discussing the most basic things.
- In Section 1.1, I do not think the **motivation** is the `motivation' of this study, since it does not motivate any design of this study. Instead it is discussing why researchers study the problem of conformation generation.
- In Section 1.1, line 90, does this study solves the problem of folding pathway? Maybe the folding pathway is something totally different from the conformations.
- In Section 1.2, the authors discuss the **challenges** and **contributions**. The challenge looks more like a basic introduction of the problem (conformation generation), and the contribution is too general to see what is special in this study.
- In Section 1.2, the first contribution contains some factual error. There have been several methods investigating utilizing AF3 to generate all-atom conformations, (although they may be not very successful), e.g., MSA-subsampling.
- In Section 3, the manuscript spends several words introduce the basic knowledge of diffusion models, attentino mechanisms, network architectures, and inference algorithms. I think the authors can use more words to discuss what makes this study different from baselines and other methods.
- I would suggest the authors reduce the first 6 pages to fewer than 4 pages. There is too much basic knowledge which I think we should assume the readers have basic understandings of this area. |
Fully human-written |
|
PANDORA: Diffusion-based Protein Conformation Generation |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents PANDORA, a diffusion-based generative model that operates in the internal coordinate space of protein backbones (bond lengths, bond angles, and dihedral torsions). The goal is to generate protein conformations with more physically realistic local geometry and improved alignment with molecular dynamics (MD) ensemble distributions. Pandora uses a graph-based transformer architecture to encode geometric information at both node and edge levels, and performs diffusion directly in this internal coordinate parameterization. Experiments on fast-folding proteins demonstrate that generating conformations in internal coordinate space leads to more accurate recovery of bond lengths, angles, and backbone torsions, both in terms of prediction error and distributional similarity to MD ensembles. These results suggest that explicitly modeling internal degrees of freedom can enhance local geometric fidelity compared to residue-level coordinate generation.
- The paper makes a valid and important observation that many residue-level generative models overlook internal backbone degrees of freedom, which are crucial for accurately modeling protein conformations.
- Modeling in internal coordinates (bond lengths, bond angles, and dihedral torsions) is a reasonable approach to improving local geometric accuracy and preserving physically meaningful conformations.
- Empirical results, evaluated using the authors’ proposed metrics, consistently show that explicitly incorporating internal coordinates leads to measurable improvements in local geometry and conformational ensemble quality.
1. Although the paper aims to explore non-native conformation generation, specifically for protein folding, it presents limited architectural, modeling, or analytical innovation compared to existing protein conformation generation models.
2. While the method claims atomic-level modeling by diffusing over bond lengths, angles, and dihedral torsions, it only models backbone atoms and does not include side-chain degrees of freedom. This makes the "full atomic freedom" claim less convincing, especially given that models like AlphaFold3 and Boltz explicitly model both backbone and side chains in full degree of freedom.
3. The experimental evaluation is restricted to a subset of existing benchmarks (5 training + 2 test proteins from fast-folding datasets), which has present in Str2Str, BioEmu, ConfDiff, or EquiJump. The author formed their own analysis focuses on recovery MSE and distributional accuracy in internal coordinates, the results are limited and are not strongly demonstrating that Pandora can model folding/unfolding pathways, generalize to new proteins, or handle larger systems.
4. Clarity: The paper would benefit from substantial revision for clarity. The methodology section is dense and difficult to follow, and common components such as diffusion training and sampling could be presented in a more standard and simplified manner.
### Model
1. The paper uses a $f_{cut}$ function to clip angle and torsion values within defined ranges. This seems a bit ad-hoc. Have the authors considered defining the diffusion process directly in the angle or torsional space (e.g., https://arxiv.org/abs/2206.01729), where the periodicity is naturally handled?
2. Several Method details are unclear:
- The paper states that the model "smooths discrete data (bond length, bond angle, dihedral angle) into continuous distributions." Why are these values described as discrete?
- What is meant by the "distance relation" in line 282?
- In Equation (6), should there be a softmax or normalization?
- In line 299, what does the variable *h* represent?
3. Since the baselines (Str2Str, ConfDiff, ConfRover) were re-trained, could the authors provide more details on their reproduction? For example: specific model variant, number of parameters, training epochs, optimizer, learning rate, and compute resources.
- For Str2Str, what forward noising cutoff was used?
- For ConfDiff, was classifier-free guidance used as in the original paper?
### Experiments and results
1. $\xi_0$ Recovery Task Definition: Could the authors explain the task in more details? How are samples generated for each model? Which values are used as reference for MAE computation? How many samples are generated per protein?
2. The paper states that a transferable setup is used (five proteins for training, two for testing) in Experimental setup. However, primary results in Sections 4.2–4.5 are on the training split, with only Section 4.5 showing held-out test results. This difference should be more clearly stated for an accurate interpretation of the results.
3. BioEmu (https://www.science.org/doi/10.1126/science.adv9817) is common diffusion-based protein conformation baseline that has used for fast-folding proteins. Boltz (https://www.biorxiv.org/content/10.1101/2025.06.14.659707v1) is another diffusion-based model, while originally used for protein folding, their all atom diffusion architecture can serve another strong baseline for this task.
4. The free energy surfaces (e.g., Figure 4) are shown in low-dimensional projections of RMSE and Rg, where a standard practice for fast folding dataset is to project coordinate onto TICA (time-lagged independent components), which captures the slow collective motions (e.g., folding/unfolding). Could the authors provide results based on TICA coordinates or similar analysis as in BioEmu and EquiJump (https://arxiv.org/abs/2410.09667)?
5. Evaluation is limited to bond/torsion-space metrics where the Pandora’s are specifically optimized, but lacks coordinate-based assessments (e.g., clashes, residue-residue contacts) needed to evaluate full 3D structural validity and ensemble fidelity.
7. In Table 6: where do the reported “number of conformations” come from, and why does the WW domain contain far less conformations than other proteins?
### Other comments
1. The introduction can be more concise and focused. For example, broad statements such as "Beyond structural prediction… sparking a surge of innovative research… showcase the transformative potential of deep learning in addressing complex biological problems" feel generic and distract from the main topic. In the abstract and introduction, protein design models are mentioned but not clearly discussed in the context of protein structure/conformation generation.
2. Line 229: The notation $\{\xi\}_{t=0}^T$ is typically used to denote a discrete set, which may be inappropriate when referring to a continuous-time variable.
3. Line 304: typo $x_i^e \to x_i$.
4. Table 9: labeled as "inference set", but the listed proteins are the same five used during training. |
Fully human-written |
|
NAVI: Inductive Alignment for Generalizable Table Representation Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes NAVI, a model that treats each *header–value segment* (header:value) as the atomic unit for table representation. NAVI integrates (a) a global header encoder, (b) Structure-aware Masked Segment Modeling (SMSM) for balanced masking of headers, values, and tokens, and (c) Entropy-driven Segment Alignment (ESA) for contrastive routing between low-entropy header-centric and high-entropy row-centric representations. The goal is to achieve fidelity (schema–value preservation and row distinctiveness) and consistency (robustness to schema, lexical, and structural variation). The paper provides theoretical analyses, extensive ablations, and evaluations on two WDC WebTables domains (Movie and Product).
- Well-motivated segment-level design. Treating (header:value) pairs as set elements with local positional encoding nicely combines permutation invariance and context awareness. The connection to DeepSets provides theoretical clarity.
- Entropy-based contrastive routing. The distinction between low- and high-entropy columns for header vs. row alignment is intuitive and empirically supported. The alignment/uniformity analysis is a good step toward geometric justification.
- Comprehensive evaluation axes. The paper evaluates discriminative (classification, clustering), generative (header prediction, value imputation), and invariance-based (PSI, header clustering) tasks, providing a holistic empirical picture.
- Limited domain diversity and possible bias.
Experiments are restricted to two domains (Movies, Products) with the largest 100 tables per domain. This selection favors clean, high-quality schemas. The model’s robustness to smaller or noisier tables, numeric-heavy domains (e.g., finance), and unseen domains is unclear. Cross-domain and noisy-table experiments are needed.
- Under-specified training regimen.
All models are trained for only 2 epochs with batch size 32 on datasets up to 3.9M rows. Such limited training may hinder convergence, confounding architectural effects with optimization noise. The paper should report learning curves, seed variance, and matched compute comparisons for baselines.
- Missing negative sampling details.
InfoNCE-based contrastive results are sensitive to negative sample quality. Clarify negative sampling strategy (same/different table, same domain, batch size, memory bank use) and ablate negative set size.
- No comparison to classical tabular methods.
The paper reports XGBoost only on top of embeddings. End-to-end baselines (e.g., XGBoost on raw features, TabPFN, TabNet) are missing, making it difficult to gauge absolute improvements.
- Are header encoder parameters updated during training? Please include an ablation comparing frozen, partially fine-tuned, and lightweight alternatives.
- How sensitive is entropy-based routing to threshold settings? Compare fixed, percentile, and soft routing schemes.
- Could the author(s) ablate negative set size and temperature parameters (τ_dom, τ_ent) to analyze their effect on alignment/imputation?
- How does NAVI perform on numeric-heavy domains? Compare against numeric-specialized models (e.g., TP-BERTa, TabPFN).
- Can author(s) demonstrate cross-domain transfer (e.g. Product→Movie, Movie→Product) to substantiate domain-invariant anchor learning?
- How robust is NAVI under schema noise (renaming, typos, column swaps)?
I would consider raising my score if the authors can adequately address these questions. |
Fully AI-generated |
|
NAVI: Inductive Alignment for Generalizable Table Representation Learning |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes NAVI: Entropy-aware Alignment via Header–Value Induction. NAVI captures the structural properties of tables through schema-aware segment induction and modeling. In addition, NAVI employs entropy-driven alignment of segments to selectively incorporate domain knowledge shared among in-domain tables. Through various experiments, the paper shows effectiveness of NAVI on various downstream tasks.
In general, the paper is easy to follow and the paper provides theoretical grounds on constructing the proposed method.
- The figures does not help understand the proposed method.
- In figure 1, the readers cannot see what they are supposed understand. Also, clear explanation with concrete examples for distinctiveness (fidelity) and robustness (consistency) is required.
- In figure 2, the paper states there are trade-offs, but it is really hard to visualize what the trade-offs are.
- It would be great to have explanations of the concepts with the examples shown in the figures.
- As the paper addresses the importance of table representation for downstream tasks, it would be interesting to see how NAVI compares to simple heuristics for encoding tables (eg., Tablevectorizer or TextEncoder in Skrub package) combined with tabular learning methods such as TabPFN, XGB, and LR.
- What characterizes the distinctiveness (fidelity) and robustness (consistency)? What are some concrete examples?
- How does NAVI deal with numerical values?
- Would there be more datasets to compare the performance of NAVI?
- What is the ground for choosing Bert-style model? Could NAVI benefit from a more sophisticated architecture? |
Fully human-written |
|
NAVI: Inductive Alignment for Generalizable Table Representation Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper addresses the trade-off between fidelity (preserving specific schema-value semantics) and consistency (robustness to schema variations) in transformer-based models for in-domain table representation. The authors propose NAVI (Entropy-aware Alignment via Header-Value Induction), a framework that introduces the "header-value segment" as the atomic unit of table representation. NAVI employs three core mechanisms: (1) Schema-aware Segment Induction (SSI), which uses a global, context-free header encoder to anchor segment semantics; (2) Masked Segment Modeling (MSM), which enforces schema-value dependencies through balanced masking of header and value tokens; and (3) Entropy-driven Segment Alignment (ESA), a novel contrastive learning objective that categorizes columns by value entropy. ESA aligns low-entropy (domain-coherent) columns with their global header embeddings to promote consistency, while aligning high-entropy (entity-discriminative) columns with their row-specific value embeddings to preserve fidelity. Extensive experiments show that NAVI outperforms baselines on generative (imputation) and discriminative (classification, clustering) tasks, successfully balancing the two desiderata.
S1. The paper provides a valuable conceptual contribution by formalizing the core challenge of in-domain table representation as a trade-off between "fidelity" and "consistency"—a key unsolved problem for creating generalizable tabular deep models. The paper further breaks this down into structural and domain-specific components, providing a clear and principled lens for evaluating and developing models in this space.
S2. The core mechanism, Entropy-driven Segment Alignment (ESA), is a novel and highly intuitive solution to the fidelity-consistency dilemma. Using column value entropy to dynamically determine the contrastive learning target (a stable global header for domain concepts vs. a specific local value for discriminative entities) is a clever and effective method for explicitly balancing these two competing objectives within a single model.
S3. The experimental evaluation is comprehensive and well-aligned with the paper's conceptual framework. The use of the Permutation Sensitivity Index (PSI) as a direct measure of structural consistency is particularly effective, and the reported near-zero PSI for NAVI is an impressive result.
W1. The Entropy-driven Segment Alignment (ESA) mechanism relies on the InfoNCE loss, which inherently assumes that for any query, there is only one positive sample and all other samples are true negatives. This assumption is frequently violated in real-world tabular data. For the domain consistency loss ($L_{dom}$), correlated columns or synonyms (e.g., `director` and `auteur`) are all treated as distinct negative samples, creating a "false negative" problem. While the Global Header Encoder is intended to mitigate this by mapping synonyms to close embeddings, this creates a conflicting objective with the InfoNCE loss, which is forced to push them apart. Similarly, for the entity fidelity loss ($L_{ent}$), two different rows that share the same value (e.g., two different products with the color `red`) would be incorrectly treated as negative pairs. The paper does not analyze the impact of this "false negative" discrepancy, which could degrade the quality of the learned embedding space.
W2. The paper's positioning against prior work, particularly "consistency-oriented" models like HAETAE, could be stronger. HAETAE also utilizes a context-free header anchoring mechanism, and the paper's claim that it suffers from "header-value misalignment" is asserted rather than deeply investigated. A more direct comparison of how NAVI's segment-based induction and alignment mechanistically differs from and improves upon HAETAE's header-anchoring would strengthen the paper's novelty claim.
W3. The experimental evaluation is missing a helpful baseline comparison. While the paper compares NAVI's embeddings against other transformer-based embeddings (e.g., BERT, TAPAS), it does not include a comparison against a traditional model like XGBoost trained directly on the raw, pre-processed features. NAVI's primary strength is handling schema variation, which GBDTs cannot. However, including a baseline on a "clean" version of the dataset would be valuable to quantify the performance gap that still exists between complex neural models and top-tier GBDTs on standard classification tasks. This would help position the work in the broader context of the 'NNs vs. GBDTs' debate for tabular data.
**Comments**
C1. The paper's core concepts of "fidelity" and "consistency" are introduced in the motivation, but their explanation remains somewhat abstract. The accompanying Figures 1 and 2, which are intended to visually clarify these concepts and their trade-offs, are dense and difficult to interpret, making it challenging to build a concrete intuition for the problem before the methodology is presented.
C2. The paper should clarify the exact mechanism for obtaining $H_{ctx}$ and $V_{ctx}$. It is stated that they are "extracted by pooling the contextualized token embeddings," but the connection to the initial $z_j^k$ (input) and the final $e_t$ (output) is implicit. An example would be very helpful: for the segment "director: danny", are the "header" tokens just director or do they include the :? This precise operational detail is important for understanding the model's architecture. Additionally, in Figure 4, it should be made clearer which plot corresponds to BERT, as the small titles are easy to miss.
Q1. Regarding the Entropy-driven Segment Alignment, the categorization is based on quartiles (Q1 and Q3), which seems to create a "dead zone" for all medium-entropy columns between Q1 and Q3. These columns apparently do not contribute to the $\mathcal{L}_{align}$ loss at all. What is the ratio of columns that fall into this dead zone, and what is the theoretical or empirical impact of ignoring them during alignment? Does this not risk creating a representation where domain-coherent and entity-discriminative columns are well-structured, but the "average" columns are left in a poorly structured part of the embedding space?
Q2. The framework's reliance on a context-free global header encoder for domain consistency is a key contribution. How does this mechanism handle headers that were not seen during pretraining (i.e., OOV headers, synonyms, or typos)? Does the model's consistency and fidelity degrade gracefully? A robustness evaluation against OOV headers seems essential for a method that so heavily relies on them for anchoring domain semantics.
Q3. In the $L_{msm}$ objective function (Section 2.2), the denominator of the softmax is written as $\sum_{v\in V}exp(We_{v}+b)$. The variable $e_v$ is not defined, whereas the numerator uses $e_t$ (the contextualized output token). Is $e_v$ a typo and intended to be something else, for instance, a non-contextualized embedding for each word $v$ in the vocabulary $V$? Please clarify the exact formulation of this loss function. |
Fully human-written |
|
NAVI: Inductive Alignment for Generalizable Table Representation Learning |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces NAVI, a new framework for tabular representation learning that jointly optimizes for fidelity and consistency. Its core idea is to represent rows as unordered sets of header: value segments and employ a novel entropy-driven contrastive alignment. This mechanism aligns low-entropy (domain-coherent) columns to ensure consistency, while separating high-entropy (entity-specific) columns to maintain fidelity. Experiments demonstrate that NAVI significantly outperforms strong baselines across a range of downstream tasks.
1. The proposed "Fidelity" and "Consistency" framework provides a highly useful and insightful lens for evaluating and designing table representation learning models.
2. The concept of the Header-Value Segment is simple yet effective. The Entropy-driven Alignment is a brilliant idea that ingeniously connects statistical properties to semantic objectives.
3. The experimental setup is sound, the evaluation is multi-faceted, and the results are significant. The ablation studies and qualitative analyses are highly persuasive.
1. Comparison with Graph Neural Network (GNN) Approaches: A brief mention and comparison with GNN-based methods in the Related Work section could make the literature review more comprehensive.
2. Scalability: For wide tables with a very large number of columns, the input sequence can become excessively long. It would be beneficial to discuss the model's potential bottlenecks with such tables and possible solutions.
3. Entropy estimation based on empirical distributions might be unstable for columns with long-tail distributions or sparse data. I suggest the authors briefly discuss this potential limitation.
4. A significant limitation of the NAVI framework lies in its handling of numerical data, a critical weakness given that numerical values are arguably the most prevalent and foundational data type in real-world tables. The entropy-driven mechanism will likely misclassify numerical columns (e.g., price, quantity) as high-entropy, "entity-discriminative" attributes due to their high cardinality. Consequently, the contrastive learning objective pushes their representations apart, ignoring the inherent ordinal and metric semantics between values (e.g., the model fails to learn that '10' is semantically closer to '11' than to '100'). This fundamentally undermines the model's ability to perform numerical reasoning, severely restricting its applicability for tasks like regression or range-based queries and confining its value to a minority of use cases dominated by categorical and textual data.
My main question, which is central to my evaluation, concerns the treatment of numerical columns. The entropy-driven alignment mechanism appears to classify numerical columns (e.g., price, age, measurements) as high-entropy, thereby treating them as entity-discriminative. The contrastive objective would then push the representations of different numerical values (e.g., "10.5" and "10.6") apart, just as it would for distinct movie titles. This approach seems to neglect the crucial ordinal and metric relationships inherent in numerical data.
Could you clarify if the current NAVI framework has any mechanism to preserve these numerical semantics?
If not, how do you see this impacting the model's utility for common, numerically-grounded tasks like regression or range-based queries? Could you elaborate on how the framework might be extended to incorporate a type-aware objective that respects the unique properties of numerical values? |
Fully AI-generated |
|
Active Learning for Molecular Conformation Optimization with a Domain-Agnostic Neural Surrogate Oracle |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This work proposes GOLF-Neural Oracle as a follow-up to the GOLF active learning framework, aiming to perform molecular conformational optimization in a more data-efficient way compared with traditional DFT-based approaches. The method achieves state-of-the-art performance in the data-efficient regime.
* The motivation to develop the Neural Oracle is clearly presented.
* The method demonstrates superior performance on conformational optimization tasks.
* In the Introduction, the authors state that methods like MVE, which provide uncertainty estimation, require architectural modifications and retraining, which is considered a burden. However, uncertainty estimation methods such as dropout uncertainty require only minor modifications with no obvious training cost. This raises questions about the motivation for introducing the Neural Oracle.
* The authors claim that the Polyak-averaged Neural Oracle provides more stable potential energy estimates, yet this stability is not well supported by experiments or analytical evidence.
* The paper argues that removing the need to select a surrogate oracle improves applicability to complex domains. However, results are only shown on the SPICE2.0 dataset, without comparison to GOLF on the same data or evaluation on other complex domains.
* The main mechanism of improvement appears to be the introduction of a mistake budget to reduce over-sensitivity to local-minima oscillations. This modification is relatively minor and limits the overall novelty of the method.
* Please clearly explain why uncertainty-based approaches (e.g., MVE, dropout) are not chosen, and justify the use of exponential moving averaging with either analytical reasoning or experimental evidence.
* Please justify why the Neural Oracle is expected to generalize well to broader chemical domains beyond those tested.
* Kindly add an LLM usage disclosure section in the appendix, as required by ICLR policy. |
Lightly AI-edited |
|
Active Learning for Molecular Conformation Optimization with a Domain-Agnostic Neural Surrogate Oracle |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes GOLF–Neural Oracle, a new active learning framework for training neural network potentials (NNPs) in molecular conformation optimization.
The method removes the dependency on empirical surrogate force fields (such as MMFF94 used in GOLF) by introducing a trainable surrogate oracle — a Neural Oracle updated as an exponential moving average (EMA) of the online NNP.
The authors benchmark their approach on the $\nabla^2$DFT and SPICE2.0 datasets, showing consistent improvements over baselines (Ensemble, MVE, GOLF-RDKit) in both standard and data-efficient regimes.
* Clear motivation and problem formulation.
* Using an EMA-updated Neural Oracle is an appealing and lightweight idea inspired by BYOL. It avoids ensemble training, uncertainty prediction heads, or architectural modifications, making it broadly applicable to existing NNP pipelines.
* The results (Tables 1–4) consistently show that GOLF-Neural Oracle achieves the best or comparable performance with fewer additional conformations.
* The method can be plugged into any molecular NNP training pipeline to improve data efficiency, which may be valuable for high-cost DFT or ab initio workflows.
* The analogy to BYOL is intuitive, but the paper never explains why EMA-averaged weights provide a meaningful uncertainty signal or how this leads to improved sampling.
A theoretical or empirical calibration study (e.g., correlation between oracle energy error and true uncertainty) is missing.
* The experiments focus solely on end-metrics but do not analyze uncertainty quality or convergence behavior during active learning cycles. It is difficult to verify whether improvements stem from better uncertainty estimation or simply additional training dynamics.
* The “domain-agnostic” claim is overstated to some extent. All benchmarks involve isolated molecules in vacuum or small solvated systems.
No results are shown for periodic, condensed-phase, or large biomolecular systems — the domains where empirical force fields indeed fail.
* No runtime or computational-cost evaluation.
* How does the method scale to larger molecules systems?
* How sensitive is performance to $\tau$ and M across different datasets — are these hyperparameters transferable? |
Fully AI-generated |
|
Active Learning for Molecular Conformation Optimization with a Domain-Agnostic Neural Surrogate Oracle |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper investigates active learning for molecular conformation optimization, a setting where querying the oracle is time-consuming and the surrogate model trained on collected data may fail to generalize effectively during optimization. The authors propose an active learning framework that maintains two neural network potential (NNP) surrogate models: one learned directly from data, and another updated based on the first model. This approach integrates both conformation optimization and active learning principles. The proposed method is evaluated on publicly available molecular conformation optimization benchmarks.
This paper addresses an important problem — active learning for conformation optimization — which is highly relevant to the AI4Science community. The authors present the background and review existing methods clearly, providing readers with a solid foundation to understand the field.
Although the paper provides a comprehensive introduction, the research question and proposed method are not well-motivated. The rationale for employing two surrogate models is unclear, making it difficult to understand the underlying motivation. Furthermore, the reported improvement appears modest, and the results are presented without error bars. Please refer to my further questions below for more details.
1. I do not understand the necessity of maintaining two surrogate models — the online NNP and the target NNP — with the target NNP defined as an exponential moving average of the online NNP. This is not clearly explained in Section 4.
2. In Section 4, is the neural oracle equivalent to the target NNP mentioned in the abstract?
3. Could you please include error bars for each experimental result to better assess the statistical significance of the reported improvements? |
Lightly AI-edited |
|
Active Learning for Molecular Conformation Optimization with a Domain-Agnostic Neural Surrogate Oracle |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces GOLF-Neural Oracle, an active learning approach for molecular conformation optimization. Building off the GOLF framework, the authors train a NNP to replace the traditional force field in GOLF. During training, queries are sent to the DFT oracle to get high-quality datapoints to train the model. Results show that the proposed approach outperforms GOLF with RDKit force fields, as well as a set of other baselines both with and without active learning.
1. The motivation for introducing a NNP to replace an empirical force field is quite reasonable, since as the authors say there are many interesting systems for which traditional force fields are not accurate enough (especially in materials science)
2. The proposed framework is quite well thought-out, and addresses the problem. The target network weighted average technique is interesting and seems to perform well
3. The results comparing GOLF-Neural Oracle with baselines, especially in Tables 3 and 4, are convincing and quite strong
The main weakness in my opinion is the novelty compared to the GOLF framework. The main difference is the Neural Oracle, which while interesting, is not necessarily a major methodological novelty. The other modification, the mistake budget M, is well-motivated but I'm not sure how much its use makes a difference. Looking at the tables that compare GOLF-RDKit with GOLF-Neural Oracle (Tables 1 and 2), I don't really see a major improvement in the introduced method vs GOLF-RDKit. While there does seem to be a small benefit, it's usually only by 1 or a few percentage points. Given the simplicity of the modifications, I would only find this paper very interesting if the modifications led to very significant improvements over GOLF, which I'm not sure I see.
While the comparisons to other methods in Tables 3 and 4 are strong, I think the major baseline is GOLF-RDKit, and all the comparisons with that method yield very limited benefit for GOLF-Neural Oracle.
N/A |
Fully human-written |
|
Active Learning for Molecular Conformation Optimization with a Domain-Agnostic Neural Surrogate Oracle |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 1: poor
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes a new active learning method for molecular conformation optimization. Instead of using a fixed empirical surrogate oracle like in prior work (e.g., GOLF with MMFF94), it uses a learnable neural oracle updated via exponential moving average (EMA). The method is model-agnostic, simple to implement, and avoids the cost of training ensembles or changing model architecture.
1. The main idea of replacing the empirical surrogate with an EMA-based neural oracle is simple and general. It avoids the need for domain-specific tuning.
2. The method is compatible with existing NNP models and does not require any changes to their architecture.
3. The empirical results are strong, and the method performs well even when only 1000 additional samples are used.
1. The paper does not provide a detailed explanation of why EMA leads to stable uncertainty estimation. The justification appears to be mainly empirical.
2. The criterion used for selecting conformations, which depends on counting negative energy changes, feels somewhat heuristic.
3. In the SPICE experiments, the proposed method benefits from finetuning, while the baseline models are used as-is. This may affect the fairness of the comparison.
1. What motivated the choice of EMA over other methods for estimating uncertainty, such as dropout-based approaches or evidential models?
2. Can this approach be extended to highly flexible systems, such as large biomolecules, where energy estimates may become less reliable due to structural noise? |
Lightly AI-edited |
|
Discourse-Aware Retrieval-Augmented Generation via Rhetorical Structure Modeling |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces Discourse-RAG, a novel training-free framework designed to address a key limitation in standard Retrieval-Augmented Generation (RAG): the tendency to treat retrieved documents as a flat, unstructured "bag of facts." This "flat structure" problem leads to intra-chunk structural blindness and inter-chunk coherence gaps, hindering the model's ability to synthesize evidence and reason.
S1. The paper identifies a clear and important limitation of standard RAG (its "flat structure") and proposes a novel, linguistically-grounded solution that directly addresses it.
S2. The method achieves state-of-the-art performance on multiple, diverse benchmarks (long-doc QA, ambiguous QA, summarization), demonstrating its effectiveness and generalization ability.
S3. The paper is clear, well-illustrated, and reproducible thanks to the detailed appendices.
W1. This method is computationally expensive. The proposed pipeline requires an enormous number of LLM inference calls per query. As described in the methodology, for a top-k retrieval, the framework k calls for intra-chunk RST tree construction, k * (k – 1) calls for inter-chunk rhetorical graph construction and 1 call for planning.
W2. The entire framework's success is predicated on the LLM's ability to function as a high-quality, zero-shot RST parser. This capability is assumed, not proven.
Q1: Why did the authors not include an intrinsic evaluation of the RST parser against a gold-standard dataset? How can we be confident that the generated structures are faithful and not just plausible-sounding hallucinations that happen to guide the LLM?
Q2: Did the authors compare the full, complex RST parsing against simpler structural signals? For example, what is the performance if only explicit discourse markers (e.g., "however", "because", "in contrast") are used to build the inter-chunk graph, without any RST tree parsing? |
Lightly AI-edited |
|
Discourse-Aware Retrieval-Augmented Generation via Rhetorical Structure Modeling |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes Discourse-RAG, a retrieval augmented generation pipeline that makes the model explicitly use discourse structure. It first parses each retrieved chunk into an RST like tree and then links chunks with rhetorical relations to show support, elaboration, or conflict. A final planning stage guides generation using this structure. In evaluation it outperforms standard RAG and other structure aware baselines on long context QA, ambiguous QA, and scientific summarization.
- It uses one pipeline where chunk level discourse trees feed into a cross chunk graph, and both are used to guide generation.
- The method is tested on three tasks, with two Llama models, in both open and closed settings, and it beats 2025 RAG baselines, including on ASQA.
- The ablations, noise tests, and chunk size tests show that the method improves across different settings.
- While training-free, the pipeline requires multiple LLM calls per query (RST parsing per chunk, O(k²) pairwise relation inference, planning, generation). Cost grows quickly with larger k and the paper should discuss the latency and token counts.
- I noticed that all LLM benchmarks use Llama 3.x models. Since Qwen was already used for embeddings, why not include Qwen models in the main comparisons as well?
- Discourse quality is not validated. All trees and relations come from an LLM prompt rather than a parser with known accuracy. If the LLM segments poorly the whole pipeline can weaken.
- Relation set may be too big. The method uses many fine grained discourse labels but does not show which ones actually help. A smaller set might work the same.
- Evaluation scope is narrow. Results are mostly on English, long context, clean inputs. It is unclear how well this works on multilingual or noisy data.
- Did you evaluate the accuracy of your LLM-generated RST trees against gold-standard annotations (e.g., on RST-DT)?
- Have you considered integrating neural discourse parsers or non-RST frameworks (e.g., entity grids, coherence models)?
- In cases where Discourse-RAG underperforms standard RAG (if any), what are the failure modes? Are they due to incorrect rhetorical parsing, poor planning, or something else?
- Beyond automatic metrics (ROUGE, LLM Score), was there any human assessment of coherence, faithfulness, or readability? LLM-based scoring can be biased. |
Heavily AI-edited |
|
Discourse-Aware Retrieval-Augmented Generation via Rhetorical Structure Modeling |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes Discourse-RAG, a retrieval-augmented generation framework that explicitly models intra- and inter-chunk rhetorical structures using Rhetorical Structure Theory (RST) and rhetorical planning to improve coherence and factual consistency in long-context reasoning. It demonstrates strong empirical results on multiple benchmarks (Loong, ASQA, and SciNews) across varying document lengths, outperforming several state-of-the-art RAG baselines. While the approach is somewhat heavy and empirically oriented, its clear performance gains and conceptual novelty justify acceptance, provided reproducibility and efficiency details are strengthened.
1. The idea of introducing rhetorical trees and inter-chunk discourse graphs into RAG is original and conceptually well-motivated, bridging discourse analysis and generative reasoning.
2. The method is tested on diverse, long-context benchmarks with detailed ablations and robustness studies (chunk size, Top-k, noise, and structure perturbations), giving credibility to the empirical claims.
3. Discourse-RAG outperforms strong baselines (StructRAG, MAIN-RAG, RQ-RAG) in both accuracy (LLM Score, EM) and factuality (SummaC, SARI), particularly on large-context and noisy retrieval scenarios.
1. Both intra- and inter-chunk discourse structures rely on LLM prompting for RST parsing, raising concerns about reproducibility, cost, and stability.
2. While results show improvements, the paper doesn’t deeply explore why rhetorical modeling helps or how structural cues propagate through the generator beyond surface correlations.
3. The multi-agent setup (parsing, graphing, planning) introduces significant preprocessing latency and complexity, which may limit real-time or large-scale deployment; no efficiency analysis is reported.
1. Provide a quantitative evaluation of RST parsing accuracy and its impact on final performance (e.g., noise sensitivity to incorrect discourse trees).
2. Include a runtime and cost comparison versus other RAG systems (e.g., StructRAG, MAIN-RAG) to demonstrate practical feasibility. |
Fully AI-generated |
|
Discourse-Aware Retrieval-Augmented Generation via Rhetorical Structure Modeling |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper presents Discourse-RAG, a rag framework that explicitly models discourse structures. It works via a three-stage pipeline: 1) constructing intra-chunk RST trees to identify core vs. supporting information, 2) building inter-chunk rhetorical graphs to model relationships, 3) using a discourse-aware planning module to generate a blueprint for the final answer. Experiments on ASQA, Loong, and SciNews benchmarks show that Discourse-RAG achieves strong performance.
The paper is well written, and gets strong performance with good baselines.
The method can be plugged in any setup without any fine-tuning.
The components are ablated.
There is no analysis on how the method scales (in terms of cost (tokens) and latency) with higher top-k settings.
What are the tradeoff with offline indexing and creation of the RST trees for the whole dataset?
How does the method scale in terms of latency, tokens with respect to higher top-k, different chunk sizes, bigger documents? |
Fully human-written |
|
AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
In this paper, the authors proposed to use a pretrained LLM to select high-quality and reasoning-intensive pretraining data. Specifically, they first find the retrieval heads of a small model, and then calculate the loss gap with or without these retrieval heads. A higher loss gap indicates a higher reasoning intensity of the data. Experiments have demonstrated the effectiveness.
1. The paper has a clear structure and is easy to understand.
2. The proposed method has good practical application scenarios.
1. The experimental design may not be reasonable enough. Compared to the baseline, the training data is mixed with additional screened 73B data. Should the baseline data also include randomly sampled 73B data?
2. Lack of further experimental analysis. In order to further validate the practical application value of the proposed method, the following analysis may be necessary:
2-1. Is the search head consistent across different corpus data? If not, is it necessary to conduct targeted searches for specific language materials?
2-2. Do the screening model and training model need to be from the same series? For example, can the data filtered by Llama be used to train Qwen?
2-3. In practical applications, CPT data filtering may be a more common scenario. In this scenario, how effective is the proposed method? For example, in CPT training that requires enhanced reasoning ability, the baseline model trained on 400B corpus, while the comparison method trained on high-quality 100B corpus filtered from 400B corpus. If the performance of the comparison method can actually reach or even exceed that of the baseline model, it can demonstrate greater practical value.
2-4. Performance and efficiency analysis of different screening models.
Please see the weaknesses. |
Lightly AI-edited |
|
AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes AttentionInfluence, a new method for efficient pre-training data selection by leveraging the retrieval heads. AttentionInfluence identifies the important attention heads in a small LLM for retrievals and selects pre-training data examples based on the loss difference over examples between keeping and masking out such attention heads. Experiments show that AttentionInfluence selects data that improves downstream performance on knowledge-intensive and reasoning-intensive tasks, and is more efficient than other data selection baselines, as a small LLM is employed as the data selector.
1. The paper proposes a new pre-training data selection method with a focus on the efficiency of data selection and weak-to-strong generalization. Such new perspectives on pre-training data selection contribute to the literature beyond language modeling.
2. The proposed method is well grounded in the interpretability literature, and experiments across multiple benchmarks provide empirical support.
3. The paper presents comprehensive analyses of different design choices associated with the proposed method.
1. There exists a mismatch between the functionality of retrieval heads (long-context retrieval and reasoning) and the downstream task of the paper (pre-training data selection), and this leads to my concern about whether the proposed method is appropriate and well-motivated. In the literature, the retrieval heads are shown to be important for long-context retrieval, understanding, and reasoning tasks (e.g., needle-in-the-haystack), but their influences on short-context tasks are much less strong. In the pre-training literature, retrieval heads are also discussed more in the context of long-context pre-training or context extension. However, this paper does not specifically target long-context pre-training, and all the downstream tasks being evaluated (e.g., those in Table 1 and Table 2) are short-context tasks. Therefore, in my opinion, there is a mismatch between the methodology and the downstream task in this paper. While the author might have been aware of the effect of context length, as Section 6 shows that AttentionInfluence selects longer data examples, the discussion is rather limited; this paper needs to be better motivated by including more discussions/experiments on the effects of context length.
2. The empirical result is relatively weak compared to the baselines. For example, in Table 1, AttentionInfluence-1.3B is worse on average compared with the FineWeb-Edu Classifier baseline, and < 1% better than the simple PPL filter baseline. In a sense, this is intuitive because of the mismatch in W1: most of the evaluation tasks are short-context, and data selected by leveraging retrieval heads might not show large enough effects for such tasks. I would expect AttentionInfluence to outperform other baselines more on long-context tasks.
3. The analyses depend heavily on loosely defined metrics. Several analyses in Section 6 use the metrics of Education Score and Reasoning score to emphasize the strength of the proposed method. While I appreciate the in-depth analyses present, the two metrics are loosely defined: (1) They are not commonly used metrics in the literature, as I did not find references provided in this paper or relevant papers in the literature that use these metrics, especially the education score; (2) They are not well-defined in the LLM-as-a-judge prompt. As the prompt in Appendix J, there is no definition for the term "educational value" and the definition for the term "reasoning-intensive" is also slightly vague. Given the vague definitions, it is unclear if the LLM-as-a-judge scores accurately capture the desired features of selected data. For example, the education scores in Table 10 and Table 11 clearly saturate and cannot differentiate between different methods at all.
Please refer to the weakness part. |
Fully human-written |
|
AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes AttentionInfluence, a training-free and unsupervised method for pretraining data selection. The key idea is that data activating more retrieval heads are high-quality and encode reasoning-related behaviors. Using a 1.3B model to select the top ~20% (73B tokens) from the SmolLM corpus (241B tokens) based on the AttentionInfluence score, and mixing them to train a 7B model with 1T total training tokens, the approach outperforms both unsupervised and supervised baselines on reasoning and knowledge benchmarks.
1. The paper introduces a new perspective by leveraging mechanistic interpretability (retrieval head behavior) for pretraining data selection.
2. It provides detailed ablations and qualitative analyses.
3. The method is effective as demonstrated by the pretraining experiments while being entirely training-free and unsupervised.
Since only one pretraining corpus (SmolLM) and one pretraining model (a 7B model) are used, the robustness and generalizability of the method may be limited. Considering the high cost of pretraining and the theoretical generality of the AttentionInfluence method, it should be possible to further verify its effectiveness through post-training experiments.
1. The full SmolLM corpus contains 241B tokens, and the selected subset adds another 73B tokens, while the total training uses 1T tokens. How many of these 1T tokens come from the selected subset (for both AttentionInfluence and FineWeb-Edu Classifier), and how does this proportion differ from the baseline?
2. Is AttentionInfluence applicable to the mid- or post-training stage? Could you provide results using a smaller and different corpus and a different model at the mid- or post-training stage to verify the robustness of this method? |
Lightly AI-edited |
|
AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper aims to improve data selection for pre-training by leveraging signals from attention heads. Specifically, the authors analyze how large language models allocate attention during reasoning and generation, and introduce a new metric, the AttentionInfluence Score, which quantifies the relative importance of tokens and data based on their attention contributions. The proposed approach first identifies attention heads that are critical for reasoning, masks them in a reference model, and then computes the attention differences between the base and reference models to measure the influence of the data. The authors use the 1.3B model for data selection on the SmolLM corpus, and then pretrain the 7B model on the combined corpus of SmolLM and the selected data instances, showing that the proposed data selection strategy outperforms relevant baselines.
* The design of the proposed metric (AttentionInfluence Score) is convincing for data selection.
* The proposed data selection process outperforms relevant baselines.
* The samples used to identify the important attention heads are very important for the later data selection, as the data instances for pre-training are selected mostly from their signals. In Section 4.1, the authors mention that it is derived from 800 synthetic samples, and it is questionable whether the selected data instances are just very similar to those synthetic samples. Also, more details on constructing those samples and their quality should be provided. Lastly, it would be great if the authors could justify why only the top 5% of the attention heads are selected for data selection.
* It is not intuitive that the authors select the pre-training data from the SmolLM corpus and then use the SmolLM corpus + the selected data instances from the same SmolLM corpus. In other words, the selected data instances for pre-training are just a subset of the SmolLM corpus (which is also used for pre-training), and it seems this setting just unsamples the existing data rather than demonstrating true data selection benefits.
* For pre-training research, it would be great to show the scaling law as a function of the number of parameters, in addition to the number of tokens provided.
* It is unclear why the authors report the results without learning rate decay settings in the main tables.
* Recent training strategies of modern foundation models include mid- and post-training. It is questionable whether the pre-trained model with the proposed data selection strategy can still be effective after mid- and post-training. In addition to this, it would also be interesting to see whether the proposed data selection strategy can be beneficial for the mid- and post-training stages, where the data selection process is typically more rigorous than the pre-training stage.
* The term Llama2-like-1.3B model is unclear. Is this not the Llama2 model?
* In Line 204, two references are broken.
Please see Weaknesses above. |
Fully human-written |
|
Causal Structure Learning in Hawkes Processes with Complex Latent Confounder Networks |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper studies structure learning (causal discovery) for partially observed multivariate Hawkes processes (PO-MHP) and provides the first principled framework that identifies latent sub-processes and recovers causal structure in continuous-time event sequences without prior knowledge.
The authors make a key theoretical contribution (Theorem 4.1) by showing that a continuous-time multivariate Hawkes process can be represented by a discrete-time linear causal model when the event-count data is appropriately binned. They further prove that the low-rank constraints on the cross-covariance matrices induced by the linear representation can be used to (1) detect the presence of latent confounder subprocesses, and (2) identify parent–cause sets and causal edges under explicit path-based conditions (Definitions 4.4, Propositions 4.3/4.5, Theorems 4.7/4.8).
Based on this theoretical foundation, the paper proposes a novel two-phase iterative algorithm where Phase I identifies causal relationships among the currently known (observed and inferred) subprocesses and Phase I discovers new latent confounders via rank tests. The authors also prove that this method guarantees the identifiability of the causal graph. Experiments on both synthetic and real-world datasets show that the proposed method effectively recovers the ground-truth causal graphs, outperforming existing baselines, especially in settings with complex latent structures.
1. One of the paper's main strengths is its strong theoretical foundation. Theorem 4.1 is a powerful result, which innovatively establishes a connection between continuous-time Hawkes processes and a discrete-time linear autoregressive representation. Moreover, the *Definition 4.4 + Proposition 4.5 + Theorems 4.7/4.8* that links symmetric path structures to observable rank deficiencies is also original and enables finding latent confounders without prior information of the existence or number of latent subprocesses.
2. The paper addresses a critical challenge, where many previous causal discovery algorithms assume that all relevant variables are observed. This paper instead studies a more realistic and difficult scenario under partial observability, where they propose a novel framework to uncover causal structure with unknown latent subprocesses. It is scientifically important.
3. The proposed two-phase iterative algorithm is a direct and elegant consequence of the theoretical results. The experiments, while concise, are well-designed to validate the paper's core claims. Specifically, the synthetic experiments include multiple graph families, sample sizes, and sensitivity checks.
1. Strong structural assumptions. Definition 4.4 formalizes the Symmetric Acyclic Path Situation (the observed effects being connected to the latent via paths of equal length and acyclic intermediate latents), which is a somewhat special topology. However, in complex systems intermediate latents can have varying path lengths or additional cross-links, which would break the condition and make that latent unidentifiable by the method.
2. While the paper is theoretically rigorous, it is also extremely dense. The written could be polished with a motivating real world example in Figure 1, then expand the theoretical proof based on this step-by-step real world example. Moreover, some intuitions could be explained before show the theorems and proofs. For example, the transition from 4.2.1 to 4.2.2 could give some intuitions.
3. Only one real world dataset is limited. This small dataset (evaluation on a five-alarm subgraph) can not support the model effectiveness on large and noise real-world systems. Meanwhile, the reported results on Tables 1-4 do not show variance.
1. How sensitive is the performance of your algorithm to the choice of the discretization interval Δ? Is there a principled way to select an optimal Δ, or is it purely an empirical choice? How does data sparsity affect this choice?
2. Some additional experiments could be added. For example, what if some latent confounders are removed (which violates the condition in Definition 4.4)? |
Fully human-written |
|
Causal Structure Learning in Hawkes Processes with Complex Latent Confounder Networks |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes a causal discovery method for identifying latent subprocesses and their causal structures in partially observed multivariate Hawkes processes. The authors introduce a causal model for partially observed multivariate Hawkes processes to represent continuous-time event sequences. Based on this model, they leverage rank constraints on the covariance matrix to identify causal influences without prior knowledge of the existence or number of latent subprocesses.
1. The authors discretize the Hawkes process, transforming the multivariate Hawkes process causal model into a linear autoregressive model, and theoretically prove this conclusion.
2. It proposed a method for identifying latent subprocesses and causal structures solely by leveraging rank constraints on second-order statistics.
1. In Proposition 4.5, should it be that the rank constraint is both necessary and sufficient for the corresponding local independence in the graph, under the structure defined in Definition 4.4 and the data generated accordingly?
2. The results mentioned in the original SHP paper differ significantly from those presented in this paper. Additionally, could results for different time intervals be provided, similar to the approach in the SHP paper?
3. When the paper transforms the model into a linear autoregressive model, does it require the noise to be constrained as Gaussian?
4. The proposed two-phase iterative algorithm suffers from severe scalability limitations, which restrict its practical applicability.
Typo:
There is an extra } at line 300.
What parameters were used for the method proposed in the real-world data experiments? |
Lightly AI-edited |
|
Causal Structure Learning in Hawkes Processes with Complex Latent Confounder Networks |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses the problem of causal structure discovery in multivariate Hawkes processes under partial observability. Authors consider Hawkes processes with elaborate latent structure, and derive the conditions under which the latent variables and the relationship between them is identifiable. Similar results exist in the context of linear autoregressive processes and as acknowledged by the authors, they inspire the latent structure discovery results in this manuscript (Theorem 4.7 and 4.8). By transforming the Hawkes process inference problem into a discrete-time linear autoregressive formulation (Theorem 4.1), the authors establish the results for Hawkes processes.
The paper proposes a two-phase iterative algorithm that alternates between (i) discovering causal relations among existing subprocesses and (ii) inferring new latent subprocesses based on rank constraints of cross-covariance matrices. Necessary and sufficient conditions are derived for identifiability, including the introduction of path-based conditions (Definition 3.4) ensuring one-to-one correspondence between latent confounder structures and observable rank deficiencies. Empirical results on synthetic and real-world data show that the proposed method successfully recovers causal structures even when latent confounders exist.
Compared to the previous NeurIPS 2025 submission that I had reviewed, the manuscript has substantially improved in clarity of the claims and structure of the paper. Most importantly, the iterative algorithm is now explicitly defined, assumptions and identifiability conditions are more carefully motivated, and the connection to prior work—including an additional LPCMCI baseline in experiments, rank-based latent structure discovery methods, and INAR processes—has been expanded.
1. Novelty of Theorem 4.1 can still be debated. While the authors’ expanded discussion distinguishes their formulation from prior binning-based estimation approaches, the contribution could still benefit from more explicit formal comparison (e.g., showing in what sense their linear representation differs from INAR-based or EM-based formulations beyond the absence of likelihood modeling).
2. Motivation benefits from more discussion. In much of the classical literature on the broader causal discovery problem, the structure within them is not discussed, as the latent confounder are often treated as root nodes affecting the observables. The work is indeed interesting in a theoretical sense, yet, I'd like to question the motivation for latent structure discover: Since these variables are not observed, it is hard to imagine interventions on them, so why is it of interest to practitioners to identify the structure of the latent variables?
3. Assumptions and their implications. The identifiability results depend on assumptions about the Hawkes process and the structure of latent confounder. In which practical scenarios are these assumptions justifiable? On the other hand, in misspecified cases, how poorly is the latent structure discovered? A short illustrative example, even synthetic, where the assumption are valid and one where they fail would greatly benefit the reader.
4. Missing acknowledgement/comparison with recent work on causal discovery in Hawkes processes via compression schemes, e.g., [1,2]
[1] Hlaváčková-Schindler, K., Melnykova, A., & Tubikanec, I. (2024). “Granger causal inference in multivariate Hawkes processes by minimum message length.” JMLR 25(133): 1–26
[2] Jalaldoust, A., Hlaváčková-Schindler, K., & Plant, C. (2022). “Causal Discovery in Hawkes Processes by Minimum Description Length.” AAAI 36(6): 6978–87.
Please address the weaknesses mentioned above. |
Fully human-written |
|
Causal Structure Learning in Hawkes Processes with Complex Latent Confounder Networks |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper studies the problem of causal discovery in multivariate Hawkes processes (MHPs) with latent confounders. The idea is to represent a MHP with a specific form of excitation functions as a linear autoregressive model over discretised variables. Afterwards, the authors introduce a set of conditions under which the causal structure is identifiable using rank tests on covariance matrices of observed discretised variables.
The paper addresses an important and relevant problem, i.e., causal discovery in multivariate Hawkes processes (MHPs) with hidden confounders.
The theoretical contributions provide valuable insights into causal discovery without assuming causal sufficiency in MHPs and represent an important step toward advancing research in this area.
The main result builds on representing an MHP as a linear autoregressive model through discretization. However, according to Theorem 4.1, this result holds only when the discretization parameter (\Delta) tends to zero. In practice, for small but finite (\Delta), this leads to model mismatch, which can also be observed in the sensitivity analysis with respect to (\Delta) in Table 1.
Moreover, all identifiability results are derived under the assumption that the linear representation holds, i.e., as \Delta -> 0. However, no guidance is provided on how to choose (\Delta) in practice to ensure consistent results.
The identifiability results further rely on an additional assumption that the excitation functions take the form (a_{i,j}w(s)), for example, the exponential decay function a_{i,j}\exp(-\beta s). While this is a common assumption in the MHP literature, it is often extended to cases where the decay rate \beta is also an unknown, node-specific parameter, i.e., (a_{i,j}\exp(-\beta_i s)). Although this may appear to be a minor modification, it is non-trivial to see how the results of this work extend to such more general excitation functions.
The proposed algorithm has exponential complexity, which limits its scalability. As discussed above, its performance is also sensitive to the choice of \Delta. Furthermore, the method relies on rank tests, which typically require large amounts of observational data. This raises a question regarding Figure 4: assuming the experimental setting is favorable to all baseline methods as well as the proposed approach (i.e., with no latent confounders), how would these methods perform with substantially fewer observations, e.g., significantly less than 30,000?
Please see above comments. |
Fully human-written |
|
Projected Coupled Diffusion for Test-Time Constrained Joint Generation |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper studies the problem of combining pretrained diffusion models while enforcing task-specific constraints. The authors present a projected coupled diffusion framework for constrained joint generation. This framework introduces two generative dynamics with coupled guidance terms, and use projection to impose hard constraints. Several experiments are used to show the effectiveness.
- This paper studies the problem of generating samples from pretrained diffusion models while satisfying task-specific constraints. This problem is important because pretrained diffusion models are often available, but sample constraints are typically not enforced during training.
- The authors formulate a new problem of generating correlated samples under hard constraints.
- They propose a generation method based on coupled dynamics, combining a coupled cost with projection onto hard constraints. This approach generalizes projected diffusion to coupled dynamics.
- The authors further show that several existing methods can be viewed as special cases of the proposed coupled dynamics framework.
- Several experiments show benefits from both projected diffusion and cost guidance.
- The projected coupled dynamics are intuitively designed by combining cost and projection. However, the conditions under which this method converges have not been studied, and to where.
- The effect of coupled costs has not been analyzed, and costs may not always be differentiable.
- The projection step is not discussed in detail. It can be infeasible when the constraints are non-convex. It is also not clear how to do projection for latent diffusion models.
- All special cases correspond to degenerate forms of the projected coupled dynamics. New application scenarios for the projected coupled dynamics have not been explored.
- Experiments demonstrate the effectiveness of combining cost and projection, which is expected since it benefits from both projected diffusion and reward guidance. However, it remains unclear to what extent the experiments reveal the advantages of the coupled dynamics.
See comments in Weaknesses. |
Fully human-written |