ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 15899 (21%) 4.43 3.58 3687
Heavily AI-edited 3233 (4%) 4.22 3.59 2990
Moderately AI-edited 7082 (9%) 4.20 3.61 2722
Lightly AI-edited 16648 (22%) 4.15 3.68 2746
Fully human-written 32938 (43%) 4.13 3.62 2917
Total 75800 (100%) 4.21 3.62 3026
Title Ratings Review Text EditLens Prediction
Layer-Based 3D Gaussian Splatting for Sparse-View CT Reconstruction Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a hierarchical, layer-based 3D Gaussian Splatting (3DGS) framework for sparse-view CT. Instead of a one-shot dense initialization, the method adds new layers of smaller Gaussians in a coarse-to-fine manner, where placement is guided by a 3D volumetric error map reconstructed via back-projecting 2D residuals with CGLS. Positive-error regions trigger densification (adding Gaussians); negative-error regions trigger sparsification (fusing Gaussians). The system starts from a classical reconstruction (SART-TV) to derive a soft Otsu mask and to seed the first layer. Experiments on synthetic and real datasets show improved 3D PSNR/SSIM over traditional solvers and prior explicit/implicit baselines (notably R$^2$-Gaussian), especially at very sparse views (5–15). 1. The layerwise residual-fitting idea is well-motivated and implemented end-to-end in CT with explicit 3D error maps driving where capacity is allocated. 2. Experiment in Table 1 shows improvements vs. strong baselines at 5–15 views on both real and synthetic sets (e.g., Real/10 views: PSNR 33.04 vs. 31.90 for R$^2$-Gaussian). Qualitative figures support crisper geometry with fewer view artifacts. 3. The paper varies number of layers, sparsification radius/centers, masking, and layer-optimization strategies; the 20-layer configuration emerges as the best trade-off and uses fewer final Gaussians with competitive time. 1. The R$^2$-Gaussian has been missed spelled as R2-Gaussian in all places in this paper, which is very non-professional. 2. The core idea of hierarchically allocating Gaussian capacity is closely parallels existing hierarchical/level-of-detail 3DGS schemes and explicit voxel/atom allocation strategies. The paper does not clearly articulate a substantive technical advance beyond adapting these known capacity-allocation ideas to CT, nor does it provide head-to-head analyses that convincingly demarcate what is genuinely new. 3. The method critically depends on the CGLS-reconstructed 3D error map. The discussion admits noise/streaks in highly sparse regimes and uses mask+Gaussian blur to denoise, but quantitative sensitivity to solver iterations, regularization, and noise level is limited. 4. The paper regularizes R$^2$-Gaussian for sparse views (higher TV, minimum scale, densification threshold). While this avoids needle artifacts, it may underplay R$^2$-Gaussian’s potential at moderate views and shifts the comparison space. Please refer to the weaknesses part, especially weakness 2&3. I would like to read the rebuttal and improve my ratings if my concerns are adequately addressed. Fully AI-generated
Layer-Based 3D Gaussian Splatting for Sparse-View CT Reconstruction Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduce a hierarchical layer-based 3D Gaussian Splatting (3DGS) computed tomography reconstruction framework. The reconstructed objects are iteratively refined by correcting the volumetric errors of previous layers. The core technical contribution is the 3D error-driven strategy to guide densification and sparsification. This strategy estimates a volumetric error map from back-projected 2D residuals, providing direct structural guidance for adding Gaussians in underrepresented regions and fusing them in over-represented regions. (i) The idea of hierarchical layer-based 3D Gaussian Splatting is novel and interesting and insighted as the basic shape of the scanned object can be easily reconstructed while the fine-grained details are hard to captured. Existing 3DGS-based method usually neglect this while this work fills this research gap. The designed technique, 3D error-guided importance sampling is also very reasonable by adding Gaussians in the underrepresented regions estimated by the positive error maps and fusing them in the over-represented regions estimated by the negative error maps. (ii) The performance on 3D CT reconstruction is solid. This work is based on the NeurIPS 2024 work R2-Gaussian. By applying the new densification and sparsification strategy to the baseline method, the performance are improved significantly by large margins on both real and synthetic datasets, as shown in Table 1. These results suggest the effectiveness of the proposed method. The visual comparison in figure can also show the propose method reconstructs clearer structural details. (iii) The overall writing is clear, especially the method part from line 187 to line 288. The presentation is also well-dressed. The workflow of the pipeline is clearly shown in the figure 2. I like the style as almost all the technical contributions are reflected in the figure. (iv) The ablation study is pretty comprehensive. All the modification are investigated, including the layered densification, layer sparsification, Masking, layer selection, and so on. The results in Table 2, 3, 4, 5, and 6 can clearly demonstrate the effectiveness of the proposed technical modifications. (v) Code has been submitted. The reproducibility can be checked. (i) The motivation is not very clear. As described in Line 038 – 042, why the regular 3DGS provides only indirect and incomplete information about the true 3D structure is not well discussed. From my point of view, this paper mainly modifies the densification of the Gaussian point clouds and the initialization has not been improved. So why also mention the one-time initialization here? It is a little weird. (ii) The differences of the proposed densification and sparsification strategies and the regular ones should be highlighted and comprehensively compared. Now the authors just plainly describe their methods. I suggest the author draw some figures and mention the differences in the method section. Meanwhile, in the teaser figure, the authors just show the changes of the Gaussian point clouds of the proposed method. How about the regular one? There is no more comparison to show the advantages of the proposed method. (iii) The main results are not very convincing. The authors claim their method beats the state-of-the-art methods but they do not use the public benchmark – X3D and did not make comparisons to the recent best neural radiance fields (NeRFs) method – SAX-NeRF, which was published by CVPR 2024. Instead, they compare with an old method NAF, which was accepted by MICCAI 2022. (iv) The main visual results also look very weird because the color is somewhat red, which is significantly different from the visual results shown in previous works such as NAF, SAX-NeRF, X-Gaussian, R2-Gaussian, etc. There is no explanation for this. (i) Could you please explain why using the TV loss in Eq.(6)? Do you do an ablatio study of this loss function? Fully human-written
Layer-Based 3D Gaussian Splatting for Sparse-View CT Reconstruction Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposed a hierarchical approach to 3D gaussian splatting for sparse-view CT reconstruction, first introducing large-scale Gaussians, and then refining in later steps. Refinement choices are based on reconstruction of the residual error. - The paper addresses an important challenge in CT, and the proposed method seems reasonable for the challenge. - Experimental results are included for both synthetic and real-world datasets. - The main contributions of the paper are not clearly described. Both Gaussian splatting and hierarchical reconstruction approaches are quite well established in the CT community (as the authors state), so it is important to clearly state how the proposed paper exactly contributes to the already known knowledge. - It is not clear to me why a negative reconstructed error indicates areas where overfitting or redundancy is present. For example, if a sample has a small hole somewhere, and the algorithm represents that part with a single large Gaussian, a negative error would occur in the error map, but the model is actually underfitting. The specific choice of interpreting negative as overfitting and positive as underfitting should be clearly motivated in the paper, and ideally evidence should be given that this is indeed a valid choice. - It is not clear how specific hyperparameters settings are chosen by the authors, and how results are affected by different choices for the hyperparameters. This is also try for comparison methods -- how are the hyperparameters chosen for these? - The terminology used, especially the use of 'layer' is unclear. 'Layer' has a very specific meaning in deep learning, so using it for a completely different concept is confusing. I suggest using a different term. - What are the main contributions of the paper? - Why does a negative error indicates overfitting? - How were hyperparameters chosen, and how do hyperparameters affect results? Fully human-written
Layer-Based 3D Gaussian Splatting for Sparse-View CT Reconstruction Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This work introduces a hierarchical, layer-based framework for 3D Gaussian Splatting (3DGS) tailored for sparse-view CT reconstruction. The core contribution is a 3D error-guided refinement strategy, where 2D projection residuals are back-projected using a tomographic solver to create a 3D volumetric error map. This 3D error map directly guides a coarse-to-fine process, acting as an adaptive importance sampling mechanism for both adding new, smaller Gaussians (densification) in under-represented regions (positive error). The error map also guides the merging of existing Gaussians (sparsification) in over-represented regions (negative error), effectively regularizing the model against overfitting. 1. The method directly addresses a key failure mode of standard 3DGS in sparse-view settings: overfitting to 2D projections. Guiding densification and sparsification with a 3D error map (from back-projected 2D residuals) is a novel and more geometrically sound approach than relying on 2D gradient-based density control alone. 2. The proposed layered, coarse-to-fine refinement strategy is well-motivated and empirically effective. Ablation studies (Table 3) clearly show that this layered approach outperforms a single-stage, dense initialization, while often converging to a more compact model (fewer Gaussians) and reducing training time. 3. The paper demonstrates state-of-the-art results, consistently outperforming strong baselines (including classical methods, implicit fields like NAF, and 3DGS methods like R2-Gaussian) on both synthetic and real-world datasets, especially in highly sparse (5-15 view) scenarios. 1. The quality of the 3D error map, which is central to the method, is dependent on the CGLS solver and the quality of the 2D residuals. As acknowledged by the authors, this map can become noisy in extremely sparse settings, potentially leading to error amplification where noise is densified. While denoising is applied, the robustness of this feedback loop could be analyzed further. 2. The normalization term for initializing new Gaussian density, $\alpha_{i}^{(l)}=C_{\alpha}\frac{e_{i}^{(l)}}{\sqrt[3]{N^{(l-1)}}}$ (Equation 4), is heuristic. It is "motivated by the physical process" but relies on a "quasi-uniform distribution" assumption. A more principled derivation or analysis of this scaling factor would strengthen the method's technical foundation. 3. The layer-building process seems to be on a fixed schedule (2500 Gaussians every 500 iterations for 20 layers). An adaptive strategy, where the number of new Gaussians and the timing of new layers are determined by the 3D error map's magnitude or distribution, would be a more elegant and efficient extension (as noted in the discussion). Please refer to the weakness part. Fully AI-generated
On the Existence of Universal Simulators of Attention Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper reports a RASP construction for simulating one attention layer on in input string. The main result is that we can construct a RASP program U that, on input T, X, simulates T(X). Here, T is a single multi-head attention layer. Since RASP is a model of transformer computation, this result serves as a kind of "universal simulator result": a transformer U that can simulate other transformers, similar in spirit to a universal Turing machine. However, the theorem statements and proofs of this claim are not entirely rigorous. 1. The goal of the paper (constructing a transformer that simulates transformers) is quite cool and would make an interesting addition to the analysis of transformer expressivity. 2. The high-level argument looks reasonable, though the detailed formal analysis should be made more rigorous. ### Rigor Issues with Theoretical Results My major concern with this paper is that there are many rigor issues in the theorem statements and proofs. As a first minor point, Lemma 2 onwards make a statement about a "transformer", but really they are about the existence of RASP programs, which correspond to a specific transformer model. More importantly, in Proposition 1 and Lemma 2 onwards, it's unclear how the input and output are represented in these constructions, as transformer inputs, outputs, and representations are sequences. Clarifying the details of the representations is important since you will be composing these operations in Theorem 5 to simulate U, and it's important that they fit together properly. The paper should have more rigorous lemma statements that clarify how all these operations can be appropriately composed within a transformer. Algorithms 1-3 are pseudocode, which is very far from both transformers and RASP. E.g., in algorithm 1, line 3, it doesn't say how you construct an attention head that maps from [i, j] to [j, i]. Clarifying the representational details mentioned above and adjusting the algorithms to refer to that would make them more rigorous. Even better, you should actually write algorithms 1-3 in RASP so that it's clear what version of RASP (i.e., what primitives) are required. Theorem 5 and 8 are confusingly worded: there exists a transformer U, that, on input X, simulates any one-layer transformer T. This is vacuously true: just set T = U. You actually mean U takes input <T, X> and simulates T(X). But, related to the issues above, it's not made clear how T should be encoded in the input to U. What does it mean to implement the operation softmax in Lemma 3? The output of softmax is irrational, so you are approximating it. But it's not clear in what sense the approximation holds. The results about MaxMin and Lipschitz continuous functions are hard to verify. The authors should clarify how the input and output are represented (if they are real numbers?), make rigorous the notion of approximation that can be achieved, and analyze the size of the transformer that is needed (e.g., as a function of the Lipschitz constant). In addition to clarifying these details, it would helpful to discuss how universal approximation result sits with the known result that any poly-size transformer encoder with poly-precision can only compute functions (indicator functions for formal languages) in TC0. ### Clarify the Version of RASP Used The original RASP is not a well-defined computational model. It's important that RASP constructions don't hide arbitrary computation in their elementwise operations, which is why later works have taken pains to define different fully defined RASP variants like B-RASP, C-RASP, etc. It's not very clear what model of RASP is assumed for the results in this paper, but clarifying that it's some simple flavor of RASP would strengthen the appeal of the results. ### Claim of Match2 Novelty is Overstated > As a consequence, we can construct an AHAT for problems such as Match2, known until now to be only learnable using single-layer single-head SMATs. Without going through your construction, it's fairly straightforward to implement match2 (e.g., mod 10) with AHAT: each x_i has attends to its left to see if it can find x_j = k, where k is the unique value -x_i mod 10. This is a direct one-layer AHAT construction that doesn't require going through your simulation. More generally, AHAT constructions are generally easier than SMAT constructions, so I'm not sure if there are clear examples where simulating SMAT with AHAT will let us solve something we couldn't otherwise solve. > has largely been data-driven, offering only probabilistic rather than deterministic guarantees. Data-driven suggests no guarantees at all? > These models have demonstrated the ability to learn from tasks and function as simulators of a broad range of computational architectures. What exactly do you mean by simulators here? Unclear what value the discussion of parity in the introduction provides. First, you should point out that the different results about parity are due to different models of transformers (hard vs. soft attention). Second, you should clarify why you're bringing up parity, match2, etc., and how it relates to your research question and results. Split discussion of RASP in Section 3 into its own subsection, since this is a core part of the paper Fully human-written
On the Existence of Universal Simulators of Attention Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper’s writing requires significant improvement. I found it difficult to even identify the stated contributions. To my understanding, the paper demonstrates that, given access to RASP commands, specifically any point-wise function, aggregation, and selection, one can reconstruct any attention layer, provided the model’s input and the weight matrices of the targeted layer are given in-context. The authors further claim that since each of these RASP commands can be translated into transformer layers, a transformer can therefore emulate any single attention layer. I think the scope of this work is meaningful and interesting; finding which is the minimal set of operations needed to reconstruct attention. This work takes a step towards this direction by showing that RASP, a programming language designed such that each of its operations can be translated to transformer layers, is actually sufficient to reconstruct one layer of encoder based attention. Furthermore, the authors show how RASP can also reconstruct other type of attention mechanisms like Linformer and Linear Attention. 1. The main contribution of the paper is not clear. According to the statements made in the paper, it seems that the main contribution is the construction of the universal simulator $U$, composed of transformer encoder layers, that is able to simulate any attention layer in-context. However, this is straightforward using the results of [1], which shows any matrix multiplication, transposition and of course softmax is feasible with encoder based architectures. While the authors claim that they improve upon [1], it is unclear which is this improvement. Note that in line 065-066 [1] requires dxd^2 input, which I believe is consistent with the requirements in this work). I suggest to the authors to make a table in which clearly state the number of layers, width etc their final model (or each individual model) has and how this improves upon [1]. Thus, if the authors consider the main contribution of the paper the in-context simulation of attention with an encoder-based model, I think the novelty here is limited. As mentioned before, I find more interesting the emulation of attention using RASP only. However, I think the paper would be improved by adding an explicit justification on the choice of RASP and **how many** layers and width the construction requires, meaning how many commands the RASP program has when the input matrix is $d\times n$. The reported Table 1 does not clearly state what cost is. 2. The flow of the paper should be significantly improved. I would suggest the authors to clearly state which are the contributions of the paper and the main points they want to make. At its current form this paper is confusing. For example, even in the abstract the sentence: "can transformer architectures exactly simulate an arbitrary attention mechanism, or in particular, the underlying operations?" reads as trivial. A transformer can simulate attention, since it contains attention module within its architecture. One way to improve the phrasing would be that can it simulate in-context. The paper makes the claim of exact construction throughout . I think this claim is incorrect. As an example, notice that in Listing 2 the operation 2.73^tokens_float is used. Even tough, RASP states any point-wise operation - what is actually meant is any operation that the non-linear layers can **approximate**. Notice that this means that exponentials would introduce some error when translated to transformer layers. This has also been analyzed in [1]. 3. Related to the previous point, I think the authors should actually construct using RASP the transformer layers to show that the simulation is exact, in case this is possible. Then show that the output of the attention layer is indeed identical to the constructed model. [1]: Giannou, A., Rajput, S., Sohn, J., Lee, K., Lee, J.D. &amp; Papailiopoulos, D.. (2023). Looped Transformers as Programmable Computers. <i>Proceedings of the 40th International Conference on Machine Learning</i>, in <i>Proceedings of Machine Learning Research</i> 202:11398-11442 Available from https://proceedings.mlr.press/v202/giannou23a.html. 1. Could the authors clearly state the contributions of the paper compared to previous work? 2. Could the authors comment on the exact construction claims made throughout the paper? Not only the example mentioned above. 3. I would suggest also improving the flow of the text and creating a small experiment (as described above) to support their claims. Fully human-written
On the Existence of Universal Simulators of Attention Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper presents data-agnostic construction of a universal simulator that replicates the behavior of single-layer transformer encoders. The authors provide mathematical analysis on such claim. This paper focuses on an interesting direction of understanding the learnability and expressivity of transformers. * The theoretical results are hard to parse. It would be better if the authors provide graphical illustration on what they proved and why it makes sense. * No empirical results supporting the theory. None Fully human-written
On the Existence of Universal Simulators of Attention Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper under review attempts to show that there exists a ``universal transformer network’’ for transformer encoders. What the result seems to say is that there exists a transformer that, given a sequence of vectors x_1, …, x_n \in mathbb{R}^d, and matrices K, Q, V, performs the attention layer on input (x_1, .., x_n) with K, Q, V as key, query, and value matrices. The construction uses a programming language RASP that is known (?) to be simulatable by an average hard-attention transformer. As a consequence the paper claims to solve an open problem by showing that softmax transformers can be simulated by average-hard transformers. The paper attempts to resolve an open question of whether softmax transformers can be simulated by AHAT. I believe that the main results are neither stated nor proved with necessary rigor. In Proposition 1, what does it mean ``represented’’? In Lemmas, 2,3,4 how the input matrices are given to the transformer and how they are split into tokens? What is the input length? What are quantifiers over r, k? In Theorem 5, what does it formally mean to simulate? Are Key, Query and Value matrices of T are parts of the input to U? The proofs of Lemmas 2,3,4 are given in the form of a RASP code without any comments. A pseudo code can be used to illustrate the proof, not to replace it. RASP language has not been defined in the paper. It makes it inconvenient to verify the proofs. Is there actually any result in the literature that RASP can be simulated by AHAT? It is introduced in Weiss, Goldberg and Yahav but I have not found any formal results. As I understand, Yang and Chiang (Counting Like Transformers: Compiling Temporal Counting Logic Into Softmax Transformers) show this result for a modification of RASP called C-RASP, and for softmax transformers. Finally, the paper is written in a confusing language. Some examples ``By dimension of a multi-dimensional array M, we signify the number of axes referred in M’’ signify <- mean? By ``the dimension’’? I guess, if you are trying to define the number of dimensions of an array, you cannot use the word ``axes’’ which is essentially the same thing. ``the induced (n−1)-dimensional array hosted from the index of the introductory axis in M’’ - I don’t understand this phrase Line 173 ``where |*| implies’’ denotes? Line 181 ``formalizes the same’’ – ``formalizes the above’’ etc. no questions Fully human-written
MotifScreen: Generalizing Virtual Screening through Learning Protein-Ligand Interaction Principles Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper presents MotifScreen, a multi-task deep learning framework for structure-based virtual screening that aims to improve generalization by modeling protein–ligand interaction principles. It integrates motif prediction, structure prediction, and affinity scoring modules trained jointly. The authors also introduce a new benchmark, ChEMBL-LR, designed to reduce ligand bias and target leakage compared to datasets such as DUD-E and DEKOIS 2.0. Experiments show that MotifScreen achieves 0.68 AUROC on ChEMBL-LR and exhibits improved robustness and smaller performance degradation across benchmarks. Ablation studies indicate that combining motif and structure learning contributes to generalization. The paper addresses an important problem concerning data leakage and bias in existing datasets. The proposed multi-task learning framework, which incorporates various forms of external structural knowledge, represents an effective approach to better utilizing available data. First, regarding bias, it is important to reconsider how it should be viewed. Similar binding pockets tend to bind similar molecules, and similar molecules tend to interact with similar pockets — this assumption underlies all machine learning–based models in this field. Based on this, the actives in any test set will naturally have higher similarity to the reference ligands. Therefore, this so-called ligand bias reflects an inherent relationship within the data itself, and its presence has a certain scientific justification. Second, in terms of model design, the proposed architecture lacks novelty. Components such as the SE(3) Transformer, EGNN, and grid-based representations have already been widely used in related molecular representation and virtual screening models. Third, the dataset, the use of random decoys makes the task overly easy. The real challenge lies in distinguishing structurally or physicochemically similar compounds; performance on average or dissimilar decoys is less meaningful. the BEDROC should be reported to show the top-ranking performance in table 1. Finally, as a new benchmark, it should include a comprehensive evaluation across a wide range of models to ensure fairness and demonstrate general applicability. 1. The formula for AVE in lines 125–126 appears incorrect — it currently shows (IT IV − IT IV), which seems to be a typographical error. 2. The drop results in Table 2 are not directly comparable. First, the absolute value of EF1% is influenced by the ratio of actives to decoys in each dataset. Moreover, different datasets (such as DUD-E and DEKOIS 2.0) were used to calculate the drop for different models, making the comparisons inconsistent. In addition, the paper does not report results on more recent virtual screening models but only docking based methods. 3. The key challenge of this task lies in the enrichment of active compounds at the top of the ranking list. Therefore, metrics such as BEDROC, which assign higher weights to top-ranked molecules, are more appropriate than AUROC. Similarly, Table 1 should also report BEDROC and EF values, rather than only AUC. Lightly AI-edited
MotifScreen: Generalizing Virtual Screening through Learning Protein-Ligand Interaction Principles Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper addresses the issue of overfitting performance reporting in deep learning-based structure-based virtual screening (SBVS), which the authors attribute to systemic biases and data leakage in commonly used benchmarks. The authors make a two-fold contribution: first, they introduce ChEMBL-LR, a new leakage-resistant benchmark designed to provide a more realistic evaluation of model generalization. Second, they propose MotifScreen, a novel end-to-end SBVS model. MotifScreen uses a principle-guided, multi-task learning framework that reasons about protein-ligand interactions by predicting binding pocket motifs, ligand-pocket compatibility, and final binding probability. 1. The paper's most significant contribution is its critical analysis of the systemic flaws in existing SBVS benchmarks. The development of the ChEMBL-LR dataset, which explicitly controls for target leakage and ligand bias, is a valuable service to the community and helps establish a more rigorous standard for future research. 2. The work is well-motivated, and the paper is clearly written and structured. The analysis in Section 4.1, which uses a Random Forest model to quantify the extent of leakage in benchmarks like DUD-E and LIT-PCBA, provides strong evidence for the authors' claims. 3. The multi-task learning architecture of MotifScreen is conceptually sound. Forcing the model to learn intermediate, physically-grounded tasks like motif identification and key atom positioning is a promising strategy to improve generalization and move beyond simple classification shortcuts. 1. Lacking important baselines, DrugCLIP[1] and EquiScore[2]. 2. Deep-learning methods tends to overfit on the benchmark. However, AutoDock-Vina just adopts a simple linear scoring function. In Table 2 and Table D5, significant decrease of AutoDock-Vina is also observed. More analysis about this should be performed. 3. Training data is important for deep-learning methods. MotifScreen employees different and larger training set compared to previous baselines. More analysis and ablation study about the effect of training data is important to evaluate this paper. [1] DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening, NeurIPS, 2023 [2] Generic protein–ligand interaction scoring by integrating physical prior knowledge and data augmentation modelling, Nature Machine Intelligence, 2024 1. The citation of MotifGen in line 205 is wrong. Fully human-written
MotifScreen: Generalizing Virtual Screening through Learning Protein-Ligand Interaction Principles Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper presents MotifScreen, a new model for protein-ligand binding affinity prediction, along with a new virtual screening benchmark named ChEMBL-RL. The authors claim their method and benchmark are more robust to data leakage; however, significant concerns remain regarding the fairness of the comparative study against baseline models. Furthermore, the comparison with the previous LIT-PCBA benchmark requires more thorough discussion. --- The usage of LLM: I wrote the entire review myself and only used the LLM to correct the grammar and improve readability. - The motivation for preparing the new benchmark is clear. - The proposed benchmark, ChEMBL-RL, exhibits less bias compared to the decoy-based test sets DUD-E and DEKOIS 2.0. ## ChEMBL-RL **1. Lack of a robust strategy to avoid false negatives.** The authors construct their decoy set by sampling actives from other targets in ChEMBL. However, I question why the authors only use Tanimoto similarity to filter these decoys, without employing other computational tools (e.g., docking, cofolding tools) to prevent false negatives. For DEKOIS 2.0 or DUD-E, decoys are drawn from large ligand libraries like ZINC, which contain many inactive molecules, posing a lower risk of including false negatives. In contrast, ChEMBL is a library of bioactive compounds, and it is one of the most popular library to identify initial hits. It is plausible that compounds from this library could exhibit activity against a target, even if they are structurally dissimilar to known actives. **2. Is the constructed decoy set truly better than LIT-PCBA's inactive set?** In Table 1, the authors claim that ChEMBL-RL achieves better bias control than LIT-PCBA due to a lack of protein-side data leakage. However, this is not a fair comparison, given that LIT-PCBA uses **experimentally validated inactives**, while ChEMBL-RL use **putative inactives** (i.e., cross-decoys). Removing bias from a limited set of _experimental_ data is arguably more difficult than drawing a decoy set from a large pool of _assumed_ inactives minimizing a bias. Moreover, the AVE values in the original LIT-PCBA paper appear lower than the values reported here. The authors should justify why they report the AVE for only 4 targets from LIT-PCBA. **3. Flawed EF1% comparison.** The EF1% values in Table 2 cannot be directly compared across different benchmarks. This metric is highly dependent on the ratio of actives to decoys (i.e., the size of the decoy set), which differs between benchmarks. --- ## MotifScreen **4. Unfair comparative study.** MotifScreen is trained on three datasets (PDBbind, BioLip, and ChEMBL), creating a training set that is significantly larger (reportedly ~6x) than those used for the baseline models (PDBBind). For a fair comparison, the authors should either report MotifScreen's performance when trained only on PDBbind or retrain the baseline models using the same extended training set. For models like KarmaDock, which has separate structure and affinity modules, it seems feasible to retrain its affinity module using the binding affinity-only data in ChEMBL. **5. Poor performance on DUD-E.** In Table D5, MotifScreen's EF1% on the DUD-E benchmark is only 5.94, which is substantially lower than the baseline models (9-16) and other state-of-the-art methods (e.g., GLIDE[1], RTMScore[2], GenScore[3], PIGNet2[4]) that report EF1% > 20 (you can see the value in GenScore Paper). Given that MotifScreen uses an extended training dataset, this performance is insufficient to support the claim of robustness. The authors state they filtered the training set to avoid leakage; they should also report performance _without_ this filtering to clarify if this is the cause. While target leakage is a valid concern, many drug development campaigns focus on known targets. Therefore, evaluating performance on targets similar to the training set is still a necessary and practical assessment. **6. Missing evaluation on DEKOIS 2.0.** In Table D5, results for MotifScreen on the DEKOIS 2.0 benchmark are absent. The authors should evaluate the model on the DEKOIS 2.0. --- **Reference:** 1. Halgren, Thomas A., et al. "Glide: a new approach for rapid, accurate docking and scoring. 2. Enrichment factors in database screening." Journal of medicinal chemistry 47.7 (2004): 1750-1759. 2. Shen, Chao, et al. "Boosting protein–ligand binding pose prediction and virtual screening based on residue–atom distance likelihood potential and graph transformer." Journal of Medicinal Chemistry 65.15 (2022): 10691-10706. 3. Shen, Chao, et al. "A generalized protein–ligand scoring framework with balanced scoring, docking, ranking and screening powers." Chemical Science 14.30 (2023): 8129-8146. 4. Moon, Seokhyun, et al. "PIGNet2: a versatile deep learning-based protein–ligand interaction prediction model for binding affinity scoring and virtual screening." Digital Discovery 3.2 (2024): 287-299. - Please report the number of similar complex data (by protein sequence and ligand similarity) in PDBbind for each benchmark set. Also, report the number of data points excluded from the training set for the DUD-E evaluation (Line 381). - It is well-known that AUROC is not an ideal metric for evaluating virtual screening performance. Please report **BEDROC** in all main benchmark tables (e.g., Table 2). - The notation **EF1** is incorrect. This metric is typically denoted as EF1\%​ or EF_{1\%}. Please correct this throughout the manuscript. Consequently, the (\%) in the header of Table 2 ('EF1 (%)') should be removed. - Please report the specific PDB IDs used, the number of actives, and the number of decoys for each target in the ChEMBL-RL benchmark in the Appendix. - Is the maximum EF1\% value is 31 in ChEMBL-RL? (30 decoys per each active) Fully human-written
MotifScreen: Generalizing Virtual Screening through Learning Protein-Ligand Interaction Principles Soundness: 3: good Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper argues that widely used SBVS benchmarks suffer from target leakage and ligand bias, which inflate reported DL performance. It proposes ChEMBL-LR, a leakage-resistant benchmark (60 targets; near-zero mean AVE bias 0.033) and introduces MotifScreen, a multi-task, structure-based screening model with three heads: (1) pocket motif prediction, (2) fragment/key-atom structure compatibility, and (3) binding score prediction. The method is trained on PDBbind+BioLip+ChEMBL with strictly removing leakage and reports results on ChEMBL-LR and DUD-E. - **Clear diagnosis of benchmark pitfalls** with concrete analyses of target leakage and ligand-only shortcuts (AVE). The paper emphasizes high overlap between common benchmarks and PDBbind and quantifies ligand bias. - **New benchmark (ChEMBL-LR)** with principled curation: strict target-wise separation, cross-decoys, removal of non-drug-like molecules, and near-zero mean AVE bias (0.033). - **Principle-guided multi-task design** (motif, structure/key-atom, affinity) that attempts to force learning of interaction physics rather than shortcut signals. - **Efficiency**: forward pass timing ($\sim$ 0.03 s/compound) suggests practical scalability for large libraries. - **Mismatched training regimes undermine the comparison.** MotifScreen is evaluated on DUD-E after removing all training entries similar to its targets (from PDBbind/BioLip/ChEMBL), while most baselines appear to use their original training with likely target overlap. This setup likely depresses MotifScreen's DUD-E EF1% (5.94) and weakens any one-to-one comparisons between the models. A fair test would retrain baselines under the proposed training dataset or evaluate all methods on a single leakage-controlled split. - **Use of $\Delta$ (performance drop) without harmonizing metric ranges or training regimes.** The manuscript subtracts EF1%/AUROC values between external benchmarks and ChEMBL-LR to argue smaller degradation for MotifScreen. However, EF1% ranges and training conditions differ across benchmarks and methods, making raw subtraction potentially misleading. A consolidated table exists (Table D5), but $\Delta$ remains hard to interpret as "generalization" without a common training/eval protocol. - **Early-enrichment evidence is mixed versus the strongest baseline.** EF1% gains over SurfDock are not statistically significant (p = 0.161), which matters because early enrichment drives SBVS utility. The manuscript foregrounds AUROC, potentially obscuring this point. - **Ablation study's design choices look ad-hoc.** Ablations use a reduced dataset and report epoch 31 snapshots. Figure D2 indicates similar validation set's AUROC trajectories between "aff+motif", "aff+motif+str" configurations. Without multi-seed runs or later-epoch checks, conclusions about hierarchical synergy risk over-interpretation. - **ChEMBL-LR vs. LIT-PCBA: incremental benefit unclear.** The paper argues LIT-PCBA's low AVE is limited (reported on only 4 targets) and leakage under external training data source (e.g., PDBbind). However, LIT-PCBA already uses experimentally measured inactives and, by the authors' own RF results, remains difficult across all 15 targets ( including the 11 with potential leakage), yielding low AUROC even when leakage could help. This can be treated as an evidence that LIT-PCBA already probes generalization. Absent a uniform, leakage-controlled training/evaluation of all methods, it is unclear what ChEMBL-LR contributes beyond more targets with target-wise separation, rather than fundamentally stronger bias control. 1. **Normalization of $\Delta$ metrics.** Since EF1% ranges and dataset compositions differ across benchmarks, how do authors justify interpreting raw $\Delta$EF1% as generalization? Would authors consider relative EF1% retention or standardized effect sizes with a common training corpus (Table D5 suggests this is possible)? 2. **Ablations: epoch choice and variance.** Why did authors choose epoch 31 for reporting? How many seeds were run for Table 3/Figure D2? Could authors report later-epoch or full-training ablations (or confidence intervals) to substantiate the hierarchical-synergy claim? 3. **Cross-docking criterion.** Sequence identity >95% is a strong but it may be an indirect criteria in terms of SBVS where we generally knows about the binding pocket. Did authors consider pocket-level similarity (e.g., local alignment, cavity overlap)? 4. **Inference score.** Please confirm that ŷ (final scalar binding prediction) is the ranking score used in all screening experiments, and note where this is specified in the main text. 5. **EF1% and BEDROC reporting.** Since early enrichment is crucial in VS, why is BEDROC only in the appendix rather than in the main comparison tables alongside EF1%/AUROC? Could authors include per-target EF1%/BEDROC distributions with CIs? Also, what $\alpha$ value used in BEDROC computation? ## Typos - AVE formula text (line 125) - "MotifGen (Anonymous, 2025)" placeholder. (line 205) - Incomplete line in Table D1's paragraph (line 1359-1360) Fully AI-generated
Diffusion Alignment as Variataional Expectation-Maximization Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors propose DAV, an alignment algorithm for diffusion models via variational expectation maximization. In the E-step, DAV employs test-time search algorithms to generate samples from the reward-weighted posterior distribution. In the M-step, the diffusion model is fine-tuned using the samples from the E-step. The authors demonstrate the effectiveness of DAV on both the continuous text-to-image generation task and the discrete DNA sequence design. - The idea to formulate diffusion alignment as a variational expectation-maximization problem is interesting. - The paper is well-written, and the theory and method are well-motivated. - Experiments showcase the effectiveness of DAV. - The searching algorithms lead to computational overhead. Therefore, a fairer comparison with the baselines should also take the computational cost into account. For example, it would be helpful to compare model performance under the same computational budget and the performance scaling curve of TR2-D2 as the computation increases. - The value of 3-mer correlation for DNA sequence design is significantly lower than those reported by baselines, e.g., in DRAKES paper the value is 0.887, much higher than DAV (0.397), while in table 2 and figure 5, it is only 0.229. Also, the target and naturalness are two competing properties, and one can get a higher value of one property by sacrificing the other via hyperparam tuning or using different training epochs. Does DAV have Pareto optimal performance compared to baselines? - The E-step can lead to an inaccurate estimation of the posterior distribution, due to the limited sample size and the value estimation error in the test-time algorithms. How does this affect the M-step optimization, and is the model robust to a suboptimal posterior distribution (e.g., with fewer samples or inaccurate samples)? Please refer to the **weaknesses** section. Fully human-written
Diffusion Alignment as Variataional Expectation-Maximization Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper develops an approach to “Diffusion Alignment as Variational Expectation-Maximization” (DAV) that alternates between two complementary phases: the E-step, which is essentially an exploration step that aims to discover diverse and high-reward samples from the variational posterior; and the M-step, “amortization”, which refines the diffusion model using samples identified from the E-step. The DAV approach is built on solid technical foundations as outlined in \S4. For instance, the E-step uses gradient-based guidance and importance sample to enhance the exploration. The M-step minimizes a mode-covering objective that incentivizes the covering of all diverse modes generated in the E-step. The combined E-M steps iteratively refine the model towards a multi-modal aligned distribution; and this overcomes problems like over-optimization and mode collapse that often arise in RL. The M-step involves two variations, in addition to the standard DAV objective in (10), there’s a variation, DAV-KL in (11). From the experimental results in Table 1, it’s not clear which one to use and when, except perhaps the ad hoc trial and error. Can the author(s) shed some light on how to choose the value of the KL-coefficient \lambda in (11), which is meant to control “the trade-off between aligning with the expert policy and pre-serving the pretrained model”? In particular, is the value \lambda robust or not with respect to downstream applications? Fully human-written
Diffusion Alignment as Variataional Expectation-Maximization Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper presents a diffusion alignment method (DAV) that alternates between test-time searches as Expectation steps and online refinement of the diffusion model as Maximization steps. Specifically, the diffusion alignment problem is formulated as a soft RL objective, whose evidence lower bound is optimized an EM algorithm. In the E-steps, DAV draws posterior samples given a reward-tilted distribution; while in the M-steps, DAV distills the sampled trajectories into the diffusion model. Experimental results show the effectiveness of DAV compared to existing RL and direct preference optimization methods for both continuous and discrete diffusion models. - The paper offers a fresh perspective by aligning diffusion models with the EM algorithm. I especially appreciate this idea because the multi-round iterative alignment could potentially help in settings where the reward is costly or intractable to evaluate—for example, when it requires human evaluation or expensive wet-lab experiments. - The proposed method is accompanied by rigorous derivations and theoretical guarantees. - Experiments only include one example for continuous diffusion and one for discrete diffusion. The case for generality would be stronger with additional tasks (e.g., compressibility or prompt alignment as in DDPO). Moreover, some of the recent methods are not included or discussed as well, such as DSPO[1], DanceGRPO[2]. - The EM algorithm may be substantially more expensive than the methods it is compared to (e.g., DRaFT or DDPO), given the test-time search required in each expectation step. However, there is currently no analysis on the runtime or convergence speed of DAV. [1] Zhu et al. "DSPO: Direct Score Preference Optimization for Diffusion Model Alignment", ICLR 2025. [2] Xue et al. "DanceGRPO: Unleashing GRPO on Visual Generation", arXiv: 2505.07818. - For the discrete diffusion model alignment, I am curious of how DAV compare to test-time sampling algorithms such as [3,4], which also consider the same DNA enhancer task, and also alignment methods designed for discrete diffusion models (e.g., [5,6]). [3] Li et al. "Derivative-Free Guidance in Continuous and Discrete Diffusion Models with Soft Value-Based Decoding", arXiv: 2408.08252. [4] Chu et al. "Split Gibbs Discrete Diffusion Posterior Sampling", NeurIPS 2025. [5] Borso et al. "Preference-Based Alignment of Discrete Diffusion Models", ICLR 2025 Bi-Align Workshop. [6] Zhu et al. "LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models", arXiv: 2505.19223. Fully human-written
Diffusion Alignment as Variataional Expectation-Maximization Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces DAV, a novel framework for fine-tuning pre-trained diffusion models. The authors motivate the work by claiming to address reward over-optimization. 1. This work formulates diffusion alignment as an iterative Variational Expectation-Maximization, which appears to be a new and interesting theoretical lens for diffusion fine-tuning.2. 2. The proposed DAV framework enjoys broad applicability. It can accommodate both continuous and discrete settings. 3. The presentation of this work is easy to follow. 4. The empirical results show DAV and DAV-KL enjoy superiority over multiple strong baselines, such as DDPO, DRaFT. 1. The main comparison in Figure 3, which plots aesthetic reward against diversity/naturalness, is confusing and potentially incomplete. The reported performance of the RL-based if they are properly trained with "suitable" KL penalty (which might be non-trivial to choose). This raises questions about the optimal tuning of these baselines. 2. Furthermore, the analysis in Figure 3 omits purely inference-time methods, which are often competitive in image experiments. 3. As noticed by the authors, this method has non-negligible computation costs. The E-step involves substantial "additional test-time computation" through gradient-guided search. In large-scale diffusion model finetuning, DDPO already takes much time to converge (compared to the fastest direct propagation). It is important to quantitatively present the added training-time overhead of DAV relative to DDPO. 1. For the results presented in Figure 3, do the images for each algorithm come from a single training run, or are they gathered over multiple runs? If from a single run, please report the standard deviation rather than only the mean. 2. For discrete finetuning, usually it's straightforward to test both DNA and RNA sequences. Can the authors provide explanations on why only the DNA enhancer is tested? 3. See other questions above Fully human-written
A theory of parameter identifiability in data-constrained recurrent neural networks Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper seeks to characterize when the parameters of an RNN can be reliably recovered from neural data, with applications to data-constrained modeling of neural dynamics. To do so, they establish conditions under which parameter subspaces are constrained by data, and when they might not be, supported by theoretical results and numerics. In particular, they argue that parameter combinations aligned with top components of the Gram matrix of observed neural activity are identifiable, whereas those associated with near-zero eigenmodes are not. They then show that FORCE learning fails to constrain parameters to identifiable subspaces, and then argue that non-identifiable parameters are optimally recovered with perturbations aligned with modes of the Gram matrix associated with small eigenvalues. The writing is clear and organized. The numerics in this paper and appendix are quite extensive, touching on many of the more realistic conditions that neuroscientists may be concerned with (correlated noise, unobserved neurons, etc.). The observations made about FORCE necessarily learning non-identifiable parameters should be of interest to those working on or fitting data-constrained models. Many of the claims feel overstated. Overall, it feels there is a large gap between what is actually proven in the theorems/propositions, and what is then concluded about identifiability of models from data as a result. For example, Theorem 1, along with most of the following results, characterize parameter identifiability in terms of a projection matrix derived from the concatenated observed neural activity/external inputs, yet this framing strictly holds only for the noiseless dynamics setting, with matched student/teacher architecture and no unobserved influences. The fact that the identifiability criteria of Theorem 1 is identical to that of noiseless LTI systems/linear RNNs speaks to how restrictive of a setting this is. Throughout this paper, stated conditions relating to parameter identifiability seem to be primarily of the necessary kind, whereas the abstract/title/discussion would lead one to think the contents also substantially cover realistic sufficient conditions for identifiability. I acknowledge that extensions of theorems to more realistic settings are considered, but they still feel underexplored. For example, Proposition S1 regarding recovery under partial observations seems like merely a restatement of Theorem 1, with an addendum to restrict to the case where recovery of parameters associated with unobserved neurons can be safely ignored. I don't see how that result meaningfully characterizes how partial observations can corrupt identifiability. I am still leaning towards an accept due to the extensive empirics, which I think would be useful in itself to the neuroscience community, but feel the paper would be much stronger if the theory spoke to the empirics better. Finally, I find it strange that this paper develops a framework around parameter subspaces defined by the top eigenvectors of $X^\top X$, but makes no reference to PCA. Isn't this identical to (or at least closely related to) projecting parameters onto the top PCs of the activity+external inputs? Other comments: 1. A minor point: throughout, $P \in \mathbb{R}^{N_X \times N_X}$ is stated to project to the column space of $X \in \mathbb{R}^{TM \times N_X}$, but by shape, this must be referring to the row space projector (projects to the subspace of $\mathbb{R}^{N_X}$ spanned by the rows of $X$). 2. The dynamics noise scale used in the empirics pertaining to estimation with noise---the more relevant/realistic case---feels absurdly small. For example, in Fig. S3, $\epsilon_{in} \sim \mathcal{N}(0,10^{-6})$. Since this noise is precisely the noise that corrupts the input $\theta x(t)$ to the nonlinearity, which is the relevant part of the dynamics invoked for Theorem 1, it would seem this would be the stress point that should be the most tested. 1. Can the learning of non-identifiable parameter combinations by certain algorithms like FORCE be rectified by simply post-hoc projecting learned parameters to the identifiable subspace, as estimated by the empirical Gram matrix? 2. Very minor: In Fig. 2, presumably, the top modes of the Gram matrix change/wiggle as longer neural trajectories are observed, beyond just overall increases in rank. Consequently, the parameter subspaces evaluated need not be exactly comparable across curves in A, B, and C, no? 3. How is accuracy of parameter recovery defined, e.g. in Fig. 2 and Fig. 3? I'm assuming this is something like 1 - (relative error in frobenius norm), but this should be stated somewhere. 4. Regarding effects of partial observations on identifiability: the paragraph starting at line 1345 seems to state that recovery of parameters associated to top spectral components is poor for partially observed systems, even under very long recording sessions, yet the following paragraph seems to contradict this paragraph in spirit, and instead spins an optimistic tone. A very minor point, but I think this disconnect should be clarified, as I feel this discussion point will be of interest to the neuroscience community. Fully human-written
A theory of parameter identifiability in data-constrained recurrent neural networks Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This work studies the identifiability of parameters of RNNs from observations of their inputs and dynamics, with motivation from the use of these models in neuroscience. Their main theorem states the fact that components of the parameters outside the subspace spanned by the observations - which is equivalent to the union of eigenspaces with non-zero eigenvalue of the empirical observability Gramian - are not identifiable. As the authors clearly explain, this paper addresses an important topic. The main theoretical results of the paper are intuitive, though perhaps unsurprising. The range of experiments presented is interesting and mostly convincing. The authors present their work in very sweeping terms, but as a reader their results didn't live up to the expectations set by the title and abstract. First, I do not find the main result (Thm. 1) to be particularly surprising; see my comment below about the study of LTI systems in control theory. Second, I am confused by the fact that the authors defer to appendix the extension of their main results to the partially-observed case. That seems central to the point they're trying to make, so though I guess it's a pretty deflationary result it's odd not to mention it clearly in the main text. I'd suggest bringing this into the main text, and moderating the tone of the paper overall. I also have a series of more specific concerns, questions, and suggestions, which I list under **Questions**. - There's a long line of work in control theory on problems of identifiability in system identification for linear dynamical systems, see for instance recent works by [Simchowitz et al. (2018)](https://proceedings.mlr.press/v75/simchowitz18a.html) or [Geadah et al. (2024)](https://ieeexplore.ieee.org/abstract/document/10886179) and references they cite. It would be useful to make contact with this literature, as my impression is that some analog of their Thm. 1 is folklore there. Also, the discussion of Gram matrices coincides with the classical observability Gramian; it'd be useful to make this connection. - It'd be interesting to extend these results to nonlinearities that are not strictly monotone. For instance, the paper you cite by Biswas and Fitzgerald focuses on some of the degeneracies that arise from using a threshold-linear function. - Thm. 2 is (up to the error term) a consequence of the fact that the parameter updates induced by gradient descent lie within the span of the observed covariates. This seems well-known to me, so I think you might consider citing a reference, or at least commenting on the conceptual content. - Related to this aspect of gradient-descent-type algorithms, I don't think it's surprising that FORCE would retain initialization-dependence, and thus non-identifiable components. Is your aim here primarily to show that CORNN compares favorably to FORCE in this regard? - I think Cor. 2 could be made more mathematically precise; when is this quadratic approximation reliable? In the appendix you don't consider the remainder term. - The authors discuss regularization at length. However, thinking naively about neuroscience experiments, it's unclear to me how one should choose a regularization parameter in a principled way, as the data is presumably almost always nonstationary. Can you elaborate on this? For instance, you show in Figure S10 an example where you can get good estimation by choosing a good $\ell_2$ penalty; how do you choose this? - The results on interventions are interesting, but could you at least speculate on the informativeness of experimentally-feasible interventions? I suppose those would be limited by partial observation. - I'm curious about the generality of the claim in Figure S5 - this correspondence between variance and task-relevance should depend on the nature of the solution found by the RNN, right? For example, some solutions based on feedforward amplification - like Karel Svoboda's group has recently suggested are at play in ALM, in [Daie et al. (2023)](https://www.biorxiv.org/content/10.1101/2023.08.04.552026v2.full) - would seem to violate this. - There is a long discussion in the appendix (starting around line 1360) about long-term recordings, but it's not clear to me whether that will necessarily resolve any of the challenges documented in the paper, because on those timescales there can clearly be substantial changes due to plasticity and other factors. I'm confused about why you'd cite Driscoll et al. (2017) here and not comment the fact that that paper shows substantial drift in responses over time, which seems contrary to the idea that such long recordings would help constrain an RNN model with fixed weights. Fully human-written
A theory of parameter identifiability in data-constrained recurrent neural networks Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors consider the problem of fitting RNNs to data, something often done in neuroscience. They characterise identifiability in the cleanest case: within model class, one-step on fully observed data, with monotonic nonlinearities, and noiseless; and find an intuitive result - the weights are only identifiable in the span of the `input datapoints' (meaning the span of the set of concatenations of previous timestep's activity vectors and current timestep's input vector). They show some theoretical and empirical results regarding which estimators will and won't recover these identifiable parameters, and present empirical results on how FORCE learning does not set non-identifiable weights to zero. Finally, they show how to design interventions to enlarge the set of identifiable weights, and that, if the activity stays within the identifiable span, that the dynamics will generalise. - The main theorem, theorem 1, was intuitive and interesting - The writing was clear, barring some AI-like verbosity - The question was interesting, and well-framed - The comparison to another common training method, FORCE, was cool. - I liked the analysis of low rank networks S2. Overall, the framing of the paper got me very excited, but I found that technical concerns led me to see the contributions as smaller than I initially thought. I will list them here, and the authors can likely correct me on some of my mistaken understandings. - First, since the RNN is trained on one-step prediction, and assuming complete observability, the problem becomes a zero-layer feed-forward neural network, or a general linear model [linear regression then nonlinear link function]. Then the result is simply: if the nonlinearity is monotonic, identifiability becomes the same as for linear regression, and that is identifiable on the span of the input data. (A) Emphasising this simplicity seems good? (B) surely this is already well-known? Googling identifiability of generalised linear models provides many results. In this particular setting it may be new, but it seems very related to existing ideas? - Then I had some concerns with the unobserved data case. Firstly, unless I'm confused, it is wrongly signposted in the text (it's at the end of appendix B.2, not C as advertised?) Then, for some reason phi become arbitrary rather than monotonic? Finally, and more importantly, the result's framing seemed weird. It showed that, even if you know the parameters relating to the unobserved data, you keep the nonidentifiability of the parameters outside the span of the observed data. So far so good. It did not show that, "even if $P=I$, RNN parameters may remain non-identifiable due to the hidden influence of unobserved neurons or redundancy introduced by non-monotonic activation functions". It just showed, exactly as in the original observed case, there exists a class of non-identifiable parameters living outside the span of the observed data. By assuming that the unobserved parameters were correctly observed you remove all the interesting parts of the problem? And you don't discuss the role of the nonlinearity at all? So why should I draw the conclusion you suggest from the theorem? - Next, I was concerned by the claims about l2 regularisation. In the simplest case (no noise), finding the estimator that fits the data with minimal l2 norm will clearly select the weights that have no projection in the nullspace of the data matrix, solving the identifiability problem. Yet the authors claim that this is insufficient. They justify this claim by showing that FORCE learning recovers non-zero non-identifiable weights - certainly an interesting result on a shortcoming of FORCE learning. But I don't see the link - they claim that FORCE learning effectively performs l2 regularisation, but I don't see how. I read through the original algorithm and couldn't see the link to the fact it is minimising error + l2 regularisation, and in fact I view the authors' results as evidence in the opposite direction. If it were minimising such a loss, it would not have these non-identifiable components upon convergence! - Theorem 2 seemed solid [did not check this proof], showing that if you start in the identifiable weights and effectively regularise the weights you will stay there. I was surprised about the fact corollary 1 is a local claim, about the loss near a minima where second order taylor expansions are relevant. This is a severe restriction for a general loss, and should be acknowledged as such (for example, perhaps name the corollary local identifiability in nonlinear regression). My take-away was still that l2 regularisation saves the day. - I found the discussion of noise confusing. The model is introduced as noisy, but all the analytics are not about that setting. The only discussion of noise is in 4.3, where suddenly the data have noise added to them. I did not get why the important quantity is the span of the noiseless component - that's only true if both your real dynamics and your fitting are applied to the noiseless data, which is not the case? (Unless the added noise is just observation noise, not the input or conversion noise introduced in section 2) I agree that estimating the rank of data with noise is interesting, and likely relevant, but (a) surely this stuff is well studied? (b) the link to the rest of the work is very unclear to me. Overall, despite thinking this was an interesting question, I felt like the more interesting parts weren't tackled. In the noiseless case with full observability it comes down to a very simple result the same as in linear regression about the span of the data. Further, the writing made this simplicity hard to see. A large part of the surprises in this problem seem to come from unobserved parts of the model/noise, and model mismatch; none of these were robustly tackled. Add to this additional confusions listed above, and I'm afraid I am currently leaning reject. Further, stylistically, there was just a lot of material, that made it hard to digest. Appendix B was basically a continuation of the paper (8 pages!), including B.3, one of my favourite bits. Some more digesting by the authors, and much punchier writing (it's very verbose), will likely help the presentation. But this is vague advice, so not something I can reasonably request changed in a rebuttal. Clear from the above I think. Fully human-written
A theory of parameter identifiability in data-constrained recurrent neural networks Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The manuscript presents a mostly theoretical treatment of the issue of parameters identifiability in the fits of recurrent neural networks (RNNs) to neural data. RNNs are increasingly fitted to experimentally measured neural data as a way to extract the relevant dynamics and potentially gain insights into the underlying mechanisms. Yet it is not fully understood to what extent the parameters of RNN are in principle identifiable based on finite, noisy data. The authors present several theoretical insights into this question and propose experimental approaches that might mitigate issues of identifiability. The premise of the paper is clearly outlined. Several interesting connections to past work (even outside neuroscience) are presented and discussed. The focus on relatively simple, tractable settings allows the authors to gain precise insights into which parameters of their models are identifiable and which are not (in the form a various theorems and corresponding proofs). The work appears technically sound. While the paper provides some interesting insights into which parameters of data constrained RNNs are identifiable or not, the practical relevance of these insights is not clear. One prominent application of data-constrained RNNs in neuroscience is to obtain smooth/denoised estimates of low-dimensional, latent dynamics from high-dimensional, noisy observations. For such applications, presumably it does not matter if multiple RNN weight matrices exist that can explain the dynamics equally well. Likewise, many mechanistic insights into the function of the fitted RNNs (like the topology of fixed points) are probably possible even if the RNN weight matrix cannot be identified uniquely. In fact, it is known that the same type of dynamics can be implemented even by different classes of RNNs (work on “Universality” by Sussillo and colleagues, 2019). The authors should also clarify the connection of their work to past studies that have found a close relationship between dimensions of the weight matrix and dimensions of the dynamics. This relationship has been described in detail in low-rank networks (work by Ostojic et al) and even nominally high-rank RNNs have been found to be functionally low-rank (Krause et al, 2022), whereby a only a low-d subspace of the weight matrix is sufficient to explain the corresponding low-d dynamics. These lines of work seem closely related to those in this manuscript, which also finds that the subspace of the identifiable weights is closely linked to the subspace explored by the dynamics. I found some of the sections of the paper are rather dense and difficult to read. It would help if the authors could at times provide more intuitions about the insights gained from their theorems. What types of insight are affected by the non-identifiability presented by the authors? If the goal of fitting an RNN to data is to infer latent dynamics, or generate hypotheses about the underlying topology of the dynamics, does it matter that the RNN parameters are not fully identifiable? The causal interventions proposed by the authors to alleviate non-identifiability seem to be focused on characterizing the components of the weight matrix that are not identifiable, as they are not sufficiently constrained by the measured data. But why would it even be desirable to constrain dynamics that are not explored in the “natural” operation of a neural circuit? Fully human-written
Identifying Truthful Inheritance in Family Models and Enhancing Truthfulness Soundness: 3: good Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Previous work ITI shows that some attention heads in LLMs are highly associated with truthfulness, as they can distinguish the truthfulness of an input with high accuracy in linear probing. Following ITI, this paper finds that these truthfulness-associated heads are inherited in the MLLMs fine-tuned from a base LLM. They then propose a representation steering method to enhance the truthfulness of the MLLMs. 1. The authors pose the research questions clearly in the paper. 1. It seems that the first clarified contribution (Line 092) is not a contribution of this paper, as this is achieved in ITI. The paper simply applies ITI on Halu dataset and clarifies that the identified heads are context-truthful heads, not truthful heads. 2. One contribution of this paper is the finding on inheritance of truthfulness. However, how to derive the correlation scores that validate this finding is not explicitly clarified in Section 2.3. That is, only scores are given in Lines 236-240, while the definition or derivation of these scores are not clarified. 3. Overall, I feel this work is an application of ITI on the Halu dataset and MLLMs. 1. In the experiment section, the authors show the effectiveness of the proposed TruthProbe on the fine-tuned MLLMs. Since the proposed method is designed to improve truthfulness for general LLMs/MLLMs, I wonder if there is any result showing the effectiveness on the base LLMs, especially compared with ITI? 2. This paper shows that the truthfulness inheritance happens on fine-tuned MLLMs. I am curious if this also happens on the fine-tuned LLMs. Fully human-written
Identifying Truthful Inheritance in Family Models and Enhancing Truthfulness Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper investigates (1) whether the truthfulness of LLM can be inherited to the MLLM, and (2) whether it is possible to design a structure that leverages the truthfulness mechanisms to improves the MLLM's downstream performance on POPE benchmark. The experiments provided affirmative answers to both questions. - The finding that truthfulness inherits from the LM to VLM is insightful. - The proposed TruthProbe improves the Acc, F1, Rec scores on the three POPE subsets (MSCOCO, A-OKVQA, GQA). It's an important advance for the interpretability field to show improvements in the model performance. - The Introduction asks "Do these models inherit traits like truthfulness" which indicates other traits might also be studied, but only the truthfulness is studied in this paper. I recommend updating the questions in the abstract and the introduction correspondingly. - The TruthProbe method would benefit from a clearer explanation of details. The number of trainable parameters in the introduced soft gate could be written clearer. From my current understanding of Section 3, $S$, $\lambda$, and $g_l^h$ are the parameters. But how many of them are trained and how many of them are hyperparameters? I'd appreciate if more details are provided. Similarly, is this method applied during inference time? - Additionally, there are a lot of typos, and I think the paper could perhaps benefit from another round of proofreading: - Line 193, "1" is it Table 1 or Figure 1? - Line 233, "by" -> "By". - Line 236, the LaTeX formatting is problematic here. - Line 274, "equation 2" -> "Equation 2". - It'd be great if the notations are unified. E.g., "Equation" vs "eq", "Figure" vs "Fig.", "Sec" vs "Sec." Please refer to my reviews in the above sections. Especially, clarifications about the details would be appreciated. Fully human-written
Identifying Truthful Inheritance in Family Models and Enhancing Truthfulness Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. In this paper, the authors investigate whether truthfulness mechanisms in large language models (LLMs) are inherited by their multimodal counterparts (MLLMs) and propose an intervention technique to enhance truthfulness performance. The authors employ linear probing at the attention head level to obtain truthfulness scores that indicate each head's responsiveness to truthful outputs. The authors then compute correlation coefficients between the truthfulness scores (probe accuracies) of base LLMs and their fine-tuned MLLMs to demonstrate inheritance. Building on these findings, they propose TruthProbe, a soft gating mechanism that amplifies contributions from high-truthfulness heads during inference by scaling their outputs in the residual stream according to normalized truthfulness scores. In the experiments, the authors evaluate the intervention method on the HaluEval benchmark for LLMs and the POPE benchmark for MLLMs. - The paper presents an innovative perspective on truthfulness mechanisms and intervention techniques. The inheritance framework provides a novel angle to understand truthful behavior, and the findings from interpretability results inform the design of the intervention mechanism. - The method is evaluated across multiple benchmarks and experimental settings, providing a comprehensive view of the proposed approach. - The presentation of the paper could be improved to enhance clarity and include necessary methodological details. - The paper lacks clarity on the dataset splitting strategy for Table 1 evaluation. While the authors mention a 4:1 train/validation split for probe training and use a held-out validation set (20%) for hyperparameter tuning, it remains unclear whether Table 1 evaluates on truly independent test data or data potentially overlapping with the probe training/validation sets. - In Figure 3, the evaluation results are obtained by aggregating across three benchmarks, but the paper does not specify how the aggregation is performed or provide breakdown scores for each individual benchmark. - Figure 2 presents truthfulness score breakdowns for each attention head across all layers, but the heatmap format offers a low information-to-noise ratio. - The rationale and evidence should be provided to support the probing applied at the final answer token position, as it may not be the place where untruthfulness manifests. - The methodology for evaluating truthfulness inheritance in MLLMs has limitations that weaken the paper's core claims. - Evaluating MLLMs' truthfulness mechanisms primarily through text-based hallucination benchmarks (HaluEval) with blank images does not provide meaningful insights into multimodal truthfulness. This evaluation setting does not reflect typical MLLM use cases, where models process informative visual content alongside text. The approach conflates text-only reasoning capabilities inherited from the base LLM with genuine multimodal grounding abilities. - The observed decrease in similarity when transitioning from text-only to multimodal settings (Figure 1(b) vs. Figure 3) may actually suggest that MLLMs develop new or altered truthfulness mechanisms distinct from their LLM counterparts, contradicting the inheritance hypothesis rather than supporting it. - The cross-modal similarity comparison does not constitute strong evidence for inheritance. The authors compare models trained from scratch (Mistral) with fine-tuned derivatives to demonstrate family-specific patterns, but this contrast is insufficient. Different base models trained on different datasets with different procedures naturally exhibit different internal representations. The low correlation with Mistral (≈0.02) could simply reflect differences in pre-training rather than validating that fine-tuning preserves specific functional properties. - The experimental results show marginal performance differences between intervened and non-intervened models (Table 2 and Figure 5), raising questions about the method's practical effectiveness. To establish that the intervention produces statistically significant improvements rather than noise-level fluctuations, the authors should report results from multiple runs with different random seeds and conduct appropriate statistical significance tests. Please see the above section. Lightly AI-edited
Cognition-of-Thought Elicits Social-Aligned Reasoning in Large Language Models Soundness: 1: poor Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a prompt engineering strategy for improving safety alignment. The LLM is prompted with instructions modeled on Asimov's three rules of robotics. In addition, the reasoning output is monitored for misalignment. If misalignment is detected by another LLM, the generation is rolled back to a point identified by inspecting attention scores. Additional guidance is injected into the chain of thought and the generation is resumed. Experiments show that this approach gives a modest boost to safety and social reasoning benchmarks. 1. The paper's attempts to incorporate lessons from other fields like psychology are interesting and creative, even if I don't think that they are successful. 2. The proposed technique is simple to implement. 1. A major concern is the reliance on language from psychology that seems misleading about what is really happening. The method is essentially a particular prompt with an LLM-as-judge that monitors the results. I don't think it is reasonable to call this "cognition". The paper has many instances of this provocative terminology that seems out of place and overstating the capabilities of LLMs. Another example is "cognitive states," which are just vectors in a subset of {-1, 0, 1}^3. The first claimed contribution says that the paper "formalizes cognitive perception." I respectfully disagree that this is a reasonable way to describe writing prompts that describe a model of human cognition. 1. The paper does not engage very much with whether the prompting strategy is having the desired effect, in the sense that the model is actually following the intended schema. A few case studies are presented in the appendix without synthesizing the findings into an overall evaluation. This paper would be much stronger if it evaluated the ability of LLMs to follow the intended rules and reason in the described state space. Currently the focus in presentation is on small improvement in existing benchmarks, rather than investigating the limits of LLMs to reason about these topics. I would be much more enthusiastic about such a paper and I encourage the authors to consider such a direction. 1. The paper does not measure the variance across generations. The results are presented without something like standard errors, but the differences in scores between methods seem small on some metrics. 1. Does CooT incur any penalty in accuracy on unharmful requests? How often does the LLM judge incorrectly flag unharmful situations? 1. In what sense is the "causal rollback" causal? The position to rollback to is determined by larger attention scores. How do attention scores prove that a particular place in the CoT caused the result to be harmful (or any other aspect of the result)? Fully human-written
Cognition-of-Thought Elicits Social-Aligned Reasoning in Large Language Models Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces Cognition-of-Thought (CooT), an inference-time alignment framework that couples a Generator with a cognitive Perceiver to monitor and correct LLM outputs during decoding. The Perceiver uses Asimov's Three Laws (Safety > Altruism > Egoism) to detect violations, triggers rollback to error origins, and injects universal + contextual guidance to steer generation toward safer outputs. The paper compares their method with a range of baselines on AIR-bench (measures safety) and on SocialEval (measures social intelligence), showing that CooT achieves higher or competitive performance. - The decoding-time cognitive architecture is an interesting and creative idea, also the first work to propose inference-time alignment as an explicit Generator-Perceiver loop with structured state monitoring. - The proposed framework is compared thoroughly with baselines across multiple model families. Results show consistent improvements. - The ablation studies are informative and validate each component (rollback, guidelines, Perceiver size, precedence). All components contribute meaningfully which verifies the design of the framework. - The paper is well written and easy to follow. 1. The state cognition model (section 3.1) lacks theoretical validation. - The use of Asimov's laws seems under justified. Why did you choose this three laws specifically? They were designed for science fiction robots, not real AI safety. - Why not use established moral psychology frameworks (e.g. Haidt's Moral Foundations Theory) or empirically grounded safety taxonomies. 2. The use of "cognition" seems inaccurate, as you are merely describing a pattern matching mechanism. - You claim the Perceiver provides "cognitive self-monitoring," but isn't it just doing classification with a specially prompted LLM? I do not find justification that this is genuinely cognitive rather than sophisticated pattern matching. - The term "cognition" should be used more sparingly or with more evidence (e.g., if there are true reasoning and understanding beyond behavioral output). 3. Rollback mechanism seems heuristic. - The attention-based sharpness score (Eq. with max-norm + entropy) lacks principled justification. Why should peaked attention indicate causal error origins? There is no analysis of failure modes: what if attention is diffuse or the error spans multiple positions? - The threshold τ requires tuning (Appendix A.4) but how to set it for new domains? 1. My understanding is that your state space is {-1, 0, 1}³, giving only 27 possible states (and you restrict to a "feasible" subset F). But real-world ethical dilemmas are much more nuanced. How can this state space capture the complexity of social alignment? For example, how would your system handle cases requiring trade-offs between different types of harm? 2. You run two models in tandem (Generator + Perceiver), with potential rollback and regeneration. What's the computational overhead? CooT can be very computationally expensive on models beyond the ones you tested (> 32B). 3. Generator and Perceiver share weights, how can the same biased parameters reliably audit themselves? Fully human-written
Cognition-of-Thought Elicits Social-Aligned Reasoning in Large Language Models Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposed a cognitive alignment framework for LLMs, enabling them to self-monitor their own outputs. Specifically, the authors introduced an additional cognitive perceiver for LLMs to continuously monitor the generation process. The perceiver used specialized prompts to determine whether the generated text complied with Asimov’s Three Laws of Robotics. When detecting violations, the model rolled back the generation process to the wrong position by aggregating the generator's attention maps to identify positions most strongly affecting the current prediction. Then, the authors introduced corrective guidelines (sentence priors) to guide re-generation from the position. Experiments showed that the framework improved safety and social reasoning performance across multiple models. $\bullet$ The paper presents a self-monitoring safety alignment framework that incorporates human cognition, and validates its effectiveness across different model families and safety scenarios. The framework appears conceptually simple and practical to implement. $\bullet$ It is great to introduce guidelines and priors. The cognitive perceiver assesses whether the generated text satisfies Asimov’s Three Laws, i.e., Safety, Altruism, and Egoism, and uses normative corpora such as the Behavioral, Emotional, and Social Skills Inventory to guide regeneration. $\bullet$ The overall framework of this paper appears clear and intuitive: detecting inappropriate generation, localizing it, and re-generating the text. However, in terms of specific technical choices, some ideas seem unclear or may involve additional options. For example, why does the perceptron use an LLM to generate classification results instead of a standard supervised classifier (the latter seemingly faster)? In the localization step, why can the maximum value of the aggregated attention map (lines 251-255) be considered the localization result? A more thorough discussion comparing common localization techniques would help justify this choice. $\bullet$ The paper should report the computational overhead introduced by the framework. Methods that embed security mechanisms into model weights typically do not introduce additional inference overhead. It appears that using perceivers (along with other steps) will introduce extra inference overhead. The authors should compare the average inference time between the base model and the model using the proposed framework (as in Figure 2) on the same tasks. Additionally, they should compare the runtime increase incurred by common security methods when performing the same inference tasks (as shown in Table 2). $\bullet$ The paper should include several examples of outputs before and after using this framework, both in the main text and in the appendix. It would also be valuable to show failed correction cases and analyze the underlying reasons. **Minor:** In Table 2, in Cooperation column, the AGRS method (54.26) appears to outperform the proposed method (54.12), yet the proposed method is highlighted in bold. See weaknesses. Fully human-written
Cognition-of-Thought Elicits Social-Aligned Reasoning in Large Language Models Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces Cognition-of-Thought (CooT), a novel decoding-time framework that equips LLMs with an explicit cognitive self-monitoring loop to improve safety and social reasoning. CooT couples a standard Generator with a cognitive Perceiver that continuously monitors generation using a precedence-based hierarchy. When violations are detected, the system performs causal rollback and regenerates with injected guidance combining universal social priors (BESSI framework) and context-specific warnings. Experiments on AIR-Bench 2024 and SocialEval show improvements over baselines, with comprehensive ablations demonstrating each component's contribution. The paper addresses an important problem of making alignment explicit and dynamic rather than baked into model weights. The core idea of coupling a Generator with a Perceiver for real-time cognitive monitoring is interesting and well-motivated by psychological research on metacognition. The experimental methodology is thorough, with evaluations across multiple benchmarks (AIR-Bench, SocialEval, HarmBench) and model families (Gemma, Llama, Qwen, GPT), demonstrating generalizability. The ablation studies (Table 3) systematically validate each component's contribution, showing that rollback, guideline injection, and precedence-aware states are all necessary. The qualitative case studies in Appendix D provide valuable insights into when and how the system intervenes. I think the primary weaknesses come in practical deployment of the framework. For instance, the "universal social schema" (BESSI) may have cultural limitations, as the BESSI framework is derived primarily from Western psychological research and may not generalize well across cultures with different social norms and values. The paper evaluates on English and Chinese tasks but doesn't discuss whether the social skills taxonomy (e.g., "Social Warmth," "Ethical Competence") translates appropriately across these contexts or whether culture-specific adaptations are needed. Furthermore, the paper doesn't report inference latency or throughput. Given that CooT requires: (1) running the Perceiver at each step, (2) potentially multiple rollback-and-regenerate cycles, and (3) encoding contextual guidelines, the computational cost could be substantial. This is critical for practical deployment. Also, the cognitive state representation is quite limited. Using just three binary values (Safety, Altruism, Egoism) ∈ {-1, 0, 1}³ to represent the model's "cognitive state" seems overly simplistic for capturing the complexity of ethical reasoning. Real ethical dilemmas often involve trade-offs that don't fit neatly into such scoring hierarchy (e.g., the trolley problem). Though I would rank this weakness as relatively minor since I couldn't think of a benchmark for stress-testing CooT. - What happens when the Perceiver itself makes errors in state classification? Is there any confidence calibration or uncertainty quantification in the cognitive state predictions? - How does the Perceiver handle ambiguous cases where safety, altruism, and egoism are genuinely in tension? For instance, in a scenario where refusing to help (preserving safety) versus helping (altruism) both seem reasonable, how does the system make the judgment call? Fully AI-generated
EXPLOR: Extrapolatory Pseudo-Label Matching for OOD Uncertainty Based Rejection Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces EXPLOR, a method for single-source out-of-distribution (OOD) generalization using pseudo-label matching across multiple heads. The core idea is to expand latent representations beyond the in-distribution manifold via simple scaling, generate pseudo-labels from diverse models (e.g., XGBoost, D-BAT), and train a multi-head network to match each pseudo-labeler’s predictions. The approach aims to improve high-confidence OOD predictions and rejection accuracy. Experiments on chemical (ChEMBL, DrugOOD) and tabular (Tableshift) datasets show consistent improvements over baseline and semi-supervised methods. 1. Clear problem formulation. The paper targets the single-source OOD setting, which is realistic but rarely addressed. The motivation from drug screening and risk prediction is well justified. 2. Simple yet effective design. EXPLOR combines latent-space expansion and per-head pseudo-label matching into a lightweight framework that requires no unlabeled OOD data or domain annotations. 3. Strong empirical performance. Across more than ten datasets, EXPLOR achieves stable gains, particularly in the high-confidence regime (AUPRC@R < 0.2). 4. New evaluation metric. The introduction of AUPRC@R < τ provides a practical way to assess selective prediction quality, relevant to safety-critical tasks. 1. Limited novelty. The method essentially combines self-training, ensemble averaging, and multi-task learning in a new context. No fundamentally new theoretical idea or architecture is introduced. 2. Empirical generality overstated. Although the paper claims to be modality-agnostic, all experiments are confined to tabular data. There is no evidence that the approach extends to images, text, or graphs. 3. Dependence on pseudo-label quality. The performance gain scales with the accuracy of pseudo-labelers. Poor labelers could degrade the overall results, and this dependency is only briefly mentioned in the appendix. 4. Weak theoretical justification. The variance-reduction derivation (Eq. 9–10) is heuristic; there is no formal analysis showing that latent scaling approximates the true OOD distribution. 5. Overstated terminology. Terms like “extrapolatory” or “modality-agnostic” may mislead readers, given the limited scope of experiments. Please carefully read Weakness and answer all the five concerns. Fully AI-generated
EXPLOR: Extrapolatory Pseudo-Label Matching for OOD Uncertainty Based Rejection Soundness: 4: excellent Presentation: 4: excellent Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces EXPLOR, a framework for single-source out-of-distribution (OOD) generalization and uncertainty-based rejection. The method integrates three elements: (1) latent-space extrapolation to synthetically expand training support, (2) supervision from a diverse set of pseudo-labelers, and (3) a multi-head student trained with per-head and mean-matching losses. A bias–variance decomposition is provided as an intuitive justification. Experiments on seven chemical and several tabular datasets demonstrate consistent improvements in AUPRC and AUPRC@R<τ over relevant baselines. - **Relevance and clarity** The paper addresses an important problem in OOD generalization under a single-source assumption. The motivation is clear, the method is concise, and the presentation is easy to follow. - **Methodological consistency** The combination of per-head matching and latent extrapolation provides a coherent multi-task formulation that plausibly enhances robustness. - **Reproducibility and efficiency** The implementation is simple, cost-efficient, and reproducible with standard hardware. - **Redundancy at inference** EXPLOR retains all pseudo-labelers at test time (Eq. 8), forming a hybrid ensemble of teachers and student heads. This design questions whether the student truly distills ensemble knowledge or merely supplements it. The claimed variance-reduction interpretation (Eq. 9–10) remains heuristic without quantitative analysis. - **Limited conceptual novelty** Each component of EXPLOR—ensemble pseudo-labeling, latent-space augmentation, and mean-matching regularization—has clear precedents in prior work (e.g., Hydra 2021, PixMix 2022, ACET 2019). The paper’s main contribution lies in system-level integration within a single-source OOD setup rather than in a new theoretical insight. - **Dataset and representation scope** The use of ChEMBL, TDC, and DrugOOD benchmarks is reasonable, but all experiments employ fixed ECFP4 fingerprints. This limits the generality of the “modality-agnostic” claim. Results on learned embeddings, such as graph neural networks or visual features, would better demonstrate transferability. - **Ablation and statistical support** Per-head matching and bottleneck ablations show small absolute gains (around 1 percent) and lack significance testing. The uncertainty claims would be stronger with calibration or variance metrics (for example expected calibration error). - **Heuristic theoretical argument** The bias–variance decomposition provides intuition but lacks empirical verification. No analysis is presented to confirm that predictive variance actually decreases as claimed. **Inference overhead** What is the actual computational overhead in terms of time and memory when retaining up to 1024 pseudo-labelers during inference compared with using the student alone? A quantitative comparison would clarify the method’s practicality. **Dependence on pseudo-labelers** How would EXPLOR perform if pseudo-labelers were unavailable at test time? Would the student alone preserve similar accuracy and uncertainty behavior? **Generality across learned embeddings** Does the proposed framework extend beyond fixed handcrafted features to learned embeddings such as those obtained from graph neural networks or visual backbones? Showing this would strengthen the claim of modality-agnostic generalization. **Notation clarification for Equation 8** Equation 8 appears to omit parentheses in the summation term. Fully AI-generated
EXPLOR: Extrapolatory Pseudo-Label Matching for OOD Uncertainty Based Rejection Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a new method for single-source domain generalization. The method trains multiple pseudo-labelers on different data subsets. It expands training data through latent-space augmentations. Then it trains a multi-headed network where each head matches a different pseudo-labeler. The approach works with any vector data and different base models. Experiments show strong performance on high-confidence predictions. 1. Single-source domain generalization is realistic for drug discovery applications. Many real scenarios only have one labeled dataset. The paper targets this practical setting. 2. Works with tree models and neural networks. Works with any vector data. Unlike image-specific methods, this is general purpose. Can be applied to many domains. 3. Tests on diverse datasets including chemical and tabular data. Compares with multiple baselines. Shows consistent improvements over pseudo-labelers across datasets. 1. The expansion pushes points away from origin by z' = (1 + |ε|)z. But no validation that expanded points are realistic. They might just be random noise. No check if they represent plausible OOD samples. 2. The method relies heavily on pseudo-labels from diverse labelers. But no check if pseudo-labels are reliable on OOD data. If all pseudo-labelers are wrong, student learns bad supervision. No quality control mechanism. 3. Diverse ensembles are standard ensemble practice. Pseudo-labeling is well-known in semi-supervised learning. Main contribution is combining them with latent expansion. The individual components are not new. 1. How do you verify expanded points are realistic? Can you show they resemble real OOD samples? 2. How do you ensure pseudo-labels are reliable? What if all pseudo-labelers are wrong? 3. Why 1024 models? How did you choose this number? What happens with 64 or 256? Fully AI-generated
Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper presents EnConda-Bench, a new benchmark for evaluating llm code agents. The authors collected various repositories which includes neccessary README files which can be used as a guide to correctly setup the repository environments. The author synthetically introduce errors in the README using LLMs and create the benchmark to evaluate how llm code agents can repair the issues in the README as well as to generate a correct bash script to setup the environment. The paper evaluates several llm agent baselines and provides detailed analysis on the benchmark performance. - tackles a very important and currently understudied problem of using llm agents to build software development environments - the benchmark is constructed nicely with detailed descriptions of the pipeline and manual examination, which can be used by future work - the authors also evaluate the benchmark already on several important baselines together with detailed analysis Missing key evaluation category: - As the author demonstrated in the paper and from prior work, developing a script that can successfully build an environment is non-trival. - In the paper the authors focus on the task of repairing a README with errors - However, we can also easily use the benchmark without README with errors to evaluate given a correct README what are the performance of generating a correct environment using LLM agents. - I think this is an interesting scenario and can allow the authors to further compare and contrast the repair results Unclearness of the error classification class: - From reading the paper it is unclear how the error classification is done. - For example how does an agent know what are the different error categories? if they do not know, then do the authors still use llm-as-judge to determine that? - I think this is a very inefficient evaluation category as it only shows how good the LLM-agents are with identifying the error type Minor issues: - some of the figures are extremely small with small texts that are difficult to see (e.g., Figure 4) 1. Did the authors evaluate the base error-free results from the benchmark? 2. How does the agent perform the error classification during evaluation? Fully human-written
Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces a benchmark, EnConda-Bench, for evaluating LLM-based agents on environment configuration tasks in software engineering. The key innovation is providing process-level trajectory assessment across four capabilities: environment setup-planning, perception-driven error diagnosis, feedback-driven repair, and action execution. This work constructs 4,201 tasks from 323 GitHub repositories. Their evaluation across multiple LLMs and agent frameworks reveals that. while agents can localize errors reasonably well (F1 ~60%), they struggle to translate feedback into effective corrections. 1. This paper propose a new benchmark with process-level evaluation for environment configuration. 2. From repository selection to filtering and validation, the multi-stage dataset construction pipeline demonstrates rigor. 3. The paper clearly articulates the problem (environment configuration bottleneck), motivation (limitations of end-to-end metrics), and solution (process-level evaluation with synthetic errors). **1. Limitations In Language Coverage**: The benchmark focuses exclusively on Python repositories. Given that environment configuration challenges exist across all programming languages, this can limit the generalizability of the evaluation and conclusions. **2. Limitations In Synthetic Error Validity**: While the authors validate that injected errors cause failures, there's insufficient evidence that these errors represent the *distribution* of real-world configuration problems. The difficulty comparison (Table 2) shows similar mean scores among different benchmarks, but doesn't validate whether the *types* of errors match real-world distributions. **3. Limitations In Evaluation Metrics:** Pass@1 metric doesn't account for partial progress (e.g., fixing 1 of 2 errors). **4. Limitations In Analysis:** The specialized environment agents (e.g., Repo2Run) are evaluated but not deeply analyzed for why they perform better. There are also no ablation studies examining which agent design choices matter most. **5. Limitations In Data Construction Transparency**: The repository selection criteria (10+ stars, 1000+ commits, 10+ issues) seem arbitrary, while no justification are provided. Also, the process of "manual checking" is mentioned but not detailed (e.g., how many annotators? what was the failure rate?). **6. Limitations In Paper Presentation:** Some figures are hard to read. For example, all the scatter points and X-axis labels and titles are very hard to read in Figure 7, and model names are also very hard to read in Figure 5. My questions are following several aspects mentioned in weakness: - How did you validate that your synthetic error distribution matches real-world configuration problems? - How did you take partial credit into consideration, as it cannot be effectively assessed by Pass@1? - Can you provide more detailed analysis of *why* Repo2Run performs better? Is it the dual-environment architecture, the rollback mechanism, or something else? - Through case study, what are the common failure modes or patterns? What are their distributions? What are the implications for future improvements? - Can you give more detailed explanation for selection, filtering, and validation criteria in dataset construction? - Can you revise the figures to make them clearer and more readable? Lightly AI-edited
Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. In this paper, the authors propose a framework for analyzing process-level trajectories of LLM agents in environment configuration. This moves evaluation beyond simple pass/fail build outcomes to diagnose where and why an agent fails. The dataset is created by injecting errors (six defined categories) into valid README files followed by automatic Docker-based validation. However, this methodological strength also possibly limits the benchmark's real-world applicability (discussed more in Weaknesses section). 1. This work addresses a critcal yet unexplored bottleneck, moving from end-to-end pass/fail metrics to process-level trajectory assessment. This is useful to extract actionalble feedback that is useful for agent designers. 2. The decomposition of evaluation into the perception, feedback, and action provides fine-grained diagnostics beyond aggregate success metrics. 1. The six error types chosen are said to be "guided by failure modes frequently encountered in practice", but no citation or prior empirical study is provided to substantiate this taxonomy. Without grounding in developer-observed data, it is unclear whether these six types cover real-world failure modes, or simply reflect intuitive assumptions. 2. Each erroneous README is created by injecting two errors per file. This raises two concerns: (i) since both synthesis and evaluation rely on LLM behavior, the resulting benchmark may reflect model-specific phrasing or error styles, rather than human-authored configuration errors; (ii) generating only two errors per category constrains diversity, a stronger design would sample multiple error candidates and retain a stratified subset verified by humans. 3. The results lack ablation on error difficulty levels, number of injected errors, and impact of Docker environment variations. These would strengthen the claim that process-level evaluation provides deeper insight than end-to-end metrics. 1. Were the six error types derived from any minded empirical study or developer survey of configuration failures? 2. Did you experiment with multiple variants per error type to check whether the evaluation metrics remain stable? 3. How does EnConda-Bench handle repositories with pre-existing errors or ambiguous READMEs? Fully AI-generated
Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a new dataset to advance agentic software engineering. This dataset focuses on measuring environment configuration, which is identified as a common weakness for current agents. It consists of 100 problem instances and includes process-level annotations. The dataset is constructed using LLM-assisted filtering as well as human validation. * This paper identifies an important issue in agentic coding and proposes a targeted dataset and benchmark to enable future research. * While there are no technical contributions beyond the dataset and some of the analysis, the evaluation suite does seem to offer some key benefits over previous work, in particular when it comes to the “process-level” evaluation. * The dataset creation procedure is thorough and well-explained; I think this will be a useful resource for the community. * There are some notable omissions in the evaluations, such a GPT-5-Codex and Claude 4.5, both of which are considered SOTA base models for coding. Furthermore, for the coding agents, why not include Codex CLI, Gemini CLI, Jules, and Claude Code? These are specifically optimized to handle novel codebases and deal with configuration issues. * I’m not sure that ICLR is the best venue for this work. Perhaps a dedicated dataset/benchmark track would be better suited. * Some of the figures are too small to be useful, such as Figure 5 and Figure 6. It would be better to focus on a subset of the results in the main text (relagate others to appendix) to better highlight differences. Why not include Codex CLI, Gemini CLI, Jules, and Claude Code? These are specifically optimized to handle novel codebases and deal with configuration. Fully human-written
Pre-training under infinite compute Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper investigates language model pre-training in a fixed data infinite compute setting. To address overfitting in the data-constrained scenario, the authors propose a recipe with 30x larger weight decay and the asymptote of the scaling law. The final joint scaling recipe yields a 5x data efficiency gain over the baseline. The authors then propose and demonstrate that model ensembles can be trained and distilled into a single student model while preserving most of the loss improvement. Lastly, the authors show that their metrics generalize well to downstream benchmarks. 1. The motivation for the paper is practical, where internet data is exhausted. 2. The empirical studies are thorough and clear. The authors produced a recipe that achieves 5x higher data efficiency and demonstrates the benefit of a much larger weight decay. 3. The level of attention to experiment details is commendable, and experiment setups are described in detail. 4. The settings the authors take into account are comprehensive, with results ranging from pretraining, model ensembling, and distillation. 1. Due to the shallow setup for the 1.4B model, it underperforms the 600M version without the proposed weight decay. While in Figure 3, the authors show how the regularized recipe fixes this, this still casts doubt on the following conclusions, given how sensitive power law fits are, and that 1.4B is the largest model variant that the author uses for curve fitting. For instance, in Figure 9, 1.4B's downstream eval results also break the trend, most likely due to model architecture configuration. Since the authors only train on 200M tokens, this is not prohibitively expensive to fix. 2. The ensemble scaling part requires a heuristic that breaks in the over-parametrized setting. The heuristic itself, when and why it works well or not, could use more explanation to make the ensemble tuning recipe simpler and more principled. 1. Is there any intuition behind distilling from a single model working better than distilling the ensemble? Fully human-written
Pre-training under infinite compute Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This work studies how to pre-train when compute is abundant but data is limited. The authors show that standard approaches of increasing epochs or model size overfit, and that stronger regularization, especially much larger weight decay, helps sustain gains. They further show that ensembling independently trained models improves the loss asymptote, and that their best setup, combining regularization, epoching, and ensembling achieves the same performance with 5× less data. Finally, they find that most of their ensembling recipe benefit can be retained through distillation into a smaller model, with gains also reflected on downstream tasks. The problem addressed is timely. The main ideas like stronger regularization, evaluating by scaling-law asymptotes, and using simple ensembling, are easy to follow and reproduce. The experiments consistently show clear gains and improved data efficiency, and the distillation results make the approach more useful. The paper is well-written, with clear experiments and a natural, convincing link between the sections. See Questions section below. 1. The novelty feels incremental in places. Ensembling and stronger regularization are well‑known tools; the main new piece seems to be “optimize for and compare asymptotes under fixed data.” Could you sharpen the claim of conceptual novelty beyond careful tuning + ensembling, and clarify what, if anything, is surprising in the results relative to prior data‑constrained scaling work? 2. Compute‑matching across alternatives could be clearer. For a fixed training FLOP budget, how do (i) a single larger model, (ii) K‑member ensembling, and (iii) spending compute to synthesize more training text compare? 3. Weight decay is pushed high and clearly helps, but other classic knobs (dropout, data noising, etc.) are not explored. Would be interested in knowing your thoughts on whether similar monotone‑scaling and asymptote reductions hold with these (or other) alternatives? Heavily AI-edited
Pre-training under infinite compute Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper explores a forward-looking problem: how to design pre-training strategies when compute grows faster than data availability. The authors simulate a future where web text becomes the limiting resource and propose several methods to make pre-training data-efficient under fixed data but infinite compute. 1. The paper identifies an under-explored but realistic future scenario where compute scales faster than data. This is a fresh take beyond the traditional “compute-optimal” (Chinchilla-style) scaling laws. 2. The authors construct a controlled 200M-token benchmark and systematically vary parameters, epochs, and regularization. The coordinate-descent tuning of weight decay, LR, and epochs shows strong empirical rigor. 3. The evaluation results demonstrate the effectiveness of the proposed approach. 1. The question raised in this paper is quite interesting. However, I am curious whether the conclusion would still hold if the dataset size were fixed between 30B and 100B tokens while training a 1B-parameter model. According to the Chinchilla paper, the optimal number of training tokens is approximately 20 x N, and over-training is quite common nowadays. For instance, the Qwen3 model was pretrained on about 36T tokens. It would be interesting to investigate, when training with more than 20 x N tokens, how to appropriately adjust the learning rate, weight decay, and number of epochs. 2. Could the authors briefly clarify how this paper differs from the Tensor Programs V: Tuning Large Neural Networks via Zero‑Shot Hyperparameter Transfer (https://arxiv.org/pdf/2203.03466)? Additionally, the μP paper shows that hyperparameters are closely tied to the model architecture, and the more recent Scaling Inference‑Efficient Language Models (https://arxiv.org/pdf/2501.18107) builds a scaling law that explicitly incorporates architecture. Could the authors elaborate on the impact of model architecture in their study? 3. The evaluation over three downstream tasks is limited. Is it possible for the authors to show the evaluation over Arc-Challenge, Lambada, HellaSwag, Winogrande, etc? I would be willing to raise my score if the authors can address the above questions. See weaknesses Moderately AI-edited
Pre-training under infinite compute Soundness: 3: good Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper presents a compelling and timely investigation into language model pre-training for a future where high-quality data is the primary bottleneck, not compute. The authors argue that as compute continues to scale faster than data generation, the community must shift from compute-efficient to data-efficient training paradigms. The work first demonstrates that standard recipes, which involve increasing epochs or model parameters, inevitably lead to overfitting and performance degradation. As a solution, the authors propose a "regularized recipe" using aggressive weight decay (up to 30x standard practice) to achieve stable, monotonic performance scaling. Building on this, they show that ensembling multiple smaller models is a more data-efficient use of compute than scaling a single large model. By combining these methods, they claim a significant 5.17x improvement in data efficiency. Finally, the paper grounds these compute-intensive findings in practice, showing that the gains from ensembling can be largely preserved by distilling them into a single, smaller model, and that improvements in validation loss translate to meaningful gains on downstream tasks. This is a strong paper that could become a foundational empirical work for the emerging data-constrained training regime. The experiments are thorough, and the insights are valuable for both theoretical and practical applications. Specifically, I appreciate the following contributions of this paper. 1. **Extensive Experiments:** The claims are backed by an impressive scale of experimentation, reportedly over 2000 training runs. I like the scientific approach of diagnosing the failure of a baseline, introducing a robust regularized recipe, and then demonstrating the superiority of ensembling, which builds a convincing, step-by-step argument. 2. **Practical and Generalizable Insights:** The work successfully connects the "infinite compute" thought experiment back to practical applications. The demonstration that an 8-member ensemble can be distilled into a single student model that retains 83% of the performance gain is a crucial result, proving these methods are not purely theoretical. Furthermore, the validation on downstream benchmarks confirms that the observed improvements in pre-training loss correspond to genuine enhancements in model capabilities. 1. **Architectural Specificity:** The empirical results are derived exclusively from Llama-style decoder-only architectures. It remains unclear whether the core findings, particularly the superiority of ensembling over scaling a single model, generalize to different architectures such as Mixture-of-Experts (MoE). More critically, under the infinite-compute regime, transformer architectures may not represent the optimal choice. The paper would benefit from discussion on architectural considerations within data-constrained training regimes. 2. **Reliability of Power-Law Extrapolation:** The paper's headline quantitative claims (e.g., 5.17× data efficiency) depend on extrapolating power laws fitted to limited data points (typically four model sizes). While the authors provide a sensitivity analysis and appropriately advise caution, the absence of formal statistical goodness-of-fit metrics (such as R² or confidence intervals) makes it difficult to assess the true reliability of these asymptotic estimates. The broader literature demonstrates that simple power laws can break down or exhibit sub-scaling behavior at larger scales, making long-range extrapolation risky. This concern is compounded by the non-standard width/depth configuration of the 1.4B model, which introduces a potential confounder that could affect the quality of the fit. The paper convincingly shows that tuned regularization eliminates the initial overfitting peak seen in standard recipes. However, the double descent literature suggests that with sufficient over-parameterization, an unregularized model's performance might enter a "second descent" and improve again. Could you comment on whether this second descent could, in the infinite limit, outperform the regularized recipe? Is there a theoretical or empirical reason to believe the regularized performance curve is strictly superior at all points in the over-parameterized regime? Moderately AI-edited
DLGNet: Hyperedge Classification via a Directed Line Graph for Chemical Reactions Soundness: 3: good Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposes DLGNet as a novel approach to solving the reaction classification task. The key idea is to represent a set of chemical reactions as a hypergraph H, then convert it into a directed line graph (DLG), where each vertex corresponds to a hyperedge in H, and edges are drawn between vertices if their corresponding hyperedges share at least one node in H. DLGNet processes this directed line graph using a spectral GNN. Experimental results show that DLGNet outperforms various baseline methods designed for hypergraph processing across three reaction classification datasets. - The paper clearly presents the often-confusing concepts of hypergraphs and their transformation into line graphs. - It proposes an extension of the hypergraph incidence matrix from binary {1, 0} values to ternary {1, –i, 0}, and introduces a corresponding Laplacian for the line graph, providing a theoretical foundation for the relationship between the hypergraph and its line graph. - Using the Laplacian defined above, the authors design a spectral GNN (i.e. DLGNet) and demonstrate its empirical superiority over other hypergraph-based methods on the reaction classification task. - If the goal is reaction classification, the selection of baseline methods seems unfair. The paper only compares against hypergraph-based approaches, many of which show very low F1 scores, making the proposed method appear more effective than it might actually be. Reaction classification can also be addressed using traditional chemoinformatics methods and various GNN- or Transformer-based approaches. At the very least, the method should be compared against typical baselines such as a classic ReactionFP [1] and something like [2]-[4]. Without such comparisons, it's hard to assess the real value of the proposed approach. - There are multiple ways to convert a hypergraph into a graph, not just the line graph approach, but also methods like clique expansion. As shown in cited Zhou et al., NIPS 2006 “Learning with Hypergraphs,” it's also possible to define an adjacency matrix directly from the hypergraph's incidence matrix. This would also be consistent with hypergraph Laplacian theory, and it’s not entirely clear how or why the proposed method is better compared to any potential alternatives. - I think it would be helpful to include a more concrete explanation of how the hypergraph-based methods being compared differ from the proposed approach. Rather than reducing a hypergraph to a regular graph, it's possible to define message passing directly on the hypergraph and perform prediction by applying global pooling over each hyperedge. In fact, this may allow for more flexible network designs (for example, by incorporating graph transformer layers). However, the paper lacks a clear discussion of how the proposed method compares to or improves upon such approaches, and what specific advantages it offers. [1] ReactionFP https://doi.org/10.1021/ci5006614 [2] Mapping the space of chemical reactions using attention-based neural networks. (Nat Mach Intell, 2021) https://doi.org/10.1038/s42256-020-00284-w [3] Chemical-Reaction-Aware Molecule Representation Learning (ICLR 2022) https://openreview.net/forum?id=6sh3pIzKS- [4] Rxn Hypergraph: a Hypergraph Attention Model for Chemical Reaction Representation https://arxiv.org/abs/2201.01196 - It’s unclear whether the primary goal of the paper is to advance ML methods for hypergraphs, to improve reaction classification performance, or both. However, the manuscript frames the work around the reaction classification task and evaluates it solely in that context. If that’s the case, why were the comparisons limited to hypergraph-based methods? Why not include standard approaches commonly used for reaction classification? - Assuming the transformation into a line graph is a key component, the ternary {1, –i, 0} incidence matrix introduced in Eq. (7) is a very interesting idea. Is his incidence matrix itself the novel contribution of the paper? Or is this idea already exists, and the novelty lies instead in how it’s used to analyze the relationship between hypergraphs and directed line graphs, or in the design of the corresponding Laplacian and spectral GNN? Clarifying this would help position the paper’s contribution more clearly. - In the end, it's not entirely clear why reducing a hypergraph to a line graph and applying specialized spectral message passing would offer practical advantages over directly designing message passing on the hypergraph itself. Could you provide more clarification on the pros and cons of this line graph reduction approach, especially in comparison to more typical hypergraph-based methods used in your comparisons? Fully human-written
DLGNet: Hyperedge Classification via a Directed Line Graph for Chemical Reactions Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents a graph neural network (GNN) for classifying chemical reactions. The method models reactions as directed hypergraphs and then transforms them into a directed line graph where each vertex represents an entire reaction. A complex-valued Laplacian matrix captures the reaction's directionality, enabling a convolution-like operator for the resulting GNN. Experiments demonstrate superior performance and confirm that encoding this directionality is key to the model's success. 1. The paper provides mathematical proofs for the key properties of the proposed Laplacian, i.e., that it is Hermitian and positive semidefinite. This rigour ensures that the method is not an ad-hoc heuristic but a well-defined operator. 2. By isolating the effect of directionality ("DLGNet w/o dir" in Table 2), the paper provides conclusive evidence for its central claim. The blation study confirms that encoding direction is not just helpful but critical for success on this task. 1. The experiments do not provide evidence that a (hyper)graph-based representation is inherently superior to a simpler, non-structural approach for this specific task. Modern architectures like transformers have shown great success on set-based data. The paper can be strengthened by including a comparison against a strong baseline that does not rely on an explicit graph structure. For example, a Deep Sets model or a Transformer encoder that treats the reactants and products as two distinct sets of molecular fingerprints. Note that the AllDeepSets and AllSetTransformer models use message-passing-style aggregators typical of graph neural networks [1]. 2. All tasks are reaction-type classification. There’s no evidence the operator helps on other directed-hypergraph problems on non-chemical domains, e.g., citation co-author network with authors as nodes and citation links between author collaborations (papers) as directed hyperedges. [1] You are AllSet: A Multiset Function Framework for Hypergraph Neural Networks, ICLR 2022. 1. Would an expanded ablation on architectural hyperparameters (number of convolutional layers, channel widths, classifier depth) clarify sensitivity and stability? 2. For the merged Dataset-2, what exact deduplication and leakage checks were performed across train/validation/test? 3. In addition to the standardised feature-transfer operator, can baselines be reported with their strongest native hyperedge readouts (e.g., attention-based or learned set pooling) to ensure fair comparison? 4. Beyond reaction-type classification, can the operator be demonstrated on an additional directed-hypergraph task (e.g., hyperedge link prediction) and/or a non-chemical dataset to establish broader applicability? 5. The aggregation of molecular features into reaction features is a critical step. The current method uses a weighted summation. Could the authors elaborate on the rationale for choosing summation over other aggregation functions like mean or max pooling? A brief justification or an ablation study on this design choice would be insightful. 6. Appendix G valuably distinguishes the proposed Laplacian from the Magnetic Laplacian, noting that the DLG Laplacian avoids 'sign-pattern inconsistency.' Could the authors expand on how this theoretical advantage translates into practical benefits, such as improved model stability, expressivity, or performance, specifically for the task of chemical reaction classification? Heavily AI-edited
DLGNet: Hyperedge Classification via a Directed Line Graph for Chemical Reactions Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper focuses on predicting chemical reactions and introduces a novel method that encodes reaction graphs into a Directed Line Graph (DLG). In this data structure, the triplet set in the raw graph is transformed into a new vertex set. The authors provide theorems for the Signless Directed Line-Graph and the Directed Line Graph Laplacian of the DLG. Additionally, authors building a spectral-based Graph Neural Network to capture features. Experimental results on three real-world chemical reaction datasets demonstrate that DLGNet consistently outperforms all baseline competitors. 1. The idea of transformation from standard graph to a hypergraph is interesting and rational for reaction prediction. 2. The theorems for the Signless Directed Line-Graph and the Directed Line Graph Laplacian of the DLG appears to be comprehensive. 1. Please remind that the anonymous repository has expired. 2. In the experiments, can the classes that DLGNet easily confuses reveal potential reaction patterns, such as similar regions representing the dominant structures in reactions, reflecting the main areas where bonds are formed and broken? 3. I think an important contribution of this paper is using the Laplacian matrix as prior knowledge for the adjacency matrix in GCN. One thing I'm curious about is whether the Laplacian matrix, after being extracted, can be transformed into the adjacency matrix of the original reaction graph, or integrated as a signal into the original adjacency matrix. For example, $ B * L * B^{-1}$. This is because I feel that DLG compresses too much information, which may affect the upper limit of the neural network's performance. See weaknesses. Fully human-written
Non-Additive Time-Series Forecasting via Cross-Decomposition and Linear Attention Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces TIM (Time-series Interaction Machine), a fully-MLP time series forecasting model based on ANOVA/Hoeffding decomposition. It enhances long-term multivariate forecasting performance by explicitly modeling main effects and non-additive interaction effects. The core innovation is explicit separation of forecasting signals into three components: (1) main temporal effects via Time Fusion, (2) non-additive multivariate interactions via Feat Fusion using DCN-style cross stacks, and (3) residual corrections via Res Fusion. Experiments on 8 benchmarks show TIM achieves 25/48 first places in MSE and 21/48 in MAE, outperforming DLinear, PatchTST, TimeMixer, and iTransformer. - This paper addresses a genuine limitation in existing time series forecasting: most models implicitly emphasize additive effects while inadequately modeling interaction effects (cross-variable and cross-temporal dependencies). - The study is grounded in a rigorous mathematical framework (e.g., ANOVA/Hoeffding decomposition). It provides comprehensive theoretical guarantees through a series of theorems and corollaries, including error bounds, interaction purity, and coefficient formulations. - The experiments cover 8 benchmark datasets across diverse domains and evaluate 4 prediction horizons (short-term and long-term). The model competes against 7 SOTA methods and achieves strong performance, with 25/48 first places in MSE and 21/48 in MAE metrics. - The theoretical framework places significant emphasis on exogenous variables. However, the experimental section does not clarify whether the datasets used incorporate these variables or how they are accounted for. Moreover, no evaluation is provided regarding the actual impact of these variables on the results. - While the paper claims to contribute to model interpretability on component-wise attribution and cross-term explanations, it offers no empirical evidence or illustrative examples in the experiments to substantiate these claims. - A more thorough and in-depth analysis of the experimental results should be conducted. This could include, for example, error analysis, results on different settings, or an examination of how well the results align with theoretical expectations. Please refer to weakness. Moderately AI-edited
Non-Additive Time-Series Forecasting via Cross-Decomposition and Linear Attention Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes TIM, which employs the decomposition of features, time, and residuals, to effectively capture both additive and interaction effects with minimal computational overhead. The paper's experiments demonstrate that TIME outperforms classical time series forecasting baselines in terms of both accuracy and efficiency across multiple datasets. This paper proposes TIM, which employs the decomposition of features, time, and residuals, to effectively capture both additive and interaction effects with minimal computational overhead. The paper's experiments demonstrate that TIME outperforms classical time series forecasting baselines in terms of both accuracy and efficiency across multiple datasets. 1. The description of the model architecture in the paper is disorganized, and it is unclear how each module actually works. There is a lack of complete mathematical formulation, and the architectural description is not clear enough. 2. The paper claims to use Feat Fusion for capturing the non-additive (interaction) component, Time Fusion for capturing temporal shifts and main additive effects, and Res Fusion for capturing residual structure not explained by main or interaction effects. However, there are no corresponding designs to ensure these defined functions. 3. The paper has poor readability, with many terms not clearly defined. For example, DCN-style and CP rank. 4. The paper's innovation is limited. AXIS-WISE LINEAR SELF-ATTENTION seems to be directly adapted from Linear attention, yet there are no relevant citations. 5. The paper's experiments are not convincing with outdated baselines. 6. The paper's layout is disorganized, with chaotic table formatting. The ablation experiment table 4 is in the appendix, but there is no proper reference to the appendix. 7. The method description in the paper is very unclear, and the source code is not provided, making both the results and the methods irreproducible. Please refer to the weakness. Moderately AI-edited
Non-Additive Time-Series Forecasting via Cross-Decomposition and Linear Attention Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes the Time Series Interaction Machine (TIM), an MLP-based forecaster for non-additive time series forecasting that explicitly aligns with ANOVA/Hoeffding decomposition. The key innovation is the use of cross decomposition and linear attention to model non-additive interactions in time series data. The architecture consists of three main components: (1) a lightweight branch capturing main effects, (2) a Feat Fusion module using Deep Cross Network (DCN)-style cross stacks to extract multivariate interaction effects, and (3) a linear transformer backbone with channel-independent structure. The paper provides theoretical justification by connecting the architecture to ANOVA decomposition, arguing that the learned components correspond to theoretical main effects and interaction effects. Experiments are conducted on standard datasets against SOTA baselines. ## Strengths 1. **Well-written introduction and background**: The first three paragraphs of the introduction provide clear and intuitive background. The abstract paints a good high-level picture of the approach with main effects and interaction effects. 2. **Novel contribution in Feat Fusion**: The Feat Fusion module is the most impressive innovation. It effectively captures multivariate interaction effects, which distinguishes this work from previous methods that can be viewed as "main effect only" models. 3. **Clear presentation aids**: Figure 1 is very helpful in understanding the architecture. Algorithm 2 is useful and clear, aligning well with Figure 1. Lines 191-211 describing the fusion mechanisms are intuitive and easy to read. 4. **Theoretical grounding**: The connection to ANOVA decomposition provides theoretical justification for the architecture design, with learned components corresponding to theoretical components. 5. **Solid experimental setup**: Standard datasets and SOTA baselines are used appropriately. 6. **Complementary approach**: The work complements recent developments in transformer-based time series forecasting methods and appears modular—the transformer backbone could potentially be switched. ## Weaknesses 1. **Clarity issues in the abstract**: - Acronyms (DCM, CP rank) are not spelled out initially - Confusion about how "MLP forecaster" and "linear self-attention" fit together - Lines 19-20 become difficult to follow 2. **Organizational and readability issues**: - References in Section 2.1 (page 2) point to pages 5-6, disrupting reading flow - Section 3 feels abrupt; unclear transition from method description to theoretical alignment - Purpose of Sections 3.4, 3.5, 3.6, and 3.7 is unclear—are these all theoretical remarks? 3. **Insufficient justification for architectural choices**: - The role and necessity of residue branches (Res Fusion) is not well explained. Why is it essential beyond being "like a skip connection"? - Line 190: unclear how Time Fusion, Feat Fusion, and Rest Fusion "share the same architecture" when designed for different purposes - Line 240: confusing whether regression targets relate to loss function construction or component interpretation 4. **Weak ablation study**: The "without RES" ablation confirms RES helps performance, but: - No clear high-level intuition for why RES is absolutely needed - Line 397 mentions RES captures cross-variable dependencies, but doesn't clarify its exact unique contribution - Line 403 mentions performance without RES but doesn't justify its necessity 5. **Missing key experiments**: A more comprehensive ablation study is needed to demonstrate the value of Feat Fusion across different backbones: - TIM + DLinear vs. DLinear alone - TIM + PatchTST vs. PatchTST alone - TIM + iTransformer vs. iTransformer alone This would better isolate the contribution of interaction effect modeling from the choice of backbone. I have listed structured questions (with help of LLM) in the above weakness part. I am going to say here my honest thoughts when reading the paper as it presents, and hopefully this can help you understand how a new reader perceives your paper. These raw feelings are genuine and I hope they provide a more human-to-human communication and contexts for the structured question above. # Review: Non-Additive Time Series Forecasting ## Abstract The review for non-additive time series forecasting via cross decomposition and linear attention. The abstract actually makes sense. It is trying to model the non-additive interactions and use a design from ANOVA decomposition. It paints an intuitive high-level picture of what it is, with main effects and interaction effects. Minor points: the acronyms, although standard, are used a lot. You can spell them out. I'm also confused: it mentions "MLP forecaster" earlier in the abstract, and then in the middle part of the abstract, it says "access via linear self-attention." So I don't see how the linear self-attention plugs in here. That's okay, I can wait. I look forward to the transparent cross-term explanations. In the middle of the abstract (the line starting around line 19-20), I start to get a little bit lost, but that's okay—I can read the main paper. ## Introduction The first two paragraphs of the background in the introduction are really good. The third paragraph is also very good. Basically, it gives the background pretty well. The proposed method is called Time Series Interaction Machine. It's an MLP forecaster explicitly aligned with ANOVA/Hoeffding decomposition. Interesting. A lightweight branch captures the main effect, and a DCN-style cross stack. DCN means Deep Cross Network. The mathematics is pretty simple—it's basically an MLP path that learns implicit nonlinear interaction effects. The criticism of transformers is correct regarding their quadratic time and memory complexity. There are linear transformers available. But the authors are using linear transformers. Otherwise, I have no complaint about the introduction. It's pretty good. A lot of the references in page 2 of Section 2.1 go to pages 5 and 6, making it a little bit difficult to read. But it's just a minor point. For better smoothness, I'm actually going to glance through 2.1 without paying too much attention, and move on to 2.2. ## Method It's already using linear attention in line 94. So that's good. The interpretation of a kernel smoother is established in the literature (line 100). So the key in Section 2.1.4 should be multivariate interaction features that are learned. I didn't see in the introduction right here how linear attention is used. Or is linear attention just used as a benchmark? Alright, I see. Figure 1 makes this really helpful. I'm assuming in Figure 1 the Feature Feat Fusion is the main innovation. Now, the extra multivariate interaction effects—yes, like cross-vector, like DCN in time. And component (b) is a linear transformer backbone going through a channel-independent structure. I see. So really, the new thing here is the first block—the Feat Fusion that extracts multivariate interaction effects. Yes, so in that sense, the time fusion and transformer backbone previously can be thought of as a main effect only model. And this work is adding the Feat Fusion to add interaction effects. Very interesting. I don't quite see what the residue branches are doing. It seems that it's just capturing anything not in the main model as residue, kind of like a skip connection or something like that. I would be curious about how important the Res Fusion is, and why it is essential. Interesting. So because Time Fusion and Res Fusion are per-variable channel-independent, the multivariate effect is solely captured in the Feat Fusion. I'm a little confused. In line 190, how can Time Fusion, Feat Fusion, and Rest Fusion share the same architecture? They are designed for very different things. In what sense are they sharing the architecture? From lines 191 to 211, the goal and the interpretation of those fusion mechanisms are intuitive and easy to read. Algorithm 2 is useful and clear for the overall flow. It aligns well with Figure 1. ## Theoretical Connections I feel Section 3 is a little bit sudden. I just thought that I had finished reading the method, and then there seems to be some alignment that needs to be done. I'm a little bit confused. Also, in line 240, it starts to introduce regression targets, which confuses me. Is it about loss function construction, or is it about the interpretation of each component? I get Section 3.2—the decomposition is somehow theoretically justified. Is the claim that the architecture mirrors this ANOVA decomposition, and therefore, the learned components will have a good correspondence to the theoretical components? Is that the story? Oh, that seems to be the case. After reading line 278, it seems that the cross-branch and main effect correspond to the earlier architecture. So I don't quite get what's the purpose of presenting Section 3.4, Section 3.5, and Section 3.6, together with Section 3.7. They are all remarks and connections to some theoretical study—theoretical interpretation. Is that right? ## Experiments Now I move on to Section 4, the experiment session. I can see that the dataset descriptions are all standard datasets. The SOTA baselines are also standard. The ablation study actually includes "without RES" That confirms my intuition that the residue seems to be least coherent to the story. I get that it helps with performance. Is there any discussion on the high-level intuition about why the residue is absolutely needed? Like you mentioned in line 397, “RES processes univariate time series as tokens and captures cross-variable dependencies." Therefore, what is the exact contribution of RES? That's a good place for you to remind me. Yes, in line 403, you actually mentioned why the "without RES" setting seems to be still ok. But you didn't remind me why the RES connection is absolutely needed. ## Overall Assessment Overall, I think it's an interesting paper. From a different angle, I can see that it can complement a lot of the recent developments in time series forecasting methods, especially transformer-based time series forecasting methods. This Feat Fusion does seem to capture interaction effects. It also appears to me that you can actually switch the transformer backbone. So for the ablation study, I would say a more fair ablation study would be: if you're using DLinear as the backbone, then you have TIM plus DLinear; TIM applied on DLinear, see how good it is. Use PatchTST as the backbone, and then TIM plus PatchTST, see how good it is. Or iTransformer as the backbone. You get what I'm saying. I think that would illustrate how important the interaction effect and this residue prediction really are. Because I really think the Feat Fusion is the most impressive new thing here. Overall, I think this is a good paper. An interesting one. I wouldn't say the entire thing is—the entire thing is a solid read. There's some confusion, but there's one part of the idea that I really like. I'm not too convinced about the Time Fusion and the Res Fusion, but I really like the Feat Fusion perspective. Lightly AI-edited
Non-Additive Time-Series Forecasting via Cross-Decomposition and Linear Attention Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces TIM, an all-MLP forecaster designed from the ANOVA/Hoeffding target. TIM consists of lightweight branches capture main effects, and a DCN-style cross stack models the orthogonal non-additive interaction subspace. TIM achieves linear time and memory complexity via axis-wise linear selfattention combined with DCN-based feature crossing, while keeping parameters comparable to compact MLP baselines. - The paper is well-structured and easy to follow. - The paper provides detailed theoretical analysis, including degree and rank guarantees for the cross stack, as well as risk decomposition identities that explain the additive error gap. These contributions ground the proposed method in a strong mathematical framework. - The proposed TIM model balances accuracy and computational efficiency, outperforming Transformer-based models while maintaining a lightweight, all-MLP architecture. This makes it particularly suitable for resource-constrained environments. - Although TIM achieves SOTA results on several datasets, its performance on certain benchmarks (e.g., Traffic, Solar-Energy) shows only marginal improvements. - While the paper compares TIM against several Transformer-based and MLP-based methods, there are some latest and relevant works omitted. For example, Transformer-based TimeXer [1] and MLP-based TimeMixer++ [2] and SOFTS [3]. [1] Wang, Yuxuan, et al. "Timexer: Empowering transformers for time series forecasting with exogenous variables." Advances in Neural Information Processing Systems 37 (2024): 469-498. [2] Wang, Shiyu, et al. "Timemixer++: A general time series pattern machine for universal predictive analysis." arXiv preprint arXiv:2410.16032 (2024). [3] Han, Lu, et al. "Softs: Efficient multivariate time series forecasting with series-core fusion." Advances in Neural Information Processing Systems 37 (2024): 64145-64175. - While the paper highlights TIM’s interpretability through its component-wise attributions and decomposition regularizer, the explanation and evaluation of this interpretability are limited. Please refer to weakness Heavily AI-edited
ExploraQA: Embodied Question Answering with Long-horizon Proactive Exploration Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces ExploraQA, a new embodied question answering (EQA) benchmark designed to evaluate long-horizon, proactive, question-driven exploration in 3D embodied environments. The benchmark contains 12k question-answer pairs across seven categories, with longer navigation trajectories and multiple valid viewpoints per question. The authors also propose Answer Quality-Guided Navigation (AQ-Nav), which integrates a topological memory, keyframe selection mechanism, and LLM-based answer-quality reinforcement to improve question-guided exploration and answering. Experiments on Habitat environments show improvements in question-answering and navigation metrics over several baseline navigators. - Interesting benchmark targeting long-horizon, question-conditioned exploration — relevant for embodied agents. Important since proactive exploration remains under-tested in embodied AI. - Clear dataset effort: multiple viewpoints, more realistic navigation lengths, broader question categories. - AQ-Nav idea (LLM-based answer-quality signal + keyframe filtering) is novel and conceptually interesting. - Reasonable gains over baselines on their benchmark. - Pipeline is well-engineered — topological memory + relevance filtering + RL reward shaping is a coherent architecture. - LLM generation vs human verification tradeoff is unclear. If every QA pair still goes through humans, where is the true scalability advantage? ~17% rejection rate suggests humans are still bottleneck. LLMs may also reduce linguistic diversity (mode collapse), undermining benchmark diversity. Why not human-written questions if humans are reviewing anyway? - The introduction (correctly) points out that other benchmarks like OpenEQA have “passive trajectories” that are ineffective as expert demonstrations for imitation learning, motivating the need for “proactive trajectories.” However, the paper does not include any imitation-learning evaluation to support this claim, making the stated IL motivation currently unverified. - No real analysis of diversity or difficulty of QA pairs. Benchmark value depends on distributional richness and question complexity. No evaluation that LLM-generated QA isn't repetitive or too easy. - Experiments: No comparison to simple baselines like a blind LLM or a multi-frame VLM with randomly sampled frames from the scene. - Human verification details limited; no quality statistics beyond a correlation metric. No inter-annotator agreement scores - No cost analysis (GPU hours, human annotation effort, CO2 footprint) 1. LLM-generated QA vs human QA: As discusses above, If all QA pairs require human verification anyway, what is the actual impact/efficiency gain? Did you measure linguistic diversity / question quality vs human-written questions? 2. Question collapse risk - Did you analyze entropy / lexical diversity / syntactic variety of generated questions? - Is there evidence your dataset avoids LLM-mode-collapse phrasing patterns? 3. Imitation learning motivation - You argue passive trajectories aren't suitable for imitation learning (and proactive exploration helps) — do you actually train an IL agent on ExploraQA to validate this? If not, can you soften the claim or add discuss IL results? 4. Reward shaping - Why reward only score==5? - Did lower thresholds or soft rewards fail? 5. Baselines - I'd really recommend adding simpler baselines, in particular (1) blind LLMs (2) Multi-frame VLMs and (3) Human baselines to give estimates of lower and upper bound of the benchmark and some sense of the difficulty. This is standard practice, as done in previous EQA benchmarks like OpenEQA. 6. Human workload - What is the total hours / cost for human verification? Is the ~17% correction rate stable across scenes? Moderately AI-edited
ExploraQA: Embodied Question Answering with Long-horizon Proactive Exploration Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a new dataset and approach for embodied question answering in navigation settings. As a first contribution they curate a new QA dataset using an iterative data generation framework. The main difference between this dataset (and framework) is that it uses expert trajectory data to label some QA. This is what is being referred to as proactive QA. The whole premise being that expert data should be useful to train imitation learning policies. The data generation framework uses multiple VLMs in the loop where one model generates while the other verifies and several rounds happen before the answer is accepted. Using this data a planning framework (AQ Nav) is proposed. AQNav builds a topological map, a navigation module for next subgoal prediction. This waypoint (node) selection mechanism uses the textual information and the graph structure (which includes the visual features from all the different locations) to propose an action, which is either a node in the graph or stop action. The main sell of the approach is that the same expert data can be used for imitation learning as well as for asking QA questions. Experiments are performed on the proposed ExploraQA dataset where the approach outperforms some old baslines and a more recent ETPONav baseline. The paper is overall well written and tries to tackle an important problem. Curating high quality datasets for navigation is still challenging and its great to see a paper exploring it. Questions: In the related work, the paper mentions that the current paper “distinguishes itself by emphasizing the agent’s active question answering capability …. where the agent can proactively seek clarification or additional information”. But I am really sure how this proactive behavior shows up in any of the evals. Can you please clarify this? I think using the word “proactive” seems a bit misguided. Proactive question answering would assume that the agent has some self-awareness and can reason about uncertainty but that is not really what is being proposed here. From my understanding some of the proactive claims (e.g. proactive trajectories) should be substantially reduced. *Need for a paired dataset*: One of the claims that the paper makes is the need for a paired dataset from which both navigation can be learned via imitation learning as well as have QA pairs to query for video-navigation QA. However, I am a bit unclear about this need. Why can we not learn both of these capabilities from disjoint datasets. Maybe some small amount of paired data is useful but it doesn’t seem super necessary to have a large scale dataset (which is what the paper is proposing) for these capabilities. Both, navigation and embodied QA capabilities seem disjoint. The agent can quite easily learn navigation from hindisght labeled data while embodied QA pairs can be generated from both optimal as well as suboptimal data. Can the authors clarify the need for this large scale paired dataset? *Comparisons*: There are some recent works such as Saxena et al. which should be cited and compared against as well. Some of the techniques followed in these papers are quite similar to the proposed work Saxena et al Grapheqa: Using 3d semantic scene graphs for real-time embodied question answering please see above Fully human-written
ExploraQA: Embodied Question Answering with Long-horizon Proactive Exploration Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper addresses the problem of asking an agent to follow some instructions to navigate to a target location, and then answer questions about the destination. It introduces a benchmark (ExploreQA) and an agent baseline (AQ-Nav). To the best of my understanding, the ExploreQA task is as follows: * Follow the instructions to navigate efficiently to a target location (using the instructions and trajectories from VLN-CE in the Habitat simualtor) * Answer a question based on views near the end of the trajectory (generated using the proposed pipeline) The authors also introduce the AQ-Nav agent and compare it to other baselines on the proposed benchmark The details that are included in the paper are clearly presented, and the paper is well-organized. Some details (like prompts) are provided in the appendix. The authors show VLM-as-a-judge evaluation has high agreement with human annotators, matching the findings in OpenEQA. The AQ-Nav baselines has multiple components that the authors show an ablation of in Table 3. My main concerns about the paper are 1. The problem introduced does not appear well-motivated to me, in that existing benchmarks already cover these capabilities 2. The benchmark adds some LLM-generated (and human checked) questions on top of existing trajectories from other works. Some details are missing about what design parameters were used to ensure the generation was complete (high-recall). 3. The proposed agent is evaluated only on this new benchmark; not on other existing benchmarks -- which makes it difficult to evaluate the efficacy of the agent. **Task novelty and utility** The proposed task asks the agent to follow a given trajectory, based on language instructions, and then answer a question based on what the agent sees at the destination. The episode is scored based on how efficiently the agent follows the trajectory, and whether the answer is correct. Both the instruction-guided trajectory following [1] and question answering [EQA] tasks themselves are from existing work. Putting these two tasks together does not seem to require substantially different capabilities than what is already required in existing benchmarks:. * The trajectories for instruction-guided trajectory are directly used from [1] * The question-answering from a trajectory is also covered in OpenEQA: the EM-EQA setting requires agents to answer questions from a predetermined video, and the A-EQA setting requires agents to determine how to navigate around the environment to find the answer Both of these benchmarks use the Habitat simulator. **Question generation:** There are very few details about the question generation provided. * How do the authors guarantee coverage of useful or important questions that a user might ask? As best I can tell, the diversity comes from whatever GPT4o generates from the prompt in Figure 6. * LLMs are known to collapse modes and produce relatively low diversity. For example, ask ChatGPT to tell a joke -- it will usually repeat the same 3 or 4 jokes. So I wonder how the authors guarantee good coverage of what humans might ask in ExploreQA? There are also few details about design decisions used to develop the QA evaluator: * How many feedback iterations are does this usually take? How does acceptance rate increase with more iterations? * What design decisions or facts about the prompt were important when designing the system? * The authors state that this uses Claude Sonnet 3.5, which is known to have relatively weaker visual understanding compared to GPT-4, Qwen, or Gemini. **Agent evaluation** Does the agent generalize to other simulators and other question types, navigation instructions, or to longer trajectories? For [1], it would also be helpful to compare the SotA on that benchmark, and ideally on other benchmarks in other simulators if they exist. [1] Beyond the navgraph: Vision and language navigation in continuous environments * The authors mention "proactive navigation" and proactive several times but I am unfamiliar with the term. Is "proactive navigation" the task in VLN-CE? * In S3.1, the paper says the viewpoints are selected s.t. "The relative orientation between poses is managed to guarantee non-overlapping visual coverage." Fully human-written
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes Fast-dLLM, a training-free inference acceleration framework for diffusion-based large language models (dLLMs). Unlike autoregressive LLMs, diffusion LLMs cannot directly leverage KV caching or efficient parallel decoding due to their bidirectional denoising process. The authors introduce two key techniques to address this: * Block-wise Approximate KV Cache: exploits temporal similarity between diffusion steps to reuse attention states across steps, enabling cache reuse in bidirectional attention without retraining. * Confidence-Aware Parallel Decoding: decodes multiple tokens simultaneously when their predicted confidence exceeds a threshold, with theoretical justification showing approximation to sequential decoding under high-confidence conditions. Experiments on several diffusion LLMs (e.g., LLaDA, LLaDA-V, Dream) across text, code, and multimodal reasoning tasks demonstrate up to 27.6x speedup with negligible accuracy loss. 1. Timely and practically relevant: addresses the major efficiency bottleneck of diffusion-based LLMs, a direction gaining interest. 2. Training-free approach: requires no retraining or model modification, making it directly applicable to existing dLLMs. 3. Comprehensive evaluation: covers both text and multimodal reasoning tasks, with consistent gains across benchmarks. 4. Strong empirical results: large acceleration factors (up to 27.6x) with small degradation make the method attractive for deployment. 1. Applicability to distilled or few-step diffusion LLMs. It remains unclear whether the proposed caching and confidence-aware decoding strategies would remain effective for distilled diffusion LLMs that operate with only a few or even a single denoising step (e.g., dParallel, arXiv:2509.26488; One-Step Diffusion LLM, OpenReview:P7OzWxOUHK). The reviewer acknowledge that these are concurrent works, while such aggressive timestep reduction is becoming a key trend, similar to continuous diffusion distillation in image/video models. Caching-based acceleration mainly benefits multi-step teacher models, but may offer limited or no gain for student variants without hurting the accuracy, restricting practical adoption. 2. Memory overhead. The block-wise KV caching mechanism likely introduces memory costs, especially for long sequences or large models. The paper does not quantify the memory–speed trade-off or report actual VRAM usage, which is important for understanding deployment feasibility. 3. Baseline comparisons. Recent concurrent and closely related works, such as dLLM-Cache (arXiv:2506.06295) and dPad (arXiv:2508.14148), are not compared or discussed. Even if concurrent, a conceptual comparison highlighting methodological similarities and distinctions would help position Fast-dLLM within this rapidly evolving research landscape. How sensitive are the results to the cache window size and confidence threshold? Could a learning-based mechanism adaptively set these parameters? What is the additional memory footprint introduced by storing bidirectional caches compared to AR caching? Will the proposed caching mechanism be compatible with TensorRT deployment? Could the proposed methods extend to multimodal diffusion transformers (e.g., text-to-image or video diffusion models)? Fully AI-generated
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper investigates the slow inference speed of diffusion-based large language models, stemming from the lack of Key-Value (KV) cache support and quality degradation in parallel decoding. The proposed Fast-dLLM method introduces a block-wise approximate KV cache tailored for bidirectional attention and a confidence-aware parallel decoding strategy. This strategy dynamically selects tokens above a confidence threshold, enabling efficient cache reuse and safe, training-free parallel token generation. 1. The method is training-free, making it an easily applicable plug-in for compatible diffusion language models. 2. It includes a comprehensive set of ablation studies analyzing the impact of key components and hyperparameters, such as cache block size, confidence threshold, cache variants, and generation/prefill length. 3. The empirical results are significant, showing marked improvements in throughput across multiple benchmarks with minimal accuracy loss. 1. The mechanism of 'DualCache' is not clearly explained; it is unclear how caching suffix tokens saves computation and accelerates inference. Additionally, more details on the extra memory overhead introduced by the KV cache need to be disclosed. In addition, the paper claims that the cache update introduces “negligible overhead” in Figure 2, but it provides no concrete explanation or timing comparison. 2. The novelty of the block-wise KV cache contribution is unclear. The manuscript fails to sufficiently differentiate its caching method from existing work like Block-diffusion, and the experimental section lacks relevant diffusion model baselines. 3. The reported speedup ratios raise concerns about potential metric inflation. Furthermore, the datasets used are primarily focused on math and code problems, lacking experimental data on benchmarks specifically designed for evaluating inference acceleration. 4. The conceptual relationship and practical distinction between threshold- and factor-based strategies are insufficiently clarified. It is unclear whether the factor-based rule is a relaxed, adaptive, or independent variant, or under which conditions one should be preferred. A clearer rationale linking these strategies would improve comprehension and applicability. 5. While the paper qualitatively illustrates KV similarity, it does not provide quantitative measures showing how approximation errors change with block size, decoding length, or modality, making it difficult to assess cache mismatch risks. It would be helpful to report KV similarity decay curves across decoding steps for different block sizes and tasks, correlate these similarities with downstream task metrics (e.g., accuracy, EM, BLEU) to establish a “similarity threshold → refresh” rule. 6. The assumption of a “well-defined joint PMF with self-consistent marginals” may not hold for real diffusion LLMs trained approximately. The implications of this idealization are not discussed in detail, limiting interpretability of theoretical guarantees. 7. The paper does not include a comparison with autoregressive (AR) models. As a result, it remains unclear whether the diffusion + Fast-dLLM approach is competitive with state-of-the-art AR systems in realistic serving scenarios. 1. The paper does not clarify how to automatically tune the commonly used 0.9 confidence threshold across different model scales, temperatures, or tasks, nor whether a simple calibration method (e.g., based on confidence–accuracy curves) can be applied. It also leaves open how to select the factor-based hyperparameter \(f\), and whether a universal default exists or task-specific tuning is required. 2. The paper does not quantify the computational cost (FLOPs or wall-clock time) of recomputing all blocks after completing one, nor does it explore whether incremental updates—such as refreshing only neighboring blocks or subsets of Keys/Values—could achieve comparable accuracy more efficiently. Profiling or throughput measurements would help clarify these trade-offs. 3. In the MathVista experiments, Fast-dLLM exhibited a noticeable performance degradation. Could you give a detailed analysis? Lightly AI-edited
PreviousPage 3 of 1516 (75800 total rows)Next