ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 15899 (21%) 4.43 3.58 3687
Heavily AI-edited 3233 (4%) 4.22 3.59 2990
Moderately AI-edited 7082 (9%) 4.20 3.61 2722
Lightly AI-edited 16648 (22%) 4.15 3.68 2746
Fully human-written 32938 (43%) 4.13 3.62 2917
Total 75800 (100%) 4.21 3.62 3026
Title Ratings Review Text EditLens Prediction
SpintBench: Evaluating LLMs' Complex \\ Reasoning via Spatial Integration Challenges Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a benchmark construction framework for evaluating spatial reasoning in both 2D and 3D spaces. It designed rules to automatically generate spatial descriptions of local scenes with overlapping cues, as well as corresponding question-answer (QA) pairs. It shows SOTA LLMs still struggle in their proposed benchmark SpintBench. 1. This paper find a meaningful automatic generation method to create benchmark which measures a weakness of current reasoning models, which is interesting. 1. This paper is lack of description of the differences between this work and similar previous works, for example StepGame, although relevant works already mentioned in this paper, but they has a lot of similarity with this work which is ignored and didn't mention in this paper, e.g. some of them also are multi-hop spatial reasoning using so called transitive inference. Therefore, I cannot confirm the novelty of this paper. Moreover, there are existing similar works which this paper didn't discuss, e.g. [1], they also proposed to use automatic generated benchmark for testing, and discussed scenarios of 2D and 3D, 'shuffled' and without shuffled. Can you tell the differences of your work with theirs? 2. Although this paper claimed its about spatial reasoning, but in my understanding, its more like symbolic reasoning based on spatial description. Unlike some spatial reasoning benchmarks which contains visual information requiring models to conduct 'real' spatial reasoning, the proposed benchmark require models has symbolic reasoning abilities, computation abilities, and memories. 3. Lack of human performance. Although a lot of models are compared and evaluated in this paper, this paper didn't discuss and cover the performance of human in their proposed task. [1] Chain-of-Symbol Prompting for Spatial Reasoning in Large Language Models. Please refer to the weaknesses. Fully human-written
Label-Free Attribution for Interpretability Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper examines how using a target label in attribution can bias explanations. It proposes a class-agnostic attribution that aggregates class evidence without conditioning on a specific label, paired with revised evaluation metrics intended to reduce label-driven bias. Experiments on standard image classifiers report results benchmarked using insertion and deletion metrics. 1. The impact of label choice on attribution is a meaningful topic. 2. The paper is well structured and easy to follow. 1. The target of this work is to attribute the effects of different classes. However, the method attributes to the sum over classes, producing class-agnostic maps. This collapses inter-class contrasts and weakens directional interpretability (cannot say “why A class over B class”), which is also problematic for tasks where class-specific reliance matters. 2. The paper claims that label conditioning causes information ignorance. However, softmax modeling already encodes mutual suppression among classes, and many existing attribution works explicitly use class-contrastive objectives [1]. In contrast, the objective of this work is unclear, and it fails to clarify the problems that introducing category information might cause. 3. Empirical evidence relies mainly on pixel-perturbation families (insertion/deletion games). To strengthen claims, include fidelity tests and distribution-robust benchmarks (e.g., ROAR/ROAD or other fidelity tests) to assess whether gains persist beyond pixel masking or after mitigating input distribution shift. [1] Wang, Yipei, and Xiaoqian Wang. "“Why Not Other Classes?”: Towards Class-Contrastive Back-Propagation Explanations." Advances in Neural Information Processing Systems 35 (2022): 9085-9097. 1. What failure modes follow from losing class directionality, and how would you differentiate between classes within your framework? 2. What is the distinct advantage of removing labels altogether compared to class-contrastive attribution objectives? Concretely, what specific problems arise from introducing label information that your method avoids? 3. There are some negative GAP values in Tables 2 & 3. Why do some methods report GAP < 0 (i.e., deletion curves outperform insertion, implying inverted explanations)? What does this mean for an attribution method? Lightly AI-edited
Label-Free Attribution for Interpretability Soundness: 3: good Presentation: 3: good Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper porposes an attribution algotihm called Label-Free Attribution for Interpretability (LFAI) that aims to improve the limitations of gradient-based attribution methods. The authors argue that current gradient-based attribution methods can lead to two key limitations: information ignorance and extra information, caused by the methods using class information/labels to help interpret model decisions. LFAI on the other hand analyzes model decisions without introducing class information. The method is primarily applied to image classification tasks, and shows competitive performance compared to other methods in experiments and evaluation metrics. The paper is well organized and the algorithm they developed (LFAI) is well explained. There are also many experiments that show the reader how LFAI perfromance compares to other methods, and it seems LFAI performs the best making this a strong contribution to the field. I think it could be explained in a bit more detail why the authors believe attribution methods should not rely on labels. If you're explaining model behaviour, models are trained with the labels, so why should the attribution method ignore that? I think this is a slightly more debated topic in the field, so a bit more justification would be good! With all methods there are some limitations (or at least trade-offs), it would be good to see the authors discuss what they think could be the limitations of LFAI. I think addresses the above weaknesses so: 1. What is the author(s) position on the debate on whether an explanation or attribution method should also use labels to develop the explanation? 2. What are the limitations of LFAI? Fully human-written
Label-Free Attribution for Interpretability Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper identifies two critical biases in gradient-based attribution methods, termed "Information Ignorance" and "Extra Information," which arise from the dependency on a single target class, particularly in low-confidence scenarios. To address this, the authors propose a novel Label-Free Attribution for Interpretability (LFAI) algorithm that generates explanations by maximizing the model's output uncertainty rather than focusing on a specific class logit. Furthermore, the paper introduces a more robust evaluation framework, including a Confusion Feature Algorithm (CFA) for creating unbiased baselines and new KL-divergence-based metrics (KL-INS/DEL). Extensive experiments demonstrate that LFAI significantly outperforms state-of-the-art methods, especially on low-confidence samples. 1. The paper effectively identifies the problems of "Information Ignorance" and "Extra Information." By focusing on low-confidence samples, it highlights an under-explored weakness in existing attribution methods, providing a strong motivation for the proposed work. 2. The experimental validation is thorough. The authors compare LFAI against a wide range of SOTA baselines across multiple standard models and datasets. The superior performance provides strong empirical support for the paper's claims. 1. The definitions of "Information Ignorance" and "Extra Information" are primarily illustrated through examples and feel somewhat subjective. It would strengthen the paper if the authors could propose quantitative metrics to measure the extent of these two phenomena in existing attribution methods, moving beyond the conceptual formulas provided. 2. There appears to be a strong coupling between the proposed method and the proposed metric. LFAI is designed to maximize uncertainty (entropy), while the KL-INS/DEL metrics are designed to measure changes in uncertainty. Could the outstanding performance of LFAI on the KL metrics be a result of this "self-serving" evaluation, where the method is optimized for the very quantity the metric evaluates? 3. Equation (3) for entropy appears to be missing a negative sign. The standard definition of information entropy is H(x) = -Σ P(x)log(P(x)). Maximizing entropy is equivalent to minimizing Σ P(x)log(P(x)). 4. There are several minor formatting issues with citations that should be corrected for clarity and consistency. For example: - Line 40: "(IG)Sundararajan et al. (2017)" should be "(IG) (Sundararajan et al., 2017)". - Line 43: "(AGI)Pan et al. (2021)" should be "(AGI) (Pan et al., 2021)". - Line 114: "in (Zhu et al., 2024; 2023)" should be "in Zhu et al. (2024; 2023)". Please address the questions raised in the Weaknesses section. Moderately AI-edited
Label-Free Attribution for Interpretability Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces a gradient-based attribution algorithm designed to interpret model decisions without requiring class label information. The motivation arises from the observation that current gradient-based attribution methods depend on predefined class labels, which can cause two biases: information ignorance (overlooking relevant non-target features) and Extra Information (incorrectly emphasizing irrelevant features). The proposed method redefines gradient accumulation to be label-agnostic, using the summed log-probability of all classes rather than a single class output. The paper also introduces new evaluation metrics to assess attribution quality and model uncertainty. Extensive experiments on Inception-v3, ResNet-50, VGG16, and additional models demonstrate that proposed method outperforms existing methods. 1. The problem of bias in class-guided attributions is real and worth studying. 2. The paper includes extensive experiments with multiple baselines and models. 3. The paper is well-written and easy to understand. 1. The distinction between Information Ignorance and Extra Information is presented as a new discovery, but these are fundamental and long-recognized challenges in attribution methods. Existing attribution techniques either fail to identify truly important features or incorrectly highlight irrelevant salient regions. 2. . The proposed label-free formulation, which aggregates the log probabilities across different classes instead of relying on a specific label, is not truly label-free but rather label-independent. It is therefore recommended that the authors restate or clarify this problem definition. 3. In Figure 2, I am not convinced that the results produced by LFAI represent the best outcome. Interpretability methods are expected to remain faithful to the model’s internal decision process rather than to align with human-perceived accuracy of attribution. 4. The formula annotations are insufficient, and many notations lack clear definitions, for example, in Equations (4) and (5). Please refer to the Weaknesses. Moderately AI-edited
Label-Free Attribution for Interpretability Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces LFAI (Label-Free Attribution for Interpretability), a new gradient-based attribution method that does not rely on specifying a class label when explaining a model prediction. Instead of asking “why class y?”, LFAI integrates gradients of the sum of log-probabilities over all classes along an adversarial-style path, with the goal of capturing all evidence the model used — including evidence for alternative classes — and avoiding biases introduced by conditioning on a single class. The authors argue that standard attribution methods suffer from two systemic problems: 1. **Information Ignorance**: they ignore features of other plausible classes, so they can’t explain model uncertainty or low-confidence predictions; 2. **Extra Information**: they sometimes assign importance to irrelevant background pixels because the chosen “target class” forces the method to rationalize that class even when the model itself isn’t actually relying on those regions. The paper also proposes new evaluation metrics (Fair Insertion/Deletion and KL-based variants) that try to remove the bias of using black/zero baselines and instead use a “maximally confusing” baseline image that maximizes predictive entropy, plus KL-based measures of how quickly the model’s uncertainty changes when adding/removing top-ranked pixels. On benchmarks using 1000 ImageNet images and standard CNNs (Inception-v3, ResNet-50, VGG16), plus additional experiments (ViT, CIFAR100) in the appendix, the authors report that LFAI beats 11 existing attribution methods across both high-confidence and low-confidence cases, especially in low-confidence regimes where traditional class-conditioned attribution tends to fail. Main strengths: - **Label-free attribution that directly targets a convincing gap in gradient-based XAI**, avoids conditioning on a single class, which the authors argue induces bias. - **Attempt to formalizing two failure modes with set-based definitions**: _Information Ignorance_ = truly relevant pixels not highlighted (missed mass); and _Extra Information_ = irrelevant pixels highlighted (spurious mass). - The papers even makes a **second, parallel, contribution on evaluation**: _Fair Insertion/Deletion_ (via a confusion-baseline) and _KL-based_ variants to assess distribution-level faithfulness. - **Method & metric alignment**: the distribution-aware objective (aggregate over all classes) pairs naturally with distribution-aware metrics (KL-INS/DEL), yielding a coherent story for uncertainty and multi-object scenes. - **Empirical relevance**: results emphasize low-confidence regimes where classic class-conditioned saliency underperforms, with additional analyses referenced in Section 4.4 / Appendix. - **Reproducibility**: code is (anonymously) released, facilitating verification and uptake. **MAJOR POINTS** - **Core definitions lack precision / clarity** (which can prevent from full appreciation of the cool work in the paper): The set-based definitions of _Information Ignorance_ and _Extra Information_ are hard to parse as written (quantifiers, what is fixed vs. varying, and what $\Phi$ denotes). For example, authors write “$\exists |\varphi|\ge k$ s.t. $\varphi=(i\mid i\in \Phi \land a_i<\tau)$” and “$\exists |\varphi|\ge k$ s.t. $\varphi=(i\mid i\notin \Phi \land a_i\ge\tau)$” without specifying whether $k$ and $\tau$ are fixed ex-ante, how $\Phi$ is defined/measurable, or whether existence is trivial by tuning $\tau$? These need explicit quantifiers (“for fixed $k,\tau$ …”) and a concrete operationalization of $\Phi$ (e.g., object masks, counterfactual evidence) to avoid vacuity. - **Equation 5 is under-motivated and its link to II/EI is not made explicit**: Why is the functional form $\frac{\partial}{\partial x_t} \left( \sum_{j} \log P_j(x_t) \right) = \sum_{j} \frac{1}{P_j(x_t)}, \frac{\partial P_j(x_t)}{\partial x_t}$ chosen, as opposed to others? A brief justification/discussion might help convince the reader to accept it. But most importantly, how does it theoretically help reduce II/EI? The paper would greatly benefit from a direct link between eq (5) and the mitigation of II/EI. A final _minor_ point related to the functional form: $\frac{1}{P_j(x_t)}$ can overreact to tiny probabilities; path-integrals may also be path/baseline-dependent and subject to OOD drift if the path leaves the data manifold. Any comments on this? **MINOR POINTS** - **Method–metric alignment risk.** The CFA baseline and KL-variants seem to be directly aligned with the label-free, distributional objective. That coherence is nice, but it may advantage your method by design, which should be made clear in your paper for honesty purposes. Not sure if it would be productive to report standard Insertion/Deletion (black baseline) results as well? - **“Class-independent” wording**: the method is better described as class-agnostic or label-free (it aggregates over all classes) rather than “class-independent,” which could be misread as not using class probabilities at all. - **CFA / entropy sign**: In the CFA definition, double-check the entropy sign: maximizing uncertainty means maximizing $H(P)=-\sum_j P_j\log P_j$. If Eq. 3 omits the minus, that’s likely a typo? Please respond to the weakness listed above. Lightly AI-edited
Bridging Discrete and Continuous RL: Stable Deterministic Policy Gradient with Martingale Characterization Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 1: You are unable to assess this paper and have alerted the ACs to seek an opinion from different reviewers. The paper studies the setting of policy gradient in continuous time and state-action spaces. The authors essentially rely on a martingale characterization to implement PG in a continuous setting. The PG theorem derived is analogous to the discrete counterpart with the advantage rate function taking the place of the advantage function. The authors list a bunch of regularity conditions to invoke the existence of Bellman equation and so on. The policies are deterministic mappings from state to action. The authors then argue that TD(0) is fundamentally incompatible in the continuous setting and propose a multi step TD approach in its place. The proposed algorithm is experimentally validated. The paper considers continuous RL which is technically challenging and not that well studied. Identifies the fundamental failure of TD(0) in this setting. Considers deterministic policies which reduces computational cost that is incurred due to sampling. Considers the finite horizon setting only. What are some challenges when it comes to extending the theory to the infinite horizon settings? The regularity assumptions _can_ be strong and may not always be met. What is the form of Q and V in line 182? How is the neighbourhood O determined in line 196? Fully human-written
Bridging Discrete and Continuous RL: Stable Deterministic Policy Gradient with Martingale Characterization Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper formulates RL directly in continuous time (finite‑horizon SDE dynamics) and derives a DPG (Deterministic Policy Gradient) formula using an *advantage rate* (A_\phi(t,x,a)\coloneqq \mathcal{L}[V_\phi](t,x,a)+r(t,x,a)), where (\mathcal{L}) is the generator of the diffusion. The main result (Thm. 3.1) expresses (\partial_\phi V_\phi) as an expectation over (\partial_\phi \mu_\phi^\top \partial_a A_\phi), i.e., a clean continuous‑time analogue of discrete‑time DPG. Thm. 3.2 characterizes both the value (V_\phi) and the advantage rate (A_\phi) via a *martingale orthogonality* condition. Practically, the authors enforce this by minimizing a “martingale loss” with test functions chosen as the networks’ parameter gradients and by reparameterizing the critic to satisfy (q_\psi(t,x,\mu_\phi(t,x))=0) (Eq. 4.1). This yields an implementable objective without sampling over actions. Building on this the paper proposes **Continuous‑Time DDPG (CT‑DDPG)** (Alg. 1). Key ingredients: (i) a *multi‑step* TD objective over a short fixed window (Lh=\delta), (ii) a target value network, replay buffer, and Gaussian exploration noise, and (iii) a discrete‑time implementation robust as (h\to0). The “martingale loss” is the core critic objective (Eq. 4.2). The DPG formula (Thm. 3.1) and the martingale identification (Thm. 3.2) are clean and connect continuous‑time stochastic control with practical deep RL estimators. The reparameterization (q_\psi=\bar q_\psi-\bar q_\psi(\cdot,\mu_\phi)) enforces the Bellman side condition (Eq. 4.1). The variance analysis (Props. 4.1–4.2) explains a widely observed pathology: one‑step TD becomes unusable as (h) shrinks, whereas multi‑step with fixed physical window (\delta) remains well‑behaved. CT‑DDPG combines these ideas into a simple recipe (Alg. 1, p. 7–8): multi‑step critic, target network, replay, plus an actor trained by the estimated (q). The experiments consistently show improved stability under smaller (h) and higher noise, the regimes where discrete‑time baselines degrade (Fig. 1). The paper proves bounded‑variance gradients and non‑vanishing signal for the critic objective (Props. 4.1–4.2), but it does **not** provide convergence or stability guarantees for the *actor–critic* recursion as a whole (with function approximation, target nets, replay, and exploration noise). The term “provable stability and efficiency” (Sec. 4) overstates what is established. Clarifying the exact scope of guarantees would help. Discrete‑time baselines use standard one‑step updates; the paper does not include multi‑step or TD(λ) versions of DDPG/SAC that might mitigate the very issue analyzed here. Among continuous‑time methods, the comparison is to a stochastic‑policy q‑learning implementation; there is no head‑to‑head with continuous‑time PPO/TRPO‑style approaches cited in Sec. A (related work). A stronger baseline suite would isolate the benefits of deterministic policies from those of *multi‑step TD*. Experiments inject i.i.d. Gaussian generalized forces at the MuJoCo level. While sensible, this perturbation may not match the SDE regularity assumptions used in analysis (e.g., uniform ellipticity in Prop. 4.1’s setup). Some discussion of this modeling gap would be needed. Theorems assume (V_\phi\in C^{1,2}) and smooth dependence on parameters/actions (Assumptions 1–2, Sec. 3). In practice, ReLU networks violate (C^1) smoothness; the paper informally argues Lipschitzness suffices (Asp. 1), but it is unclear whether subgradient analogues cover the results used in Thm. 3.1/3.2. A brief justification or reference would strengthen the story. Identification of (A_\phi) holds in a *neighborhood* (O_{\mu_\phi(t,x)}) of the policy action (Eq. 3.7). This entails a coverage condition on exploratory actions and suggests potential brittleness far from the current policy. The algorithm’s off‑policy robustness as exploration grows is not directly tested. CT‑DDPG hinges on window length (\delta=Lh) and the choice of test functions in the martingale loss. Ablations over (L), (\delta), the terminal‑value penalty weight, and the set of test functions would clarify how performance depends on these choices. (Alg. 1; Eq. 4.2.) Tasks are standard but mid‑scale; there’s no Ant/Humanoid evaluation. The return curves (Figs. 1–2) demonstrate trends, but sample‑efficiency comparisons in terms of environment interactions (not just episodes) would make claims about efficiency more concrete. Add *n‑step* or TD(λ) DDPG/SAC and TD3; test whether multi‑step alone narrows the gap to CT‑DDPG when (h) is small. Show learning speed/variance vs. (\delta) at fixed (h) and vs. (L) at different (h). Probe how performance changes with exploration noise magnitude; does the martingale identification break when actions stray far from (O_{\mu_\phi})? (Eq. 3.7.) Beyond (\zeta_t=\partial_\theta V_\theta) or (\partial_\psi q_\psi), try basis functions in (t) and (x) to assess bias/variance of the martingale loss. (Sec. 4.1.) Include continuous‑time TRPO/PPO‑style methods from the related‑work section and report stability under small (h). Compare the current MuJoCo force‑noise to process‑noise injected at the state‑equation level to align more closely with the SDE assumptions. (Sec. 5.) Fully AI-generated
Bridging Discrete and Continuous RL: Stable Deterministic Policy Gradient with Martingale Characterization Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a principled framework for deterministic policy gradient (DPG) in continuous-time reinforcement learning (CTRL). The authors derive a continuous-time analogue of the advantage-based policy gradient theorem and establish its martingale characterization, providing a theoretical foundation for stable deterministic policy optimization in continuous domains. Building upon this theory, they introduce a new algorithm, CT-DDPG, designed to mitigate the instability and discretization sensitivity often observed in discrete-time RL algorithms when applied to continuous dynamical systems. Experimental evaluations on several continuous control benchmarks demonstrate improved convergence stability and learning efficiency compared to both discrete-time and prior continuous-time baselines. Overall, the paper aims to bridge the theoretical and algorithmic gap between discrete-time RL methods and real-world continuous control systems. 1. The paper establishes a mathematically solid foundation for deterministic policy gradient (DPG) methods in continuous-time reinforcement learning. The authors derive a continuous-time analogue of the advantage function (Theorem 3.1) and rigorously prove its martingale characterization (Theorem 3.2), connecting the policy gradient to the advantage-rate function under deterministic policies. This framework generalizes the discrete-time DPG results (Silver et al., 2014) while removing restrictive assumptions such as uniform ellipticity and purely stochastic policies. The martingale-based formulation elegantly guarantees consistency of the value and advantage functions and provides a clear theoretical pathway for algorithm design. 2. The paper directly targets the issue that discrete-time RL algorithms become unstable as the time-step $h\to0$. The authors provide a precise theoretical explanation for this degradation: one-step temporal-difference (TD) updates cause gradient variance to blow up (Proposition 4.1). To overcome this, they introduce the Continuous-Time Deep Deterministic Policy Gradient (CT-DDPG) algorithm using multi-step TD objectives (Eq. 4.7), proving its variance remains bounded (Proposition 4.2). 1. The paper’s theoretical framing and algorithmic contribution appear incremental relative to prior work in continuous-time reinforcement learning (CTRL). While it emphasizes the use of a martingale characterization to derive a deterministic policy gradient (DPG) formula, the idea is conceptually close to existing stochastic-policy frameworks (Jia & Zhou, 2022a; 2022b; 2023) and the model-based DPG formulations of [5]. The paper does not convincingly clarify why the martingale representation provides a fundamentally new perspective or practical advantage over Itô-based stochastic analysis or existing actor–critic formulations. Moreover, the literature review is narrow and overlooks several recent studies towards same target. For example: [1] systematically analyzes multiple discretization strategies (MSS) for model-based CTRL; [2] proposes a time-adaptive sensing and control approach that jointly optimizes action and duration, directly tackling the same step-size sensitivity problem that this paper aims to address; [3] focuses on improving sample and computational efficiency while maintaining performance; and [4] theoretically studies how, for stochastic dynamics, the observation interval should adapt to the system’s variance. Although the authors cite [5], the discussion remains superficial—[5] already develops a continuous-time actor–critic framework that learns a deterministic policy gradient under ODE dynamics, conceptually similar to this work. The paper does not clearly articulate how its CT-DDPG algorithm differs from or improves upon these established baselines, many of which already achieve comparable goals such as variance control, discretization invariance, and stability. A more thorough comparative discussion and empirical ablation referencing these works would be necessary to substantiate the claimed contributions. [1] Treven, L., Hübotter, J., Sukhija, B., Dorfler, F., & Krause, A. (2023). Efficient exploration in continuous-time model-based reinforcement learning. _Advances in Neural Information Processing Systems_, _36_, 42119-42147. [2] Treven, Lenart, et al. "When to sense and control? a time-adaptive approach for continuous-time RL." _Advances in Neural Information Processing Systems_ 37 (2024): 63654-63685. [3] Zhao, R., Yu, Y., Zhu, A. Y., Yang, C., & Zhou, D. Sample and Computationally Efficient Continuous-Time Reinforcement Learning with General Function Approximation. In _The 41st Conference on Uncertainty in Artificial Intelligence_. [4] Zhao, R., Yu, Y., Wang, R., Huang, C., & Zhou, D. (2025). Instance-Dependent Continuous-Time Reinforcement Learning via Maximum Likelihood Estimation. _arXiv preprint arXiv:2508.02103_. [5] Yildiz, C., Heinonen, M., & Lähdesmäki, H. (2021, July). Continuous-time model-based reinforcement learning. In _International Conference on Machine Learning_ (pp. 12009-12018). PMLR. 2. The set of stochastic baselines is insufficient. It is unclear why the paper does not include comparisons with ctrl methods such as those proposed in [2] and [5]. In Figure 2, the only continuous-time baseline shown is the q-learning algorithm. Since the authors already cite [5] in the main paper, it would be natural—and important—to include it as a baseline to better demonstrate the claimed advantages of the proposed deterministic approach. 3. The reported results also appear somewhat counter-intuitive. In Figure 2(c) (Hopper, h = 0.008, $\sigma$ = 0), the q-learning (L = 1) baseline exhibits a substantially larger error bar than in panels (g) and (k), even though the noise level is lower. This behavior is unexpected and not explained in the paper—it raises questions about potential implementation differences, instability in training, or inconsistent experimental conditions. Please compare with the papers listed in the weakness part. I would be happy to raise the score if the authors can address the concerns as listed in the weakness part. Fully AI-generated
Bridging Discrete and Continuous RL: Stable Deterministic Policy Gradient with Martingale Characterization Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes CT-DDPG, an discrete RL algorithm in the continuous RL paradigm. The main idea is to use the martingale orthogonality characterization. Experiments have been conducted on Pendulum-v1, HalfCheetah-v5, Hopper-v5, and Walker2d-v5. The paper is well-written, and I mostly enjoyed reading it. The theoretical results are solid. The numerical experiments also support the proposed approach. Below are a few comments: (1) The introductory part seems to be too long, e.g., Thm 3.1 and 3.2 are basically known from the literature. It isn't until Section 4 (p.5) that new results come. The authors may think of a bit reorganization. (2) The idea of applying Robbins-Monroe in continuous RL was also explored in Regret of exploratory policy improvement and q-learning, arXiv:2411.01302, Tang and Zhao -- but using the martingale loss approach. This paper considers a different martingale orthogonality approach. See also On Optimal Tracking Portfolio in Incomplete Markets: The Reinforcement Learning Approach, SICON, Bo, Huang and Yu for a similar analysis in a specific setting. (3) Proposition 4.2 (and also 4.1): the authors proved the variance blow-up/no blow-up. However, there is no analysis on the proposed algorithm. What is (roughly) the value function gap in terms of h? Also the proposed algorithm uses the number of steps of order 1/h -- which is exploding. The authors may provide some explanations. (4) Is there any insight how to choose h in practical problems? See weaknesses. Fully human-written
HIGH-AVATAR: Hierarchical Representation for One-shot Gaussian Head Avatar Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a one-shot framework for animatable 3D head avatar reconstruction using a hierarchical Gaussian representation. The model enables dynamic LOD rendering from a single image by combining transformer-based global features, projection-sampled local features, and occlusion-aware fusion guided by depth buffers. Additionally, a multi-region decomposition models head and shoulders separately, enhancing completeness. - Novel hierarchical representation that allows scalable LOD-based rendering. - Occlusion-aware feature fusion effectively balances local detail and global semantics. - Multi-region modeling significantly improves shoulder reconstruction quality. - The paper demonstrates high inference speed (85–120 FPS) while maintaining competitive visual fidelity. - Clear ethical considerations and reproducibility statement. - Overall results appear slightly over-smoothed. Fine-grained wrinkles (e.g., frown lines, crow’s feet during squinting) are often missing, leading to a soft appearance. - The oral area exhibits noticeable artifacts and blur, especially in wide-open mouth expressions. This affects perceived realism. - Although the paper claims to produce a “3D avatar,” no explicit novel-view rendering or side-view synthesis results are shown. This weakens the 3D avatar claim, even if acknowledged in the limitations. - Comparison gaps and weaker performance vs. 2D baselines - Despite the claimed advantages of 3D Gaussian representations under large pose variations, the visual quality still lags significantly behind strong 2D reenactment methods such as XNEMO, even in those large-pose scenarios where the authors argue 3DGS should excel. - the paper lacks discussion with other advanced 3D or Gaussian-based approaches (e.g., AVAT3R, HeadGaP), which would provide a more balanced and convincing evaluation of the method’s position within the current landscape. - The proposed method is conceptually close to GAGAvatar. - How does the model handle unseen poses (e.g., ±90° head rotation) or extreme lighting? Heavily AI-edited
HIGH-AVATAR: Hierarchical Representation for One-shot Gaussian Head Avatar Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents a one-shot drivable Gaussian head generation framework that leverages a multi-LOD Gaussian subdivision scheme and a depth-based feature fusion strategy to improve geometric detail and visual expressiveness. While the idea is interesting, the technical novelty and experimental validation are not convincing. Although the authors claim to outperform recent state-of-the-art methods, the supplementary videos indicate that the approach still suffers from significant visual and structural issues. 1. The paper is clearly written and the overall pipeline is easy to follow. 2. The use of a z-buffer–based feature fusion mechanism is an interesting idea that may inspire future extensions. 3. The proposed subdivision strategy can effectively reduce the computational overhead of cross-attention. 1. The claimed architectural novelty appears overstated. The proposed shoulder mask introduces unnecessary design complexity and resembles over-engineering. Such techniques have already been extensively explored in NeRF-based systems without offering substantial improvements. Moreover, 3D Gaussian frameworks inherently support learning positional offsets to refine geometry. The hierarchical coarse-to-fine feature sampling strategy is also insufficiently justified. 2. The experimental validation raises serious concerns. Despite the claim of outperforming SOTA methods, the supplementary videos reveal evident artifacts — the teeth are blurry, and the shoulder and boundary regions contain visible distortions. This contradicts the paper’s claimed performance advantages. 3. The reconstructed 3D head geometry shows noticeable inward deformation, especially in the hair region, as visualized in Fig. 2. 4. The choice of reference images could be improved by including more diverse facial expressions. The current demonstrations all use closed-mouth expressions, which may explain the degraded quality of the mouth and teeth regions. 5. When AIGC-generated source images are used, many fine-grained motion details are lost, indicating limited generalization capability of the method. 1. The model is trained only on the VFHQ dataset. Could the authors clarify whether the model generalizes well to other datasets or in-the-wild scenarios? 2. Could the authors explain the rationale behind using the coarse-to-fine (multi-LOD) design in the architecture? Why would directly using the same number of V2 points for cross-attention be suboptimal? Lightly AI-edited
HIGH-AVATAR: Hierarchical Representation for One-shot Gaussian Head Avatar Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper present HIGH-Avatar, a one-shot 3D Gaussian head reconstruction framework using hierarchical representations. This method combines transformer-based global features and projection-based local features, fusing them with depth-guided occlusion awareness. It supports multi-level-of-detail modeling and separately reconstructs head and shoulders. * This method balances detailed representation with computational efficiency through hierarchical subdivision and feature fusion. * The approach of combining global features and local features, and utilizing depth buffers for occlusion-aware fusion, is reasonable and effective. * The visualized results presented in this paper are difficult to capture the advantages of multi-level subdivisions. * Modeling the head and shoulders separately did not demonstrate advantages. (refer to questions.) * The paper lacks discussion and demonstration of limitations, extreme cases, and failure cases. * As mentioned in the paper, the 3D Gaussian head model relies on FLAME priors and accurate 3D deformable model (3DMM) tracking, and it is difficult to capture subtle facial expressions. * In the visual results shown in Figure 3, the results of LAM in the 2nd, 3rd, 4th, and 9th rows, etc., show significant misalignment with the driven expressions and are worse than other baseline methods. However, in the quantitative results, the AED of LAM is still very high. What causes this problem? * The detail advantages brought by multi-level subdivision are not obvious. For example, the white hair region in the third row of Figure 3, the beard region in the seventh row of Figure 3, and the scar on the head in the seventh row of Figure 12 indicate that the method may have difficulty capturing fine details. * From Figure 2, the separate modeling of the shoulder region does not seem very reasonable. It has already been modeled by sub0/1/2. Considering that it is essentially modeled as a rigid body without additional controllability, is the modeling of sub0/1/2 already sufficient? At the same time, from the visualization results in Figure 3, the visual effect of the shoulder after separate processing is actually worse than some baselines (especially Portrait4Dv2). * The results of the LAM method are missing in the speed comparison (Table 3). * In the ablation experiment, using only the global feature leads to a severe loss of details. What would happen if only the local feature is used? * From sub1 to sub2, although the number of Gaussian points increases significantly, the improvement in performance is very limited. What causes this? * The results in Figure 3 show that the hair region in the reconstruction/driving results is worse than the baseline methods Portrait4Dv2/GAGAvatar (with smoother textures and loss of details). Why is there no improvement in the hair region after multi-level sampling? * As can be seen from Figure 2, the predicted offset of the Gaussians in the hair region is clearly incorrect (even for the visible part of the frontal hair). What causes this? Will the same problem occur for frontal accessories such as glasses? Fully human-written
HIGH-AVATAR: Hierarchical Representation for One-shot Gaussian Head Avatar Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents HIGH-Avatar, a novel method for quickly generating high-quality 3D head avatars from a single image. By using hierarchical Gaussian representation, occlusion-aware feature fusion, and separate modeling of the head and shoulders, it achieves higher rendering quality with lower computational cost and supports real-time animation. Experiments show that it outperforms existing methods in image quality, rendering speed, and efficiency. 1. The paper is clearly written, well-structured, and rich in figures and reproducible details, making the method easy to follow and re-implement. 2. Extensive experiments on two large datasets with 12 baselines, thorough ablation studies, and multi-LOD evaluations demonstrate solid and convincing workload. 3. Novel contributions include hierarchical 3D Gaussian LOD representation, occlusion-aware global-local feature fusion, head-shoulder decomposition, and coarse-to-fine training, all working together for high-fidelity avatar generation. 4. Competitive overall performance. 1. Shoulder modeling relies on image-plane unfolding without 3D geometric priors, leading to texture stretching and geometric inconsistencies under large viewpoint or motion changes. 2. The method has not been tested on low-quality inputs (blur, low resolution, lighting variation, occlusion), so its robustness in real-world conditions is uncertain. 1. Regarding the propagation of local-feature sampling errors:As the mesh is subdivided, local features are obtained by projecting ever-higher-resolution vertices back to the single input image. Could the authors share any insight into how projection inaccuracies (e.g., small calibration or geometry errors) are prevented from being amplified at finer levels, and whether they might shift high-frequency details such as wrinkles or beard textures? 2. On the robustness of the occlusion test:The visibility mask is derived from a standard depth buffer. Have you observed cases where hair, glasses, or other thin structures produce depth conflicts, and might this lead to visible artefacts if local features are sampled from incorrectly "visible" pixels? 3. Saturation point of Gaussian count vs. quality:Sub#2 model already achieves excellent results with ≈29 k Gaussians. I’m curious whether you have experimented with even denser sets (say 100 k–200 k). At what point do further additions stop improving PSNR/LPIPS, and did you plot a full quality-vs-count curve to identify a saturation point? 4. Cross-identity reenactment with large appearance gaps:The paper shows convincing transfers across moderately different identities. I wonder if the authors tested situations where source and driver differ dramatically in gender, age, or ethnicity? If so, did expression fidelity or identity preservation degrade, and how might the method be extended to handle such large domain gaps more robustly? Fully AI-generated
Element2Vec: Build Chemical Element Representation from Text for Property Prediction Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper is concerned about learning representations for chemical elements (e.g., atoms) from text data such as Wikipedia webpages. The proposed representation consists of `Element2Vec-Global` and `Element2Vec-Locals`; the former is a representation of the whole relevant text data, while the latter is a collection of representations, each of which is obtained by prompting an LLM to focus on specific (pre-defined) attribute (details are in Figure 2). Since the number of all elements is limited, the authors also propose a test-time training approach for prediction, instead of the standard supervised learning (Section 4.3 ). The benefit of the proposed representations has been validated from several perspectives. Section 4.1 visually examines the validity of the proposed local representations, as compared against several different representations. Section 4.2 quantitatively examines the benefit of the local representation. Section 4.4 examines the effectiveness of the proposed embedding and test-time training method. - Learning a representation from text data is an interesting way of utilizing LLMs. - It is insightful that the authors have shown several ways to define local embeddings and have explained why the proposed embedding is selected among others. One of the major concerns is the empirical validation. As far as I am aware of, a quantitative validation is done on a task to predict the van der Waals radius of an element, without any existing methods to be compared. Since I am not an expert in materials science, I failed to understand the importance of the prediction task, and thus, I thought the experiment was rather a toy task rather than a real-world problem. In addition, without performance comparison with other existing methods, it is difficult to understand whether or not the proposed representation is useful or not in real applications. - I would like to ask the authors to clarify how $p_k(x)$ is computed in Section 4.2. - In Section 4,4, the authors state that "$R_{\mathrm{vdW}}$ is difficult to determine and not uniquely defined", and I'm curious about how the authors determine the ground truth labels. - I would like the authors to clarify the relevance of the van der Waals radius prediction task to real-world problems. Fully human-written
Element2Vec: Build Chemical Element Representation from Text for Property Prediction Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a framework named Element2Vec, which aims to use LLMs to extract chemical embeddings from the Wikipedia text for material property prediction. The embeddings consist of global embeddings that summarize entire text and local embeddings specific to attribute texts (e.g., mechanical, optical, thermal). To handle the high sparsity in chemical property data, the paper also designs a test-time training strategy based on self-attention, transforming property prediction into an imputation problem. Experimental results show that local embeddings achieve better clustering by element family in t-SNE visualization for classification. In property regression task, the global embeddings combined with the 'test-time training' strategy perform best under high data missing ratios. 1. This paper presents an interesting perspective, using large language models to extract embeddings from text for predicting chemical properties, which can capture richer contextual knowledge missing from traditional databases. 2. This paper proposes a Test-Time Training strategy, which treats all elements as a whole for 'imputation' prediction. According to the experiments in the paper, this method outperforms traditional inductive training methods at all data missing ratios. 1. The authors employ LLM to generate chemical representations from natural language text, however, there are several concerns. The paper fails to sufficiently address the issue of data leakage; since Wikipedia may already contain explicit numerical values or strongly correlated descriptions of the properties being predicted, it is unclear whether the model is learning genuine chemical relationships or merely retrieving and regurgitating memorized information. The authors need to provide a rigorous analysis to rule out this possibility. 2. Furthermore, the justification for choosing the specific Gemini embedding model over other general-purpose large models (like GPT) or domain-specific models (like MatSciBERT or SMILES-BERT) is insufficient. The paper needs to clearly articulate the specific advantages of the chosen method relative to these alternatives. 3. The paper's introduction to the global and local embedding generation methods lacks the necessary detail. The process of how text is segmented, summarized, and fed into the model to generate local embeddings is not clearly explained. Concurrently, the paper fails to provide a convincing scientific justification for the necessity of local embeddings. Although the authors hypothesize that local embeddings can highlight specific attributes, the empirical evidence provided appears contradictory, for instance, the results in Figure 7(b) show that the global embedding generally has a lower RMSE in property prediction than the local embeddings, which weakens the motivation for adopting the more complex local embedding method. This work resembles a simple text-processing workflow applied to the chemical domain rather than a substantial new contribution to chemistry or materials science. 4. The paper lacks modeling of the relationships between local representations. The authors generate independent vectors for attributes (e.g. atomic, chemical, and thermal), but in chemistry, these properties are deeply correlated and interconnected. The current method appears to treat them as mutually independent. 5. The paper does not provide sufficient detail regarding the dataset used for embedding generation and property prediction. Although Wikipedia is mentioned as the source, the authors need to clearly specify the data collection and cleaning procedures, as well as comprehensive statistics for the final corpus (e.g., average document length, vocabulary size, etc.). 6. The paper's experimental evaluation is weak due to a lack of adequate baseline comparisons. The authors primarily compare different variations of their own method. A benchmark against non-text-based methods, such as GNNs, is required. Please refer to the weaknesses. Lightly AI-edited
Element2Vec: Build Chemical Element Representation from Text for Property Prediction Soundness: 1: poor Presentation: 2: fair Contribution: 3: good Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes Element2Vec, a framework for representing chemical elements as vectors by leveraging textual descriptions. The authors use Wikipedia pages for each of the 118 chemical elements as the data source. An LLM-based pipeline is employed to produce two types of embeddings for each element: a single Global embedding capturing the overall content of the element’s page, and multiple Local embeddings that are attribute-specific. To obtain the Local embeddings, the approach first uses an LLM to classify each sentence of an element’s page into one of eight predefined attribute categories (e.g. Atomic, Chemical, Thermal, etc.), and then aggregates the text of each category (with a brief summary of the whole page) to generate an embedding for that attribute . The goal is that these Global and Local embeddings encapsulate meaningful chemical knowledge extracted from text, which can then be used for downstream tasks such as classifying an element’s periodic-table family and predicting various material properties The paper's key strengths lie in its innovative cross-domain approach that bridges NLP and materials science by leveraging large language models to extract knowledge from scientific text, moving beyond traditional hand-designed features. The introduction of attribute-aware embeddings significantly enhances interpretability by producing multiple vectors for each element corresponding to human-understandable categories (mechanical, thermal, chemical properties), which demonstrably organize the latent space in ways that respect known scientific classifications like periodic families. The proposed test-time training method effectively addresses sparse data challenges, showing substantial performance improvements over conventional baselines especially when 50-80% of data is missing, by cleverly allowing unlabeled instances to influence the model during inference. The work is supported by thorough empirical evaluation including comprehensive ablation studies (examining summary length effects, attribute contributions, and embedding dimension overlap) that reveal meaningful insights such as the model's ability to rediscover periodic families from text alone and capture real scientific relationships like the shared features between melting and boiling points. This combination of strong performance, interpretability through attribute-specific analysis, and alignment with domain knowledge makes the approach both scientifically valuable and trustworthy for deployment in materials research. The approach suffers from fundamental limitations in its dependence on Wikipedia as a single, uneven source of truth, with sparse coverage forcing the exclusion of 22 elements from certain analyses. More critically, the heavy reliance on LLM-based sentence classification and summarization lacks any validation or accuracy assessment. The paper provides no evidence that the LLM correctly categorizes sentences into attribute buckets or avoids hallucination during summarization, despite these steps being central to the embedding process. This unverified pipeline could propagate errors throughout the embeddings, yet the authors offer no robustness analysis or manual verification of these critical automated decisions. The experimental evaluation lacks essential baseline comparisons that would contextualize the method's performance. No comparisons are provided against simple alternatives like one-hot encodings, manual feature sets, or naive whole-text embeddings without attribute segmentation. Furthermore, the results reveal a surprising weakness: the sophisticated Local attribute embeddings actually underperform the simpler Global single-vector approach in property prediction tasks, with the authors themselves acknowledging that "global embedding generally exhibits the lowest error." This undermines a core contribution of the paper and suggests the attribute segmentation may lose important holistic information rather than enhance it. The work's practical applicability is limited by its narrow scope (118 elements only, with no demonstration on compounds or real materials) and reproducibility concerns stemming from dependence on proprietary models like Gemini. The authors provide no plan to release computed embeddings or discuss computational costs, making it unclear how others could replicate or extend this work without access to the same commercial AI services. 1. How do you handle elements with incomplete Wikipedia attribute coverage—do they receive fewer Local embeddings, default vectors, or some imputation method? 2. Did you compare Element2Vec against simpler baselines like linear regression on atomic features (e.g., atomic number, group, period) to quantify the advantage of text-derived embeddings? 3. How reliable was the LLM's sentence classification into attribute categories, and did misclassifications (especially for ambiguous sentences spanning multiple categories) impact embedding quality? 4. Why did concatenated Local embeddings underperform Global embeddings for regression, and did you explore learned fusion methods like attention mechanisms to weight attribute relevance for specific properties? 5. How sensitive is test-time training to hyperparameters and the ratio of known-to-unknown elements, and at what point does including too many unknowns cause overfitting or instability? 6. Is the Gemini embedding model publicly accessible, could alternative models like SBERT produce similar results, and will you release the Element2Vec embeddings for the 118 elements? Fully AI-generated
Element2Vec: Build Chemical Element Representation from Text for Property Prediction Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. To address the critical challenge of property prediction for chemical elements, this paper employs a Large Language Model (LLM) as an encoding module to extract knowledge from text in the form of embeddings. The proposed Element2Vec framework leverages an LLM to construct vector representations of chemical elements derived from unstructured text sources, specifically Wikipedia pages. The key contribution lies in generating two types of embeddings that capture information at different levels. The global embeddings utilize the entire document or page of an element as input to capture holistic information, while the local embeddings are learned from text grouped under specific attributes. Overall, this idea of using LLM to encode chemical text has been explored in several related works which limit the contribution of this paper. 1. The proposed Element2Vec provides an effective way for translating human-experienced, qualitative knowledge like Wikipedia into numerical representations that are machine-readable. 2. The proposed method utilizes both Global and Local (attribute-specific). The local embeddings include the information of specific characteristics (e.g., optical vs. thermal properties), which is vital for materials design and scientific analysis. The global embeddings capture more holistic knowledge. 3. The training-free framework is another strong aspect. By relying on pre-trained LLMs as feature extractors and content classifiers, the embedding generation pipeline does not require additional training. This makes it straightforward to apply to new elements or attributes without extensive retraining, thereby facilitating faster research and experimentation. 1. The main limitation of this work is that Element2Vec relies entirely on Wikipedia pages. Consequently, the quality, depth, and neutrality of the generated embeddings depend heavily on the completeness and accuracy of this data source. If a particular element’s Wikipedia page is sparse, outdated, or biased, the resulting embedding may be inaccurate. A potential improvement would be to incorporate additional sources of domain-specific knowledge, such as scientific publications or chemical databases, to enhance representation quality. 2. Wikipedia is a general data source, and most modern LLMs have already been trained on it during pretraining. Therefore, a more appropriate baseline would be to compare this approach against recent, powerful models applied directly to property prediction tasks. Additionally, several models have been fine-tuned for chemistry-related tasks, and including such comparisons would strengthen the paper. Overall, while the application of LLMs to chemical property prediction has been investigated in previous studies, this work would benefit from a clearer demonstration of its unique technical contributions and distinctions from existing approaches. Please check the weaknesses. Moderately AI-edited
Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors use model uncertainty as process supervision to guide the model’s reasoning steps. To perform uncertainty estimation, they train a lightweight value head in a supervised manner to predict the model’s uncertainty. The training data are labeled either by the model itself serving as the supervisory model or by a third-party supervisory model. The authors conduct experiments on mathematics, planning, and QA datasets, comparing their approach with several unsupervised uncertainty estimation methods and third-party process reward models. 1. The method proposed by the authors is lightweight — it only requires training a value head, which makes it highly efficient. 2. The authors proposed an automated training data construction scheme. 3. The authors conducted extensive experiments, comparing their approach across datasets from three different domains. 1. The proposed method lacks novelty, as many prior works have already trained process reward models (PRMs) or used uncertainty estimation as a supervision signal. For example, the baseline methods cited by the authors employ similar ideas. The main contribution of this paper is merely implementing such supervision through a lightweight value head. And UHead is also an existing work. 2. The authors' definition of uncertainty lacks rigor. Generally speaking, a metric trained directly from accuracy should not be regarded as a measure of uncertainty. For example, when a model produces a particular wrong answer very frequently during random sampling, its uncertainty about this that answer should be very low. However, under the training method proposed by the authors, such a case would yield a high uncertainty value. For the definition of uncertainty, I recommend reading this paper: https://arxiv.org/pdf/1802.10501 Have you compared the results between full-parameter fine-tuning and Uhead? Lightly AI-edited
Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces UHeads (Uncertainty quantification Heads), a lightweight alternative to Process Reward Models (PRMs) for verifying step-level correctness in LLM reasoning chains. UHeads are small transformer modules (<10M parameters) trained on frozen LLM internal states to predict step-level uncertainty, with labels generated either by larger models (DeepSeek-R1) or through self-supervision. The authors demonstrate that despite being 750-810× smaller than PRMs, UHeads achieve competitive performance across mathematics, planning, and QA tasks, particularly excelling in out-of-domain scenarios, suggesting that LLMs' internal states encode meaningful uncertainty signals for reasoning verification. 1. The proposed UHeads achieve comparable or superior performance to PRMs while using 750-810× fewer parameters (9.8M vs 7-8B), offering a highly efficient alternative for step-level reasoning verification that significantly reduces inference costs and memory requirements. 2. UHeads demonstrate superior generalization capabilities, particularly on OOD tasks where they consistently outperform much larger PRMs, suggesting they capture more transferable uncertainty signals rather than overfitting to domain-specific patterns. 3. The automatic annotation pipeline eliminates requirements for human labels, verifiable final answers, or costly Monte Carlo rollouts, supporting both external supervision (via DeepSeek-R1) and self-supervision approaches with comparable performance. 1. Tables 2-4 and 6 consistently show UHeads underperforming strong PRM baselines on in-domain mathematical tasks (MATH, GSM8K), with gaps of 5-10% in PR-AUC, raising questions about whether the computational savings justify the accuracy trade-off for domain-specific applications. 2. The 256-token generation limit during training data creation may constrain the method's applicability to more complex reasoning tasks like AIME problems that require tens of thousands of tokens, potentially limiting the approach's generalizability. 3. Given that UHeads require training on specific LLM internal states while PRMs can be used off-the-shelf across different models, and considering the performance gaps on certain tasks (e.g., ScienceQA where RLHFlow-PRM-DeepSeek significantly outperforms), the overall value proposition compared to training a single general-purpose PRM remains unclear. 1. How does the approach handle step boundary definition in complex reasoning chains that include self-verification, backtracking, or recursive refinement? The paper's reliance on structured prompts may not generalize to more naturalistic reasoning patterns. 2. In Section 2.3, the notation P(y|x,D) appears problematic since training on data D fundamentally changes model parameters θ rather than just conditioning the distribution. Should this be reformulated as P_θ'(y|x) where θ' represents post-training parameters? 3. What training factors contribute to UHeads' underperformance on in-domain tasks compared to PRMs? A deeper analysis of failure modes and potential improvements would strengthen the paper's contribution. Fully AI-generated
Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper addresses the challenge of verifying intermediate reasoning step correctness in LLMs’ multi-step reasoning and proposes a lightweight Uncertainty Quantification Head (UHead) to replace computationally expensive Process Reward Models (PRMs). - The proposed method is compared with comprehensive baselines. - Process reward is important for the development of Large reasoning models. 1. The proposed method is not clear, how is U (r(j)t ∣r(j)<t , x) be estimated, what's the archecture of the U-heads. 2. It seems that this paper utilizes the U-head to learn the uncertainty for process-reward estimation. Since the U-head is from another work, what's the contribution of this work? 3.The method should be evaluated on the latest PRM benchmark like PRMBench U-head contains few parameters compared with LLMs, but does it rely on the embedding or hidden states of LLMs? If In that case, we can not say that U-head is an more efficient methon than some simple baseline like LLM-as-judge. Fully human-written
Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces a method for step-wise verification based on quantifying the uncertainty involved in the reward predictions for each reasoning step. It implements a UHead, a classification module on top of LLM’s hidden states and uses its predictions for scoring verification rewards. Empirically, the paper provides experiments in step-level correctness and offline/online best-of-N using verifier-guided search, claiming on par performance with 7B-8B PRM models and strong OOD generalization. - The idea of using UQ methods for unsupervised/self-supervised verification is well motivated and justified. - The paper brings several baselines, particularly on UQ methods for verification, which is a good benchmarking for UQ-based model-based verification research. - The paper brings an interesting OOD evaluation setting, which is very relevant but overlooked in PRM research. - The major concern in the paper is its clarity/presentation on describing the proposed methodology. The background section brings a section about UQ but does not follow up on it when describing the method in Section 3. The core technique in the paper is the UHead, but this is not formally described in the paper, making it not self-contained. There are no details on how uncertainty is estimated or how/whether the terms in the equation of Section 2.3 are computed, nor what is the nature of the uncertainty estimated (e.g., predictive, epistemic). - From the description provided in Section 3, the UHead seems to be a classification network on top of a base LLM hidden state, and the uncertainty here relates to the softmax entropy among Yes/No classes. If this understanding is correct, then there are a few points to consider: - It would be important to compare against the predictive entropy from the LLM itself, i.e., consider the (re-normalized) Yes/No distribution conditioned on the reasoning step and compute its entropy as score; this baseline will validate if the classification training is indeed needed. - The claims about “comparing” against models that are 150x, 810x times larger sounds misleading since the method still requires inference over a base LLM to extract features, so all the parameters of the base LLM are also activated in the process of generating verification. This should be counted as well if the goal is to compare model sizes. - There are also strong claims about the UHeads being general, plug and play, and that they “generalize across tasks, languages and domains”. These claims are unclear and unjustified. From the paper description, these are classification models trained on top of self-supervision or even DeepSeek-R1 labels, so we need proper evidence to support these claims, otherwise I would expect them to behave similarly to other Adapter models in the literature. - The Related Work section is very superficial. It provides a little of contextualization and does not contrast with other similar works. There is also recent work in uncertainty-aware step-wise verification missed, e.g., [1, 2]. - The paper does not report confidence intervals to assess statistical significance in the results. In fact, the paper does not mention how many experimental seeds were used (I assume it is a single one). Prior literature has raised how sensitive math reasoning benchmarking is for small changes [3], requiring a more statistical grounding to evaluate whether the reported takeaways are meaningful or just observation noise. - As an illustrative example, Figure 3 (left) is used as evidence to claim scaling improvements for the proposed method. The reported gap in performance is less than 1% of accuracy (over Qwen2.5-Math-PRM-7B), which diminishes as N increases. There is no way to assess statistical significance here, yet the paper claims the “consistently better results”. The same lack of statistical rigor extends to all reported experiments, which makes it hard to evaluate scientific claims. Overall, I believe the paper requires a good rewriting in the methodological section to improve clarity on the proposed method. Some of the claims (as described above) needs to be calibrated, and the experiments should report performance across different seeds with proper confidence intervals. The related work section may also be polished to better contextualize with the literature and contrast with similar methods. References: [1] Cao et. al. More bang for the buck: Process reward modeling with entropy-driven uncertainty, 2025. [2] Ye et. al. Uncertainty-Aware Step-wise Verification with Generative Reward Models, 2025. [3] Hochlehnert et. al. A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility. COLM, 2025. N/A Fully human-written
DeepHA: Scaling Action Chains Elicits Deep Hierarchical Agents Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes the Deep Hierarchical Agent (DeepHA), an agent architecture for complex, open-world environments like Minecraft. The authors identify two main limitations in prior work: reliance on a single, predefined action space and errors from decoupled high-level and low-level policies. To address this, DeepHA introduces a manually-defined, multi-level action hierarchy (e.g., Skill, Grounding, Motion, Raw) and a Mixture-of-Policies (MoP) framework where a central VLM generates an action at a chosen level of abstraction, which is then routed to a specialized low-level policy. The paper also proposes a "Chain-of-Action" (CoA) reasoning framework, where the VLM autoregressively generates a sequence of actions from high-to-low abstraction (e.g., $Skill \rightarrow Grounding \rightarrow Motion$), using the higher-level actions as "thoughts" to guide the lower-level ones. Finally, to handle long context lengths, the authors describe a "memory-efficient mechanism" that prunes historical tokens and uses KV caching. The agent is evaluated on a large, proprietary benchmark of over 800 Minecraft tasks, where it is shown to outperform previous methods. 1. **Strong Empirical Results:** The paper's primary strength is its extensive dataset curation and empirical validation. The authors have tested a large-scale benchmark and performed a detailed evaluation, demonstrating state-of-the-art performance within their chosen domain. 2. **Clear Ablation Studies:** The ablation study provides clear, quantitative evidence that, within their hand-crafted framework, deeper hierarchical reasoning leads to better performance than shallow reasoning. This validates their central design choice. 3. **Detailed System Documentation:** The paper and its appendix are very transparent about the complex, multi-stage training pipeline and the extensive data curation process. 1. **Critical Flaw in Memory Contribution:** The "memory-efficient mechanism" is the most significant weakness. The paper claims to "manage the computational demands... in long-horizon tasks", but the proposed method is explicitly described as an "**inference-time process**". It offers **no solution** for the memory bottleneck during **training**, which is the main pain point for long-sequence models. The method is a simple application of token pruning + standard KV caching for generation. This is not a novel contribution and does not solve the problem it claims to. 2. **Lack of Algorithmic Novelty (Action Space):** The paper's core, the action hierarchy, is entirely hand-crafted. While the *concept* of hierarchy is general, the *implementation* is a fixed, domain-specific engineering choice. The work offers no generalizable method for *learning* this hierarchy, which severely limits its scientific contribution beyond the specific domain of Minecraft. 3. **Conflation of Data-Scaling with Algorithmic Novelty:** The SOTA results are impressive but appear to be the product of a massive, multi-stage data engineering effort and finetuning on data from powerful proprietary models. This is a great engineering result, but it's unclear how much of the gain comes from the *method* versus this extensive, domain-specific data advantage. 4. **Reproducibility:** The reliance on proprietary programs / pipelines for generating the foundational dataset makes a key component of the work impossible to reproduce. My questions are stated as above. Overall I do acknowledge the empirical result of this work. But the paper is a simple composition of many established / common knowledge techniques and the novelty of this work is limited. I think this paper's empirical success worth to share on data mining conferences but not ICLR. Lightly AI-edited
DeepHA: Scaling Action Chains Elicits Deep Hierarchical Agents Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. *Motivation* Previous methods are confined to a predefined action space, limiting their applicability to a diverse range of tasks. For e.g., "an action space that excels at navigation may be ill-suited for precise object manipulation”. *Proposal* The authors address this by proposing a mixture of action spaces, where each action space defines actions of a specific hierarchical level (semantic or temporal). The authors fix the level abstractions as: “high-level language skills, coordinate-based grounding actions, mid-level motion commands, and low-level raw action sequences”. To allow the action prediction to follow the pattern as, high to lower level actions, the authors propose “long action-chain reasoning through action pyramid”. Secondly, as the memory context can become intractable easily, the authors also propose an efficient memory management scheme that trims the past low-level actions from memory. The authors test the agent in an expanded version of the minecraft dataset (800 tasks) from OpenHA. The method performs significantly better than the previous SOTA methods. - Good integration of multiple modules to achieve a high performance. - Not completely clear on the full details of the construction of abstract actions. While section 2.2 discusses possible methods for automatically extracting abstract actions from datasets, the method does not use them. The only section that provides usable insights into the action abstractions used is B.2. The paper does not mention how the dataset was created, manual attribution, or automatic generation. Other small details, like the total number of instances in each abstraction, are also missing. - While the proposed abstractions improve performance (at the MineBlock task), it has not been discussed why this specific hierarchy was chosen. As the abstraction hierarchy is listed as a core contribution, it should be addressed. - A lot of data has been curated for each training phase: world knowledge QA, VQA, scene captioning, Visual grounding, Reasoning capabilities enhancement, abstracted action data, and chain of action structured data. This makes it difficult to judge the impact and transferability of the method to other tasks (such as robotics). - Not really end-to-end as there are multiple training phases, each with its own objectives and target weights subset. Only the finetuning phases are end-to-end. - While a lot of effort has been put into getting the architecture working better than SOTA, including data curation and training. The only novel contributions I see are the constructed action abstractions, but they may not be valid across domains. The other contribution of memory pruning only provides modest improvements over full memory (+1.1% ASR and -1.2% FT). - Ablation 4.3.3 shows that the greedy mode, which unrolls the full hierarchy, performs worse than the eager mode, which can break the hierarchy, which seems to invalidate the hypothesis that action hierarchies are crucial to improving performance. An analysis of how a greedy strategy can fail, or of how eager mode chooses an exit, should be provided. - The paper is slightly confusing to read, and I have to hunt for information. There are sections in the main paper that are not very relevant, while those in the appendix are required for a core understanding of the method. The appendix is currently almost mandatory to understand the method clearly. I believe a rewrite can significantly enhance readability. - Small typo in line 725 - How are the action abstraction policies trained? - Does ablation 4.3.1 use greedy mode? - What kind of memory does ablation 4.3.3 use? How is the memory context of direct mode disproportionately highest while it produces the fewest tokens? - I understood eager mode as model choosing the exit, so what does Eager-Motion/Grounding mean in ablations 4.3.3? - In section A.2, the inspirations for the abstractions are mentioned. For grounding-based policies, it is noted that they are “adept at interpreting coordinate-based instructions or visual targets and translating them into navigational actions”. However, the grounding policy in the paper’s case translates high-level goals into raw actions. Similarly, don’t all policies output in the raw action space (Fig. 3). Can the author’s clarify this? - How does the model learn to output the eager-stop token? Is it present in the data? If so, how is the early exit decided when creating the data? - The way eager mode allows for an exit is fundamentally different from how hierarchical agents act. In the proposed architecture, it seems all abstractions exist side-by-side, allowing the router to choose from them rather than stacking them vertically. Fully human-written
DeepHA: Scaling Action Chains Elicits Deep Hierarchical Agents Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes Deep Hierarchical Agent (DeepHA), a single, unified architecture for operating in heteregenous actions spaces. The authors propose the Chain-of-Action (CoA) framework that enabels the agent to generate higher-level actions as thoughts to guide the generation of finer-grained actions. The primary motivation is the observation that for domains such as minecraft, complex tasks naturally decompose into a hierarchy of actions. To deal with hetereogenous action spaces, the action use a single high-level VLM along with a low=level mixture of policies. The VLM is fine-tuned to generate actions from a specific action space which a router routes to the appropriate policy. The VLM can be etrained in different ways In direct mode, the vlm generates an abstract action from a signle space. The router simply routes it. In Greedy mode, the vlm sequentially generates actions from the action hierarchy ultimately resulting in a low-level action. Each action in the sequence serves as a thought to improve the generation of the ext action. In eager mode, execution is halted once an executable action is produced. This is either by detecting a sepcial tag for manual experiments or allowing the vlm to autonomously learn it by fine-tuning to produce the tag as its generation. The authors also propose a memory-efficient chain-of-action by compressing parts of the execution history. The experiments are conducted on Minecraft with several baselines and 800 total tasks. 1. The overall idea seems intuitive. 2. The compression feature is interesting. 3. Results look promising although there are some concerns here. 1. I dont think this paper is written well. For the most part it is okay but there are a couple of major issues that might cause confusion and as a result misunderstanding of your work 1a. Where is this Chain-of-Action architecture defined? It is proposed as a new contribution but Im not able to find it detailed anywhere? As a result, I cannot even imagine how to apply this to other tasks. 1b. There is no contrast with related work. Lines 104-107 and 131-133 seem like a citation dump with no contrast on how your approach is different. This makes it very hard to understand what value your work provides since these are hierarchical agents as well several of which you use as baselines. 2. The approach seems very hand-coded. For example, the authors state a complex task is naturally decomposed to simpler tasks. Their CoA framework in greedy mode generates a very specific sequence of actions that help minecraft (A^s -> A^g -> A^m -> a). Who determines this order? Does this also work for other tasks that are not minecraft? How to scale your approach to other domains automatically. I think currently this needs clarificaiton and seems like a major limitation on the generalizability of this framework. Please clarify if the training data needs to be formulated as such for training. If it is indeed the case, then i think you owuld need to show more than just Minecraft to demonstrate that this approach works for different types of hieracchies. You mention that such action pyramids can be learned in lines 161-185 but this is not clarified or explained in detail. 3. For Table 1, DeepHA, what mode was used? was it eager mode? if it was, then was it using manually terminated tags or a vlm fine-tuned to generate the tag that allows routing to an executable action using a pretrained policy? 4. Lines 377-378 say that the baselines are trained on the same expert datasets. Could you elaborate on how the dataset was processed for your appraoch since the action pyramid owuld need to be learned I think? As a result, you would need to annotate the dataset in a different way? 5. Why are the standard deviations so high in some tasks? eg. Mine Blocks and Kill entities. Is this just one run of 800 tasks with the avg and sd computed across the 800. Or is it the sd of the ASR across X runs? Ive asked questions in weaknesses itsefl. Overall interesting work but needs more clarifications before i can accept it. Happy to engage in discussions and increase my score. Fully human-written
DeepHA: Scaling Action Chains Elicits Deep Hierarchical Agents Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors propose DeepHA, a hierarchical agent architecture enabling dynamic action generation across multiple abstraction levels (skills, grounding, motion, raw actions) via Chain-of-Action reasoning. They introduce a memory-efficient mechanism that reduces context length by 75% through dynamic history compression while preserving high-level semantic goals for long-horizon tasks. Originality - The paper presents a novel and insightful approach. The memory compression of past intermediate steps, which offers a clever mechanism for designing scalable and efficient Vision-Language-Action (VLA) architectures. Quality & Clarity - The formulation is comprehensive and well-articulated, with clear definitions of key concepts such as action levels, inference modes, and policy mixtures. - The experiments effectively validate the proposed method and concepts. - The illustrations and the experiment details (including case studies and configurations in the Appendix) clearly demonstrate the concepts and outputs across action levels. Significance - The results are promising. The proposed approach outperforms other SOTA approaches, including instruction-conditioned policies and hierarchical agents. They also introduce a new metric ASR to show the competence of the approach. - While the paper includes comprehensive ablation studies, it lacks an analysis of failure modes. I encourage the authors to include case studies and detailed examinations of failures across inference modes, action levels, and termination mechanisms. Such analyses would offer deeper insights into the rationale behind the chosen designs and clarify whether these components are complementary and essential to the overall approach. - The paper lacks implementation details, such as the actual prompts for action generation and the implementation code. The authors should provide this information for reproduction. - I wonder if the memory efficiency could also be supported by theoretical analysis. Beyond empirical results, deriving a theoretical lower bound on memory usage would provide stronger insights into the scalability and potential applications of the proposed approach. - Have you applied the bottom-up approach in your framework? You mention the approach in Section 2.2, while there is no other discussion in this paper. How would it be used? Lightly AI-edited
DeepOmni: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper presents DeepOmni, a multimodal spoken language model that integrates adaptive modality-specific experts within a Mixture-of-Experts (MoE) architecture. The goal is to alleviate catastrophic forgetting in native multimodal large language models (MLLMs). DeepOmni introduces an adaptive modality expert selection strategy based on modality token load and employs a three-stage training procedure of modality alignment, unimodal training, and cross-modal joint training. Experiments on spoken QA, ASR, TTS, and text benchmarks demonstrate that DeepOmni reduces language performance degradation to 5.5%, substantially lower than existing native MLLMs (typically over 20%), while maintaining real-time response latency (<0.5 s). Overall, the paper contributes a novel and well-engineered MoE-based framework for building end-to-end speech interaction models that effectively balance linguistic competence and acoustic generation. 1. The work claims to be the first native MLLM built upon an MoE-based LLM backbone with a 3-stage post-training and addresses the catastrophic forgetting in native MLLM. Solid and highly effective. 2. It proposes an effective and intuitive expert partition strategy that selects modality-specific experts based on modality load, and the proposed model achieves a low performance drop in language capacity. 1. The paper claims native MLLMs preserve richer paralinguistic features as part of its motivation, but the evaluation lacks essential quality-based metrics to substantiate this claim and compare the expressive quality of the proposed model against other native baselines. 1. See weakness 1. Can we see results comparing DeepOmni's speech output against other native MLLMs on quality metrics like prosody and emotional expression?
 2. The process for designating the 2 shared modality experts is missing from the adaptive partitioning mechanism (Algorithm 1). Can the authors clarify this step? 3. The Phase 3 pseudo-code in Algorithm 1 puts $\text{top-}k$ inside a loop iterating over $j$. This looks confusing, as the intent seems to be applying $\text{top-}k$ globally, like \text{Audio Experts}_{l} \leftarrow \text{top-}k\!\left( \left\{ \rho_{l,j}^{A} \cdot (1 - \rho_{l,j}^{T}) \right\}_{j=1}^{M},\, k \right) 4. Some formatting issues for inline citations. Some should be using citep Fully human-written
DeepOmni: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes DeepOmni, a native multimodal large language model for speech interaction that leverages Mixture of Experts (MoE) architecture to mitigate catastrophic forgetting. The key contribution is an adaptive modality expert selection strategy that dynamically assigns experts to audio or text modalities based on their "modality load." The model outperform other native MLLMs like GLM-4-Voice on text-to-text, speech-to-text and speech-to-speech tasks. 1. Addresses Important Problem: Catastrophic forgetting in native multimodal speech models is a genuine and pressing challenge. The paper tackles a real bottleneck that limits the practical deployment of end-to-end speech interaction systems. 2. Novel Adaptive Selection Strategy: The adaptive modality expert partitioning based on modality load is creative and well-motivated. Unlike random assignment, this data-driven approach intelligently identifies which experts are suitable for audio vs. text. 3. Comprehensive Evaluation: The experimental evaluation is thorough, covering multiple dimensions: spoken QA, ASR, TTS, and LLM benchmarks. 1. Weak Baselines in Comparison: The results section appears to compare against relatively weak baselines. Why do Tables 2–5 not include comparisons with Qwen-2.5-OMNI and Kimi-Audio? Notably, Kimi-Audio is itself a non-modular speech LLM, making it an important baseline for fair evaluation. 2. Questionable Claims About Modular SLM Limitations: The paper’s claims regarding the limitations of Modular Speech Language Models are not fully substantiated. These models remain end-to-end differentiable, for instance, Qwen-OMNI can leverage its generated LLM hidden representations to encode paralinguistic cues. Did the authors perform any experiments showing that modular architectures are indeed worse at modeling such paralinguistic information? 3. Lack of Analysis: While the method demonstrates improved performance, there is limited insight into why it mitigates catastrophic forgetting more effectively than other methods. What distinct knowledge patterns do the audio and text experts capture? How does modality isolation help preserve capabilities? A deeper analysis—e.g., via probing or visualization—would greatly strengthen the paper. 4. Clarity and Presentation Issues: The paper is difficult to follow in several sections. The term e(h_l)_i should be explicitly defined in Eq. (2). Algorithm 1, which seems central to the contribution, is hard to interpret and should be explained in greater detail. The Expert Selection Statistics remain unclear. The multiplicative selection criterion (lines 22–23) seems arbitrary—why this specific formulation and not other combinations? An ablation or stronger motivation is needed. Formatting in Sections 3.3 and 3.4, as well as reference styling (perhaps using \citep{}), should also be improved. 5. Section 3.3 (Audio–Text Alignment): The paper mentions “downsampling the audio and using text padding tokens to align their lengths.” More details should be provided—specifically, how much padding is applied and whether it affects training stability or convergence. Please check weaknesses, particularly 1,2 and 4 Moderately AI-edited
DeepOmni: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. To mitigate catastrophic forgetting in native MLLMs, this paper proposes DeepOmni for adaptive modality expert learning in a MoE-based MLLM. DeepOmni goes through stages of adaptive modality expert selection based on modality load, specialized single-modality training with instruction data from different modalities, and then joint multimodal collaborative training using cross-modal instruction data. Experimental results show that DeepOmni achieves a 5.5% relative performance drop over the original LLM, substantially lower than some MLLMs such as GLM-4-voice. The E2E dialogue latency remains 0.5 secs for efficient voice interaction. 1. Using a MoE architecture for developing MLLMs has been explored in earlier works, as well as dynamic modality expert selection such as in prior works of LLMoE etc. The single-modality expert training and then cross-modal expert training has also been explored in the prior Uni-MoE framework. The main contribution of this work seems to investigate the impact of these previously proposed approaches on mitigating catastrophic forgetting of text capabilities in LALMs and omni models, which is an important research question. The experimental results show that DeepOmni achieves a 5.5% relative performance drop over the original LLM, which is better than the 6.5% relative drop from Qwen2.5-omni (a dense model) over its backbone LLM. 2. The analysis of the number of audio experts, modality load of experts at different layers are useful. The comparison between PureMoE, LoRA-PureMoE, Random Modality-specific MoE, Adaptive Modality-specific MoE shows clear advantages of adaptive modality-specific MoE over the less principled approaches. 1. Some important related, non-contemporaneous works are missing in theoretical and empirical comparisons, for example, strong MLLMs such as Kimi-audio, Ming-lite-omni (which is also a MoE-based omni model). Hence, the presentation of the experimental results is misleading. For example, in Table 2 performance on Spoken QA, Kimi-audio and Qwen2.5-omni achieved much better performance than the proposed DeepOmni, yet their evaluation results are missing. In Table 3 evaluating the T2T performance and the relative drop of the LALMs/omni models, Kimi-audio and other dense or MoE-based MLLMs are missing, such as Uni-MoE, Ming-lite-omni. 2. DeepOmni is built upon a weak backbone, DeepSeek-V2-Lite, which is further verified by its poor performance on text capabilities as shown in Table 3. As a 15.7B-A2.4B MoE model, its average score is 53.06, much worse than qwen2-7B's 70.52, qwen2.5-7B's 73.62, and GLM-4-9B's 64.08, with all these being dense models. With a low performing backbone, it is difficult to fully justify the effectiveness of the proposed dynamic modality expert selection, uni-modality expert training and cross-modal expert training. It is important to evaluate these proposed approaches on a more competitive MoE backbone. 3. The batch parallel decoding used in DeepOmni, as also used in mini-omni and other works, expands a single audio input into a batch size of two, with one audio+text sample, and one text-only sample, and embed the text-only output into the audio generation process. This is more a hybrid workaround rather than a principled solution for the speech and text interference in parallel speech-text modeling. 4. For S2S, the speech generation performance needs to be evaluated, for example, reporting WER and UTMOS. 5. The ablation study in Appendix focuses on investigating the number of acoustical experts and analysis of modality load across different layers showing the benefit of dynamic modality expert selection, but the analysis of the multi-stage training is not presented. 1. There are some formatting issues. For example, the citations could be added using \citep, so that it would appear as, for example, (Radford et al., 2023). The current citation formatting right after text, e.g., (ASR) Radford et al. (2023), degrades readability. Fully human-written
DeepOmni: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces DeepOmni, an MoE-based speech interaction model built on the DeepseekV2-Lite backbone. It follows a parallel modeling paradigm and employs the SNAC codec for speech tokenization. To address the catastrophic forgetting issue in native MLLMs, DeepOmni adaptively identifies modality-specific experts based on modality load, performs specialized single-modality training, and concludes with joint multimodal collaborative training. Experiments on text-to-text tasks demonstrate only a 5.5% performance drop compared to the original LLM. 1. The paper proposes the first MoE-based native speech interaction model, effectively addressing the catastrophic forgetting issue in native MLLMs, which is a critical research topic. 2. The adaptive modality-specific MoE design is innovative and is supported by ablation studies showing its advantages over MoExtend, PureMoE, LoRA, and Random Modality-Specific approaches. 3. The paper is written clearly and includes code in the supplementary material, enhancing reproducibility. 1. The paper uses batch parallel decoding, which increases computational costs and creates an unfair comparison with baselines that do not use this technique. Results without batch parallel decoding should be provided. 2. The research is based on a relatively weak backbone, making it unclear how the model would perform with stronger backbones like Qwen3-30B-A3B. 3. There is a lack of comparison with key baselines in Table 2&3 such as Kimi-Audio and Step-Audio2-Mini, which are also native MLLMs. Table 2 suggest that DeepOmni underperforms these models on Spoken QA. Minor: 1. GPU model details should be added for latency testing. Please address the issues described in the Weaknesses section. Resolving these concerns could improve the paper’s evaluation. Lightly AI-edited
Probing Memes in LLMs: A Paradigm for the Entangled Evaluation World Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces a new evaluation framwork to characterise data samples from test sets and large language models, focussing strongly on model-data interactions under the umbrella of memetics. The authors test 4507 LLMS with many dataset characterisations and model capability probes on these datasets. 1. Capability probing is an important aspect of foundation model evaluation. It helps us make evaluations more granular and extract more information from datapoints. This work makes a positive contribution in that direction. 2. The different probes and phemotypes are well-defined, in theory. 3. Testing 4507 models from OpenLLM Leaderboard is a substantial empirical contribution. 1. The meme framework seems unnecessary and, without a more grounded theoretical framework and justification in the context of LLM evaluations, could be removed. It leads to unnecessary terminology such as perception matrix, meme probes and phemotypes. The core contributions would not be affected if the metaphor were removed. It also leads to some sections becoming quite confusing to read("latent units of model capability that can be revealed through probing"). 2. Several relevant papers have not been cited or referenced in this work. Flexibly defining new properties based on evaluation needs and the observation that datasets “contain a large number of seemingly simple questions that are nevertheless answered incorrectly by some elite models” are both insights established in [1]. Then, “models with similar accuracy may succeed on very different types of items” is adopted from [2]. Several of the meme probe properties are established frameworks already: difficulty has been defined and used in [3] as as the IRT model in [4](cited elsewhere in the paper), the . Other forms of capabilities are tested by works such as [5]. 3. This work only considers binary (0,1) evaluations. Other works [1,4] take into consideration heterogeneous metrics such as accuracy (0/1), BLEU score ([0,1]) which is a more realistic setup, ensuring that the framework of evaluation takes into consideration all datasets. 4. There is no insight into whether the phemotypes are correlated to each other and to what extent. This paper would benefit from quantitative studies to determine this. Minor 1. Figure 5 and 6 are not easily interpretable, more details and insights would be beneficial. 2. T-sne is known to result in spurious clusters. UMAP, on the other hand, preserves both local and global structure in the data and is a better algorithm for data visualisation.
 [1] Ghosh et al. ONEBENCH to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities, ACL 2025 [2] Goel et al. Great Models Think Alike and this Undermines AI Oversight, ICML 2025 [3] Prabhu et al. Efficient Lifelong Model Evaluation in an Era of Rapid Progress, NeurIPS 2024 [4] Polo et al. tinyBenchmarks: evaluating LLMs with fewer examples, ICML 2024 [5] Alyahya et al. ZEROSUMEVAL: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition, ACL 2025 1. “Certain elite models that excel in overall metrics nevertheless display anomalous errors on questions that most other models solve with ease”. Could this be the case due to train-test contamination, the presence of test samples during pretraining? What would be a way to test this? 2. Is there a unique insight into evaluation provided in the memetic definition (not talking about the probe properties here)? The interaction of data and model is basically how evaluation is done. Other works such as [1] use more applicable definitions from other fields, such as social choice theory and also conduct experiments to compare their framework with other methods in that field. 3. To measure I_j, which is it logarithmic? It seems invariant to “reduces the influence of weak models while emphasizing the contribution of stronger models”. [1] Ghosh et al. ONEBENCH to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities, ACL 2025 Fully human-written
Probing Memes in LLMs: A Paradigm for the Entangled Evaluation World Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces the "Probing Memes" paradigm for LLM evaluation, drawing conceptually on Dawkins' theory of memes as replicating cultural units. The authors propose treating LLM capabilities as composed of latent "memes" that can be revealed through carefully designed probes. They construct a “perception matrix” from model-data interactions. - The problem of better analyzing model evaluations is important - Some of the proposed metrics are interesting 1. **Unclear motivation for the meme framing**: The conceptual link to Dawkins and memetics feels forced and adds unnecessary complexity without clear benefit. The core contributions (probe properties and phemotypes) could stand without this metaphor. The paper states that memes are "latent units of model capability that can be revealed through probing", but this is more of a renaming than a substantive theoretical contribution. Why is this memetics lens necessary or illuminating? 2. **Limited differentiation in results**: The key weakness is visible in Figure 7, which shows phemotype scores tracking remarkably closely with accuracy across models. While the paper claims to reveal "fine-grained phenomena invisible under conventional evaluations," the phemotypes appear highly correlated with overall performance. This contradicts the motivation that current approaches "obscure fine-grained differences" - if the proposed phemotypes largely parallel accuracy, what additional insight do they provide? 3. **Inconsistent with cited literature**: The paper cites Schilling-Wilhelmi et al. on IRT where different analysis methods reveal significant ranking changes. However, Figure 7 shows surprisingly consistent orderings across metrics, undermining this motivation. 4. **Unclear practical utility**: What should practitioners do differently with phemotypes versus accuracy? The paper doesn't provide clear guidance on how these metrics inform model selection, dataset design, or capability assessment in practice. 1. Can you provide examples where phemotypes lead to different model rankings or selection decisions compared to accuracy? 2. In Figure 7, why do phemotypes correlate so strongly with accuracy if they're capturing distinct capability dimensions? What percentage of variance in phemotypes is explained by accuracy? 3. What is the empirical evidence that the meme metaphor provides insight beyond standard psychometric approaches (IRT, factor analysis, etc.)? 4. How should researchers choose which phemotype to optimize for? Are some phemotypes more important for certain applications? Fully AI-generated
Probing Memes in LLMs: A Paradigm for the Entangled Evaluation World Soundness: 3: good Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes Probing Memes, an evaluation paradigm that jointly considers individual samples in the dataset and models evaluated on them, in contrast to the previous convention that only looks at a single model's performance aggregated over samples. The authors design evaluation metrics to model each sample as "probes" to study the capability properties of the dataset vs. the model. Experimental results reveal distinct dataset-model properties across several benchmarks, such as a high probing value of "surprise" on MMLU-pro. 1. Thorough joint analysis along the dimensions of sample and model is an interesting yet underexplored direction. 2. The proposed Probing Memes paradigm is well-motivated. The overall idea is clear. 3. Experiments are extensive and well-deliver the proposed idea. 1. The novelty of this paper lies in the proposed two-dimensional (sample-model) evaluation. However, the novelty may be undermined by the fact that previous work has been using such analysis for specific purposes. For example, [a] defines and calculates the difficulty of samples in benchmark datasets, along with 52 LLMs of different sizes, to investigate emergent abilities. This raises concern about the insufficient discussion of and distinction from related work. 2. This proposed paradigm relies on the assumption that there have been sufficiently many LLMs evaluated on the target dataset. If the number of LLMs is "not enough," the values of probes will become unreliable. On the other hand, with more LLMs evaluated on the target dataset, the values of probes and phenotypes will change since they depend on the tested LLMs. 3. I am concerned about this paper's framing. Terms like "meme", "probe," and "phenotype" might be a bit distracting and imprecise. 4. Discussions on experiments and findings seem not deep. For example, it is no surprise that there are samples with high surprise value in MMLU-pro. Can your paradigm help answer why this happens? Arguments for the practical value of this paradigm are insufficient and unclear. Another example: Given you find that "probes in IFEval, GPQA-Diamond, and BBH exhibit relatively high uniqueness" (line 431), what can we do next to make it not only a "good to know" thing? What is this finding's practical value? a. [U-shaped and Inverted-U Scaling behind Emergent Abilities of Large Language Models](https://openreview.net/forum?id=jjfve2gIXe) 1. Can you make a comparison of the definitions of memes in this paper and in The Selfish Gene? It's unclear to me how this term can be borrowed from the book. 2. My current understanding regarding 1. is that a meme is a facet of LLM's ability (by line 79, " From this perspective, the abilities of LLMs are conceptualized as composed of latent memes"). In this case, are the proposed 7 kinds of probes only a small subset of an unknowingly big set of all memes? 3. In Section 2.2, you define **uniqueness** through the averaged conditional entropy of two probes and define **ϕ-coefficient** to calculate the similarity of two probes. It seems to me that entropy (e.g., JS entropy) is also a natural metric to determine whether two probes should have an "edge". Is my understanding correct? 4. In Figure 6, can you explain why Astuteness has nearly identical weight distribution over the three datasets? 5. In line 429, you mention MMLU-pro has many samples with high surprise. Is this possibly due to the nature of MCQs? Assume a challenging question where a frontier LLM gets wrong. If a low-ability LLM simply randomly guesses all samples and thus picks the correct choice by chance, the surprise value of this question will be high. 6. The statement in lines 461-462 is a bit abrupt and without any supporting argument. You can move this statement to the earlier section and support it with concrete examples—text only is ok; quantitative experiments are a huge improvement. 7. In line 471, why is full reproducibility not guaranteed even with temperature=0? Isn't the probe calculation process deterministic? My current score for this paper is 2. Given that I have many questions, I may raise my score to 4 or 6 if this paper's significance and novelty becomes clear and good to me during the discussion stage. Fully human-written
Probing Memes in LLMs: A Paradigm for the Entangled Evaluation World Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper solved the prob of evaluating LLMs only with coarse, accuracy-centric scores that ignore how models and datasets interact at scale. This paper proposed the Probing Memes paradigm: build a perception matrix over many models × many items, derive Meme Probe Properties, and aggregate them into model-level phemotypes to reveal fine-grained capability structure across populations. 1/ In this work, the authors reconceptualize evaluation as an entangled world of models and data, formalizing a perception matrix that supports probe-level properties and interpretable phemotypes; this exposes phenomena hidden by traditional benchmarks (e.g., elite models failing items most models solve) and scales to thousands of models. 2/ The authors validate the framework on a huge amount of LLMs, showing clear probe/property distributions, family-level structure in phemotype space, and practical insights (accuracy-equal models with different behavioral profiles), demonstrating both scalability and interpretability. 1/ I suggest maybe the authors can consider broadening tasks (coding, RAG, agents) and adding head-to-head baselines (e.g., IRT-based compact sets, adversarial stress tests) to verify that phemotypes add incremental value beyond existing item- and ability-modeling approaches. 2/ An ablation on property definitions, thresholds, and clustering (e.g., Leiden parameters) would clarify robustness and generality. 3/ For the evaluation results, the authors can add multi-judge adjudication, uncertainty estimates, or multi-runs to enhance the robustness of the results. See weaknesses Heavily AI-edited
Exploring weightless neural networks: From logic gates to convolutional lookup tables Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents an empirical comparison of Weightless Neural Networks (WNNs)—Logic Gate Networks (LGNs) and Look-Up Table networks (LTNs)—against traditional deep neural networks (MLPs and CNNs). The authors train 1040+ model architectures across MNIST, Fashion-MNIST, and CIFAR-10 to evaluate test accuracy, training time, and robustness to noise. Training 1040 architectures across three datasets with multiple evaluation dimensions (accuracy, training time, robustness) represents a significant experimental effort. Introduction of convolutional LTN variants fills a gap in the literature and enables fair comparison with LGCNNs. The paper addresses real engineering questions (training time, robustness, bit depth) relevant to practitioners considering WNN deployment. Beyond accuracy, the robustness analysis (salt-and-pepper noise, occlusions) and training time measurements provide valuable practical insights. The paper also tests Fashion-MNIST, which was not done by Petersen et al. (2022;2024), and something that was missing in their evaluation. The statement “In real-world deployments, applying augmentation would likely improve performance” should simply be tested. The paper's core motivation is FPGA deployment and inference speed, yet never measures either. All experiments run on GPUs (NVIDIA L4) No inference time measurements reported No hardware resource utilization (LUTs, power consumption) No comparison to actual FPGA implementations Some or all of these are critical to draw the real-world conclusions the authors do. The statement “Note that LGNs and LTNs achieve state-of-the-art performance for MNIST and Fashion-MNIST (i.e. hand written characters and clothing items) while performing worse on CIFAR-10 (i.e. containing structurally complex images of birds, cars, and other classes), allowing these datasets to stress each model’s performance and reveal challenges with training complex model architectures” requires citations. No error bars on accuracy measurements despite stochastic training 2-fold validation is unusual—why not standard 80/10/10 or 5-fold cross-validation? Averaging over "top 5 models" biases results toward best-case scenarios Some missing related work. A few of these are merely concurrent work, but it makes sense to cite given the overlap. https://arxiv.org/abs/2508.17512 https://arxiv.org/abs/2506.07500 (you already mean to cite this. It is the Yousefi & Wattenhofer 2025 citation) https://ieeexplore.ieee.org/document/10301592 https://arxiv.org/abs/2510.03250 https://arxiv.org/abs/2506.04912 https://arxiv.org/abs/2509.25933 https://arxiv.org/abs/2504.00592 You cite “Shakir Yousefi and R Wattenhofer. Deep differentiable logic gate networks: Neuron collapse through a neural architecture search perspective. 2025.” However, this is a project description. Yousefi published their work in the Mind the Gap paper (https://arxiv.org/abs/2506.07500). Captions for the tables should be above the tables as per the formatting instructions. The figures are generally low resolution, and the font is small. Please address this. The story of the paper is interesting; however, the writing is rather clunky, and the presentation could be improved. This is in particular the case for sections 3.2 and 4. The color-coded bars in Tables 2 and 3 are hard to interpret. Typos: Line 50 should have “ML” rather than “Ml.” While my review is rather negative, the authors can and should address several of these things for the iclr submission, as the paper and reviews will be public. The issues with the citations, missing citations, figures, captions, etc., can be resolved within a day :) What is the training and validation split? Line 187 makes it sound like 50/50, but this seems quite aggressive. Why did you leave out all data augmentations? I understand omitting some to test generalization to unseen perturbations; however, if you want to determine their real-world applicability, then this should be included. Why did you use the quantization method over a temperature encoding as Petersen et al. use? What is the full distribution of accuracies (not just top-5)? Can you create a figure for section 3.2, as it is currently not easy to understand? How many layers do the models have? (e.g. in Table 1). Do you know why LTCNN’s time per epoch drops a lot for the largest models in Table 1. Both the mean and the std are very low. Why are your CNN on CIFAR results so poor? It should not be hard to get an accuracy around 80% (https://www.kaggle.com/code/faressayah/cifar-10-images-classification-using-cnns-88) You report test accuracies for your DWN that are much lower than in the original DWN paper. Your Table 3 Fashion MNIST test accuracies are around 55% while the DWN paper reports 89% (see Table 1 https://arxiv.org/pdf/2410.11112). Why is this? Fully human-written
Exploring weightless neural networks: From logic gates to convolutional lookup tables Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper conducts an extensive empirical study on Weightless Neural Networks (WNNs), particularly the Logic Gate Networks (LGNs) and Look-Up-Table Networks (LTNs), exploring their scalability, robustness, and training efficiency relative to standard MLPs and CNNs. The authors introduce a convolutional variant, the LTCNN, designed to mimic CNN kernels via sliding-window logic, and evaluate it against existing LGCNNs. Three systematic studies are presented: 1. Model Scaling: Comparing training time, accuracy, and noise robustness across model sizes. 2. Bit-Depth Variation: Assessing how quantization granularity (1-, 2-, 4-bit) affects performance. 3. Learnable Mappings: Investigating the impact of learnable interconnects between logic layers. Results across MNIST, Fashion-MNIST, and CIFAR-10 show that WNNs achieve comparable accuracy to traditional DNNs on simple datasets but require larger parameter counts and training time. LGNs display superior robustness to salt-and-pepper noise, while LTNs generally train faster. However, scaling beyond modest architectures remains challenging due to combinatorial training complexity and limited receptive fields. *Unprecedented experimental scale:* Over 3000 model variations evaluated across architectures, datasets, and encoding schemes — the largest comparative WNN study to date. *Methodological clarity:* Parameter search, optimization settings, and training details are exhaustively documented, ensuring reproducibility. *Architectural innovation:* Introduction of LTCNNs extends WNN applicability to spatially structured data. *Balanced analysis:* Includes multiple performance metrics — accuracy, training time, and robustness — not just raw accuracy. *Hardware relevance:* Considers inference efficiency for FPGA deployment, highlighting edge-device applicability. *Limited conceptual novelty:* Despite broad experimentation, the contribution is primarily empirical — no new training paradigm or theoretical framework is proposed. *Underdeveloped scalability discussion:* The paper identifies training inefficiency but doesn’t analyze why gradient-based optimization underperforms with discrete structures. *Missing SOTA comparisons:* Lacks benchmarks against Binary Neural Networks (BNNs) or quantized models (e.g., XNOR-Net, DoReFa-Net), which target similar hardware-efficient goals. *Overemphasis on small datasets:* Evaluation restricted to MNIST-family and CIFAR-10 — too elementary for claims of “real-world scalability.” *Ambiguous bit-depth insights:* The bit-depth study’s findings (“depends more on dataset than model”) feel descriptive rather than explanatory. *Unclear path forward:* Future work is listed but not tied to the limitations uncovered, weakening the narrative closure. *Detailed Analyses:* This paper stands at the crossroads of symbolic determinism and differentiable learning. It is not just a technical benchmark but a philosophical probe into how much “logic” can live inside a modern neural framework. The study’s brilliance lies in revealing that weightlessness is not simplification — it’s structure exposed. The very mechanisms that make LGNs interpretable — fixed binary operators and explicit mappings — also constrain their ability to scale. This is the paradox of discrete differentiability: transparency breeds rigidity. Yet, the work’s contribution is not diminished by its empirical focus. It charts the limits of current WNNs while providing an honest, data-driven narrative of their trade-offs. It implicitly calls for hybridization — integrating logic-based regularization or attention-like symbolic layers into conventional deep nets. In short, the paper answers a deeper question: where do Boolean ideals meet the entropy of gradient descent? And in that meeting, it maps the next horizon of neurosymbolic research. While not theoretically groundbreaking, this paper’s scale, rigor, and insight make it a valuable empirical cornerstone for the neuro-symbolic community. Its clarity and reproducibility elevate it beyond a routine benchmarking effort. However, it would benefit from stronger engagement with recent SOTA baselines and a more principled discussion of why weightless architectures hit their current limits. I expect the authors to defend or rebut the points in the weakness section during the rebuttal phase. Fully AI-generated
Exploring weightless neural networks: From logic gates to convolutional lookup tables Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents a comprehensive investigation of Weightless Neural Networks (WNNs), specifically Logic Gate Networks (LGNs) and Look-Up Table Networks (LTNs) and compares them to conventional neural models (MLPs and CNNs). It explores a very wide range of model configurations (>3000), analyzing their impact on training time, accuracy, and noise robustness using 3 image-based datasets (MNIST, Fashion-MNIST, CIFAR-10). Impact of learnable mapping (trainable inter-connects) and bit depth of inputs encoding is also studied. As part of their evaluation, the authors also introduce a novel LTN architecture (LTCNN), by applying to LTN the sliding-window mechanism characteristic of LGCNN, an LGN variant. Results show that, at the range of model sizes investigated, LTNs and LGNs achieve comparable accuracies and noise robustness to their MLP and CNN counterpart, although requiring longer training times. The optimal bit depth is primarily dataset-dependent. Learnable mapping can be beneficial for accuracy but at the cost of significantly increased model size and training time. - The main strength of this paper is the comprehensiveness of the comparative study of LGNs and LTNs, an exploration covering a very wide range of model configurations and test conditions. The results offer a consolidated reference for WNN performance - The paper is well structured and results are well organized. It is generally easy to follow, although some concepts, such as learnable mapping and sliding-window modification for LGCNN, are taken for granted and not explained for a general audience - The authors do not overclaim, they offer a balanced discussion of WNNs underperforming/overperforming compared to the counterpart reference models - Limited novelty: the primary novelty lies in 1) the introduction of the LTCNN and 2) an extensive configuration sweep. However, LTCNN is conceptually a direct adaptation of the existing LGCNN. It should be noted that the exact sliding window mechanism the authors introduce in LTCNN is not described in details in this paper, although it is understood to be equivalent to the one used in LGCNN. LTCNN do not appear to offer significant performance gains and are slower to train. On the other hand, the broad hyperparameter exploration is not a source of novelty per-se, and the discussion is primarily observational, with speculative explanations - In terms of impact, a key bottleneck to WNNs practical applicability is the long training time and this paper confirms this limitation rather than offering a solution or mitigation strategy. Consequently, experiments relies on very small-scale image datasets and small models (up to 1M parameters), severely limiting generalization to real-world or large-scale data - Other explorations (noise robustness, bit width) show mixed results, in the sense that different trends are observed across models. Although interesting, they suffer from the same lack of generalizability to larger datasets and more challenging tasks - Learnable mappings improve performance but exacerbates the fundamental limitation of WNN, the long training time, further worsening scalability - Can LTCNN be optimized to improve training time, similarly to kernel optimization implemented for LGCNN in the cited Petersen et al. 2024? - Can the observed trends be generalized to more complex datasets, at least in some scenarios like noise robustness? Fully human-written
Enzyme-Unified: Learning Holistic Representations of Enzyme Function with a Hybrid Interaction Model Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes ENZYME-UNIFIED, a multi-task learning framework that holistically predicts five key enzyme properties, including kinetic constants and environmental optima, via a novel Hybrid Interaction Model that fuses fine-grained cross-attention and global feature concatenation. The authors present three rigorously partitioned, sequence-dissimilar benchmark datasets for fair evaluation and demonstrate state-of-the-art results on both public and new benchmarks, supported by strong ablation and interpretability studies. 1. Architectural innovation: The Hybrid Interaction Model dynamically integrates token-level cross-attention and global feature concatenation, with a learned gate, to represent both local and global enzyme-substrate interactions. 2. New dataset construction: Three new, large-scale, non-homologous datasets are described, with careful cluster-based partitioning to enforce sequence dissimilarity. 3. Reproducibility: Datasets and code are promised for release, and hyperparameter details are comprehensive. 4. Clear, concise presentation: The manuscript flows well, with logically organized sections, visual explanations, and clear mathematical exposition. 1. Limited discussion and incorporation of prior multi-task and multi-label enzyme function prediction works: Both the related work section and the experimental comparisons are missing several directly relevant works, such as CLEAN (Yu et al., 2022), EnzymeCAGE (Liu et al., 2024), and EZSpecificity (Cui et al., 2025). These studies have already addressed multi-label or holistic enzyme function prediction using deep learning models. Their methods and results should be discussed, compared, and cited to properly position the contribution of this work, especially since the novelty of ENZYME-UNIFIED relies heavily on its multi-task, unified perspective. This omission hinders the reader's ability to measure the paper's progress relative to the existing literature. (1) Yu, Tianhao, et al. "Enzyme function prediction using contrastive learning." Science 379.6639 (2023): 1358-1363. (2) Liu, Yong, et al. "EnzymeCAGE: a geometric foundation model for enzyme retrieval with evolutionary insights." bioRxiv (2024): 2024-12. (3) Cui, Haiyang, et al. "Enzyme specificity prediction using cross attention graph neural networks." Nature (2025): 1-3. 2. Clarity in loss transformation and objective function. The transformation $T(y)$ (Section 3.3) is piecewise, distinct for kinetic and environmental tasks. However, it isn’t clear how this interacts with the MSE loss numerically or whether separate losses are weighted; multi-task loss balancing (if present) isn't explicitly described, which could affect model optimization in joint settings. 3. The potential for data leakage in dataset construction is unaddressed. 4. Missing details in modeling token-level chemical interactions: The fine-grained interaction pathway is elegantly formulated, but the actual operationalization of the cross-attention is not deeply detailed. 5. Limited novelty in the model architecture. 1. The paper argues that a limitation of existing deep learning models is their tendency to predict only single attributes while ignoring the correlations between them. However, ENZYME-UNIFIED appears to be trained on multiple task datasets separately, without establishing connections between these different task sets. This approach would also seem to ignore inter-attribute correlations. Could the authors please address this apparent contradiction? 2. Numerous previous works have addressed enzyme function prediction tasks. It seems these existing models could be adapted to the dataset presented in this paper by merely modifying their training objectives. What was the consideration for excluding these models from the baseline evaluation? 3. Did dataset construction strictly avoid information leakage from meta information (e.g., substrate names, assay condition annotations), not just sequence similarity? Can the authors provide statistics on maximum sequence identity or substrate overlap between train/test folds? 4. Please clarify the implementation details for handling disparate sequence lengths in cross-attention: Is there padding, masking, or special encoding to preserve biochemically plausible alignment or neighborhood context between enzyme and substrate tokens? Could position embeddings (absolute vs. relative) affect fine-grained interaction modeling? Fully AI-generated
Enzyme-Unified: Learning Holistic Representations of Enzyme Function with a Hybrid Interaction Model Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper established a suite of three new datasets for predicting catalytic efficiency, optimal temperature and optima pH. Then the paper developed a unified framework to simultaneously predict five distinct properties. 1. The curated three new datasets of different enzyme properties are good contributions to the enzyme design and enzyme engineering committee. 2. The idea of simultaneously predicting different enzyme properties are useful. 1. The performance of the provided method show very minor improvement compared to baseline models. I'm not sure if this improvement is significant. 2. The paper lacks some baselines. As it targets at overcoming the limitation of previous works that they solely predict different enzyme properties. There should be baselines to simultaneously finetune a pretrained protein model on different tasks, like finetuning ESM2 and ProtT5 using multiple task layers to achieve the goal of this method. This baseline is more fair to the setting of the proposed method and could be better compare the strong baseline and the proposed framework performance. Additionally, since the performance improvement of the proposed method is quite minor, it would be interesting to see the performance of multitask finetuning on large-scale pretrained models like ESM2-15B and ESM2-3B. Please see above weaknesses. Fully human-written
Enzyme-Unified: Learning Holistic Representations of Enzyme Function with a Hybrid Interaction Model Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper identifies two significant limitations in the current machine learning-based prediction of enzyme properties: 1) models predict properties in isolation, failing to capture the biophysical interplay between them , and 2) models are often evaluated on homology-unaware, biased datasets, leading to inflated performance. To address this, the authors present two main contributions: (1) three new large-scale, rigorously partitioned datasets for multi-property prediction; (2) ENZYME-UNIFIED, a unified framework for holistic enzyme property prediction, powered by a novel HYBRID INTERACTION MODEL that adaptively fuses global and local interaction features for more powerful and flexible representations. The authors train independent instances of this model for five properties and demonstrate SOTA performance. The paper does an excellent job motivating the work. The critique of the "fragmented" single-task paradigm and the practical need for a "holistic view" of an enzyme's profile is very compelling . The creation and public release of three new, large-scale datasets is a significant contribution to the field. The case study provides strong evidence that the fine-grained attention pathway is learning biochemically meaningful information, as it correctly identifies the catalytic histidines The introduction is built entirely on the need to move beyond the single-task research paradigm. It argues for capturing the intricate biophysical interplay and inter-property relationships using a multi-task learning paradigm that can co-predict multiple, interdependent properties. However, The implementation seems to completely contradicts this. Section 3.3 explicitly states: "The Enzyme-Unified hybrid architecture is trained independently for each of the five target properties..." Figure 1 explicitly labels the output as an "All-in-one model" but it seems that It is an all-in-one architecture used to train five "one-at-a-time" models. This seems to be a critical distinction, and the current framing overstates the contribution. If the above judgement is correct, then a direct comparison between the "trained independently" strategy and a true multi-task learning strategy (e.g., a shared hybrid trunk with five separate prediction heads trained jointly) will be very helpful. The authors use ProtT5 as baselines but combine ProstT5 in their algorithm without clear explanation (i.e. why cannot combine ProtT5 or why the baseline cannot use ProstT5) See above. Fully human-written
Enzyme-Unified: Learning Holistic Representations of Enzyme Function with a Hybrid Interaction Model Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper addresses critical limitations in enzyme function prediction—isolated single-property prediction and overestimated performance due to homology-biased datasets—by proposing ENZYME-UNIFIED, a multi-task learning framework for holistic enzyme property prediction. The core innovation is a Hybrid Interaction Model that dynamically fuses fine-grained local interactions (via cross-attention) and global feature representations (via concatenation) using a trainable gate. The framework simultaneously predicts five key enzyme properties: turnover number, Michaelis constant, catalytic efficiency, optimal temperature, and optimal pH. To enable robust evaluation, the authors construct three large-scale, sequence-dissimilar datasets (clustered by 40% sequence identity to avoid homology leakage) for the five target properties. Experiments show ENZYME-UNIFIED SOTA performance on the public CataPro benchmark and their new datasets. Ablation studies validate the synergy of the hybrid architecture and the value of the trainable gate, while a case study on Ribonuclease A (RNase A) confirms the model’s ability to identify biochemically relevant catalytic sites, ensuring interpretability. Key contributions include: (1) the ENZYME-UNIFIED framework with a novel Hybrid Interaction Model; (2) three rigorously partitioned, homology-unaware datasets for multi-property enzyme prediction; (3) SOTA results across kinetic and environmental property prediction, with validated interpretability. - Methodological novelty: The gated hybrid architecture elegantly bridges fine-grained molecular interaction modeling with traditional global encoders. - Careful dataset curation, transparent evaluation, and homology-aware partitioning. - Consistent improvement over strong baselines (CataPro, UniKP) across multiple properties. - Limited interpretability generalization: The RNase A case study is convincing but narrow. The model’s ability to identify catalytic sites is only demonstrated for one enzyme (a ribonuclease). Extending this to 2–3 additional enzymes from different EC classes (e.g., lactase, a common hydrolase) would confirm that the attention mechanism consistently targets functional sites across enzyme types, rather than RNase A-specific patterns. - Limited evidence of cross-task synergy: Each property is modeled independently; a joint multi-output model might better support the “unified” claim. Have you tested the model’s attention mechanism on additional enzymes (e.g., lactase, cytochrome P450) to confirm it consistently identifies catalytic sites across EC classes? If not, could you include this analysis in a revised version? Fully AI-generated
VL-JEPA: Joint Embedding Predictive Architecture for Vision-language Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduced VL-JEPA, a visual language JEPA model that is trained by predicting the embedding of target texts. The authors demonstrate that this results in faster and more efficient training, and for selected tasks yield state-of-the-art results. The model also obtains impressive zero-shot retrieval scores despite its training paradigm. In addition, a promising approach for selective decoding was also presented. Overall, the paper is well-written, and it is easy to find the relevant pieces of information. The paper introduces (or successfully reapplies) the JEPA architecture to the VL setting. - shows notable gains in both training speed and performance on zero-shot video captioning and classification (fig 3) while using a well-argued training procedure (JEPA). - shows non-trivial adaptation to retrieval and open-label classification. E.g. seen on youcook2, MSR-VTT (table 6). - The paper explores underexplored areas in the field, by exploring alternatives to generative token decoding, resulting in promising decoding strategies (selective decoding) and a reduced number of parameters compared to alternative models (fig 3, table 2, 4) - The relevant benchmarks used for evaluation are only briefly introduced. I would have loved to see a more substantial justification for choosing these specifically. - While it is stated that the model and code will be open source, it could be shared through existing anonymous platforms - VL-JEPA is in Table 6 compared to contrastively-trained models. It is implicitly argued that this is the reason for the subpar performance on some of the tasks. I would have loved to see a contrastive adaptation to see if this assumption is correct. I would love to get a more substantial justification for the choice of benchmarks/evaluation datasets. Fully human-written
VL-JEPA: Joint Embedding Predictive Architecture for Vision-language Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper presents VL-JEPA a vision language model built on top of a JEPA architecture. The model is evaluated on video understanding and world modeling. 1 - The paper is clearly written and easy to follow and understand. 2 - A new vision-language model leveraging JEPA architecture instead of regular transformer decoders. 3 - Comparable performance to existing transformer-based VLMs, with less parameters. 4 - Extensive details are given about the training setup and resources. 1 - The model seem to be focused on video understanding as most of the training data are related to this task. This raises questions about the comparision to other VLMs that are trained and designed to be more generalist. 2 - Experiments focus on only a subset of use cases of a vision-language model (video understanding). More experiements on other types of tasks wuold have been appreciated (e.g., MMMU, OCRBench, DocVQA, etc.). If the JEPA architecture is intended to replace transformer-based VLMs then more generalization experiemnts are required. 3 - The choice of evaluation benchmarks is not well justified. For example, WORLDPREDICTION-WM is not known by the vision-language modeling community. If I'm not mistaken the paper introducing this benchmark [1] was cited once. [1] Chen, D., Chung, W., Bang, Y., Ji, Z., & Fung, P. (2025). WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning. arXiv preprint arXiv:2506.04363. 1 - Why the benchmarking of this model does not follow standard VLM benchmarking suites? 2 - Is there a justificiation for focusing on video understanding? 3 - Do VL-JEPA need a pre-training phase? How does the model size scale with the data used for training? In the paper, it is said 64M samples are seen, how much is this in number of tokens? Fully human-written
VL-JEPA: Joint Embedding Predictive Architecture for Vision-language Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents VL-JEPA, a vision-language model formulated on the Joint Embedding Predictive Architecture (JEPA). Instead of traditional autoregressive token-level generation, VL-JEPA learns to predict target text embeddings directly in continuous space, thereby abstracting away surface linguistic variability and focusing on semantic representation. The model demonstrates improvements in efficiency and sample complexity compared to classical token-generative VLMs, particularly in zero-shot video understanding, retrieval, and real-time streaming scenarios. Extensive experiments benchmark VL-JEPA against leading models, with ablation studies and scalability analyses provided. 1.The paper's core contribution—applying a predictive JEPA-style objective to the cross-modal VL problem—is highly novel. Moving VL learning from the discrete token space to a continuous semantic space is a well-motivated and promising direction to address the known efficiency and latency bottlenecks of standard generative VLMs. 2.The "selective decoding" mechanism (Sec 4.3) is a standout contribution. The idea of monitoring the latent embedding stream for semantic variance and only triggering the expensive text decoder when a significant shift is detected is an elegant and practical solution for real-world, low-latency video streaming applications. 3.The model achieves state-of-the-art (SOTA) results across a wide and diverse range of video-language benchmarks, demonstrating the effectiveness and generalizability of the learned representations for both zero-shot and finetuned tasks. 1. While the paper emphasizes that the shift from token-space to embedding-space simplifies the target distribution, it provides no rigorous analysis of the measurability and discriminability of the resulting semantic embedding space. To substantiate this claim, the authors should provide supplementary analysis, such as: (i) Visualization or quantitative studies on the embedding space's structure (e.g., its clustering properties, separability) to demonstrate this claimed simplification. (ii) A theoretical elucidation of the target distribution's compressibility, perhaps through the lens of the Information Bottleneck principle. (iii) A crucial ablation study comparing the performance impact of using different embedding spaces (e.g., from CLIP, SONAR, BERT-base) as the prediction target. 2. L2 loss implicitly assumes a unimodal, deterministic target distribution.Real-world VL tasks are full of "one-to-many" ambiguities (e.g., "the light went out" vs. "the room became dark"). Both are valid but semantically distinct answers. The L2 loss will penalize the model for predicting either correct answer, forcing the predictor to regress towards a non-existent "average" embedding located somewhere between the two valid target points. This regression to the mean will likely result in semantically "blurry," generic, or even nonsensical decoded outputs. The paper completely fails to address this fundamental limitation. 3. Unfair and Misleading Efficiency Comparison: The core comparison in §4.2, which pits a 0.5B VL-JEPA predictor against a 1B VLM baseline, is fundamentally biased. The authors claim superior parameter efficiency, but their 0.5B predictor is not a neutral model; it is "cherry-picked" from the top-most, most semantically potent layers (L8-16) of the 1B Llama model. The paper's own ablation study (Table 5) confirms this bias, showing that these top layers (45.20% accuracy) are vastly superior to the bottom layers (35.86%). This is not an "apples-to-apples" comparison and does not prove the framework's efficiency, but rather the known fact that top-level LLM layers handle more complex semantics. 4. Lack of Statistical Rigor: The paper suffers from a critical lack of statistical validation. All reported results—including all SOTA claims in the tables and the key efficiency curves in Figures 3 and 4—appear to be point estimates from a single training run. No error bars, standard deviations, or significance tests are provided. This makes it impossible to determine if the reported gains are statistically significant or merely the artifact of a single, fortunate random seed, which undermines the scientific validity of all conclusions. 5. Missing Ablation on the Critical Y-Encoder Component: The entire methodology is critically dependent on the properties of the frozen y-encoder (SONAR). The paper fails to provide the most crucial ablation study: testing the VL-JEPA framework with different frozen text encoders (e.g., CLIP's text encoder, or a standard Sentence-BERT). Without this, the paper's claims are not generalizable. It is impossible to know if the authors have discovered a robust, general-purpose framework or simply a special-case solution that is uniquely and luckily compatible with the SONAR embedding space. 1.How does the L2 loss framework handle inherently multi-modal or ambiguous targets where multiple, semantically distinct ground truths exist? Does this not lead to regression towards a semantically blurry "average" embedding? 2.Can you provide quantitative evidence (e.g., t-SNE, cluster variance) that the embedding space actually simplifies the target distribution (e.g., maps "light went out" and "room is dark" to nearby points) compared to a standard token space? 3.How robust is VL-JEPA to the choice of the frozen y-encoder? What is the performance impact of replacing the SONAR encoder with a standard CLIP or Sentence-BERT encoder? 4.Can you clarify the 2.85x saving calculation (is it 1Hz / 0.35Hz)? More critically, can you provide any statistical validation (e.g., mean and std. dev. over 3+ runs) for your key SOTA claims, or at least for the comparison in Fig 4? 5.Loss Function: Why was L2 loss chosen over Cosine Similarity loss? Cosine loss would ignore magnitude and only focus on direction, which might be a more robust objective. Was this tested? Fully AI-generated
VL-JEPA: Joint Embedding Predictive Architecture for Vision-language Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces VL-JEPA, a kind of non-autoregressive vision-language model that predicts target text beddings from visual tokens and a query. The inference is conducted in the embedding spaces. This non-autoregressive nature allows for selective and low latency decoding at inference, while still exhibits strong zero-shot capability. 1. The paper is well-written and easy to follow. 2. The exploration of a novel architecture deviated from the current mainstream (autoregressive VLM) is meaningful and encouraged. 3. The low-latency and selective decoding properties of VL-JEPA makes it adaptable for many practical applications such as robots and wearable devices. 1. The core idea of VL-JEPA, to predict the target response in the embedding space, is very similar to LCM [1]. The training objective is also the same. However, the authors did not discuss the similarity and relationship with LCM. 2. Encoder-decoder architecture may impair the capability of the model to understand long query and generate long responses. The model can neither perform multi-round interaction. 3. The evaluations use accuracy and CIDEr as the main metric. It does not consider the readability of the generated responses. User study or LLM-based evaluation is needed. 4. In Table 2, the bold metrics are not the best ones. [1] Barrault, Loïc, et al. "Large concept models: Language modeling in a sentence representation space." arXiv preprint arXiv:2412.08821 (2024). 1. Most datasets used in this paper do not contain question-style queries. What are the queries used for training and evaluation? 2. Are you using any queries when you perform CLIP-like zero-shot evaluation? 3. Since your training data only entails limited number of tasks, have you tested the instructions following capability of the model? 4. It seems that the current VL-JEPA can only perform well on some standard video understanding tasks such as action recognition and step recognition. For such tasks, what is the advantage of Encoder-decoder model like VL-JEPA, compared with CLIP-style encoder-only models? Fully human-written
Understanding Transformers for Time Series: Rank Structure, Flow-of-ranks, and Compressibility Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper titled “UNDERSTANDING TRANSFORMERS FOR TIME SERIES: RANK STRUCTURE, FLOW-OF-RANKS, AND COMPRESSIBILITY” analyzes Transformer models for specifically time series (TSFMs). They show that in this scenario, the transformer models ( as a consequence of the data being passed to it) possess a uniquely low-rank structure compared to similar architectures for text or vision. They postulate thus due to the continuous nature of time-series embeddings. This low-rank input structure leads to that the Attention layer matrices being highly compressible. The authors introduce the concept of "flow-of-ranks," which describes how the numerical rank of a representation gradually increases with model depth due to nonlinear mixing, explaining why earlier layers are more amenable to compression. By leveraging these insights, the researchers demonstrate that large TSFMs like Chronos are severely over-parameterized and can be significantly compressed, achieving up to a 65% reduction in inference time and 81% in memory without losing predictive accuracy. The low rank property of time-series may prove very useful in their application using transformers, in terms of designing models with fewer parameters. Theoretically, this low-rank property is proven for continuous embeddings, with guaranteed polynomial or exponential decay of singular values for smooth or analytic functions, respectively (Theorems 1 and 2). The work provides the first general theoretical results (Theorem 3) connecting low-rank input embeddings to the compressibility of the internal Attention matrices (W_Q, W_K, W_V) The paper introduces and quantifies the concept of "flow-of-ranks," which explains how non-linear components (like activations, residual connections, and normalization) across deep layers gradually increase the rank of a representation (Theorem 4) The majority of the empirical validation and compression experiments focus almost exclusively on the Chronos family of Time Series Foundation Models. While there are references to other models the core compression techniques and deep rank analysis (flow-of-ranks, impact of heads) are primarily demonstrated on Chronos. This limits the generality of the practical findings and compression results to other TSFM architectures While the paper's core claims about rank structure are presented as a modality-dependent framework, the empirical evidence is often constrained to a small number of specialized experiments. How do we know this structure will prevail in other transformer architectures or other datasets. The core theoretical analysis (Theorems 1 and 2) is derived for the univariate time series case. The authors mention in passing that this is extendible to few-variate time series but a detailed discussion is lacking na Fully human-written
Understanding Transformers for Time Series: Rank Structure, Flow-of-ranks, and Compressibility Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper analyzes Transformers for time-series data through the lens of rank structure. The authors show that time-series embeddings are inherently low-rank, unlike those in text or vision. They prove that this structure induces low-rank attention matrices, introducing the concept of flow-of-ranks to describe how rank gradually increases with layer depth due to nonlinear mixing. They demonstrate that time-series foundation models, such as Chronos, can be compressed by up to 65% in inference time and 81% in memory without accuracy loss. 1. The paper combines linear-algebraic theory with well-designed experiments confirming the predicted low-rank behavior in TSFMs. 2. The authors introduce a new analytical lens (flow-of-ranks) that connects data modality to model design. The findings have real design implications for TSFMs. 1. The main validation focuses on Chronos. Testing on other TSFMs (like TimesFM and Time-MoE) would strengthen the generality claim. 2. The comparison to prior compression methods (LoRA) is missing, which makes it unclear how much gain stems from modality vs. technique. 1. How does the proposed compression compare quantitatively with existing low-rank or sparse-attention baselines (e.g., LoRA, Linformer)? 2. Does the low-rank structure persist after fine-tuning a compressed model on downstream tasks? 3. Can the flow-of-ranks pattern be empirically confirmed on other TSFMs beyond Chronos? 4. How would the theory extend to multivariate or irregularly sampled time series? 5. Could you provide a simple practical guideline (e.g., rank schedule formula) for designing TSFMs from scratch? Fully AI-generated
Understanding Transformers for Time Series: Rank Structure, Flow-of-ranks, and Compressibility Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper presents a rigorous linear-algebraic analysis of time series Transformer models. It first examines the ranking structure of time series embeddings, revealing unique low-rank characteristics that distinguish them from other modalities. The authors then theoretically infer the potential low-rank nature of the attention matrices in time series Transformers. Meantime, the authors introduce the concept of “flow-of-ranks” to describe how representation ranks evolve and increase across Transformer layers due to nonlinear transformations. Finally, leveraging these insights, the paper proposes two effective compression strategies for time series foundation models, achieving up to 65% reduction in inference time and 81% reduction in memory usage on the Chronos model, without compromising accuracy. 1. Strong Theoretical Grounding: Theorems 1 and 2 formally connect patch size and embedding smoothness to low-rank structure. Theorem 3 crucially links low-rank inputs to compressible attention layers, while Theorem 4 quantifies the "flow-of-ranks." These theories provide novel insights to guide time series foundation model design. 2. The translation from theory to practice is seamless and compelling. Each theoretical claim is supported by corresponding empirical evidence, making the overall theory framework more convincing and practically relevant. And the results are striking, showing that TSFMs are significantly more over-parameterized than LLMs and can be compressed dramatically. The layer-dependent rank schedule is a simple yet powerful idea derived directly from the "flow-of-ranks." 3. Clarity and Organization: Despite the complex mathematical content, the paper is well-structured and readable. The flow from data structure to single-layer analysis to depth-dependent phenomena and finally to applications is logical and easy to follow. 1. The core experiments only focus on a single architecture family. All experiments are conducted on Chronos and Chronos-Bolt, which are based on the T5 architecture. While the principles are argued to be general, validation on other TSFM architectures (e.g., TimesFM, Moirai) would have further strengthened the claim of universality. 2. The paper provides elegant existence proofs—such as the low-rank properties of time series embeddings and the W_Q/K/V matrices—based on the core assumption that time series embeddings are intrinsically low-rank. However, this assumption may be an artifact of current TSFM design choices (e.g., small patch sizes and simple MLP-based embedding layers). As more tokenization methods emerge (e.g., VisionTS [1], Wavelet-based Tokenization [2]), it remains unclear whether these conclusions will still hold. [1] Chen M, Shen L, Li Z, et al. Visionts: Visual masked autoencoders are free-lunch zero-shot time series forecasters[J]. arXiv preprint arXiv:2408.17253, 2024. [2] Masserano L, Ansari A F, Han B, et al. Enhancing foundation models for time series forecasting via Wavelet-based tokenization[J]. arXiv preprint arXiv:2412.05244, 2024. 1. Could the observed low-rank and compressibility properties be interpreted as universal characteristics of time series Transformer models? If so, does this imply that current architectures are inherently low-rank and may therefore lack sufficient expressiveness to capture more complex temporal dynamics? 2. The success of pre-trained compressed model suggests that standard TSFMs are severely over-parameterized. Does this imply that the common practice of scaling up model size for time series is misguided ? Is the low rank of TSFMs a inherant feature, or is it a sign that we are not yet challenging them with tasks of sufficient complexity? 3. Theorem 3 suggests that the attention matrix may be less compressible when dealing with more complex or higher-rank input time series (e.g., noisy financial data). Are there datasets of this nature that could be used to empirically test the boundaries of the proposed low-rank assumption? Lightly AI-edited
Understanding Transformers for Time Series: Rank Structure, Flow-of-ranks, and Compressibility Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper investigates why Transformers trained on time series data behave differently from those trained on text or vision. The authors analyze the rank structure of embeddings and attention matrices to explain why time-series foundation models (TSFMs)—such as Chronos are highly compressible without losing much accuracy. Several key points include: time series are naturally low-rank, and flow-of-rank perspective. 1. To my understanding, transformer theories in time series are generally lacking or provides limited practical guidance. This paper serves as a good entry point to understand how time series data differs from other modalities. 2. The results are intuitive to me, making this paper easy to follow. 1. This paper considers univariate time series. which is limited as several TSFMs can handle any-variate inputs. 2. The paper assumes input data X is 1-rank (or low-rank). I think this is a pretty strong assumption, which may not hold in many high-dimensional data. 1. Can the authors explain footnote 1? Does it mean that if we have $n$-variate data, $x$ is then rank-$m$, where $n = m$? Is it possible that $m < n$? 2. I wonder how naive Thm 1 and Thm 2 are? Since this paper mainly shows existence proof, if the input data is low-rank, it seems like its straightforward that we can find low-rank weights to model it. Are there any counterexamples? 3. Are there any other features in the data assumption that makes it a time series? Or would the result hold for all low-rank input data? Overall, I think this paper will be a good contribution to the field and am happy to adjust my score if the authors address my concerns? Fully human-written
Reinforced Preference Optimization for Recommendation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes ReRe to fix two flaws of LLM-based generative recommenders: poor high-quality negative modeling and reliance on implicit rewards. ReRe uses constrained beam search to improve sampling efficiency/diversify hard negatives and combines rule-based with ranking rewards for finer supervision. 1. ReRe effectively addresses the two key flaws of LLM-based generative recommenders (insufficient high-quality negative modeling and reliance on implicit rewards) by integrating constrained beam search (for improving sampling efficiency and diversifying hard negatives) and a combined reward (rule-based accuracy + auxiliary ranking rewards), directly tackling the unique generation space and sparse supervision challenges of RLVR adaptation . 2. The study uses three real-world datasets (Amazon Toys, Amazon Industrial, Yelp) and compares ReRe with diverse baselines. 3. ReRe maintains robust performance across different backbone models (Qwen2.5-1.5B, Gemma-2B, Qwen2.5-7B) and initialization methods (Base, SFT). 1. There is a lack of training-time efficiency comparisons with other generative recommendation methods (e.g., those that do not use reinforcement learning) as well as with traditional methods. 2. The dataset information in Table 5 is not clearly described, and the experiments rely exclusively on relatively small-scale datasets. 1. In Table 5, do the numbers for “Tran” refer to the number of interactions or the number of users? 2. Can this method be combined with generative recommendation approaches (e.g., TIGER)? A semantic ID–based generative paradigm appears more practically viable, whereas relying on text prompts may constrain inference efficiency. Lightly AI-edited
PreviousPage 43 of 1516 (75800 total rows)Next