ICLR 2026 - Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	15899 (21%)	4.43	3.58	3687
Heavily AI-edited	3233 (4%)	4.22	3.59	2990
Moderately AI-edited	7082 (9%)	4.20	3.61	2722
Lightly AI-edited	16648 (22%)	4.15	3.68	2746
Fully human-written	32938 (43%)	4.13	3.62	2917
Total	75800 (100%)	4.21	3.62	3026

Title	Ratings	Review Text	EditLens Prediction
Multimodal Datasets with Controllable Mutual Information	Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes a framework for generating synthetic multimodal datasets with explicitly controllable MI between modalities. The method combines causal latent-variable construction with flow-based generative models that preserve MI under bijective mappings. The authors claim this provides a testbed for studying multimodal SSL and benchmarking mutual information estimators. Although data generation is not my primary area of expertise, this work appears to address a genuinely underexplored and important problem: constructing realistic high-dimensional multimodal datasets with analytically tractable and controllable mutual information, which could enable systematic evaluation of self-supervised learning methods and mutual information estimators. The theoretical development is simple and clear. The use of flow-based generative models to maintain information structure across high-dimensional modalities is conceptually elegant and technically well-motivated. The main limitation of this paper lies in the absence of empirical validation. While the framework is theoretically elegant, the paper does not demonstrate that the generated datasets are practically useful for their intended purposes, such as evaluating self-supervised learning methods or mutual information estimators. The examples provided are purely illustrative and rely on analytic expressions rather than experiments that confirm controllability or MI preservation in practice. Moreover, the claim of producing “realistic multimodal data” is overstated: using CIFAR-10 class-conditioned flows as a proxy for distinct modalities is a weak approximation of genuine multimodality (e.g., image–text, video–audio, etc.), and it remains unclear whether the generated samples exhibit meaningful cross-modal relationships. The reliance on linear-Gaussian causal structures, while analytically convenient, limits the generality of the approach for more complex, nonlinear dependencies in real-world multimodal settings. The paper would also benefit from quantitative experiments comparing analytical MI values with empirical estimates obtained via neural MI estimators to substantiate its proposed utility. 1. Can you provide empirical evidence that the generated datasets preserve the specified mutual information after flow transformations? 2. Have you tested any self-supervised learning methods to demonstrate that controllable MI affects downstream performance as intended? 3. Does your linear-Gaussian setup generalize to nonlinear or non-Gaussian latent dependencies? 4. How scalable is the framework to higher-dimensional data (e.g. video or time series) or more modalities? 5. Have you evaluated how well existing mutual information estimators recover the known MI values on your generated datasets?	Fully AI-generated
MemOrb: A Plug-and-Play Verbal-Reinforcement Memory Layer for E-Commerce Customer Service	Soundness: 1: poor Presentation: 3: good Contribution: 1: poor Rating: 0: Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	The paper “MemOrb: A Plug-and-Play Verbal-Reinforcement Memory Layer for E-Commerce Customer Service” (under review at ICLR 2026) proposes a lightweight, schema-free memory system that allows frozen LLM-based agents (i.e., without any fine-tuning) to continuously improve through reflection across user sessions. Self-Reflection is an interesting idea and worth exploring. 1. Inaccurate summarizations and comparisons: In Table 1, MemGPT is not simply using key-value pairs and not just raw dialogues. They have core memory which is the summarization and extracted infomation, they also use `SQLite` or `PostGreSQL` as the storage, while the paper only says "Key-Value" in the column Storage. Also in MemGPT they do have the ability to rewrite the memory. In their codebase there is a function called `core_memory_replace` and `core_memory_rewrite`. If your "ReWrite" is talking about rewriting the query, then the table is even more incorrect. MemGPT has agentic search which definitely has the ability to rewrite the query. 2. Limited Novelty: Basically the paper is proposing that we can (1) save the raw trajectories; (2) do some self-reflection after each trajectory; (3) retrieve relevant trajectories. The novelty and design remains trivial and sounds like a simple pipeline every company would easily think of during the applications. In the experiments, there are only 130 tasks in total which makes the evaluation results highly unstable and unconvincing. 3. Limited Evaluation Datasets: If the method is simple I would expect this method to have much more powerful performance across various tasks such as Long-Horizon Agent Tasks like TAC[1], Mind2Web[2], SWE-Bench-Pro[3], etc, instead of only showing the performance on a not-so-popular Ecom-bench. This is a research paper, not a technical report in the Industry. 4. Limited Baselines: This paper compared with zero memory-related baselines. Even though they mentioned in the introduction about the limitations of existing memory-augmented methods, the limitations of long-context methods, they compared with none of them in the experiments. 5. Missing Citations: Many recent papers about agent memory systems are not cited [4,5,6,7,8,9] (there should be much more than these such as Agent Workflow Memory, Mem-p, ...). Also in Line 152, this paper mentioned that "MemOrb offers a lightweight, plug-and-play solution that improves the performance of LLM-based agents without the need for frequent model updates or large-scale retraining," However they did not mention any related works about the memory methods that require "large-scale retraining" (some representative works are MemoryLLM[10], M+[11]). They can either not mention this in the introduction, or they have to cite related papers to justify the statement. [1] TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks. [2] Mind2Web: Towards a Generalist Agent for the Web. [3] SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? [4] Nemori: Self-Organizing Agent Memory Inspired by Cognitive Science. [5] MIRIX: Multi-Agent Memory System for LLM-Based Agents. [6] EgoMem: Lifelong Memory Agent for Full-duplex Omnimodal Models. [7] Zep: A Temporal Knowledge Graph Architecture for Agent Memory. [8] MemoRAG: Boosting Long Context Processing with Global Memory-Enhanced Retrieval Augmentation. [9] HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. [10] MEMORYLLM: Towards Self-Updatable Large Language Models. [11] M+: Extending MemoryLLM with Scalable Long-Term Memory. I don't have any questions.	Fully human-written
MemOrb: A Plug-and-Play Verbal-Reinforcement Memory Layer for E-Commerce Customer Service	Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces MemOrb, a lightweight, plug-and-play verbal reinforcement memory layer designed to address the problem of LLM-based customer service agents forgetting information across sessions and repeating errors. MemOrb enables continual self-improvement without requiring costly fine-tuning by distilling multi-turn interactions into compact strategy reflections called "Orbs". These Orbs are stored in a shared memory bank (using SQLite and ChromaDB) and are retrieved at inference time via a specialized retrieval and rewriting pipeline to guide the agent's decision-making, facilitating cross-user knowledge transfer. LLM-based agents deployed in customer service often forget information across sessions, repeat errors, and lack mechanisms for continual self-improvement. To address these limitations, the paper proposes MemOrb, a lightweight and plug-and-play verbal reinforcement memory layer. This system distills multi-turn interactions into compact strategy reflections, which are stored in a shared memory bank. These reflections are then retrieved to guide decision-making, all without requiring any fine-tuning. Experiments demonstrate that MemOrb significantly improves both success rate and stability, achieving up to a 63 percentage-point gain in multi-turn success rate. 1. The experiments were conducted on only one benchmark (ECom-Bench), and the number of baselines is too limited. There is no quantitative comparison against the advanced memory mechanisms mentioned in Table 1. 2. The ablation studies did not sufficiently discuss or evaluate the MemOrb's additional "Rewrite" and "Self-Reflection" modules. 3. Compared to the baseline agent, how much does MemOrb increase the total computational cost when accounting for the additional Rewrite and Self-Reflection modules? 1. The paper defines an Orb as a 6-tuple that includes "emotion" , and this feature is included in the embedded document for retrieval. However, its actual impact on retrieval quality or final task success rate was not evaluated. What is the effect of this feature? 2. How sensitive is the system's performance to the hyperparameter $k$? If $k$ is increased (e.g., to $k=10$ or $k=20$), does this lead to context window overload or introduce too many contradictory reflections, thereby interfering with the Actor's decision-making and degrading performance?	Lightly AI-edited
MemOrb: A Plug-and-Play Verbal-Reinforcement Memory Layer for E-Commerce Customer Service	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper introduces MemOrb, a method designed to enhance LLM agents through policy-level reflections in multi-trial scenarios. Additionally, the authors expand the ECom-Bench dataset by adding 77 new clothing-domain tasks, resulting in a total of 130 realistic multi-turn customer service tasks. 1. The idea of distilling multi-turn interactions into compact strategic representations is interesting and potentially useful for improving agent performance. 2. The paper provides detailed experimental analysis of success rate trends across different trial counts (Figure 3). 1. Experiments are conducted only on ECom-Bench, which includes 144 tasks. It remains unclear whether MemOrb generalizes effectively to other LLM agents benchmarks. 2. Compared with Reflexion [1], the methodological novelty of MemOrb appears limited or insufficiently highlighted. 3. The paper lacks detail on how the additional 77 clothing-domain tasks were constructed, including data sources, task diversity, or annotation quality. 4. As shown in Table 2, Doubao-Seed-1.6-Thinking-MemOrb surpasses Doubao-Seed-1.6-Thinking only when the number of trials exceeds 5. However, it underperforms for T1–T4, suggesting that the method is less effective in low-trial or single-pass scenarios, which are also common in practical settings. [1] Reflexion: Language Agents with Verbal Reinforcement Learning 1. The paper claims that MemOrb is motivated by tasks requiring stability and consistency. However, it is not entirely clear why multi-trial settings are important in the e-commerce domain. In customer service scenarios, users typically expect the agent to succeed in a single interaction, making the one-pass success rate more relevant than multi-trial performance.	Lightly AI-edited
MemOrb: A Plug-and-Play Verbal-Reinforcement Memory Layer for E-Commerce Customer Service	Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 0: Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes a system for "policy-reflection distillation" for customer service. This system is based on emotion tagging and policy reflection modules. The paper correctly identifies some areas where existing LLM-based agents fail. The paper is not yet ready for detailed review. The main issues are as follows: -- Major writing problems that limit the clarity of the paper. See e.g. Figure 1 and its caption -- The paper contains almost no technical detail about the approach, beyond very limited pseudocode. This makes it mostly impossible to assess the technical relevance or methodological contribution of the paper -- The topic, while ML-related, is largely not relevant to ICLR -- The baselines are far too limited for the method to be properly evaluated -- Bibliography is nearly entirely just arxiv links, I assume automatically generated It would be great if the authors try to clarify the main technical contribution of the paper and its relevance to ICLR. I realize this is a lot to ask during a rebuttal phase but the paper does not seem quite ready to review without more technical detail.	Fully human-written
Projected Neural Additive Models as Universal Approximators	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This work attempts to provide a potential proof of the universal approximation property of PNAM through Theorem 2.1. Furthermore, by incorporating regularization constraints to enforce sparsity, as shown in Equation (9), the authors propose a possible approach to enhance the interpretability of PNAM. 1. It is helpful to provide a high-level overview or intuitive sketch before presenting the formal proof of Theorem 2.1. 2. The overall progression of the proof from Lemma A.2 to Lemma A.3 and then to Theorem A.1 is reasonable and clear, except for a few concerns I have raised (see Questions). 1. In the introduction, you should provide more details on how you define the interpretability of neural networks in your work. Does it refer to the significance of different input variables, the weighting parameters within the network, or another aspect? 2. In Section 2.1, which introduces the PNAM, more details should be provided about its underlying advantages and mechanisms. For example, why is a NAM needed beyond a standard neural network? How does it contribute to interpretability — by reducing the number of connections in a conventional NN, or in another way? In addition, how does PNAM enhance the expressiveness of NAM? Does the linear transformation from x to z play a beneficial role without compromising interpretability? 3. The mapping form of the PAM, as shown in Lemma A.1, appears to be more constrained than that of a conventional neural network, which could potentially reduce the hypothesis space for pattern recognition. How do you justify that the possible loss in expressiveness compared to a conventional NN is negligible or does not significantly affect performance? For example, can you justify why PNAM or NAM would not be more prone to underfitting compared to conventional neural networks? Otherwise, it is unclear why these new architectures are necessary, especially if their expressiveness is limited—even with a proof of universal approximation. 4. The notation of deg in Lemma A.2 should be explained to readers. 5. Even though the parameters may be non-unique, PNAM restricts the possible expressiveness to a subspace (as organized in Equation (A.3)) compared with a conventional neural network. How do you justify that the optimal expressiveness indeed lies within this subspace defined by PNAM? 1. In Equation (1), $\epsilon$ represents the polynomial approximation error. Can it be reasonably treated as noise (for example, assumed to follow a Gaussian distribution)? 2. Regarding Equation (7), the proof of Theorem 2.1 does not appear rigorous. What justifies the assumption that the $\epsilon$ serving as an upper bound in Equation (7) is the same as the $\epsilon$ in Equation (5)? The essential condition for this relationship to hold is that the general form $F_2(X)$ in Equation (5) and the specific form of $F_2(X)$ in Equation (7) is in the same continuous functional space. Specifically, for conventional neural networks, this existence holds because they span the full polynomial space. However, in this work, you have constrained the representation to a subspace; therefore, how do you prove that the existence result still holds under this restriction? 3. Should interpretability help reduce uncertainty? However, with respect to Equation (9), it does not appear to guarantee this key property, since we can not guarantee the "significance" values are not wide in range. In that case, can the model still be considered interpretable? 4. What are the variables related to $\mathcal{L}_P$? 5. A key condition connecting Lemma A.2 and Lemma A.3 is the statement that ``Lemma A.3 can be proved by expanding $F(x_1, x_2)$, which produces the same set of monomials as the product of $f_1(x_1)$ and $f_2(x_2)$ in Lemma A.2.'' However, how do you justify that this condition always holds, or at least holds to some extent? Otherwise, certain forms of $F(x_1, x_2)$ may lack the necessary expressibility, making Theorem A.1 inapplicable.	Lightly AI-edited
Projected Neural Additive Models as Universal Approximators	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper reviews Projected Neural Additive Models (PNAM) and targets to prove that this family can approximate any continuous target when given enough directions and flexible components. They also add practical tools for interpretability: encouraging the projection to be orthogonal and sparse, supporting monotonic/convex shape constraints on the one-dimensional components, and post-training symbolic compression of those components into compact expressions. Experiments on structured scientific problems demonstrate that PNAM outperforms a standard NAM and competes with other baselines at a similar capacity. The math formulation and demonstration are easy to follow, also aligning with the claimed approximation capability. The added practical interpretability tools (orthogonality/sparsity, shape constraints, symbolic compression) make it helpful to implement. 1. For Theorem 2.1, one clarification is needed: Is the universal approximation applied to an arbitrary continuous function? 2. From NAM to PNAM, it seems projection is what enables universal approximation here. If so, it may be clear and intuitive to show the mechanism, e.g., the projection mixes features and he 1-D parts then capture interactions that plain NAM cannot. 3. What’s the cost to get universality, like computation, memory, and sample complexity? As there is no free lunch. 4. While the claimed proof logic flow is easy to follow, the general idea needs some clarification. Why can this NN be universal at all? As formulated, the model abandons flexible raw-variable coupling, which is limiting. Is the point that the projection picks up the coupling among features into z, and then you keep separate 1-D MLP parameterizations on those directions? 5. Interpretability of NN, and NN for symbolic regression are mentioned as motivation. It's helpful to clarify where PNAM sits relative to other interpretable NN families like NALU (neural arithmetic units) and AI Feynman / SINDy / DSR, which are also stated in the paper. PNAM offers interpretable 1-D parts of a high-dimensional approximation, while NALU has exact function recovery? Please see Weakness.	Lightly AI-edited
Projected Neural Additive Models as Universal Approximators	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This study introduces a method to enhance the expressivity of neural additive models (NAMs) while allowing for interpretability. Specifically, it proves that the projected neural additive model (PNAM), which extends NAMs by applying a learnable linear transformation to the features before they enter the additive structure, achieves the universal approximation property. To address the reduced interpretability that arises from feature coupling after the linear transformation, the autors introduce regularization strategies that promote sparsity and penalize unnecessary interactions, enabling ranking and pruning of features and transformations. Finally, symbolic regression converts the pruned MLPs into mathematical expressions, further enhancing interpretability. ● Theoretical support for universal approximation. The paper makes a theoretical contribution by establishing the universal approximation property of PNAM, which had not been previously proven. ● Efforts to enhance interpretability. The framework employs both regularization and post-hoc techniques to improve interpretability by encouraging the use of only the necessary linear transformations of features. Moreover, converting the pruned MLPs into mathematical expressions enhances interpretability while reducing computational complexity. ● Reduced interpretability due to feature coupling. The learnable linear transformation entangles multiple features. This makes it difficult to attribute model behavior to individual inputs and limits interpretability. ● Computational complexity. The additional optimization required for regularization tuning and symbolic regression may introduce significant computational overhead. ● Dependence on hyperparameter choices. The performance and sparsity outcomes appear sensitive to the selection of projection dimension and regularization weights, requiring extensive tuning. ● Lack of synthetic experiments for interpretability validation. Synthetic experiments with known ground truth would enable a clearer quantitative assessment of interpretability, but such evaluation is currently missing. ● Limited evaluation of pruning and symbolic regression. Given the performance gap between the symbolic expression and the original MLP, it would strengthen the contribution to clearly justify why the derived symbolic form can still be regarded as interpretable and reliable. ● Could you clarify the column descriptions in Table 3? In particular, it would be helpful to explain the distinction between the “reported” and “ours” values in the evaluation of test accuracy, the reason for the large discrepancy between them, and the precise definition of “Total acc.”	Lightly AI-edited
Projected Neural Additive Models as Universal Approximators	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces Projected Neural Additive Models (PNAM), an extension of Neural Additive Models (NAM) that achieves universal approximation capability by incorporating a linear transformation of inputs before processing them through independent single-variable neural networks. The authors establish the theoretical foundation for PNAM's universal approximation property using the Stone-Weierstrass theorem and propose regularization and post-processing techniques to enhance interpretability. Through experiments on mathematical knot invariants and phase field simulations, they demonstrate PNAM's competitive performance against MLPs and NAMs while maintaining better interpretability. The work is significant as it provides a balance between expressivity and interpretability, particularly beneficial for scientific domains where understanding model behavior is crucial. The paper presents several notable strengths. First, it provides a rigorous theoretical foundation by formally proving PNAM's universal approximation property using the Stone-Weierstrass theorem, addressing a key limitation of the original NAM. Second, the proposed architecture elegantly combines the interpretability of additive models with enhanced expressivity through input projection, representing a meaningful advancement in interpretable neural network design. Third, the comprehensive regularization framework—including weight decay, function value constraints, input coupling penalties, and sparsity promotion—effectively addresses the trade-off between model complexity and interpretability. Fourth, the post-processing techniques for feature importance ranking, parameter pruning, and symbolic regression conversion offer valuable tools for enhancing model transparency. Finally, the experimental evaluation on two distinct domains (knot theory and phase field fracture) demonstrates the model's versatility and provides convincing evidence of its competitive performance against relevant baselines. 1) While the paper presents a compelling approach, there are several aspects that warrant further consideration. First, the theoretical foundation primarily focuses on the universal approximation property, but lacks analysis of the optimization process. Specifically, there is no discussion of convergence guarantees for the Adam optimizer when applied to PNAM's architecture, nor is there an analysis of the convexity properties of the loss function with the proposed regularization terms. Additionally, the impact of the projection dimension M on the optimization landscape and convergence speed remains unexplored. 2) The experimental setup could benefit from more comprehensive details. The selection of regularization weights (w1-w5) is not adequately justified, as they are simply set to fixed values without sensitivity analysis. Similarly, the criteria for choosing key hyperparameters such as the projection dimension M and network architecture are not well explained. Moreover, the paper lacks information on computational requirements, training times, and resource consumption, which are important for assessing practical feasibility. 3) The scale and diversity of the experimental data raise some concerns. While the knot theory dataset is substantial, it only has 17 input dimensions, which may not represent high-dimensional challenges. The phase field dataset, on the other hand, is extremely small (only 96 samples), which may lead to overfitting and limit generalization. Furthermore, the absence of experiments on truly high-dimensional, large-scale datasets or standard machine learning benchmarks makes it difficult to evaluate PNAM's performance in more realistic settings. 4) The comparison with alternative approaches is somewhat limited. The paper primarily benchmarks against MLP, NAM, and KAN, but omits comparisons with other interpretable neural network methods and traditional statistical approaches like Generalized Additive Models (GAMs). Additionally, the comparison with symbolic regression methods is insufficient, especially given the emphasis on converting PNAM to symbolic form. 5) There is a lack of systematic parameter sensitivity analysis. The paper does not explore how the projection dimension M affects model performance and interpretability in depth, nor does it analyze the sensitivity to regularization weights, despite their critical role in balancing accuracy and interpretability. The impact of network architecture choices (depth, width) on performance is also not adequately addressed. Moreover, the evaluation of the quality and reliability of the symbolic expressions obtained through post-processing is limited, and there is no analysis of PNAM's computational complexity during training and inference. Please refer to the above questions. It's a fascinating insight for NAMs, while the proof seems largely similar to generalized additive models. I would be grateful and willing to raise my scores if the authors would solve my above concerns.	Fully AI-generated
Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs	Soundness: 3: good Presentation: 4: excellent Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes Dynamic Hierarchical Sparse Attention (DHSA), an inference-time, drop-in sparse attention module for decoder-only LLMs. DHSA first dynamically chunks the sequence via a learned boundary predictor, then builds length-normalized chunk representations, computes chunk-chunk similarities, upsamples these scores back to the token level, and finally selects Top-Nb token interactions for each query. The method targets on-device settings and reports LongBench accuracy competitive with dense attention but having lower latency. 1. This paper tries to tackle an important problem of how to improve the efficiency of LLM inference in long context by leveraging sparsity in attention. 2. A clear hierarchical routing formulation with a concrete sparse attention pipeline. The design and implementation details are explained well. 3. The paper reports accuracy improvements over existing static sparse attention baselines and lower latency over full dense attention. 1. Missing comparisons to other more recent dynamic sparse baselines. Current baselines are mostly static patterns on static template. 2. Missing upper bound analysis with oracle top-k baseline to show how close the number of tokens selected is to the optimal choice. Missing latency comparison with baselines other than dense attention. 3. Still not clear why dynamic chunking is needed if there is an accurate way to estimate the contribution of each chunk to the overall attention. 4. Not clear how the system performs under batching settings. The paper did a comprehensive analysis and evaluation with static sparse attention baselines, including StreamingLLM, MInference and Block Sparse, but misses important dynamic sparse attention baselines. For example, [MagicPig](https://arxiv.org/abs/2410.16179) uses LSH sampling to select tokens for attention computation dynamically. [Quest](https://arxiv.org/abs/2406.10774) exploits query-aware sparsity that keeps track of minimal and maximal key values in the KV cache and estimates importance based on queries. Without a comparison with these state-of-the-art sparse attention baselines, it is hard to fully evaluate the benefits of the proposed approach. It is not clear why dynamic chunking is needed, even though an ablation is provided. The ablation shows cases of DHSA without robust chunk representation and without dynamic and robust chunk representation. However, to demonstrate that dynamic chunking is indeed needed, it should further evaluate the case of DHSA with robust chunk representation and without dynamic chunking. Robust chunk representation is a normalized prefix sum for queries and keys in the chunk and should work independently of the chunk size selected. Also, I can imagine there are other ways to estimate the chunk similarity, for example, based on different clustering methods. However, the paper does not provide an evaluation of them other than the normalized prefix sum one. There are also some evaluations missing in the paper. For example, it should provide an upper-bound analysis compared with the oracle top-k baseline on the number of tokens selected. In addition, performance numbers for the batching scenario are not evaluated. 1. How does DHSA perform in terms of accuracy compared to dynamic sparse attention baselines under the same sparsity setup? 2. Can you give some intuitions on why the boundary predictor is designed as this? For example, why is the left and right window not overlapped? 3. Can you show a comparison of DHSA with robust chunk representation and without dynamic chunking as an ablation? Have you evaluated other methods that could estimate the chunk similarity other than the current approach? 4. Can you provide the evaluation performance of DHSA under the batching scenario? 5. How are the ratios and hyperparameters of the baselines selected? Can you provide latency comparison with baselines? Can you provide number of tokens selected in the optimal case? 6. In Table 2 under sparsity = 25\%, how can DHSA outperform dense attention on LongBench Synth by such a large margin? 7. In Table 3 why is more memory needed for DHSA? Is it for storing the model weights of the boundary predictor?	Fully human-written
Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs	Soundness: 3: good Presentation: 3: good Contribution: 4: excellent Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes a Dynamic Hierarchical Sparse Attention (DHSA) mechanism to make long-context inference efficient for large language models, especially when deployed on devices with limited memory and compute power. DHSA dynamically detects hierarchical attention boundaries and prunes redundant computations, adapting sparsity in real time based on token relevance. Unlike fixed sparsity or static compression approaches, it introduces a multi-level adaptive mechanism that balances local and global context retention. The method shows strong results—maintaining accuracy close to dense attention while significantly reducing latency and memory use across benchmarks such as LongBench and Needle-in-a-Haystack. The paper’s contribution is practical and well-aligned with the need for efficient, scalable, and on-device LLM deployment, demonstrating that dynamic hierarchical sparsity can effectively enable longer-context reasoning without sacrificing model performance. This paper introduces a technically elegant and well motivated solution to one of the most critical bottlenecks in modern LLMs efficient long-context inference. The proposed DHSA framework combines dynamic boundary detection with hierarchical sparsity prediction, achieving strong accuracy along with efficiency trade-offs across tasks such as LongBench and Needle-in-a-Haystack. Its design as a training-free, drop-in module makes it immediately applicable to on-device. The empirical results show consistent latency and memory gains while maintaining dense-attention-level accuracy. The presentation is thorough, with sound motivation, clear algorithmic exposition, and reproducible implementation details. Despite the contribution being incremental relative to recent dynamic sparsity and KV compression literature (e.g., MInference, H2O, PyramidKV), with limited theoretical grounding for why hierarchical chunking yields near-optimal sparsity prediction. The dependency on hyperparameter tuning for chunk size and sparsity budgets limits generalizability across architectures and devices. The method’s scalability beyond 100K context is mentioned but needs to be empirically validated. The experimental evaluation could be broadened with larger models or real-world application benchmarks (e.g., RAG or document retrieval tasks). 1. How does the method ensure that important global information isn’t lost when dynamically pruning attention? Can the authors show examples or quantitative evidence that key tokens are always retained? 2. The paper claims DHSA works well for on-device inference. Can the authors provide more details on the actual hardware setup or latency improvements in real deployment, not just simulated benchmarks? 3. DHSA is stated as “no retraining,” but boundary predictors are trained offline, though its a lightweight and doesn't touch the base model weights, but it’s not literally zero learning?	Fully AI-generated
Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	Due to the quadratic nature of attention, prefill attention becomes the key bottleneck for long-context inference. Existing systems prune tokens based on heuristics or pre-chosen patterns, which limits accuracy. This paper proposes DHSA, which trains an MLP layer to dynamically predict the boundaries of token chunks and uses dot similarity to model interactions between chunks. The actual attention is then computed only on highly relevant chunk pairs. DHSA achieves good accuracy and speedups on long-context inference. - The dynamic partitioning of tokens is a novel and effective idea. - The accuracy evaluation results look promising. - DHSA requires training an MLP layer to predict chunk boundaries, making it harder to deploy than existing methods. - The efficiency evaluation is not very comprehensive. Thanks for submitting to ICLR 2026. The paper introduces an interesting idea of dynamically partitioning sequences to group similar tokens into the same chunk. However, I still have some concerns about the paper. - Firstly, since DHSA involves training, it would be more fair to compare it with other methods that also train a small predictor, such as DSA or SeerAttention. These should provide much stronger performance than the current baselines. Additionally, DuoAttention may also be a good baseline, especially at 12.5% or 25% sparsity. - Moreover, the upsampling process seems to violate the assumption of partitioning. Theoretically, similar tokens should already be grouped together, and we should expect clear boundaries between chunks. - Additionally, during MLP training, it is unclear what $ f_{MHA} $ represents. Which Q vector is being used in this computation? - Regarding efficiency, the evaluation is based on PyTorch implementations. FlashAttention might be a better baseline for fair comparison. It is also unclear how to efficiently implement block-sparse attention given the dynamic chunk sizes. - For inference, since DHSA treats all newly generated tokens as a single chunk, what happens in long-generation tasks? Will this chunk grow too large and degrade performance?	Lightly AI-edited
Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper addresses the significant computational and memory costs (quadratic complexity) of the attention mechanism in long-context Large Language Models (LLMs), which is a major bottleneck, especially in resource-constrained environments. Key Contributions： * Dynamic Segmentation: DHSA first segments the input sequence into variable-length "chunks" based on the content itself. This is more adaptive than using fixed-size blocks. * Hierarchical Computation: It then computes representations for these chunks using a special "length-normalized" method to avoid bias from different chunk sizes. It calculates similarity scores at this coarse, chunk-to-chunk level. * Token-Level Upsampling: Finally, it upsamples these chunk-level scores to the token level to create an importance map. This map determines which fine-grained token-to-token attention scores are actually computed, preserving only the most impactful ones. * Efficient long-context handling: Matches dense attention accuracy while cutting prefill latency by 20–60% and peak memory usage by 35% at 8K context, and scales to 100K context on a single 24 GB GPU (where dense kernels fail). * Input-adaptive sparsity: Avoids rigid static patterns or heuristics; dynamically predicts attention sparsity via data-driven chunking and similarity, adapting to diverse tasks/inputs. * Easy integration: Functions as a drop-in module for standard decoder-only Transformers, requiring no retraining or architecture changes to the base LLM. * Robust chunk representation: Uses length-normalized aggregation to eliminate bias from variable chunk lengths, ensuring reliable similarity estimation. * Hyperparameter dependence: Its performance relies on hyperparameters like the number of chunks and preserved keys, whose optimal settings vary across models, tasks, and hardware, lacking adaptive allocation strategies. * Boundary predictor constraints: The boundary detector requires training on specific datasets (e.g., Long Data Collections) and may need adjustments for diverse text types, introducing potential generalization gaps. * Hardware adaptability limitations: While tested on NVIDIA GPUs, its performance on other hardware (e.g., CPUs, edge devices) is not evaluated, raising questions about cross-hardware applicability. * In the ablation study, DHSA without dynamic chunking degrades to standard block-sparse attention, showing the critical role of dynamic chunking. However, the paper does not compare DHSA with recent advanced dynamic chunking methods (e.g., context-aware adaptive chunking). How does DHSA’s chunking strategy perform relative to these methods in terms of segmentation accuracy and computational efficiency? * The boundary predictor uses soft labels derived from attention scores and focal BCE loss for training. If the base LLM itself has biased attention distributions (e.g., over-attending to trivial tokens), will this bias be transferred to the boundary predictor, affecting chunk quality? How to mitigate such potential bias propagation?	Fully AI-generated
F4-ITS: Fine-grained Feature Fusion for Food Image-Text Search	Soundness: 2: fair Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes F4-ITS, a training-free framework designed to improve fine-grained food image-to-text retrieval. The authors identify that general-purpose models like CLIP struggle with subtle distinctions in specialized domains like food. The proposed solution has two main components: - A common weighted sum multi-modal fusion strategy that combines a standard image embedding (from CLIP, etc.) with a text embedding of a "dense" description generated by a VLM, like Gemini 2.5. - A common feature-based re-ranking mechanism for top-k ingredient retrieval, where VLM-generated "sparse" ingredient lists are used to re-score an initial set of candidates based on maximum similarity. - The results are promising. The authors show that with the weighted sum fusion methods, it improve ~28.6% in top-k ingredient retrieval and ~10% in desnse caption retrieval. - Lack of novelty: The paper's main weakness is its limited novelty. The first key contribution is a weighted sum fusion method. This is a widely-known, basic ensemble technique [1]. The paper attempts to differentiate itself by focusing on the food domain, but this does not constitute a novel algorithmic contribution. - Unclear results: The table 1, 2, and subsequent tables do not specify the types of fusion methods (uni/bi direction fusion) employed to obtain the results. - Missing details: What's the prompt do you use for the Gemini and Gemma? - The comparisons are weak. In Table 1 and Table 2, the F4-ITS method is only compared against the baseline, w/o fusion. This is a weak argument. The authors should have compared their weighted-sum approach against other simple fusion methods mentioned in the related work, such as simple averaging or concatenation. More importantly, they didn't compare against any other training-free retrieval methods from related work, like PDV. [1] Liang, Paul Pu, Amir Zadeh, and Louis-Philippe Morency. "Foundations & trends in multimodal machine learning: Principles, challenges, and open questions." ACM Computing Surveys 56.10 (2024): 1-42. See weaknesses.	Fully human-written
F4-ITS: Fine-grained Feature Fusion for Food Image-Text Search	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes the F4-ITS, a framework for Food Image-Text Search. This approach tackles two problems: 1) Single Image-Text Retrieval (Dense Caption Retrieval) and 2) Top-k Image-Text Retrieval (Sparse Ingredient Retrieval). The main idea is to fuse image and image caption features, and then retrieve the food text. The authors performed experiments on Food datasets and evaluated the framework's performance using different ViT architectures. This paper tackles an important task: Food Image-Text retrieval. This is important for downstream applications such as dietary monitoring, nutritional analysis, and so on. The framework is easy to understand. There is no technical innovation in the framework. Using CLIP image and text encoders to extract image and text features is widely used, and fusing them using weights is well-known. The CLIP model is relatively small. Larger models, such as BLIPv3, Qwen2.5-VL, should be used to test the performance of the proposed framework. It's better to explain what the "Dense Caption Index" is. I guess it's the input text information.	Fully human-written
F4-ITS: Fine-grained Feature Fusion for Food Image-Text Search	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper presents F4-ITS, a training-free framework for food image-text retrieval that mixes CLIP/SigLIP with vision-language models to link food pictures and words. It fuses features both one-way and two-way, then re-ranks answers for a better match. Experiments are conducted on MetaFood data to show its effectiveness. S1:The training-free nature of the approach gives the method a clear practical edge, avoiding the high cost of fine-tuning large models on domain-specific data. S2：The finding that lightweight fused models can rival their heavyweight counterparts is valuable for resource limited scenarios. W1:The key techniques include weighted fusion of image-text embeddings, using VLMs for caption generation, which is not novel. The contribution is primarily an engineering combination of existing methods applied to the food domain. W2:The evaluation uses only small subsets (13K and 15K samples) from the MetaFood Challenge datasets, which raises questions about generalization. No evaluation on other well-known food datasets (e.g., Food-101, Recipe1M) is provided. W3: In Equation 8, the weights (w_img=0.7, w_text=0.3) are set without ablation/parameter study. And in Equation 3, different weights are used for bi-directional fusion (w_img=0.3, w_text=0.7) without clear justification. W4: A critical yet unaddressed limitation is that the framework invokes a full VLM forward pass for every single query image, an operation that carries non-trivial latency and GPU-hour expense. No empirical analysis is provided. W5: There is no comparison results with the cited training-free methods (s PDV Tursun et al. (2025), TF-ZS-CIR Wu et al. (2025)) in related works. D1: Could you clarify how the fusion weights were set? Was a grid search conducted, or was another tuning strategy employed? How sensitive are the retrieval results to small changes in these hyperparameters? Finally, what motivates the choice of different weight values for uni-directional versus bi-directional fusion—does the shift reflect a fundamental difference in how each pathway contributes to the combined representation? D2：How does the pipeline behave when the VLM hallucinates an incorrect caption—e.g., mislabeling “green bell pepper” as “cucumber” or omitting a key ingredient? And how to deal with some potential biases in VLM-generated descriptions? D3:In Equation 11, what theoretical or empirical motivation led you to adopt max-pooling rather than an average or learnable weighted combination?	Lightly AI-edited
F4-ITS: Fine-grained Feature Fusion for Food Image-Text Search	Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	The paper proposes F4-ITS (Fine-grained Feature Fusion for Food Image-Text Search), a training-free framework designed to improve fine-grained cross-modal retrieval in the food domain. The method fuses image embeddings from pretrained CLIP-like models with dense or sparse textual embeddings generated by large vision-language models (VLMs) such as Gemini or Gemma. Evaluations on the MetaFood25 dataset show moderate gains across multiple ViT backbones. The technical novelty is minimal—the method mainly combines weighted feature fusion and cosine-based re-ranking, both well-explored in prior zero-shot retrieval works. Experiments are limited to a single dataset without generalization or robustness analysis, and key design choices (e.g., fusion weights) lack justification. Overall, the contribution is incremental and insufficient for a top-tier venue. - Applicable across ViT-B to ViT-bigG models; findings (e.g., smaller models benefit more) are interesting and practically valuable. - The experiments are clear and the results consistently show measurable improvements (e.g., +10% Recall@1, +28.6% mAP), demonstrating the empirical effectiveness of simple feature fusion under the food domain. - Technical novelty is very limited. The approach primarily relies on a weighted sum fusion of image and VLM-generated text embeddings, followed by cosine-based re-ranking — both are straightforward extensions of existing zero-shot fusion and retrieval strategies. - The paper repeatedly fixes fusion weights, there is no justification or ablation for the setting of $w_{img}$ or $w_{text}$. Similarly, other design choices (e.g., selection of Gemini/Gemma captions, reranking threshold) are not thoroughly justified. - Insufficient experiments. Experiments are restricted to MetaFood25; there is no evaluation on broader or unseen datasets such as Food-101 or Recipe1M. This limits the claimed generalization. No direct comparison with prior zero-shot or fine-grained food retrieval methods (e.g., fine-tuned CLIP variants) is presented. - The method heavily relies on VLM-generated dense/sparse captions, which might not generalize across domains. There’s no discussion of robustness under noisy or biased caption generation. - While the authors claim the system can generalize to other image-text retrieval domains, all evidence is food-specific, and the method lacks evaluation or discussion supporting such transferability. The generality of the proposed method is over-claimed. Please compare your results with the existing state-of-the-art works in the food retrieval task.	Fully AI-generated
A Unified Cortical Circuit Model with Divisive Normalization and Self-Excitation for Robust Representation and Memory Maintenance	Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper introduces a simple model that combines self‑excitated excitatory unit and shared inhibitory neuron. The authors show analytically that the system has a unique steady-state solution and can exhibit a continuous attractor when the self-excitation β exceeds the semi‑saturation constant η. They derive conditions for stability and show that with inputs the fixed point corresponds to normalized input ratios. Also really simple, this model has really interesting properties. The model is applied to a Random Dot Kinematogram denoising task and to both deterministic and probabilistic Wisconsin Card Sorting Tasks. In the RDK experiment, two excitatory units (representing left/right motion) with β = 2, η = 1 and τ = 50 ms denoise two noisy input streams; the authors report that the d′ improves from 0.20 to 1.67 and that the network retains the normalized ratio after the stimulus ends. In the WCST experiments, three units with the same parameters update beliefs about rules based on feedback and can switch rules within two trials or track probabilistic rules. I am giving the paper an overall 4 "marginally below the acceptance threshold" but will consider increasing that grade if the weaknesses are adresses. * Unified framework: The paper proposes a simple dynamical system that reduces to classical divisive normalization when β=0 and generalizes known recurrent normalization models for β=1, giving a nice mathematical link between normalization and attractor dynamics. The steady‑state derivation and identification of a transcritical bifurcation when β crosses η are clearly explained, and the continuous attractor analysis is analytically grounded. * Clarity of writing and figures: The model and its dynamics are well illustrated (Fig. 1 and Fig. 2). The tasks are described clearly, and the simulation results are easy to follow. Discussion and future directions acknowledge limitations and suggest spiking extensions. * Even when if goal is to illustrate a concept rather than optimize performance, you need to provide some justification or exploration of parameter choices because it speaks to the robustness and generality of the proposed mechanism. In your paper, the key results hinge on a few manually chosen values (for example, β = 2 and η = 1 with τ = 50 ms) in both tasks. Without showing how the system behaves when these parameters vary, readers cannot tell whether the ability to maintain normalized representations and perform inference is a general property of the architecture or a consequence of tuning to specific numbers. * Besides hyper‑parameter tuning, the paper lacks quantitative comparisons with alternative models. For instance, it does not show how the RDN’s denoising compares with classical divisive normalization (β = 0), recurrent normalization (β = 1) or attractor networks on the same tasks. 1. Can you provide baseline comparisons with β = 0 and β = 1, or with standard attractor/Bayesian models on your tasks? 2. How sensitive are the results to β, η and τ ?	Fully human-written
A Unified Cortical Circuit Model with Divisive Normalization and Self-Excitation for Robust Representation and Memory Maintenance	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes a cortical circuit model with divisive normalization and self-excitation, attempting to unify noise resistance and information maintenance. The circuit model looks like an extension based on (recurrent) divisive normalization models (ref. 24-27) with a major revision of inserting a new parameter $\beta$ representing self-excitation (Eq. 1). The key contribution claimed by the authors is probability demonstrating that high-dimensional continuous attractors can emerge in the proposed extremely minimal model from the dynamical system perspective. The paper further uses the same circuit model architecture but with different parameters to implement a sensory perception task (Fig. 3) and the Wisconsin card sorting task (Fig. 4). - Analytical solutions of the nonlinear neural dynamics. - Utilize the circuit to realize two cognitive tasks (Fig. 3 and 4) ### 1. The neural dynamics is over-simplified due to the lack of recurrent excitation. ### 2. The study doesn't fully utilize the analytical tractability of a minimal model to understand the computational and algorithmic mechanism of neural circuits. The theoretical analysis of the circuit only focuses on the stability analysis of the simple model. In contrast, the comp-neuro field has developed more comprehensive theoretical analyses on more complex recurrent networks that include both recurrent connections across E neurons and static divisive normalizations. In addition, the recent development on Stabilized Supralinear Networks (SSN) by Ken Miller also yielded much deeper dynamical system understanding than the current paper, even if the SSN doesn't contain _explicit_ divisive normalization (but the supralinear inhibitory neurons activation function enables E neurons realize input-output curves similar to divisive normalization). Combined, I feel the stability analysis on the proposed minimal model doesn't provide a significant advance on circuit dynamical theories when compared with the papers mentioned above. ### 3.The disconnection of the theoretical analysis and the circuit mechanism of cognitive tasks in Figs. 3-4. While I appreciate the theoretical analysis in Sec. 2, Sec. 4 only uses numerical simulations to show the circuit model can realize the two cognitive tasks, without utilizing the theory in Sec. 2 to gain deep insight into the coding/algorithmic mechanism in the circuit model underlying the two cognitive tasks. This would be a waste of the analytical results of the proposed nonlinear circuit dynamics, and that's why I think the work is still in the preliminary stage. Since we have analytical results, we could ask many deep questions. For example, how do the circuit parameters like $\beta$ and dynamic divisive normalization affect the coding performance in the Fig. 3 task? What is the underlying probabilistic inference algorithm of the circuit in performing Fig. 4 task (variational Bayes or sampling), and/or how is the belief represented by neuronal population activities (firing rate proportional to probability or log-probability, or sampling-based representation)? The absence of these insights significantly limits the depth of this paper. In addition, considering that the two cognitive tasks have been extensively studied in neuroscience research, the paper doesn’t provide new (theoretical) insight into the circuit mechanism, nor a comprehensive comparison with existing studies of circuit mechanisms underlying the two cognitive tasks. I have no detailed question on the study. I would like to disclose that I also reviewed this paper submitted to NeurIPS 2025 earlier and accessed all questions raised by all reviewers. After browsing the ICLR version, it seems that there is no significant change of the text, even in the Discussion section.	Fully human-written
A Unified Cortical Circuit Model with Divisive Normalization and Self-Excitation for Robust Representation and Memory Maintenance	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes a recurrent neural circuit model that integrates divisive normalization and self-excitation to jointly achieve robust sensory encoding and stable memory maintenance. The authors analytically show that, under appropriate parameters, the system forms a continuous attractor capable of denoising noisy inputs and sustaining representations after stimulus removal. Through mathematical derivations and two illustrative tasks—RDK (for perceptual denoising) and WCST (for flexible rule inference)—the model demonstrates that a single cortical microcircuit can perform both noise-suppression and working-memory functions. The authors argue that this unified framework bridges two canonical cortical computations and could inform the design of biologically plausible artificial neural networks. - The paper is clearly written. The motivation, model architecture, and experimental setup are all easy to follow. - The theoretical analysis is solid and seems correct. The derivations (steady states, bifurcation, stability) connect well with the model’s qualitative behavior. - The two chosen tasks (RDK and WCST) nicely illustrate how the same circuit can support both noise-robust encoding and persistent activity. - The figures are clear and effectively convey the model’s behavior. - The paper is well-grounded in existing computational neuroscience literature, and the link to canonical cortical computations (normalization and attractor dynamics) is thoughtfully made. - I’m not fully convinced by the motivation of a unifying model of perceptual denoising and memory maintenance. One could in principle pick any two cognitive computations and design a model to achieve both — why these two in particular, beyond the fact that they’ve been studied separately? - The AI relevance statements in the abstract and discussion feel overstated. The model is elegant but quite simple; the claims about implications for artificial intelligence seem premature. - The experiments are purely conceptual and don’t connect to biological or empirical data. That makes the “unified cortical circuit” claim somewhat speculative. - The “unification” mainly comes from adding recurrent excitation to a divisive normalization framework. It’s an interesting idea, but the novelty feels incremental. - The probabilistic inference task (pWCST) is more of a rule-tracking demonstration than a true implementation of Bayesian belief updating. The link is more conceptual than mechanistic. - Parameter choices seem somewhat arbitrary, and the paper doesn’t show how sensitive the results are to these parameters. - In the experiments, how are the R–G weights determined? Are they always set to 1, or adjusted per task? - What happens if $\beta$ is close to $\eta$ in your experiments? Will the model still be able to keep memory? - I'm not sure if I follow the experiment setup of the Bayesian inference experiment. For example, how are the (0/0.5/1) feedback inputs provided to the model? - It is known that RNNs can perform perceptual denoising and keeping working memory (e.g., see refs below) - how could you relate your model to RNN models? Maybe worth mentioning it at least in the discussion. References: - Masse, Nicolas Y., et al. "Circuit mechanisms for the maintenance and manipulation of information in working memory." Nature neuroscience 22.7 (2019): 1159-1167. - Song, H. Francis, Guangyu R. Yang, and Xiao-Jing Wang. "Training excitatory-inhibitory recurrent neural networks for cognitive tasks: a simple and flexible framework." PLoS computational biology 12.2 (2016): e1004792.	Fully AI-generated
A Unified Cortical Circuit Model with Divisive Normalization and Self-Excitation for Robust Representation and Memory Maintenance	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper here proposes the Recurrent Divisive Normalization (RDN) circuit model where the purpose is to bridge the gap between denoising capability of neural coding and the ability to maintain sustained representations in these neural coding, considering that these problems are considered to be of separate domains. The author shows that under certain parameter conditions, the model can show normalized representation of the input while also capable of maintaining persistent representation of the input when the input is withdrawn. The author performed a model analysis to show the stability of the model with/without input hence showing the model maintaining persistent representation. The author performed random dot kinematogram (RDK) task, and on both classical and probabilistic versions of Wisconsin Card Sorting Test (WCST) to show the model’s robustness with noise. It is interesting that the author showcases a model that is capable of being both an attractor network and a normalisation model, hence capable of sustaining persistent activities of representation while also being capable of denoising in a specific regime. Moreover, it is nice to see that there is some analytical support done for the persistent representations, as well as cross-demain demonstrations of the models’ normalization. I do feel like the abstract is lacking in describing what the key contribution is. It feels lackluster and I can only understand this as I go further into the paper. Grammar/ Structuring/ Spelling error needs to be fixed throughout. Please revise the paper for these kinds of errors. Some examples to check: Line 70-71: “ but also forms a continuous attractor that persistently maintains those representations after input withdrawn.” Line 48: “While there exists different perspectives” Line 51-52: “Bayesian approach consider the neuron population” Line 94: “β = 1 lead to a model” Line 306: “2-Alternative Foice Choice” Do define what and how the 3 excitatory units for WCST are done. It is pretty general in mentioning just shape, number and color. Although these experiments are given, there are lots of other models that could be compared to this that haven’t been mentioned. E.g. attractor networks (either discrete, continuous), classic or soft winner take all models, etc. Could experiments to further analyse these differences be done for the model? The number of classes (N) is small in the current experiments, i.e. RDK (N=2) and WCST (N=4). Could you evaluate larger N to assess scalability (e.g., 8, 16, 64) and whether denoising and persistence still hold? Figure 2 is confusing for beta = eta. Although the black dot is supposed to be an unstable fixed point, this seems to be converging towards the black dot. Could you explain this part? Questions for the author has been addressed alongside the weakness in the weakness section. Please check the weakness section for this.	Fully human-written
Physically Valid Biomolecular Interaction Modeling with Gauss-Seidel Projection	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces a physically constrained diffusion framework for biomolecular structure generation that enforces local geometric and energetic constraints while maintaining structural accuracy. The method integrates a Gauss–Seidel based differentiable projection that iteratively enforces constraints to ensure the physical feasibility of generated structures. The framework supports backpropagation through conjugate gradients, enabling stable gradient-based training and inference. Empirical results demonstrate improved geometric validity, structural stability, and fast inference across complex biomolecular benchmarks. - The motivation is clear and well grounded in the need for physically valid biomolecular generation. - The method enforces atomically realistic outputs compared to unconstrained baselines, a practically important contribution. - The algorithmic formulation is sound. The iterative Gauss–Seidel projection and penalty method are mathematically principled and differentiable. - The PoseBusters results show a notable drop in docking-related metrics, but the paper does not analyze the cause, possibly due to a trade-off between hard constraint enforcement and ligand flexibility. - Some evaluation metrics are missing. Including measures such as the PoseBusters-valid success rate, TM-Score, and iLDDT (in Table 2) would make the empirical validation more complete and convincing. 1. Since the Gauss–Seidel scheme guarantees uniqueness only for each linearized subproblem, can different orderings or initialization seeds lead to non-unique final projections? Have the authors observed multiple feasible solutions in practice? 2. The drop in docking-related metrics on PoseBusters is notable. Can the authors provide analysis or ablation evidence on whether this degradation stems from the specific constraint choices or the limited number of denoising steps? 3. Why is metrics such as iLDDT excluded from Table 2? Including complementary metrics such as TM-Score or PoseBusters-valid success rate would make the evaluation more comprehensive. 4. Could the authors clarify how $\alpha$ is chosen in practice, and whether convergence depends sensitively on this value?	Fully AI-generated
Physically Valid Biomolecular Interaction Modeling with Gauss-Seidel Projection	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper tackles the critical issue of physical invalidity (steric clashes, distorted geometry) in structures generated by deep learning models for biomolecular interactions, particularly all-atom diffusion models. The authors propose a method to enforce physical validity as a hard constraint during both training and inference. The core idea is a differentiable projection module that maps the provisional atomic coordinates generated by a diffusion model onto the nearest physically valid configuration . This projection is efficiently implemented using a Gauss-Seidel iterative scheme, exploiting the locality of physical constraints . Crucially, the module is made differentiable via implicit differentiation, allowing it to be integrated seamlessly into existing diffusion frameworks (like Boltz-1) for end-to-end finetuning . A key result is that incorporating this module enables the generation of physically valid and structurally accurate complexes using only 2 denoising steps, achieving accuracy comparable to 200-step SOTA baselines while offering ~10x speedup and guaranteeing validity . The method is evaluated on six diverse benchmarks against strong baselines. - Generating physically plausible structures is a prerequisite for the reliability and utility of biomolecular models. This paper directly confronts the common failing of deep generative models in this regard , offering a principled solution. - The use of the Gauss-Seidel method for the projection step is well-suited for the problem, leveraging the local nature of physical constraints (bond lengths, angles, clashes) for fast and stable convergence compared to methods like gradient descent. - Making the iterative Gauss-Seidel solver differentiable via implicit differentiation is a key technical contribution, enabling end-to-end training and allowing the diffusion model to adapt to the projection step. This integration is crucial for maintaining high accuracy, as shown in the ablation study. - The evaluation of the Protenix baseline appears to be an underestimation. According to the Protenix technical report, as well as anecdotal user feedback, its performance at 200 steps (e.g., in terms of DockQ and validity metrics) is reportedly not as low as presented in this paper. I recommand the authors to either provide the detailed configuration (config) files used for their Protenix experiments or, preferably, release the raw prediction files generated by their baseline models to ensure a fair and reproducible comparison. - Regarding the Protenix mini tech report, the original authors claim that 'increasing the ODE sampling steps to 10 effectively mitigates this issue (clash)'. Given this, the comparison in Table 1 might be suboptimal. I strongly suggest that the authors re-evaluate the baseline using 10 ODE steps in Table 1, as this seems to be the recommended setting for mitigating structural clashes. - The paper states that GS projection achieves good physical constraint effects within just 2 steps. This raises a question: why did the authors not experiment with applying the projection for more steps? Furthermore, the results in Table 1 indicate that sampling for 2 steps with constraints still underperforms the original 'boltz2' baseline. The comparison is incomplete. The authors should also include experiments comparing their method against baselines like 'boltz2' when restricted to a similar small step budget (e.g., 10 steps) for structure prediction. - Although the 2-step process is much faster overall, the paper could provide more details on the computational cost (time and memory, forward and backward) of the Gauss-Seidel projection module itself, especially as the system (complex) size grows larger. The backward pass involves solving a linear system using CG, which could become expensive. - Can the authors provide metrics (e.g. RMSD) quantifying the magnitude of coordinate changes introduced by the projection step? Are there cases where projection significantly alters key interface interactions or secondary structure elements?	Fully AI-generated
Physically Valid Biomolecular Interaction Modeling with Gauss-Seidel Projection	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper addresses the physical validity challenge in generative modeling for protein structures. It proposes a first-order constraint and enforces it during both fine-tuning and inference processes. The constraint is formulated as an optimization problem, which is approximately solved using Gauss-Seidel updates. Additionally, the paper introduces implicit differentiation to enable backpropagation during fine-tuning. Compared to baseline methods, this approach achieves faster inference, requiring only 2-step denoising, while maintaining competitive accuracy. 1. The manuscript is well-written, and easy to follow. 2. The performance in terms of inference acceleration is impressive. 3. The proposed method is reasonable and offers valuable insights to the field of AI4S. In many scientific applications, strict physical laws must be upheld. When training data is limited, models often struggle to learn these constraints in a data-driven way; thus, explicitly enforcing the constraint is a reasonable and effective solution. 1. While the main focus of the paper is on inference performance under a small number of sampling steps, I believe it would be beneficial to also include results under larger-step settings. This would provide a more comprehensive understanding of the method's performance when the computational budget is less constrained, and offer a clearer comparison with existing approaches in such scenarios. 2. It would be beneficial to discuss the relationship between proposed method with reinforcement learning (RL)-based fine-tuning approaches for diffusion models (e.g., DPOK, DRaFT). Compared to inference time guidance methods, they also involve fine-tuning to optimizing the reward. I’m curious about how the fine-tuning cost of RL compares to that of your iterative optimization algorithm. 3. How is the scalability of your method? Since your method involves enforcing constraints on all substructures, as the complexity or size of the system grows, does the method remain computationally feasible during training and inference? 4. Alphafold3 is an important baseline in this domain and is suggested to be included in experiment section for a more complete comparison. Additionally, the evaluation could benefit from incorporating more comprehensive metrics such as TM-score and pb-valid. 5. To facilitate reproducibility and enable further research based on this work, I strongly encourage the authors to release the code (such as in an anonymous form during the review period). 1. In Equation (2), Is there a redundant factor of 0.5 in the definition of E(x)? It seems to be inconsistent with Equation (3). 2. The iterative linearization of eq.4 (Theorem D.1) relies on multiple approximations (replacing K by I; assume g = 0). Could you please discuss how these approximations affect the convergence speed and solution quality of the optimization problem? Maybe by numerical simulations or tests on real protein datasets) could help validate the affect of these approximations. 3. Could there be additional textual description to clarify the "invalid" structures in Figure 4? It is not always obvious which structural issues are presented: Are all cases attributed to atomic clashes, or are there other problems (e.g., bond length violations, steric hindrance)? 4. How does the method perform without relying on the Protenix-mini sampling strategy and compare it to Boltz-1? 5. While the paper emphasizes reducing the number of sampling steps from 5 to 2, it is unclear why this results in a wall-clock time reduction significantly greater than 2.5× in Figure 5 (right) when compared to Protenix-mini.	Fully human-written
Emergent Chess Skill Acquisition in Large Language Models	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper studies how language models acquire chess-playing abilities when trained on algebraic chess notation. The authors introduce a custom disambiguation-aware tokenization scheme and train models of varying depths on datasets. The paper reveals an approach similar to curriculum learning, with rule comprehension emerging early and higher-order abilities following later. - The motivation of the paper is sound. - The paper is well-structured with clear method descriptions and results presentation. - The paper is titled with "Large Language Models." However, the maximum size of the models trained in the paper is 100M parameters, which is relatively small. - As mentioned in Section 5.3, evaluations used only 10 games per configuration, which may limit the robustness of the proposed method, especially for cases like sacrifices or complex tactics. - There's no analysis of how the custom tokenization scheme impacts learning compared to other alternatives. NA	Lightly AI-edited
Emergent Chess Skill Acquisition in Large Language Models	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	Using chess as the research domain, the study examines how models acquire various chess skills from scratch. Lower-level skills, such as making legal moves, are learned early in training, whereas higher-level strategies, such as sacrificing pieces, are only acquired in the later stages. Provides a detailed characterization of skill acquisition during the model’s training process. 1. I am not an expert in explainable AI! 2. I find the article’s conclusion quite obvious: higher-level skills are learned later in training. This is predictable and does not provide the reader with additional insights. I suggest the authors focus on discussing how the existing findings in the paper can inform better strategies for training models. see weakness	Moderately AI-edited
Emergent Chess Skill Acquisition in Large Language Models	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper investigates chess skills in decoder-only transformer models trained from scratch on algebraic chess notation. The authors focus on the training dynamics and developmental trajectory of these skills, rather than final performance. They systematically vary model depth (5 to 25 layers) and the training data distribution (a balanced dataset vs. a white-win-only dataset). Using a custom, disambiguation-aware tokenization scheme, they analyze the emergence of three hierarchical levels of competence: rule comprehension, tactical execution, and strategic planning. The paper concludes that chess provides a valuable, interpretable benchmark for studying how structured, hierarchical reasoning emerges in language models. The dynamics of skill acquisition rather than just end-state performance is interesting. The study is well-designed varying variables: architectural depth and data distribution. The evaluation is good, moving beyond simple win rates or Elo. The current evaluation protocol appears to test the models as the White player. It would be beneficial to clarify if any experiments were conducted with the model playing as Black. There seems to be a slight inconsistency in the evaluation methodology that I would appreciate clarification on. Rule comprehension is measured based on unconstrained generation, whereas the strategic evaluation uses prefix-constrained decoding to enforce legality. Could the authors explain the rationale for this dual approach? I wonder if this might decouple the model's strategic choices from its internal rule knowledge, potentially affecting the interpretation of the strategic metrics for shallower models that have not yet mastered legality. The paper mentions that the training data was filtered to include games between 80 and 200 plies. Could the authors elaborate on the justification for this specific range? The custom disambiguation-aware tokenization scheme is an interesting feature of the methodology. Could the authors explain why this hand-engineered approach was chosen over standard, data-driven subword tokenization methods like BPE? Please refer to the weaknesses	Moderately AI-edited
Emergent Chess Skill Acquisition in Large Language Models	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper studies how language models acquire chess skills when trained on algebraic chess notation. By introducing a disambiguation-aware tokenization scheme and train models of varying depths (5-25 layers) on different datasets to study the emergence of capabilities. They observe clear developmental patterns: shallow models struggle with move legality, while deeper models develop tactical and positional understanding. Models trained on balanced game outcomes consistently outperform those trained only on white-win games. - The paper is well-organized and clearly written. - The intuition of this paper is great. - The largest model studied (25 layers, ~100M parameters) is relatively small by current standards. It's unclear if the observed patterns would hold at scales of billions of parameters. - The paper doesn't compare performance against purpose-built chess engines. This makes it difficult to assess overall performance compared to other methods. - The paper lacks information about the computing resources needed for training. - The paper lacks cast studies. Please refer to the "Weaknesses" section.	Lightly AI-edited
UKAT: Uncertainty-aware Kernel Association Test	Soundness: 3: good Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper introduces the Uncertainty-aware Kernel Association Test (UKAT), a framework for statistical testing that explicitly incorporates per-observation measurement uncertainty. The authors argue that standard tests ignore this valuable information, leading to reduced statistical power. ### Method UKAT's core innovation is to treat each observation not as a single value, but as a probability distribution characterized by its mean ($X$) and uncertainty ($U$). It represents each observation as an augmented data point $\Theta = [X, U]$, assumed to be Gaussian. The framework uses a distance metric on these points, equivalent to the Wasserstein distance for Gaussians, to construct a kernel for the Hilbert-Schmidt Independence Criterion (HSIC) test. ### Applications Real-world applications demonstrate UKAT's ability to uncover novel insights. It detected significant behavioral changes in LLM responses based on self-reported confidence that accuracy-only tests missed. In astronomy, it identified potential systematic biases in exoplanet data by finding associations within measurement errors. The idea is quite simple to understand and the exposition of the idea was straightforward. These are the main weakness of the paper: 1. Limited methodological novelty. The core technical proposal is to concatenate an observation’s mean and uncertainty into a two-dimensional vector and apply the standard HSIC test. This represents a straightforward application of an existing statistical tool to augmented inputs. The connection to the Wasserstein distance for Gaussians appears to serve mainly as an interpretation rather than a design principle, and the work does not introduce a new kernel, test statistic, or theoretical framework. Overall, the methodological advance is limited. 2. Narrow Scope and Restrictive Assumptions. The reliance on a Gaussian assumption to make the link to Wasserstein distance undermines the primary advantage of HSIC as a non-parametric test, making the approach less flexible than claimed. 3. Confusing experiments. The experimental results only go up to n=50, and does not provide a sample size power plot, nor does it consider size plots for larger sample sizes. The paper also doesn't elaborate what the AUC metric does. Further it's potentially misleading to label HSIC applied to (mean, std) data pairs as UKAT-C, it should be considered a baseline method. In its current form, the manuscript presents a simple idea without the necessary methodological depth, novelty, or rigorous comparison to established alternatives to be considered a significant contribution. 1. The AUC metric is unclear, can you elaborate exactly how it is calculated? 2. Why do you only consider n=50? What happens at larger n? 3. Why is HSIC with (mean, std) called UKAT-C? It appears to be a direct application of HSIC to this type of data and should be considered a baseline 4. The standard T-test seems to be doing perfectly fine in terms of power and size? Can you construct a more convincing case when the standard T-test breaks and your proposed method works much better? 5. Can you provide a density plot of the data you're testing? It would help understanding what exactly you're testing for.	Lightly AI-edited
UKAT: Uncertainty-aware Kernel Association Test	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	A new test of dependence, incorporating uncertainty. Seems like it works good. I am really not that sure about some critical things. My name is Joshua Vogelstein, I’ve written many papers on two-sample testing, including my favorite one on the topic, which is relevant (because it also leveraged ranks), https://elifesciences.org/articles/41690. Of note, we implemented this in SciPy, https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.multiscale_graphcorr.html I like the idea of this paper, but I am confused about a few things. 1. How is it that we are observing or measuring uncertainty? Is the idea that somehow we directly have an estimate of uncertainty, without multiple samples. In our work, we are often faced with multiple samples per subject, eg, we have 50 subjects, each sampled 2 times. So, we can get an estimate of the variance from those 2 observations. Is that what you have in mind? If so, why not just use all the observations, rather than use them to estimate uncertainty? This is a fundamental misunderstanding that I have, which makes evaluating the paper quite difficult for me. 2. There are lots of ways to model uncertainty, only estimating the variance is one option. The simulations seem to assume this is a good option. I wonder what happens when this is not a good option, eg, the uncertainty is bimodal. 3. When you write “AUC”, you mean area under the which curve, power? Assuming what null and alternative? And assuming alpha = 0.05? 4. In the figures, you don’t compare to just ignoring the uncertainty, eg, just running HSIC, or MGC, etc. That makes me wonder. 5. If we have a distribution, sure, we can use a 2 parameter estimate of the distribution, but we could do other things, eg, a 2-bin histogram, or a k-bin histogram. I wonder about such options. 6. I don’t understand the LLM experiment. Did the LLM give you a numerical estimate of its standard deviation? If not, how did you estimate it? 7. I don’t really understand Fig 1, and Fig 2 did not do much for me. I’d rather have more text/pseudocode on the alg.	Fully human-written
UKAT: Uncertainty-aware Kernel Association Test	Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes a new dependence test: this is a statistical test aimed to determine whether two random variables $X$ and $Y$ (observed through their joint realizations) are statistically dependent. Unlike typical existing dependence tests, the new test is “uncertainty-aware” in the sense that each realization of $X$ is allowed to be accompanied by an uncertainty measure. For instance, this can be a standard deviation (associated with one realization, not the full distribution of $X$). The new test is built on the well-known kernel-based HSIC test. The direction this paper aims to tackle (i.e., accounting for uncertainty on each realization when doing a statistical test) is technically interesting. The approach essentially views each realization as a distribution, which is a rather unusual view (in a positive way); though, it is not the first work to do so. I think tackling this problem can lead to significant development down the line in the future. At a high level, the paper is easy to understand (though several technical details are missing). While the goal of treating each observation as a distribution in a dependence test is technically interesting, the paper falls short of what is expected of a statistical test in a number of ways. To briefly provide a few examples, the paper does not mathematically precisely describe how $u$ (the uncertainty measure) and $x$ (a realization of $X$) are related. There is an implicit assumption but the assumption is not sufficiently articulated.Secondly, it is unclear for what kind of joint distribution (that jointly generates $(x, y, u)$) would the proposed test provide a consistent result. This is a natural theoretical result expected from a new statistical test since it will clearly define the class of distributions that the test can work. Without this result, given a problem, it is unclear whether the proposed test is applicable. No such result is provided. Further the new test builds on top of the well-known HSIC (Hilbert-Schmidt Independence Criterion) dependence test of Gretton et al. The present work proposes to use two positive definite kernels of a particular form with HSIC, limiting the novelty. More discussion points and specific questions are given in Questions. Q1: Standard HSIC operates on two random variables $(X,Y)$ following some (unknown) joint distribution $P$. The proposed test in this work operates on three random variables $(X, U, Y)$, where $X$ and $U$ are univariate variables, and $U$ represent some kind of uncertainty measure on $X$. Question: What is the assumption on the relationship between $X$ and $U$. This is an important point that is never elaborated precisely. In Sec 3, at L166, > We therefore characterize each observation as $N(x_i , u_i^2 )$. Is this just an example, or an assumption? If it is an assumption, it must be stated more clearly. Does this mean $X$ follows a Gaussian distribution? Or does a realization $x_i$ act as the mean of another random variable? If so, what is that random variable? Q2: Following Q1, with the normality assumption, what happens in practice if $u$ is a standard deviation for $x$, but $x$ does not follow a normal distribution? This is an important point that is not discussed sufficiently. In practice, it is highly unlikely that the normality assumption would hold in general. Q3: Theoretically, for what kind of joint distribution $P$ (that generates $(X, U, Y)$) would the test provide a consistent result? To be more concrete, for instance, what is the assumed factorization form of $P(X, U, Y)$? If the three variables are independent (i.e., $P(X,U,Y) = P(X)P(U)P(Y)$), would the test be able to control the type-I error, for instance? What about other less trivial forms of factorization? I would like to see this kind of consistency statement: > For $P \in $ (Some Distribution Class), under the alternative hypothesis $H_1$, the new test gives a test power of 1 as the sample size goes to infinity. What is “(Some Distribution Class)”? This point is related to Q2. It is important to understand the scope that the new test can apply to. Q4: On a related note, at line 160, > Under the null hypothesis, the distribution of distributions is independent of other covariates. Could you please write down mathematically what the null hypothesis $H_0$ is? This is never stated mathematically. And what is the alternative hypothesis $H_1$? Q4.1: If $X,Y$ are independent, is it possible that the presence of $U$ can result in a false positive (i.e., reject $H_0$ when it should not be rejected)? Q4.2: The other way. If $X, Y$ are dependent, is it possible that the presence of $U$ can result in a false negative? Q5: Sec 3.2, L201, > RBF and Laplacian kernels are also universal but yield uncalibrated p-values… Could you please precisely describe what this means? Do you mean, with a wrong kernel choice, the new test can give an uncontrolled type-I (false positive) error? If so, then this is a big problem. The existing HSIC test at least has a well-controlled type-I error for any kernels under $H_0$; though, it may have very low test power under $H_1$ if inappropriate kernels are used. Owing to the above concerns, I cannot give a strong recommendation.	Fully human-written
UKAT: Uncertainty-aware Kernel Association Test	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces UKAT, a framework that incorporates uncertainty into independence testing, which is often ignored by traditional statistical tests. UKAT represents each observation not as a point $X$, but as a distribution $N(X, U^2)$, where U is measurement uncertainty. The core idea is to use the Wasserstein distance between these distributions, which simplifies to a Euclidean distance on $\Theta=[X, U]$, to construct an energy kernel. This kernel is then used within the standard HSIC framework to test for associations. Simulations and applications demonstrate that UKAT achieves higher statistical power than traditional tests while maintaining proper Type I error control. 1. The idea of embedding uncertainty directly into hypothesis tests/kernel methods represents an interesting direction. 2. The writing is generally clear and well-structured. I appreciate the figures. 3. The proposed UKAT framework is conceptually simple yet intuitive. 1. There is no new theorem or substantive analytical insight beyond restating existing kernel-independence theory under a Gaussian-uncertainty parameterization. As such, the paper’s theoretical contribution appears limited. 2. The proof of Proposition 3.1 seems incorrect. The author stated that the universality of the proposed kernel follows from the fact that k is characteristic and translation-invariant. However, k is not translation-invariant, therefore fails to establish universality as claimed. 3. The paper is framed as a general association/independence test, leveraging the HSIC framework to detect arbitrary dependencies. However, the entire simulation study fails to test it. Instead, two special cases: detecting differences in group means and group variances. Therefore, the paper provides no evidence that the proposed energy kernel outperforms simpler tests for general association testing. 4. There is no formal theoretical grounding for the robust variant UKAT-R. 5. If I understand correctly, UKAT essentially implicitly reweights samples using the uncertainty estimates contained within the dataset as prior information. Therefore, its practicality heavily depends on the quality of these uncertainty estimates, which are often untestable or unverifiable. In real-world scenarios, this leaves us uncertain about when this approach can be reliably applied. Furthermore, the baseline lacks adaptive kernel-learning or data-driven reweighting methods [1-4]. If such adaptive strategies can be learned without explicit uncertainty priors, the practical significance of UKAT would be substantially diminished. References: [1] Liu et al, Learning Deep Kernels for Non-parametric Two-Sample Test. ICML 2020. [2] Ren et al, Learning Adaptive Kernels for Statistical Independence Tests. AISTATS 2024. [3] Xu et al, Learning Deep Kernels for Non-Parametric Independence Testing. Arxiv. [4] Li et al, Extracting Rare Dependence Patterns via Adaptive Sample Reweighting. ICML 2025. 1. Is there any discussion/experimental evidence why RBF and Laplacian kernels yield suboptimal power?	Lightly AI-edited
EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer	Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces EchoMotion, a new framework designed to solve a critical problem in video generation: the synthesis of complex and kinematically plausible human motion. The authors argue that existing models, trained on pixel-only objectives, prioritize appearance fidelity at the expense of learning the underlying physical principles of human articulation, leading to anatomical artifacts and unnatural movements. To address this, EchoMotion's core idea is to model the joint distribution of video (appearance) and 3D human motion (kinematics), rather than just the video distribution conditioned on text. MVS-RoPE is proposed as a unified 3D positional encoding for both video and motion tokens and establishes an inductive bias for video-motion temporal alignment. A large-scale dataset HuMoVe with 80,000 video-motion pairs is constructed for training and achieves better human-centric video generation results. 1. The paper clearly identifies a fundamental weakness in current human-centric video generation models for kinematic correctness and proposes to explicitly model the joint distribution of video and motion as a strong inductive bias to enhance the video generation performance; 2. The MVS-RoPE design is clear and well-justified to the non-trivial problem of aligning modalities with different temporal resolutions. 3. The creation of the 80,000-pair HuMoVe dataset is a substantial contribution to the field. The lack of large-scale, high-quality, paired video and 3D motion data has been a major bottleneck. 4. The experiments are thorough and well-designed. 1. The paper does not provide a clear description of the specific "open-source datasets, movies, and the internet" used to build the HuMoVe dataset. Furthermore, the extracted motion could be noisy as the ground truth; 2. The framework's reliance on the SMPL model as its parametric motion representation creates an inherent bottleneck for fine-grained realism. SMPL is a whole-body model that offers very limited, or no, supervision for highly articulated and expressive areas like individual hand gestures and facial expressions. 3. Is the strong inductive bias harmful for those physical disabilities or significant bodily variations, such as amputees, as the underlying parametric model does not support this topology. I believe this paper is substantial, demonstrates improved results, and serves as a positive contribution to advancing the field of controllable video generation. Please refer to the weaknesses to further improve this paper.	Fully human-written
EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer	Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	Inspired by VideoJam , this work establishes a model for the joint distribution of video and motion. It explicitly denoises parametric motion and performs text-to-video & motion generation. The results demonstrate an improvement in motion smoothness and human evaluation scores compared to the baseline (Wan video ). Originality: It designs and establishes a modeling framework for the joint distribution of video and motion. Quality: The quality is acceptable. Clarity: The paper is well-structured and clearly articulated. Significance: 1. This work proposes a solution for modeling the joint distribution of video and motion. 2. Community Contribution: The authors commit to open-sourcing their code, which will be a valuable public resource for advancing the field 1. Limited quantitative experiments: The paper only compares results with its base model using metrics that are not specialized for human motion. It lacks comparisons with closed-source models like Kling or Veo3 (it doesn't necessarily need to surpass them, but at least show the gap with SoTA models). The evaluation metrics are not focused on human motion. 2. Lack of necessary ablation studies: The effectiveness of the video-to-motion and motion-to-video capabilities is unknown, as no quantitative results are provided. This is crucial for validating the joint distribution modeling. Furthermore, there is no ablation study for the complex training process . 3. The visual quality demonstrated in the supplementary materials is still subpar. There are instances of impossible human poses, and the characters' hands are very blurry. 1. In the supplementary materials, specifically in sample 6 (especially the last frame) and sample 15, some very unnatural or incomprehensible human poses appear. What are the possible reasons for this? 2. As mentioned in the paper, EchoMotion can perform video-to-motion and motion-to-video tasks. Could you provide quantitative metrics to demonstrate the performance of these tasks? Specifically, for motion-to-video, could you compare it with models like Champ , Animate Anyone, or WanAnimate (since its base model is also Wan video)? 3. The VBench metrics used in the comparison are not specifically designed for human motion. Would it be possible to compute an FID (Fréchet Inception Distance) on the generated SMPL motion parameters? 4. It is suggested to also include comparisons with closed-source models, such as Kling, Veo3, etc. 5. If you were to use SMPL-X as the motion representation instead of SMPL, would this lead to an improvement in the representation of hands? 6. Could an ablation study be provided for the complex training process described in Section 3.2 ?	Lightly AI-edited
EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This work proposes EchoMotion that accepts both video and human motion modalities for video generation with a mixed multi-modal in-context learning strategy during the training. This work also introduces a new human-centric video dataset HuMoVe that includes paired video, 3D human motion parameters, and text data. 1. This work proposed a Dual-Modality DiT architecture that accept input and output with different modality. 2. This proposed Motion-Video Synchronized RoPE is an interesting idea to add motion information to the model. 3. This paper proposed a new high-quality dataset for video, human motion and text. 1. The novelty of the Dual-Modality DiT and Motion-Video Synchronized RoPE is limited. The notion of multi-modality DiT is not new and the idea of adding motion information is well studied in human mesh and skeleton generation tasks. 2. There are only baseline model results of Wan-1.3B and Wan-5B which are not enough to give accurate evaluation of the proposed architecture. 3. There is no ablation study to show the effectiveness of each proposed block in the architecture. 4. The model efficiency evaluation can add metrics like average generation fps for a more direct comparison. 1. Why is there no video tuning result for Wan-1.3B at Table 1? 2. I saw there is a parameter named FPS in Table 4. Is that the FPS of the input video or something else? 3. Section 4.3 mentions that EchoMotion can operate bi-directionally. Can you provide quantitative results for this part to see the performance comparison with other state-of-the-art models?	Fully human-written
EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper aims to overcome the limitations of motion generation based on pixel-level supervision in previous studies by proposing joint modeling of human appearance and motion. The authors propose a DiT-based architecture that processes tokens from two modalities. The SMPL parameters are used to represent human poses, and to emphasize the dual-modality nature, Query, Key, and Value extracted from both the video and motion are concatenated and processed through self-attention. This structure enables attention to consider multiple modalities, which is advantageous for joint modeling. A motion-video synchronized RoPE (MVS-RoPE) which is an encoding method applicable to both modalities, is also proposed. Specifically, a diagonal extension is proposed to prevent interference between motion latents and video latents. In addition, the authors propose the HuMoVe dataset, containing over 80,000 video-motion pairs. This dataset includes descriptive textual captions, 3D SMPL motion parameters, and video pairs, making it valuable for multi-modal generative modeling that considers vision, text, and motion jointly. The experimental results present various metrics and human evaluations, showing performance improvements over baselines. Furthermore, ablation studies for each module are provided to analyze the effectiveness of the proposed methods. - The paper proposes the large-scale HuMoVe dataset. Since the dataset includes test captions, videos, and motion parameter pairs, it is highly useful for multi-modal modeling tasks. - MVS-RoPE that can be jointly applied to visual and motion embeddings is proposed. This encoding technique utilizes diagonal positioning to prevent interference between vision and motion latents, which is a reasonable approach (although more experimental evidence is needed to support this). - The paper is easy to follow. - The deep network structure is only a simple extension of existing networks. Except for MVS-RoPE, the network mainly uses self-attention on concatenated features for joint modeling, which is quite simple and straightforward. Discussion on whether other components could be improved to better support joint modeling would strengthen the paper. - The quantitative evaluation relies only on self-evaluation. Even if direct comparison with prior studies is difficult, the paper should include analyses comparing the video and motion decoder performance improved from joint modeling with existing conditional generation methods (e.g., VideoJAM) to show the degree of improvement or equivalence. - The explanation of how text descriptions were generated for the HuMoVe dataset needs to be clarified. In particular, since the initial data were created using an LLM, the paper should provide more detailed information about the prompts used. - p.2 L66: The authors mention that previous works are limited because, even with a 3D prior, supervision is applied after projecting it into 2D, which constrains accurate 3D (motion) generation. However, since the proposed method is also trained through a video diffusion process, hasn’t it still failed to overcome the problem of losing 3D information? - p.4 L189: Motion tokens are generated as 51 dimension. What is the specific reason for this number? - Motion tokens are added diagonally to visual tokens. Since maintaining temporal alignment is sufficient, there seems to be no strict reason for using the diagonal arrangement. Is there experimental evidence supporting the "positional collisions" mentioned in the text? - Eq. 6: If the reviewer's understanding is correct, the last term must be u_{\theta}(\phi, m_{t}, \phi) - Quantitative comparison is provided only as self-evaluation. Although direct quantitative comparison with previous studies may be difficult, joint modeling is expected to enhance the performance of both the video and motion decoders. Therefore, a quantitative comparison between the videos and motions generated by the proposed framework and those produced by conditional generation methods (e.g., VideoJAM), given GT as condition, could better highlight the advantage of joint modeling (even if the performance does not surpass that of conditional generation). - Minor Comments: -- p.5 L231: i is the motion token -> i is the motion token index ? -- Fig.3: The distinction between "noisy" and "clean" is described only in text. It would be clearer and easier to understand if visual symbols were added to indicate the presence or absence of noise.	Fully human-written
Sculpting User Preferences for Recommendation with Positive-Negative Diffusion Guidance	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper proposes SteerRec to enable effective and steerable negative guidance in diffusion-based recommenders. It firstly introduces Positive-Negative Guidance inference mechanism in the inference stage, which replaces the generic unconditional prior with a user-aware negative condition. To ensure the negative condition provides meaningful repulsive guidance in the dynamic embedding space, it further designs a margin-based objective that explicitly aligns the training process with PNG by ensuring the model’s prediction under a positive condition is closer to the target item than its prediction under a negative condition. Extensive experiments on three datasets provide the effectiveness of SteerRec. 1. The idea about incorporating user-aware negatives into DM is interesting and the utilized Guidance Alignment Triplet Loss is well-aligned with the PNG. 2. The experiments and analyses are entensive and the compared baselines are reasonable. 3. The source code and the utilized datasets are released in anonymous Github repo. 1. It seems that Figure 2 omits many details. Could the authors use this figure to further clarify the technical contributions and highlight the novelty of their work compared to existing methods in the sequential recommendation and diffusion-based recommendation? 2. The focus of this paper is Positive-Negative Guidance, but the utilized negative samples are just the in-batch negatives (training) and randomly-selected samples (inference). Compared to the reserve stage of diffusion models, the time complexity of performing negative sampling should not be significant. 3. Lack of the the time complexity analysis and the comparison of the actual training and inference runtimes between the proposed SteerRec and baselines. 4. In Figures 10–12, the authors claim that PreferDiff forms uneven clusters with indistinct boundaries, whereas SteerRec produces a well-structured representation space with multiple distinct and dense clusters separated by clear low-density regions. However, this is not immediately evident. Could the authors provide a more detailed explanation of the reasons behind these three figures? Please refer to the weakness. Since there is no borderline option in this review, I would be willing to raise my score if the authors can address my questions.	Fully human-written
Sculpting User Preferences for Recommendation with Positive-Negative Diffusion Guidance	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	The paper introduces SteerRec, a novel framework for sequential recommendation that leverages diffusion models (DMs) for more accurate and personalized predictions. Traditional diffusion-based recommenders struggle with incorporating negative feedback during inference, as the existing classifier-free guidance (CFG) mechanism relies on a global, user-agnostic prior, which limits control over the generative process. SteerRec addresses this limitation by introducing Positive-Negative Guidance (PNG), which uses user-aware negative feedback to explicitly steer the generation process away from undesired items. This is complemented by the Guidance Alignment Triplet Loss (GAL), a novel training objective that ensures the model generates predictions that are both closer to desired items and farther from disliked ones. Through extensive experiments on public benchmarks, SteerRec outperforms existing methods, providing a more precise and controllable recommendation system. 1. SteerRec's key contribution is the Positive-Negative Guidance (PNG) mechanism based on the well-known classifier-free guidance, which directly integrates user-aware negative feedback into the inference process of diffusion model-based recommenders, providing enhanced control over the generation process. To further optimize the proposed of PNG, the authors introduce the Guidance Alignment Triplet Loss which ensures that the negative guidance is both meaningful and effective during training. This alignment makes the model's predictions more reliable and better aligned with user preferences. The combination of both modules is both logical and effective. 2. SteerRec converges faster than its counterparts like PreferDiff, demonstrating that its direct and efficient learning signal enhances the model's ability to learn user preferences quickly. 1. Paper Writing: A major issue in the paper's writing is the lack of a detailed comparison between SteerRec and previous methods (e.g., DreamRec [1], PreferDiff [2]). This omission assumes that readers are already familiar with diffusion-based recommenders, which increases the reading burden for those who may not have such background knowledge. Further 2. Brute Force Sampling: I think the author's motivation is both reasonable and compelling. Replacing CFG with recommendation-tailored guidance is essential. However, in practice, the method of constructing the negative condition during inference by randomly sampling a set of items from the global item corpus seems somewhat crude. This approach overlooks the unique preferences encoded in the user's historical representation, potentially leading to less precise guidance. I think more tailored approaches for selecting negative samples that are more in line with the principles of diffusion-based recommenders. I also noticed in Appendix F that the authors' experiments show SteerRec performing well when high-quality negative samples from the MIND dataset are used. However, obtaining high-quality negative samples is challenging in most real-world scenarios. Addressing this issue should be a key focus of the paper, as both PNG and GAL are quite intuitive ideas. A potential solution could involve methods to generate or identify high-quality negative samples in settings where they are not readily available. 3. Missing Some Hyperparameter Experiments I find some key experiments related to diffusion, such as the impact of the diffusion steps and DDIM steps, were not presented. If SteerRec requires fewer diffusion or DDIM steps compared to PreferDiff, it would further highlight the efficiency brought by the introduction of negative guidance. Showing these results would provide stronger evidence for the practical benefits of SteerRec's approach. 4. Need More Novelty on Negative Guidance The use of relevant negative signals aggregated into a centroid is shown to be effective in PreferDiff, but this approach lacks some novelty in SteerRec. A potential improvement could be to explore why other users' representations are not used as negative samples. Incorporating negative samples from different users might provide richer and more diverse guidance. [1] Yang, Zhengyi, et al. "Generate what you prefer: Reshaping sequential recommendation via guided diffusion." Advances in Neural Information Processing Systems 36 (2023): 24247-24261. [2] Liu, Shuo, et al. "Preference Diffusion for Recommendation." The Thirteenth International Conference on Learning Representations. See weaknesses.	Lightly AI-edited
Sculpting User Preferences for Recommendation with Positive-Negative Diffusion Guidance	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes SteerRec, a novel diffusion recommendation framework that enables direct and reliable negative guidance. It introduces a Positive-Negative Guidance mechanism that replaces the generic unconditional prior with a user-aware ngative condition, enabling targeted repulsion from disliked items. It also designs a complementary training objective that explicitly aligns the denoising network's behavior with the PNG mechanism by ensuring the model's positive prediciton is closer to the taget item than its negative prediction. 1. This paper is well motivated and addresses the limitation of Classifier-Free Guidance by incorporating user-aware negative feedback at inference time. 2. The proposed Guidance Alignment Triplet Loss explicitly forces the model to distinguish between positive and negative conditions, solving the critical training-inference discrepancy. 1. Even without PNG or GAL, SteerRec still significantly outperforms DiffuRec and DreamRec, which raises some doubts whether the performance gains are from PNG/GAL or some tricks. 2. All datasets are from Amazon Reviews, where the average sequence length is shorter than 10. It would be better to include experiments on datasets with longer sequences. 3. It requires about 800 - 1200 steps during inference. However, recommendation differs from image generation, so it is unclear whether so many steps are necessary. Using a large number of steps may compromise efficiency. Could you address the above three weaknesses: 1. Since SteerRec outperforms DiffuRec and DreamRec even without PNG or GAL, what factors contribute to such a large performance gap? Could there be implementation or evaluation differences that explain this result? 2. How well would SteerRec generalize to datasets with longer user interaction histories? 3. How does the number of diffusion steps affect the trade-off between model performance and efficiency?	Lightly AI-edited
Sculpting User Preferences for Recommendation with Positive-Negative Diffusion Guidance	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper focuses on the improvement of conditional guidance strategies for diffusion-based recommender systems. Specifically, the original classifier-free guidance simultaneously models both the conditional and unconditional distributions' score functions (log-likelihood gradients), thereby using the weighted difference between the conditional and unconditional score functions as an additional guidance condition during generation. In contrast, this paper replaces the unconditional vector (a trainable embedding) with a weighted embedding derived from in-batch negative samples, thereby introducing negative conditional guidance. Furthermore, this paper introduces a Guidance Alignment Triplet Loss to further regularize negative guidance. The proposed method is relatively simple and effective. However, on the one hand, it completely discards modeling the unconditional distribution, which may affect the model’s cold-start capability. On the other hand, the need for negative sampling during inference could further impact the inference efficiency of diffusion models. 1. This paper is well-organized, with clear tables and figures. 2. The motivation of this paper is well-founded, and the method is simple yet effective. 3. The experimental setup in this paper is well-aligned with prior work and relatively extensive. 1. Lack of diverse negative sampling strategies: The computation of the negative condition in this paper relies on negative sampling; however, only in-batch negative sampling is considered. Exploring more diverse and fine-grained negative condition constructions—such as incorporating hard negatives—could further enrich the content and strengthen the contributions of the paper. 2. Lack of cold-start analysis: This paper completely replaces the none condition with a negative condition. However, the computation of the negative condition relies on negative sampling (e.g., in-batch negative sampling). For cold-start users, whose positive interaction information is relatively scarce, the influence of negative signals becomes more significant, and the use of random negative sampling may adversely affect recommendation performance. Nevertheless, the paper does not include any discussion or analysis regarding the cold-start problem. 3. Training–inference inconsistency: The proposed method computes the negative condition during training using in-batch negatives; however, batch information is unavailable during inference, so only random negative sampling can be applied. In this case, if positive items are accidentally sampled as negatives, it may undermine the validity of the negative condition. 4. Lack of efficiency analysis: The proposed method requires explicit random negative sampling during inference to compute the negative condition, which could further reduce the model’s recommendation efficiency and even affect its applicability in online settings. However, the paper does not provide any comparative analysis of efficiency. Please refer to Weakness.	Lightly AI-edited
Principled Policy Optimization for LLMs via Self-Normalized Importance Sampling	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper identifies a critical dilemma in critic-free RLHF: existing methods are either theoretically unsound or biased. Group Relative Policy Optimization (GRPO) suffers from high variance by improperly mixing sequence-level advantages with token-level importance sampling (IS) ratios . Group Sequence Policy Optimization (GSPO) achieves stability by using a geometric mean of token ratios, but this is a biased estimator that optimizes a "perturbed objective" and distorts the crucial reward-KL trade-off . The authors propose SNIB (Self-Normalized Importance Sampling with a Baseline), a novel algorithm that resolves this dilemma. SNIB is both stable and asymptotically correct. It uses the theoretically correct sequence-level IS weight (the product of token ratios) and achieves stability by applying two principled techniques: Self-Normalization: It normalizes each sample's IS weight by the average weight of the entire batch, which provably dampens outliers and reduces variance . Baseline: It uses the mean batch reward as a baseline to further reduce variance in the advantage estimates . The paper provides strong theoretical backing, proving SNIB's gradient estimator is consistent and asymptotically unbiased (with bias vanishing at O(1/G)) . This principled design is shown to be more robust to reward model uncertainty and, unlike GSPO, preserves the principled KKT conditions of the constrained reward-KL optimization problem . Empirically, SNIB outperforms GRPO and GSPO on challenging mathematical reasoning and code generation benchmarks Principled and Novel Solution: The proposed solution, SNIB, is elegant. It correctly insists on using the true sequence-level IS weight, and then intelligently applies self-normalization —a statistically-grounded variance reduction technique—to solve the exact stability problem that plagued naive IS (shown in Fig 2). Strong Theoretical Guarantees: The method is built on a solid theoretical foundation. The paper proves that SNIB is asymptotically unbiased (Proposition 1) and that this unbiasedness preserves the underlying KKT conditions of the constrained optimization problem, which biased methods like GSPO do not (Proposition 2). Missing PPO Baseline: The entire motivation for critic-free methods is to replace the expensive, critic-based PPO . However, PPO is not included as a baseline in the main results (Table 2). This makes it impossible to assess the full picture. We can see SNIB is better than other critic-free methods, but how much performance (if any) are we sacrificing compared to PPO for the gain in efficiency? Limited Task Domain and Reward Type: The experiments are conducted exclusively on math and code generation tasks, using ground-truth correctness as the reward signal. While this is a very clean and sound way to test the algorithm, it doesn't demonstrate the method's performance in the more common (and noisy) RLHF setting with a learned reward model on subjective tasks like "helpfulness" or "harmlessness." The reward noise experiment (Fig 1a) is a good simulation, but not a substitute for a real-world test. Sensitivity to Group Size G: The theory states that SNIB's bias is on the order of O(1/G). The experiments use a group size of G=8, which seems small and may imply a non-trivial bias in practice. The paper does not include an ablation study on G, which is a key hyperparameter for both performance and efficiency. Given that the primary motivation is to find an efficient alternative to critic-based PPO, why was PPO omitted from the main performance comparison in Table 2? A direct comparison is needed to understand the full performance-vs-efficiency trade-off. The experiments are on math/code tasks with ground-truth rewards. How do you expect SNIB's performance to translate to the more common RLHF setting using a noisy, learned reward model for a subjective task like "helpfulness"? Your analysis in Figure 1a is promising, but is additive Gaussian noise a sufficient proxy for the complex, correlated noise from a learned RM? The theoretical bias of SNIB is O(1/G), and a small group size of G=8 was used. Have you performed a sensitivity analysis on G? How does the performance and stability of SNIB change with a smaller (e.g., G=4) or larger (e.g., G=16) group size?	Moderately AI-edited
Principled Policy Optimization for LLMs via Self-Normalized Importance Sampling	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces Self-Normalized Importance Sampling with a Baseline (SNIB), which is a critic-free policy optimization algorithm for aligning LLMs. The authors observed that GRPO uses theoretically unsound arithmetic mean of token-level importance ratios, which have high variance and scale poorly with sequence length, while GSPO uses geometric mean of token ratios but introduces non-vanishing bias, which distorts the reward-KL trade-off. The authors show that SNIB is both stable and asymptotically correct. The key idea is to use the true sequence level importance weight but to stabilize it with self-normalized importance sampling. Experiments show that the proposed method outperforms GSPO on several math and code benchmarks and it is more robust to reward noise. Articulating the stability vs. correctness dilemma as the heart of current critic free RLHF is clearly done in the work. The analysis of why GRPO and GSPO are flawed provides strong motivation for the research. The ablation shown in Fig. 2 is reasonable and convincing to show that SNIB effectively reduces the high variance of vanilla IS. The motivation for SNIB and its class of algorithms is to replace the standard PPO with a critic. However, the PPO with critic baseline is absent from all comparisons. It would be great to see if SNIB matches the performance of standard PPO with critic, or it closes the gap whereas GSPO and GRPO do not. The paper claims that GSPO is biased while SNIB is principled and asymptotically unbiased. However, it seems SNIB is also biased at any finite batch size G. The claim is only that this bias is asymptotic and vanishes as O(1/G). The PPO clipping, although as the authors noted, is also a source of bias, which does not vanish with batch size. Please see weakness part.	Fully human-written
Principled Policy Optimization for LLMs via Self-Normalized Importance Sampling	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes SNIB (Self-Normalized Importance Sampling with Baseline), a critic-free RLHF algorithm that unifies the stability of GSPO with the theoretical correctness of unbiased policy gradients. SNIB uses self-normalized importance sampling to reduce variance while remaining asymptotically unbiased. Theoretical analysis proves its consistency, robustness to reward noise, and preservation of the KL–reward trade-off. Experiments on math and coding reasoning benchmarks show moderate but consistent gains over GRPO and GSPO. 1.Clear theoretical contribution: principled, asymptotically unbiased estimator for critic-free RLHF. 2.Well-presented mathematical analysis, including variance, convergence, and KKT proofs. The paper is well-written, well-structured, and effectively connects theory with practical implications for RLHF pipelines. 2.Empirical results demonstrate improved stability and robustness to reward noise. 1.Experiments limited to math/coding tasks — generalization to dialogue or multimodal alignment is unclear. 2. Lack of comparison with more recently RLHF baselines. 3. Performance improvements are modest given the theoretical complexity. Although SNIB improves theoretical soundness, its empirical advantage over GSPO is relatively small (1–2% absolute in most benchmarks). The improvements, while consistent, may not justify the added conceptual and implementation complexity. 4. Key design components (self-normalization, baseline, stop-gradient) are not individually ablated, making it difficult to attribute improvements precisely. The paper would benefit from comparing fully differentiable vs. stop-gradient SNIS. 1. Does the stop-gradient version compromise the theoretical unbiasedness claimed? 2. Can SNIB be integrated with existing GRPO/GSPO infrastructures in practice? 3. How does SNIB behave under severe reward model bias rather than random noise?	Fully AI-generated
Principled Policy Optimization for LLMs via Self-Normalized Importance Sampling	Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper proposes Self-Normalized Importance Sampling with a Baseline (SNIB) to address the bias and high-variance issues that accumulate over long sequences when estimating importance sampling $\pi_\theta(y\|x)/\pi_{\text{old}}(y\|x)$ in GRPO and GSPO. SNIB is consistent and asymptotically unbiased, ensuring convergence to the correct policy objective. Experimental results demonstrate that SNIB is more robust to adversarial reward perturbations and achieves a better Reward–KL trade-off compared to prior methods. - The theoretical analysis of self-normalized importance sampling is rigorous and well-justified. - The paper is poorly written; many key contributions and analyses, particularly those analyzing the high-variance issues in prior methods, are either missing or insufficiently explained in the main text. - Proposition 1, which shows that the SNIB estimator is consistent and asymptotically unbiased, is an important theoretical property. However, this result is not novel and well-known [1]. - Unrealistic reward model uncertainty experiment: The noisy reward experiment is unrealistic, as the authors simply add random Gaussian noise $\epsilon\sim\mathcal N(0,\sigma^2)$ to the rewards during training. In the context of LLM alignment, such noise cannot capture the complexity of modeling uncertainty and context dependence in human preferences (see [2, 3]). Moreover, in mathematical reasoning tasks, we typically have access to a ground-truth reward function (as also used in the paper), which can reliably provide learning signals for LLMs [4, 5]. Therefore, the results presented in Fig. 1 do not convincingly demonstrate SNIB’s robustness under realistic scenarios. - The main results in Table 2 show that GRPO achieves substantially better performance than SNIB on three out of four mathematical reasoning benchmarks, which raises questions about the effectiveness of SNIB. ## References. [1] Monte Carlo theory, methods and examples - Art Owen. [2] Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF. ICLR 2024. [3] Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision. ICML 2024 Oral. [4] DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 2025. [5] DAPO: An Open-Source LLM Reinforcement Learning System at Scale. NeurIPS 2025. [6] Scaling Laws for Reward Model Overoptimization. ICML 2023. [7] Training language models to follow instructions with human feedback. NeurIPS 2022. [8] Understanding the performance gap between online and offline alignment algorithms. CoRR 2024. [9] VL Norm: Rethink Loss Aggregation in RLVR. arXiv:2509.07558 - Since SNIB remains a biased (but consistent) estimator, it is important to analyze the bias–variance trade-off and compare the performance of SNIB, GRPO, and GSPO as the number of responses per prompt increases. - The variance analysis in the paper focuses primarily on the importance weights. It is also necessary to evaluate SNIB’s ability to stabilize training by measuring gradient variance as sequence length increases, compared to GRPO and GSPO. Without length normalization, the gradient variance can grow proportionally with response length [9]. - From my experience, the normalized weights can be computed using a softmax operation: $\text{Softmax}(\{\log(\pi_\theta(y_i\|x)-\log\pi_{\text{old}}(y_i\|x)\}_{i=1}^G)$. However, the exponential function in the softmax can amplify discrepancies between response groups, allowing longer sequences to dominate the learning signal due to higher variance. Could this lead to a long-sequence bias, where SNIB favors longer responses? If so, is this effect detrimental or beneficial for exploration? It would be valuable to visualize the distribution and entropy of normalized importance weights $w$ to better understand this phenomenon and to help tune the clipping parameter $\epsilon$. Additionally, does SNIB tend to produce longer sequences or higher-entropy outputs (more explorative) compared to GRPO and GSPO? - While GSPO indeed suffers from variance accumulation over sequence length, GRPO—when formulated under a token-level MDP—employs token-level importance sampling with per-token clipping, which effectively mitigates high variance at each timestep. The paper should contrast SNIB with GRPO, explicitly showing failure modes of token-level importance sampling and explaining why GRPO achieves better empirical performance compared to SNIB. - The paper claims that SNIB provides a more predictable and principled trade-off between reward maximization and KL regularization compared to other estimators. However, in Fig. 1(b), all three estimators exhibit a similar trend where increasing $\beta$ lowers the average reward. The authors should clarify what makes SNIB “more predictable” than GRPO and GSPO under this observation. - KL regularization is commonly used to prevent the model from deviating too far from the initial policy distribution, thereby reducing reward hacking [6, 7]. However, in mathematical reasoning—where a ground-truth reward exists—KL regularization can actually hinder learning [5]. Could SNIB’s predictability help identify an optimal $\beta$ when we only have access to a proxy reward for training, without access to a golden reward function (as in [6, 8] reward hacking setting)?	Lightly AI-edited
InverseScope: Scalable Activation Inversion for Interpreting Large Language Models	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper is motivated by the goal of finding "assumption-free" interpretability methods, that don't rely on the linearity and/or sparsity assumptions present in many contemporary tools, such as sparse autoencoders (SAEs). As such, there are no assumptions made on compositionality, and the paper exclusively focuses on the "local" interpretability problem of creating a general tool for interpreting individual LLM activations, as opposed to the "global" problem of discovering compositional structure within the entire space of activations. The intuition for the method is as follows: suppose we invert the activation and get back a text input having some salient property $p$. We might think that the activation encodes that "the input has property $p$". We can check this by sampling many activations "close" to it in the embedding space, invert them, and check if they have the same property. If they all do, we have some reason to believe the activation encodes $p$; if it's a random mix of have / not have $p$, we have some reason to think maybe this activation is not sensitive to $p$. To summarize the presentation of the methodology from Section 2, the paper proposes a method to interpret activations $z$ in LLMs by: 1. approximately inverting noisy versions of $z$ back to input texts; 2. using these texts to formulate a hypothesis for a feature $f$ of text that could be represented by $z$; 3. checking the hypothesis by evaluating the probability (over a task-related text distribution $x\sim D$) that activations close to $z(x)$ have the same value of $f$ as $x$. Notably, there is no condition that activations far from $z(x)$ should have a different value of $f$ from $x$; such activations simply don't matter for the objective by which the hypothesis is evaluated. To approximately invert activations, the authors train a language model conditioned on internal representations, trained to generate input texts that result in representations close to the wanted internal representation. This is effectively a next-token prediction objective. Note that this method provides a way to classify activations $z$ whenever we have some function $f$ from texts to a finite set of classes. This is because we can approximately invert $z$ and evaluate $f$ on the inverted text. In that way, the method is similar to the well-known linear probe method, but without the linearity assumption. Experiments include: - identifying which heads in GPT-2 represent which features in the IOI task, a well-studied simple language task whose circuit has been mapped extensively in prior work. - benchmarking against SAEs on the RAVEL dataset, where InverseScope is compared to using individual SAE features as classifiers. - studying the layers in which task vectors emerge The paper tackles a somewhat under-investigated question in the interpretability literature overall: can we have an "oracle" that simply tells us what features of the input text a given activation "cares about", without relying on assumptions like linearity or sparse coding? The work makes the assumption that activation semantics is "continuous", which seems reasonable as far as assumptions go. The writing is clear & easy to follow. - The contribution over the prior work [1] is relatively incremental. The prior work also trains an activation inverter. The main contribution of the current work is not in methodology, but in the kernel used for approximate inversion and in the experiments. - The approximate activation inversion process complicates and obfuscates the method, as it introduces hyperparameters (the noise scale and the "width" of the kernel) with an unknown role in the final results. Additionally, the inversion only works on a limited task dataset, limiting the overall applicability of the method. This means that the method can only generate hypotheses for concepts that exhibit variation in the task dataset, unlike an SAE for example, which can generate hypotheses based on concepts in the entire pre-training distribution. In other words, the question being answered here is not "What is the model thinking about when processing the input that created this activation" but "What dimensions of variation in the task dataset is the model thinking about when processing the input", which is subtly but crucially different. - This is at heart a correlational method (no causal experiments are performed), and as such there's only limited interpretability utility to be found in it. I won't make this objection in detail, as it is analogous to the challenges to probes as an interpretability tool that have already been raised in the literature. See work by Belinkov and colleagues, e.g. “Probing Classifiers: Promises, Shortcomings, and Advances” or “Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance?” - Related to that, the authors say "Crucially, while the original benchmark evaluates interpretability via causal interventions on model behavior, we instead focus on a more fundamental question: assessing the method’s fidelity in identifying the correct attribute encoded within the activation itself" (line 377) - I disagree that this is more fundamental. - As the authors readily point out, in general there's no obvious way to generate feature hypotheses for step 2. of the method. [1] Xinting Huang, Madhur Panwar, Navin Goyal, and Michael Hahn. Inversionview: A general-purpose method for reading information from neural activations What is the sensitivity of the results to the hyperparameters involved in the inversion method? In general in interpretability, we always know the input an activation came from. Since in your method you don't have a general method to generate the hypothesis $f$, a plausible alternative is to skip the activation inversion altogether, and instead formulate a hypothesis by perturbing the input in salient ways and measuring activation distance. How do you think about the tradeoffs here?	Fully human-written
InverseScope: Scalable Activation Inversion for Interpreting Large Language Models	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The authors present a new technique to map from activations to plausible input sequences that would produce similar activations. To do this, the authors train a conditional generation model. This model is a small transformer model trained on next token loss but conditioned on the latent activations of the model we would like to interpret. Instead of training a site specific model, which would be computationally expensive, they train translator linear layers for each site, training a single unified conditional model. Their conditional model is a finetuned of GPT-2 small, but they apply it to larger models. The proposed method is more sample efficient than other inversion methods. The proposed method correctly identifies some of the attention heads that are important for the IOI circuit in GPT2. The accuracy on the classification task on the RAVEL benchmark surpasses that of SAEs. The 'inverting' model has to be re-trained for each specific task. Although this can be said of several interpretability methods, it is not clear here if the latent representations have any causal link to the mechanisms of the underlying model and it is not obvious how to test the predictions made. On the IOI task, the 'ground' truth heads were correctly identified by the consistency rate, but there were also other several heads that had similar consistency. In a world where 'ground' truth labels don't exist, it is not obvious how much this method could be used to identify relevant components. I think this sentence (lines 136-137) is not well constructed `Given the distribution P(x; ˆz), we propose a three-step pipeline for feature interpret, involving hypothesize, formalize and evaluate:` I couldn't quite understand how many samples were used to finetuned the InverseScope model for each task. Could an LLM distinguish between the 'true' input and the generated examples?	Fully human-written
InverseScope: Scalable Activation Inversion for Interpreting Large Language Models	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The authors propose InverseScope, a framework for interpreting neural activations by reconstructing the textual inputs that triggered them. In order to do so, the authors propose a conditional generation architecture, where a projection of the representation from the original network is used to conditionally influence reconstruction of the original network’s input. The reconstruction network is initialized to GPT-2, and then its parameters are frozen and only the conditional set is trained to induce the reconstructon behavior after a solid language modeling base. The authors perform experiments on the IOI task and RAVEL datasets, showing that the proposed architecture improves attribute identification over SAE-based baselines in the latter. In a follow up analysis, the authors also use their framework to analyze where task-specific information obtained from ICL is encoded in the model, and confirm findings from previous works which suggested that these features are encoded in the middle layers. Overall the paper is well written and easy to follow. The work opens up interesting avenues for inspecting knowledge encoded within LLM latent representations. - Proposes a novel framework for interpreting LLM internals - Operationalizes the framework with a conditional generation architecture - Results on IOI and RAVEL show the method is promising and more accurate compared to SAE-based alternatives - Interesting analysis sheds light where task-specific features from ICL are encoded within the model - It would be interesting to also show qualitative samples of reconstructed inputs, as well as failure cases. - I think it is too strong to consider a conditional LM as a standalone contribution, as such architectures have been widely used prior. - The related works section feels quite thin. Namely, conditional generation based on model latents, and the connections to the perspective of variational autoencoders seem relevant - but this is quite minor. See above	Fully human-written
InverseScope: Scalable Activation Inversion for Interpreting Large Language Models	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes InverseScope, an activation-conditioned generator designed to interpret neural representations in large language models (LLMs) via activation inversion. Given an activation from a specific site (layer/head), the model defines a conditional distribution over inputs that would yield similar activations and samples from this distribution using a conditional Transformer generator. The authors also introduce the Feature Consistency Rate (FCR), a quantitative metric evaluating whether generated inputs preserve certain features (e.g., subject/object identity). Experiments show qualitative and quantitative correspondence between activation patterns and encoded semantic features. 1. Clear motivation and formalization: The paper clearly articulates the challenge of probing activations in LLMs. 2. Reasonable architectural engineering: The control-layer conditional generator is a practical and technically sound approach for conditioning a Transformer decoder on internal activations. 1. Scalability claim insufficiently supported: The paper claims to “advance inversion-based interpretability by scaling it to larger open-source LLMs and applying it to practical tasks” (lines 69–70). However, the actual generator is trained on GPT-2 small, and target models are limited to Gemma-2B and LLaMA-2-7B. No experiment demonstrates how the generator behaves with increasing activation dimensionality, even though the authors justify their method by noting that “the probability that a random input produces an activation close to $\hat z$ decays exponentially with dimensionality” (lines 166–167). If dimensionality scaling is the central motivation, a quantitative study showing how approximation accuracy degrades or stabilizes with dimension is essential. 2. Limited novelty: The paper feels incremental in method, with novelty residing mostly in framing rather than technique. The assumption that similar activations imply similar semantics has already been well discussed in prior studies (e.g., Bengio et al., 2013, IEEE TPAMI, as also noted by the authors in lines 87–88). Moreover, the study largely reuses a GPT-2-style conditional language model without structural innovation. The authors provide no justification for this architectural choice, nor any ablations showing how control-layer design or alternative decoder types affect inversion fidelity. 3. Shallow and constrained experimental validation: The experimental validation of InverseScope remains shallow and constrained in scope. All generator experiments rely exclusively on GPT-2-small, with no ablations across different generator sizes or architectures, leaving open questions about model capacity and scaling behavior. Furthermore, the input prompts used throughout the experiments are notably simple and short. Even in the Limitations section (lines 480–481), the authors acknowledge that the current setup does not test complex or compositional language; however, such long, multi-clause prompts would provide a much more meaningful evaluation of scalability and generalization. In addition, while the paper refers to “task-specific input distributions” (lines 241–242), the method is not evaluated on a broader variety of tasks beyond IOI and RAVEL, limiting the evidence for its task-agnostic applicability. 4. Human-defined feature functions: Feature functions $f(x)$ are manually constructed by the authors for each task (Appendix D.1.2-lines 691–692, Appendix D.2.4-lines 821–824). However, the paper does not explain why these task-specific, rule-based definitions are the most appropriate way to evaluate feature consistency. If users must manually define $f(x)$ for every new task, the method’s scalability and reproducibility are questionable, since performance could vary substantially depending on how $f(x)$ is specified. [1] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013. Questions for the Authors 1. Can InverseScope generate diverse and novel sentences (unseen during training) for a given target activation? Since FCR evaluation appears to be closely related to the diversity of generated samples, it would be important to measure and report explicit diversity metrics (e.g., lexical or semantic variance). 2. Could you provide a quantitative analysis showing how approximation accuracy or FCR stability changes as activation dimensionality increases? Does InverseScope maintain inversion fidelity when applied to larger-scale LLMs (e.g., LLaMA-13B or 70B) or to more complex reasoning tasks? 3. The generator architecture is fixed to GPT-2 small. Could you explore alternative generator backbones (e.g., T5, Mistral, Gemma) or larger-scale models? What motivates this specific architectural choice, and would inversion performance or sample diversity change with different configurations? 4. As seen in Figures 2 and 4, only late layers exhibit strong inversion behavior, while early-layer activations appear almost flat. Can the authors provide insight or diagnostic analysis explaining why inversion signals are weak or absent in earlier representations? Additional Suggestions 1. Typographical errors: lines 159–161; Appendix lines 687–688. 2. The related work section is too short to clearly position the paper within recent interpretability research. Expanding it—perhaps in an appendix—would improve clarity and contextual grounding	Lightly AI-edited
TruncProof: LL(1)-Constrained Generation in Large Language Models with Maximum Token Limitations	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces TruncProof, a grammar-constrained decoding method that leverages LL(1) parsers to estimate the minimum tokens needed to complete a valid output at each step, enabling LLMs to generate syntactically valid outputs within a fixed token limit. The RQ is clear, and the paper is relatively easy to follow. Conceptual clarity and motivation: The paper would benefit from clearer motivation and explanation of why enforcing token limits is necessary and how it improves decoding quality, e.g., does it improve overall accuracy or just ensure early termination? It is not fully clear why adding an LL(1)-based token constraint offers an advantage over simply applying standard grammar-constrained decoding (e.g., CFG-based methods) with an explicit token or context budget specified in the prompt. The practical benefit, especially when existing constrained decoding already guarantees syntactic validity, remains ambiguous. Writing: Providing a concrete, step-by-step example walkthrough of each phase of TruncProof (rather than only one abstract output example in Fig. 4) would greatly improve clarity and help readers understand how the approach operates and why it matters. Experimental limitations: The choice of baselines is limited and may not fully isolate TruncProof’s contribution. A stronger comparison would include variants that apply existing grammar constraints or prompt-level token limits under the same decoding setup. Additionally, the overhead introduced by TruncProof, both in preprocessing and runtime, especially whether scaling up grammar size will affect decoding efficiency, should be quantified and discussed. Finally, the evaluation focuses primarily on syntactic validity rather than overall task performance; adding metrics for semantic correctness or downstream accuracy (e.g., in code completion, data-to-JSON generation, or semantic parsing) would help clarify whether the proposed constraint mechanism provides real benefits beyond ensuring grammaticality. See weakness.	Heavily AI-edited
TruncProof: LL(1)-Constrained Generation in Large Language Models with Maximum Token Limitations	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper addresses a problem in grammar-constrained generation with token limits. When prior grammar-constrained generation techniques, such as syncode and xgrammar, reach the maximum token count, they stop generating, which often results in incomplete or invalid outputs. The paper proposes TruncProof, which uses LL(1) parsers to calculate at each generation step how many tokens are minimally required to complete a grammatically valid output. This allows the method to block token selections that would either break the grammar or exceed the token budget. The approach is implemented as a logit modifier, making it compatible with different models and decoding methods. Experiments on JSON generation and C code tasks demonstrate that TruncProof maintains grammatical validity under strict token constraints, while baseline methods largely fail in these scenarios. * The paper is well-written and easy to read * I appreciate the time complexity and space complexity analysis. It would be better if the authors could compare these complexities with prior works. * The related work section is comprehensive and considers most SOTA CFG constrained generation works. * The evaluation considers various sampling techniques (beam search, MCTS) during inference. In my knowledge, this has not been studied extensively in the prior CFG-constrained decoding works. The major issue in the paper is the limited empirical evidence showing the practical applicability of the work beyond JSON generation under token-limit restriction. I would encourage the authors to perform additional experiments on one of the other grammars such SQL, Python, Java, or Go, that have been considered in prior works and use end-to-end benchmarks such as humaneval.   * The technique uses LL(1) grammar, which is inherently weaker than LR and Earley grammars supported by some of the prior works  . * The evaluation on JSON considers only 100 examples on 2 models. Both should ideally be more. * The evaluation for C code generation is almost non-existent. Is it just evaluated on a single example in Figure 4? Am I missing something? * What is the motivation for the token limit restriction? * In case of JSON generation, LLM generates some extra whitespaces/newlines which are avoidable but is it also realistic in other programming languages?	Fully human-written

PreviousPage 31 of 1516 (75800 total rows)Next