|
Towards Generalizable Implicit In-Context Learning with Attention Routing |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes In Context Routing (ICR) for implicit in context learning. Instead of adding demonstration tokens to the prompt or adding shift vectors in the residual stream, the method extracts Principal ICL Directions from multi domain explicit ICL runs by doing PCA at each layer. A small router then maps a new input to layer wise weights and per head gates. During inference the method adds a low rank input conditioned bias to the attention logits. The authors suggest a kernel view interpretation plus low rank reparameterization. Experiments on diverse datasets with several open models show consistent gains over other implicit ICL and few shot ICL and stronger OOD robustness.
- Practical efficiency: No prompt length increase and no weight updates in the base model (compared to vanilla ICL). The added compute is low rank and local to the attention logits, which is friendly to deployment compared with long demonstrations or broad fine tuning.
- Clear design shift in implicit ICL: The key novelty is the move from post hoc residual steering to structural routing at the attention logits. This places the intervention exactly where ICL mechanisms operate and turns implicit ICL into a problem of routing attention paths. This is a fresh axis that is different from LoRA that edits weights and from activation steering that edits residuals.
- Empirical breadth and stability: The method wins against several implicit ICL baselines on both in domain and out of domain sets and shows fewer collapses below zero shot. It sometimes matches or beats few shot prompting while keeping zero shot latency and memory.
- Data and supervision needs: Router training uses labeled data from several domains. The limits of generalization to tasks with new label spaces or to settings without labels are not fully explored. It remains unclear how far the train once and reuse promise extends.
- Information usage in PID extraction: Using only the last token Q and K may underuse the rich structure inside demonstrations. The paper argues it is sufficient as an integration point, but alternative choices like pooling across several tokens or using attention rollouts could strengthen the claim.
- Why restrict PID extraction to the last token only. Have you tested using several recent tokens or a learned pooling over the demonstration region, and how would that affect out of domain robustness and interpretability? |
Heavily AI-edited |
|
Bridging ML and algorithms: comparison of hyperbolic embeddings |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 1: poor
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper aims to provide a systematic, cross-community comparison of hyperbolic embedding algorithms developed in machine learning (ML), network theory (NT), and algorithmic graph research.
The authors aim to close the gap between these communities by experimentally evaluating 14 embedding methods on 30 real-world and 450 simulated networks, comparing both embedding quality and computational efficiency.
This work aims to bridge ML and algorithmic perspectives on hyperbolic embeddings, attempting to show that older algorithmic methods are far more computationally efficient without sacrificing accuracy, and proposes ICV which is intended to be a fairer, theory-grounded quality metric for comparing embeddings.
The main strength of the paper is the mission to offer a wide comparison of algorithms for hyperbolic embedding of networks across domains.
The comparison on artificial networks is misleading. The authors should use benchmarks that create ground-truth networks generated in the hyperbolic space using the nonuniform popularity similarity model (nPSO) that allows also the creation of mesoscale community structure in artificial networks as many real networks have. Then they should use measures for the order of nodes on angular coordinates, the hyperbolic distance correlation and the angular separability index of the community. In particular, the presence of community in the benchmark is fundamental because they are a property that justifies why methods such as Hypermap and HypermapCN did not work well, because being based on the simple PSO model they did not have a parameter to model the community organization of real networks.
The test on real networks are not convincing because as a matter of fact the authors do not offer any quantitative evidence that these networks are hyperbolic and the measures they use are all empirical and ill posed. For instance, assuming that a network is embedded better because their greedy stretch factor is higher is misleading. In reality, every network has a complex system on the back with an intrinsic navigability and over-estimating it is wrong.
The measure of Greedy stretch factor for navigability is old and surpassed. Measures such as the greedy routing efficiency and the geometrical congruency are more advanced tools to quantify the navigability.
The authors do not consider in their comparison important studies:
+ one of the recent states of the art proposed in this article:
CLOVE, a Travelling Salesman’s approach to hyperbolic embeddings of complex networks with communities. SG Balogh, B Sulyok, T Vicsek, G Palla. Communications Physics 8 (1), 397
+ a fast method based on network automata:
Minimum curvilinear automata with similarity attachment for network embedding and link prediction in the hyperbolic space.A Muscoloni, CV Cannistraci. arXiv preprint arXiv:1802.01183
+ the studies of Filippo Radicchi on this topic seems to be neglected.
+ It does not seem that the Authors mentions explicitly the algorithms Hypermap and HypermapCN in their initial review. These are inefficient algorithm with low performance but they are between the first methods offered. Papadopoulos, F., Psomas, C. & Krioukov, D. Network mapping by replaying hyperbolic growth. IEEE/ACM Trans. Netw 23, 198–211 (2015).
Papadopoulos, F., Aldecoa, R. & Krioukov, D. Network geometry inference using common neighbors. Phys. Rev. E 92, 22807 (2015).
+ It does not seem that the Authors mentions explicitly the algorithms of LPCS of Wang, Z., Wu, Y., Li, Q., Jin, F. & Xiong, W. Link prediction based on hyperbolic mapping with community structure for complex networks. Phys. A 450, 609–623 (2016).
This is one of the fastest ever algorithm proposed, and it should be considered together with the other evolutions proposed by the same authors in subsequent articles that the Authors can find themselves by reviewing the literature of these relevant scientists.
7. This study seems to be neglected together with the following of the same authors: Martin Keller‑Ressel and Stephanie Nargang authored “HYDRA: a method for strain-minimizing hyperbolic embedding of network- and distance-based data”.
Coalescent embedding, which is a machine learning method, was published on arXiv on [Submitted on 21 Feb 2016, Machine learning meets network science: dimensionality reduction for fast and efficient embedding of networks in the hyperbolic space], hence before the other machine learning studies listed.
The study does not report enough information to replicate the experiments. For many algorithms, it is not reported the way their versions or hyperparameters are selected.
The organization of the article needs a strong effort to improve the quality of the information reported.
For instance the authors could propose a first figure that organizes the different algorithms in a genealogical tree in order of date of publication and relationship of methodology used.
The main article could report the selected results for the best methods and the appendix the full results.
I kindly ask to the authors to address all the concerns I raised in the section Weaknesses above.
However, in my opinion this study needs to be totally restructured and re-written and my recommendation is to withdraw and resubmit in a next conference a majorly improved version of this study. |
Fully human-written |
|
Bridging ML and algorithms: comparison of hyperbolic embeddings |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
Hyperbolic embedding is a well-studied technique for graph representation learning. This paper presents a comprehensive comparison of algorithms from the ML/DL community and the traditional algorithm community. It presents a convincing survey that shows: the most popular methods derived by the ML/DL community (e.g. by Nickel and Kiela) are inferior to classic algorithms (e.g. by Blasius et al) in terms of both computational efficiency and embedding quality. This is a problem that many ML/DL researchers vaguely noticed but failed to explicitly point out. I highly appreciate the contributions of this work.
1. This paper presents a comprehensive and clear survey of both ML/DL approaches to hyperbolic embedding as well as traditional algorithmic approaches. The computational complexity of each approach is clearly discussed.
2. Very comprehensive experimental results to support the authors' claims.
3. Good writing styles -- the authors wrote a very clear preliminary section on hyperbolic geometry for the readers' information, followed by separate sections on algorithmic approaches and ML/DL approaches.
1. Some citations should be in brackets, e.g. on Line 42.
2. Figure 2, 3, and 5 are difficult to read -- the markers are too small and too similar. The same problem applies to the figures in the Appendix. If these figures are scaled to fit the page limit, I'd suggest re-design Table 1 to save some space instead of sacrificing the readability of your most illustrative figures.
3. Different hyperbolic embedding strategies perform differently on different structures, be it a tree, a deep but sparse graph, a shallow but dense graph, or a nearly fully-connected graph. You can refer to this ICLR workshop paper: https://arxiv.org/pdf/2407.16641?. The concept of local capacity explains why some methods work better on certain datasets. You may want to include more varied network datasets in your experiments, or generate datasets with distinct structures to test the methods.
1. Line 288-290, the authors claim that "The hMDA method from (Sala et al., 2018) looks interesting, but it depends on the scaling factor, and it is not clear how to learn this parameter; therefore we do not include this method in our experiments". Do you mean the hMDS method? You need to explain more clearly why this method does apply to your experimental setting. |
Fully human-written |
|
Bridging ML and algorithms: comparison of hyperbolic embeddings |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper presents more like a survey with experiments bridging hyperbolic embedding methods in machine learning and classical network embedding algorithms. It reviews learning-based hyperbolic embeddings algorithms commonly used in ML with embeddings algorithms from network theory. The authors argue that the ML community often focuses on gradient-based methods, overlooking algorithmic approaches from network theory, that could show better efficiency with comparable performances.
The paper highlights the diversity of hyperbolic embedding methods and draws attention to classical embedding algorithms from network theory. Conducted experiments for both sides algorithms and draw comparison.
It makes the point that representations from network theory is more efficient and could even performs better than ML learning-based algorithms tracing from Nickel and Kiela (2017, 2018) and thus should be favored, cited and adopted.
The survey-style overview may help readers unfamiliar with the historical algorithmic side of hyperbolic embeddings algorithms.
The criticism that ML work “fails to cite/use” algorithmic hyperbolic embedding literature is not very compelling.
- The reason the ML community focuses more on works tracing from Nickel and Kiela (2017, 2018) is that, they allow gradient flow to the embedding layer, and trainable, end-to-end representations that adapt to downstream tasks, not just for embedding data (rather the reported performances serves as a demo of the learnt representation). Static embeddings from algorithmic approaches (network theory) cannot serve this role. Ultimately in my opinion, the hyperbolic learning community targets at an end-to-end hyperbolic networks that can replace existing Euclidean-based networks on some tasks. And it's already well-established how important learnable embedding layers are.
- The focus on metrics is also somewhat misplaced, since for downstream tasks (e.g., node or graph classification), the rank and precision of distances is very relevant to downstream performance, that truly matters, especially considering current e.g. classification algorithms are largely dependent on the hyperbolic distance.
- Some reported results appear unreliable—for example, the 2D Poincaré and Lorentz embeddings from Nickel and Kiela (2017, 2018) were not reproduced, even though they were already shown in many following ML works, can be reproduced.
- The framing of the paper is also misleading: it reads more like a survey rather than a genuine attempt to bridge ML and algorithms due to the reasons stated above.
- How do the authors envision static embeddings from network theory adapting to downstream ML tasks that require trainable representations?
- what causes the non-reproducibility of Poincaré/Lorentz experiment results? |
Fully human-written |
|
Bridging ML and algorithms: comparison of hyperbolic embeddings |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper systematically compares hyperbolic embedding methods spanning three communities. The headline result is that the BFKL algorithm is typically ~100× faster than Poincaré (NIPS’17) and Lorentz (ICML’18) embeddings while achieving similar quality under MAP/MR and greedy routing metrics. The authors also introduce an Information Control Value (ICV) criterion to penalize excessive radius/dimension, arguing that apparent gains from higher dimension are often optimization artifacts.
1. The work fills a real gap by bridging communities that have studied hyperbolic embeddings in isolation, delivering a broad, apples-to-apples experimental comparison across many methods, datasets, and metrics.
2. The paper thoughtfully treats numerical stability and contributes the ICV criterion to counter dimension/radius inflation, leading to more balanced method selection. The inclusion of both real hierarchies and scale-free networks, plus synthetic HRG controls with regression analyses on temperature/size, strengthens the empirical narrative.
1. There is no theoretical support in the paper.
2. Figures 2 and 5 are difficult to read.
3. Despite the breadth, some baseline coverage choices limit conclusions, which may bias the landscape.
4. Some evaluations rely on discrete variants (dMAP/dMR) to avoid precision issues, which can shift scores relative to standard definitions.
No questions |
Moderately AI-edited |
|
Large Language Model Guided Dynamic Branching Rule Scheduling in Branch-and-Bound |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes an LLM-guided dynamic branching rule scheduler for branch-and-bound in MILP. It selects an initial branching rule from problem descriptors and adaptively switches rules during search via tree-to-text prompts, asynchronous multi-LLM querying, and voting. Experiments on SC/CA/CFL/IS compare against SCIP’s reliability pseudocost (RPB) and ML baselines (SVMRANK, LMART, GCN, tMDP).
- Addresses an important and underexplored question: rule scheduling conditioned on evolving tree state.
- Practical design: tree-to-text representation, asynchronous queries, multi-LLM voting.
- Competitive results on multiple benchmarks without training data.
- Fairness of comparisons is questionable. GCN/tMDP are constrained to CPU inference, while the proposed method appears to rely on external LLM APIs (effectively offloading compute). If the baselines’ GPU is disabled, the LLMs should also be forced to local CPU inference or their API latency/compute must be counted explicitly. Otherwise the test-time “zero-training” advantage is conflated with outsourced compute.
- Missing comparisons to prior LLM-for-BnB work. There is a growing body of agentic/LLM methods for MILP/BnB (e.g., LLM4Solver and related), and the paper does not include head-to-head results, weakening credibility of the claimed benefits of LLM-guided scheduling.
- Limited ablations on scheduling frequency/cost. The number of LLM calls, end-to-end latency impact, and sensitivity to prompt design are not quantified.
[1] LLM4Solver: Large Language Model for Efficient Algorithm Design of Combinatorial Optimization Solver
- Will you enforce a fair compute protocol? For example: (i) deploy the LLM ensemble locally on CPU (or a fixed on-prem GPU) and include its inference time in the reported wall-clock; or (ii) if using API, report per-instance number of calls, p50/p95 latency, total API time, and treat it as part of solving time. Alternatively, allow GCN/tMDP to use GPU so all methods leverage external accelerators.
- Can you add direct comparisons to existing LLM-based BnB methods (e.g., LLM4Solver and related agentic solvers) under the same datasets and limits?
P- lease report the scheduling overhead: average calls per instance, decision adoption rate, and how solving time changes if switching is disabled or made sequential (blocking) across all datasets. |
Fully AI-generated |
|
Large Language Model Guided Dynamic Branching Rule Scheduling in Branch-and-Bound |
Soundness: 2: fair
Presentation: 4: excellent
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper notes that a single branching rule used throughout B&B is often sub-optimal because the tree structure evolves. The authors propose to let large language models dynamically schedule rules:
1) At the root, an ensemble of LLMs votes for an initial rule based on problem type and size;
2) During search, every L steps the recent sub-tree is converted into text and the LLMs decide whether to switch rules;
3) Asynchronous queries and majority voting reduce latency and hallucination.
On four NP-hard benchmarks the method beats SCIP’s default RPB and four learning-based branching policies in solving time, without any training.
1. Training-free generalisation to unseen problem types, avoiding the data-hungry nature of ML-based branching.
2. Practical asynchronous + ensemble mechanism improves robustness and hides LLM latency.
3. Consistent speed-ups over SCIP default and recent learning baselines on four representative problem classes.
1. Prompts require manual curation of extensive rule descriptions; maintainability and extensibility are not discussed.
2. Only schedules existing SCIP rules; coupling with other commercial solvers is not studied.
3. No theoretical guarantees, e.g., regret bounds or convergence analysis of the scheduling policy.
1. If all LLMs hallucinate the same poor rule, is there a fallback safeguard?
2. Asynchronous advice may arrive tens of nodes late—does this still guide the search effectively, and could an adaptive trigger frequency help?
3. Have you tried smaller open-source models (e.g., 7B) to reduce cost, and how much performance is lost? |
Fully AI-generated |
|
Large Language Model Guided Dynamic Branching Rule Scheduling in Branch-and-Bound |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes an innovative framework that integrates Large Language Models (LLMs) into the Branch and Bound (B&B) algorithm to improve decision-making in combinatorial optimization. Specifically, the authors employ LLMs to assist in node selection, branching, and pruning decisions, aiming to reduce search space and accelerate convergence. The idea of incorporating high-level reasoning into a classical exact optimization algorithm is conceptually intriguing and demonstrates a novel cross-disciplinary perspective between symbolic optimization and neural reasoning.
However, while the conceptual direction is interesting, the practical feasibility remains highly questionable. The major concern lies in the extremely high computational overhead of invoking an LLM at every decision step within B&B. Given that B&B may expand thousands or even millions of nodes, the time and resource consumption quickly become prohibitive. The paper currently lacks a discussion or analysis on how to mitigate this issue, such as through model distillation, caching, or selective LLM querying. As a result, the proposed framework appears difficult to scale beyond small toy instances.
1. Creative integration of LLMs and classical optimization: The work explores a fresh and potentially impactful direction that bridges symbolic search and neural reasoning.
2. Clear methodological presentation: The paper describes how the LLM fits into the B&B pipeline in a systematic way.
3. Empirical feasibility on benchmarks: Initial experiments show that LLM-guided decisions can lead to improved pruning and shorter search depth.
1. Severe computational overhead: Calling an LLM (especially large models like GPT-4) at every B&B step is computationally infeasible for realistic problem sizes.
2. Lack of efficiency analysis: The paper does not quantify runtime costs or provide complexity estimates of LLM usage.
3. Limited scalability: Experiments are only conducted on small-scale problems; the method’s practicality for larger MILP instances is unverified.
4. No mitigation strategy: The paper lacks any discussion of reducing LLM inference cost (e.g., distillation, caching, or hybrid heuristics).
See Weaknesses |
Fully AI-generated |
|
Theory of Scaling Laws for In-Context Regression: Depth, Width, Context and Time |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors tackle the problem of how should we allocate depth, width, context length, and training compute when scaling transformers for in-context learning (ICL) on regression tasks. The paper analyzes a deep loop linear attention transformer trained (via SGD) to do linear regression in context, without finetuning at test time. It studies three task regimes: (i) isotropic data, (ii) fixed but structured covariance, and (iii) randomly rotated structured covariance (task distribution shifts every context), and derives dynamical equations and asymptotic scaling laws linking pretraining time, width, depth, and context length. The contribution lies in a unified, theoretically grounded scaling law that distinguishes the roles of depth, width, context length, and training time in ICL.
1. I appreciate that the paper gives a concrete, theoretically grounded answer to a question that is widely discussed in practice: how should depth vs. width vs. context length scale for in-context learning? Instead of treating “bigger model = better,” it isolates when depth specifically matters, and ties that to properties of the task distribution (shared vs. varying covariance across contexts). Specifically, the model decouples network width $N$ from the problem dimension $D$ when studying the scaling law, and the use of loop transformer ensures that the total number of parameters does not go up with the number of computes---which I believe provides a decoupled and generic test bench that adds to similar work studying scaling with linear model.
2. A demonstration that the usefulness of depth is task-distribution dependent: If all tasks share the same covariance, depth is asymptotically unnecessary (long enough context suffices, as the model weights can "encode" the covariate information). If covariances vary across contexts, depth is fundamentally valuable, even with infinite context (reflecting the philosophy of test-time compute).
3. A good match between theories and experiments.
1. The entire analysis is built around linear regression tasks solved via in-context learning with (mostly) loop linear attention. I didn't find much discussion surrounding the use of loop attention block. One benefit I could imagine is the decoupling between total model weights and the depth. However, the use of loop could potentially restrict the model's expressiveness, where model could possibly implemented higher-order optimization algorithm, e.g., Newton's step rather than gradient descent [1], and the model might demonstrate different scaling behavior, especially in depth.
2. The key conceptual result is that depth becomes essential when task covariances vary across contexts (“randomly rotated structured,” RRS). But the diversity they model is very specific: random orthogonal rotations of a shared spectrum. That’s mathematically nice but arguably still a stylized shift. Real heterogeneity looks more like mixture of domains, sparsity structure, nonstationary label noise, hierarchical latent factors, etc. It’s not obvious that random rotations is the right stand-in for natural distribution shift.
3. The paper argues its results are relevant to large-scale LLM design, but the experiments cap out at synthetic regression and relatively small controlled transformers. There’s no ablation on modern-scale architectures (residual blocks with MLPs, nonlinear attention heads, long-context finetuning) to show even qualitative alignment. So the significance for frontier models is still somewhat speculative.
[1] Giannou, A., Yang, L., Wang, T., Papailiopoulos, D., & Lee, J. D. (2024). How Well Can Transformers Emulate In-context Newton’s Method? arXiv preprint arXiv:2403.03183.
1. Your theory and experiments focus on linear regression tasks with (mostly) linear attention, Gaussian feature distributions, and controlled covariance structure. How confident should we be that the same depth–width scaling conclusions hold for nonlinear Transformers trained on natural language, vision, or multimodal data?
2. You mostly analyze single-head linear attention plus residual depth. How do you expect multi-head structure and MLP blocks (i.e., actual transformer blocks) to affect the depth vs. width story? For example, could significantly more attention heads (possibly also scales with D) also benefit the learning process? |
Lightly AI-edited |
|
Theory of Scaling Laws for In-Context Regression: Depth, Width, Context and Time |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper studies deep linear self-attention trained on the in-context linear regression task, characterizing how performance depends on width, depth, number of training steps, batch size and data per context.
The paper studies an interesting and relevant problem of training a deep linear self-attention model on in-context linear regression tasks, considering different in-context task structures. Extensive prior work has examined one-layer linear self-attention trained on in-context linear regression with isotropic task vectors. Analyzing deep models and non-isotropic task vectors represents meaningful progress.
- My main concern is the looped transformer assumption, $W_i^l=W_i^{l'}, i\in \{k,q,v\}$, which means that all layers have the same weights. The expressivity and optimization properties of a deep self-attention model can differ significantly with or without this assumption. Why would the scaling limits derived under this constrain reflect those of a real multi-layer attention model, which typically learns different weights across layers?
- In the reduced model in Equation (4), all weight matrices within a layer appear to be merged into a single trainable matrix $\Gamma$, effectively making the self-attention layer "shallow". Since gradient descent dynamics and the loss landscape are sensitive to such reparameterization, the true loss landscape probably differs from those shown in Figure 2 and Figure 4b. If this is indeed the case, it would be helpful to explicitly highlight this distinction.
- I suggest that the authors perform another round of proofreading and polishing. There are presentation issues that make the paper unnecessarily difficult to read smoothly. I list some below.
It appears that manual vertical spacing commands have been used in several places of the paper. The formatting on page 14 seems irregular.
The clarity of Equation (3) could be improved by specifying the dimensionality of weight matrices.
The authors use inner-product and transpose notations interchangeably. The expectation notation $\mathbb E(\cdot)$ and the angle brackets notation $\langle \cdot \rangle$ are also used interchangeably. Adopting consistent notations throughout would enhance readability.
The symbols $\boldsymbol X, \boldsymbol y$ in Equation (4) seem to be undefined.
The symbol $i$ is used inconsistently: sometimes as an index and sometimes as the imaginary number. In particular, it is undefined in Result 7. Clarifying its meaning in each context would avoid confusion.
In Equations (13), (15), it appears that a scalar is being added to a matrix, e.g., $(1-L^{-1}\gamma\Lambda), (i\omega+\Psi\Lambda)$. Please check these terms for dimensional consistency. Additionally, the trace operator $\text{tr}$ is used without parentheses, which could be ambiguous.
The problem considered in this paper is interesting and potentially important. However, recurring issues with presentation and clarity make it difficult to fully assess the contributions. Improving the clarity would make the results more accessible for proper evaluation. |
Fully human-written |
|
Theory of Scaling Laws for In-Context Regression: Depth, Width, Context and Time |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper studies deep linear self-attention by analyzing the corresponding solvable model. The authors systematically investigate how depth of the model, width, context length, and training steps affect the solution of the model. Specifically, the authors focus on three distinct types of data (termed as ISO, FS, and RRS), where the authors reveal that depth is unnecessary for long contexts on ISO and FS, along with a series of other results that characterize the gradient flow dynamics. Furthermore, the authors derive a separable scaling law for the RSS setting.
1. This paper is well written and well organized. The presentation is very clear and the flow of this paper is consistent, which I think can allow the readers to easily appreciate the core contributions of this work (the summarized theoretical results along with immediate numerical experimental results).
The proof is also easy to follow (though I did not check all the details): first transforming the gradient learning dynamics to an equivalent linear model, which has a simpler dynamics, then diving to this new dynamics under different conditions of the covariance. Although this general idea is not new, incorporating the depth to the analysis is novel.
2. The derived theoretical results indeed consider attention in the multi-layer case, which I believe is an improvement over prior works. The authors demonstrate when and why the depth can be necessary. The summarized results are indeed novel and interesting.
While the results are interesting by considering the depth, I think the setting is still not significantly novel compared to prior works. In particular:
- While equation (3) indicates a dependence on the layer depth $l$, the induced parameter $\Gamma$ in fact does not depend on $l$, because the matrices $W_i$'s are treated equally for different layers given one specific $i$. Instead, this dependence on $l$ is replaced by a simple summation over $l$ in $\Gamma$. As a result, the corresponding analysis in fact does not provide significantly novel analysis compared to prior works in this line of research, i.e., studying the gradient flow dynamics of the induced parameter $\Gamma$ (which is still a linear regression) as a proxy of the true weight matrices of the attention model. It remains unclear whether doing so can really capture the effects of depth.
- The RSS is positioned as the most general case, but the randomly rotated and structured across contexts covariance still cannot effectively capture the essence of task diversity. In addition, the theoretical framework is built on taking a very specific join limit where $P, K, B, D, N \to \infty$ with fixed ratios. While convenient for analytical tractability, this obscures efects that are relevant at finite scales.
- Due to the aforementioned limitations, the generality of the derived results remains unclear.
Minor: As this paper considers solvable models and scaling laws of attention, I think the related work [1], which also studies a solvable model for attention and its scaling laws, could be discussed a bit in the related work.
[1]. Lyu et al. A Solvable Attention for Neural Scaling Laws. ICLR 2025.
1. As I'm mostly concerned about the role of the depth, which plays an important role in the novelty of this work, can the authors justify the validity of assuming equal weight matrices across layers and the corresponding generality?
2. Furthermore, can the authors discuss the difficulty brought by varying weight matrices w.r.t layers and how the current framework can still be applied in that case?
3. Is taking the joint limit of $P, K, B, D, N$ necessary for the results presented in this paper? |
Fully human-written |
|
Theory of Scaling Laws for In-Context Regression: Depth, Width, Context and Time |
Soundness: 4: excellent
Presentation: 2: fair
Contribution: 4: excellent
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper theoretically investigates in-context learning for linear regression using deep linear self-attention models, analyzing performance based on width (N), depth (L), context length (P), pretraining time (t), and data structure. Examining isotropic (ISO), fixed structured (FS), and randomly rotated structured (RRS) covariance settings, the authors find that depth primarily benefits ICL in the ISO and FS settings only when context length is limited; for infinite context length in these settings, increasing depth beyond L=1 offers no advantage. However, in the more complex RRS setting where covariances vary, depth significantly improves performance even at infinite context length. For this RRS case, the paper derives a Chinchilla-like scaling law and predicts a compute-optimal shape scaling, linking optimal architecture to task data properties.
- The paper provides a comprehensive theoretical analysis of multi-layer linear self-attention models for in-context linear regression across three distinct covariate settings (ISO, FS, RRS).
- This paper rigorously characterizes the training dynamics using gradient flow analysis, revealing how the model learns under different data structures and providing an interpretation of the learned estimator as implementing multi-step gradient descent with optimal step sizes.
- The derivation of a Chinchilla-like neural scaling law incorporating time, width, depth, and context length for the RRS setting in the context of linear regression with power-law features is a significant theoretical contribution.
- The application of Dynamical Mean Field Theory (DMFT) to derive a two-point deterministic equivalent for the loss landscape under random rotations represents a novel technical approach for analyzing complex learning dynamics in this asymptotic regime.
-The presentation of detailed proofs and derivations within the appendix could be improved for clarity and accessibility, making it challenging to fully verify the technical steps.
- Could the authors elaborate on the necessity of employing DMFT to derive the closed-form loss expression in Result 7? Is this approach required because directly analyzing the gradient flow dynamics in (13) is intractable, perhaps due to the lack of a known closed-form solution for the ODE governing $\gamma(t)$ in the randomly rotated setting with finite width N?
- The current analysis focuses heavily on the proportional asymptotic regime. Are the techniques employed amenable to deriving non-asymptotic results that might provide insights into the behavior of the system with sizes?
- The derivation of the two-point deterministic equivalent using DMFT in Appendix A introduces notation that appears distinct from the parameters used in the main text to describe the Transformer model and its dynamics and it's hard to directly match (22) with the main result. Could the authors provide a clearer mapping between the DMFT variables/order parameters and the model parameters/dynamics described earlier in the paper? |
Moderately AI-edited |
|
Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents FLASHVLA, a training-free and plug-and-play acceleration framework for Vision-Language-Action (VLA) models. Instead of architectural re-design or retraining, the method targets two types of redundancy: (1) temporal similarity between consecutive actions and (2) redundancy among visual tokens. To address these, the authors propose a token-aware action reuse mechanism and an information-guided token pruning strategy based on singular value decomposition (SVD) and information contribution scores (ICS).
Experiments on LIBERO, UniVLA, and VLAbench show that FLASHVLA reduces FLOPs by 55.7% and latency by 36.0%, with only 0.7% performance drop. While the idea is appealing and practical, the methodology mainly combines known techniques (e.g., token pruning, reuse heuristics) with moderate novelty.
- Proposes a simple yet effective framework to reduce redundant computations in VLA inference.
- Training-free and compatible with FlashAttention, enabling easy integration into existing models.
- Demonstrates strong empirical results on multiple benchmarks.
- Includes detailed ablation and sensitivity analyses.
- The approach primarily combines known concepts (token pruning, reuse heuristics) without strong theoretical advancement.
- The method is evaluated mostly in simulated environments; real-robot deployment results are missing.
- Performance may depend on manually tuned thresholds; no adaptive mechanism is proposed.
- Works such as VLA-Cache [1], TinyVLA [2], EfficientVLA [3] should be discussed and experimentally compared.
- Some related token pruning methods should also be discussed and compared, such as AIM
[4], DART [5], VisPruner [6]
[1] Xu, Siyu, et al. "Vla-cache: Towards efficient vision-language-action model via adaptive token caching in robotic manipulation." arXiv preprint arXiv:2502.02175 (2025). \
[2] Wen, Junjie, et al. "Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation." IEEE Robotics and Automation Letters (2025). \
[3] Yang, Yantai, et al. "EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models." arXiv preprint arXiv:2506.10100 (2025). \
[4] Zhong, Yiwu, et al. "Aim: Adaptive inference of multi-modal llms via token merging and pruning." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2025. \
[5] Wen, Zichen, et al. "Stop looking for important tokens in multimodal language models: Duplication matters more." arXiv preprint arXiv:2502.11494 (2025). \
[6] Zhang, Qizhe, et al. "Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2025.
- How stable is the reuse mechanism under long-horizon or dynamic-scene tasks?
- Can the reuse threshold be made adaptive, e.g., based on uncertainty or temporal confidence?
- Could the authors report real-world timing (wall-clock) results beyond FLOPs estimation? |
Fully AI-generated |
|
Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a training-free and plug-and-play acceleration framework named FlashVLA. The proposed method features a token-aware action reuse mechanism and a visual token selection strategy that integrate seamlessly with Flash Attention to avoid redundant computation. The experiments on the LIBERO simulation benchmark show that inference latency and the FLOPs of visual token computation are both decreased considerably without significantly sacrificing the success rate.
1. Substantial reduction in FLOPs and inference latency without additional fine-tuning;
2. The method is straightforward and can be integrated with models that use Flash Attention for inference.
All the experiments are conducted on simulation manipulation benchmarks (LIBERO, VLABench). It lacks validation on tasks that involve highly dynamic actions (requiring frequent and rapid changes in actuators) and rapidly changing visual scenes (with significant perturbations in objects and background).
As we know, real-world tasks are prone to visual perturbations and sensor noise, while robotic arms face a sim-to-real gap. Could the action reuse mechanism significantly compromise the execution accuracy of the robotic arm? How effective can visual token reduction remain in the face of real-world visual variations? |
Fully human-written |
|
Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes FLASHVLA, a training-free acceleration framework for Vision-Language-Action (VLA) models. The method is motivated by the identification of two key redundancies: high similarity across consecutive action steps and high redundancy in visual tokens. Experiments on the LIBERO benchmark show that FLASHVLA can reduce FLOPs and latency.
- FLASHVLA can be directly plugged into existing VLA models (e.g., OpenVLA, UniVLA) without retraining, which makes it both practical and reproducible.
- The extensive ablation studies on # of Tokens, different modules and diverse VLA architectures make the results convincing.
- Both the similarity across consecutive action steps and across visual tokens have been well-studied. “The first training-free and plug-and-play acceleration framework that enables action reuse in VLA models” also needs more justification.
- The method's hyperparameters appear to require careful, per-setting tuning, which may limit reproducibility. Specifically, the parameter $\delta$, which controls the token set stability threshold $\epsilon_2$, is set to different values for each token budget (e.g., 3, 4.5, 5, 5.5).
- The experiments should be further strengthened. First, the experiments on other VLA acceleration baselines and other benchmarks should be included. Second, the evaluation is confined to simulation. While the simulation results on LIBERO are strong, the paper lacks validation on a real-world robot.
- How sensitive is performance to the thresholds ε1 (action angle) and ε2 (token overlap)? Could these parameters be adaptively learned or adjusted at runtime?
- In scenarios with discontinuous or abrupt motion (e.g., contact manipulation), does the reuse mechanism still function reliably, or could it mis-predict stable states?
- There seems to be a critical contradiction in the description of the "FlashTrigger" mechanism (Section 3.3). Could the authors please clarify the action reuse trigger logic in Equation 6? Is the condition $\alpha(s) > \epsilon_1$ a typo, and should it be $\alpha(s) < \epsilon_1$? The current formulation seems to contradict the stated goal of reusing actions in "stable areas" where the angle change is small.
- The visual token selection strategy is applied after the first two layers ($L_p=2$) in the prefill stage. What is the rationale for this specific depth? How sensitive is the model's performance to this $L_p$ parameter? |
Lightly AI-edited |
|
Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a method to enhance the visual comprehension capability of a multimodal large language model (MLLM) by training a region-proposal network, which is initialized from a subset of the MLLM's parameters. The training data for the region proposals are derived from the attention maps of intermediate layers in the MLLM. To obtain more precise and less noisy region-of-interest (ROI) targets, the authors introduce several techniques, such as removing sink tokens and dynamically assigning labels to instances.
With the trained region-proposal network, the MLLM can perform two-stage inference: in the first stage, the model predicts ROIs using the proposal network; in the second stage, it processes a high-resolution version of the selected ROI to enable more detailed visual information extraction. The authors conduct extensive experiments to validate the effectiveness of the proposed method, and provide ablation studies to assess the contribution of each design component.
* The overall presentation of the paper is strong. The manuscript is well-written, logically structured, and easy to follow.
* Extensive experiments are conducted to demonstrate the superiority of the proposed method across multiple MLLM benchmarks.
* The method achieves notable performance improvements, particularly on OCR-related tasks, highlighting its strengths in fine-grained visual understanding.
* The ablation studies and visualizations provide clear evidence of the effectiveness of the key components in the proposed approach.
* I question the necessity of extracting ROIs from attention maps, as they can be noisy and spatially imprecise. Given the availability of large-scale object detection datasets (e.g., OpenImages) and powerful MLLMs capable of generating captions, it may be feasible to construct a dataset with high-quality region-text annotations. It would be helpful for the authors to discuss whether their approach offers any advantages—such as cost, scalability, or task-specific alignment—over using such manually or model-annotated datasets.
* The authors propose several strategies for utilizing the predicted bounding boxes. However, a concern arises: if only the cropped ROI is used for answering the current question, could this lead to a loss of global visual context? While the global scene may appear irrelevant to the immediate query, it might provide valuable information for subsequent questions in a multi-turn dialogue. In such cases, does the proposed two-stage inference risk discarding useful contextual cues, potentially harming performance over extended conversations?
* The RPN is designed to predict ROIs based on the provided textual context. However, using a portion of an MLLM with billions of parameters for this purpose seems computationally heavy. Since the goal resembles that of open-set grounding models, it would be insightful to compare against lightweight alternatives such as fine-tuned Grounding DINO or YOLO-World. Could such models achieve comparable ROI localization performance with significantly lower computational overhead?
* The standalone performance of the proposed RPN remains unclear. There are no dedicated benchmarks or quantitative metrics reported to evaluate its accuracy in ROI prediction. Could the authors clarify how they validated the quality of the generated proposals during development? For instance, were human evaluations or proxy metrics used?
* I noticed that some visualization results are provided in the supplementary material. Including a few representative examples in the main paper—such as attention maps, predicted ROIs, and corresponding model outputs—would greatly enhance the reader’s understanding of the method’s working mechanism and effectiveness.
* I am also interested in potential failure cases. Are there instances where the baseline MLLM answers correctly but the proposed method fails? Analyzing such cases could shed light on the limitations of the current approach and guide future improvements.
Please see weaknesses. |
Moderately AI-edited |
|
Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception |
Soundness: 3: good
Presentation: 3: good
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
**Catching the Details: Self-Distilled ROI Predictors for Fine-Grained MLLM Perception** proposes a Self-Distilled ROI Predictor (SD-RPN) to address the high-res image perception problems. Unlike previous sft-based or training-free ROI-based methods, this paper pioneered using the filtered middle layer attention map as supervision signal, and trained several middle layers to serve as the SD-RPN. The overall idea is clear, elegant, and generalizable to other encoder+llm type of VLMs. The effectiveness of the proposed method is demonstrated via experiment, with solid and comprehensive growth across benchmarks.
1. The idea of Self-Distilled ROI Predictor is valuable, as it fully leverages the contextualization power of a VLM model, without relying on human-annotated data.
2. The training is not computation-efficient, and demonstrates consistent growth when data quantity is increased (Tab. 4b). This shows the potential of scaling up of this method.
3. The writing and illustration of the paper are excellent. The paper has expressive and clear figures, solid algorithms and detailed explanation for methods.
1. The experiments are limited to ROI methods and do not compare against other SoTA methods for high resolution. If authors could add some work such as Monkey, Token-Packer, Mini-Gemini, Honey-bee, LLaVA-NeXt, which have competitive numbers on the selected benchmarks such as DocVQA, the paper would be more solid.
2. The design of taking middle layers is not ablated. I have read Appendix A. While intuitively the middle layers should be a balance of visual details and global semantics, it should be proven, since using the middle layers can be much more expensive in computation compared with shallow layers.
3. (minor) The 1.6x slowdown does not support the claim of "light weight". Maybe try rephrasing the claim to be clearer.
4. (minor) The hyperparameters, like B and Sbg, Sfg thresholds, are not ablated.
References:
[1] Li, Z.; Yang, B.; Liu, Q.; Ma, Z.; Zhang, S.; Yang, J.; Sun, Y.; Liu, Y.; and Bai, X. 2023d. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607
[2] Li, W.; Yuan, Y.; Liu, J.; Tang, D.; Wang, S.; Zhu, J.; and Zhang, L. 2024a. Tokenpacker: Efficient visual projector for multimodal llm. arXiv preprint arXiv:2407.02392
[3] Li, Y.; Zhang, Y.; Wang, C.; Zhong, Z.; Chen, Y.; Chu, R.; Liu, S.; and Jia, J. 2024b. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814
[4] Cha, J.; Kang, W.; Mun, J.; and Roh, B. 2024. Honeybee: Locality-enhanced projector for multimodal llm. In IEEE CVPR, 13817–13827
[5] Liu, H.; Li, C.; Li, Y.; Li, B.; Zhang, Y.; Shen, S.; and Lee, Y. J. 2024a. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge.
1. Have authors considered ablating shallow layers? If not, is there a more convincing explanation?
2. Why are DocVQA and ChartQA significantly lower than the token compression methods?
I have a positive attitude towards this paper's contributions. I would maintain the high score if authors can address my concerns. |
Fully human-written |
|
Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces a novel annotation-free framework termed the Self-Distilled Region Proposal Network (SD-RPN), designed to address the challenge that MLLMs face in comprehending fine-grained text or objects within high-resolution images. It circumvents the limitations of existing approaches: training-based methods that rely on large-scale annotated datasets, and training-free techniques that suffer from computational inefficiency and lower accuracy. The proposed approach transforms the inherently noisy attention maps within MLLMs into high-quality pseudo-RoI labels, which are subsequently used to train a lightweight RPN. Furthermore, the weights from middle layers of the MLLM are used to initialize the RPN, enhancing its performance. Empirical results demonstrate the high effectiveness of the method, achieving substantial absolute accuracy gains (over 10%) on benchmarks such as TextVQA and DocVQA when integrated into existing MLLMs.
1. This paper introduces a self-distillation framework for training a lightweight RPN, resolving the difficult trade-off between training-based methods, which require costly human annotations, and training-free methods, which are often inefficient and inaccurate.
2. The pseudo-label generation pipeline proposed is clear and feasible, as it involves (i) identifying and removing "sink tokens" and (ii) adopting a selective classification strategy that labels only high-confidence foreground/background regions.
3. The paper validates the effectiveness of the proposed approach through a wide range of benchmarks.
1. This paper introduces thresholds in stages such as pseudo-label generation and RoI prediction. However, the authors do not justify the rationale behind these specific threshold values, nor do they explore how variations in these thresholds might affect model performance.
2. The paper seems to lack an in-depth analysis regarding the sources of "noise" during pseudo-label generation. For instance, it does not adequately address the types of noise involved, how such noise is generated, or what kinds of images or tasks are prone to producing noisy labels.
3. The qualitative examples presented in Figure 5 effectively showcase the successes of SD-RPN. However, a comprehensive evaluation should also include an analysis of its failure modes. The paper would be strengthened by showing examples where SD-RPN fails and diagnosing the cause.
1. How were the specific threshold values (e.g., 0.2 and 0.1 for foreground/background definition) determined?
2. Which types of images or questions are more prone to introducing noise in the attention maps? |
Lightly AI-edited |
|
Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a Self-Distilled Region Proposal Network (SD-RPN). The pipeline first converts the response-to-image attention from the MLLM into high-quality pseudo RoI labels through two key steps: removing sink tokens and assigning labels. Then, the RPN is initialized using the corresponding layers of the MLLM. It collects hidden-state sequences from the second-to-last layer and uses the corresponding token sequences to predict an RoI map. Finally, the RPN is trained with self-distillation, constrained by these pseudo RoI labels. Trained on only a few question-answer pairs, the method achieves great accuracy and efficiency improvement on unseen benchmarks.
1. Annotation-free RoI distillation mechanism that elegantly leverages internal attention for supervision.
2. Improvement on accuracy and efficiency with comprehensive experimental validation across diverse benchmarks.
3. Clear motivation and insight, especially the analysis of noisy attention and the role of sink tokens.
1. The pseudo-label generation method relies on fixed thresholds (e.g., $τ_{fg}$=0.2) to distinguish foreground, background, and ignored regions. These thresholds directly affect the quality of the pseudo labels, but the paper lacks experimental validation with different parameter settings across multiple datasets.
2. Lack of experiments on very recent MLLMs (e.g., Qwen2.5-VL).
3. Efficiency improvement can be further quantified in wall-clock latency and GPU cost.
1. SD-RPN was tested on LLaVA and DeepSeek-VL. How does it transfer to other architectures such as Qwen2.5-VL?
2. What is the performance of SD-RPN if pseudo-labels were replaced by ground-truth bounding boxes? Is it the upper bound of SD-RPN?
3. What is the performance of SD-RPN compared to thinking with image methods?
4. Why was BCE chosen over other robust alternatives (e.g., focal loss or IoU-based loss) given the heavy class imbalance between foreground and background tokens? |
Fully human-written |
|
From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces EduVisBench, a multi-domain and multi-level benchmark designed to evaluate the visual reasoning capabilities of foundation models in educational contexts. The benchmark includes diverse STEM problem sets and a fine-grained evaluation rubric grounded in pedagogical theory. The authors also propose EduVisAgent, a multi-agent framework that coordinates specialized agents for instructional planning, reasoning decomposition, and visualization design.
(1) The formulation of a multi-agent system specifically tailored for pedagogical visualization seems novel.
(2) The paper is well-executed, with a rigorous experimental setup involving multiple model families.
(3) The writing is clear and well-structured.
(1) While the use of GPT-4o as an automated judge is validated, it remains a single-model evaluator. Including more diverse evaluators (e.g., human teachers, multiple LVLMs) could strengthen the reliability of the scoring system.
(2) The paper does not include an ablation study to analyze the contribution of each agent in EduVisAgent. Understanding which components are most critical would help future researchers prioritize agent design.
(3) The multi-agent system is computationally intensive. A discussion of inference time, resource requirements, or potential optimizations would be useful for real-world deployment.
(1) Could the authors provide an ablation study to show the individual contribution of each agent (e.g., removing the metacognitive reviewer or reasoning decomposition agent) to the overall performance?
(2) While automated scoring is efficient, have the authors considered a more extensive human evaluation with actual educators or students to assess the pedagogical effectiveness of the generated visualizations?
(3) What are the main practical challenges in deploying EduVisAgent in real educational settings (e.g., latency, integration with LMS, adaptability to different curricula)? |
Fully AI-generated |
|
From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses the critical and underexplored challenge of generating pedagogically effective visual explanations using foundation models. The authors argue that existing models, while proficient in textual reasoning, fail to create structured, interpretable visualizations that support conceptual understanding in educational contexts.
To address this, the paper presents two primary contributions:
1. EduVisBench: A comprehensive benchmark for evaluating the pedagogical visualization capabilities of FMs. It consists of 1,154 STEM problems across Mathematics, Physics, and Chemistry, organized by difficulty. Crucially, it introduces a fine-grained, five-dimensional evaluation rubric grounded in pedagogical principles.
2. EduVisAgent: A novel multi-agent collaborative framework designed to excel at this task. Inspired by expert instructional design, the framework coordinates five specialized agents to systematically decompose a problem, structure the reasoning process, and generate a coherent, interactive, and visually grounded solution.
Through extensive experiments on EduVisBench, the authors demonstrate that existing state-of-the-art FMs and LVLMs perform poorly. In contrast, their proposed EduVisAgent achieves an average score of 81.6%, representing a substantial 40.2% relative improvement over the best-performing baseline, validating the effectiveness of their structured, multi-agent approach.
1. The paper tackles a timely and important problem. As AI becomes more integrated into education, the ability to generate not just correct answers but effective teaching materials is paramount. The focus on pedagogical visualization as a distinct capability gap in FMs is novel and well-motivated.
2. The design of EduVisAgent is not an ad-hoc collection of agents but is thoughtfully grounded in pedagogical theory, mimicking the division of labor in instructional design. The performance improvement is not marginal; a 40.2% relative gain over the strongest baseline is substantial.
1. The EduVisAgent framework consists of five distinct agents. While the overall system is highly effective, the paper lacks an ablation study to analyze the individual contribution of each agent. For example, how critical is the Metacognitive Reviewer or the Conceptual Mapping Agent to the final score? Understanding the impact of each component would provide deeper insight into the architecture and help identify the most critical elements for pedagogical visualization.
2. The benchmark and agent are designed for STEM problems that typically have a clear, decomposable reasoning path. It is unclear how this framework would generalize to more qualitative or open-ended domains, such as literature, history, or social sciences, where visualization might serve to illustrate arguments, relationships, or timelines rather than step-by-step problem-solving.
1. Could you provide more insight into the necessity of each of the five agents?
2. How do you envision the EduVisAgent framework being adapted for educational domains outside of STEM that rely on more narrative or conceptual reasoning? |
Fully AI-generated |
|
From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces EduVisBench, a benchmark designed to systematically evaluate the pedagogical visualization capabilities of foundation models such as diffusion models and LVLMs. The study reveals that existing models struggle with visual reasoning, semantic alignment, and text–graphic coherence. To address these issues, the authors propose EduVisAgent, a multi-agent collaborative framework comprising agents for task planning, conceptual mapping, reasoning decomposition, metacognitive review, and visualization design. Experimental results show a 40.2% improvement over state-of-the-art baselines, demonstrating superior pedagogical coherence, logical structuring, and interactivity.
1. The paper identifies a genuine research gap in the pedagogical visualization ability of foundation models.
2. Results across multiple STEM domains convincingly demonstrate performance gains with detailed quantitative metrics.
3. The five-dimension rubric provides a reproducible and extensible evaluation standard.
1. Limited real-world validation: While grounded in educational theory, no classroom-level or human-teacher evaluation supports pedagogical impact.
2. Incomplete interpretability of agent collaboration: The internal coordination among agents lacks empirical or ablation-based justification.
3. Evaluation bias risk: Heavy reliance on GPT-based automated scoring may introduce bias or circular reasoning.
4. Limited domain generalization: The benchmark focuses on STEM subjects; extension to other domains remains unclear.
5. Density and readability: The paper is information-heavy, which may reduce accessibility for general AI researchers.
1. Are there any conflicts or redundancies among the agents? Have inter-agent dependencies been empirically analyzed or ablated?
2. Has the pedagogical efficacy been validated with human teachers or learners?
3. Could the GPT-4o-based evaluation introduce model familiarity bias toward similar architectures?
4. Will the authors release full source code and interactive visualization generation modules?
5. Can EduVisBench be extended to non-STEM disciplines or open-ended educational reasoning tasks? |
Fully AI-generated |
|
From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
- The authors propose a methodology to address the limitations of existing generative models in creating effective visual explanations.
- They introduce EduVisBench, a metric for evaluating the educational score of visual materials.
- They successfully utilize the EduVisAgent multi agent framework to generate high quality educational visualization data.
- EduVisAgent achieved a significant performance improvement over existing models through the collaboration of agents with instructional strategies.
- Developed EduVisBench, a benchmark with richer information compared to datasets from existing generative models.
- Successfully validated the reliability of the LLM-based automatic evaluation system using human assessments shown in Table 2.
- Utilized five specialized multi agents to implement strategies.
- Tested broad generalization capabilities across three major academic domains: mathematics, physics, and chemistry.
- A detailed explanation of the dataset utilized for evaluation is required, as the content presented in Figure 3 is unclear.
- The input prompts used for the LLM evaluation in Table 1 were not disclosed, making the verification of fairness difficult.
- The description of each EduVisAgent is too simple, leaving the method of theory implementation unclear.
- The description of how the theory was implemented in the system is lacking.
Openness regarding the benchmark and the educational data is important for validating the reproducibility and reliability of the research.
1. Are there plans to provide a public repository link for the EduVisBench dataset, and will high resolution versions of figures, such as Figure 3, be supplemented?
2. Can you disclose the specific generation prompts used to evaluate the baseline models in Table 1 to allow for the verification of reproducibility?
3. Is it correct that all agents, except the visualization agent, are composed of LLMs? If correct, are the authors willing to clearly disclose the specific details of the structured prompts used to implement the education theories within the LLM-based agents?
4. Are there plans to provide an additional analysis on the mechanism by which each theoretical implementation maximizes the education effectiveness of the generated output?
5. The baseline comparison was made against simple LLMs, is it a comparison possible against multi-agent systems or recent prompt engineering techniques?
6. What was the primary reason for constructing a multi-agent system? Given that many recent LLMs have significantly larger input token limits, what is the performance difference when all prompts are aggregated and input to a single large LLM versus the proposed multi-agent approach?
7. Although "six specialized expert agents" are mentioned on line 86, Section 3 EduVisAgent only has five bolded agents. What does the remaining one refer to? |
Lightly AI-edited |
|
From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper introduces EduVisBench, a new benchmark to evaluate how well AI models generate pedagogically sound, step-by-step visual explanations, and proposes EduVisAgent, a multi-agent framework that significantly outperforms existing models by coordinating specialized agents for planning, reasoning, and visualization to create more effective and interactive learning tools. However, the paper lacks completeness in two core aspects: the composition of the benchmark and the implementation details of the multi-agent system. Please refer to the Weakness section for more details.
1. This paper introduces the first benchmark for visualized instruction, which is one of its key contributions.
2. This paper is well written, with a clear structure.
1. The benchmark samples presented are not sufficiently representative. For example, the visualization of mathematical instruction is based on Math 500, a dataset with the difficulty level of high school math competitions. However, the left panel of Figure 3 only shows a simple addition problem (7 + 9), which is not adequate or representative.
2. The paper lacks a concrete description of the process for converting text problems into image problems, for example how prompts are designed and configured.
3. Regarding the proposed multi-agent method, although it is stated to consist of multiple agents, the paper does not detail the prompt design for each agent or which models were used. A careful check of the appendix revealed no such information—this is a significant omission in terms of completeness.
see weakness |
Lightly AI-edited |
|
Align Human Camouflaged Perception: Visual Refocus Reinforcement Fine-Tuning |
Soundness: 2: fair
Presentation: 3: good
Contribution: 1: poor
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes a RL-based framework to address the limitations of multimodal models in detecting camouflaged objects by aligning them with human visual perception. The method employs a progressive refocus mechanism, enabling models to iteratively refine attention and localize concealed objects through hierarchical reasoning. Using a combination of curriculum reinforcement learning and a rule-based reward system, the approach improves camouflaged object classification and detection. Extensive experiments on public benchmarks demonstrate significant performance improvements over SFT baselines.
- The paper is well-conducted.
- The experiments are solid.
- Given the context of COD, the use of bounding boxes instead of segmentation masks to indicate targets introduces ambiguity. From the perspective of reviewer, masks would provide a more precise representation of the target. Similarly, the comparison between human performance and the proposed method, particularly using mIoU as the metric, is invalid as human perception does not rely on bounding box-based localization (human use edge, boundary instead). Furthermore, the visualization results in Fig. 6 suggest that steps 2–4 do not provide significant improvements, merely adjusting the offset of bounding boxes.
- The rollout number. While the proposed method shows improved mIoU with increased reasoning steps, it is unclear whether the performance gain is due to the refocus mechanism itself or simply because additional reasoning steps allow for more rollouts. The increase in performance from RF1 through RF3 raises questions about whether these steps correspond to pass@1, pass@2, and pass@3, rather than reflecting a true reasoning improvement.
- Experiment details. Fig. 7 is not adequately explained in the paper. The meaning of RF1, RF2, RF3, RF4, and RF5 on the x-axis is unclear.
- The pipeline and proposed techniques bear significant resemblance to prior work, particularly DeepEyes. The approach appears to adapt general MLLM methodologies to the specific domain of COD without substantial innovation.
see weakness |
Lightly AI-edited |
|
Align Human Camouflaged Perception: Visual Refocus Reinforcement Fine-Tuning |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a novel framework (VRRF) for enhancing the performance of VLMs on identifying camouflage animals from the Concealed Object dataset. The authors make use of a modified (Group Relative Policy Optimization) GRPO with a curriculum with progressive rewards during training. The results are benchmarked against other models and human subjects on identification and classification of the camouflaged objects. A noticeable improvement is seen from the application of the VRRF RL framework towards Qwen2.5. Further ablation studies demonstrate the efficacy and importance of the curriculum as well as various modifications of GRPO.
The paper is well motivated and the results are presented comprehensively.
The curriculum was interesting and a novel application to camouflage (as far as I know), with prompt engineering adapted to the problem.
The progressive reward acquisition is also demonstrated thoroughly with a set of ablation studies.
There are concrete examples throughout the paper that make the framework generally understandable ovearll (though there are points of confusion in specific areas)
The analysis of effect of refocus steps on increasing inference time is a useful and important metric.
The authors make many unsubstantiated claims about human vision and search, without citation or reference to the abundant literature on these topics. There is a diverse set of opinions on how humans perform search tasks. E.g. Rosenholtz, et al.(2012). Rethinking the role of top-down attention in vision: Effects attributable to a lossy representation in peripheral vision. Frontiers in psychology, 3, 13., Carrasco, M. (2011). Visual attention: The past 25 years. Vision research, 51(13), 1484-1525., Wolfe, J. M., et al.(2017). Five factors that guide attention in visual search. Nature human behaviour, 1(3), 0058. None of the seminal work on understanding human attention is cited or even indirectly referenced in the paper. Specifically, there are no citations given to support that subjects in search tasks take an iterative, focus, rethinking, backtracking approach as claimed by the authors (e.g in lines 192-194). I don’t even find it necessary to connect improvements in VLMs at identifying camouflaged objects to inspirations in human perception. However, given the numerous claims in the paper including the title, the omission of citing research on human attention and real attempts at their application is unacceptable.
The human subject studies did not outline any steps towards the protection of the safety and privacy of subjects involved (e.g. IRB protocols, consent forms, etc.) As such I am flagging this for ethics review.
The human subject experiment is performed in LabelMe, a GUI for labeling images, without any mention of controlled stimulus presentation or timing as in a standard psychophysics experiment. How did the authors control the presentation time?
Prior work is not well introduced. It is only found in the appendix while typically, it should be near the beginning of the paper.
There should be a border set of experiments to demonstrate generalizability of the VRRF framework. How does VRRF work on models other than Qwen2.5 and search tasks on other datasets?
How are the “Hard” vs “Easy” datasets identified? Was this an arbitrary grouping by authors of the paper or was it obtained through a human subject study based on accuracy, etc.?
A more detailed description of how training is performed is needed. For example, how is the prompt, q, presented to the various examples? Is it a fixed prompt as shown in figure 2? Or does the <explore> </explore> section vary between examples?
Minor points:
The clip art throughout the paper is distracting, and not appropriate style for a conference publication.
In Figure 8 the ordering 1243 is a bit confusing. It this incorrectly labeled? If not, it should be put in a raster pattern to avoid confusion. |
Fully human-written |
|
Align Human Camouflaged Perception: Visual Refocus Reinforcement Fine-Tuning |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes Visual Refocus Reinforcement Fine-Tuning (VRRF), a reinforcement learning framework that aligns multimodal models with human camouflaged perception. By introducing curriculum-based rewards and in-context “refocus” trajectories, VRRF enables models to iteratively shift attention and reason from global to local regions, mimicking human visual focusing behavior. Experiments on multiple camouflaged object detection benchmarks show substantial gains over supervised fine-tuning and prior RL-based methods, even surpassing human performance in certain challenging settings. The work is novel, well-motivated, and demonstrates clear improvements, though it would benefit from comparisons with open-source reasoning models and deeper reward analysis.
- Aligning with human cognition is important because existing models only align the results and ignore the consistency of reasoning and cognition.
- The authors propose an innovative approach that utilizes visual grounding as a reinforcement learning method to enhance inference models.
- Construct a new benchmark.
- The understanding of ChatGPT in Figure 1 is not necessarily "misreading"; it may be due to the wolf's region (which may require an attribution-based approach to reveal), but it is weak in the ability to perform visual grounding based on text generation.
- The paper does not clearly specify whether rewards from previous stages remain active when transitioning to the next stage in the curriculum reinforcement learning process. It is unclear if the training is cumulative or reinitialized at each stage, which affects reproducibility and understanding of the optimization dynamics.
- The paper lacks comparisons with traditional camouflage object detection methods, which would better highlight the unique advantages of using MLLMs for camouflaged perception.
- The paper seems that do not evaluate the model’s generalization ability to out-of-distribution (OOD) camouflaged categories, leaving open whether the proposed refocus mechanism can handle unseen camouflage types or patterns.
- The paper does not discuss or visualize the model’s behavior when the refocus process fails to include the ground-truth region during reasoning. It remains unclear whether the model tends to produce incorrect predictions, hallucinate alternative explanations, or refuse to answer in such failure cases.
- It would be better to consider citing some works that differentiate between foreground and background for camouflaged object detection [1].
- This article may also discuss the importance of interpretability for visual grounding [2].
[1] Phantom-Insight: Adaptive Multi-cue Fusion for Video Camouflaged Object Detection with Multimodal LLM. 2025.
[2] Interpreting Object-level Foundation Models via Visual Precision Search. CVPR 2025.
Please see the weaknesses. |
Moderately AI-edited |
|
Align Human Camouflaged Perception: Visual Refocus Reinforcement Fine-Tuning |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes Visual Refocus Reinforcement Fine-Tuning (VRRF), an RL-based framework that teaches multimodal models to progressively “refocus” attention when detecting camouflaged objects. Using a modified GRPO algorithm and a curriculum-style reward schedule, the model learns human-like Focus–Rethink–Backtrace behavior, achieving large gains over SFT and even surpassing human accuracy on difficult camouflage datasets.
- Clear motivation: bridges a known gap between human and model perception in camouflaged scenes.
- Consistent improvement across COD datasets and competitive results versus GPT-4.
1. **Relative narrow scope**: The GRPO modification is simple and verified only on camouflaged perception. The paper would be stronger if it demonstrated generalization to broader perceptual or visual reasoning tasks.
2. **Experimental completeness**: (1) The “enhanced perception” claim should be better supported by results on perception-oriented benchmarks beyond camouflage (e.g., BLINK, CV-Bench, HallusionBench, or other perceptual / image captioning benchmarks). (2) It's not clear if the method focusing on camouflaged object will hurt general image understanding capability.
3. **Human-study details** are insufficient (sample size, time constraints, evaluation setup). The “surpasses human” claim should be qualified accordingly.
- Can the designed mechanism generalize to other perception tasks (e.g. general / small object detection)? Additional numeric evaluations will be better here.
- Could the authors test whether the improved perception leads to measurable gains in overall model ability on general image understanding benchmarks (MMBench, MMVet)?
- What exactly is the role of "Clip-High Objective Without KL Penalty" in enhancing localization? Could it bias the model toward higher-confidence but less generalizable patterns? Evaluations for the last question can also demonstrate this.
- What is the backtracking behavior’s reliability? So far, it appears only in demo cases about camouflaged objects. Does it emerge consistently across different kind of VL tasks? |
Lightly AI-edited |
|
Multimodal Policy Internalization for Conversational Agents |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces Multimodal Policy Internalization (MPI), a task and methods for training multimodal conversational agents to follow complex, reasoning‑intensive policies without carrying those policies in the inference prompt. The authors propose TriMPI, a three‑stage training recipe: (1) Visually‑Masked Continual Pretraining (VM‑CPT) to inject policy knowledge by language modeling the combined (policy + task) streams while masking visual tokens; (2) CoT‑SFT, which teaches explicit policy‑guided reasoning on task data; and (3) RL fine‑tuning with a new rollout augmentation called PolicyRollout (PoRo) that adds policy‑aware trajectories during exploration but keeps training/inference aligned by applying policy gradients only to the no‑policy path. They also release two benchmarks: ClevrPolicy (synthetic, controllable policy complexity; a text‑only variant and an image‑augmented policy variant) and GTAPolicy (real‑world images and queries for tool‑use with versioned, user‑conditional rules; low‑data regime). TriMPI yields large gains over SFT and in‑context baselines, maintains general capabilities, and substantially reduces prompt tokens and prefil time once policies are internalized.
The main strength of the paper is as following:
1. The authors address a practical, under‑explored problem: handling long, reasoning‑intensive policies for decoder‑only models. The motivation is clear and quantified.
2. Two new datasets are introduced. ClevrPolicy (synthetic, controllable complexity; text‑only and image‑augmented policy variants) and GTAPolicy (real images and queries; tool metadata and versioned rules in a low‑data regime) provide controlled benchmarks.
3. The three components (VM‑CPT, CoT‑SFT, RL) are well motivated and ablated; PolicyRollout improves over GRPO/DAPO by enabling policy‑grounded exploration. And we can clearly observed that SFT part is not enough to train the model to compress the long prompts.
4. The paper evaluates policy updates, policy knowledge referral, catastrophic forgetting, and efficiency, which is crucial for this method, because, generally, on-inference prompt extension with the policy description seems more robust approach in comparison to the specific training.
Despite the strong points of the research, I noticed a few weaknesses
1. Risk of overfitting and external validity. While TriMPI maintains general abilities (MMMU‑Pro/MMLU‑Pro) and handles policy updates (Policy Override), broader real‑policy evaluations (e.g., long stylistic/safety guidelines) would further validate external generalization.
2. The three‑stage pipeline, particularly RL, increases implementation and tuning burden; DAPO vs. GRPO behavior on small data underscores this.
3. Although the authors argue soft prompts are task‑specific and hurt robustness, a direct empirical comparison to strong prompt‑tuning/gist‑token baselines would strengthen the case.
My questions to the authors are the following:
1. I could miss something, but you notice that you apply vision masking in the first training stage, isn't it a common practice for almost all adapter-based multimodal training, to mask vision input from calculating cross-entropy loss? Or there is some specific trick applied in your research. Could you, please, provide more detailed explanation for this stage?
2. Beyond Policy Override, do you evaluate OOD generalization across visual domains or unseen tool types?
3. How do failure modes break down (policy‑branching vs. perception vs. tool‑argument formatting), especially on GTAPolicy?
4. What are the most common overfitting patterns you observed during training, and can VM‑CPT or PoRo be adapted to mitigate them further? |
Lightly AI-edited |
|
Multimodal Policy Internalization for Conversational Agents |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This work addresses the challenge of Multimodal Policy Internalization. Policy internalization is a known problem in LLMs, which involves learning policy information through model parameters, instead of providing it in context during inference. This work highlights that this in-context policy can significantly increase inference costs, more so for simpler user queries. This work concretely explores these issues with Multimodal models and proposes a benchmark and a training paradigm for multimodal policy internalization. Overall, on the proposed benchmark, with their framework TriMPI, the multimodal models are much better at following policy and accomplishing a task, without requiring the policy in context during inference.
- Firstly, I think the paper was well written, and most of the content was very clearly understandable. Great job on that!
- The problem of multimodal policy internalization is a well-motivated one, especially in scenarios like legal or financial applications, where agentics systems might need to reference large documents with text and images to make decisions.
- The idea of policy rollout, to include policy-aware algorithms in the response space, is a clever one and seems to have good empirical benefits.
- The authors have ablated various stages of their pipeline, CPT, RL finetuning, and the role of policy rollout.
- Additional analysis on computational efficiency and evaluating policy referral is interesting and should provide useful advice for future practitioners when adopting the ideas from this work.
One key question I have about this work is that, although it is geared towards multimodal models, it seems that most ideas are in some way bootstrapped on ideas in the language domain, and they have, after some tuning, shown benefits.
- This is not necessarily a bad thing, but were there any explorations as to how using policy images can benefit policy internalization?
- The lack of discussion on the vision side of things is my only major complaint about this work.
- L304, it's claimed that the policy remains insufficiently grounded in the policy. Was this a qualitative observation? Is there a way it can be measured?
- Just as an additional baseline, what if the model did have access to the prompt at inference time?
- Does MPI still work better?
- Or does having the prompt, essentially, make all the baselines and MPI the same in terms of performance?
- This could be a good tradeoff to understand when the cost of MPI tuning may be higher than just putting the policy in the prompt. Note that this does not take any away from the contribution of the work.
- At what point does the in-context policy tradeoff wrt the user query become negligible?
- Specifically, are there scenarios where the user queries may be large enough that the policy tokens are negligible with respect to them?
- In the policy rollout, the way I understand it, is that the RL algorithm now has more policy-aware options, but a question comes to mind, which might be naive:
- How are these policy-aware responses affecting the rollouts? Is it because they’re more correct, hence GRPO will force the Policy model to produce more similar trajectories?
- Is RAG an alternative? One piece of related work that I get curious about when thinking of the problem of multimodal policy internalization is what if the LLM could leverage a RAG-like system to retrieve necessary policy-based information.
- Maybe this is a different way of implementing MPI, but this can maybe reduce the amount of in-context information needed during inference, by only retrieving the relevant one. I would be curious to hear the author’s thoughts on this one.
- Layers in CLEVR policy: are these referring to the layers in the decision tree? Is that how the complexity of tasks is defined?
- Minor: There are often references to figures and tables in the supplement; perhaps it might become easier to parse if it were stated, like Table 5 in the Supplement, etc. |
Fully human-written |
|
Multimodal Policy Internalization for Conversational Agents |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper tackles the problem of internalizing multimodal policies---complex, reasoning-intensive system prompts that guide agent behavior---into the parameters of large multimodal conversational models to improve efficiency and policy adherence. It introduces TriMPI, a three-stage framework combining Visually-Masked Continual Pretraining (VM-CPT), Chain-of-Thought Supervised Fine-Tuning (CoT-SFT), and PolicyRollout, a policy-aware reinforcement learning stage extending GRPO/DAPO for grounded exploration. Experiments on two proposed benchmarks, ClevrPolicy (synthetic decision-making) and GTAPolicy (tool-use), show significant performance gains over SFT and in-context baselines, while maintaining minimal degradation on general reasoning benchmarks (MMMU-Pro and MMLU-Pro).
- The paper is easy to follow, with strong conceptual motivation and clear examples of datasets and methods.
- The paper addresses an important and relevant challenge of conversational agents under long system prompts.
- Extensive experiments, ablations, and analysis (including generalization, efficiency, and catastrophic forgetting) lend strong empirical evidence to claims.
- It requires multi-stage SFT and RL with explicit SFT data; less straightforward than prompt-based alignment approaches.
- It is evaluated on self-constructed datasets (ClevrPolicy, GTAPolicy) rather than realistic multimodal policy data.
- It is unclear how the benefit of “multimodal policy” over “text-only policy” translates to conversation agent use cases.
- *L048:* The estimated range of policy prompt lengths (1K–50K tokens) is broad—are there empirical or open-source statistics supporting this claim?
- What are practical scenarios where multimodal policy adherence is critical over text-only policy for conversational agents?
- What’s the distinction with deliberative alignment (Guan et al., 2024) apart from the input modality?
- Writing
- *L072—073:* The phrase “…, which requires minimal reasoning” can be more clear, example, “…, which requires minimal reasoning to adhere to the policy.”
- There’s seems to be inconsistency in Section 4: Line 256 describes two stages while Line 259 and Figure 4 show three. The method description be made more consistent across sections. |
Fully human-written |
|
FATE: Focal-modulated Attention Encoder for Multivariate Time-series Forecasting |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes FATE, a Focal-Modulated Tensorized Encoder tailored for multivariate time-series forecasting. Unlike vanilla Transformers that flatten inputs, FATE preserves a 3D tensor structure, introduces tensorized focal modulation with temporal focal grouping and cross-axis (feature) modulation, and provides interpretability via dual modulation scores (head-wise and station-wise). Across seven datasets, FATE reports consistent SOTA or competitive performance, with notable gains on long horizons and high-dimensional settings. Ablations, visualizations, and limited efficiency analysis are included.
1 Clear architectural novelty: a principled tensorized design that preserves temporal and feature axes; focal modulation tailored to time-series (temporal focal groups) rather than spatial grids; cross-axis modulation for multivariate dependencies.
2 Interpretability: dual modulation scores linking heads to stations, with compelling visualizations that show evolving spatial focus as horizon increases.
3 Strong empirical performance: consistent improvements on diverse datasets, including large-scale (LargeST) and climate-related (Weather5k), and competitive performance on standard ETT benchmarks. The reported margins on several tasks are sizable.
1 Hyperparameters differ substantially across models (batch sizes, layers/heads), and several strong modern baselines are missing or lightly tuned
2 Claims of “moderate overhead” vs. Transformers are not backed by FLOPs/latency/memory scaling curves across sequence length, stations, and features; no wall-clock comparisons at different horizons or ablations on focal levels.
3 For Traffic, FATE’s MAE is slightly worse than PatchTST (though MSE improves). A broader analysis of error distributions would clarify the trade-offs.
1 Can you report FLOPs, max memory, and throughput vs. (T, S, P) compared to standard Transformer, PatchTST, and TimeTensor, across short and long horizons? |
Fully AI-generated |
|
FATE: Focal-modulated Attention Encoder for Multivariate Time-series Forecasting |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper addresses key challenges in multivariate time-series forecasting (MTSF), including the difficulty of capturing hierarchical spatiotemporal dependencies and computational inefficiencies in high-dimensional data. To tackle these issues, the authors propose **FATE (Focal-modulated Attention Encoder)**, a novel Transformer-based architecture designed to improve accuracy and scalability for long-horizon and high-dimensional forecasting tasks.
Key contributions include:
1. **Tensorized Focal Modulation**: Retains the full 3D structure of the input tensor (temporal, spatial, and feature dimensions), enabling effective modeling of long-range dependencies.
2. **Focal Temporal Grouping**: Dynamically defines temporal focal groups to capture hierarchical temporal dependencies.
3. **Cross-axis Modulation**: Extends focal modulation to the feature dimension to model cross-feature interactions.
The authors validate FATE on seven diverse datasets, demonstrating state-of-the-art (SOTA) performance, particularly on long-horizon and high-variability tasks. Extensive ablations and interpretability visualizations further support the effectiveness of the proposed method.
- **Innovative Architectural Design**
The use of tensorized focal modulation to preserve the 3D input structure is a significant innovation. This approach effectively models spatiotemporal dependencies and cross-feature interactions, which are critical for multivariate time-series forecasting.
- **Thorough Evaluation**
The authors conduct comprehensive experiments, including ablation studies, to validate the impact of each component. Visualizations of focal modulation and attention dynamics further strengthen the empirical claims.
- **Limited Discussion on Computational Trade-offs**
While the paper highlights that FATE introduces moderate computational overhead, it does not provide a detailed comparison of training and inference times against lightweight baselines like linear models.
- **Inconsistent Performance on Certain Datasets**
Although FATE achieves SOTA results on most benchmarks, its performance on the Europe dataset is relatively weaker, with LSTM models outperforming it in several scenarios.
- **Sparse Explanation of Feature Selection**
The paper mentions the use of seven meteorological features for climate datasets but does not elaborate on the rationale or methodology for feature selection.
1. **Computational Efficiency**
Could the authors provide more detailed runtime comparisons (e.g., training/inference time per epoch) between FATE and lightweight baselines (e.g., DLinear, TiDE)?
2. **Generalization to Limited Data**
The Europe dataset results suggest that FATE may struggle in scenarios with limited or unevenly distributed data. Could the authors discuss strategies to improve FATE's robustness in such cases?
3. **Feature Selection**
How were the seven meteorological features chosen for the climate datasets? Did the authors perform any feature importance analysis to validate their selection? |
Fully AI-generated |
|
FATE: Focal-modulated Attention Encoder for Multivariate Time-series Forecasting |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes FATE (Focal-modulated Attention Encoder), an encoder-only Transformer for multivariate time-series forecasting. The key ideas are: (i) tensorized QKV projections that preserve the native 3D structure of inputs; (ii) a tensorial focal-modulation mechanism that aggregates hierarchical temporal context while gating interactions across stations/variables; and (iii) dual modulation scores to provide interpretability about which sources drive forecasts. Experiments on seven datasets show consistent gains against Transformer, linear, and spatio-temporal GNN baselines, with ablations and qualitative visualizations supporting the design.
1. This paper extends focal modulation (from vision) to a tensorized, dual-axis scheme for multivariate time series, preserving temporal and variable axes and introducing dual modulation scores for interpretability.
2. Broad evaluation across 7 datasets with long-horizon regimes is conducted; consistent accuracy gains are reported, including on large-scale traffic where FATE improves MAE/MSE over the best GNN baselines. Qualitative modulation maps align with the narrative about dynamic spatial dependencies.
3. Figures explaining slice-wise QKV formation and multi-head aggregation improve intuition, and the modulation visualizations make the interpretability story tangible.
4. The approach targets real forecasting challenges and shows benefits at longer horizons—valuable for climate and infrastructure planning.
1. The distinctions from recent tensor/patch or efficiency-oriented methods (e.g., tensorized attention variants, multi-scale mixers) are not clear; ablations isolate focal levels and gating qualitatively, but do not study which tensorization choices (e.g., per-axis PE, grouped projections) are essential vs. incidental. A one-for-one replacement study (e.g., FATE vs. Time-tensorized attention with identical training) is missing.
2. The paper states “moderate overhead” and “comparable to baseline Transformers,” but provides limited wall-clock/GPU memory curves across sequence length, station count, and horizon. More rigorous asymptotic and empirical scaling (vs. quadratic attention and linear baselines) would strengthen soundness.
3. While code is provided, some training details remain high-level (e.g., feature preprocessing per dataset; normalization and leakage controls; early stopping; number of runs). HP tables are present but do not specify search ranges or budget comparability across models.
4. Modulation scores are visually appealing, yet the paper does not quantify faithfulness (e.g., perturbation tests, deletion/insertion curves) or compare to attention-based or attribution baselines.
1. Please include runtime and peak memory vs. (i) number of stations, (ii) horizon length, and (iii) input window, comparing to vanilla Transformers and linear baselines on identical hardware/settings.
2. How many random seeds per result? Could you report mean and std and significance tests for Tables 2–4? Also, are hyperparameter budgets matched across baselines?
3. Beyond the circular graphs, can you quantify whether high-score cities/heads are causally important (e.g., occlusion/ablation of stations or time segments reduces accuracy proportionally to scores)? |
Fully AI-generated |
|
FATE: Focal-modulated Attention Encoder for Multivariate Time-series Forecasting |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes FATE, which focuses on capturing spatial-temporal dependency by extending the prior work, FocalNet Transformer, to time series forecasting.
The exploration of spatial-temporal data with the concept of locality is plausible.
The experiment expands on multiple areas to show the model's performance gains.
**W1:** Notation lacks definitions: The notation system can be improved. A lot of terms have been used before defining. For example, in lines 51-53, the authors directly used T, S, and P for the 3-dimensional tensor without explaining what each dimension stands for. In Eq. (1) PE(\cdot) is also not explained, and many more. Additionally, many notations are not in math format. While these issues do not directly obscure the main ideas, they reduce the clarity and readability of the paper.
**W2:** Terms unexplained: A lot of uncommon terms are used without justification. For example, what are focal levels?
**W3:** Notations need to be revised: Apart from unexplained notations, the dimension of the tensors and weights also seems to be problematic. For example, in Eq. (2), it is unexplained how every h in Q, K, V can be obtained by multiplying every p in X with the corresponding weights. The dimension is either mismatched or requires a more detailed explanation. This paper needs great work on improving the notation clarity.
**W4:** Experiment not explained: The method was proposed for spatial-temporal data. However, in the tasks that the authors are using to evaluate the performance, many are not spatial-temporal data and thus do not have a "3-dimensional" tensor structure or any explicit geographical relationships (e.g., ETTh1, ETTm2, and Traffic). It generally does not make sense as the "city" or "local" concept becomes invalid for these specific datasets, and there is also no explanation of how to adapt the proposed model to these tasks that hold unordered covariates.
**W5:** Insufficient analysis. The paper would benefit from additional experiments to provide a deeper understanding of the model’s behaviour. For example, several parameters specific to this work are presented in Table 1 without validation or discussion of their impact.
Minor:
- The used baselines are quite out-of-date, with the latest being iTransformer, which was published in 2024. Many prior work have had lower MAE scores on the benchmark dataset than this work.
- Color scale can be improved. The current version is not color-blind friendly.
- Please explain in detail what are the 4 focal levels for temporal sequences? |
Fully human-written |
|
Robustify Spiking Neural Networks via Dominant Singular Deflation under Heterogeneous Training Vulnerability |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper investigates the unstable training issue in heterogeneous training of SNNs. To address this, they propose a hyperparameter-free method named Dominant Singular Deflation (DSD). This reduces the Hessian spectral radius, prevents convergence to sharp minima, and enhances robustness under different training conditions. Extensive experiments across multiple datasets (CIFAR, TinyImageNet, ImageNet) demonstrate that DSD improves both robustness and stability without incurring inference overhead.
- The paper is well-structured and easy to follow.
- The proposed method is elegant, hyperparameter-free, and mathematically grounded, making it practical for real-world SNN training.
- The method is validated across multiple datasets and architectures, including both static (CIFAR, ImageNet) and event-driven (DVS) data, consistently outperforming state-of-the-art baselines.
- The experimental validation does not clearly substantiate the theoretical motivation of the proposed method. While the Dominant Singular Deflation (DSD) algorithm is designed to suppress the unbounded growth of the Hessian’s spectral radius, it is unclear how this mechanism directly translates into enhanced robustness of the SNN models?
- The paper states that DSD is designed to mitigate model collapse during heterogeneous training, but most experimental evaluations emphasize adversarial robustness under homogeneous training. Furthermore, while DSD improves robustness, it also yields the lowest clean-data accuracy in Table 1, which contradicts the paper’s stated goal of achieving stable and reliable training.
- The theoretical analysis in Theorems 1 and 2 relies on the ****Gauss–Newton (GN) Hessian approximation rather than the true Hessian. Since the GN Hessian is always positive semidefinite, this assumption may limit the generality of the theoretical results.
Could the authors please clarify what specific training scenario hetero-training refers to in this paper and what is the difference between hetero and homo-training? |
Lightly AI-edited |
|
Robustify Spiking Neural Networks via Dominant Singular Deflation under Heterogeneous Training Vulnerability |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
In this work, the authors present a novel method, Dominant Singular Deflation (DSD), to address the vulnerability of Spiking Neural Networks (SNNs) under heterogeneous training conditions. The authors provide both theoretical and empirical evidence to support their claims, demonstrating significant improvements in robustness across multiple datasets and attack scenarios. The work is timely and addresses an interesting issue in the safe deployment of SNNs.
1. This manuscript identifies and systematically analyzes the phenomenon of model collapse under heterogeneous training—a realistic yet understudied scenario. The theoretical analysis linking BPTT and direct encoding to the growth of the Hessian spectral radius is rigorous and insightful.
2. In this work, the proposed DSD method is hyperparameter-free. It effectively reduces the spectral radius of the Hessian and preserves the descent property, ensuring stable training without introducing significant overhead.
3. The authors conduct extensive experiments across multiple static and neuromorphic datasets, under both homogeneous and heterogeneous training settings, and against a variety of white-box and black-box attacks. The results consistently show that DSD outperforms existing SOTA methods in robustness.
1. As reported in Table 1, DSD leads to a noticeable decrease in clean accuracy compared to vanilla SNNs. This trade-off between robustness and clean performance may limit its applicability in certain real-world scenarios where high clean accuracy is required. However, the reviewer's previous research also discovered similar phenomena. So, what thoughts does the author have regarding the improvement of this issue?
2. While the paper identifies the combination of BPTT and direct encoding as the main culprit for vulnerability, it does not extensively explore how DSD performs with other training paradigms (e.g., SLTT) or encoding methods beyond direct and rate encoding.
3. The author seems to have overlooked some studies on the robustness of SNNs [1-3] in the related work.
[1] "Enhancing the robustness of spiking neural networks with stochastic gating mechanisms." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 1. 2024.
[2] "Towards effective training of robust spiking recurrent neural networks under general input noise via provable analysis." 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE, 2023.
[3]"RSC-SNN: Exploring the Trade-off Between Adversarial Robustness and Accuracy in Spiking Neural Networks via Randomized Smoothing Coding." Proceedings of the 32nd ACM International Conference on Multimedia. 2024.
Please see weaknesses! |
Lightly AI-edited |
|
Robustify Spiking Neural Networks via Dominant Singular Deflation under Heterogeneous Training Vulnerability |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper indicates that spiking neural networks (SNNs) trained using the backpropagation through time (BPTT) algorithm are inherently susceptible to perturbations from heterogeneous training data (clean and corrupted). The paper analyzes this susceptibility and concludes that it stems from an excessively high Hessian spectral radius. To address this issue, the authors propose Dominant Singular Deflation (DSD), a method that explicitly removes the dominant rank-one singular component from the gradient during training. The author conducted experiments on multiple datasets and demonstrated that their method significantly improves the robustness of SNNs.
1. From a novel perspective, this paper identifies a source of robustness vulnerability in SNNs.
2. A thorough theoretical analysis supports the proposed method.
3. The experimental results demonstrate the performance advantages of the proposed method.
1. Whether the analysis presented in this paper holds true for other training methods, such as other parallel training [1] approaches or single-step SNNs that propagate firing rates [2], and encoding schemes, such as temporal encoding, remains to be seen.
2. The proposed method produced SNNs that performed significantly worse than other robust methods on clean data—a clear drawback.
3. As shown in Table 7, the proposed method significantly increases training time. Training time nearly doubles on the static CIFAR10 and CIFAR100 datasets.
```
[1] Parallel Spiking Neurons with High Efficiency and Long-term Dependencies Learning Ability. NeurIPS. 2023.
[2] Scaling Spike-Driven Transformer With Efficient Spike Firing Approximation Training. IEEE TPAMI. 2025.
```
See the weakness. |
Lightly AI-edited |
|
Robustify Spiking Neural Networks via Dominant Singular Deflation under Heterogeneous Training Vulnerability |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
1. Research Question
- The paper addresses the instability and vulnerability of Spiking Neural Networks (SNNs) under heterogeneous training conditions and adversarial perturbations.
- Key problem: Why do SNNs collapse under small distribution shifts, and can this be mitigated by stabilizing their spectral dynamics during training?
2. Method Proposed: Dominant Singular Deflation (DSD)
- DSD is a spectral regularization technique applied during backpropagation.
- Instead of modifying weights or data, DSD removes the dominant singular component from the gradient matrix (Δ W) at each update step.
- DSD suppresses the gradient’s most unstable direction, aiming to reduce spectral explosion and improve training stability and robustness without introducing new hyperparameters or architectural changes.
3. Theoretical Part
- The authors conduct theoretical analysis based on the spectral properties of the Jacobian and Hessian during BPTT.
- The paper formalizes *heterogeneous training vulnerability* as a consequence of the exponential growth of the Hessian’s spectral radius over time steps.
- It shows (under simplifying assumptions) that the dominant singular direction of (Δ W) aligns across time, amplifying instability.
- Removing the dominant singular direction theoretically reduces the Hessian’s largest eigenvalue, ensuring descent and smoother loss curvature.
4. Experimental Part
- Experiments on CIFAR-10/100, TinyImageNet, and event-based datasets show: DSD stabilizes training (smoother loss, bounded gradient norms).
- DSD Improves robustness under FGSM, PGD without adversarial training, maintains or slightly improves clean accuracy.
- Spectral diagnostics (Hessian radius, singular value histograms) support the theoretical mechanism.
This work reminds me of a relevant paper: SNN-RAT (https://openreview.net/forum?id=xwBdjfKt7_W). The two works do have very similar views—both aim to improve robustness by controlling the spectral properties of SNNs (singular values/Lipschitz constants). However, they have subtle but important differences in their conceptual approach, theoretical approach, and technical implementation. Therefore, I will use SNN-RAT as a comparison to discuss this paper.
1. Theoretical Part
- Presents a novel *optimization-space* spectral stabilization method, contrasting prior *parameter-space* approaches such as SNN-RAT, which constrain the largest singular value of (W). DSD instead regularizes the gradient matrix Δ W, introducing a fresh angle on robustness grounded in training dynamics.
- Provides a clear and mathematically coherent link between Hessian spectral growth, gradient alignment, and heterogeneous training instability — a valuable conceptual contribution to understanding SNN optimization behavior.
- Demonstrates theoretical descent guarantees via spectral deflation, suggesting that DSD maintains convergence while suppressing dominant curvature directions.
- DSO is elegant, minimal, and free of additional hyperparameters, making it an analytically interpretable robustness mechanism.
2. Experimental Part
- Empirical results show that DSD improves both training stability and adversarial robustness without adversarial training, which is a notable distinction from previous works relying on adversarial data augmentation like SNN-RAT/HoSNN.
- Consistent improvements across CIFAR-10/100, TinyImageNet, and DVS datasets, while preserving or slightly improving clean accuracy.
- Spectral diagnostics (gradient singular spectra, Hessian radius) align closely with the theory, strengthening internal validity.
- The evaluation includes white-box, black-box scenarios, reducing concerns about gradient obfuscation.
3. Method Scalability
- The DSD operation is simple and parameter-free, requiring only rank-1 spectral deflation per layer, theoretically lightweight compared to full spectral regularization in SNN-RAT.
- The method integrates seamlessly into standard backpropagation and does not alter network architecture, making it easy to adopt in existing SNN frameworks.
- DSO can reduce the reliance on adversarial samples in SNN adversarial training, which greatly reduces the amount of computation while providing adversarial robustness.
1. Theoretical Part
Overall, the paper's proof follows the path of "spectral analysis of the Jacobian product → increasing the radius of the Hessian spectrum → shrinking the Deflation spectrum → improving stability." This is intuitively reasonable.
- It would be helpful for the rigor and clarity of the paper if the author could clearly state the key assumptions in each theorem. In particular, it would deepen our understanding of the problem by indicating when the core theorems proposed fail.
2. Experimental Part
Overall, I think the experimental part is clear, comprehensive, and rigorous.
- More attack methods like Gaussion noise, APGD (https://arxiv.org/pdf/2003.01690) could be tested. Only FGSM and PGD are tested in the current paper.
3. Method Scalability
- The proposed DSD step requires layer-wise SVD or power iteration, which is computationally demanding for large-scale or convolutional SNNs; the paper should quantify this cost in the main paper.
- DSD depends on full gradient access and is thus incompatible with neuromorphic or local learning implementations, limiting its deployability on real spiking hardware.
- The direct modification of the principal components of the gradient is a concern. Even though the paper provides good results overall, unpredictable results may occur when interacting with other adaptive optimizer such as Adam and RMSProp.
1.Could the authors explicitly state the key assumptions/conditions in each theorem? For example:
* Under what conditions do the contraction and boundedness assumptions on the Jacobian fail?
* Are there cases where the spectral radius of the Hessian would not scale linearly with time steps (T)?
* What happens if the “approximate time-invariance” assumption of the Jacobian is violated (e.g., due to noise, dropout, or adaptive thresholds)?
2. Could the authors consider evaluating the robustness under additional perturbations such as Gaussian noise or stronger adversarial attacks e.g., APGD?
3. Could the authors quantify DSD's computational cost (e.g., time per iteration, FLOPs, or scaling with network size)
4. Could the authors discuss potential adaptations or approximations that make DSD compatible with neuromorphic chips?
5. Have the authors observed any instabilities or performance inconsistencies during DSD training process or when DSO is combined with different optimizers? |
Fully AI-generated |
|
An Optimal Algorithm for Marginalization in Bayesian Networks |
Soundness: 3: good
Presentation: 3: good
Contribution: 1: poor
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper studies the problem, given a Bayesian network over a set of variables, to find the smallest Bayesian network for a given subset of variables that introduces no extra conditional independencies and does not violate the ancestral constraints of the original Bayesian network. It proposes the first provably optimal algorithm for this problem, i.e., one that guarantees to find a Bayesian network with minimal number of edges.
The paper is generally well-written. I had no problems of following the arguments laid out in the paper and the proposed algorithm appears to be sound. Moreover, the algorithm is novel and the first to (fully) address this problem.
The main weaknesses of the paper are that the problem is rather niche/specialized and that, while the proposed algorithm is sound, no formal run-time analysis is provided (the algorithm runs in exponential time) nor a complexity theoretic classification of the problem (e.g. NP-hardness proof).
Niche problem:
The problem that is solved is effectively, for a given maximal ancestral graph (MAG), to find the smallest Bayesian network with no extra condititional independencies and satisfying given ancestral constraints (given by a partial order). This can be used to tackle the (maybe more natural) problem of reducing/marginalizing a Bayesian network to a subset of variables while preserving its ancestral constraints, introducing no further conditional independencies and adding as few edges as possible. However, it remains unclear how solving this task can be useful in a larger pipeline. Also for causal inference tasks, the MAG representation appears to be more useful than the Bayesian network that is constructed from it in this procedure. This raises the question why it is crucial to keep the ancestral constraints, which is motivated due to preserving the causal ordering.
Formal complexity analysis:
The provided algorithm has exponential run-time. While it is certainly much faster than naive exponential-time algorithms, an analysis corroborating this and discussing the worst-case run-time is needed. Also, from the paper it is not clear to me whether the hardness of the given problem has been studied. Even if this is simple/obvious, a proof/discussion of e.g. it's NP-hardness would complement the exponential run-time algorithm.
1. Can you give examples how the studied problem can be useful for larger pipelines? Are there (naive) implementations in software libraries (Bayesian networks, causal inference) of algorithms tackling this or similar problems?
2. Is it possible to say whether the problem is (likely) NP-hard? Is there hope that a polynomial-time algorithm can be derived?
Suggestions/typos:
- line 062: "the topological ordering", rephrase to not sound like there is a unique topological ordering
- line 169: removing barren nodes, i.e. nodes that...
- line 206: return a *non-empty* set of BNs ...
- line 250: "independency" should be independence
- line 262: "we'll" replace by "we will" (same for "let's" in line 357)
- line 367: "in our algorithm of this method", do you mean "in our implementation of this method"? |
Fully human-written |
|
An Optimal Algorithm for Marginalization in Bayesian Networks |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 1: poor
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The submission studies marginalization in the context of Bayesian Networks (BNs). In particular, it provides a procedure to compute a BN that omits a specified set of nodes while preserving independencies and ancestor/descendant relations.
Bayesian Networks are a well-established model that is of interest to both empirical and foundational subcommunities in ICLR.
Given that the submission focuses on marginalization in BNs, the discussion of related work in that context is lackluster. The concept of marginalization in BNs is introduced as if it were a well-established concept, but in the relevant 4th paragraph of the introduction there is only a single reference given to previous works considering marginalization in BNs - and that reference is barely relevant (it studies abstractions of BNs and only contains a single occurrence of the word "marginalization", as a by-the-way remark). If marginalization in BNs has been understudied, the introduction should have made a point of that. It seems there are several related works on marginalization in adjacent graphical models (such as Markov models) from the early 2000s as discussed in Section 2.2, and perhaps some of these can be transferred to the BN setting - however, a discussion of whether this is the case is missing.
Moreover, the definitions are not provided with sufficient formal rigor, leading to ambiguities which would otherwise be easy to avoid. For example, ancestral graphs are introduced as the extension of BNs via the inclusion of bidirectional edges, yet how these new bidirectional edges interact with the previously introduced notation for purely one-directional edges is unclear. In classical graph-theoretic terms, bidirectional edges would be interpreted as two directed arcs which form a C2 - yet that is clearly not the case here, given the claim that ancestral graphs must be acyclic. But then how do the ancestor and descendant relations interact with bidirectional edges? Can paths contain bidirectional edges?
Another issue is that the specific contributions are not clearly described. As one example of this, the problem solved by Algorithm 1 (formalized only in Section 4) asks a set of BNs instead of a single BN and this is neither explained nor discussed. Has this specific problem been studied before, or how closely is it related to the previous work on BNs? As another example, there is no discussion of the running time of the proposed algorithm - without checking the algorithm as well as the referenced subprocedures, one cannot even ascertain whether it runs in polynomial time. And the submission is also missing a description or discussion of the technical contributions underlying the foundational results: what were the main difficulties that had to be overcome (or novel insights that had to be obtained) towards establishing the new results?
There are also other, smaller issues such as:
-The very first paragraph of the introduction is repeated (probably just a copy-paste error).
-At the end of Section 3.2, the submission claims that Shachter's reduction operations need not produce the most compact BN even after considering all possible orders of edge reversals and node removals. Since this claim is crucial for the significance of the submission, I would have expected this claim to be substantiated by a reference or explanation.
-There is still ample space left in the main body of the submission, even though several important proof details have been delegated to the appendix without being reflected in the short proof sketches. One wonders why has the remaining space not been used in any way.
The authors are welcome to respond to any of the concerns or questions raised above. |
Fully human-written |
|
An Optimal Algorithm for Marginalization in Bayesian Networks |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper tackles the problem of graph marginalization in Bayesian networks and provides an algorithm for finding the optimal marginalization. The definition of a valid marginalization and optimality is given. The algorithm leverages the closed-ness of MAGs (maximal ancestral graphs) under marginalization, and decomposes the MAG into bidirected connected components (BCCs) and enumerates valid topological orders. The algorithm is shown to be correct and sound. Emprical experiments are conducted to show consistent speedup against existing algorithms.
- This is the first graph marginalization algorithm with a provable optimality guarantee. Clear theoretical advancement.
- The experiments demonstrates strong speedups in a wide set of random graph configurations.
- This paper is generally well-written. The introduction and motivation are clear.
- It would be better to have a running example (such a Figure 1 in the manuscript) to illustrate (i) how the proposed algorithm works; and (ii) why the existing algorithms fail to give valid or optimal marginalization. Maybe the example in Figure 1 is not complex enough.
- The experiment only demonstrates the speedup compared to the existing algorithms rather than the correctness of the outputs. I wonder empirically for the graphs considered in the experiments, why is the percentage that the existing algorithms fail to give a valid or optimal marginalization?
- The performance of the algorithm appears to be driven by BCC structure. How large can BCCs typically get in practice?
- Is there a runtime complexity analysis for the proposed algorithm? |
Fully human-written |
|
An Optimal Algorithm for Marginalization in Bayesian Networks |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper contributes with a new algorithm to obtain a marginalized Bayesian Network, that is, a Bayesian Network DAG approximating the projection of an input Bayesian Network DAG under the following constraints: it must preserver the ancestral relations (or topological constraints) of the original DAG, it must preserve the dependences of the original DAG, and it must minimize the number of edges. The authors claim this to be the first algorithm able to ensure that a edge-minimal network is found. The algorithm performs a local brute-force over the bi-directed components of a Maximum Ancestral graph representation of the input DAG in order to find a candidate DAG, by enumerating local topological orderings. The authors prove that this approach ensures sufficiency and necessity, proving soundness of the algorithm. Experiments with synthetic networks show a significant speedup compared to a naive approach.
- Text is relatively easy to follow (for someone with proper background)
- Sound approach
- Unexplored task
- Lack of proper motivation for the task
- Brute-force approach scales poorly (dismissing the single motivation given in the introduction, to be able to speed up inference) in the size of biconnected components
- Experiments are very preliminary and use unrealistic network structures
The work misses a proper justification. Bayesian networks with thousands of nodes are not very interpretable; in fact, a neural network that uses activation functions with image in [0,1] can be seen as a Bayesian network over binary variables, which shows that this fact alone does not contribute for interpretability. As it is, it is not clear that marginalization of Bayesian Networks can help in probabilistic inference; it is possible that the task is NP-hard in itself and there is no evidence that inferences produced by a marginal Bayesian network are good approximations (as the graph might have high in-degree and conditional probability estimates are less statistically robust). In fact, one way of obtaining the parameters of the marginal Bayesian network is by doing probabilistic inference in the original network.
The description of the task is very poorly motivated. Why one would like to obtain a marginal Bayesian network that respects the ancestral relations? With respect to causal modeling, one can often work with the ADMG (which admits a factorization), thus dispensing with the approximations inserted by marginal DAG. If one is only interested in a probabilistic model, than keeping the ancestral relations seem unnecessary (and can lead to much larger models).
Lemma A.1 (and Cor A.1.1.) are well-known, see e.g. Koller and Friedman 2009. |
Fully human-written |
|
Background Matters: Robust 3D Human Pose Estimation via Controllable Video Generation |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
While the paper is well written and experimentally detailed, it lacks sufficient novelty and fails to demonstrate competitive performance compared to existing methods. The absence of comparisons with strong baselines further weakens the technical contribution.
1) The paper is clearly written and the proposed idea is clearly presented. Overall the paper is easy to follow.
2) It includes a comprehensive set of experiments.
1) The idea of using pose guidance for video generation to improve generalizability has been explored in prior work (e.g., PoseSyn [1]). The novelty of this paper is therefore limited.
2) The reported results fall significantly behind SOTA performance. For instance, on the 3DHP dataset, PersPose [2] (ICCV 2025) achieves less than 75 MPJPE, while this paper reports 124 MPJPE.
3) No quantitative comparison with relevant prior work, both in:
- Dataset generation methods: PoseExaminer [3], PoseGen [4], PoseSyn, IDOL [5], AdaptPose [6];
- SOTA 3D pose estimation models: PersPose, PostoMETRO [7]
[1] PoseSyn: Synthesizing Diverse 3D Pose Data from In-the-Wild 2D Data
[2] PersPose: 3D Human Pose Estimation with Perspective Encoding and Perspective Rotation
[3] PoseExaminer: Automated Testing of Out-of-Distribution Robustness in Human Pose and Shape Estimation
[4] PoseGen: Learning to Generate 3D Human Pose Dataset with NeRF
[5] IDOL: Instant Photorealistic 3D Human Creation from a Single Image
[6] AdaptPose: Cross-Dataset Adaptation for 3D Human Pose Estimation by Learnable Motion Generation
[7] PostoMETRO: Pose Token Enhanced Mesh Transformer for Robust 3D Human Mesh Recovery
The authors might want to clarify the contributions; compare with related SOTAs; and add more discussions. |
Lightly AI-edited |
|
Background Matters: Robust 3D Human Pose Estimation via Controllable Video Generation |
Soundness: 3: good
Presentation: 3: good
Contribution: 1: poor
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes using diffusion models to generate and augment the in-studio dataset to improve the generalization of human pose estimators. Specifically, they argue that 2D-to-3D pose lifting techniques are often trained on in-studio, near-perfect data, whereas in the real world such data are scarce, and artifacts from noise and occlusion can undermine the robustness of pose estimation systems. To address this challenge, the paper introduces a two-stage augmentation pipeline that first trains a controllable video generator (Animate Anyone) on a dataset and then feeds it poses from diverse domains to generate new postures with different backgrounds. The paper supports its claims through extensive experiments, showing that by augmenting and training the models on corrupted/synthesized datasets, results in improved performance.
- The paper is well-written and easy to follow. It explains the technical details and provides adequate clarification.
- The cross-dataset analysis clearly shows that by augmenting RGB videos rather than 2D poses, the performance of cross-dataset generalization can increase
- Alternatives to video generation are not compared against. For instance, a simple background-pasting algorithm can serve as a straightforward baseline.
- The computational cost and practicality of the approach are questionable. Training a large video generation model can be much more computationally expensive than rendering a synthetic dataset. Additionally, the paper mentions that 90% of the data is discarded, meaning that the process is highly inefficient and uncontrollable. This limitation and the lack of controllability are not fully addressed in the paper.
- No other baselines are compared against. For instance, while PoseAug tries to augment the 2D poses, it can be a point of comparison. The paper cites these methods, but does not include them in the comparisons.
1. Could you please provide a detailed breakdown of the computational cost for 1) training/fine-tuning the video generator, 2) generating the dataset, and 3) training the HPE model? Please provide a comparison with other available approaches cited in the paper on line 58.
2. Please address my points in the above section, and the point about comparing with PoseAug specifically. |
Lightly AI-edited |
|
Background Matters: Robust 3D Human Pose Estimation via Controllable Video Generation |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper utilizes controllable video generation for learning to estimate 3D human poses robustly. RGB Video generation is supposed to help in generating diverse 2D human motion sequence through varying poses, scenes, and viewpoints. Besides using real 2D inputs, the idea behind using real-world detections is to design a robust and generalizable pose estimation model. The authors conduct experimental evaluation on various 3DHPE datasets to show the effectiveness of the proposed approach on real world and corrupt 2D inputs.
1. The paper reframes the usual pose-only augmentation approach into a multi-modal data generation pipeline that explicitly models scene diversity in terms of pose, viewpoint, lighting, etc. which helps in building a generalizable 3D human pose estimation model.
2. The idea of leveraging pose-guided video diffusion models is intuitive and a straightforward implementation should be easily achievable.
3. Experiments include multiple datasets (H36M, PMR, 3DHP, 3DPW) and diverse metrics (MPJPE, P-MPJPE, velocity error). Moreover, the paper evaluates robustness under real-world corruptions (blur, compression, spatter), which strengthens the practical motivation.
4. The effects of filtering ratios, pretraining strategies, and GT vs. detected 2D inputs are well-studied. Also, the results seem to demonstrate consistent improvements across nearly all configurations which suggests robustness of the method.
1. The technical contribution mainly lies in data generation and composition rather than in a novel model or algorithm which I feel is a major discussion point. The method builds on existing diffusion video generation models with minimal architectural innovation.
2. The paper relies heavily on pretrained models like Animate Anyone (Hu et al.) and Latent Diffusion model (Rombach et al.) without domain-specific adaptation. It is unclear how much of the improvement comes from the inherent realism of these models rather than the proposed pipeline design.
3. The method is validated primarily on H36M, PMR, and 3DHP which are all captured in a controlled environment. A demonstration on truly unseen in-the-wild datasets (e.g., MPII, COCO-Video, AMASS-based scenes) is missing and would support generalization claims.
4. One of the concerns is that generating and filtering large-scale video data using diffusion models is resource-intensive. The paper lacks any discussion regarding this, e.g., training time, compute requirements, or efficiency trade-offs compared to simpler augmenters like PoseAug (Zhang et al.). No user or perceptual evaluation is provided for the realism or physical plausibility of generated videos.
1. What is the precise novelty beyond combining existing diffusion video generation models with pose augmentation?
2. What is the core mechanism by which background variation improves generalization? The hypothesis is intuitive (“background matters”), but the paper lacks an analysis or visualization showing how added background diversity affects learned representations.
3. What is the computational overhead of generating and filtering data? Since video diffusion generation is extremely costly, the scalability of this method to larger datasets or real-time adaptation is unclear. Additionally, for a generalizable model, training a video generation model on extensively diverse datasets seems to be crucial but at the same time unscalable. |
Fully AI-generated |
|
STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes STITCH, a generation method that alternates between unspoken reasoning chunks and spoken response chunks. It aims to significantly reduce the latency between reasoning generation and spoken response. Experimental results on both reasoning (e.g., math) and non-reasoning datasets, such as TriviaQA, show that STITCH performs on par with or better than baselines that either reason before speaking or do not perform reasoning at all.
- The authors introduce three ways of integrating reasoning into spoken language models: Thinking Before Speech (TBS), Simultaneous Thinking and Talking with Reasoning First (STITCH-R), and Simultaneous Thinking and Talking with Speaking First (STITCH-S). The methodology is clearly described, and Figure 2 effectively visualizes the differences between these methods.
- The analysis is good. The authors report performance while varying the length of reasoning chunks during inference and analyze the number of tokens used in the reasoning process. They also conduct experiments using different reasoning models to study how reasoning quality affects spoken responses.
- Experiments are conducted on both reasoning-oriented and non-reasoning datasets, making the experiments comprehensive.
- Based on the performance tables (Table 1-(a) and 1-(b)), there is no clear winner that consistently outperforms all other baselines. On math datasets, STITCH-R and STITCH-S show mixed performance across models, sometimes performing significantly worse than TBS (e.g., TBS 64.94, STITCH-R 58.70, STITCH-S 56.72). Similarly, on non-reasoning datasets, STITCH-R and STITCH-S perform inconsistently relative to other baselines. The paper does not clearly explain the reasons behind these trends.
- Following the previous point, there are no clear guidelines on when to use STITCH-R versus STITCH-S.
- The main motivation for reducing interaction latency is to improve user experience while maintaining response quality. However, the paper does not include any user-centric evaluation (e.g., human preference, perceived responsiveness, or satisfaction).
- It is unclear how STITCH handles cases where reasoning takes longer than speech generation. Would the model introduce pauses until reasoning is completed? Such pauses might make the users feel more inconvenient to use it.
In STITCH-S, reasoning occurs after the response. In this case, the reasoning could be merely post-hoc justification. Why does this variant outperform baselines that reason before speaking or models without reasoning? |
Lightly AI-edited |
|
STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces STITCH, a new framework that allows Spoken Language Models (SLMs) to “think” while they are “talking.” Current SLMs can only generate speech responses directly, without any internal reasoning before speaking. In contrast, humans silently think before responding aloud. STITCH mimics this behavior by dividing the reasoning and response generation into small “chunks.”
Instead of generating a long chain of reasoning first, STITCH alternates between reasoning chunks and speech chunks, so the model can keep thinking while speaking. The paper shows that STITCH improves reasoning performance by 15–20% especially on math tasks compared to baseline models, while keeping response latency nearly the same. It performs equally well on non-reasoning tasks, showing that “thinking” doesn’t harm general speech quality.
- The idea is novel — enabling SLMs to think internally while speaking. The chunked reasoning design, STITCH-R and STITCH-S is creative and practical for real-time systems.
- Overall the paper is very well-written
- The experiments mainly focus on math reasoning datasets like GSM8K and SVAMP. It would be valuable to test STITCH on more diverse reasoning domains such as commonsense, dialogue reasoning, or multi-hop factual reasoning.
- The paper compares only within GLM-4-Voice variants. Is it possible to see how STITCH performs when applied to other open-source SLMs like Qwen2.5-Omni or Baichuan-Audio to prove generalizability?
- How does STITCH behave under actual streaming conditions where speech synthesis latency varies? |
Fully AI-generated |
|
STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This study introduces STITCH, a method designed to incorporate reasoning into speech by interleaving it within the existing Speech-Language Model (SLM) framework—specifically building upon GLM-4-Voice, one of the text–speech interleaving architectures (e.g., GLM-4-Voice, VITA-Audio). The authors propose two variants, -R and -S, depending on whether reasoning appears at the very beginning or after the first text–speech chunk.
The proposed approach is optimized based on the A100 + vLLM configuration. Compared to the baseline method, in which GLM-4-Voice performs “thinking” before speaking, STITCH demonstrates comparable performance on mathematical reasoning tasks and achieves performance on general conversational tasks that is similar to the original GLM-4-Voice.
**[S1]** The paper is written clearly and is easy to follow. The operational mechanism of the proposed method is straightforward to understand.
**[S2]** Notably, the model exhibits comparable results on mathematical tasks compared to approaches that explicitly perform reasoning with reduced latency.
While the study presents an interesting direction, the scope of its **novelty and generalizability appears somewhat limited**. The following points are offered as considerations rather than criticisms:
**[W1]** The reported optimization is based on the **A100 + vLLM** setting, which may limit the applicability of the results. It remains uncertain whether the proposed approach would generalize well to limited hardwares, larger models, or alternative architectures, such as those that jointly optimize a separate decoder (or talker) during inference rather than interleaving, or models that predict codecs (e.g., *moshi*). In addition, the interleaving method applied in this study appears to extend the existing text–speech interleaving approach, already performed in models such as GLM-4-Voice and VITA-Audio, to the reasoning component, without introducing additional architectural or methodological considerations specific to reasoning itself.
- **[Q1]** To what extent do the reported gains persist across different hardware budgets, model scales, and architectural variants (e.g., non-interleaving pipelines with a joint talker/decoder, or codec-predictive systems such as moshi)?
**[W2]** The dataset used comprises general dialogue, mathematical, and knowledge-intensive questions generated through **GPT-4o**. It is possible that employing reasoning-specialized models, such as **Qwen3-235B** or **DeepSeek**, for path construction might have yielded broader improvements, particularly in the non-reasoning tasks presented in Table 1. The comparable performance to the baseline GLM-4-Voice on general tasks therefore leaves some ambiguity regarding how the reasoning path contributes in those contexts.
- **[Q2]** Does the choice of path-generating model (e.g., GPT-4o vs. reasoning-oriented models) materially affect downstream performance, especially on non-reasoning tasks (Table 1)?
**[W3]** The paper does not provide an analysis or guarantee as to whether the reasoning path consistently concludes earlier in the 100 / 13 / 26 configuration, nor whether reasoning reliably precedes the corresponding text segments. Given that human reasoning typically precedes linguistic expression, a more explicit discussion or examination of this alignment could further strengthen the study.
- **[Q3-1]** Does reasoning precede and complete before corresponding text/speech segments, and does early termination of reasoning path occur in the 100 / 13 / 26 configuration?
- **[Q3-2]** If the reasoning segment tends to finish much earlier than the corresponding text, could the authors examine (1) whether adjusting the reasoning chunk length during training so that it depletes at a pace similar to the text segment, and (2) independently, enforcing a structure where reasoning always precedes the textual response (to prevent the model from answering first and reasoning afterward), would lead to different experimental results?
See weaknesses |
Lightly AI-edited |
|
STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces STITCH, a Spoken Language Model (SLM) that performs internal reasoning alongside speech generation, theoretically without increasing interaction latency. STITCH generates speech chunk-by-chunk, with each chunk containing reasoning tokens, text tokens, and speech tokens. Experimental results demonstrate that this method improves reasoning performance compared to baseline approaches.
- The idea of chunk-wise reasoning alongside speech generation is novel.
- The investigation into optimal reasoning chunk length and the use of reasoning from other models adds further novelties.
- Experimental results clearly show the effectiveness of the proposed method.
- The paper does not compare STITCH against thinker-talker models. The substantial drop in reasoning performance observed in interleaved models (relative to text-only LLMs) often stems from the interleaved LLM's need to generate both text and speech tokens. In contrast, the thinker component in thinker-talker models generates only text, addressing this limitation. Although the authors mention that thinker-talker models are harder to fine-tune—which may imply challenges in adapting STITCH to such architectures—it would be fairer to include thinker-talker models in the comparisons.
- Evaluation is conducted solely on text outputs. While this is somewhat acceptable, it would be more robust to also evaluate directly on speech outputs, since the performance gap between text and speech evaluations can depend on the backbone LLM’s text-speech alignment capabilities.
- There is an inconsistency in the introduction section. Lines 084-087 state that STITCH-S cannot generate reasoning by design, but it actually generates text reasoning—it simply does not generate text reasoning in the first chunk. Am I missing something here?
- Statistical analysis should be performed to assess whether there are significant differences in responses among TBS, STITCH-R, STITCH-S, etc.
N/A. |
Lightly AI-edited |