ICLR 2026 - Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	15899 (21%)	4.43	3.58	3687
Heavily AI-edited	3233 (4%)	4.22	3.59	2990
Moderately AI-edited	7082 (9%)	4.20	3.61	2722
Lightly AI-edited	16648 (22%)	4.15	3.68	2746
Fully human-written	32938 (43%)	4.13	3.62	2917
Total	75800 (100%)	4.21	3.62	3026

Title	Ratings	Review Text	EditLens Prediction
AdaSpec: Adaptive Spectrum for Enhanced Node Distinguishability	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper investigates the problem of node distinguishability in spectral Graph Neural Networks (GNNs). The authors state that a spectral GNN's ability to distinguish nodes is theoretically lower-bounded by two main factors: the number of distinct eigenvalues of the graph matrix ($d_M$) and the number of non-zero frequency components of the node features in the matrix's eigenbasis. Based on this analysis, the paper proposes AdaSpec, a plug-in module that generates an adaptive graph matrix $\Omega(A,X)$ designed to maximize this lower bound. AdaSpec consists of three components: $\Omega_D(A)$ to increase distinct eigenvalues using a learnable diagonal matrix, $\Omega_S(A)$ to shift eigenvalues from zero, and $\Omega_F(X)$ to increase the number of non-zero frequency components by incorporating feature information. The authors provide theoretical guarantees that AdaSpec maintains permutation equivariance and empirically demonstrate its effectiveness at improving node classification, particularly on heterophilic datasets. S1. The paper tackles an important and specific problem—node distinguishability for spectral GNNs. S2. The design of the AdaSpec module is well-motivated. Each of its three components ($\Omega_D$, $\Omega_S$, $\Omega_F$) is explicitly designed to address a specific limitation identified in the theoretical analysis. S3. The experiments are extensive, covering 18 benchmark datasets with diverse characteristics (homophilic, heterophilic, large, and small). The ablation study in Table 4 validates the contribution of each component of AdaSpec. W1. The most critical flaw is the failure to compare AdaSpec against other relevant graph augmentation or rewiring methods. The paper frames its contribution as a graph matrix generation module, which is functionally a form of learnable graph rewiring or augmentation. Although the authors claimed that they are the first to study node distinguishability, many other papers on spectral GNNs studied the expressive power of spectral GNNs, where node distinguishability was explicitly or implicitly studied. As we can see from the methodology and the experiments, AdaSpec is in fact a graph augmentation method for spectral GNN. Adding a comparison with other related augmentation methods in spectral GNN is necessary. W2. The related work section acknowledges graph rewiring techniques (e.g., DropEdge, DiffWire, FoSR) but dismisses them as "fundamentally different", arguing they operate in the spatial domain. This distinction is not convincing. The goal is the same: modify the graph structure to improve GNN performance.AdaSpec also demonstrates its improved GNN performance, not in terms of node distinguishability. On the other hand, there are many methods that operate in the spectral domain. W3. The experiments only compare spectral GNNs with AdaSpec to the same GNNs without it (i.e., using a fixed matrix). This demonstrates that some form of adaptation is better than none, but it fails to show that AdaSpec is superior to, or even competitive with, other existing augmentation/rewiring techniques. The observed performance gains might simply stem from adding any adaptive rewiring, rather than from the specific spectral motivations of AdaSpec. W4. The paper's central concept, "node distinguishability," is not formally defined until Section 4 (Definition 4.1). This is a major structural flaw. The term is used in the title, abstract, and throughout the entire introduction without a precise technical definition. Q1. To validate the paper's central claim, the authors must demonstrate that their spectrally-motivated adaptive matrix is more effective than other spatially-motivated or general-purpose adaptive matrices. Q2. It is better to move the formal definition of node distinguishability (Definition 4.1) to the preliminaries (Section 3) or to provide a concise, informal definition in the introduction. Q3. Could the authors elaborate on the unique necessity of the $\Omega_S(A)$ component given the presence of $\Omega_D(A)$? Does the learnable diagonal matrix $B$ in $\Omega_D(A)$ not already provide sufficient flexibility to shift eigenvalues, including the zero eigenvalues? The ablation study (Table 4) shows that $\Omega_S(A)$ on its own has inconsistent and often poor performance (e.g., on Citeseer and Cora), suggesting it may be redundant or unnecessary.	Lightly AI-edited
AdaSpec: Adaptive Spectrum for Enhanced Node Distinguishability	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes to enhance spectral GNNs by improving their node distinguishability. Namely it starts by observing that node distinguishability is limited by two factors, the repeated eigenvalues in the graph adjacency matrix and the missing frequency components in the node features. Based on this, the paper proposes to modify the graph shift operator used in spectral GNNs with constructions that improve the two factors, and provides theory supporting this. Finally, experiments on many transductive datasets are performed by using the learned graph shift operator in various spectral GNNs. Results show improvements over baselines and show that the number of distinct eigenvalues is indeed increased on several datasets. 1. Motivating objective. The focus on improving node distinguishability in spectral GNNs is well-motivated and addresses a relevant limitation of existing approaches. 2. Theoretical development. The paper provides substantial theoretical analysis in support of the proposed method, offering formal insights into the factors influencing node distinguishability. 3. Clear experimental framing. The experiments are structured around clear research questions, making it easy to understand the intended empirical validation. 4. Breadth of evaluation. The method is evaluated on a range of standard transductive benchmark datasets, demonstrating applicability across multiple settings. 1. Practical relevance of the theoretical bound. The main conceptual contribution is an improved lower bound on the number of nodes that can be distinguished. However, it remains unclear how meaningful this bound is in practice, or whether it provides actionable guidance for model design or empirical performance. 2. Clarity of assumptions. Some theoretical results (e.g., Theorem 5.2) rely on assumptions that are either not fully stated or not clearly motivated. It would be helpful to explicitly articulate these assumptions and discuss their necessity and scope. 3. Novelty claims may be overstated. The statements “To the best of our knowledge, no existing work has systematically analyzed the interaction between the graph matrix and node features in determining node distinguishability in spectral GNNs” and “In this work, we demonstrate that node distinguishability is influenced by the eigenvalue multiplicity and the missing frequency components of node features in the eigenbasis of the graph matrix” appear stronger than warranted. Similar themes were examined in prior work, particularly [1]. The contribution would be clearer if the relationship to [1] were more explicitly discussed and the novelty claims were calibrated accordingly. 4. Readability and exposition. The paper is difficult to follow in several places. A clearer introduction to the problem setting, intuition for the theoretical results, and more accessible mathematical presentation would greatly improve readability. 5. [Minor]Limited significance of empirical results. The experimental findings are not particularly strong: in Table 2, 25 out of 40 comparisons are not statistically significant, and in Table 3, 27 out of 35 are not significant. That said, this is standard in the field. Reference: [1] Wang, X., & Zhang, M. (2022). How powerful are spectral graph neural networks? In ICML (pp. 23341–23362). PMLR. 1. Fig 1 a) The paper states that a spectral GNN cannot distinguish nodes 1 and 3. Should a GNN be able to distinguish these? It seems like nodes 1 and 3 are isomorphic as the node features is [1, 0, 1, -1, -1]. 2. Proof of theorem 5.1: “More unique coefficients in characteristic polynomial implies more unique eigenvalues of the matrix.” Can you clarify what you mean by this? Are you claiming that more different coefficients imply more roots to a polynomial? A counterexample to this claim is: (x-1)^3 =x^3-3x^3+3x-1 has 4 distinct coefficients but a single root and (x-1)(x-2)(x-3)=(x^2-3x+2)(x-3) = x^3 -6x^2 +11x -6 has three distinct roots but three distinct coefficients. 3. Theorem 5.2 relies on Theorem A.1 which require C to have no repeated eigenvalues, hence the statement of Theorem 5.2 is wrong/misleading since it seems like it applies to any matrix C. Particularly, since the motivation is that the adjacency has repeated eigenvalues, and a modification of A will play the role of C makes me question the usefulness of this statement in the context of the paper. 4. Table 4: Please add the base performance of ChebNet. 5. Table 5: please add the statistics for all datasets, not necessarily in the main text.	Fully human-written
Self-Knowledge Without a Self? Learning Calibrated and Model-Agnostic Correctness Predictors from Historical Patterns	Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	The paper tests whether it is possible to fine-tune an LLM to predict correctness of itself or other LLMs, including multiple LLMs at once, and finds roughly equal performance in predicting an LLM’s performance by itself and other LLMs. They conduct several ablation experiments, including removing the answer and the response from the fine-tuning prompt, effectively fine-tuning the model to predict performance based on the query only, thus forcing it to extract patterns in the question space. Some experiments with in-context learning are also conducted. - originality: the precise experimental setup (fine-tuning a model to predict correctness from various other models) is original, even though the broader idea has been around for a while - quality: the experimental setup is sound and thorough, and so is the conceptualisation and interpretation of the findings - clarity: the text is mostly clear - significance: the fact that a model’s performance can be predicted as well by another model as by itself is interesting and insightful, particularly when the answer is not included; also, this finding seems to partly contradict that in https://arxiv.org/abs/2410.13787, which sets ground for interesting debate. Moreover, the strength of predictive power is significant. Also, evaluations on selective prediction is important. - The paper could discuss more the real-world relevance of this for questions where the ground truth answer is unknown (and for which, therefore, the model fine-tuned to predict performance does not know it). What I mean is: performacne predictors that include the answer in the prompt may rely on knowledge of the ground truth, and the authors indicate how this likely explains the increased performance over not using the answer. However, this is only true when the answer is known to the model. There may be cases where the answer is unknown (to the model and the human users), and where performance prediction is therefore even more important. In practice, one can test the predictive methods on datasets where the performance of the model fine-tuned to predict performance of another one is very low (for instance, as the facts are after the knowledge cut-off of the model trained to predict performance, but not for the target model). - I think an addressable weakness is not touching with a particular strand of work exploring the question of whether performance can be predicted, such as: - https://ojs.aaai.org/index.php/AAAI/article/view/21487 introduces the concept of “assessors”, small models trained to predict performance of a main one starting using the question only - https://ceur-ws.org/Vol-3169/paper4.pdf implements assessors for language models - https://arxiv.org/abs/2409.03563 develops assessors that work across multiple language models I invite the authors to indicate how these approaches relate to the one they propose in their related works section. I also believe that some of these approaches can be insightful baselines to compare their method with (for instance, using simple classifiers trained on top of sentence embeddings can further indicate if a LLM is needed as the predictive model, or if instead simpler approaches allow to determine predictive patterns in the prompt space). - some minor clarity points: - lines 49-52 seem to suggest calibration is the only thing that matters for confidence, but this is not the case: one can have calibrated confidence by always predicting the success rate, with no discriminative power - it would be great to make more explicit the precise finetuning prompt(s) used in the main text, particularly how the answer, response, and model name are used (for instance, putting them in a figure, instead of hidden among the text) - for instance, the prompt in lines 167-168 asks how the model “will” respond to a prompt, but the authors say that this is appended to a prompt and model response. So why “will”? Is this what is actually used? - the authors’ findings seem to partly contradict those in https://arxiv.org/abs/2410.13787. How would the authors reconcile the two? Is there a substantial difference between the setups that may explain the difference in findings? - lines 306-307: “These results indicate that correctness prediction generalizes across families, sizes, and even held-out stronger models.”. Is this due to models failing and succeeding in very similar ways?	Fully human-written
Self-Knowledge Without a Self? Learning Calibrated and Model-Agnostic Correctness Predictors from Historical Patterns	Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper argues that a model 1) _has no privileged information about its own error distribution_ and hence that it is more effective to 2) _treat the ability to predict correctness / detect error as a downstream task on its own_. (Here I am wording the two points my own way, hopefully the authors will agree that my phrasing represents their views well.) The paper provides evidence for (1) by testing whether a model can predict its own correctness any better than some other model could, the observation is in the negative. The paper goes over a number of design ideas to essentially deliver (2), for which of course it uses data about a model's error distribution (that is, annotation for when a model has been wrong/correct in the past). The paper shows that by engineering this component (the paper calls it a 'correctness model') carefully, the ability to predict correctness can even generalise across a number of conditions. The paper is generally well-written, the research is interesting and the findings are likely to be echoed / built upon. That said, I do find it to miss an entire branch of relevant literature, but I will comment on that in weaknesses. 1. RQ1, both its formulation and the simplicity of the observation that supports it is very refreshing. I personally think it's about time that someone dispels the tacit assumption that models have privileged information outside the merits of carefully engineered heuristics or uncertainty quantifiers (the 'careful' in it being the designer's evolving knowledge - paper after paper - of what works and what doesn't, effectively, a proxy to learning from historical data). For that, I find this paper rather refreshing. 2. I have a problem with how the paper positioned RQ2 (see weakness), but the strategies developed to approach it (the various designs for the so-called 'generalised correction models' and the observations in the various settings) come across as quite thorough. My main problem with this paper is that it is written as if confidence and quality estimation had never been approached as ML tasks ever before. Learning from 'historical data about correctness' is precisely what these fields have always done (from SVMs, to Bayesian models, to neural models, to LLMs, with supervision from a single model or via system combination, with and without feature engineering, etc; confidence and quality estimation have been approached by everything in the ML toolbox). The oldest reference I can think of on the spot is [Specia's 2009 paper](https://aclanthology.org/2009.eamt-smart.10/), but please do the literature check, you will uncover more papers than I can list (from small university labs to the biggest industry labs out there) and you will find that many of these papers are hugely impactful. To the best of my judgment, the research in this submission stands, but this paper - even if inadvertently - currently obfuscates that "learning from historical data about correctness" is an actual thing and it's been called QE for at least nearly two decades. I am not sufficiently close to QE myself, so I cannot tell you whether the designs proposed in the paper as 'correctness models' are at all surprising. I do suspect they aren't, but I will refrain from penalising in that dimension. Instead I will just hope that there's a QE reviewer in the loop. I am however penalising for a kind of obfuscation of literature that I consider harmful. My expectation is that the paper should be transparent / clear rather upfront about the point above, namely, the framing of confidence/quality estimation as a task on its own, for which models can be trained on historical data, etc, is generally accepted, well-established and _not_ a contribution of this paper. Of course, in light of the findings surrounding RQ1, the importance of embracing that framing is clear, and this is an argument you are establishing and supporting in the submission. I appreciate that and am not challenging it. Last, in your literature check, I'd recommend looking for recent papers (for example from Unbabel or WMT submissions to the QE task; these are just some entry points into modern QE) as the concrete designs you propose might have been proposed before and, I hope you agree, it is only right to give credit where credit is due (even if, incidentally, those papers did not inspire you directly).	Fully human-written
Self-Knowledge Without a Self? Learning Calibrated and Model-Agnostic Correctness Predictors from Historical Patterns	Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The authors analyzed the judgment accuracy of response content by both their own models and other models, and found that models do not exhibit higher accuracy in judging their own answers—instead, more capable models demonstrate higher accuracy in judging answers. Meanwhile, the authors trained a judgment model and observed that it could achieve higher judgment accuracy. They argue that this finding indicates model-related issues can be optimized by leveraging certain patterns and confidence levels to assess the response accuracy of models. Additionally, the authors suggest that both post-hoc calibration and in-context learning (ICL) are effective in improving accuracy. The authors' writing is highly concise and accessible. The topic of confidence is of great importance, especially in the context where hallucinations are becoming increasingly severe under the reinforcement learning paradigm brought by o1/R1. Although the authors focus on the confidence field, in the reviewer's opinion, they fail to conduct sufficient literature research. On one hand, the authors overlook the field of reward models. The Correctness Model proposed by the authors appears to be a type of reward model, yet there are numerous open-source models in this domain—including General Reward Models and Process Reward Models. Notably, many existing studies in the reward model field have already reached similar conclusions to those presented in this work. But there is no baseline from reward models. Despite the rigorous logical flow of the research content, its innovation and inspirational value are, in the reviewer's view, insufficient to support its acceptance by top-tier conferences. As stated in Weakness	Moderately AI-edited
Self-Knowledge Without a Self? Learning Calibrated and Model-Agnostic Correctness Predictors from Historical Patterns	Soundness: 1: poor Presentation: 4: excellent Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This work investigates the problem of confidence calibration for Large Language Models (LLMs). The authors propose two research questions (RQs): 1. Are LLMs superior to other LLMs in predicting their own correctness? 2. What role does historical information from multiple models play in calibrated correctness prediction? The study conducts experiments on open-source LLMs to address these questions. 1. The topic of LLM verbalized confidence calibration is highly important, and this research possesses significant practical relevance. 2. Both RQs are critical. A comprehensive solution to this issue would have a profound impact on the community, highlighting the paper's exceptional insight. 3. The authors designed a series of progressive experiments to substantiate their claims. The paper is well-structured and reads very clearly. 4. RQ2 is intuitive and is supported by convincing experimental evidence. 1. The validation for both RQ1 and RQ2 relies exclusively on logit-based confidence scores, which severely limits the paper's persuasive power. As noted in [1], logit-based confidence in large models can be unreliable due to the influence of RLHF. This is particularly problematic for RQ1: verbalized confidence generation is strongly dependent on a semantic understanding of the context. In contrast, logit confidence, which is based on probabilities adjusted by RLHF, is not semantically aligned with the concept of "confidence." Consequently, this approach fails to leverage the powerful In-Context Learning (ICL) capabilities of LLMs. 2. Following W1, the focus on logit-based scores restricts the study to smaller, weaker, open-source models. A critical claim like that in RQ1—which contradicts intuition and community consensus—requires validation on large-scale, state-of-the-art models (e.g., GPT-5, Claude 4, Gemini 2.5 Pro, DeepSeek V3.2) to be convincing. While the current experiments explore multiple angles (which is commendable), the most critical factors—model scale and quantity—are insufficient. The experiments for RQ1 are limited to Qwen2.5-7B and Llama3.1-8B. Even by small-model standards, this sample size is very small. A more appropriate study should include various sizes of Qwen, GPT-OSS models, Gemma, and others to provide a convincing basis for the conclusions. 3. The significance of "answerless" confidence is unclear. This metric seems to be a direct assessment of the question's intrinsic difficulty, which could be interpreted as a marginal distribution of confidence summed over all possible answers. However, this approach fails to account for the prior distributions resulting from the model's own predictions for specific answers. The rationale for using this as a basis for an ablation study needs more detailed justification. [1] Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher Manning. 2023. Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5433–5442, Singapore. Association for Computational Linguistics. 1. Lines 252-254: Why does removing the model response necessarily eliminate the influence of parametric knowledge? 2. Line 196: Is "Measuring Confidence" a typo here?	Lightly AI-edited
ReCAP: Recursive Prompting for Self-Supervised Category-Level Articulated Pose Estimation from an Image	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces a self-supervised, single-image, category-level articulated object pose estimation framework that avoids depth/3D supervision. The method adapts VGGT with lightweight prompting, fuses semantic and geometric cues, predicts a dense point cloud, and canonicalizes it via a learnable category template to regress global and per-part poses. Experiments show competitive performance. + Proposed a self-supervised, single-image articulated pose estimation framework built on a frozen VGGT backbone, avoiding any ground-truth annotations while tackling a challenging setting. + Introduces a prompt strategy for VGGT is meaningful. + While adapting VGGT may be reasonable, the reported ablations show only marginal gains; it remains unclear whether the proposed prompting/recursion is necessary versus simpler alternatives or no adaptation at all. + Benchmark coverage is limited and qualitative results focus largely on rotational joints; prismatic/mixed-DOF categories, heavy occlusions, and classes with larger intra-class variation are underexplored. + For symmetric objects, closed configurations, or low-texture surfaces, joint type/axis is not uniquely recoverable from a single image; results read as the most plausible hypothesis under shape/semantic priors rather than demonstrably identifiable solutions. + For symmetric shapes, near-closed poses, or texture-poor views, a single image does not uniquely determine the joint type or axis; the predictions read as prior-conditioned best guesses rather than uniquely identifiable solutions. + How is the joint type obtained (assumed, predicted, or inferred)? + Using a learnable category-level template with DCD may bias predictions toward an average shape, suppressing instance-level details and skewing axis estimation. Quantifying this effect is valuable. + Since core geometry comes from a frozen VGGT, to what extent do the gains stem from the backbone prior rather than the proposed self-supervised training?	Moderately AI-edited
ReCAP: Recursive Prompting for Self-Supervised Category-Level Articulated Pose Estimation from an Image	Soundness: 3: good Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper presents ReCAP, a self-supervised framework for category-level articulated object pose estimation from a single RGB image — a task that traditionally requires depth or multi-view supervision. ReCAP adapts a large geometry foundation model (VGGT) using a novel Recursive Residual Prompting mechanism, which refines prompts through iterative recursion and stabilizes them via residual injection, introducing less than 1% additional parameters. This enables parameter-efficient adaptation of rigid-object priors to articulated settings. To address occlusion and symmetry ambiguities, the paper further introduces a Cross Semantic–Geometry Pyramid (X-SGP) module that hierarchically fuses semantic and geometric cues. Experiments on OP-Align, HOI4D, and PartNet-Mobility show that ReCAP achieves state-of-the-art performance among self-supervised methods and even surpasses some supervised RGB-D baselines. * The proposed Recursive Residual Prompting (RRP) is a well-motivated and technically elegant solution to two core issues in applying prompt tuning to large geometry backbones such as VGGT: (1) limited capacity of shallow prompts to capture complex articulation patterns, and (2) instability caused by VGGT’s dynamic token reconfiguration. * The proposed Cross Semantic–Geometry Pyramid (X-SGP) effectively fuses semantic and geometric cues via adaptive FiLM modulation and multi-scale refinement. Ablation studies show clear performance drops when removing pyramid layers or FiLM components, confirming their necessity in handling occlusion, symmetry, and fine-grained part alignment. * The authors provide quantitative ablations for recursion depth, prompt placement (input/output), and parameter scaling, showing that the recursive approach achieves comparable or better performance than multi-layer stacking with only 0.8% additional parameters. This level of analysis supports the soundness and reproducibility of the proposed design. * The work thoughtfully adapts prompt tuning and DEQ-style recursion to articulated pose estimation, which is a valuable and nontrivial contribution. Still, the methodological core builds on established ideas, with innovation mainly in integration and application rather than new theoretical development. * Despite its parameter efficiency (<1% trainable), the recursive refinement introduces noticeable latency (≈13 FPS vs. 41 FPS in baselines). Discussion of this trade-off or adaptive recursion strategies would improve clarity on practical feasibility. * The paper could better illustrate how recursive prompting reshapes geometric or semantic representations. For example, visualizing token attention or feature evolution across recursion steps would clarify how the mechanism contributes to improved articulation reasoning beyond empirical gains.	Fully AI-generated
ReCAP: Recursive Prompting for Self-Supervised Category-Level Articulated Pose Estimation from an Image	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper introduces RECAP, a self-supervised method for single image category-level articulated object pose estimation. To tackle the depth uncertainty problem, the authors exploit a geometry foundation model to learn the corresponding complete point cloud for the input object, with the proposed recursive prompt for adapting articulated objects. Then an alignment method is used for optimizing the per-part 6D pose using the reconstructed point cloud into the RGB image. 1. The authors solve the pose estimation problem using an only RGB image, which is promising. 2. The technique presentation is sound and convincing. 3. Enough experiments are provided. 1.To address the depth missing problem, the authors employ a geometric foundation model for point cloud learning. However, the comparison of point cloud reconstruction with methods that do not utilize such foundation models is arguably unfair. Although the RECAP method introduces external knowledge for the pose estimation task, it only achieves marginal improvements. 2.Several relevant works on category-level articulation pose estimation are not adequately cited or discussed, such as R2-Art (AAAI 2025), U-COPE (ECCV 2024), and "Toward real-world category-level articulation pose estimation" (TIP 2022). Additionally, the well-established render-and-compare methodology, widely used for single-image pose estimation, is also overlooked in the discussion. 3.The authors utilize the OP-Align dataset as a benchmark; however, its scale is relatively limited, encompassing only four categories. Given that existing datasets contain over 2,000 objects across numerous categories, the selection of merely four categories appears insufficient. 4.The evaluation does not include two widely recognized articulation datasets—ArtImage and ReArtMix. The authors are encouraged to provide an explanation for this omission. please refer to weaknesses.	Lightly AI-edited
CausalAffect: Causal Discovery for Facial Affective Understanding	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This work proposes CausalAffect - a framework for causal graph discovery for facial affect analysis. Given facial images, this framework recovers a causal graph graph describing dependencies between facial activation units and facial expressions (i.e., AU -> AU and AU -> Expression). The authors introduce several specific mechanisms, including sample-adaptive learning and feature-level counterfactual interventions, designed to improve graph discovery. The authors validate the framework empirically on six benchmark datasets, illustrating that it improves the accuracy of AU and expression predictions, and recovers causal graphs that are consistent with existing psychological theories. In this work, the authors propose the first (to my knowledge) approach for causal discovery in facial affect data. One key benefit of the proposed approach is that it can learn in a semi-supervised manner, and does not require joint annotations of AUs and facial expressions. This enables the proposed approach to ingest data from disjoint datasets during learning. The empirical validation of the work also has several string dimensions: the authors compare against a broad set of relevant baselines, illustrate accuracy improvements provided by the approach, and report interesting interpretive analysis of the scientific insights recovered by the framework (Figure 2; Appendix C). The authors also provide an interesting comparative analysis of the Social Smile and Duchenne Smile, and provide also perform hyperparameter sensitivity and ablation studies. While minor, CausalAffect is also a great name for the framework. ## Clarity of Causal Framework and Formal Assumptions One of my key concerns with this work is that it provides an incomplete motivation for why a causal perspective is important to this problem, and why conceptually a causal approach improves over the status-quo in AU/Expression recognition. After reading the work several times, it remains unclear what cognitive or behavioral model this work instantiates. AU -> Expression, it could also be the case that Expression -> AU, or that there is a third unmeasured variable that is manifesting in both AUs and emotion expressions. When investigating such theories, I would expect to see a formal Structural Equation Model or DAG with accompanying references supporting the model. The term "relations" used throughout the work maintains this ambiguity and does not clearly identify the proposed causal pathways that are under study. - For a causal discovery problem, I would expect a formal statement of causal and statistical assumptions needed to support inference, matched to the DAG or SEM above. - The soft DAG constraint enabling cycles in the AU → AU pathway violates the acyclicity requirement of causal DAGs. This makes me question the causal model under study. I think this could nicely be viewed as a time series problem where there is a fixed DAG with states that vary as a function of time t. This could obviate the cyclic issues while also supporting the dynamics the authors mention. Generally adopting a time-dependent formulation would be natural in this setting. In sum, there is a key gap between the paper's causal framing and its methodology. While the paper claims to "infer psychologically plausible causal relations" and "enforce genuinely causal relations", the absence of a formal causal model, stated identification assumptions, and the violation of acyclicity principles means that the learned relationships cannot be interpreted causally. The method may learn useful structured dependencies, but these are correlational rather than causal. I recommend the authors either: (1) formalize their causal framework with appropriate assumptions and modify the architecture to respect causal principles, or (2) reposition the work as learning interpretable feature relationships via a semi-supervised approach, which would still be a quite valuable contribution. ## Presentation of Technical Framework Figure 1 is very helpful for understanding the framework and the paper is well-written overall. However, in places throughout Section 3, I found myself lost in the details of the approach with limited rationale for why these details matter. In particular, subsections 3.1-3.5 currently read as a recipe of technical details rather than providing a unified rationale for why these components are needed. For example, it's unclear why Counterfactual Interventions are necessary given HSIC-Based Disentanglement, or why sample-adaptive graphs are necessary to support heterogeneous nodes. The empirical ablation study illustrates the empirical impact of these decisions but offers a limited conceptual basis for why each component is necessary. The authors could address this by easing the notation / moving some details to the appendix to make space for more rationale for the design decisions and sharing why these problems are technically challenging. Related to the comment above, one place where I was especially missing the rationale is Section 3.3. Why would Social and Duchenne smiles require different graph structures rather than different paths through the same graph? If we conceptualize each AU as a random variable, the causal graph structure should remain invariant across realizations—only the values of these variables should change. A graph that changes structure conditional on observed data is at odds with standard causal modeling. Note that addressing the first point (Clarity of Causal Framework) could also resolve this concern. ## Empirical Validation While the empirical validation has several strengths (noted above) there are also a few weaknesses. Foremost, the authors do not report statistical uncertainty in Table 1 and elsewhere in the results. This makes it challenging to assess whether the claims hold across settings. Further, based on the current results, it is unclear what the foundational goals of the empirical evaluation should be. I was surprised to see prediction accuracy featured so prominently in a causal discovery work. For causal inference, the primary validation should be whether the method recovers the true causal structure, not whether it improves predictive performance (which can be achieved by learning spurious correlations). To validate the approach, the authors may want to design a synthetic or semi-synthetic experiment illustrating that the approach can recover the ground truth when the true causal structure is known. This would provide valuable evidence substantiating that the learned causal relations (Fig 2) are valid. Finally, my conceptual confusion surrounding the causal framework extends to the experiments. How would it be possible to recover a causal graph with edges that vary conditional on a single image (Fig. 3)? This reinforces the concern that the learned relationships may be correlational rather than causal. More positively, a compelling visualization could show graph activation over AUs as a function of time—e.g., showing frames from a video sequence. This would illustrate the temporal activation patterns underpinning the graph structure and would align naturally with the time-series formulation suggested earlier. Conceptual: - What is the proposed causal model underlying this approach? Specifically, does it posit that AU → Expression, Expression → AU, or both directions exist? Could you provide a formal DAG or SEM that represents your theoretical model? How does temporality connect to this formulation? - What causal and statistical assumptions are necessary for your method to identify causal relationships rather than correlations? For instance, do you assume causal sufficiency (no unmeasured confounders between AUs and expressions)? - Could you clarify the theoretical justification for having graph structures that vary across individual images? Empirical: - To what extent can the accuracy benefits reported in Table 1 be attributed to increased dataset size available to CausalAffect? - How do the reported empirical results vary across training runs and random seeds?	Fully human-written
CausalAffect: Causal Discovery for Facial Affective Understanding	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces CausalAffect, a novel framework designed to learn causal relations among Action Units (AUs) and between expressions and AUs. The work features extensive experimentation and analysis, providing interesting insights into these relationships. * The trained models and code will be released, which is a significant contribution to the research community. * Extensive experiments, including ablation studies, are conducted on multiple datasets. The paper compares the method against competitive baselines and demonstrates promising performance. * Overall, the paper is well-written and presented in good format. * Interesting analysis and visualizations are presented in Section 4.2. These provide readers with a straightforward understanding of how the learned causal relations align with or differentiate from previous studies. (Although some results appear unusual, which is addressed in the Questions section.) * Straightforward case studies are provided to illustrate the effectiveness of the proposed method. * Figure 2 is cited in line 46 but is located on page 8, which is too far from the relevant text and disrupts the flow of reading. * The paper lacks a Related Work section. * While Table 1 shows promising performance, the comparison feels unfair because the best CausalAffect configuration utilizes additional training data sources. Without this extra data, the performance is very close to the best baseline, and a significant statistical test is missing to confirm the difference. * Regarding the EmotioNet experiment in Table 1, could you please explain why joint training (with AU datasets) appears to detrimentally affect the expression recognition performance? * In Figure 2, the GNN-Learned Correlation visualization looks weird to me. It seems to imply that every AU is highly correlated with every expression. Could you elaborate on this specific finding? Similarly, for the AU-AU co-occurrence in BP4D, the strong correlation between AU6 and AU14 warrants further explanation.	Lightly AI-edited
CausalAffect: Causal Discovery for Facial Affective Understanding	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 8: accept, good paper Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper introduces a framework, named CausalAffect, for causal graph discovery in facial affective analysis. It aims to overcome the limitations of existing methods that lack psychological plausibility, rely on joint annotations, and ignore causality direction and inhibitory effects. Its main contribution is a weakly-supervised framework that learns a two-level (global and sample-adaptive) causal hierarchy for both AU→AU and AU→Expression dependencies, capable of modeling both excitatory and inhibitory relations. A key innovation is a feature-level counterfactual intervention mechanism that enforces true causal effects by perturbing latent AU features, eliminating the need for image synthesis. The primary strength lies in its novel formulation of facial affective analysis as a causal discovery problem, moving beyond mere correlation to seek psychologically plausible mechanisms. The proposed framework, CausalAffect, integrates a two-level (global and sample-adaptive) graph structure to capture both stable population-level rules and context-specific dynamics. Its ability to model both excitatory and inhibitory relations, combined with an efficient feature-level counterfactual intervention, ensures the discovered dependencies are genuinely causal. It also eliminates the need for scarce jointly-annotated datasets by its weakly-supervised design. Well-designed and comprehensive ablation studies that confirm the necessity of each component. - The system complexity would be a concern to me. CausalAffect contains four different modules and that caused a large number of loss functions. Their corresponding hyperparameters (e.g., $\lambda_{ib}$, $\lambda_{DAG}$, $\lambda_{consist}$) also increase the risk of training instability and these factors would make it difficult for other researchers to reproduce the results. - The paper employs a significant number of mathematical symbols and formulas, which is commendable. However, to some extent, this comes at the cost of reading fluency and also compresses the available space for text, causing some of the results analysis to be relegated to the appendix. Question 1: Validating psychological plausibility against existing literature like FACS is a clever approach. However, for the 'new' causal relations discovered by the model, for example: the subtle inhibitory ones, it seems that we don't have an objective 'gold standard' or ground truth. This makes it difficult to determine whether these are genuine psychological insights or merely statistical artifacts learned from the specific datasets. What's your thoughts on this? Question 2: In Table 3, the result for GFT at idx 13 should be bolded, assuming that bolding is used to indicate the best result.	Lightly AI-edited
CausalAffect: Causal Discovery for Facial Affective Understanding	Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	The paper proposes a framework for facial Action Unit detection that is based on a global feature extraction followed by a per-AU classification head and two graph neural networks that can do message passing to refine the detections. Through direct supervision, the graphs are learned following both data correlations and psychologically-based data. The whole network i.e. the backbone, heads and the adjacency matrices for the graphs is learned in an end-to-end fashion. The method is tested in standard AU benchmarks delivering competitive results. The idea of trying to find a proper AU relationship as well as a relationship between AUs and expressions is appealing, despite not being new, and the authors try to approach it in a data-driven way. The results are compelling and the relation between AUs and expressions across datasets is investigated, showing similar correlations than that of existing work. The paper is poorly written, and poorly presented, with many broken sentences. The narrative is very loose and the figures and notation do not serve the understanding of the paper. The paper is full of clutter and the tables and figures have been minimized to fit in the paper to an unacceptable level. The method is not novel and combines many pieces of existing work. The discovery of knowledge-based AU graphs is not new, it has been presented in many works; as an example there are the following approaches not cited in this paper: Song et al. Dynamic Probabilistic Graph Convolution for Facial Action Unit Intensity Estimation. CVPR 2021 Wang et al. Spatial-Temporal Graph-Based AU Relationship Learning for Facial Action Unit Detection. CVPRW 2023 Fan et al. Facial Action Unit Intensity Estimation via Semantic Correspondence Learning with Dynamic Graph Convolution. AAAI 2020 What is exactly the contribution of the paper? The paper does not include any discussion wrt to prior work and how the proposed method advances existing research. There is no analysis of complexity and training and inference time. This needs to be included. The use of external data to justify the results and its use to compare against state of the art methods is unfair. For a fair comparison the competing methods should have been trained in the same data. It is not good practice to claim state of the art results when the training includes a large amount of external data than that used by the competitors. Even when using additional data, the results are surprisingly close to state of the art, meaning that the method barely advances existing research. In summary, the manuscript is rather poor and needs a lot of work for it to be considered. The reading is unpleasant and the contributions are not properly justified. The results are far-fetched thanks to the addition of external data, and there is no proper comparison in the methodology against prior work on graph neural networks for Action Unit detection/intensity estimation. Please see my comments above. In particular, I would suggest the authors to properly illustrate in which ways their method is novel wrt prior work, and what are the main contributions they propose, in a succinct, to-the-point manner.	Fully human-written
Parameters vs. Context: Fine-Grained Control of Knowledge Reliance in Language Models	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces CK-PLUG, a plug-and-play method for controlling knowledge reliance in RAG systems when conflicts arise between LLMs' parametric knowledge and retrieved context. The approach uses a novel Confidence Gain metric based on entropy shifts to detect knowledge conflicts at the token level. CK-PLUG modulates token probability distributions through weighted fusion of parameter-aware and context-aware predictions, controlled by a single tuning parameter α. Experiments on four LLMs (LLAMA2/3, Mistral, Qwen) demonstrate wide-range controllability on counterfactual datasets while maintaining fluency. The method also offers an adaptive mode that automatically balances knowledge sources based on model confidence, achieving consistent improvements across six diverse RAG tasks without requiring parameter modifications or retraining. - Novel entropy-based conflict detection that provides interpretable, theoretically-grounded identification of knowledge conflicts through Confidence Gain metric - Flexible control via single parameter enabling smooth adjustment from full contextual to full parametric reliance with optional autonomous mode - Practical plug-and-play design requiring no training or architecture changes while demonstrating effectiveness across multiple models and diverse RAG tasks 1. Insufficient Baseline Comparisons The paper lacks comparisons with existing adaptive RAG methods that also address knowledge conflicts or context utilization. Notable missing baselines include: - Adaptive retrieval methods: FLARE, Self-RAG, DRAGIN, SeaKR - Context-aware generation: RQ-RAG, QC-RAG, CtrlA Without these comparisons, it is difficult to assess whether the performance gains are due to CK-PLUG's novel approach or simply from any form of adaptive control. The authors should include at least a subset of these methods to demonstrate the unique advantages of their entropy-based approach. 2. Missing Critical Related Work The paper overlooks several highly relevant previous or concurrent works that employ similar entropy-based or conflict-detection approaches for RAG: - Entropy-Based Decoding for Retrieval-Augmented Large Language Models (arXiv:2406.17519, June 2024) - uses entropy for RAG decoding - Discerning and Resolving Knowledge Conflicts through Adaptive Decoding with Contextual Information-Entropy Constraint (arXiv:2402.11893, Feb 2024) - directly addresses knowledge conflicts via entropy - SEReDeEP: Hallucination Detection in Retrieval-Augmented Models via Semantic Entropy and Context-Parameter Fusion (arXiv:2505.07528, May 2025) - combines semantic entropy with context-parameter fusion - FaithfulRAG: Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation (arXiv:2506.08938, Jun 2025) - models fact-level conflicts 3. Limited Applicability to Modern Agentic RAG Systems. Current RAG systems are evolving toward agentic architectures involving multi-step planning, iterative search, self-reflection, and answer verification (e.g., Search-o1, Search-R1, Reason-RAG, Web-walker, Web-sailor, etc). CK-PLUG operates at the token-level decoding stage, and it remains unclear whether: - The method can be integrated into multi-turn agentic workflows - Conflict detection works when contexts are iteratively refined - The approach scales to complex reasoning chains The authors should discuss or demonstrate CK-PLUG's compatibility with agentic RAG frameworks to ensure practical relevance. 4. Insufficient Analysis of Computational Overhead While claimed to be "lightweight," the paper provides no quantitative analysis of: - Latency increases during inference (requires two forward passes for parameter-aware and context-aware distributions) - Memory overhead from maintaining multiple probability distributions - Scalability with increasing context length Q1: Clarification on Notation (Line 143-144). There is a typographical error with double periods: "distributions.." Please correct. Q2: Ambiguous "Baseline" Definition in Table 1. The "Baseline" row in Table 1 is unclear. Does it refer to: - (a) Vanilla LLM without RAG (direct question answering), or - (b) Standard RAG with both query and retrieved context, but without CK-PLUG?	Fully AI-generated
Parameters vs. Context: Fine-Grained Control of Knowledge Reliance in Language Models	Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	The approach detects tokens susceptible to conflicts between these two knowledge sources by measuring per-token entropy, then interpolates between their context-dependent and context-independent probability distributions. The degree of interpolation is governed by a single hyperparameter, which can be set manually or determined automatically using a heuristic based on the entropy ratio of the two variants. - The paper is clearly written and easy to understand. - The authors introduce a conceptually straightforward and well-motivated approach to regulate the model’s dependence on retrieved context. - The proposed method is empirically solid and thoroughly evaluated, demonstrating its effectiveness in balancing contextual and parametric knowledge and enhancing question-answering accuracy. ### Methodological Evaluation From a methodological standpoint, the proposed approach offers limited novelty, as it also relies on distribution interpolation between context-dependent and context-independent probabilities, similar to [1]. The main differences are: - Selective interpolation: In this paper, interpolation is applied only to tokens whose entropy increases after adding context, assuming these tokens indicate parameter–context conflict. In contrast, [1] applies interpolation to all tokens. - Different interpolation formula: This paper uses $$ \alpha \log p(y \mid x) + (1 - \alpha) \log \frac{p(y \mid c, x)}{p(y \mid x)} = (1 - \alpha) \log p(y \mid c, x) - (1 - 2\alpha) \log p(y \mid x) $$ whereas [1] uses $$ (1 + \alpha) \operatorname{logit}(y \mid c, x) - \alpha \operatorname{logit}(y \mid x) $$ However, the motivation for this specific interpolation formula is largely intuitive, and the procedure for identifying conflict-inducing tokens is not rigorously justified. The improvements in accuracy over the standard RAG baseline are modest—and sometimes even negative (e.g., on FEVER, performance drops from 89.5 % to 89.2 % for Mistral, see Table 2)—which is disappointing given the method requires approximately double the compute. Overall, more analysis and empirical/theoretical justification are needed to demonstrate that the proposed method is truly worth its computational overhead and that it outperforms [1] in a meaningful way. Reference: [1] Shi, Weijia, et al. Trusting Your Evidence: Hallucinate Less with Context-Aware Decoding. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers). 2024. ### General Questions - What are the exact formulas used to define ConR and ParR? They are mentioned in line 309, but no details are provided on how they are computed. - It would be helpful to analyze how much context vs. parametric reliance affects performance to justify why adjusting this balance is important. I’m particularly interested in the question-answering accuracy corresponding to each ratio in Table 1. - Could you please specify the number of forward passes (or total compute) used by each method in Table 2 to ensure a fair comparison? --- ### Suggested Experiments #### Justify the Interpolation Formula I recommend comparing the current method directly with [1], including: - No ConD + interpolation from [1] - CK-Plug results from Table 1 and Table 2 Additionally, please include an ablation on the interpolation formula in Table 3. At present, it includes (ConD + interpolation from CK-Plug) and (no ConD + interpolation from CK-Plug); it would be informative to also test (ConD + interpolation from [1]) and (no ConD + interpolation from [1]). These comparisons would clarify the necessity of introducing ConD and justify your specific interpolation design. If the interpolation from [1] performs robustly without ConD, then the added component may not be needed. --- #### Explore More Challenging Context–Parameter Conflict Scenarios It would strengthen the paper to test CK-Plug in settings with stronger context–parameter conflicts, such as those difficult even for large models like ChatGPT. You could evaluate performance in scenarios like §4.4.2 (where the parametric answer is inserted as a substring into the context) or use Table 6 in [3] as reference. This would help determine whether CK-Plug can effectively guide the model to prefer the contextual rather than parametric answer under such conditions. --- #### Justify the Use of Entropy for Conflict Detection I suggest performing ablations on different uncertainty measures for identifying conflict-prone tokens, beyond entropy. For example, try using maximum token probability or more recent uncertainty estimation techniques such as [2]. This would validate whether entropy is indeed the most suitable choice for detecting parameter–context conflicts. --- References [1] Shi, Weijia, et al. Trusting Your Evidence: Hallucinate Less with Context-Aware Decoding. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), 2024. [2] Ma, Huan, et al. Estimating LLM Uncertainty with Logits. arXiv preprint arXiv:2502.00290 (2025). [3] Kortukov, Evgenii, et al. Studying Large Language Model Behaviors Under Context–Memory Conflicts With Real Documents. First Conference on Language Modeling.	Moderately AI-edited
Parameters vs. Context: Fine-Grained Control of Knowledge Reliance in Language Models	Soundness: 2: fair Presentation: 4: excellent Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper introduces CK-PLUG, a plug-and-play method that enables large language models to dynamically balance reliance on internal (parametric) knowledge and external (retrieved) context during retrieval-augmented generation. Using a novel Confidence Gain metric that detects knowledge conflicts via entropy shifts in token probabilities, CK-PLUG selectively adjusts token-level predictions with a single tuning parameter $\alpha$ (or adaptive enhancement) to favor either parameters or context. Experiments demonstrate the effectiveness of the proposed method. - The paper was well-written and had very nice figures. - The proposed method is lightweight and effective. - My biggest concern with this paper is novelty. The use of entropy for identifying key tokens has been explored in recent works [1–2], yet these closely related studies are not cited—especially [1], which shares a similar methodology for token-level entropy analysis. Even if applied in a different context, omitting these references significantly weakens the originality of the contribution. - The proposed CK-PLUG method may not generalize across all scenarios. For example, if the model is confidently wrong and the retrieved context reinforces the incorrect belief, the system may still fail. The authors should clarify the underlying assumptions and delineate conditions where CK-PLUG is reliable to enhance its scientific soundness. - The paper lacks comparisons with prior decoding-based [3–7] and intervention-based [8–9] approaches that similarly aim to regulate factuality and knowledge conflicts. Including such baselines would better demonstrate the advantages and distinct contributions of CK-PLUG. [1] What is Wrong with Perplexity for Long-context Language Modeling? ICLR'25 [2] Attention Entropy is a Key Factor: An Analysis of Parallel Context Encoding with Full-attention-based Pre-trained Language Models. ACL'25 [3] Trusting Your Evidence: Hallucinate Less with Context-aware Decoding. NAACL'24 [4] Sled: Self logits evolution decoding for improving factuality in large language models. NeurIPS'24 [5] Dola: Decoding by contrasting layers improves factuality in large language models. ICLR'24 [6] Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation. EMNLP'25 [7] AdaCAD: Adaptively Decoding to Balance Conflicts between Contextual and Parametric Knowledge. NACCL'25 [8] Cutting Off the Head Ends the Conflict: A Mechanism for Interpreting and Mitigating Knowledge Conflicts in Language Models. ACL'24 [9] Taming Knowledge Conflict in Language Models. ICML'25 Aforementioned in the weakness section.	Lightly AI-edited
Parameters vs. Context: Fine-Grained Control of Knowledge Reliance in Language Models	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	Retrieval-Augmented Generation (RAG) reduces hallucinations in Large Language Models (LLMs) by incorporating external knowledge, yet it faces challenges from conflicts between the models’ parametric knowledge (internal) and retrieved context (external)—especially when the retrieved information is unreliable or the internal knowledge is outdated, leaving LLMs unable to decide which type of knowledge to prioritize. To solve this, the authors propose CK-PLUG, a plug-and-play method designed to control LLMs’ reliance on parametric and contextual knowledge. CK-PLUG introduces a new knowledge consistency metric called Confidence Gain, which detects knowledge conflicts by measuring entropy shifts in token probability distributions after context insertion; it then enables fine-grained control over knowledge preference by adjusting the probability distribution of tokens with negative Confidence Gain via a single tuning parameter, and also supports adaptive control based on the model’s confidence in both knowledge types. 1. The biggest advantage of this paper is proposing a "plug-and-play" inference-time method. 2. Conflict Detector: It introduces a metric called "Confidence Gain (CG)", which identifies conflicts by comparing the entropy change of token distribution between RAG input (context + query) and regular input (query only). A conflict is determined when there is an entropy increase (i.e., the model becomes more confused), and this definition is reasonably sound. 3. Knowledge Controller: This method isolates the logits purely contributed by the "context" through log subtraction, and then uses a single parameter to perform weighted fusion of parametric knowledge and contextual knowledge (Eq. 8). This is an extremely concise and theoretically grounded approach to logits manipulation. 4. Adaptive Model Construction: The paper also proposes an adaptive mode with "automatic (parameter adjustment)" (Eq. 10), whose logic is equally intuitive — the model automatically trusts the knowledge source with lower entropy (i.e., higher confidence). 1. This is the most serious and obvious flaw of the paper. To calculate [relevant parameters] and [relevant parameters], CK-PLUG must execute two complete forward propagations in parallel at each decoding step: one for [input with context + query + generated tokens] and the other for [input with query only + generated tokens]. This almost doubles the inference latency and computational cost. 2. The core assumption of the paper is that "conflicts lead to entropy increase". However, if the erroneous context itself is highly "credible" and "fluent" (e.g., "The capital of France is Lyon"), it is entirely possible to reduce the model’s perplexity, resulting in "entropy decrease". 3. The calculation of (Eq. 6) may be numerically unstable. If [parametric distribution] assigns a near-zero probability to a certain token (with [log value] approaching negative infinity) while [context-enhanced distribution] assigns a high probability to it, [resulting value] may "explode". The paper does not discuss any suggestions for handling numerical stability. 1. Supplement implementation details regarding the calculation of [relevant parameter], and explain whether and how potential numerical instability issues have been addressed. 2. Conduct more rigorous stress tests on the "Confidence Gain (CG)" assumption—specifically construct erroneous contexts that are "highly credible and highly fluent", and illustrate the changes in [relevant indicator] under such circumstances as well as CK-PLUG’s performance metrics.	Lightly AI-edited
LLM Unlearning with LLM Beliefs	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper identifies a critical failure mode in previous gradient-based LLM unlearning methods (like Gradient Ascent and NPO), terming it the "squeezing effect". The authors demonstrate that these methods, while successfully reducing the probability of the exact target response, could redistribute this probability mass onto semantically related rephrasings. This leads to "spurious unlearning", where the sensitive knowledge persists, a failure often masked by standard automated metrics like ROUGE. To address this, the paper proposes a bootstrapping (BS) framework that explicitly targets the model's own high-confidence generations (its "model beliefs") as additional unlearning signals. In practice, the method utilizes token-level BS and sequence-level BS. 1. The identification and mechanistic analysis of the "squeezing effect" is a novel and significant contribution. It provides a clear diagnosis for a subtle but critical flaw in widely-used unlearning methods. This finding is highly significant for the field, as it suggests many existing methods may offer a false sense of security regarding privacy and safety. 2. The core claim of the "squeezing effect" is not just asserted but convincingly demonstrated through empirical analysis of probability dynamics (Fig. 2), tracking how probability mass shifts from the target to high-likelihood alternatives. The paper also provides a theoretical analysis for BS-T (Thm. 4.2) within the learning dynamics framework, explaining why suppressing the model's beliefs helps reshape the gradient to mitigate squeezing. 3. The empirical evaluation is comprehensive, covering three diverse benchmarks (TOFU, MUSE, WMDP), multiple model families (LLaMA-2, LLaMA-3, Zephyr), and various model scales (1B, 3B, 7B, 8B), demonstrating the robustness of the findings. 1. One small weakness is the practical cost of BS-S. Algorithm 2 implies sampling $N$ high-confidence sequences for every sample in a batch during training. This requires $N$ inference passes for each training step, which seems computationally prohibitive and scales poorly. Figure 6 shows BS-S is ~2x slower than NPO, and it might be even worse as $N$ grows. The paper also notes OOM issues when set $N=5$. It would be better if adding an ablation on the frequency of belief sampling (e.g., once per epoch vs. once per batch). 2. As noted in the ablations (Fig. 5) and limitations (Sec. G), the methods are sensitive to the bootstrapping coefficients ($\lambda_{BST}$, $\lambda_{BSS}$). Performance appears to drop off significantly if these are not tuned correctly. This could be a major barrier to adoption, as it may require expensive, model- and dataset-specific tuning. The paper would be stronger if it provided more intuition or heuristics for setting these crucial parameters. 3. The theoretical analysis is a key strength for BS-T, but it is missing for BS-S. As BS-S is the more complex and often better-performing method, it would be better to add a theoretical standpoint why sampling $N$ full sequences is superior to the more efficient token-level suppression of BS-T. 1. In BS-S, what is the nature of the $N$ sampled sequences? Are they $N$ semantically distinct paraphrases, or are they minor lexical variations of the same core "belief"? If the diversity is low, would a smaller $N$ (e.g., $N=1$ or $N=2$) enough, thereby mitigating the cost? 2. The main experiments (e.g., Table 1) appear to combine the BS methods with retention regularization (i.e., using $\mathcal{D}_r$). How much of the utility preservation is attributable to the BS method itself versus this external regularization? What does the performance of "pure" BS-T/BS-S (without $\mathcal{D}_r$) look like compared to "pure" NPO? This would help isolate the true impact of your method on the forget-retain balance. 3. In Line 302 to 303, GA will increase mass on $H_k^{(i)}$, but according to Figure 2 (b), it is not very obvious that GA shifts the probability mass to high-likelihood regions. Would it be better to clarify more on this?	Moderately AI-edited
LLM Unlearning with LLM Beliefs	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper clearly proposed and defined the "Squeezing Effect" for the first time: the existing gradient ascent forgetting method only reduces the probability of the target response, causing the probability mass to be "squeezed" into a semantically similar high-confidence area, thereby causing spurious unlearning. To this end, the Bootstrapping framework proposed in this paper uses the model beliefs to guide the forgetting process. Extensive experiments on multiple benchmarks confirm the effectiveness of this approach. - The paper is easy to follow. - The content of the paper is substantial, with both a summary of existing work and sufficient theoretical evidence. - The paper acknowledges in Appendix G that this method is very sensitive to the settings of hyperparameters such as the bootstrapping coefficient, and often requires extensive tuning for specific datasets and models. This seriously affects the method's application in practical scenarios. - Lack of comparison of computational overhead between various baseline methods. - The Bootstrapping framework relies on the high-confidence results generated by the model itself to guide forgetting. However, the model's confidence often does not necessarily reflect the required forgetting content correctly. For example, the model's confidence may be high, but its answer may be wrong, or it may have low confidence but the answer may be correct. - There is a lack of experiments on more diverse model structures to prove the effectiveness of the proposed method. Since I am completely unfamiliar with this area, I will adjust my score based on the suggestions of other reviewers.	Lightly AI-edited
LLM Unlearning with LLM Beliefs	Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes a bootstrapping framework for LLM unlearning that tackles the “squeezing effect,” where probability mass shifts to semantically similar outputs instead of true forgetting. Two variants are introduced: BS-T, which suppresses high-likelihood tokens, and BS-S, which augments the forget set with high-confidence generations. The authors provide theoretical analysis and experiments across TOFU, MUSE, and WMDP, showing improved balance between forgetting and retention. 1. The paper is very readable, with a logical flow from motivation → analysis → method → theory → experiments. Figures and appendices are well-organized, and pseudocode makes the algorithms easy to reproduce. 2. The authors make a thoughtful observation about the squeezing effect and systematically demonstrate its existence through both qualitative and quantitative analysis. The proposed bootstrapping strategy is a creative extension of this insight, and the experiments convincingly show that BS-T and BS-S outperform existing unlearning baselines under various settings. 3. The work is conceptually motivated by an intuitive yet underexplored idea—connecting unlearning failures with the model’s own belief distribution. This is a fresh perspective on unlearning that moves beyond purely loss-based formulations, and the motivation is clearly justified both intuitively and empirically. 1. While the paper focuses on redistributing likelihood as the core cause of spurious unlearning, the explanation still feels surface-level from a semantic standpoint. The essence of the problem may not lie solely in likelihood shifts, but rather in the fact that current unlearning methods attempt to correct predictions without accounting for semantic relatedness. Unlearning should arguably target semantic classes of knowledge, rather than isolated outputs or sequences. A more principled formulation in semantic embedding space (e.g., clustering or alignment-based unlearning) might provide a deeper understanding of what “forgetting” really entails. 2. Although BS-T and BS-S generally achieve the best average scores, in several tasks their performance margins over strong baselines like NPO or RMU are modest. The results could be strengthened with additional analysis. 1. In BS-S, high-confidence generations are added to the forget set, but such sequences may still contain unrelated or benign information. How does the method ensure that these “bootstrapped” samples do not lead to accidental forgetting of non-target knowledge? Is there any filtering mechanism beyond temperature-controlled decoding? 2. The theoretical part (Section 4.2) discusses the dynamics of probability redistribution under BS-T versus GA, but it would be very valuable to show empirical probability dynamics for BS-S as well—similar to Figures 2(b)–(c) for GA and NPO. This would help demonstrate whether BS-S effectively flattens or redistributes probability mass in the way the theory predicts.	Fully AI-generated
LLM Unlearning with LLM Beliefs	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper investigates the "spurious unlearning" problem that occurs when LLMs perform unlearning tasks. The authors point out that while existing methods (such as gradient ascent and NPO) can reduce the probability of a target response, they also redistribute the probability mass toward semantically similar regions, resulting in a "squeezing effect." To address this issue, the authors propose a Bootstrapping (BS) framework that incorporates the model's own high-confidence generation (model belief) into the forgetting objective. BS-T suppresses high-probability tokens at the word level; BS-S suppresses high-confidence generation of entire segments at the sequence level. The authors validate their approach on benchmarks such as TOFU, MUSE, and WMDP, and provide a theoretical analysis of BS-T to explain its mechanism for mitigating the squeezing effect. - Interesting and Important Discovery: The paper reveals the "squeezing effect" and the resulting "spurious unlearning", suggesting that current unlearning methods only achieve superficial unlearning and further analyzing the reasons. - Simple Yet Effective Design: The Bootstrapping framework directly targets areas of high probability of false forgetting by jointly suppressing the target response and the model's own high-confidence output. It requires no additional models or external data and is logically self-consistent. - Comprehensive Experimental Results: Systematic experiments on multiple datasets and models, compared with strong baselines such as NPO, WGA, and RMU, show consistent improvement. Qualitative examples and gradient dynamics analysis are also provided to further strengthen the demonstration. - BS-S has high computational overhead: Sequence-level bootstrapping requires generating multiple belief sequences for each sample, significantly increasing computational costs. The paper does not provide clear time or resource costs, nor does it discuss scalability for large-scale applications. - BS-S lacks theoretical support: The authors provide a gradient analysis for BS-T, but BS-S is validated solely by experimental results and lacks formal explanations or convergence guarantees. - High-confidence suppression strategies carry risks: High confidence does not necessarily indicate content that should be forgotten. The top-k outputs of the model may contain semantically related but harmless tokens; ranking by probability alone may lead to excessive forgetting. This paper identifies and addresses flaws in existing LLM unlearning methods, using a sound approach and robust results. To further enhance the paper's contributions, I have the following comments: 1. Quantitatively demonstrate the effectiveness of mitigating the "squeezing effect," such as the change in semantic similarity between generated samples before and after forgetting; 2. Report the computational overhead of BS-T and BS-S, and discuss their applicability in large-scale scenarios. 3. Supplement sensitivity and ablation analysis of the hyperparameters λ, k, and N; 4. Clarify the model belief sampling strategy and evaluate the impact of different parameter settings on the results; 5. Explore dynamic k or entropy-based adaptive strategies to mitigate the over- or under-forgetting issues associated with a fixed k.	Moderately AI-edited
Layer-wise Sensitivity-aware Sparsity Allocation for Efficient LLM Inference	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces ASAF (Adaptive Sparsity Allocation Framework), a method for making LLM inference more efficient by combining rotation-based quantization and layer-wise adaptive sparsity. Unlike prior work that applies uniform compression, ASAF dynamically assigns different sparsity levels to layers based on their sensitivity. The approach uses a two-phase dynamic programming optimization: 1- Coarse-grained phase: Decides how to group layers and narrows sparsity ranges. 2- Fine-grained phase: Determines exact sparsity rates and layer assignments. Tested on Llama-2 models (7B–70B), ASAF achieves up to 3.6× faster inference and 12.6% lower memory use, with <1% accuracy drop compared to baselines like QuaRot. - Framing sparsity allocation as a layer-grouped, constrained optimization problem with dynamic programming is elegant. - The mathematical formulation is clean, and the dynamic programming approach (Algorithms 1 & 2) is well explained. The inclusion of tabulation to precompute FLOP and accuracy costs is a smart engineering choice that enhances reproducibility. - The experiments emphasize prefill acceleration but offer less analysis of end-to-end latency or real-world throughput improvements. - The proposed method involves precomputation (tabulation tables for FLOPs and accuracy degradation). This could limit practicality for very large models or rapid iteration cycles. - It is not entirely clear how scalable the DP-based search is as model depth increases beyond 70B-scale architectures. - The paper doesn’t deeply probe why certain layers are more sensitive or how the learned sparsity patterns correlate with model internals (e.g., attention vs MLP layers). - The paper mentions dynamic programming and tabulation to efficiently explore the search space, but how does computational complexity scale with model depth (e.g., 70B to 180B parameters)? - The optimization constraint depends on an estimated accuracy degradation function. How is this function obtained in practice: via heuristics, proxy metrics, or direct evaluation? - Do the learned sparsity rates correlate with identifiable layer characteristics (e.g., attention layers being less prunable than MLP layers)? - The experiments focus mainly on the Llama-2 family. How well does ASAF generalize to architectures with different scaling patterns (e.g., Mistral, Falcon, or OPT)? - Reported “prefill acceleration” results are strong, but what is the effect on end-to-end latency or tokens per second under realistic batch sizes and generation lengths? - How sensitive is ASAF to delta (the allowed accuracy degradation) and the sparsity range?	Fully AI-generated
Layer-wise Sensitivity-aware Sparsity Allocation for Efficient LLM Inference	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes ASAF, an Adaptive Sparsity Allocation Framework for efficient LLM inference. It addresses the limitation of uniform compression by recognizing that different transformer layers have varying sensitivity to sparsification. ASAF combines quantization with sparsification via a two-phase, dynamic programming-based optimization to allocate sparsity adaptively across layer groups. This approach minimizes computational FLOPs while keeping accuracy degradation under 1%, achieving up to 3.63× prefill acceleration and 12.63% memory reduction on Llama-2 models. - Proposed a joint optimization framework for quantization and sparsification, identified the problem, formulated it mathematically, and provided a solution. - Adopted an optimization perspective with a two-phase approach, and proposed a method to tackle the combinatorial explosion challenge. - Conducted relatively comprehensive experiments. 1. The basis for grouping is unclear, and the rationale for assuming consecutive layers can share sparsity is not justified. Grouping merely addresses the combinatorial explosion from a computational standpoint, but the motivation behind this grouping needs further elaboration. 2. The model was only tested for compression ratio and prefill speed on the Llama-2 series; no results are provided for other models. Llama-2 is already outdated. Additionally, for Llama-3, only accuracy experiments were conducted. However, differences in model architecture and training methods may affect the compression efficacy. Measuring only accuracy cannot fully demonstrate the method's effectiveness. 3. Prefill speed was only measured at an input length of 2K; what about other lengths such as 512, 4K, or longer? 4. The baseline method was proposed over 1 to 2 years ago. Are there comparisons with more recent methods from the past year? 1. What assumption is the sharing of sparsity among consecutive layers based on, and is there any experimental validation for it? 2. To verify the method's generalizability, I suggest adding compression ratio results on models of different sizes from another series. 3. Prefill speed was only measured at an input length of 2K; what about other lengths such as 512, 4K, or longer?	Lightly AI-edited
Layer-wise Sensitivity-aware Sparsity Allocation for Efficient LLM Inference	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This submission introduces Adaptive Sparsity Allocation Framework (ASAF), an approach for efficient acceleration of LLM inference that combines rotation-based quantisation with layer-wise adaptive sparsity. The selection of the deployed configuration is formulated as a 2-stage dynamic programming optimization approach, in order to make the exploration of the combined search space computationally feasible, that initially determines high-level structural configuration parameters of both approximations (the optimal number of layers groups and sparsity search intervals), followed by fine-grained optimisation of exact consecutive layer allocation and sparsity rates of each group. - This work studies a timely and interesting problem, by combining approximations that are often studied in isolation in efficiency works for LLM inference. - The proposed approach demonstrates considerable speed-up to meaningful baselines (in the examined prefill stage), with controlled impact on acucracy. - The main drawback of the proposed approach is the lack of consideration of the whole LLM inference process (prefill + decoding). Although it is acceptable for an approach to focus its optimisation efforts solely in one of the two phases, the impact of the proposed solution to the total inference time (and discussion on the applicability or impact of the proposed method to the other remainder process) is required to fairly evaluate the contribution and effectiveness of the proposed method. - Additionally, it is unclear how the proposed hierarchical search formulation would compare to more naive heuristic exploration baselines on the combined optimisation space (constrained to similar search time). Please consider replying on the concerns raised above.	Fully human-written
In-Context Clustering with Large Language Models	Soundness: 1: poor Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This manuscript proposes In-Context Clustering (ICC), an LLM-based method that performs clustering without predefined similarity functions. ICC uses the attention mechanism of pretrained LLMs to capture context-dependent relationships among inputs across modalities. The authors show that LLMs or multimodal LLMs exhibit zero-shot clustering ability in numercal and visual data. With additional fine-tuning using LoRA with a Next Token Prediction loss, the experiments showed that ICC achieved improved performance on both numeric and image datasets. Moreover, ICC supports text-conditioned image clustering that allows prompt-based control on the clustering process. Major Strengths: - The writing and organization of this manuscript is clear and easy to follow (yet, it's better to add necessary details to make the main paper self-contained; jumping to figures in appendix from main paper is not very enjoable) - The experiment that visualizes the attention allocation of input data and generated cluster labels at an intermediate layer is very interesting and innovative. This will give a good support to the clustering mechanism behind LLMs when in-context clustering is used. - The presentation and visualization of this manuscript is clear and visually enjoyable. Major Weakness: - My primary concern about this manuscript lies in the validity of its central claim that “in-context clustering with large language models (LLMs) performs as well as or better than” traditional clustering methods such as K-means, spectral clustering, or DBSCAN, or other related methods. The rationale is as follows: clustering algorithms are designed to handle and explore unlabeled, unseen, and novel data—such as new concepts, observations, or protein structures—across diverse modalities. In contrast, the proposed “in-context clustering with LLMs” method fundamentally depends on pre-trained LLMs or multimodal LLMs and, consequently, on the massive datasets and implicit clustering criteria, and of course, clustering centroids, these models have already encountered during training. Therefore, it is unclear how well the proposed method would perform when the data is genuinely novel, unseen or out-of-distribution. This data coverage limitation also raises concerns about the validity of the claimed “zero-shot” setting in the experiments. In contrast, traditional clustering methods such as kmeans can be easily adapted to truly novel and unseen data. - The reviewer found the following statement to be an overclaim: At Line 73, the authors state: “We believe that this ability to change the way clustering is done based on different prompts makes ICC, and this research direction, particularly compelling.” In fact, text-conditioned or prompt-steered clustering using LLMs paradigm was first proposed by IC\|TC [1], and subsequently explored in [2,3,4]. Moreover, [5] further extended this line of work by enabling automatic discovery of clustering conditions from data using LLMs. So, such innovation and capability has already been proposed and studied by the community recently. The authors should appropriately acknowledge prior research contributions rather than implying that this innovation originates solely from their proposed ICC method. - Several highly relevant studies, including [3, 4, 5], are missing from the literature review and discussion. - Compared to IC\|TC [1] and [2], the novelty of the proposed ICC method is quite limited, as it mainly adds an additional fine-tuning component. - Regarding the “Zero-shot In-context Clustering” experiments in Section 3.1: are they truly zero-shot? The prompt template (Lines 144–146) explicitly provides the number of clusters to the model. This information constitutes a strong prior about the data structure, meaning the model already knows how many ground-truth groups exist in the dataset. With such prior knowledge given, the zero-shot nature of the setting is questionable. In real-world zero-shot scenarios, the number of clusters is often unknown. - The experimental setup and baseline comparisons in Section 4.2 (Table 2) and Section 5 (Table 3) appear to be unfair. The IC\|TC baseline is training-free and uses LLMs directly without fine-tuning (e.g., GPT-3.5-turbo). In contrast, the proposed ICC method either (1) uses GPT-4o, which is a much stronger model, or (2) llava-interleave-qwen-7b-hf includes further fine-tuning on data drawn from a similar distribution. Comparing the GPT-4o model and fine-tuned llava-interleave-qwen-7b-hf to a training-free GPT-3.5-turbo baseline is not fair, as it conflates improvements due to model scale and additional tuning. Consequently, the conclusions drawn from this comparison are not well supported. - Further, the reviewer questions how ICC would compare against traditional clustering methods using strong vision features. For example, what is the clustering performance of K-means when using features extracted from DINOv3-ViT-7B/16? Would ICC—relying on a significantly larger model—still outperform DINOv3-based clustering under comparable settings? [1] Kwon, Sehyun, et al. "Image clustering conditioned on text criteria." In ICLR, 2024. [2] Luo, Yulin, et al. "Llm as dataset analyst: Subpopulation structure discovery with large language model." In ECCV, 2024. [3] Yao, Jiawei, Qi Qian, and Juhua Hu. "Customized multiple clustering via multi-modal subspace proxy learning." In NeurIPS, 2024. [4] Yao, Jiawei, Qi Qian, and Juhua Hu. "Multi-modal proxy learning towards personalized visual multiple clustering." In CVPR, 2024. [5] Liu, Mingxuan, et al. "Organizing unstructured image collections using natural language." Arxiv preprint, 2024. Minor questions are described in the following: - The authors claim that prior similarity-based clustering methods cannot capture “context.” However, no proof, reference, or experimental evidence is provided to support this claim in either the textual or visual modality. In fact, many earlier methods in text clustering, including classical probabilistic models such as LDA [6], explicitly aim to model contextual information to group documents by topic. Similarly, in the vision domain, when images are represented through learned or encoded features, it is unclear to the reviewer why such representations would be inherently incapable of capturing context. - Regarding the evaluation metrics in Section 3.1: while it is standard practice to compute clustering accuracy using the Hungarian linear assignment, this metric can be easily biased due to its matching paradigm. Since LLMs are capable of producing textual labels for each cluster, the authors could consider an alternative approach that approximates classification accuracy for a more direct comparison. - At Line 189, the authors state: “We also observe that instruction tuning improves the overall accuracy.” However, the dataset used for instruction tuning, and the details of how this tuning was performed, are not specified in the paper. Without this information, the result cannot be properly interpreted or reproduced. - Please explain what is “df” (degree of freedom) in Section 3.1. It is not explained in the manuscript. [6] Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." Journal of machine Learning research 3.Jan (2003): 993-1022.	Fully human-written
In-Context Clustering with Large Language Models	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The article introduces In-Context Clustering (ICC), which extends the in-context learning paradigm to unsupervised clustering tasks. The authors demonstrate that large language models (LLMs) can perform zero-shot clustering on text-encoded numeric data and images by leveraging their attention mechanisms to capture complex relationships between inputs. The work introduces fine-tuning strategies using next token prediction (NTP) loss to enhance clustering performance, particularly for heavy-tailed distributions and semantically rich data. Additionally, ICC enables text-conditioned image clustering, allowing users to specify clustering criteria through natural language prompts. - The paper is clear about its motivation with sufficient significance and quality. - The paper makes a compelling case for extending in-context learning to unsupervised settings. The ability to perform clustering without predefined similarity measures through prompting is innovative and addresses real limitations of classical methods. - The zero-shot clustering results on t-distributed data with varying degrees of freedom convincingly demonstrate that LLMs can outperform k-means when Gaussian assumptions are violated. The performance gains are particularly striking for heavy-tailed distributions. - The visualization and analysis of attention matrices revealing emergent cluster structures (Section 3.2) provides valuable insights into the internal mechanisms. The finding that spectral clustering on attention matrices achieves 85% accuracy before fine-tuning while generation only reaches 74% is particularly intriguing. - The paper insufficiently addresses the computational limitations for practical deployment. With O(n²) attention complexity and token limits, how does ICC handle datasets with thousands of points? The average pooling for images seems like a band-aid solution that could lose critical fine-grained information. - While the empirical results are good, the paper lacks theoretical analysis of when and why ICC works. What properties of the attention mechanism enable clustering? Under what conditions might ICC fail? - This is the most critical weakness of the paper. For image clustering, the comparison is limited to k-means and IC\|TC. Missing comparisons with modern deep clustering methods (e.g., SCAN, NNM, SwAV, or other self-supervised approaches) makes it difficult to assess the true performance gains. - The fine-tuning data generation process using t-distributions with random parameters seems arbitrary. How sensitive is performance to this choice? - No ablation studies on key design choices (e.g., impact of different pooling strategies, prompt variations) - Figure quality could be improved - some attention visualizations are difficult to interpret - The related work section could better position this work relative to recent advances in foundation models for clustering - How does performance degrade as the number of data points approaches context limits? Have you experimented with hierarchical clustering or other strategies to handle larger datasets? - How robust is ICC to prompt variations? The paper uses a simple template "Cluster the following data into {k} clusters" - have you tested more sophisticated prompting strategies or chain-of-thought reasoning? - Can you provide any theoretical justification for why attention patterns correspond to cluster structure? Is there a connection to graph-based clustering methods or spectral theory? - What types of clustering problems does ICC struggle with? For instance, how does it handle clusters with varying densities, non-convex shapes, or hierarchical structures?	Fully AI-generated
In-Context Clustering with Large Language Models	Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces In-Context Clustering (ICC), a novel method that leverages LLMs for clustering of images and numerical data. The authors show that the LLM’s attention mechanism captures complex relationships between inputs that can be used for clustering with spectral clustering. Further improvements are obtained through fine-tuning with next-token prediction, extending the method to numeric and image data. Additionally, text-conditioned image clustering is demonstrated where multiple different clusterings can be extracted based on the design of the prompt. - The paper addresses a highly relevant research area, presenting an approach that enables more user-guided clustering through prompt-based interactions with LLMs - I found it interesting that the method works well with numerical data as input for clustering via LLMs, although this capability appears to be constrained to lower-dimensional data - Employing the attention matrix derived from the LLM as input for spectral clustering is an interesting insight - Experiments show benefits across different datasets and modalities ## Soundness The authors limit their comparison to a single classical clustering algorithm, namely K-Means, which serves already as a strong baseline in Table 2 and 3. Based on this I am missing the comparison to different classical algorithms like Expectation-Maximization Clustering, DBSCAN or its popular extension HDBSCAN to see if "simpler" baselines can outperform the proposed method. ## Novelty My main concern with this paper lies in its limited novelty. Several recent works have already explored closely related ideas, particularly the use of prompting and multimodal representations to induce or control clustering behavior, but the authors only compare to IC\|TC. For example, prior studies such as - Jiawei et al. "Multi-modal proxy learning towards personalized visual multiple clustering." CVPR 2024. - Jiawei et al. "Customized multiple clustering via multi-modal subspace proxy learning." NeurIPS (2024): 82705-82725. - Stephan et al. Text-Guided Image Clustering. EACL (1) 2024: 2960-2976 - Stephan et al (2024). Text-Guided Alternative Image Clustering. In Proceedings of the 9th Workshop on Representation Learning for NLP (RepL4NLP-2024) (pp. 177-190). already use prompt-based approaches to obtain one or multiple clusterings conditioned on different attributes or textual guidance. Moreover, recent works such as - Gadetsky et al: Large (Vision) Language Models are Unsupervised In-Context Learners. ICLR 2025 - Gadetsky et al: Let Go of Your Labels with Unsupervised Transfer. ICML 2024 demonstrate already that large (vision) language models can perform unsupervised in-context learning and clustering without explicit supervision. Taken together, these prior works already explore the use of LLMs and in-context mechanisms for unsupervised or text-guided clustering, which significantly overlaps with the proposed In-Context Clustering (ICC) framework. The authors should therefore clearly differentiate their method from these existing methods and compare to them in benchmarking experiments. Further, a dedicated discussion of what is conceptually and technically novel about ICC compared to these earlier contributions is missing. - In what key methodological ways does your approach differ from the prior works referenced in the weaknesses section? What are the key contributions of your method? - The current comparison is limited to k-Means and IC\|TC. How does your algorithm perform relative to other recently proposed methods mentioned above? - How do other classical clustering algorithms compare to your approach? Are there scenarios in which simpler baselines outperform ICC, and if so, under what circumstances? More broadly, when might traditional methods be sufficient compared to LLM-guided clustering?	Fully human-written
In-Context Clustering with Large Language Models	Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes and tests using LLMs for “in-context clustering”, where (multimodal) LLMs are presented sequences and are tasked with assigning cluster labels (conditioned on a priori cluster count) to each element. The experiments include (1) synthetic numerical clustering, where points sampled from mixtures of low-dimensional t-distributions with varying degrees of freedom; (2) attention-based analysis, where token-level attention maps are treated as affinity matrices for spectral clustering; (3) LoRA fine-tuning, where the model is trained via next-token prediction on synthetic text prompts containing sample–label pairs; and (4) image experiments, where images and captions from ImageNet are clustered through textual prompts. All experiments report Hungarian-aligned accuracy against true labels and compare only to simple baselines like k-means. - The experiments are reproducible and clearly presented - Analysis of attention affinity matrices is interesting - Performance gains are unsurprising: the fine-tuned models are trained on synthetic mixtures drawn from the same/very similar distribution family as the evaluation sets, so improvement simply reflects distributional overlap, not generalizable clustering ability. - "Classical methods often rely on predefined measures" - I don't agree with this. For example, embedding models trained with contrastive learning transform data onto a low dimensional manifold where local distance meaningfully represents semantic difference. - Minimal novelty. The pipeline and evaluation duplicate prior IC\|TC work, with only superficial framing changes (“in-context” language). - Accuracy via Hungarian matching inflates scores and hides near-chance performance. Consider using ARI/NMI as well. - Does attention-spectral remain strong under permutation of item order and different prompt formats? Show stability across layers/heads with an automatic selection rule. - How do CLIP/DINOv2 features + spectral/DBSCAN/GMM compare under the same data, including conditional setups via text features?	Lightly AI-edited
A Theoretical Analysis of Discrete Flow Matching Generative Models	Soundness: 1: poor Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	The paper provides a useful bounds for Conditional Discrete Flow Matching (CDFM) controlling the sampling error in total variation by a so-called riskk directly linked to the training loss. The authors also provide quantitative universality bounds for Transformers in the context of CDFM. - The main body of the paper is well written. - The main TV bound is useful - The Roadmap on page 6 is a nice addition to the paper to help understand how the proof works. Major: the math is particularly clumsy and not to the expected level of a purely theoretical paper. - The hypothesis that the velocity approximator is "factorized" seems unnecessary - The whole Appendix B needs rewritting. In fact, the whole section is a combination of statement and proofs of undergraduate level written in a rather inefficient way. For instance, - Lemmas B.5 and B.8 are standard undergraduate analysis statements (and the proof of B.8 is incorrect under the assumptions used, it is a common mistake but still...). Similar remarks apply to part of section D. - Lemma B.9 and B.6 are so standard they may probably be stated without proof. Lemma B.10 would be highly simplified by using the Lipchitz constant of (X,Y) -> X+Y and better used of B.5. - Theorem B.17 is not original and I fail to understand why it is rewritten here or what the proofs written here brings to the discussion. - Appendix C has too many problems to be acceptable: - Lemma C.3 is FALSE!!!! It is only true if the matrices $U_{t,\theta}$ commute. - Holder inequality seems to be unknown to the authors as they repeatedly bound the L1 norm of a product by the product of the L1 norms. - That being said, I do not doubt the final result, pending replacement of Lemma C.3 by slightly more sophisticated inequality one can obtain using Gronwall Lemma. Minor - The chosen embedding of vocabulary is non-standard, quite restrictive and an unnecessary hypothesis. - The same holds for the timestep embedding. - Theorem 4.6 statement is unecessarily complicated, the notations of the space of transformers is not defined - Lemma 4.4 may be most likely found in a functional analysis book, although in the context of an ML article, reproving it directly as in Appendix E is acceptable. - some typos - Theorem 4.6 is expected to hold in some form for any architectures for which one is able to prove a quantitative universality theorem. It is not specific to transformers, stating the result only for transformers is rather odd. See Weaknesses.	Fully human-written
A Theoretical Analysis of Discrete Flow Matching Generative Models	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper provides the first end-to-end theoretical analysis of Discrete Flow Matching (DFM). It establishes: (1) Intrinsic error bound linking velocity field risk to distributional error; (2)Approximation theory showing Transformers can approximate DFM velocity fields; (3)Estimation theory giving statistical convergence rates for training DFM-Transformers. In summary, the paper produces a complete approximation and estimation theory analogous to classical diffusion and flow matching results but for discrete flow matching. (1) The paper proposes a rigorous end-to-end theory for descrete flow matching. (2) The chain of results (intrinsic error, approximation error, estimation error) is elegant and well-motivated. (3) The derivation of total variation bounds via Kolmogorov equations and Grönwall’s inequality seems technically non-trivial and new. (1) The paper currently lacks empirical experiments. In pratice, the is hard to make the training objective to be very small while the error bound relies heavily on terms like \sqrt(M)exp(M). The resulting guarantees could be loose for typical M, limiting practical interpretability. (2) Some modeling assumptions appear stronger than usual, for example, time-Hölder smoothness of CTMC paths, exponentially large UA constants for transformers, and a global velocity-field bound. Clarifying why these are needed and how sensitive the results are to them would improve transparency. The theoretical analysis is interesting; I would consider raising my score if the authors address the following question: (1) Could the authors analyze error bounds for the out-of-vocabulary (OOV) case, now or in future work? The current theory appears to assume a finite discrete state space; with OOV tokens, how should the problem be treated? (2) Recent work (e.g., Dirichlet Flow, H. Stark 2024; Diffusion LM, Shen et al., 2025) models discrete diffusion/flow using cross-entropy objectives that naturally yield KL bounds. Why is MSE chosen here instead, and why analyze total-variation (TV) bounds rather than KL?	Fully human-written
A Theoretical Analysis of Discrete Flow Matching Generative Models	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper provides a theoretical analysis of end‐to‐end training of DFM models. They bound the distance between the generated distribution and the true data distribution by the risk of the learned probability velocity. Then this risk is analyzed through two sources: approximation and estimation errors. The approximation error is determined by the capacity of the transformer, and the estimate error is due to the training on a finite dataset. The paper provides theoretical justification (for the first time in discrete flow‐matching) that model error can be controlled and that convergence is guaranteed under ideal conditions. The paper is well written. To my knowledge this is the first formal proof of convergence for DFM. It provides a valuable theoretical contribution to the growing area of discrete generative modeling and flow‐matching methods. They decompose the errors for learning the probability velocity into the approximation error (due to model architecture) and the estimation error (due to the sample size), assuming that training is done ideally. The analysis is built upon a slightly different version of the loss function from the original discrete flow matching (Gat et al, 2024). Maybe it is better to elaborate a bit more on the connection between those loss functions and justify the use of squared $\ell_2$ as the Bregman divergence. There is a connection between discrete diffusion models and discrete flow matching. The original DFM training loss is identical to the masked diffusion training loss. There exists earlier literature on the convergence analysis of the discrete diffusion models. It’s better to include some discussions on the similarity and differences, and novelty compared to earlier works on discrete diffusion. In addition, since the paper only provides upper bounds on the TV distance and probability velocity risk, I feel that it needs some empirical evidence to show how tight those bounds are. 1. In the original discrete flow matching, the training loss uses cross-entropy terms (see Eq. (28) in Gat et al, 2024). Is it equivalent to the loss term in Eq. (2.10) by using the $\ell_2$-norm in Bregmann divergence? 2. Does using different form of Bregmann divergence affect the results? 3. In Theorem 3.1, is $T = 1$? How big is $t_0$? 4. Is it possible to show some empirical evidence of the scaling behavior (e.g., $M$, $n$, or $d_0$) on some toy data?	Fully human-written
A Theoretical Analysis of Discrete Flow Matching Generative Models	Soundness: 3: good Presentation: 3: good Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The authors provide a full error analysis for a discrete flow-matching model parametrized by a transformer, trained on factorized mixture path interpolants. On the technical level, a central result is their construction of a smooth velocity field valued on a continuous domain, extending the true velocity field which takes discrete arguments. This in turn allows them to apply approximation results for transformers, and reach error guarantees, alongside norm bounds for the model parameters. The work extends the growing body of works on theoretical analyses of flow-matching models to the discrete case. This extension requires technical extensions such as Lemma 4.4, which allows to bridge from the discrete case to the continuous case. The contribution is, to the best of my awareness, novel and interesting, and will thus prove of interest to the ICLR community. The paper is overall clearly written, and the main technical ideas sufficiently explained and motivated. In this light, I am overall in favor of acceptance; however, I have limited familiarity with the field, and did not read the proofs in detail. My main criticism is the lack of sufficient discussion on certain aspects, which I highlight in the questions section. - The main weakness of the bound is an error guarantee exhibiting a curse with the effective dimensionality $Md_0$. Could the authors provide more intuition or discussion on why this is the case? Section $6$ discusses the prefactor but not the rate. Notably, it seems that this depends on the choice of the model through $d_0$ and is not a fundamental limitation. If this is the case, could the authors brieflw discuss this point and how it can be refined in future works? - I understand that the discrete-to-continuous mapping leverages the natural inclusion embedding of $\mathcal{V}$ in $\mathbb{R}$, with vocabulary items of higher index receiving larger values. Naively, it seems that this embedding, unlike e.g. one-hot encodings, does not treat all vocabulary items in the same fashion, and is not the most natural embedding. Could the authors comment on this, and how this choice impacts (or not) the bounds? - Further motivation on the choice of transformer architectures would be helpful. In particular, many previous works on continuous generative models studied ResNet or auto-encoder architectures (e.g. Boffi et al; Shallow diffusion networks provably learn hidden low-dimensional structure; 2024). Is the transformer particularly adapted to the discrete case, or simply chosen to reflect most practical settings? - (Minor) l 266 : "total variation distance bound scales exponentially with vocabulary size M" I am confused as I do not see any exponential dependence in Theorem 3.1. - I invite the authors to include further comparison with analyses in the continuous case to better highlight the main differences, which would be interesting. In particular, I am wondering whether the authors have an idea how an analysis of the time discretization error would proceed and differ in the discrete setting ?	Fully human-written
Improving End-to-End Training of Retrieval-Augmented Generation Models via Joint Stochastic Approximation	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper aims to address the challenge of end-to-end optimization in retrieval-augmented generation (RAG) systems. Existing RAG models typically consist of separately trained retrievers and generators. Achieving true end-to-end optimization would require marginalizing over all relevant passages in the knowledge base, which are modeled as discrete latent variables. However, current methods tend to be either biased or exhibit high variance. To tackle this issue, the authors propose a new training framework called JSA-RAG, which applies the Joint Stochastic Approximation (JSA) algorithm to RAG training. By employing a stochastic EM approach to train a posterior retriever, the model enables genuine end-to-end optimization of RAG. Experiments conducted on five datasets across two tasks: open-domain question answering (ODQA) and knowledge-grounded dialogue. It demonstrate that JSA-RAG significantly outperforms both Vanilla RAG and VRAG in terms of generation quality and retrieval performance. 1. This paper offers a well-founded approach to addressing the gradient estimation challenge arising from discrete latent variables in RAG training. In contrast to RAG (TKM), which suffers from biased gradient estimates, and VRAG, which tends to produce high variance, the proposed JSA-RAG applies the Joint Stochastic Approximation (JSA) algorithm that theoretically ensures unbiased and low-variance gradient estimation for the inference model. Overall, this work presents a well-motivated and insightful research direction. 2. JSA-RAG consistently outperforms RAG and VRAG across all five datasets. It improves both generation quality and retriever performance simultaneously, demonstrating its ability to achieve effective joint optimization. 3. The paper offers a thorough analysis that clearly demonstrates the advantages of the proposed approach. Compared with the frequent sharp spikes observed in the gradient norms of VRAG, JSA-RAG achieves lower variance and exhibits more stable training dynamics. Furthermore, the ablation studies indicate that the posterior retriever trained with JSA surpasses its VRAG counterpart in both recall and MRR performance. 1. Insufficient Baselines：In the Related Work section, the paper mentions several relevant studies, such as RetGen and Stochastic RAG, and claims that these methods “tend to be biased or have high variance.” However, there is a lack of empirical comparison with these approaches. Including only VRAG as a baseline is insufficient; the experiments should incorporate more baselines for a fair and comprehensive evaluation. 2. Concern about Retrieval Performance Evaluation: Regarding the evaluation of retrieval performance, the paper uses datasets without gold passage annotations and relies on GPT-4o-mini to generate these annotations. However, there already exist datasets with human-annotated gold passages (such as MultiHop-RAG and others). Using such datasets would make the evaluation of retrieval performance more credible and convincing. 3. Mismatch in Model Scale and Analysis Scope：The Dialog datasets have relatively small knowledge bases, whereas ODQA involves much larger ones. However, the ablation studies are conducted only on the OR-QuAC dataset, which differs substantially from real-world RAG scenarios. This raises concerns about whether the experimental results can generalize to larger-scale datasets. 4. Concern about Reported Training Costs: The paper claims that the training cost of JSA-RAG is comparable to that of RAG and VRAG. However, as shown in Table 7, the training cost per 100 steps for JSA-RAG is notably higher than that of the baseline models. Hence, the experimental evidence does not fully substantiate the claim of comparable training costs. Please see the weakness	Moderately AI-edited
Improving End-to-End Training of Retrieval-Augmented Generation Models via Joint Stochastic Approximation	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper addresses the challenge of end-to-end RAG optimization, where marginalizing over latent passages leads to biased or high-variance gradients in traditional methods. The core contribution is JSA-RAG, a framework applying the Joint Stochastic Approximation (JSA) algorithm to obtain low-variance gradient estimates. This is achieved using an auxiliary posterior retriever and MCMC sampling. Experiments show significant performance gains over baselines on multiple benchmarks. 1. The paper correctly identifies the core gradient estimation problem and applies a theoretically sound solution (JSA) from the statistical machine learning literature. 2. The method demonstrates consistent and significant performance improvements over strong baselines across a diverse set of tasks and datasets. 3. The work is well-supported by thorough analysis, including gradient norm comparisons and ablation studies on index rebuilding and passage concatenation, which enhance the method's credibility. 1. The framework introduces an auxiliary inference model and a complex MCMC sampling procedure, which increases implementation and tuning difficulty compared to simpler RAG variants. 2. The method's effectiveness relies on a high-quality posterior retriever to guide the MIS sampler. However, the paper does not analyze the framework's sensitivity to a sub-optimal posterior retriever, which could impact training efficiency and convergence. 3. The iterative MCMC sampling introduces a non-trivial computational burden per training step, raising concerns about its scalability as model and data scales increase. 1. What was the rationale for selecting the number of MIS steps? 2. The analysis of the results should be deepened. Please connect the theoretical benefits of JSA more directly to why the method succeeds on the tested tasks.	Moderately AI-edited
Improving End-to-End Training of Retrieval-Augmented Generation Models via Joint Stochastic Approximation	Soundness: 4: excellent Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper treats the retrieved passage as a discrete latent variable and seeks to maximize the marginal log-likelihood. Instead of top‑K marginalization or ELBO surrogates (vanilla RAG/VRAG), it applies Joint Stochastic Approximation (JSA) with Metropolis‑Independence Sampling (MIS), using a posterior retriever as the proposal. Accepted samples are used as pseudo‑labels to jointly update the prior retriever, generator, and posterior. To make it scale, prior/posterior probabilities are computed on the union of their top‑k sets from a FAISS index. Experiments on ODQA (NQ, TQA, MS‑MARCO) and dialog (OR‑QuAC, DoQA) show consistent but modest absolute gains in generation and retrieval, plus lower gradient‑variance for the posterior. The paper also analyzes index rebuilding and passage concatenation. - Principled estimator for latent retrieval with lower gradient variance on the posterior retriever; clean algorithmic presentation. - Consistent gains across five datasets (QA and dialog) with multiple metrics. - Engineering details (FAISS + union top‑k, index rebuilding, passage concatenation) are useful to practitioners. - The paper argues for statistical neatness (low‑variance updates), but does not answer why end‑to‑end retriever–generator training is preferable today versus strong non‑E2E alternatives (e.g., well‑tuned retrievers with instruction‑tuned generators, verification‑augmented pipelines, or separate retriever/generator training). - Inference uses the same top‑k documents decoding; latency/QPS/VRAM numbers versus baselines are absent. Without speed or cost benefits, the case for E2E training hinges entirely on accuracy. - Improvements are generally +1–3 points (task‑dependent). For many applications, that uplift may not justify the added training complexity and cost. - Reported wall‑clock shows JSA slower than VRAG (same order but noticeably higher). The paper lacks scaling curves vs MIS steps m and union top‑k that would help calibrate the practicality. - ODQA retrieval metrics rely on GPT‑4o‑selected “gold” passages from top‑100; robustness to this proxy is untested (e.g., human‑checked subsets or multi‑gold analyses). - MIS acceptance/mixing statistics and their evolution are not presented; this is important to understand stability, variance, and compute trade‑offs. - Limited discussion of modern non‑E2E alternatives (e.g., RA‑DIT‑style decoupled training, verifier‑augmented RAG) under matched compute; such comparisons could change the cost‑benefit picture. - Scaling curves: How do accuracy and wall‑clock vary with MIS steps m and union top‑k? Where do returns diminish? - Sampling behavior: What are MIS acceptance rates over training, and do they correlate with final performance or stability?	Heavily AI-edited
scCMIA: Self-supervised Dual Model for Mitigating Information Loss in Single-cell Cross-Modal Alignment	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper propose a new method for cross modality integration and alignment. The methods focusing on scRNA and scATAC data integration are already well studied, and thus it is hard to figure out the main contributions of this paper to this field. The framework is clearly presented. I have several questions or concerns regarding the current model design and model performance. I think these challenges preclude the paper from publication in this conference, at least in this format. 1. What is the unique contribution of this paper? Using the VQ-based method for multi-omic data integration or biological data learning has already been studied in several papers (https://www.nature.com/articles/s41540-020-00158-2, CVQVAE, or scBeacon). This method lacks innovation, and the training design is not very appealing. 2. The motivation is not so well established. The central dogma only allows one-directional information flow, and thus, we do not need to model the bidirectional information. RNA can never come back to chromosomes, and thus, this method lacks biological interpretation. 3. The benchmarking result is also very weird. Why can we find some baselines with variance reported, but others not? The authors should unify the presentation mode and provide variance for every model. Moreover, reconstruction in single-cell multi-omic data analysis is not a useful metric, as the expression profiles always have noise. The authors should consider one or two new tasks to perform the evaluation. I recommend the authors' reading: https://www.nature.com/articles/s41592-025-02856-3 for including more baseline methods. 4. The comparison should be fair. The authors need to tune hyperparameters for all methods to ensure a fair comparison. 5. I can not find the information about the data scale. Are all the testing data on a large scale or a small scale? 6. How about applying the method to integrate proteomic data such as CITE-seq? Since the authors do not model noise, this framework should work well. Please see the weaknesses.	Fully human-written
scCMIA: Self-supervised Dual Model for Mitigating Information Loss in Single-cell Cross-Modal Alignment	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces scCMIA, a self-supervised framework designed to address the challenges of integrating single-cell multi-omics data, particularly focusing on cross-modal alignment between scRNA-seq and scATAC-seq modalities. The key innovation lies in leveraging mutual information (MI) principles to decouple modality-specific and semantic features within a unified discrete latent space using a VQ-VAE architecture. The proposed method aims to mitigate information loss during integration by combining intra-modal decoupling (via CLUB-based MI minimization) and inter-modal alignment (via contrastive learning with InfoNCE loss). 1. The integration of MI bounds for intra-modal decoupling and cross-modal alignment is theoretically grounded 2. The paper provides a rigorous evaluation across multiple datasets and tasks (alignment, reconstruction, clustering, label transfer). 1. My main concern is the novelty of this work. The proposed framework is a patchwork of existing techniques, and shows no insights or benefits for the community. 2. While four datasets are used, they primarily focus on well-studied protocols (e.g., 10x Multiome). Broader validation on more complex tissues or rare cell types would strengthen generalizability. 3. The paper lacks comparison with cutting-edge approaches like scButterfly or graph-based methods beyond GLUE. Including these would better contextualize scCMIA’s advancements. Please see the weaknesses.	Heavily AI-edited
scCMIA: Self-supervised Dual Model for Mitigating Information Loss in Single-cell Cross-Modal Alignment	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes a deep learning framework that is designed to operate on multiple single-cell data modalities. It solves the problem of alignment of single cells across modalities as well as the problem of translating between modalities. The method uses an InfoNCE loss for alignment, and uses a discrete codebook to improve interpretability. Extensive empirical experiments suggest that the method outperforms various state-of-the-art competitors. The proposed model includes several components (the VQ module and the mutual information module) that are well motivated and seem to provide significant improvements relative to the state of the art. A substantive assessment of the weaknesses of the paper. Focus on constructive and actionable insights on how the work could improve towards its stated goals. Be specific, avoid generic remarks. For example, if you believe the contribution lacks novelty, provide references and an explanation as evidence; if you believe experiments are insufficient, explain why and exactly what is missing, etc. A major problem with this paper is that the exposition is difficult to follow. For example, the second paragraph of the introduction fails to clarify exactly what problem you are working on. Indeed, by describing multimodal protocols that assay multiple aspects of the same single cell, I was misled about what tasks you are interested in solving. What would help is a precise, formal description of the problems you are addressing. More generally, I found the text very difficult to follow. It would be better if you carefully defined terms before using them. Below I outline some of the questions that arose as I worked through the manuscript. In general, I think a missing piece here is assessing how well these models generalize beyond the specific data set they are trained on. I think that each model is trained and validated on splits of the same data set (though I don't know for sure, because you don't tell us how this is done). So a reasonable question is whether you can apply the trained model to a new, independent dataset, generated from a different type of cell. The multimodal alignment methods mentioned at the start of Section 2 work directly in such a scenario, whereas a trained model like yours inherently has to worry about generalizability. In practice, to be useful your model has to generalize to single-modality data (i.e., I only measured scRNA-seq, and you tell me what the corresponding scATAC-seq would look like). A discussion of this issue, and some experimental characterization of it, would substantially strengthen the paper. I thought your description of the challenges associated with multi-modal data (lines 43-49) was imprecise and not very informative. For example, what does it mean to say that there are "substantial discrepancies" between scATAC-seq and scRNA-seq? They measure entirely different things. To my mind, the fact that there are differences in feature spaces is not a "challenge" per se; it's just definitional. You wouldn't say that multimodal analysis of text and images is "challenging" because pixels don't look like words, right? I don't actually believe your claim (line 55) that if you don't embed data into a shared space, then you "cannot fully exploit potentially complementary information across modalities." This is a very bold claim that requires substantial evidence. Indeed, I don't know how you could conclusively prove such a claim. I am not convinced that mean FOSCTTM is the most useful measure. Have you considered computing a p-value for improvement of the FOSCTTM? You get a FOSCTTM score for each cell, so you could do something like a sign test. In the related work section, the fact that alignment methods "suffer from poor alignment robustness when handling noisy [data]" is not a substantive critique, in my opinion. All methods degrade in performance in the presence of noise. I do not understand the critique (line 104) of methods that do multimodal reconstruction without relying on a shared embedding space. You say that "their utility for tasks requiring direct cross-modal comparison, querying, and label transfer can be limited." Why? It's pretty straightforward to do, e.g., label transfer with an accurate multimodal reconstruction method: just reconstruct from one space to the other and then use nearest neighbors to transfer. There is no reason you have to do nearest neighbors in a latent space. I think this critique is misguided or needs to be explained much more carefully. I found the text in lines 144-149 difficult to understand. For example, what is the difference between "modality-specific features" and "semantic characteristics"? What do you mean by the "bounds of MI"? Similarly, the sentence at lines 162-164 is not grammatical. I'm also confused about what it means to be "insufficient for effectively decoupling ... in a directed manner" (lines 167-168). I wish you had introduced your assumption (line 184) earlier, since it seems to be important to understand the basis of much of this work. I guess this is what you were alluding to when you talked about "modality-specific features" versus "semantic characteristics." In the description of the datasets, you should indicate what previous papers used these datasets for benchmarking, and indicate what paper you extracted results from (unless you ran all the tools yourself, in which case indicate that). I was surprised that all the talk about mutual a bound on MI ultimately seems to boil down to just doing an InfoNCE alignment loss. Minor: line 192: uses -> use line 270: objection -> objective You should delete the sentence at line 293 ("Single-cell multi-omics data are often hindered by complex and sophisticated techniques, low throughput, and high noise levels."). Just say what data you used. It doesn't even make sense to say that data is hindered by something. Incidentally, I think calling cross-modal translation "reconstruction" is misleading, since reconstruction typically refers to starting and ending from the same place; e.g., reconstructing a scRNA-seq profile from a masked or compressed version thereof. I do recognize that other papers in the literature use "reconstruction" to mean "translation." Did you compute the performance measures in Tables 1-4, or were some of these taken from previous publications? If the latter, did you use the same cross-validation splits? How was train/test splitting done for each dataset?	Fully human-written
scCMIA: Self-supervised Dual Model for Mitigating Information Loss in Single-cell Cross-Modal Alignment	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces multi-modal alignment between scRNA (single-cell RNA-sequencing) and scATAC (single-cell Assay for Transposase-Accessible Chromatin using sequencing) data using a VQ-VAE (Vector Quantized Variational Autoencoder) architecture. The justification for the modeling based on Mutual Information is well-established. Limited Novelty - The justification based on Mutual Information has been thoroughly explored in previous research (e.g., the CLUB paper). - Techniques like VQ-VAE are all existing methods. - Are there specific challenges unique to single-cell data, and does the paper introduce a corresponding novel technique to address them? Decoupling Explanation: More explanation is needed regarding decoupling. - Why is decoupling necessary? - Consideration is needed on how the decoupled representations could be used independently if required. Applicability to Uni-modal Data: The method was only applied to single-cell multi-modal data. Does it have utility for uni-modal data as well? Showing that the method performs well even on uni-modal data through experiments could further justify the use of multi-modality in the model. See weakness section	Moderately AI-edited
WebRAGent: Retrieval-Augmented Generation for Multimodal Web Agent Planning	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces WebRAGent, a framework for retrieval-augmented multimodal web agent planning. The motivation is that progress in multimodal trajectory learning is limited by the difficulty of representing rich visual information within long interaction histories that exceed a model’s context window. To address this, the authors propose multimodal trajectory retrieval, along with: - A benchmark for trajectory-based retrieval pairs, - A model (GAE-Retriever) for multimodal retrieval based on a vision-language backbone, and - A retrieval-augmented web agent integrating the retriever into agent planning. - The paper presents a novel and intuitive idea, aiming to connect retrieval-augmented generation with multimodal web agent reasoning. - It is well-written and easy to follow, with clear structure. - The release of code and resources is great and improves reproducibility. - The motivation for the multimodal trajectory retrieval task is weak and needs clearer justification—why is this problem important, and what real-world gap does it fill? - The introduction of the GAE-Retriever lacks sufficient motivation and integration into the overall narrative. - Figures 1 and 3, and Tables 1, 3, 9, and 10, are difficult to read due to poor formatting and small font sizes. - The limitations of the proposed approach are not discussed. - The related work section oversimplifies the historical relationship between retrieval and generation in the context of multimodal retrieval. Multimodal retrieval methods have existed long before generation-based approaches became dominant. - The overall storyline feels disjointed: it is not entirely clear how the benchmark, retriever, and agent components connect to form a coherent research contribution. - Could the authors clarify how the benchmark, retriever, and agent fit together conceptually and experimentally within one unified framework? - Please expand on the motivation behind the proposed multimodal trajectory retrieval task; why is it necessary, and what unique challenges does it address? - It would be helpful to include a single overview figure illustrating how all components (benchmark, retriever, agent) interact within the proposed system.	Lightly AI-edited
WebRAGent: Retrieval-Augmented Generation for Multimodal Web Agent Planning	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper addresses the question of how to retrieve the most relevant parts of past multimodal trajectories to support planning. Part of the motivation is that storing all trajectories with multimodal contents in the context is impractical. The paper constructs a trajectory retrieval corpus called Unified Agent Trajectory Dataset (UATD) from annotated demonstrations and states across diverse real-world scenarios. Building on this, it constructs GAE-Bench, a benchmark containing a large number of trajectory-based retrieval pairs. Further, the paper proposes GAE-Retriever, a retriever for multimodal trajectories that uses token selection and GradCache to optimize the contrastive objective. It also introduces WebRAGent, a retrieval-augmented web agent that integrates GAERetriever. Experiments are performed on the Online-Mind2web benchmark. 1. The core idea of retrieving similar trajectories for reuse is interesting and intuitive. 2. The GAE-Bench benchmark introduced in this paper for trajectory retrieval is a valuable resource with several patterns of trajectory retrieval, such as text-to-state, text-to-trajectory, state-to-state, etc. 3. Empirical results show a significant boost in performance over non-retrieval baselines on the Online-M2W benchmark. 1. The Unified Agent Trajectory Dataset introduced in this paper is really not novel. Similar aggregated trajectory dataset already exists [1] 2. For the Online-Mind2web results, rather than choose a simple MLLM baseline, the authors should add trajectory retrieval to exising SOTA or close to SOTA model, e.g. SeeAct [2] or AgentE [3]. 3. The authors selected a subset of 100 tasks from Online-M2W without justification for not using the original dataset. [1] Xu, Yiheng, et al. "Aguvis: Unified pure vision agents for autonomous gui interaction." ICML'25. [2] Zheng, Boyuan, et al. "Gpt-4v (ision) is a generalist web agent, if grounded." ICML'24. [3] Abuelsaad, Tamer, et al. "Agent-e: From autonomous web navigation to foundational design principles in agentic systems." arXiv preprint arXiv:2407.13032 (2024). N.A.	Fully human-written
WebRAGent: Retrieval-Augmented Generation for Multimodal Web Agent Planning	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper addresses a critical challenge in the development of autonomous GUI agents: how to effectively learn from and utilize vast amounts of multimodal trajectory data (states, actions, visual observations) that often exceed the context windows of current models. The authors present a comprehensive framework to tackle this issue: 1. Unified Agent Trajectory Dataset (UATD): They first curate and unify five existing GUI agent benchmarks into a standardized dataset, encompassing 7,747 demonstrations and over 82,000 states. 2. Multimodal Trajectory Retrieval Task: They formally define a new task, "Multimodal Trajectory Retrieval," to bridge the gap between general-purpose retrieval and agent-centric modeling. 3. GAE-Bench: Based on this new task, they construct a large-scale benchmark (GAE-Bench) with 714,628 retrieval pairs, derived from 12 extraction patterns that capture both temporal and semantic relationships. 4. GAE-Retriever: They propose a multimodal retriever built on VLM2Vec, employing optimizations like token selection and GradCache to efficiently train on high-resolution image sequences and large batches. 5. WebRAGent: Finally, they integrate their retriever into a retrieval-augmented agent framework, WebRAGent, which demonstrates significant performance gains (15-22%) over non-retrieval baselines on the Online-Mind2Web benchmark. 1. Problem Significance: The paper correctly identifies a timely and critical problem. As trajectory datasets grow, RAG is a logical and necessary step to scale agent capabilities beyond in-context learning. 2. Foundational Dataset Contribution: UATD and GAE-Bench are significant contributions in their own right. The engineering effort to unify heterogeneous datasets (web, mobile, desktop) into a standardized format (Section 3.1) is substantial and highly valuable for the community. 3. Novel Task Formulation: The "Multimodal Trajectory Retrieval" task, with its 12 extraction patterns (Figure 1), is a key conceptual contribution. This detailed formulation is crucial for training a robust retriever that understands both temporal sequence and semantic intent. 4. Pragmatic Model Design: The GAE-Retriever (Section 4.2) is well-designed. Using a VLM (VLM2Vec) backbone instead of CLIP-based models is well-justified for handling arbitrary combinations of multimodal inputs (lines 92-95). The use of token selection and GradCache to tackle the practical constraints of training with multiple high-resolution images is a critical and well-thought-out optimization. 5. Strong Empirical Validation: The paper closes the loop by demonstrating that its retrieval model directly translates to a 15-22% success rate improvement in a downstream planning task (WebRAGent). This is a very convincing validation of the entire pipeline. While this is an excellent paper, there are a few areas that could be clarified or strengthened: 1. Justification of "Silver Trajectories": A key part of GAE-Bench is the semantic retrieval task (q → τ∼, lines 233-239). The authors generate "silver trajectories" via entity substitution. The example given ("Buy a t-shirt for children on Amazon" → "Order a laser printer on eBay," lines 237-239) seems to represent a pair with very different task flows, even if the high-level intent ("shopping") is similar. - Concern: This could be a very noisy training signal. Does retrieving a trajectory for "buying a printer" actually help an agent "buy a t-shirt," or does it introduce confusion? - Recommendation: The authors should provide a clearer justification for this data augmentation strategy. How is this "silver" pair more helpful than a hard negative? 2. Architectural Novelty of GAE-Retriever: The paper calls GAE-Retriever a "novel...framework" (line 90) but also states it is "built on VLM2Vec" (line 91) and uses optimizations (GradCache) from VLM2Vec (line 309). - Recommendation: The authors should more precisely articulate the architectural novelty of GAE-Retriever itself, distinct from VLM2Vec. If the primary novelty lies in the application and task-specific training (i.e., being the first to successfully apply this architecture to the multimodal trajectory retrieval task), this should be stated clearly. 3. Lack of Qualitative Analysis: The 15-22% performance gain is impressive, but the paper does not explain why it works. - Question: What kind of knowledge is being retrieved? Is it high-level planning steps (e.g., "log in first, then search") or low-level interaction details (e.g., "click this specific icon")? - Recommendation: The paper would be significantly strengthened by adding a qualitative analysis section with 1-2 concrete examples. Show a task where the baseline fails and WebRAGent succeeds, and—crucially—show the actual retrieved trajectory that made the difference. 4. Scope Mismatch (UATD vs. WebRAGent): The UATD is presented as a highly general dataset unifying "web, mobile, desktop, and embodied environments" (line 69). However, the downstream validation (WebRAGent) is only performed on a web-based benchmark (Online-Mind2Web). This leaves the claims about cross-platform generalization underexplored. 1. On "Silver Trajectories": (See Weakness #1) How do you ensure that the generated silver trajectories share a similar procedural flow, rather than just being semantically related at a high level? The t-shirt vs. printer example seems to represent very different procedures. 2. On Token Selection: You use a UI-connected graph in RGB space to merge similar patches (lines 299-302). What are the advantages of this over a simpler baseline like bicubic resizing/downsampling of the image? Is there a risk of merging small but critical UI elements (e.g., a checkbox) that are similar in color to their background? 3. On Inference Latency: What is the computational overhead (latency) introduced by the GAE-Retriever step during WebRAGent's inference? How does this trade-off against the 15-22% gain in success rate? 4. On "Hard Tasks": You mention larger gains on "hard tasks" (line 106). Could you provide a concrete example of a "hard task" and qualitatively explain why retrieval was so beneficial for it?	Fully AI-generated
WebRAGent: Retrieval-Augmented Generation for Multimodal Web Agent Planning	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper presents WebRAGent, a retrieval‑augmented multimodal web agent designed to leverage past GUI trajectories for better decision‑making. The authors introduce the Unified Agent Trajectory Dataset (UATD) and propose the new task of Multimodal Trajectory Retrieval, releasing the benchmarks GAE‑Bench and GAE‑Bench‑lite with over 700K trajectory retrieval pairs. They develop GAE‑Retriever, a VLM2Vec‑based model using token selection and GradCache for efficient contrastive training, and integrate it into the WebRAGent framework. Experiments across five datasets show substantial recall improvements over strong multimodal baselines, and on Online‑Mind2Web, WebRAGent achieves 15–22% higher success rates than non‑retrieval models. - Innovative dataset construction and benchmarks. The paper introduces a unified methodology for integrating heterogeneous GUI‑based trajectory data, resulting in the Unified Agent Trajectory Dataset (UATD) and two large‑scale benchmarks (GAE‑Bench and GAE‑Bench‑lite). This contributes valuable resources for evaluating multimodal trajectory retrieval and provides a standardized foundation for future agent studies. - Novel and well‑structured web‑agent framework. The proposed WebRAGent framework innovatively integrates multimodal retrieval with generation in agent planning. Its modular design—combining observation, retrieval, memory, and planning—demonstrates a clear architectural innovation and effectively bridges trajectory - - Comprehensive and substantial work. The paper presents extensive data preparation, thorough model development, and large‑scale experiments across multiple benchmarks. The amount of work is significant, covering dataset unification, retriever training, and online evaluation, showing strong technical depth and implementation effort. - Lack of clarity in technical details. Several key components are insufficiently explained, such as the reward design, data annotation procedures, and implementation specifics of dataset generation and evaluation. These omissions make it difficult to reproduce and precisely understand how the framework works. (See detailed questions in the Questions section.) - Unclear core contribution. The proposed GAE‑Retriever primarily builds upon existing methods like VLM2Vec and integrates known techniques such as Token Selection and GradCache. Since these components are not original, the novelty of the contribution is uncertain. If the main innovation lies in the integration, the authors should provide clear ablation studies or quantitative evidence demonstrating the necessity and contribution of each module. - Onfair performance comparison. The retriever is trained and evaluated on data from the same source, which naturally favors high recall scores. Comparisons with untrained or zero‑shot models are therefore not entirely fair. Moreover, the paper does not compare WebRAGent’s web‑retrieval capability against existing Web search or Web‑retrieval models, making it difficult to know whether the performance is better than the existed systems or not. - Formatting and presentation issues. Figures and tables are sometimes awkwardly placed, often disrupting the reading flow. Aligning them consistently—at the top of pages—would significantly improve the presentation quality. - Reward mechanism ambiguity. The paper mentions the use of an “LLM‑as‑judge” strategy for rewarding but provides no implementation details. Which specific LLM model was used for judging (e.g., GPT‑4, GPT‑4‑turbo, or others)? What were the prompts, scoring criteria, and calibration procedures? Given that the reward signal directly affects policy and evaluation, this should be clarified in detail. - Unspecified reranking model. The framework claims to apply an LLM‑based reranking step after retrieval, yet there is no description of the rerank model, prompt design, or how it integrates with GAE‑Retriever. How is the reranker implemented, and what quantitative performance gain does it contribute? - Insufficient transparency in data construction. At line 377 the paper states “Data are annotated with gpt‑4o‑mini‑2024‑07‑18.” but does not explain the detailed annotation process. What prompts were used for labeling? How were data quality and potential data leakage from pretrained sources verified? - Lack of comparison with other web‑search models. WebRAGent’s web‑retrieval performance is not compared with existing Web search or Web‑retrieval systems, such as dense retrievers or LLM‑based search agents. Without such baselines, it remains unclear whether the model actually advances the state of web‑scale retrieval. - Computational cost of retrieval augmentation. The paper does not quantify how much additional latency or computation the retrieval process introduces during inference. How large is the retrieval database, and what is the average time‑to‑action compared to the non‑retrieval baseline? - Fairness of planning models across baselines. In the Planning and Action section, the authors state that WebRAGent uses GPT‑4.1 for DOM mode and OpenAI’s computer‑use‑preview model for Vision mode—both very strong, closed‑source models. Do the non‑retrieval baselines employ exactly the same planners? If not, how can we separate the performance gain attributable to retrieval augmentation from that potentially caused by stronger planning models?	Fully AI-generated
Contrastive Residual Energy Test-time Adaptation	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces a test-time adaptation framework based on residual energy, CRETTA to enable efficient and well-calibrated adaptation under distribution shifts. In contrast to entropy-minimization methods that rely on unreliable pseudo-labels or energy-based approaches that demand expensive sampling, CRETTA employs a residual energy function to capture the discrepancy between source and target distributions. By integrating this residual function into a contrastive learning objective, the framework eliminates the need for normalization constant estimation and substantially reduces computational cost. 1. CRETTA avoids both sampling and normalization constant estimation, leading to remarkable efficiency gains relative to other energy-based method. 2. The residual design is well-motivated, it allows the model to adapt using minimal, controlled adjustments to the existing parameters, ensuring stability and preserving previously learned knowledge. 3. CRETTA consistently improves both accuracy and ECE across diverse datasets and corruption severities. The method maintains stable calibration even on challenging settings such as TinyImageNet-C and PACS, demonstrating that the proposed residual-energy mechanism contributes to reliable uncertainty estimation rather than merely higher accuracy. 4. The paper is well written and easy to follow, with logical organization and smooth transitions between motivation, method, and experiment. 1. The paper leverages an energy-based formulation for test-time adaptation, yet it remains unclear why energy modeling should be theoretically effective in this context. Could the authors provide more intuition or formal justification for why minimizing or adapting an energy function leads to improved generalization under distribution shift? 2. While the experiments cover standard corruption and small-to-medium-scale datasets (CIFAR10/100-C, TinyImageNet-C, PACS), the paper does not evaluate CRETTA on larger and more diverse domain generalization datasets such as DomainNet. Validation on such benchmarks would better demonstrate the scalability and robustness of the proposed method under complex, real-world domain shifts. Could the authors clarify the fundamental difference between energy-based and entropy-based test-time adaptation methods? Specifically, how does optimizing an energy function over the marginal distribution differ in objective and behavior from minimizing the prediction entropy of y given x?	Heavily AI-edited
Contrastive Residual Energy Test-time Adaptation	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces CRETTA, a sampling-free energy-based framework for Test-Time Adaptation (TTA). Unlike conventional TTA methods that rely on uncertain conditional predictions (e.g., entropy minimization) or costly energy-based sampling, CRETTA focuses on modeling only the residual energy, the discrepancy between the source and target distributions. By embedding this residual energy into a contrastive learning objective, CRETTA removes the need for normalization constant approximation or Markov Chain Monte Carlo (MCMC) sampling, achieving well-calibrated and efficient adaptation. - The paper introduces a residual energy formulation that redefines TTA as learning only the distributional discrepancy between source and target domains. It is conceptually fresh and removes a reliance on normalization constant approximation. - The paper is well-structured and readable. - Experiments are extensive, covering CIFAR10/100-C, TinyImageNet-C, PACS, and ImageNet-C, with consistent improvements in both accuracy and calibration error (ECE). - Although the paper argues that residual energy learning stabilizes adaptation, its claimed insensitivity to the source buffer suggests that the absolute source energy distribution plays a minor role. This raises the question of whether residual learning is fundamentally required, could similar stability be achieved simply by modulating target energies relative to arbitrary reference energies? (For example, AEA uses the low energy target samples as source buffer to reduce the source-target energy gap.) - The ablation studies focus mainly on buffer composition and size; additional analysis on temperature sensitivity or else for residual learning could strengthen understanding of the method’s robustness. - While the paper reduces the computational overhead of energy-based TTA by removing normalization constant estimation, the core idea of using EBMs for TTA follows prior works such as TEA and AEA. The contribution feels incremental, as sampling-free energy optimization has already been actively explored in other domains (e.g., sampling-free EBMs, RLHF, DPO). Moreover, performance gains over those baselines seem to be marginal. - Evaluation is confined to standard online TTA; results under continual or episodic adaptation are missing, limiting the understanding of CRETTA’s robustness in dynamic environments. - The ablation using CIFAR10-C with CIFAR100 as a replay buffer is not fully convincing, since the two datasets share similar data distributions and semantics. A more meaningful test would involve substituting with a cross-domain dataset (e.g., PACS or TinyImageNet) or reversing the setup (using CIFAR10 as the buffer for CIFAR100-C). See weakness section	Lightly AI-edited
Contrastive Residual Energy Test-time Adaptation	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper presents CRETTA, a residual energy–based test-time adaptation framework designed to achieve efficient and well-calibrated adaptation under distribution shifts. Unlike entropy minimization–based methods that depend on unreliable pseudo-labels or energy-based approaches that require costly sampling, CRETTA introduces a residual energy function to model the discrepancy between source and target distributions. By embedding this residual function within a contrastive learning objective, the method removes the need for normalization constant approximation and significantly reduces computational overhead. Experiments across multiple benchmarks, including CIFAR10/100-C, TinyImageNet-C, PACS, and ImageNet-C, showing consistent improvements in both accuracy and calibration error, with strong robustness to overfitting and catastrophic forgetting. 1. The paper introduces a residual energy perspective* on test-time adaptation, which elegantly models distribution shifts as residual corrections to a pretrained energy landscape. This idea is both conceptually appealing and technically original, offering a clear advance over existing MLE- or entropy-based methods. 2. By eliminating sampling and normalization constant estimation, CRETTA achieves major computational savings (over 6× reduction in GFLOPs compared to TEA) without sacrificing performance, making it practical for real-time or resource-constrained deployment. 3. The experiments are thorough, covering multiple benchmarks and including ablation studies, buffer analysis, and gradual shift scenarios. The results convincingly demonstrate CRETTA’s robustness, calibration quality, and insensitivity to buffer composition. 1. The proposed framework adapts to the marginal distribution $p(x)$ via residual energy modeling, yet classification fundamentally depends on the conditional distribution $p(y\|x)$. The paper does not clearly explain how aligning $p(x)$ leads to improved conditional decision boundaries or classification accuracy. Without a theoretical bridge (e.g., via Bayes decomposition or information-theoretic reasoning), the causal link between marginal alignment and better predictive performance remains speculative. 2. Dependence on source data: CRETTA relies on a small source buffer to perform contrastive adaptation. While the buffer can be as small as 1% of the source dataset and even substituted with similar-domain data, this still departs from the strict source-free TTA setting. In privacy-sensitive or memory-limited scenarios, this requirement might constrain the method’s deployment. 3. The contribution of the contrastive component is central to the method, yet there is no targeted ablation isolating its effect from the residual modeling itself. Including such analysis would help clarify whether performance gains stem mainly from contrastive optimization or other architectural factors. The proposed framework adapts to the marginal distribution $p(x)$ via residual energy modeling, yet classification fundamentally depends on the conditional distribution $p(y\|x)$. Could the authors clarify how aligning $p(x)$ contributes to improved conditional decision boundaries and classification accuracy? Is there any theoretical justification (e.g., based on Bayes decomposition or information-theoretic reasoning) for this linkage?	Fully AI-generated
Contrastive Residual Energy Test-time Adaptation	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper aims to improve model robustness under distribution shift without access to labeled source data. The authors argue that existing TTA methods relying on conditional distributions suffer from poor calibration, while energy-based approaches, though label-free, are computationally expensive due to sampling. To address these limitations, the authors propose Contrastive Residual Energy Test-Time Adaptation (CRETTA), which defines a residual energy function over target data and incorporates it into a contrastive objective. An adaptive gradient reweighting mechanism is used to mitigate overfitting and eliminate the need for sampling. Experimental results are reported to show that CRETTA achieves better calibration and efficiency compared to prior TTA methods. 1. Test-time adaptation remains an active and challenging area, and the paper’s focus on calibration and computational efficiency is well-motivated. 2. The integration of residual energy modeling with a contrastive objective is conceptually interesting and may open paths toward energy-efficient adaptation. 1. While the paper presents an interesting reformulation of energy-based adaptation, the overall contribution appears incremental—largely combining existing ideas (energy modeling, contrastive learning, and gradient reweighting) with limited novelty. 2. According to Table 1, the experiment improvements are marginal at best, and are worse in many cases. In general the empirical section demonstrates some improvements but does not convincingly establish robustness, scalability, or significant gains over strong baselines. 1. Why does not the performance improve over TEA in Table 1? Why is the performance improvement is larger in Table 4? 2. What is the computational cost of CRETTA relative to energy-based methods that rely on sampling? 3. "CRETTA consistently outperforms other methods on most of corruption types in calibration as reported in Table 9." But there is no Table 9, there are only 7 tables.	Lightly AI-edited
SpintBench: Evaluating LLMs' Complex \\ Reasoning via Spatial Integration Challenges	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes a new text-based benchmark for spatial relations, called SpintBench. This benchmark is designed to evaluate a model’s ability to infer global spatial information from given local information. The authors evaluate 17 large language models (LLMs), including both language-only and multimodal models. The results show that while these models can handle recall-based, within the same grid, inferences in finding distance task, LLMs struggle to combine local knowledge and derive a coherent global understanding of spatial relations. The paper also presents an investigation into the effects of global size and object count on task accuracy. Finally, the authors perform an error analysis and find that non-thinking models often assume local grids are close to each other and ignore common objects shared between grids. In contrast, thinking models avoid such shortcut assumptions and perform more deliberate reasoning, leading to better performance. - The proposed benchmark effectively evaluates LLMs’ ability to perform mental reasoning and reconstruct global spatial relations from local information. This addresses an important problem in spatial reasoning, enabling models to infer the structure of a global environment based on partial, local cues. - The paper provides detailed evaluations and ablation studies on their dataset, covering three key conditions: order, shuffle, and recall-infer. Order-shuffle refers to ordered of grid presented in context. Recall-infer refers to whether the question can be answered within grid, or required merging. - Paper evaluate both 2D and 3D settings for this problem. The results is similar between two domains that the model fail to reconstructthe global representation based on given local cues. - The paper also includes a detailed failure analysis, revealing how non-thinking and thinking models differ in their reasoning strategies when solving spatial relation tasks. - Well written paper, and provide illsutration to help explain the evaluated task. - Figure 3, which illustrates model performance across different global sizes for inference and recall questions, may cause confusion, as it can be misinterpreted to suggest that one result represents an improvement over the other. Consider revising or clarifying this visualization to better distinguish the two. - Although an error analysis is provided, it lacks sufficient detail. Including 2–3 sentences explaining how the analysis was conducted would improve clarity and help ensure the reproducibility of the findings. - While the task design is interesting, the current evaluation (based on computing Euclidean distance) may be somewhat impractical. A model might still compute the correct distance despite having an incorrect understanding of the layout or swapped object positions. Extending the benchmark to include more practical downstream tasks such as relation identification, layout reconstruction, or compositional reasoning would make the evaluation more meaningful. - The error analysis highlights that models struggle with multi-step reasoning; however, it lacks deeper discussion or examples demonstrating the specific types of reasoning failures. Expanding this section to categorize and illustrate the observed errors would strengthen the paper’s insights into model behavior. - No initial prompt engineering or methodological approach is proposed to address the shortcomings of LLMs in this task, aside fromexcept utilizing CoT which should be considered as baseline. - Is there any difference in performance between models trained with language-only data and those trained with multimodal data when interacting with grids or combining local information? - Is there a specific reason for selecting Euclidean distance as the downstream task over other potential tasks such as relation identification or layout reconstruction?	Lightly AI-edited
SpintBench: Evaluating LLMs' Complex \\ Reasoning via Spatial Integration Challenges	Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces SpintBench, an automatically generated benchmark to evaluate spatial integration reasoning in both 2D and 3D contexts. It extends transitive inference from 1D to higher dimensions, requiring LLMs to synthesize local spatial cues into global configurations. The benchmark evaluates 17 models and explores effects of parameters, spatial density, and prompting methods (CoT, ReAct). The study concludes that even “thinking” models struggle, suggesting persistent limits in spatial reasoning. 1. Novel angle: Evaluating spatial integration reasoning in text-only LLMs is timely and relatively unexplored. 2. Automatic generation pipeline: Offers scalability and resistance to data contamination. 1. Misaligned problem definition (the teaser fails to represent the paper’s intent). The teaser and introduction claim to study “spatial reasoning,” but the benchmark remains restricted to 2D distance estimation rather than reasoning about 3D spatial relations or physical layouts. Traditional spatial reasoning involves object localization and relational understanding in 3D (e.g., topology, containment, occlusion, perspective), not mere 2D metric inference. The conceptual framing thus misrepresents the task scope and inflates the claimed contribution. 2. Insufficient benchmark scale and unclear data composition. The dataset is relatively small, and the paper lacks comprehensive statistics or visual summaries of the benchmark. Readers cannot assess the distribution of grid sizes, object counts, or overlap ratios. Without transparent visualization (e.g., histograms or embedding maps), the dataset’s diversity and representativeness remain uncertain. 3. Poor organization and weak narrative flow. The paper’s exposition is disjointed and difficult to follow. Key concepts are introduced abruptly, and methodological details are scattered across sections without coherent logic. Many paragraphs mix motivation, implementation, and results. Substantial restructuring and language polishing are needed for clarity and readability. 4. Oversimplified evaluation dimension. The current evaluation focuses only on a single scalar distance metric, which cannot sufficiently reflect a model’s true spatial reasoning ability. Spatial reasoning should involve qualitative relations (left/right/front/behind), multi-object configurations, or relational inference beyond metric computation. The current setting thus evaluates numerical estimation, not reasoning. I am curious about the true value of the dataset, as I strongly suspect that it may merely overfit to its own format. If a base model (e.g., Qwen-VL) could be fine-tuned on this dataset and subsequently demonstrate performance gains on other benchmarks (such as VSI-Bench, OmniSpatial and SPACE), I would be much more inclined to recognize the dataset’s contribution.	Heavily AI-edited
SpintBench: Evaluating LLMs' Complex \\ Reasoning via Spatial Integration Challenges	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	SpintBench introduces a text-only benchmark for spatial integration reasoning in both 2D and 3D, where models reconstruct a global map from overlapping local grids and answer Euclidean distance queries between two objects. The construction is automated: a global m×m grid is populated with n objects, partitioned into k×k local grids with one-row/column overlaps; ordered or shuffled “context stories” list local coordinates. Results show strong difficulty and discriminative power—especially infer-shuffled—where even top “thinking” models perform modestly. 1. The local-to-global integration setup is well motivated and implemented with ordered/shuffled contexts 2. CoT/few-shot yields modest improvements, with clear prompt templates in the appendix. 1. The paper acknowledges a significant theoretical gap, as it does not provide a proof for the uniqueness or solvability of global reconstruction from local overlaps. This leaves open questions about the reliability and consistency of the model's output. 2. The empirical validation is constrained by a limited inference test size, which consists of only 100 samples. This small scale may not be sufficient to generalize the model's performance and robustness across a wider range of data. 3. The reliance on distance-only queries may limit the ecological validity of the findings. This approach potentially underrepresents other critical spatial reasoning skills, such as orientation and relative positioning, which are integral to comprehensive spatial understanding. 4. The authors assert that the model is resistant to data contamination; however, this claim is not substantiated with empirical data leakage checks or a rigorous theoretical justification. The absence of this validation makes it difficult to assess the true robustness of the model against contaminated training data. 5. While the 3D study includes comparative results against other methods, it offers a limited analysis of key performance factors. Specifically, it lacks a deep dive into how the model's performance is affected by scaling, variations in context length, and overall robustness to different conditions. Could you broaden the task set to include relation classification, direction/orientation, connectivity/topology, and step-count diagnostics for multi-hop spatial integration Would you add algorithmic baselines (graph/constraint solvers) to contextualize LLM performance and validate benchmark difficulty For 3D, can you study context length systematically (grid count, overlap degree) and propose scalable summarization or memory mechanisms to mitigate long-text effects	Moderately AI-edited

PreviousPage 42 of 1516 (75800 total rows)Next