ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 15899 (21%) 4.43 3.58 3687
Heavily AI-edited 3233 (4%) 4.22 3.59 2990
Moderately AI-edited 7082 (9%) 4.20 3.61 2722
Lightly AI-edited 16648 (22%) 4.15 3.68 2746
Fully human-written 32938 (43%) 4.13 3.62 2917
Total 75800 (100%) 4.21 3.62 3026
Title Ratings Review Text EditLens Prediction
Beyond Accuracy: Measuring Reward Variance as a Predictive Benchmark for RLHF Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper presents the reward variance benchmark. It is designed as an evaluation benchmark to measure not only the pairwise accuracy, but also the quality of predictions via three variance-derived metrics. They use a custom-constructed benchmark of prompts and responses to evaluate 23 existing reward models. For me, the paper is a case of "less would have been more". I like the research problem; the paper shows some interesting ideas. However, I am not convinced by the formulation of metrics; they feel overly complex, and I am skeptical of the positioning of this work as a benchmark. I don't find that the reported results in terms of downstream performance prove a strong predictive capacity of the derived metrics. Instead, I would have preferred framing the paper as an empirical investigation of the relationship between reward variance and performance (which the dataset can be a precursor for). This would allow for avoiding complex wrapped metrics and a more open-ended exploration of the relationship. I think this could provide more insightful results and discussions. + Evaluation benchmarks in general are a great resource for the community + Investigating and evaluating contributing factors to training success in RLHF is a relevant topic + I really appreciate an empirical investigation of the reward variance hypothesis, and its effect on downstream performance, and think this is a big strength of the paper + For most parts, an adequate level of detail for reproducibility + The developed RVB benchmark shows an ability to predict downstream RL performance + A series of interesting experiments, and good reporting of results - The dataset construction process in section 3.1. is not really motivated or evaluated. Why choose this subset of the RMB benchmark? Why use the four temperature settings, and what difference did it actually make? I think there is a description of Appendix A3.4, but it seems disconnected from Eval-Core. I feel that providing the template (A.5) is insufficient. E.g., “we keep the RMB candidates (about 3-6)”. How were they selected? - How were the five candidate metrics (Section 4) designed? What was the process? Was it based on related work? They come a bit out of the blue. I get the overall idea for each, but it contributes to some issues I have: - I find the introduction of new terms for the metrics a bit cumbersome, as this requires memorizing these new terms and hides the relationship to existing concepts. I know that keeping metrics within the range [0-1] is attractive, but it adds complexity: For example, instead of the SEI, directly reporting H(p) would also be comparable across models, and “prediction entropy” would be more comprehensible than SEI (“lower is better” metrics are totally valid in my opinion). I think this relates to my question 1, where I feel that these metrics are potentially interesting to investigate, but not something I want to necessarily optimize for - Similarly, for DCI, which is another custom metric, in my mind, not well motivated, as an aggregation metric - Finally, this is even exacerbated with the composite score in 4.5. At this point, there are so many layers of metrics and normalizations that I find the composite score somewhat incomprehensible except for a vague “higher is better”. I feel that an evaluation benchmark should try to make an effort for generalizable and comprehensible metrics - The results in Figure 4 also coincide with accuracy (the RewardBench score of Tulu is noticeably lower than the other two, which seems to be reflected in the training success). The Skywork model seems to converge even faster, although the composite metric is lower than the URM one. So these results do not fully indicate the predictive performance of the metrics compared to a simple metric like accuracy (you report a relatively high correlation of 0.51). Minor: - I feel line 170 with the listed reward models is missing references for the different families of RMs - “Wikipedia contributions” is an unusual citation; I would prefer a permanent document as a reference. - Figure 3 is difficult to read. I would advise filtering some categories, introducing some highlighting, or choosing a different type of visualization - I am somewhat skeptical of the central motivating hypothesis: Ideally, I want the output distribution of my reward model (or any model for that matter) to reflect an underlying (aleatoric) uncertainty. So while a “sharper” model indeed leads to faster learning speeds, I do not find it obvious that this is more representative of the actual prediction target (i.e., it is well-calibrated). So I am not convinced that a low variance reward model is a metric that should be directly optimized for. A simple way to optimize for this benchmark seems to be to apply an entropy-penalization term during RM training, but I am unsure if this would result in a “better” reward model. So, while the general hypothesis of RM variance as an important factor is supported by peer-reviewed related work, I am somewhat skeptical whether benchmarking RMs for low variance is a desired quality per se. - Have you compared the variance metrics to using the Brier score (I guess you call it the top-gap metric)? Have you run experiments for comparison? I would be really interested in also seeing the predictive capacity of top-gap metrics for downstream performance to better motivate the need for these new metrics - Don’t the results of Chen et al. and Leng et al. point to the risk of overconfidence and even show that models with lower accuracy might lead to better learning? Isn’t low variance a potential sign of overconfidence? Fully human-written
Beyond Accuracy: Measuring Reward Variance as a Predictive Benchmark for RLHF Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper argues that pairwise accuracy, a dominant metric for evaluating reward models (RMs), misses an important dimension: variance in reward scores. Low-variance reward functions create flat optimization landscapes that can slow policy learning. To address this, the authors propose the Reward Variance Benchmark (RVB), a suite of variance-aware metrics for assessing RMs’ “teaching effectiveness.” - Evaluating reward models is a critical challenge in RLHF and preference-based RL. Poor evaluation can lead to slow learning or unsafe learned behaviors. By focusing on metrics beyond pairwise accuracy, the paper tackles an important open problem. - Reward model evaluations focused on reward variance seem to be novel and underexplored in evaluation frameworks. As prior work has found that reward variance affects the optimization landscape, evaluting reward models based on this woud be useful. Framing variance as a “teaching signal” provides a new perspective on reward model utility. - Comparison across 23 models covering multiple families (LLaMA, Qwen, Gemma, Mistral, etc.) adds breadth. **Missing related work:** The paper does not discuss prior reward model evaluation frameworks, such as TAC [1], DARD [2], and EPIC [3], which evaluate reward functions beyond pairwise accuracy. **Eval-Core benchmark:** - The authors describe refining an existing benchmark (RMB 1.9k prompts to 354 prompts) to create Eval-Core. It is unclear why this reduction is useful or what statistical or practical advantage it provides. The contribution seems incremental, as it largely repurposes a preexisting dataset. - The benchmark focuses on “helpfulness,” not on “harmlessness”. Why was this set chosen? Extending RVB to this additional set would strengthen claims. **Metrics:** - The paper claims the proposed metrics (SEI, nGMD, and DCI) provide interpretability, but it is unclear how they do so or how they convey “clear optimization semantics.” - Razin et al (2025) [4] note that the same reward model can induce high reward variance for one language model, yet low reward variance for another. Therefore, different language models can require different reward model teachers. How do these metrics handle that? Will this be an issue for these metrics? Can these metrics be high for a reward model with respect to one language model, but low for the same reward model with respect to a different language model? **Experimental Design Choices + Claims:** - The paper states that three representative RM families were scored, but it is unclear what makes these families “representative.” More justification is needed. - The claim that RVB metrics predict policy performance is weakly supported, as only three reward models were trained. It is difficult to infer a meaningful correlation from such a small sample. - Why not use standard reward variance as a baseline metric? As included in Razin et al (2025) [4], which the authors do discuss. **Resources** - TAC [1]: https://rlj.cs.umass.edu/2025/papers/RLJ_RLC_2025_280.pdf - DARD [2]: https://openreview.net/pdf?id=CALFyKVs87 - EPIC [3]: https://openreview.net/pdf?id=LwEQnp6CYev - Razin et al (2025) [4] https://arxiv.org/pdf/2503.15477 - Are there examples of reward models where the RVB score better predicted policy performance than another metric (e.g., RewardBench Score)? For instance, is there an example of a reward model where another metric incorrectly had a high score, but RVB had a low score, and policy performance was indeed worse? - Can the authors explain how the 23 reward models were chosen? - Can the authors further elaborate on how the SEI, nGMD, and DCI provide interpretability? - Can the authors explain how the proposed metrics are better than standard reward variance with normalization that is done in Razin et al (2025)? Fully human-written
Beyond Accuracy: Measuring Reward Variance as a Predictive Benchmark for RLHF Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper identifies a critical gap in current reward model (RM) evaluation for reinforcement learning from human feedback (RLHF)—namely, the lack of systematic attention to the variance and distributional properties of reward signals. The authors propose the Reward Variance Benchmark (RVB), a comprehensive evaluation suite that introduces three variance-sensitive metrics (SEI, nGMD, DCI) to profile RMs along axes of score concentration, pairwise separation, and cross-prompt stability. Through extensive analysis of 23 popular RMs, with supportive experiments and visualizations, the RVB suite is shown to predict downstream RLHF convergence and select RMs more effectively than accuracy alone. The paper is accompanied by a standardized benchmark data release and reproducible tools. 1 Clear Motivation and Relevance: The paper effectively justifies why accuracy alone is insufficient for RM evaluation, referencing empirical and theoretical findings (Section 2, references to Chen et al., 2024; Razin et al., 2025). The flatness of reward landscapes under low variance, as depicted in Figure 1, directly motivates the shift in perspective. 2 Metric Suite Design: The introduction of the SEI (Softmax-Entropy Index), nGMD (normalized Gini Mean Difference), and DCI (Decision Consistency Index) is mathematically well-founded. The metrics are robust (using median, MAD), interpretable, and explicitly decoupled from accuracy (Section 4). 3 Empirical Validation: The RVB metrics are validated against RLHF convergence rates using multiple models (Figure 4), and variance-based rankings provide new insights beyond accuracy rankings (Table 1). 1 Overlapping Metrics & Composite Score: While the correlation analysis in Appendix A.2.3 mitigates concerns, there is still notable overlap between SEI and nGap (correlation $\rho \approx 0.78$) and partial redundancy with nGMD. The choice of metric aggregation (equal weighting of MAD-z scores in the composite) is somewhat heuristic (Section 4.5). The impact of alternative weighting or selection criteria for the composite is not fully explored, raising the possibility of overfitting to the presented evaluation set. 2 Empirical Baselines: While 23 RMs are evaluated, there is limited discussion of calibration or strong baselines from variance-aware reward modeling (e.g., DGRO, GRPOVI, GVPO), or ablation against RM ensembles that explicitly regularize variance. This omission weakens claims about RVB's broad applicability. Please see weaknesses. Fully AI-generated
Beyond Accuracy: Measuring Reward Variance as a Predictive Benchmark for RLHF Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a framework for evaluating reward models (RMs) used in reinforcement learning from human feedback (RLHF), arguing that pairwise accuracy alone fails to capture how effective an RM is in providing signals to policies. The authors introduce the Reward Variance Benchmark (RVB), which measures distributional characteristics of RMs through three metrics: SEI (Softmax–Entropy Index) for score concentration, nGMD (normalized Gini Mean Difference) for global pairwise separation, and DCI (Decision Consistency Index) for stability across prompts. * The paper suggests a better evaluation framework for reward models, which is an important problem crucial to developing robust reward models. * Experiments in verification (Section 5.4) are quite shallow. Only three policies fine-tuned with RMs are compared with each other. Due to this, it is hard to believe that the analyses provided in the previous sections 5.2 & 5.3 are a decisive factor in RM performance in teaching policies. * The 4 'teaching styles' provided in section 5.2 is not backed by empirical results. To really see if the teaching styles do produce any effect to the fine-tuned policy, some distinctive pattern for each of these fine-tuned policies from different groups of RMs should have been observed * Figure 3 lacks providing useful insights other than the fact that there is variance between task categories for different types of reward models. * Using GPT-4o might benefit certain RMs that are used to the specific generation style of the model. Comparison with other models would be beneficial. * Minor corrections * In page 4, MAD, the citation is pointing to Wikipedia contributors, which is not a proper citation. Please update this with statistics textbooks or relevant papers. * What is the motivation to exclude the 2 metrics from the initial 5 metrics? Fully human-written
Monitoring Decomposition Attacks with Lightweight Sequential Monitors Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The submission produces a dataset ("DecomposedHarm") of highly-effective LLM jailbreaks that rely on decomposing a harmful query into multiple benign-seeming subtasks. Careful experiments show that LLM-based sequential monitors (classifiers of the subtask/prompt sequence) can accurately identify when these subtasks will begin to induce a harmful output. To address the financial cost and latency of such LLM-based monitors, prompt engineering is used to enhance the performance of lightweight LLM sequential monitors (beyond the performances of more sophisticated and expensive models). Notably, even when additional adversarial strategies are added to the decomposition attack, the lightweight sequential monitors robustly discriminate between benign and harmful subtask sequences. **Originality** The work introduces a new dataset (DecomposedHarm), and a simple, practical method for addressing decomposition attacks. **Quality** DecomposedHarm addresses a variety of potential jailbreak settings, from image generation to agentic LLM tasks. Comparisons include strong baselines like Llama Guard, which is greatly outperformed. Adversarial evaluation provides additional evidence of the proposed method's benefits. Decompositions are verified by human reviewers. The limitations are thorough and helpful. **Clarity** The paper is very well written, with clear figures and tables. **Significance** The submission addresses an urgent problem of practical importance. The dataset DecomposedHarm will facilitate future research in this area. The submission has no significant weaknesses. In the limitations section, perhaps emphasize that decomposition is just one of many attack strategies, and decomposition could potentially play a role in building composite attacks (e.g. with genetic algorithms) stronger than those explored in the submission. 1. Line 178: why are the harmful indices the last indices? Couldn't the image become harmful before the last subtask? 2. Did you employ any checks for redundancy in the LLM generated examples that populate DecomposedHarm? A little redundancy is okay, but not if you have overlap between your validation and test sets.  3. Adding a newer model (Gemini 2.5, Sonnet 4.5, and/or GPT 5) as a reference in Table 1 could boost the relevance of the analysis and clarify what the frontier is. Relatedly, a newer compact model (Haiku?) could be a strong candidate for prompt-engineering. 4. In Table 1, GPT 4o without any optimization appears cost effective and performant. It would be interesting to see how the references perform with a prompt (latency and F1). 5. Table 1: The GPT 4o mini F1 does not seem to match Table 6’s.  6. Line 351: there seems to be a typo here – is “o3-mini” supposed to be “GPT 4o”? Fully human-written
Monitoring Decomposition Attacks with Lightweight Sequential Monitors Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper examines decomposition attacks against LLM agents. These involve breaking down malicious intents into subtasks that are safe when taken in isolation and can bypass safety filters. The paper presents DecomposedHarm, a dataset of 4634 human-annotated agent tasks for conducting and evaluating such attacks, and finds that they are up to 87% successful against GPT-4o. The authors also develop a defense using sequential monitoring of LLMs which can detect these attacks with up to 93% success and with low additional latency. 1. The research problem holds significant value, as attacks decomposed through simple prompts remain undetectable by LLMs. This issue exhibits universality, scalability, and urgency, while the novel dataset DecomposedHarm demonstrates high research value. 2. The author innovatively introduces a defense method achieving a maximum defense success rate of 93%, while maintaining robustness under adversarial conditions. 3. DecomposedHarm is an extensive and diverse dataset for studying decomposition attacks, providing clear splits (Table 4) and strong empirical visualization (Figure 2). 4. The authors provides solid quantitative analyses (Figures 2&5, Tables 1–3), consistently showing decomposition sharply reduces refusal rates and generalizes across models and interaction modes. 1. Applying o3-mini, GPT-4o, and Claude-3.7-Sonnet model can reach the highest F1 scores, but the costing problem is also concerning, especially examing cumulative context. 2. The authors only applies in-context learning (prompt engineering) methods to improve the sequential monitors. 3. The ability of defense in the method does not stem from the robustness of the algorithm itself, but rather from the accidental effect of the prompt wording. 4. Although the decomposition attack (Fig. 9) demonstrated in the paper is effective, it is limited to non-adaptive scenarios and single pipelines. 1. How much does monitoring performance drop if the best-performing ICL or CoT prompts are replaced with simpler, less-engineered prompts? Can the authors quantify the risk that monitoring is reliant on specific prompts? 2. If a single attack prompt can be decomposed into many sub-questions, achieving high F1 could bring significant token consumption. How do the authors address this scalability issue? 3. Please report the distribution of harmful metric percentages (harmful metric / total subtask length) and stratify performance reporting by this percentile in test evaluations. Fail to do so may introduce bias in assessments due to the tendency to place harmful steps in advantageous positions. 4. Please evaluate the framework under varying benign-subtask injection rates by fixing the fraction of benign subtasks (e.g., 25%, 50%, and 75% ,benign subtasks / total subtasks). For each setting report F1, cost per task, and average latency. No need for generating new tasks, just pick the tasks that satisfies corresponding ratio already had. If the main concerns have been addressed, I am considering raising my score. Fully human-written
Monitoring Decomposition Attacks with Lightweight Sequential Monitors Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper shows that well aligned models such as GPT-4o and Llama-Guard are vulnerable to decomposition attacks, and are effective across different settings. It proposes that lightweight models can cumulatively monitor the queries of decomposition attacks as they progress, and through careful prompt engineering, outperform much stronger baselines in terms of detection rates and cost. The paper also introduces an extensive new dataset for decomposition attacks, where each task is broken down seemingly benign subtasks. 1. The paper is timely, as this problem is a growing concern. 2. The solution is simple, cheap and effective, beating far more expensive zero-shot baselines. 4. The evaluation is extensive, the metrics used are appropriate and the results are convincing. 3. The dataset will prove very useful to the area going forward as a benchmark. 1. The defense is reliant on an engineered, static prompt to control detection behavior, which raises concerns about adaptive attacks. While the PAIR baseline does give an adaptive attack (where subtasks are made more benign while maintaining semantics), it doesn't include the system prompt as part of the input, which can lead to a suboptimal objective for the adversary. 2. The dataset could more details and descriptions regarding the diversity of tasks (subcategories), length of decomposed prompts, whether the subtasks are independent of each other, etc. 1. How well does the sequential monitor framework perform when the ICL prompt is included as part of PAIR's input? 2. Are there scenarios where prompts can be decomposed into independent subtasks? Adversaries could make singular independent queries to avoid providing a cumulative prompt history to the monitor. Fully human-written
Monitoring Decomposition Attacks with Lightweight Sequential Monitors Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper looks at decomposition attacks in the context of LLM agents. These are attacks where a harmful task is broken up into a sequence of benign steps that are cumulatively harmful. The authors produce datasets that capture this category of jailbreaks and then introduce sequential monitors that can detect decomposition attacks. They demonstrate that prompt optimization of a sequential monitor can block decomposition attacks more effectively than standard filters. ## Contributions * A novel dataset of decomposition attacks * A sequential monitoring method for filtering * Results that indicate their sequential monitoring is feasible with reasonably small monitoring models Overall, the paper deals with an important area of concern: decomposition attacks are an effective attack and mitigations are an important area for further study. As a result, they are studying a problem of interest to the community. The dataset contribution is the most valuable from my perspective, although prior work has defined and demonstrated decomposition attacks, I believe there is no existing dataset of decomposition attacks that allows for comprehensive evaluation or study. Separately, the approach described for defense seems reasonable, and the authors do a good job of balancing concerns for the effectiveness of monitors with the feasibility of deployment. In summary, the paper studies an important problem and provides a reasonable defense proposal. The evaluations indicate that this is a potentially promising direction for the development of new guardrails. The primary weakness of the paper is a heavy level of overclaiming. The introduction and title suggest that it is "surprisingly easy" to defend against these attacks. However, I do not believe the author's evaluations are sufficient to justify such a claim. There are a few specific issues with the framing: * The authors study a variant of the problem where the decomposition attack occurs within a single context. This is easily bypassed if the attacker can carry state over from one context to another. For example, they describe an image generation task to generate a potentially racist propaganda image that violates guidelines. However, it seems like it would be straightforward to split these requests into several contexts and avoid the defense mechanisms proposed. While this is briefly mentioned in the limitations section, it deserves further discussion and motivation for why this particular attack setting is appropriate for study. * While the authors do some study of adaptive attacks, it is not clear that these are representative of how a capable attacker would behave. In particular, 10 attempts by GPT-4.1 is a valuable measurement, but far below the level of resources that a capable attacker could apply to such a system. * Most of the comparisons show performance against o3-mini as an expensive evaluator; they don't show that performance is good enough from an objective risk minimization perspective. As a result, the evidence the authors actually have supports that lightweight sequential classifiers can be optimized through prompt optimization to perform similarly to more expensive classifiers. This is a different and weaker claim. * It seems possible that there is information leakage from the similarity in how the data is generated. There's an attempt to account for this by using in-context examples from one task and exploring how well it transfers to another. I'm not sure what justifies this approach. Similarly, the fact that the examples from the train/validation set are programmatically selected by the same process as the test set means that this approach risks undercounting vulnerability. ## Clarity about evaluations I also have some questions about the evaluation setup overall. * First, in comparison to GPT-4o and o3-mini as a reference, why are the prompts unoptimized? It seems to me that it would be appropriate to compare to an optimized prompt for the reference models as well. This provides an effective limit on performance. By comparing against an unoptimized prompt for the more expensive models, you are stacking the deck in your favor. This makes the analysis of the results misleading: the optimized sequential monitor matches the performance of a more expensive, but unoptimized model. * Second, it's unclear why the authors are comparing 4.1-nano with 4o-mini. This is a bit strange because they are different model sizes and I would expect 4.1 to be more effective as a guardrail model. This raises some questions about the overall choices of which models to compare and why they were chosen over alternatives. * How did you evaluate the quality of your LLM-judge for harmfulness? * How did you select which models to use for your experiments? E.g., why is 4o-mini being compared with 4.1-nano? * Is there a way to evaluate your performance against a better-resourced adversary? * Can you explain why these defenses are interesting, given that we can/should expect decomposition attacks to be executed across different contexts? * Can you clearly and concisely articulate how the evidence in your paper supports the claim that defending against decomposition attacks is 'surprisingly easy'? Fully human-written
Inductive Reasoning for Temporal Knowledge Graphs with Emerging Entities Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces TransFIR, a transferable inductive reasoning framework for Temporal Knowledge Graphs that enables effective reasoning about emerging entities by leveraging historical interaction patterns of semantically similar known entities via a codebook-based semantic clustering approach, achieving significant performance gains over baselines in predicting facts involving new entities. 1. Intuitive experimental results clarify the motivation, making the paper easier to understand. 2. A novel codebook-based approach is proposed to address emerging entities in temporal knowledge graphs. 3. Experimental results comprehensively and clearly demonstrate the effectiveness of the proposed method. 1. The related work section omits some recent inductive reasoning methods for temporal knowledge graphs. 2. Lines 157–158 state, “after training, emerging entities deviate sharply from known entities in the embedding space.” Since emerging entities rarely appear in the training set and are updated less frequently, this phenomenon is unsurprising. 3. With BERT-encoded and frozen entity embeddings, the method likely relies on BERT’s semantic encoding to address emerging entities. Ablation results on ICEWS18 in Figure 5 support this. It is recommended to provide additional experiments to further assess the impact of LM on performance. 4. The method depends on having a reliable textual description for each entity to generate initial BERT embeddings . In domains where such text is unavailable, noisy, or ambiguous, the quality of the codebook clustering could degrade significantly, weakening the entire framework. 5. The complexity analysis shows a time complexity of $O(n_t L(k^2d + kd^2))$ for the IC encoder, which could become a bottleneck for graphs with very long interaction histories. See Weakness Lightly AI-edited
Inductive Reasoning for Temporal Knowledge Graphs with Emerging Entities Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Existing Temporal Knowledge Graph (TKG) reasoning methods primarily focus on modeling relation dynamics but typically assume a closed entity set. In the real world, new entities are continuously added to the graph but lack historical interaction data, leading to a significant drop in reasoning performance for these entities. TRANSFIR offers a systematic solution to the inductive reasoning problem for emerging entities without historical interactions. It enables transferable temporal reasoning through semantic similarity transfer and a codebook-based classification mechanism, achieving significant progress in both performance and scalability. 1. The paper introduces the concept of semantic similarity transfer, providing an effective solution to prevent representation collapse. 2. Through empirical research, the paper demonstrates the widespread presence of emerging entities in Temporal Knowledge Graphs (TKGs), with approximately 25% of entities being new. The study also shows that existing methods experience a significant performance degradation when handling these emerging entities. This provides strong theoretical and experimental support for the proposed TRANSFIR framework. 1. The evaluation could be more comprehensive. It only includes one large-model-based method, whereas other relevant approaches like ICL [1] and PPT [2] are not considered. 2. Unclear novelty over existing similarity-based approaches. The main innovation of the proposed TRANSFIR framework lies in leveraging the behavioral evolution patterns of similar entities to assist in predicting emerging entities. However, similar approaches already exist — for example, MGESL[3] also considers the similarity between entities and analyzes the behavioral evolution patterns of semantically related entities. Moreover, MGESL discusses both settings where candidate entities are known and unknown. [1] Dong-Ho Lee, Kian Ahrabian, Woojeong Jin, Fred Morstatter, and Jay Pujara. 2023. Temporal Knowledge Graph Forecasting Without Knowledge Using In-Context Learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 544–557, Singapore. Association for Computational Linguistics. [2] Wenjie Xu, Ben Liu, Miao Peng, Xu Jia, and Min Peng. 2023. Pre-trained Language Model with Prompts for Temporal Knowledge Graph Completion. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7790–7803, Toronto, Canada. Association for Computational Linguistics. [3] Shi Mingcong, Chunjiang Zhu, Detian Zhang, Shiting Wen, and Li Qing. 2024. Multi-Granularity History and Entity Similarity Learning for Temporal Knowledge Graph Reasoning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5232–5243, Miami, Florida, USA. Association for Computational Linguistics. 1. In the ablation experiment on the GDELT dataset, is the performance without the textual encoding module better than TransFIR? This is difficult to determine from the figure. If the performance without the textual encoding module is better than TransFIR, what could explain this result? 2. How does TRANSFIR fundamentally differ from existing similarity-based models such as MGESL[3]? Would including MGESL[3] in the experimental comparison change the relative performance ranking of TRANSFIR? Lightly AI-edited
Inductive Reasoning for Temporal Knowledge Graphs with Emerging Entities Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes TRANSFIR, an inductive reasoning framework for temporal knowledge graphs, designed to handle emerging entities that appear without historical interactions. The authors first conduct an empirical investigation showing that approximately 25% of entities in common TKG benchmarks are unseen during training, leading to severe performance degradation and representation collapse. To address this, TRANSFIR introduces a Classification–Representation–Generalization pipeline: 1. Codebook Mapping via a learnable vector-quantized (VQ) codebook that clusters entities into latent semantic categories, even for unseen ones. 2. Interaction Chain Encoding, which models temporal dynamics as ordered interaction sequences instead of unordered neighborhoods. 3. Pattern Transfer, which propagates learned temporal dynamics within semantic clusters, preventing collapse and enabling inductive generalization. Experiments across four standard datasets (ICEWS14, ICEWS18, ICEWS05-15, and GDELT) demonstrate significant performance improvements (average +28.6% MRR) compared to strong baselines such as LogCL, REGCN, and InGram. Ablation, sensitivity, and visualization analyses confirm the contribution of each component and show how TRANSFIR prevents embedding degeneration. The paper provides theoretical motivation, detailed methodology, and strong empirical validation. 1. Clear Problem Definition and Motivation The paper explicitly defines inductive reasoning on emerging entities — a setting rarely formalized before. The authors provide convincing empirical evidence that around one-quarter of TKG entities lack training interactions, motivating the need for inductive treatment. This establishes a meaningful gap between existing “closed-world” assumptions and real-world scenarios. 2. Well-Designed Methodology The Classification–Representation–Generalization pipeline is logically structured and technically coherent. Each stage (codebook clustering, interaction chain encoding, and pattern transfer) addresses a distinct aspect of the emerging-entity problem: type-level priors, temporal dynamics, and generalization. 3. Empirical Rigor and Breadth The experimental setup is comprehensive: four datasets, multiple categories of baselines (graph-based, path-based, inductive), and both strict Emerging and relaxed Unknown evaluation settings. Quantitative improvements and stable results across hyperparameters demonstrate robustness. 4. Insightful Analysis and Visualization The inclusion of t-SNE visualizations and the quantitative Collapse Ratio metric provides clear evidence that TRANSFIR effectively mitigates representation collapse. The cluster case study concretely illustrates transferable reasoning patterns. 5. Clarity and Organization The writing is technically clear, equations are well formatted, and the pipeline diagram helps convey the overall structure. The ablation and sensitivity analyses provide transparency regarding the influence of each module and hyperparameter. 1. Limited Theoretical Explanation of Codebook Semantics The VQ-based codebook serves as the foundation for TRANSFIR’s semantic generalization, yet the paper offers limited theoretical or empirical analysis of what these latent clusters truly capture. Beyond a few illustrative examples, there is no quantitative assessment of the semantic coherence or stability of the learned clusters. It remains unclear whether the grouping behavior arises from shared linguistic semantics, co-occurrence frequency, or inductive biases in the embedding space. A more explicit discussion of how the codebook representation links to underlying entity semantics would strengthen the interpretability claim. 2. Incomplete Scalability and Efficiency Evaluation Although Appendix D.3 presents an asymptotic complexity discussion, the main text lacks direct empirical comparisons of runtime and memory usage with strong baselines such as REGCN and LogCL. Given that TRANSFIR integrates multiple computational stages—including codebook updating, transformer-based interaction encoding, and intra-cluster pattern propagation—a detailed runtime profile and resource breakdown on large-scale datasets would be valuable for assessing its real-world feasibility and computational efficiency. 3. Sensitivity to Textual Initialization and Encoder Choice The model initializes entity representations using fixed BERT-based textual embeddings, yet the influence of these pretrained representations is not examined. The paper does not analyze whether the model’s performance depends on the semantic quality of textual inputs, nor whether substituting alternative or domain-specific encoders would change outcomes. Since the codebook mapping step relies heavily on the textual embedding space, understanding this dependency is important for assessing generalization across domains or datasets with varying textual richness. 4. Limited Exploration of Temporal Chain Configuration The Interaction Chain length parameter defines the temporal window used for reasoning, but the paper provides minimal empirical or theoretical discussion on its effect. The impact of varying chain length on information propagation, noise accumulation, and temporal dependency modeling remains underexplored. A systematic analysis of how chain truncation influences accuracy and stability across datasets would clarify how TRANSFIR balances temporal coverage with computational overhead. 5. Absence of Detailed Error and Failure Case Analysis The qualitative examples focus on successful transfer cases and reduced collapse, but the paper omits analysis of failure conditions. Instances where semantic clusters merge unrelated entity types or where temporal transfer fails due to inconsistent interaction histories are not discussed. Identifying and characterizing such failure modes—especially on heterogeneous datasets like GDELT—would provide important diagnostic insights and demonstrate a more complete understanding of model behavior. 1. Causal Path Discovery Assumptions The paper defines causal path discovery as the foundation of CausER’s reasoning process, but the assumptions that guarantee the validity of discovered causal paths remain implicit. Could the authors specify under what structural or temporal conditions the learned paths can be regarded as causally valid rather than correlational? Clarifying how the model ensures causal sufficiency and mechanism stability in multi-relational temporal graphs would help readers understand the theoretical boundary of the proposed intervention objective. 2. Identifiability and Theoretical Guarantees The theoretical section presents an identifiable counterfactual objective but does not detail how identifiability is maintained under partially observed temporal data. Are there specific assumptions—such as temporal faithfulness or stable mechanism transitions—that must hold for the causal estimator to remain unbiased? A more explicit discussion of these conditions and their relation to the structural causal model defined in Section 3.2 would strengthen the theoretical contribution. 3. Causal Path Generator Efficiency and Scalability The causal path generator explores multi-hop relational paths using differentiable interventions, which can be computationally intensive on dense graphs. Could the authors provide empirical runtime and memory profiles for this module on larger datasets such as GDELT? Including a quantitative comparison with baselines in terms of cost per epoch or per sample would clarify whether the causal discovery process scales efficiently to real-world graph sizes. 4. Effect and Behavior of the Counterfactual Regularizer The counterfactual regularizer is presented as a key mechanism that improves robustness to temporal confounding, yet its operational behavior is described qualitatively. Could the authors further explain how this regularizer alters the score distribution during training? For instance, how does it affect the relative weighting of causal versus spurious temporal correlations over epochs? More detailed training dynamics or representative examples would make its impact on model behavior clearer. 5. Evaluation Protocol and Emerging Entity Setting The paper emphasizes inductive generalization to unseen entities and uses chronological splits for evaluation. Could the authors clarify whether the evaluation explicitly separates emerging entities from known ones and whether metrics are reported both for emerging and overall subsets? Such clarification would allow more precise comparison with other inductive temporal reasoning frameworks and highlight how CausER handles first-appearance nodes. Fully AI-generated
Inductive Reasoning for Temporal Knowledge Graphs with Emerging Entities Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a novel framework tailored to the unseen entity link prediction in temporal knowledge graphs. The authors observed that entities sharing similar semantics often have comparable interaction histories and interaction patterns. Inspired by this, the authors propose the TransFIR framework that uses the semantically similar known entities to augment the unseen entity reasoning, where a codebook-based classifier is used to map entities to semantic clusters, and the semantics of unseen entities will be augmented by other entities within the cluster. Extensive experimental results showcasing the effectiveness of the proposed method. S1. The paper is well-written and easy to follow S2. Learning the reasoning strategy for emerging entities is a challenging and valuable direction in the field of temporal knowledge graphs. S3. Technical details of the proposed framework are well-motivated and justified. S4. Extensive experimental results are provided, offering a comprehensive understanding of the model performance. W1. The current framework assumes static cluster assignments for entities after training. However, in reality, entity semantics often evolve over time, leading to potential shifts in their associated clusters. This inherent limitation is likely to impair the model's performance in long-term prediction scenarios, where semantic changes can become more pronounced. W2. Under the open-world assumption, emerging entities may belong to entirely new categories that exhibit no discernible similarities to existing ones. It is therefore worthwhile to examine how the framework performs in handling such entities. None Lightly AI-edited
EVALUESTEER: Measuring Reward Model Steerability Towards Values and Preferences Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces EVALUESTEER, a synthetic benchmark probing whether LLM-based reward models can steer to user value and stylistic profiles. Using WVS-derived value statements and controlled style transforms, the authors evaluate six RMs across 11 prompting setups, finding style-over-substance biases and ≤75% accuracy with full context. 1. Steerability to pluralistic values and styles is underexplored for reward models; the paper targets an important gap beyond generic reward benchmarks. 2. Orthogonal manipulation of four value dimensions and four stylistic families with six pairwise comparison regimes is well designed for isolating effects. 3. Large evaluation (165,888 pairs), explicit prompting conditions (11), and an “oracle” filtration step make the pipeline easy to reason about; human validation adds rigor. 1. Completions are generated with GPT-4o and filtered by a GPT-4.1 oracle, while GPT-4.1-Mini is among evaluated judges. This tight coupling risks self-agreement bias and overestimates generalization. Suggestions: Use a disjoint oracle family (e.g., non-OpenAI) or a committee oracle with disagreement filtering; report results when the example pool is filtered by each of {OpenAI, Google, Meta, Alibaba} or by majority vote across families. --- 2. Even though you include a 4-vs-4 ablation, the main condition gives ≈200 value statements vs 4 style statements. This still encourages attention to surface style cues. Suggestions: (1) Add a profile-summarization condition: top-k relevant value sentences via retrieval; (2) a matched-token-budget condition; (3) require RMs to cite which profile sentences justify their choice to check relevance sensitivity. --- 3. While the synthetic design isolates factors, it may not reflect messy real conversations where values are implicit and style/value signals co-occur with other attributes (domain knowledge, safety, politeness norms). Suggestions: Include a human-in-the-loop subset: have annotators with known WVS-like profiles express preferences over the same pairs; report agreement and calibration vs EVALUESTEER labels. --- 4. “Readability,” “verbosity,” and “confidence” can correlate (longer → harder; confident → fewer hedges → sometimes simpler). Current stylometrics are univariate proxies. Suggestions: Provide multivariate separability analyses (e.g., logistic regression predicting each style label using all stylometric features), and release per-sample metrics distributions showing minimal cross-loading. See Weakness Part Fully AI-generated
EVALUESTEER: Measuring Reward Model Steerability Towards Values and Preferences Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces a benchmark dataset to evaluate RM's steerability against user value and stylistic preferences. The authors evaluated a set of LLM judges based RMs and found that they manifested noticeable gaps against Oracle labels. * This paper studies an important problem on LLM alignment against user value and stylistic preferences. * The construction of user profiles and values draw from real human data from World Values Survey. * A diverse range of reward models are evaluated (based on models of varied sizes/capacity). * When it comes to user value and preference assessments, having human in the loop is critical and necessary. The dataset construction has no human involvement, and it's unclear to what extent the generated pairs align with what people would prefer in practice. The fact that the authors use GPT-4o filter as the "oracle" basically caps the performance of human preference alignment by 4o's inherent capabilities. Having human validation will make the datasets much more credible and convincing. * The main results presented by the paper (Figure 2) are confounded by the fact that 4o is the oracle. It is not surprising that models smaller than 4o didn't match the validation by 4o. This is less about human value alignment but more about aligning against the judgements made by 4o. * The evaluation was only done on prompting-based LLM judges. When it comes to RM, it's important to test whether the introduced dataset can be used to improve RM through finetuning. Please see points in weaknesses Fully human-written
EVALUESTEER: Measuring Reward Model Steerability Towards Values and Preferences Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces EVALUESTEER, a benchmark designed to evaluate reward model (RM) steerability towards users’ value and style preferences. Built on the World Values Survey (WVS) and enriched with synthetic stylistic variations, EVALUESTEER tests how well reward models align model outputs with user profiles reflecting diverse human values and communication styles. The synthesized benchmark comprises 165,888 preference pairs spanning four value dimensions (traditional, secular-rational, survival, and self-expression) and four stylistic dimensions (verbosity, readability, confidence, and warmth). Six RMs, including GPT-4.1-Mini and Skywork variants, are evaluated under 11 prompting conditions. Key findings include: * Even with full context, top models achieve only ~75% accuracy, leaving a 25% gap to the oracle setting. * RMs show secular and self-expression biases on the Inglehart–Welzel value map. * Most RMs exhibit stylistic bias toward verbosity and confidence. * When values and styles conflict, RMs favor style over values ("style over substance" bias). 1. Overall, the paper is easy to follow, and the findings are clearly presented. 2. EVALUESTEER fills a research gap by systematically measuring RM steerability to user-specific values and styles. The main weakness of the paper lies in its quality, especially the synthesized benchmark and the evaluated model sizes. To support the claims made in the paper, a manual check and more evaluations across different model sizes are needed. 1. [Quality] The samples in the benchmark are synthesized by GPT-4o, which raises concerns about their quality. For example, it is possible that GPT-4o may produce similar responses for opposites of certain types of WVS statements. In that case, the small gap in different values would not be caused by the inherent abilities of the evaluated models but by the benchmark itself. I strongly recommend conducting a manual check of the proposed benchmark, and if that is too difficult, consider reducing the number of samples in the benchmark. 2. [Quality] While validation metrics are reported, the paper lacks direct human benchmarking, e.g., RM performance vs. human annotators on identical tasks. 3. [Significance, Quality] All tested models are small (-mini/-flash or <10B), which affects the generality of the conclusions when extending to modern large-sized models. For instance, the 25% gap in accuracy may simply be mitigated by increasing model size. To address this concern, experiments on larger models are needed, such as Gemini-2.5-Pro, GPT-4.1/5, or DeepSeek-R1 (671B). 4. [Novelty] Several existing works [1][2] also evaluate LLM steerability. It is worth comparing, or at least discussing, the main differences between EVALUESTEER and these works. 5. [Clarity, Quality] * It is strongly recommended to include a figure illustrating the overall synthesis pipeline. * The font size in Figure 3 should be enlarged, especially the text in the right subfigure. * Table 2: "pm" -> $\pm$ * Figure 9: Some numbers extend outside the box. * Line 1112: ">=" -> "$\ge$" ### References [1]: Chen, Kai, et al. "STEER-BENCH: A Benchmark for Evaluating the Steerability of Large Language Models." EMNLP 2025. [2]: Chang, Trenton, et al. "Measuring steerability in large language models." Neurips Safe Generative AI Workshop 2024. 2024. * I am wondering whether this style bias can be mitigated as model size increases, similar to a "scaling law" in model steerability. * I am wondering which version of GPT-4o is used. * In Section D, a small portion of the synthesized data is manually validated. I am curious about the background of the annotators (the PhD and MS students), in particular: - 1) Whether they are experts familiar with different inter-country values. - 2) Whether they are authors of the paper and may have biases or conflicts of interest in the rating results. Fully human-written
G-Verifier: Geometric Verifier for Robust 3D Point Cloud Semantic Search with Spatial Relation Reasoning Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces G-Verifier, a module to encode explicit spatial relations between objects to re-rank the object proposals (candidates) in 3DVG. G-Verifier outputs RoSE features for each candidate and applies contrastive learning to implement the re-ranking. This paper also introduces 3D-SpAn, a large-scale 3DVG dataset with structured explicit spatial relationship annotations, to train and evaluate G-Verifier. Experimental results show that adding G-Verifier improves the backbone model (Grounded 3D LLM [a]). a. [Grounded 3D-LLM with Referent Tokens](https://arxiv.org/abs/2405.10370) 1. This paper is well-structured. 2. Body texts are easy to follow. 3. The proposed 3D-SpAn dataset adds value to the 3DVG community. 4. Disentangling semantics and spatial relations processing can be a good attempt. 1. Many typos, missing spaces, and incorrect citation formats, such as "F1-score(0.96)" in line 028, "...geometry. (Xu et al. (2024b))." in line 070, and "(MLP) Haykin (1994) to" in line 302. 2. Over-squeezed spaces between images and body text. Figure 1 & 2 abuse \vspace. 3. In line 157, the authors misrefer to Figure 6 in the appendix, and Figures 2 & 6 are the same. 4. Vigor [a] and CoT3DRef [b] should be discussed because they have been attempts to decouple the semantic and geometric information of 3DVG, which is very relevant to your core argument. 5. G-Verifier and the proposed 3D-SpAn dataset seem to target a simplified scenario where object relations are a pre-defined closed set, like SR3D [c]. However, real-world scenarios may be users verbalizing natural, lengthy descriptions without fixed templates, as NR3D [c]. It is unclear to me how G-Verifier will tackle such natural descriptions. 6. G-Verifier is a second-stage refinement on another model's outputs—Grounded 3D-LLM itself has had the ability to propose object candidates and select the most possible one. That is, G-Verifier is a "plug-in" to backbone 3DVG models, analogous to CoT3DRef. In this case, G-Verifier should be evaluated on several backbone models to verify its general benefits. The only experiment on performance improvement is on Grounded 3D-LLM [d] (Table 2), where numbers seem not significant, given that the extra stage and parameters are used. 7. The novelty (settings, engineering techniques, and insights) of G-Verifier is limited. G-Verifier merges explicit spatial info into the last process, while the main reason why previous works did implicit learning is because they consider scenarios where spatial info is complex and descriptions have open forms. a. [Data-Efficient 3D Visual Grounding via Order-Aware Referring](https://arxiv.org/pdf/2403.16539) b. [Chain-of-Thoughts Data-Efficient 3D Visual Grounding](https://arxiv.org/abs/2310.06214) c. [ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes](https://github.com/referit3d/referit3d) d. [Grounded 3D-LLM with Referent Tokens](https://arxiv.org/abs/2405.10370) I may raise my score if authors address the above weaknesses. My major concerns include limited novelty, simplified settings, and insufficient experiments. Fully human-written
G-Verifier: Geometric Verifier for Robust 3D Point Cloud Semantic Search with Spatial Relation Reasoning Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces G-Verifier, a geometric verifier and re-ranker designed for post-processing grounding results. To train this module, the authors constructed a large-scale 3D spatial relation annotation dataset, 3D-SpAn, and employed a specific language alignment training strategy. Experimental results indicate that the G-Verifier module possesses discriminative capabilities. When integrated into the end-to-end pipeline, it significantly enhances the baseline model's localization accuracy while maintaining high stability. However, as a post-processing module, the paper lacks sufficient experiments demonstrating the method's generality and efficiency. 1. This paper designed a post-hoc re-ranker, G-Verifier, which can be conveniently integrated into existing 3DVG frameworks. 2. The paper construct the large-scale 3D-SpAn dataset, featuring structured spatial relation annotations, holds significant value for relevant research within the community. 3. The results in the paper clearly demonstrate G-Verifier's effectiveness in enhancing spatial relation reasoning robustness, even when compared against a strong baseline model. 1. The paper primarily compares against the authors' re-implementation of the Grounded 3D-LLM baseline. While strong, it lacks direct comparison with the results reported in the original Grounded 3D-LLM paper or other recent SOTA 3DVG methods that also focus on spatial relations. 2. G-Verifier functions as a post-processing step, introducing additional computation. The paper does not explicitly analyze the resulting increase in inference time or computational resource requirements. 3. The authors acknowledge that using "inverse querying" to generate anchor pseudo-labels may introduce noise. Further discussion or quantification of this noise's impact on model training, final performance, and the model's robustness to such noise is recommended. 4. The rationale behind the chosen feature fusion method within ROSE could be more clearly articulated. Additionally, the specific advantages of 3D RoPE compared to alternative 3D relative position encoding methods could be further elaborated. 5. Does the method merely overfit to the authors' proposed dataset? Can this post-processing approach improve the model's general grounding capabilities? Although this is a post-processing method, its overall simplicity and seamless integration with other methods are commendable. Therefore, if the authors can address these concerns, I will consider increasing my score. see in weakness Lightly AI-edited
G-Verifier: Geometric Verifier for Robust 3D Point Cloud Semantic Search with Spatial Relation Reasoning Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes G-Verifier, a post-hoc geometric verification module intended to improve spatial reasoning in 3D visual grounding. The design follows a Propose, Select, Verify pipeline, where the final verification stage re-ranks candidate objects using a newly proposed RoSE relational embedding. Experiments show moderate improvements on a constructed benchmark. - The paper is clearly written and motivated by well-known weaknesses of current 3DVG/3D-LLM models. - The re-ranking idea is conceptually intuitive and easy to integrate into existing systems. - The new dataset appears reasonably large and might be useful to the community. - **Lack of Novelty**: The core idea, introducing a post-hoc re-ranking module using relation embeddings, is well-established in both 2D grounding and classical IR-style pipelines. Most components (contrastive alignment, RoPE encoding, relation-type embeddings, weighted score fusion) are direct adaptations of existing methods. The paper does not demonstrate what is fundamentally new beyond recombining known elements. If the authors present the dataset contribution as their primary contribution, they should emphasize the associated analysis and insights instead of stressing methodological novelty. - **Confusing Experimental Design**: A critical issue is that the paper does not include comparisons against any existing 3D visual grounding or spatial reasoning baselines. The experiments only evaluate variants of the proposed method on a custom dataset, which makes it impossible to assess contribution or practical significance. - **Marginal Practical Impact**: The reported improvements are small (e.g., +2.5% Acc@0.50) and limited to a curated evaluation split rather than standard benchmarks. There is no evidence that the proposed verifier improves generalization in broader or real-world conditions. - **Insufficient Detail in Data**: The model relies heavily on the newly constructed 3D-SpAn dataset. However: The pseudo-labeling process is under-specified. The dataset is not benchmarked against alternatives. It is unclear whether the module performs well without this dataset, limiting reproducibility and applicability. - **Limited Insight Into Failure Cases**: The analysis focuses on overall performance but does not sufficiently explore when and why the verifier fails. Certain relation types show no improvement (e.g., behind, right of), suggesting that the approach may not truly address viewpoint-dependent spatial reasoning, despite claiming to. - **Conceptual Gap in the Claimed Contribution**: The paper’s thesis is that decoupling semantic and geometric reasoning is necessary. However, the verifier still depends on semantic embeddings and does not provide insight into how “geometric reasoning” is being meaningfully separated rather than just appended. Please see Weaknesses Fully AI-generated
IDSPACE: A Model-Guided Synthetic Identity Document Generation Framework and Dataset Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper proposes IDSPACE, a framework to generate synthetic ID documents for training and evaluating document-authentication systems. It separates metadata (like name, country, age, background, etc.) from control parameters (rendering/tamper effects) and then uses Bayesian Optimization (BO) to tune those parameters so that synthetic documents “behave like real ones” according to perceptual similarity (SSIM) and model-prediction consistency. They also release a big dataset with about 180k template and 180k scanned images of 10 European IDs. The paper is fairly well written and clear about its pipeline. The authors clearly put a lot of engineering work into the rendering system and dataset. Splitting “metadata” and “control parameters” is tidy, I can see how that helps in producing balanced or controlled sets (say, same ID but different background or lighting). Also, if someone genuinely can’t share real ID images due to privacy, a synthetic set could be convenient for testing. I’ll be blunt: the problem this paper solves doesn’t feel significant enough for a research venue like ICLR. It reads more like a software engineering or product demo paper. 1. **The core problem is weak.** The whole paper is built around “checking whether an ID image *looks* real or fake.” But modern ID verification systems don’t rely on that alone, they combine **live selfie + liveness**, **OCR + database checks**, and **cryptographic MRZ/barcode verification**. Once you have those, this “visual authenticity” check adds very little. In many cases, you can just match the text and face to the DB, and you’re done. 2. **Over-selling small tricks.** The “two key innovations” from the abstract: (1) separating metadata/control parameters, and (2) optimizing parameters with Bayesian Optimization, sound fancy, but in practice, that’s just normal pipeline design plus Optuna-style tuning. There’s no conceptual novelty or learning contribution. It’s mostly sugar coating over standard components. 3. **Feels like a tool demo, not research.** Lines like “IDSPACE enables model-guided synthesis” (Section 3) make it sound like a new ML idea, but it’s really a parameter tuning loop wrapped around a rendering engine. This could be a great engineering blog post or experiment, but not a strong scientific contribution. 5. **Synthetic is not real, and it won’t generalize.** They assume synthetic ID tampering detection can train robust models. But this kind of model tends to fail on **unseen manipulation algorithms**, exactly what attackers will do. So a model trained on their dataset could collapse the moment a new generator or Photoshop trick appears. Honestly, I’m struggling to see how this work could be improved without rethinking it from the ground up. The current motivation and problem framing (“synthetic image–based document authenticity”) feel too narrow and disconnected from how real-world ID verification is actually done. That said, here are the only questions that might change my mind if addressed clearly: 1. **Practical necessity:** Can the authors explain concrete scenarios where document-image authenticity still matters *even when* OCR → database checks, chip/MRZ verification, and selfie+liveness pipelines exist? Who exactly would use this, and why is synthetic doc-auth better than existing data-augmentation or template-rendering pipelines? 2. **Attack realism:** Most real fraud uses *real* faces and text but manipulates fields, print–scan loops, or replays. Why would a model trained on synthetic tamper patterns generalize to such attacks? Have you tested unseen manipulations (e.g., new generator, Photoshop edits, reprint attacks)? 5. **Novelty and scope:** What prevents this from being just “Optuna on top of a rendering engine”? The two main “innovations” (metadata separation + BO search) don’t seem new. If you disagree, please make the distinction explicit—what’s the scientific contribution beyond engineering? Fully AI-generated
IDSPACE: A Model-Guided Synthetic Identity Document Generation Framework and Dataset Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces IDSPACE, a novel framework for generating high-quality, customizable synthetic identity documents, accompanied by a substantial open-source dataset. The core strength lies in its ability to produce diverse and realistic synthetic IDs (spanning various countries, ethnicities, and environmental backgrounds) while maintaining high fidelity to target domain characteristics. This is achieved through a unique model-guided Bayesian optimization approach that simultaneously ensures both visual similarity and model prediction consistency, even in few-shot scenarios, outperforming existing GAN-based methods like StyleGAN. The authors also demonstrate the utility of this data for training fraud detection models and validate the "realism" of the generated data using large language models like GPT-4. 1. The most crucial contribution is the provision of a customizable, large-scale, and open-source synthetic identity document dataset. This directly tackles the persistent challenge of data scarcity and privacy concerns in identity document fraud detection, offering a valuable resource for the research community. The framework's ability to generate diverse synthetic ID documents based on customizable parameters (e.g., different countries, ethnicities, and environmental backgrounds) is a major strength. This flexibility is essential for comprehensive benchmarking and evaluating fraud detection models across varied real-world conditions. 2. The use of model-guided Bayesian optimization, integrating both image similarity (SSIM) and model prediction consistency, is highly effective. This ensures that generated documents, while allowing for detail variations, maintain overall semantic consistency (e.g., preserving user identity) with the target domain. This sophisticated alignment mechanism is key to the data's utility. 3. The paper provides extensive empirical evidence for the effectiveness of the generated data in training fraud detection models. The comprehensive comparisons, particularly highlighting IDSPACE's superior consistency even with few samples compared to GAN-based baselines like StyleGAN, underscore the method's robustness and practical advantage. The innovative use of large models (e.g., GPT-4) to verify the "realism" or stealthiness of the generated data adds a unique and compelling layer of validation, demonstrating that the synthetic outputs are difficult for advanced AI to distinguish from real ones. 4. The demonstrated capability to produce high-consistency synthetic data from a minimal number of original samples is a critical advantage, making the framework cost-effective and applicable in privacy-sensitive, low-resource settings. 1. Diffusion Model Exploration: The paper does not explicitly discuss or compare the performance of IDSPACE against state-of-the-art diffusion models for synthetic identity document generation. Given the recent advancements and high fidelity of diffusion models in image synthesis, this omission leaves a gap in understanding IDSPACE's position relative to this powerful generative paradigm. 2. Generative Fraud Paradox and Control Mechanisms: A significant concern arises from the method's ability to generate highly realistic, new types of fake IDs (beyond traditional crop-and-move or inpainting). If fraud detection models are trained on IDSPACE's non-fraudulent synthetic data, they might inadvertently learn to recognize and potentially ignore the "generative artifacts" as legitimate, making them vulnerable to this new class of generative fraud. The paper does not address how models trained on IDSPACE data would detect such "generative" forged documents, nor does it propose control mechanisms to prevent the misuse of the framework for creating undetectable forgeries (e.g., when generating IDs with fabricated personal information). This ethical and practical dilemma warrants deeper discussion and potential solutions. 3. Dynamic Scene Generation: The current framework focuses on static image generation. However, in real-world remote identity verification, documents are often captured in dynamic video streams, involving subtle movements, varying lighting, and interactions with the environment. The absence of functionality to synthesize identity documents within such dynamic scenarios limits the full applicability of the dataset for training and evaluating models in more challenging, realistic conditions. See the part of weaknesses Fully AI-generated
IDSPACE: A Model-Guided Synthetic Identity Document Generation Framework and Dataset Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces IDSPACE, a framework for generating synthetic ID documents from a small number of real samples. The authors tune the generation parameters (such as noise, font, and blur) for generating synthetic documents using a Bayesian Optimization method to determine how the generation parameters would need to be adjusted till the guide model produces the same prediction for real and synthetic documents, and utilize these parameters to generate synthetic data. The authors demonstrate that models trained on the IDSpace dataset generalize better than existing methods. The paper uses Bayesian Optimization (BO) for targeted update of synthetic data generation parameters based on the performance of the guide model. The generation parameters are controllable and tuned to adapt to cases when guide models fail, and the experiments show that the data generated in this method helps the detector models generalize better. The weaknesses of the guide models can potentially give interesting insight into which set of parameters is difficult to predict for which type of model. The data generation method is limited by the expressiveness of the control parameter space (fonts, blur, noise, etc). IDNet[1] already utilized Bayesian Optimization to tune parameters to generate synthetic ID documents using SSIM, and the primary contribution of this work is adding model-consistency to the objective, which makes it an important but incremental improvement. [1] L. Xie et al., "IDNet: A Novel Identity Document Dataset via Few-Shot and Quality-Driven Synthetic Data Generation," 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 2024, pp. 2244-2253, doi: 10.1109/BigData62323.2024.10825017. 1. In the case when the target document contains features that cannot be represented within the space (such as a unique printing artifact), how would the BO behave? Would it try to find a “shortcut” by introducing non-realistic artifacts? 2. Do the changes in the generation parameters by BO indicate any specific pattern? For example, do they change any specific generation parameter more frequently, or are there any particular changes specific to a single model? 3. Training the guide model is a stochastic process that can be affected by the model seed. How robust are the generation parameters to these variations of guide models? For example, if the same model were to be trained on different seeds, would the resultant generation parameters from them converge, or would they vary significantly? Fully human-written
IDSPACE: A Model-Guided Synthetic Identity Document Generation Framework and Dataset Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper presents IDSPACE, which is a system for producing realistic synthetic identity document images tailored to a specific domain using few-shot learning. The framework does document creation as a combination of controllable parameters and high-level metadata. It applies Bayesian Optimization to automatically tune these parameters so that the resulting synthetic images match the appearance and prediction patterns of real target documents. This functions as a flexible data generator so users can define desired document attributes, and the system automatically aligns low-level visual properties to the target domain. The optimization process is guided by pre-trained fraud detection models to ensure that both visual fidelity and semantic consistency are achieved. The authors claim the experiments show that IDSPACE produces synthetic data with higher realism and cross-model consistency than baselines. The authors also released a synthetic ID dataset. 1. The paper tackles a real problem regarding public ID datasets. The main contribution IDSPACE provides is creating a tunable method that can flexibly generate IDs based on user inputs, which would be very useful for controlled and targeted analysis of existing fraud-detection models' capabilities. Furthermore, IDSPACE releases a synthetic dataset that can be utilized for fraud detection tasks. 2. Allowing users to specify entity metadata and capture conditions is a practical feature. It makes the synthetic data more relevant and interpretable than other generation model outputs. 3. Since all identities and content are synthetic, privacy is preserved. This is crucial for identity-based datasets. Furthermore, the authors explicitly note that no real personal information is used. 1. I am doubtful about basing the performance of the technique based on the model consistency evaluation only. Here the creation is optimized using an objective encompassing SSIM and model consistency, and this is also evaluated with model consistency as well. While I appreciate the fact that this evaluation is shown using different models as guidance across different models for consistency (basically showing model invariance of the proposed method), this type of evaluation isn't robust enough for fraud-detection systems that should depend on finegrained nuances. I would suggest to conduct a human study regarding correctness across the different baselines. 2. Overall the experiments and baselines compared against is limited. For example in Table 2, the methods that are utilized as baselines are CycleGAN and BO with SSIM only objective. However, generative models such as diffusion-based approaches aren't used. Furthermore, methods such as IDNet/DocXPand that utilize similar approaches should also be considered. It is also unclear how much the BO tuning actually helps. Furtheremore, there should be experiments comparing different optimization strategies to show the validity of using BO tuning for this specific setting. 3. The paper's novelty is in decoupling some metadata to be user-controlled, and including model prediction consistency in BO (BO is already done in IDNet, but that has less user control and only SSIM). Furthermore, they show that their method is better using model prediction consistency (but this is the same metric they optimized for in BO). Overall, the novelty of the contribution is incremental, but there is immense utility to the contributions as this is a user-controlled dataset generation framework which can be used to generate specific target scenarios. However, the paper didn't show evaluation or specific use-cases in this regard. 1. Could the authors clarify how they ensure that model-prediction consistency, which is used both as an optimization objective and an evaluation metric, does not bias the results toward self-validation? Have the authors considered any independent metrics (e.g., human perceptual evaluation, FID/LPIPS) to assess the correctness of the generation process? 2. Is there any specific reason only CycleGan and BO (with SSIM) are used for baselines? Can the authors show results for diffusion models or generative models (preferably trained on similar data)? 3. Is there other optimization techniques the authors utilized for experimentation? Can the authors show some baselines regarding that? 4. Can the authors demonstrate a specific user-controlled generation scenarios that highlights the practical utility of your decoupled metadata and automatic tuning? Some experiments showcasing this utility would also be great. Fully human-written
IBiT: Utilizing Inductive Biases to Create a More Data Efficient Attention Mechanism Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 0: Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces a learned mask in self-attention to better capture the inductive bias that is lacking in the conventional ViT models. The authors compared their method on the ImageNet-1K dataset, showing good performance compared to other ViTs. Introducing inductive bias to the ViTs is a good topic to study. - The motivation of the paper is that introducing the inductive bias to the ViTs can improve ViT training on small-scale datasets. However, only one experiment on ImageNet-1K is shown in the paper, which is not considered to be “small-scale”. The experiment does not correspond to the motivation of the paper. - The paper lacks a significant amount of related work discussion. There are many methods that do weight selection, weight mimetic, introducing CNN structure, etc. These methods are very related to the topic explored in this paper; the authors should properly cite this research and give reasonable discussions. - The paper does not provide a solid, thorough explanation of the proposed method. No valid theoretical proofs or thorough quantitative proofs. - The paper is not well-organized or polished. There are many unnecessary parts in the paper that only add to the pages. For example, algorithm 2 and figure 3 are redundant, training curves are not explained, just randomly shown in the paper, etc. There exist many typos, unexplained figures, and unreasonable arguments. No more questions. Please reconsider this topic and write a proper paper. Fully human-written
IBiT: Utilizing Inductive Biases to Create a More Data Efficient Attention Mechanism Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 0: Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. To improve the sample efficiency of vision transformers and enable them to perform effectively on small datasets, the authors propose adding a learned masks to the transformer to incorporate CNN-like inductive bias (spatial inductive bias). The method demonstrates positive results on ImageNet, outperforming the DeiT baseline while using fewer parameters and achieving faster performance. - The paper is easy to follow. - The results show improvement over the DeiT baseline on a well-known dataset such as ImageNet. - The proposed method is simple to apply and can therefore be easily adopted by the community. - The paper omits a significant amount of important related work. Several examples include Swin Transformer[1,2], Conv-based ViTs [3,4,5], and other works that incorporate inductive bias into vision transformers (for instance MEGA [6], 2D-SSM [7], MaxViT [8], MixFFN and others). These works should be used both as baselines and to better differentiate the proposed method. - There is no analysis of latency, FLOPs, or memory usage, during either training or inference. - Further analyses are missing, such as evaluating robustness (Imagenet-A, Imagenet-E, and others) and beyond-classification performance (segmentation, generation). ___ **References:** [1] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV 2021 [2] Swin Transformer V2: Scaling Up Capacity and Resolution. CVPR 2022 [3] CMT: Convolutional Neural Networks Meet Vision Transformer. CVPR 2022 [4] Early Convolutions Help Transformers See Better. NeurIPS 2021 [5] CoAtNet: Marrying Convolution and Attention for All Data Sizes. NeurIPS 2021 [6] Mega: Moving Average Equipped Gated Attention. ICLR 2023. [7] 2D-ssm a general spatial layer for visual transformers. ICLR 2024 [8] MaxViT: Multi-Axis Vision Transformer. ECCV 2022 - Is there a reason for not comparing the proposed method with the methods in [1–8]? - Where do the authors think this method can be applied, and what are the limitations of using distillation or other approaches for small datasets like [10]? [10] Vision Transformer for Small-Size Datasets Fully human-written
IBiT: Utilizing Inductive Biases to Create a More Data Efficient Attention Mechanism Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 0: Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors are trying to mimic inductive bias in CNNS and plant it into self-attention mechanisms. They propose using learnable masking and low rank approximation to learn this mask, and they claim it should benefit in low data regime. * Better understanding of the process mechanism of attention heads is important, and handling low-data regime might be beneficial in some cases where data is hard to harvest. * The importance maps of your method look nice, however beside the first two images, I cannot understand what the attention map represents, on which object does it focuses and why. * Poor level of writing which is not up to standard, to elaborate on some: * The abstract and the introduction sections are far from being comprehensive, they are too short and not informative enough. * It is not common to embed citations in the abstract. * The figures – for example Fig. 1 and Fig. 4 are too detailed, visualizations are better in clarifying or convey ideas, not for detailed explanations. * Unbalanced layout, the Figs are took over the space in the paper, where in the extreme case in page 5 – Figure 3 and Algorithm 2, are covering the entire page (beside 2 sentences in the middle). * Uncommon notations – it is not common to denote variable as a full name like width, height and size. Using of notations without explaining them like the full dot (which I can get that it is multiplication I guess) but it is better to declare prior to the usage. * There is no period at the end of sentences of almost all captions of figures. * The comparisons are not convincing, with low margin and against too few exemplars. * Overuse of capital letters – see for example Lines 12 and 17 in the abstract. * The equations are not numbered. * Lack of flow and tension in the paper, for example the separation for paragraphs is not enhancing flow and sometimes seems random. * The mathematical notations are confusing and very hard to follow – The symbol * is not what you want to use for multiplication, it is better to leave it without any symbol for multiplication. Moreover, the indexing is super hard to follow. * The experimental section is too poor. There is too few comparisons with only DeiT and ConViT only on ImageNet. Moreover, there seems to be no improve, or only marginal when using lower portions of ImageNet, whereas this is the claim to fame of the paper. * Why do you think of $X_m$ as flattening of the image? In transformers it is consist of the patches of the input images, or tokens in LLMs and not the pixels themselves. * When you talking on inductive bias in CNNs with respect to those of attention – do you mean that CNNs are inherently handling spatial consistency? While it might be that transformers are not dedicated to? If so, then I think that this claim should be clarified and exemplified – since there is a positional encoding – so the spatial information is somehow embedded in the transformer learning scheme. * What should I learn from tables 1 and 2? It is a configuration of the training, you can just state it, why do you place it in table? Moreover, the values are almost identical. Im sorry, but this paper is far from being up to academic standard, especially for top-tier conference. The level of writing, depth of analysis, thorough comparisons, mathematical correctness are far from the required. I tried read it few times but I think I couldn’t catch the conceptual novelty here, both because the writing level and the mathematic confusing notations. Even if there is conceptual contribution hidden here, this paper should be dramatically revised. Fully human-written
IBiT: Utilizing Inductive Biases to Create a More Data Efficient Attention Mechanism Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes an Inductively Biased Image Transformer (IBiT) framework to improve data efficiency and small-dataset performance of ViT by introducing convolution-like inductive biases into the self-attention mechanism. Specifically, IBiT includes two core technical ideas: Learnable Mask and Low-Rank Approximation. The Learnable Mask module applies trainable, convolution-inspired locality constraints directly onto the attention maps, enabling the ViT to mimic CNN’s translational equivariance and locality without the use of knowledge distillation or teacher models. The Low-Rank Approximation technique exploits the sparsity of local attention patterns to represent the inductive bias mask using two sub-matrices, significantly reducing parameter count and computational cost. The experimental evaluation in this paper assessed the performance of the proposed IBiT on ImageNet-1K, comparing it to various state-of-the-art Transformer-based methods. The results indicate that IBiT consistently outperformed baselines with the same parameter count while maintaining better scaling behavior on smaller subsets of the dataset. - The authors mathematically derive that convolution operations can be exactly implemented using specific sparse structures in the attention weights, thereby establishing a mathematical connection between self-attention and convolution. - The attempt to introduce convolution’s translational equivariance and locality into ViT, with the goal of improving performance on small datasets, is considered meaningful. - Experimental results show that on the ImageNet dataset, IBiT outperforms DeiT and ConViT, which provides some evidence for the effectiveness of the approach. - Although the mathematical derivation in Sec. 3.1 is valid; there remain key differences from actual convolution. For example, attention weights are input-dependent, while convolution kernel parameters are independent of the input. In self-attention, the weights depend on the dot products between queries and keys, so different inputs will produce different attention distributions. The proposed method merely multiplies the attention weights by a fixed-shape mask; while the mask enforces locality, the attention values are still content-dependent, and thus, its translational equivariance may be less stable than that of convolution. - Convolution not only enforces locality, but also weight sharing and a fixed linear combination pattern, both of which are important for feature stability. The proposed Learnable Mask employs a rolling mechanism to maintain consistency in the mask pattern. However, it still multiplies the mask pattern with attention weights that may vary significantly across positions, thereby ignoring the inherent variability of the weights themselves. - One of ViT’s advantages over CNNs is the ability to model global dependencies. Applying a strong locality mask may suppress interactions between distant tokens, which could negatively affect tasks requiring full scene semantics. - The experiments mainly compare against DeiT and ConViT, lacking a comprehensive comparison with other locality-enhanced Transformers (e.g., Swin Transformer). - All experiments are conducted on classification tasks; there is no evaluation on detection or segmentation downstream tasks, leaving it unclear whether the proposed inductive bias would also be effective for tasks requiring precise spatial localization. The presentation of this paper should be largely improved. Lightly AI-edited
AALawyer: A Generative Retrieval-Augmented Large Language Model System for Legal Reasoning Soundness: 2: fair Presentation: 1: poor Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. Authors present AALawyer, a three-stage legal RAG system: (1) AC-RAG — a classifier that predicts article numbers and retrieves canonical article text, (2) CCs-RAG — case-to-case retrieval, and (3) AA-LeLLM — a finetuned legal generator. They introduce an HR-Benchmark (4 dimensions including a “Hallucination” score) and report improved scores versus some baselines. The central claim is that their generative RAG (AC-RAG + CCs-RAG + AA-LeLLM) reduces hallucination and improves explainability for criminal-law reasoning. I 1.Thoughtful pipeline architecture. The tri-stage structure is intuitively aligned with how legal professionals think: identify governing statute, retrieve precedent, and then craft reasoned text. This mapping from professional workflow to system architecture is a strong design. 2. Focus on hallucination risk. The inclusion of HR benchmark and metrics signals good awareness of the major failure mode in legal generative systems. The fact that the authors treat hallucination explicitly is a positive step. 1. Figure 1(b) can have some additional illustrations. What are R and G? Can you also attach a clearer message on what the difference this approach aims at achieving? 2. The terminologies in this paper are really confusing. There are two module names, and there's also AA-LeLLM, along with other abbreviations like Auth, CP, FAP, ITI, DFI, and the four dimensions of your eval also comes in abbreviations, just leaving the paper with an overwhelming impression. I find it hard to jump back and forth when I read it. The paper will improve presentation a lot by simplifying terminologies. 3. The paper structure is atypical and interrupts the flow of the article. I would not recommend putting related work between your method and experiments. 4. It also seems slightly unfair that the paper is comparing a LM system with off-the-shelf LMs. Should have compared with stronger baselines which are also LM systems. This undermines the validity and meaningfulness of the comparison. Suggestions: I really encourage the authors to simplify the formulation of this paper as it feels needlessly complicated. Fonts in Figure 2 are also a bit too small. The improvement in accuracy are also less valid and meaningful since the comparison is a bit unfair between LM systems and off-the-shelf models. Fully human-written
AALawyer: A Generative Retrieval-Augmented Large Language Model System for Legal Reasoning Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper presents AALawyer, a generative retrieval-augmented legal AI system that combines a fine-tuned legal LLM, precise legal and case retrieval modules, and a 4-dimensional evaluation benchmark to improve accuracy, reduce hallucinations, and enhance explainability in criminal law reasoning. 1. This paper introduces a new RAG method. The ARTICLE-CONTENT RAG provides insights. 2. The experiment shows the effectiveness of the proposed method. 1. **Unclear definition of hallucination measurement.** In Eq. (3), the calculation of `auth()` is not clearly introduced. Does it originate from the article itself, or is it derived through another computational method? As this function is a crucial component of the AC-RAG module, it should be explained in detail. Providing a concrete example would also help readers better understand the process. 2. **Lack of clarity in presentation.** The paper is somewhat difficult to follow. For instance, Figure 1 provides limited information and would be more informative if it used a legal-related context. Similar issues are observed with Figure 2. 3. **Minor inconsistencies and typos.** There are inconsistent expressions such as *G&G RAG* and *G&G-RAG*. It is recommended to standardize these terms throughout the paper for consistency and professionalism. 4. **Outdated baselines.** Some comparison methods appear outdated. To the best of my knowledge, the experiments do not include the latest Legal LLMs [1], and the LawGPT method has been updated in [2]. It would strengthen the paper to incorporate these recent models in the experimental comparison. *** Ref: [1] Chen, H., Xu, Y., Wang, B., Zhao, C., Han, X., Wang, F., ... & Xu, Y. (2025). LexPro-1.0 Technical Report. arXiv preprint arXiv:2503.06949. [2] Zhou, Z., Yu, K. Y., Tian, S. Y., Yang, X. W., Shi, J. X., Song, P., ... & Li, Y. F. (2025). Lawgpt: Knowledge-guided data generation and its application to legal llm. arXiv preprint arXiv:2502.06572. See weaknesses above Moderately AI-edited
AALawyer: A Generative Retrieval-Augmented Large Language Model System for Legal Reasoning Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper addresses hallucinated citations and limited explainability in legal LLMs by proposing AALawyer—a generative retrieval-augmented system for criminal law reasoning (integrating AA-LeLLM, AC-RAG, CCs-RAG)—and HR-Benchmark (4-dimensional evaluation). Experiments show AALawyer achieves 88.84% accuracy on LawBench’s FAP task (↑71.98% vs. DeepSeek-7B) and reduces hallucination risk by 37.6% on HR-Benchmark. Key contributions include the novel AC-RAG pipeline, 176k criminal case dataset, and HR-Benchmark. 1. AC-RAG innovates a "generate-retrieve-generate" pipeline, using numeric identifiers to convert article retrieval into a classification task, enabling parameter sharing and reducing hallucination. 2. The 176k case dataset fills gaps in criminal law retrieval data; HR-Benchmark addresses the lack of RAG system evaluation tools in legal AI. 3. Experiments focus on core FAP tasks; ablation studies (e.g., AC-RAG cuts hallucination by 59%) and Lemma 1 (mathematical proof of hallucination bounds) ensure rigor. 4. The 3-stage pipeline (AC-RAG/CCs-RAG/AA-LeLLM) and theoretical foundations are clearly presented. 1. Generalization Limits: AA-LeLLM is only based on 7B DeepSeek-7B (no 13B/70B analysis) and restricted to criminal law (no discussion on extending to civil/administrative law). 2. HR-Benchmark Reliability: Relies solely on DeepSeek-Chat for evaluation (no scoring criteria or multi-Judge comparison); 200 CAIL2018 cases lack clarity on covering major crimes. 3. AC-RAG Gaps: Fails to quantify numeric identifier prediction errors or disclose Format1/Format2 prompt templates; no analysis of "ambiguous cases". 4. Related Works: Omits discussion of domain association models like Hyper-RAG, which suit legal complex relationships. see weaknesses Moderately AI-edited
AALawyer: A Generative Retrieval-Augmented Large Language Model System for Legal Reasoning Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes AALawyer, a legal reasoning system for Chinese criminal law that combines a fine-tuned legal LLM (AA-LeLLM) with two retrieval-augmented generation (RAG) modules: AC-RAG (Article-Content RAG) and CCs-RAG (Case-Cases RAG). The AC-RAG module converts article retrieval into a classification task to reduce hallucination in legal citations, while CCs-RAG retrieves similar precedent cases using a dataset of 176k criminal court cases. The authors evaluate their system on LawBench classification tasks and introduce a new Hallucination Risk-Benchmark (HR-Benchmark) with four dimensions: hallucination score, professionalism, informativeness, and explainability. The results show improvements over baseline models. 1. **Well-motivated approach**: The paper grounds its design in legal syllogism theory and real-world judicial practice, providing a principled framework for the three-stage pipeline (retrieving legal articles, referencing precedent cases, and generating conclusions). 2. **Comprehensive system design**: The integration of classification-based article retrieval (AC-RAG) with case-based retrieval (CCs-RAG) addresses both hallucination reduction and explainability, which are critical for legal applications. 3. **Strong empirical results on target tasks**: The model achieves impressive performance on the criminal article prediction task (FAP), with a 71.98% improvement over the base model, demonstrating the effectiveness of the proposed AC-RAG approach. 1. **Limited novelty in core concept**: The idea of incorporating legal articles into legal reasoning to reduce hallucination is not new. Previous work such as Lawyer LLaMA (Huang et al., 2023) has already explored similar methods of retrieving and integrating legal provisions into the generation process. 2. **Questionable benchmark validity**: The construction and evaluation methodology of the Hallucination Risk-Benchmark is insufficiently rigorous and lacks credibility as a reliable evaluation framework: * Inadequate dataset construction details: The paper states that 200 cases were "randomly selected" from CAIL2018 but provides no justification for this sampling strategy. Critical questions remain unanswered: What criteria ensure these cases have appropriate difficulty levels? How is diversity guaranteed—do they cover a broad range or representative set of criminal charges? Why only 200 cases rather than a more comprehensive evaluation set? Without this information, the benchmark's representativeness and reliability are questionable. * Ambiguous and unvalidated evaluation dimensions: The four dimensions (hallucination, professionalism, informativeness, explainability) are vaguely defined without clear operational definitions or grounding in legal theory. It is unclear whether these dimensions are legally meaningful or simply intuitive categories. The paper does not justify why these specific four dimensions are appropriate for evaluating legal reasoning systems. * Insufficient LLM-as-judge methodology: While the authors employ an LLM-as-judge approach, Section 4 provides no details about the evaluation prompts or scoring rubrics. The validity of using LLMs to evaluate legal reasoning quality—especially for domain-specific dimensions like "professionalism"—is not established. Critically, there is no validation against legal expert judgments to confirm that the LLM's assessments align with human expert evaluations. This is a significant omission given that the benchmark is proposed as a contribution of the paper. Please refer to the weaknesses. Heavily AI-edited
Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks? Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a firewall-based defense method against indirect prompt injection attacks. Specifically, the authors design two LLM-based firewalls: (1) a Tool-Input Firewall that minimizes unnecessary confidential information from tool input arguments, and (2) a Tool-Output Firewall that sanitizes returned tool observations from potentially malicious content. Both firewalls are implemented using LLMs with specifically designed prompts. The authors evaluate their method on AgentDojo, Agent Security Bench, InjecAgent, and Tau-Bench. The results show that the approach effectively reduces attack success rates while preserving benign utility to some extent. Finally, the authors analyze the strengths and weaknesses of existing prompt injection benchmarks and propose suggestions for improvement. 1. The method is simple and effective, requiring no additional training to achieve defense capabilities. 2. The evaluation is comprehensive, covering commonly used datasets in the field. 1. The novelty is limited. The idea of using LLMs themselves to filter harmful information has been extensively explored in both jailbreak defenses and prompt injection defenses. The authors should clearly articulate what distinguishes their approach from prior work beyond the specific application to tool-input and tool-output filtering. 2. The method causes a degradation in benign utility. As shown in Table 1, benign utility drops dramatically from 83.02% without defense to only 58.41% with the firewall. This suggests the firewall is probably conservative and filters out valuable information necessary for legitimate task completion. The authors should provide a detailed analysis of what types of benign tasks are being incorrectly blocked and whether this trade-off is acceptable in real-world deployments. 3. The paper lacks important analysis on practical deployment considerations: (1) Computational overhead: Since both firewalls are LLM-based, what is the additional latency and cost introduced? This could be prohibitive for real-time applications. (2) False positive/negative analysis: As mentioned in 2, the paper should provide detailed statistics on misclassification rates to better understand the firewall's reliability. Please refer to the weaknesses part. Moderately AI-edited
Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks? Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes a simple defense against indirect prompt injection (IPI) attacks in tool-using LLM agents. It introduces two lightweight modules — a Tool-Input Firewall (Minimizer) and a Tool-Output Firewall (Sanitizer) — that filter inputs and sanitize tool outputs. Experiments on several benchmarks (AgentDojo, ASB, InjecAgent, τ-Bench) show near-zero attack success rates. The paper also points out flaws in existing benchmarks and suggests improvements. The paper is clearly written and easy to follow. The idea is simple and practical — the defense can be easily applied without retraining. The benchmark analysis is detailed, and the authors identify real issues in existing evaluation setups. The core idea (“minimize & sanitize”) is too simple and incremental, offering little novelty beyond existing “firewall” or “guardrail” defenses. Most results come from benchmarks that the authors themselves criticize as flawed, so the findings feel self-contradictory and less convincing. The paper lacks deeper insight or analysis about why the approach works and how it generalizes. The work doesn’t propose new attack strategies or theoretical understanding — it’s mainly an engineering evaluation rather than a research contribution. Overall, the contribution feels minor; it’s a straightforward application rather than a new idea. see weakness Fully AI-generated
Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks? Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents a two-fold contribution. First, it proposes a simple, model-agnostic defense against indirect prompt injection (IPI) attacks called the "minimize & sanitize" firewall. This defense consists of a "Tool-Input Firewall" (Minimizer) to filter sensitive data from tool inputs and a "Tool-Output Firewall" (Sanitizer) to remove malicious instructions from tool responses. The authors demonstrate that this defense, particularly the Sanitizer, achieves state-of-the-art (SOTA) results, reducing the attack success rate (ASR) to ≈0 on AgentDojo and $\tau$-Bench and to ≲0.3 on InjecAgent. The paper's second, and arguably more significant, contribution is a rigorous critique of these same benchmarks. The authors reveal that the SOTA results are largely an illusion, as the benchmarks suffer from critical flaws, such as "forced attack-tool injection" in ASB and brittle utility metrics in AgentDojo. This makes them poor evaluators of true security. To prove their point, the authors develop a stronger, obfuscated (Braille-based) attack that successfully bypasses their own SOTA firewall, thereby highlighting the urgent need for stronger, more realistic security benchmarks. 1. The paper's primary strength is its dual contribution. It not only proposes a simple, effective, and model-agnostic defense (the Minimizer-Sanitizer firewalls) but also provides a rigorous critique of the very benchmarks used to measure success. 2. This paper uncovers flaws in ASB and AgentDojo that distort ASR and utility, and provide concrete fixes to make evaluations more trustworthy. 3. The proposed firewall defense is commendable for its simplicity and practicality. As a modular, model-agnostic approach, it serves as an excellent and easily replicable baseline. 1. Potential Data Contamination. While the reported results of the proposed defense are strong, this method relies primarily on frontier models (GPT-4o and Qwen3), and the paper does not analyze potential training–evaluation contamination (prior exposure to attack styles or benchmark artifacts). Could you replace the Minimizer/Sanitizer with an older model and report the performance? This would show whether the defense truly depends on frontier-model memorization. Overall, this is an interesting and valuable paper that productively revisited progress on benchmarking prompt-injection attacks and defenses. It would be even better with a quantitative treatment of optimization-based adaptive attacks [1, 2]. Conceptually, these attacks should also serve as strong baselines, especially since many defenses in current benchmarks are largely static and plausibly vulnerable to adaptive optimization (e.g., tuning an adversarial suffix). [1] Liu, Xiaogeng, et al. "Automatic and universal prompt injection attacks against large language models." arXiv preprint arXiv:2403.04957 (2024). \ [2] Pasquini, Dario, Martin Strohmeier, and Carmela Troncoso. "Neural exec: Learning (and learning from) execution triggers for prompt injection attacks." Proceedings of the 2024 Workshop on Artificial Intelligence and Security. 2024. For rebuttal, please refer to the weaknesses. Lightly AI-edited
Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks? Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper claims to provide a simple, effective, modular and model-agnostic defense for tool-calling agents based on: a Tool-Input Firewall (Minimizer) and a Tool-Output Firewall (Sanitizer). They demonstrate that their approach achieves 0% or the lowest attack success rate on four public benchmarks while maintaining high task success. They also found and fixed key flaws in widely popular benchmarks to enable more reliable evaluation in the agent attack community. Finally, they also provided case study about the failure of their own method to call for more stronger defenses. 1. The contribution of fixing existing benchmarks is very useful for future benchmarking in this field. 2. The method seems to work well on the 4 benchmarks, making them strong candidates for agent security defense. 3. this paper is easy to understand, and the demonstration is very good and illustrative. 1. The method is, despite its fancy names, imo, a pre-processor and a post-filter, which is not new. And I did not see any necessity to associate it with firewalls as inspiration. 2. The two filters (pre&post) seem only been built by a very short system prompt. thus, it's questionable why those system prompts serves for the claim of 'firewall is equipped with a robust system prompt'. How do you justify this? why that system prompt is robust? how did you choose those system prompt? 3. In 7 DISCUSSION, I understand stronger attacks can unsuperisingly bypass your firewalls. But how about other baselines? does the same attack also succeed on other baselines or it a flaw of your own methods? more discussion on this would be good. 1. How about finetuning the two filters? Why is the system prompting enough? 2. What's the difference if the backbone model is not gpt4o? Any trade-off analysis? Fully human-written
Explain in Your Own Words: Improving Reasoning via Token-Selective Dual Knowledge Distillation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes Token-Selective Dual Knowledge Distillation (TSD-KD), a framework for transferring large language model reasoning abilities to smaller models. The method is designed to provide targeted supervision, focusing on high-uncertainty tokens to mitigate issues with existing KD approaches. Three key components are listed: 1) Token-Selective Indirect Distillation: The teacher provides preference rankings over the student's top-k generated tokens in the initial sequence of reasoning (the "opener"), utilizing a Plackett-Luce model. 2) Token-Selective Direct Distillation: A JSD-based distribution matching loss is applied only to tokens where the student's uncertainty (entropy) significantly exceeds the teacher's confidence (the "uncertainty gap"). 3) Token-Selective Entropy Regularization: The entropy of the student's top 10% most uncertain tokens is minimized. 1. The core idea of Token-Selective Direct Distillation is well motivated. 2. Authors provide comprehensive ablation study in demonstrating effects of each component and showed strong empirical results over baselines. W1: Hyperparameter Sensitivity: The framework relies on an extremely sensitive set of hyperparameters ($c$, $k$, $s$, $\beta$), as demonstrated by sharp performance drop-offs in the appendix analyses. This suggests the method is brittle and lacks practical generalizability. In the Table 1, authors also only report the performance from the best hyperparamter selections. I wonder how much this complex setup could transfer into new domains or tasks. W2: Conflict Between On-Policy Learning and Entropy Minimization ($\mathcal{L}_{EM}$): The $\mathcal{L}_{EM}$ term, which minimizes entropy on the top 10% most uncertain tokens, fundamentally conflicts with the core on-policy principle of preserving and encouraging exploration. While selectivity is claimed as a mitigation, the paper does not analyze the true impact on the student's output diversity or rigorously justify that minimizing entropy is superior to simpler confidence maintenance. W3: The paper provides insufficient analysis to attribute the performance at Token-Selective Indirect Distillation. It is unclear if the success of the indirect distillation is due to the preference ranking (teacher's subtle knowledge transfer) or simply the top-k token candidate proposal. Based on prior work, latter might be the bigger contribution. I believe authors should perform additional ablation experiments to justify that preference ranking is necessary. W4: There is a very relevant paper Speculative KD (https://arxiv.org/abs/2410.11325). Authors should consider compare to or mention. At W3 Fully human-written
Explain in Your Own Words: Improving Reasoning via Token-Selective Dual Knowledge Distillation Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes Token-Selective Dual Knowledge Distillation (TSD-KD), a student-centric framework for improving reasoning through selective and dual-mode knowledge transfer. The method integrates: Indirect distillation, the student proposes candidate tokens and receives preference ranking feedback from the teacher (similar to DPO-like weak supervision); Direct distillation, selective distribution matching on tokens with large student–teacher uncertainty gaps using JSD, Entropy regularization — confidence enhancement through minimizing entropy of the most uncertain tokens. The authors conduct comprehensive experiments on reasoning benchmarksusing Qwen2.5 and Gemma2 model families. Results show consistent improvements, with the student model occasionally outperforming its teacher. 1. Complete and sound framework combining distillation with entropy regularization forms a coherent pipeline. 2. Writing is good. 3. Use of token entropy for identifying important tokens aligns with recent research trends in reasoning-focused LLMs (e.g., 80/20 entropy rule, ARPO). 4. Comprehensive experiments across two model families demonstrate generalizability. 1. The core idea of TSD-KD is highly similar to <Keypoint-based Progressive Chain-of-Thought Distillation (icml 2024)> in motivation. Both approaches emphasize selective token weighting and distillation; TSD-KD replaces KPOD’s mask-learning with entropy-based selection but retains the same underlying philosophy. However, this previous work is completely neglected. 2. At this time, it is unclear why the authors did not conduct experiments on the Qwen3 family, such as Qwen3-8B, which has become the de-facto principle for reasoning evaluation (like in <Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning>.). The paper only reports results on Qwen2.5 and Gemma2, both of which are now relatively outdated and substantially weaker in reasoning capability. Since the proposed method explicitly targets reasoning enhancement, it is essential to verify its effectiveness on more competitive and up-to-date models. 3. Limited theoretical insight are provided in this paper. For example, the paper lacks a clear theoretical justification for why entropy-based token selection truly improves reasoning robustness beyond serving as a heuristic importance measure. refer to Weaknesses, I will adjust my score according to the responses. Lightly AI-edited
Explain in Your Own Words: Improving Reasoning via Token-Selective Dual Knowledge Distillation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces Token-Selective Dual Knowledge Distillation (TSD-KD), a framework for distilling reasoning ability from large language models to smaller student models by focusing on important tokens in the reasoning chain. TSD-KD combines two main innovations: (1) indirect, preference-based distillation, where the teacher re-ranks student-generated token candidates without forcing its output distribution, and (2) direct, gated distillation that selectively applies distribution-matching to tokens where the student is uncertain and the teacher is confident. The approach is regularized with selective entropy minimization on the most uncertain tokens. Empirical evaluation across 10 challenging reasoning benchmarks demonstrates TSD-KD’s strong performance, including cases where the compressed student surpasses its teacher. 1.The use of student-generated candidates (preference-based indirect distillation) and selective, entropy-based token gating in direct distillation is thoughtfully motivated and distinguishes the framework from prior “teacher-forcing” approaches. The focus on letting the student “explain in its own words” resonates with cognitive insights and supports the central claim. 2.The explicit combination of indirect and direct knowledge distillation, each carefully limited to critical tokens, is well-positioned to address known weaknesses of pure distribution-matching or of-point-wise imitation. 3. The mathematical formulation is transparent, the underlying assumptions are stated, and the algorithmic components are described with appropriate rigor.  4. TSD-KD consistently outperforms strong baselines, with substantial absolute gains. Importantly, in multiple cases, the student model trained via TSD-KD surpasses the teacher. 1. The preference-based indirect distillation encourages the student to align with the teacher’s ranking on top-$k$ student candidates. However, this assumes that the student's beam search is likely to generate candidates close to the correct reasoning trace, which may not hold for weaker students or for highly ambiguous problems.  2. While tables and figures provide extensive quantitative results, the paper lacks qualitative or error analysis on the types of reasoning improvements the student makes with TSD-KD (beyond aggregate accuracy). 3. Even though performance improvements are noticeable, the paper does not report any statistical significance tests. 1. How robust is the preference-based indirect distillation if the student’s top-$k$ candidates are mostly incorrect? Does the framework degrade gracefully if initial reasoning is off-policy, or does performance collapse? Are there analyses on very weak students or pathological candidate proposals? 2. Do any ablation results suggest redundancy between direct distillation with entropy gating and selective entropy minimization? Are there tasks where one suffices without the other, and can the gains be attributed to only one component in certain domains? 3. What steps are in place to detect or mitigate biases propagated from teacher to student, considering that only a subset of tokens is distilled but on the student’s own output? Fully AI-generated
Explain in Your Own Words: Improving Reasoning via Token-Selective Dual Knowledge Distillation Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes Token-Selective Dual Knowledge Distillation (TSD-KD), a framework designed to efficiently transfer the reasoning abilities of a large teacher model to a smaller student model, aiming to reduce the cost of Chain-of-Thoughts (CoT) generation. Adopting a student-centric, on-policy distillation paradigm, TSD-KD applies supervision only to the most critical or uncertain tokens during the reasoning process, thereby avoiding the distribution mismatch and overwhelming issues associated with traditional Teacher-Forcing. The method integrates three key components, all guided by token selection: Indirect Distillation (teacher acts as a preference ranker for student candidates), Direct Distillation (applying GKD loss to tokens with a large uncertainty gap—student uncertain but teacher certain), and Entropy Regularization (selectively minimizing student entropy on critical tokens). 1) TSD-KD achieves State-of-the-Art performance across 10 challenging reasoning benchmarks. Experimental results demonstrate its significant superiority over existing baseline methods across multiple tasks. 2) The student model, after training, even surpasses its teacher model on some reasoning tasks (with improvements up to 20.3%). This result strongly suggests the framework is not merely imitative but effectively promotes the student model in building its own, more generalizable reasoning logic. 1) The core insight of the paper—that "high-entropy/uncertain tokens are critical branching points in reasoning" and should be targeted for selective supervision—is not an original discovery. This phenomenon, which guides the model learning process, has been well-established in antecedent works (such as the RL-based methods by Wang et al. (2025) and Lei et al. (2025)). Therefore, the paper's contribution lies primarily in the engineering application and integration of this existing principle into the knowledge distillation domain for selective supervision, rather than a breakthrough in fundamental mechanism discovery or method innovation. 2) The TSD-KD methodology lacks deep theoretical innovation in distillation, being an effective combination of existing techniques and intuitive heuristic rules. Specifically, the Indirect Distillation employs the established Plackett-Luce (PL) model from RLHF, and Direct Distillation uses the known Generalized JSD (GKD) loss. While the "uncertainty gap" token selection mechanism is novel, it functions as an intuitive heuristic rule. Consequently, the paper's main contribution is the effective integration of these existing components, rather than the proposal of a new foundational distillation mechanism or a novel loss function. 3) The crucial length of the "Opener" for selective supervision is defined by an empirical hyperparameter: the c% accumulated entropy threshold (set to c=10% based on ablation studies). This fixed-ratio approach is a heuristic inherited from similar suggestions in other reinforcement learning works. The absence of a dynamic or adaptive mechanism that adjusts this threshold based on the specific complexity and depth of the reasoning task limits the theoretical generalizability of the method, as the optimal empirical value may vary significantly across different domains (e.g., mathematical vs. common-sense reasoning) and model architectures. 1) Given that the insight "high-entropy/uncertain tokens contribute more" is highly similar to recent RL-based works (e.g., Wang et al. (2025) and Lei et al. (2025) as mentioned), how do the core innovative mechanisms of this paper (e.g., the uncertainty gap selection, the Dual Distillation design) demonstrate a theoretical or empirical advantage over the Token Importance mechanisms in the precursor works? 2) In the context of Knowledge Distillation, what specific advantages—such as increased data efficiency or stability—does selective supervision offer that cannot be achieved or are less efficient using traditional RL frameworks (i.e., penalizing/rewarding only critical tokens via sparse reward signals)? 3) Indirect Distillation is only applied during the Opener phase. How does the student model maintain reasoning consistency and quality during the subsequent unsupervised phases? If the student selects a path consistent with the teacher's preference during the Opener, to what extent does this restrict its ability to develop new, non-imitative reasoning logic "in its own words" in the subsequent steps? Heavily AI-edited
Revisiting the Role of Homophily in Fair Graph Representation Learning Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors introduce CSBM-S, a controllable synthetic benchmark that decouples label homophily (h_y) and sensitive homophily (h_s), enabling precise evaluation of fairness mechanisms. CSBM-S identifies two empirical trends: Group disparity peaks when label homophily and Bias tend to decrease as sensitive homophily Then, based on these insights, they propose FairEST, a method that enforces ~0.5 by flipping sensitive attributes and correlated features during training to mitigate bias. Experimental results show consistent improvements in fairness metrics across baselines. I like the idea of fairness in GNNs through a homophily view, offering a new conceptual angle on how topology affects bias propagation. Also, FairEST is conceptually straightforward, model-agnostic, and integrates easily into existing GNN pipelines. The observations linking fairness with specific homophily ranges could inform future fairness-aware graph design. - To me, GNNs inherently rely on the homophily principle, learning from neighboring nodes under propagation. Therefore, attributing fairness issues primarily to homophily may oversimplify the problem. The root causes of unfairness might instead come from global structural factors, such as community topology or node identity, rather than local structural properties like node degree or neighborhood similarity. - The rationale for flipping sensitive attributes may appear heuristic. That is, its connection to causality or representation disentanglement could be better articulated. - Could the authors clarify whether the feature flipping operation might leak or distort semantic information critical for downstream tasks? - How does FairEST perform on heterophilous graphs where h_y and h_s are both low? Does the method still yield fairness gains? - How sensitive is FairEST to incorrect or noisy sensitive attributes? Lightly AI-edited
Revisiting the Role of Homophily in Fair Graph Representation Learning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper studies group fairness in GNNs through the lens of label homophily $h_y$ and sensitive homophily $h_s$. It introduces CSBM-S, a synthetic generator that independently controls $h_y$ and $h_s$ to analyze bias under message passing, observing that disparity peaks near $h_y \approx 0.5$ and diminishes as $h_s \to 0.5$. Building on this, the authors propose FairEST, which iteratively edits the sensitive attribute $s$ and its most correlated features to steer neighborhoods toward $h_s \approx 0.5$, with an auxiliary group-fairness loss. Experiments across multiple datasets and GNN backbones show reduced group-fairness gaps with comparable accuracy. + The motivation is clear. The paper formalizes node-level $h_y$, $h_s$, and standard group-fairness metrics ($\Delta\mathrm{SP}$, $\Delta\mathrm{EO}$), then analyzes how message passing amplifies or attenuates disparities. It further employs CSBM-S to vary $h_y$ and $h_s$ independently. Grid sweeps and a mean-field analysis yield interpretable patterns that motivate the method. + FairEST is backbone-agnostic and easy to implement. Edits to $s$ and correlated features, combined with a fairness loss, reduce bias without architectural changes. + The experimental study is extensive, covering multiple datasets/backbones with ablations and hyperparameter analyses that reveal both gains and failure modes. - The method assumes the sensitive attribute $s$ is observed, which may be unrealistic in high-stakes settings. Please evaluate fairness when $s$ is hidden or unavailable, e.g., by comparing to adversarial/invariant approaches and to a setting where $s$ is predicted from proxies. - The algorithm greedily balances neighborhood homophily $h_s$ toward $0.5$ using node-wise majority and a fixed iteration cap. It is unclear whether these local flips provably reduce global disparity or instead induce distributional shifts in $P(s)$. - The paper currently targets binary labels and a single binary sensitive attribute. How about the generalization to multi-class or multi-attribute settings? - Only group fairness is evaluated. Individual fairness is discussed conceptually but not assessed. Many applications may require both. Please refer to the above weaknesses. Lightly AI-edited
Revisiting the Role of Homophily in Fair Graph Representation Learning Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This work aims to study fairness in GNNs from the perspective of homophily. Specifically, the authors focus on notions of label and sensitive attribute homophily, assessing which neighborhood patterns cause fairness degradation. Through their CSBM-S model, a synthetic graph model that controls the label and sensitive attribute homophily, the authors demonstrate that group fairness degrades as label homophily tends towards 0.5, while group fairness improves as sensitive attribute homophily tends towards 0.5. To use these findings, the authors present FairEST, a method which aims to optimize the sensitive attribute homophily at training-time to 0.5. Generally, FairEST is able to achieve decent fairness metrics, but does incur a performance cost. 1. I found the method sections of the paper relatively easy to read given that each section naturally follows from the previous and logically builds on one another. 2. The empirical results, at least for the fairness metrics, tend to be decent. 3. The methods simplicity as a training-time augmentation makes it amenable to different backbones and settings. 1. My main concern about this work is that it largely uses methods and insights already well established in the literature. Moreover, the authors do not offer enough arguments as to how their work explicitly differs from these methods, sometimes missing citations all together. A few examples are: - Homophily and fairness are already well connected in the literature. As far as I can tell, the authors do not explicitly address how their work builds on, or "revisits", these previous findings [1, 2, 3]. - Beyond just connecting homophily and fairness, the proposed CSBM-S model and analysis in section 4 are highly similar to that in [2], both from the model design of manipulating label and sensitive attribute homophily, as well as the resulting takeaways. - The idea of flipping the sensitive attribute, aiming to "debias" the message passing process, does not seem sufficiently different from previous methods which manipulate the graph structure to encourage different treatments across sensitive attributes [3, 4]. 2. For the majority of the experimental results, while the fairness metrics are decent, the accuracy drops are sometimes quite large. Given my points above, I think significantly more effort needs to go in to remedy this issue and establish more novelty in the method. 3. While trying to assess whether the author justified the performance drops, I realized there are instances in section 6.2 which do not seem to correspond to Table 1. For instance, on line 364, the reported accuracy numbers changes are significantly higher than the drops seen in the table (e.g., authors report -2.1% on GIN-bail, yet it would appear the drop is closer to 5.5% in table 1). In all, I think this work needs quite a bit more effort to both sufficiently ground itself in the literature and also improve presentation in the experimental section. [1] Wang et al. “Improving Fairness in Graph Neural Networks via Mitigating Sensitive Attribute Leakage” [2] Loveland et al. “On Graph Neural Network Fairness in the Presence of Heterophilous Neighborhoods” [3] Li et al. “On Dyadic Fairness: Exploring and Mitigating Bias in Graph Connections” [4] Rahman et al. “Fairwalk: Towards Fair Graph Embedding” Please see my weaknesses above. Fully human-written
Revisiting the Role of Homophily in Fair Graph Representation Learning Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper studies the relationship between homophily and fairness in GNNs. They claim that the degree of label homophily $h_y$ and sensitive homophily $h_s$ significantly impacts bias amplification during message passing. To analyze this, they propose CSBM-S, a synthetic graph model that decouples label and sensitive homophily, allowing controlled experiments. They further introduce FairEST, an algorithm that enforces $h_s \approx 0.5$ to improve fairness by iteratively flipping sensitive attributes and correlated features. Experiments on several benchmarks and GNN baselines show modest improvements in fairness metrics with comparable accuracy. 1. The paper identifies an intersection between fairness and graph homophily. They show how label homophily and sensitive homophily could shape fairness under the message passing of GNNs. 2. The paper introduces CSBM-S as a controlled simulator for fairness studies, which allows disentangling the effects of different homophily levels in a reproducible manner, potentially benefiting future research. 3. Experiments include multiple models and datasets, and the authors conduct ablations, sensitivity analyses, and noise robustness tests. 1. The idea that message passing propagates sensitive signals via edges with attribute correlation is well-established (Wang et al., 2022; Dong et al., 2022; Dai & Wang, 2021). The notion of balancing sensitive attribute distributions (making $h_s \approx 0.5$) is just a graph-level rephrasing of feature decorrelation or resampling. FairEST’s “flip and reflect” procedure is essentially a stochastic data augmentation trick, not a theoretically or algorithmically novel approach. As such, the manuscript somewhat overstates the conceptual novelty of the approach by framing it as a “homophily-centric fairness framework,” when the underlying idea remains relatively straightforward. 2. In Section 4.4, the analysis appears to restate known results from mean-field diffusion analysis. The authors find that bias is largest when label information is weak ($h_y \approx 0.5$) and when sensitive channels dominate (extreme $h_s$). While this observation is intuitively consistent, i.e., bias tends to increase when sensitive features drive predictive signals. It does not seem to offer new theoretical insight into why or how GNN architectures amplify fairness issues. 3. Despite proposing CSBM-S, the paper does not use it to uncover deeper causal or structural insights about fairness dynamics in graphs. The synthetic experiments are mainly limited to grid-sweep heatmaps and a few straightforward observations, without quantitative analyses of robustness, sensitivity to graph topology, or comparisons with alternative fairness mechanisms. As a result, the proposed “homophily-centric toolkit” currently functions more as a synthetic data generator than as a framework for deeper theoretical understanding. 1. How does FairEST compare with trivial strategies such as random node feature shuffling, attribute dropout, or standard reweighting schemes? 2. Are the fairness improvements statistically significant across runs? Please report confidence intervals or p-values. 3. The method assumes full access to sensitive attributes during training and precise correlation estimation, which is rarely met in practical settings. How would FairEST operate under partial or uncertain sensitive attribute availability? 4. Have the authors tested on larger or more realistic graphs (e.g., OGB datasets)? The current experimental setup lacks scalability evidence. 5. Does enforcing $h_s \approx 0.5$ actually remove causal influence of the sensitive attribute, or does it merely mask correlations? Fully AI-generated
Geometric and Information Compression of Representations in Deep Learning Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This work disputes the connection established in previous literature (e.g., by Goldfeld et al., 2019) between the geometric compression of latent representations $Z$ and mutual information $I(X;Z)$, where $X$ are the inputs of a neural network. The authors argue that a decrease in $I(X;Z)$ does not imply a more clustered $Z$; on the contrary, their experimental results suggest that compression correlates with higher mutual information. The authors also provide a supplementary theoretical result that justifies the Information Bottleneck (IB) analysis of DNNs with analytic activation functions and continuous dropout. The paper's key strengths are its main theoretical result (Theorem 3.1) and the sheer scale of its experimental evaluation. While Theorem 3.1 is an extension of the result from Adilova et al. (2023), I consider it to be important for proving the non-vacuousness of IB analysis for a wider class of neural networks. Furthermore, the experimental results are quite insightful, highlighting the intricate interplay between $I(X;Z∣Y)$, Neural Collapse, various hyperparameters ($\beta$ in CEB and $\lambda$ in the Gaussian dropout framework) and accuracy+generalization. I have two major concerns in regard to the methodology: 1. **How MI is measured**. - The authors are rather inconsistent with their choice of MI estimators. For Figure 1, they use NPEET (which is, by the way, not referenced in the main text); for CEB, the variational bound is employed; for Gaussian dropout (GD), they use DoE estimator. - The paper lacks concrete justification for the choice of estimators. Specifically, the use of NPEET is unexplained, the claim of a "practically tight" variational bound (line 248) is unsubstantiated, and the superior performance of DoE over other SOTA estimators (lines 250-254) is not demonstrated. I kindly ask the authors to elaborate on these decisions and provide the experimental results that support their claims. 2. **When MI is measured.** As stated in lines 128-130, this study focuses on the connection between MI and clustering at the end of training. While this is an interesting direction, I find it only loosely connected to the original works on the IB principle. For instance, Shwartz-Ziv & Tishby (2017) measure MI throughout the training. They identified a distinct compression *phase*, where $I(X;Z)$ begins to decrease after a certain epoch. Goldfeld et al. (2019) later connected this *phase* to geometric compression (under normal conditions). Therefore, an anti-correlation between MI and NC observed only at the end of training does not preclude the occurrence of such a compression phase, nor does it rule out geometric compression as its driver (for example, geometric compression can drive MI to the minimal value *throughout the training*, but the minimum itself can still be anti-correlated with NC). Moreover, recent studies suggest that compression phases can be transient and may not result in ultimate MI compression (e.g., $I(X;Z)$ might exhibit an overall steady growth punctuated by rapid drops correlated with improvements in training loss). Please, refer to Figure 5 in [1] or non-ReLU IB experiments in [2]. I also encourage the authors to include the full proof of Theorem 3.1, since there is no limit on the length of the Appendix. [1] Butakov et al. "Information Bottleneck Analysis of Deep Neural Networks via Lossy Compression". Proc. of ICLR 2024. [2] Anonymous Authors. "A Generalized Information Bottleneck Theory of Deep Learning". ICLR 2025 submission: [https://openreview.net/forum?id=reOA4r0FGL](https://openreview.net/forum?id=reOA4r0FGL). **Minor issues:** 1. In lines 167-170, the joint distribution of $X$ and $Y$ is said to be "typically continuous", while $Y$ is clearly discrete since the task is classification. 2. The $\parallel$ symbol in Tables 1-2 is not visually appealing due to misalignment. The missing values are also a bit confusing. I understand that they suppose to mean that "gen" requires evaluation on both train and tests subsets. Perhaps, a viable option is placing it in-between the columns using `\makecell` or a similar macro. Finally, it is not immediately obvious what "Perf." stands for. Overall, I suggest an overhaul of these tables. 3. Perhaps, Figure 2 might benefit from log-scaling the `y` axis. 4. The equivalence in line 406 requires additional explanation. As I understand, $$ I(X;Z \mid Y) = I(X;Z) - I(X;Y;Z) = I(X;Z) - I(Z;Y) + \underbrace{I(Z;Y \mid X)}_0, $$ where $I(Z;Y \mid X) = 0$ since $Y \to X \to Z$ is a Markov chain. Please, elaborate on this in the main text. 5. In line 716, a backslash before `log` is missing. I also kindly suggest using `\text` for `dist`, `vol` and `finite` in the subsequent derivations. **Conclusion:** Overall, the paper appears rather unpolished. The methodology also needs stronger justification. For these reasons, I recommend a major revision. 1. Why did you use NPEET instead of DoE for Figure 1? 2. The original implementation of the DoE estimator relies on rather weak approximations of the distributions (e.g., Gaussian). Are they good enough for your complex task? 3. Do you have any intuition behind the anti-correlation between MI and NC? For me, a positive correlation is quite intuitive (clustered representations are "degenerate" and typically encode less information), but I still can not explain the opposite behavior that you observe. Fully human-written
Geometric and Information Compression of Representations in Deep Learning Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper addresses an open question in representation learning: Does low mutual information (MI) between inputs and learned representations imply geometric compression of those representations, and vice versa? The authors probe this through experiments on classification networks with continuous dropout (injecting noise) and with the Conditional Entropy Bottleneck (CEB) objective. They also attempted to examine the role of generalization 1) Theoretically sound MI estimation 2) The authors present evidence that one can observe low mutual information without strong within class variation collapse, and variation collapse can occur even when mutual information remains high (as was known for deterministic networks). 3) they also measured that the relationship between generalization and compression is not causal. While the experimental design is solid and the question is important, the theoretical framing is not as rigorous. 1) The paper repeatedly refers to “Neural Collapse,” but only measures NC1 (within-class variance). The co-occurrence of NC1-NC2 are critical for a geometry to be called neural collapse (Thm 1 and 2 in papyan 2020). Only NC1 can include degenerative solutions. 2) Neural collapse also refers to when training accuracy goes to 100% (or plateau nearby 100%), did you observe that in your experiments? If not (low beta in the CEB objective, which may lead to compressing away even classification relevant information), it is hard to even say your model attained neural collapse. 1) Would you please clarify your definition of compression: whether it’s informative compression (e.g., late-phase IB) or trivial compression (e.g., untrained/noisy encoding)? 2) Is it possible to control for test accuracy to demonstrate that generalization is a confounder of compression and low MI between input and latent representations. Fully human-written
Geometric and Information Compression of Representations in Deep Learning Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This work examines whether information-theoretic compression, measured by the mutual information I(X; Z), implies geometric compression, quantified by Neural Collapse (NC). The authors find that the relationship is not reliable. Their theoretical and experimental results show that a decrease in mutual information does not necessarily lead to a more collapsed geometric structure. This work establishes the finiteness of mutual information in dropout networks employing analytic activation functions. It presents experiments across different architectures and datasets. The identification of generalization as a potential confounder in the relationship between compression and generalization. The theoretical toy model in Appendix illustrates why mutual information and neural collapse measures can diverge in practice. 1) The paper relies heavily on the accuracy of MI estimates, yet the justification for the chosen methods is somewhat brief. For CEB, the claim that the variational bound is "practically tight" is asserted but not thoroughly validated. The gap $\mathbb{E} [D_{KL}(p_{Z \mid(|) Y} \mid(|) q_{Z \mid(|) Y})]$ is assumed to be small due to co-training (line 247), but no evidence is provided to quantify this gap. 2) While the DoE estimator might be reasonable choice in some situations, its sensitivity and potential biases in the high-dimensional regimes of state-of-the-art models are not deeply discussed or ablated as far as I know. Thus, it would be more convincing to compare results against a wider suite of MI estimators (line 252). 3) I agree that I(X; Z \mid(|) Y) = I(X; Z) - I(Z; Y), but the claimed I(X; Z \mid(|) Y) \approx I(X; Z) in line 406 should be justified more rigorously. 4) By "geometric compression" the paper means the NC metric. While this is a well-established measure for class-separation geometry and used recently (lines 39-40), it is not the only possible measure. I think the work would be more insightful if it were expanded to include other geometric measures, such as intrinsic dimension, which is mentioned in passing (line 37), or others. To further validate the findings, it would be helpful to see results with other MI estimators and to also compute other geometric measures (e.g., intrinsic dimension). This would test whether the correlation with MI holds for geometric properties beyond Neural Collapse. Lightly AI-edited
Geometric and Information Compression of Representations in Deep Learning Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper attempts to better elucidate the relationship between geometric quantities, such as the neural collapse of a neural network, and the information compression that network is capable of. In doing so this paper aims to compare between conditional entropy bottleneck models and models trained with "continuous dropout", and track to what degree neural collapse happens at the same times as information compression. They find that there are several differences between these different metrics but there is typically a negative relationship between information compression and neural collapse. The proof presented here is to my knowledge novel, and the proof is mathematically interesting and nontrivial. I think that the use of dropout as a way to introduce Stochasticity to allow for the information bottleneck theory to become sensible is also interesting. The paper also contains a detailed and clear related works section, which can help in reading the literature in this area. 1. The primary weakness with this paper is that it is not clear, from the empirical results provided, what the takeaway is. Is the takeaway intended to be that neural collapse and information compression are not very strongly or obviously related, as Fig. 3 seems to display? In that case is the purpose of the paper to display a null result (which I think is not an issue, but it should be stated as such)? 1. From Figure 1, what are we supposed to take from this? My impression is that the neural collapse is somehow orthogonal to the mutual information, is this the right way to interpret this? 2. Are there other ways to measure the mutual information that could provide with more stable estimates for the continuous dropout model? 3. Can you provide some more information as to what the data in Table 1 and 2 are coming from? Are these the combinations of the four considered setups here? Are there differences between them? Fully human-written
STDACN: a Spatiotemporal Prediction Framework based on Dynamic and Adaptive Convolution Networks Soundness: 2: fair Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes STDACN, a model for spatio-temporal prediction, which aims to improve upon static TCNs and GCNs. The method consists of two main components: (1) a high-order dynamic gated Temporal Convolutional Network (TCN) and (2) an adaptive dynamic Graph Convolutional Network (GCN). S1 - The paper proposes a lightweight model that shows comparable empirical results on traffic forecasting benchmarks when compared to significantly larger Transformer-based models. W1 - The paper suffers from a significant lack of novelty. The core concepts of dynamic gating convolutions and adaptive graph convolutions are well-established. The paper fails to position its contribution relative to prior methods adequately. It composes a pipeline of prior techniques rather than a new, principled approach. W2 - The empirical evaluation is flawed and insufficient to support the paper's claims. The paper omits comparisons to several state-of-the-art dynamic graph models that are directly relevant to this work. These omissions include DGCRN [1], a key work in dynamic graph-based forecasting, as well as more recent and high-performing methods like MSTFGRN [2] and SDSINet [3] [1] Luo, Xunlian, et al. "Dynamic graph convolutional network with attention fusion for traffic flow prediction." arXiv preprint arXiv:2302.12598 (2023). [2] Zhao, Wei, et al. "Multi-spatio-temporal fusion graph recurrent network for traffic forecasting." Engineering Applications of Artificial Intelligence 124 (2023): 106615. [3] Yang, Shiyu, and Qunyong Wu. "SDSINet: A spatiotemporal dual-scale interaction network for traffic prediction." Applied Soft Computing 173 (2025): 112892. Could you please confirm and correct the presentation errors? Q1 - Is Table 1 for METR-LA and PEMS-BAY? Q2 - Is Table 8 for PEMS03 and PEMS08? Q3 - Can authors provide details of Solar dataset? Lightly AI-edited
STDACN: a Spatiotemporal Prediction Framework based on Dynamic and Adaptive Convolution Networks Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes STDACN, a spatiotemporal prediction framework combining high-order gated TCN with recursive causality and adaptive GCN using diffusion convolution. - Addresses a relevant problem in spatiotemporal modeling - Combines temporal and spatial modules in a unified framework - Evaluation on real-world datasets (traffic, electricity) - The general motivation for moving beyond strict weight sharing is reasonable - Dynamic convolution via channel calibration (Eq. 2-5) resembles existing attention mechanisms like SENet and CondConv. The recursive gating appears as a minor modification of standard gated TCN. The distinction between this work and prior self-learning graph methods is unclear. - The GCN formulation (Eq. 6) is incomplete in the submission. Missing details on how temporal dynamics integrate with spatial adaptive learning. The "dynamic diffusion convolution" mechanism needs fuller explanation. - Experimental validation is insufficient with only two dataset types mentioned. No ablation studies demonstrate the contribution of individual components. Computational cost analysis is absent. The reported 1.2%-4.7% improvement range is ambiguous. - Theoretical analysis is lacking. No justification for why recursive gating helps or under what conditions. The choice of mish activation for gradient issues needs support. Hyperparameter selection (e.g., rt in Eq. 4) appears arbitrary. 1. How does your dynamic calibration mechanism differ from squeeze-and-excitation or dynamic convolution networks? Can you provide direct comparisons? 2. What is the computational overhead of the K-order recursive gating? How do you choose K? 3. The claim about "breaking weight sharing constraints"—can you formalize what constraint is broken and prove this improves expressiveness? 4. Can you provide ablation studies showing: (a) recursive gating vs. standard gating, (b) mish vs. tanh, (c) dynamic calibration vs. static kernels? 5. How does performance scale with graph size? What about computational complexity? Fully AI-generated
STDACN: a Spatiotemporal Prediction Framework based on Dynamic and Adaptive Convolution Networks Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces **STDACN**, a spatio-temporal forecasting model designed to overcome the limitations of **static weight-sharing** in conventional convolutional networks. The architecture is hierarchical, combining three main dynamic components: a **recursive high-order gated TCN** ($g^n$Conv), a **Dynamic Causal Temporal Convolution (DCTC)** that generates channel-wise calibration weights $\pi$, and an **Adaptive Dynamic GCN (SAGC)** that learns a time-varying adjacency matrix $\tilde{A}_{\text{adp}}$. Experiments across five datasets (METR-LA, PEMS-BAY, PEMS03/08, and Solar) demonstrate competitive performance and improved robustness compared to a broad range of established baselines. 1. **Motivation and Coherent Design.** STDACN directly addresses static weight-sharing/stationarity limitations in TCNs and GCNs, integrating temporal adaptivity (DCTC) and spatial adaptivity (SAGC) into a unified framework. 2. **Competitive Empirical Performance.** The model frequently ranks top-1 or top-2 across prediction horizons and diverse datasets compared to many baselines (GNN, TCN, Transformer families). 3. **Thorough Component Analysis.** Ablation studies, convolution type comparisons, and graph variant analyses clearly show the contributions of individual design choices. 4. **Demonstrated Robustness and Efficiency.** Anti-noise experiments and efficiency analysis (Table 6) support practical applicability with favorable tradeoffs relative to larger baselines. 1. **Limited Technical Novelty.** The architecture integrates existing concepts (dynamic filters, channel-wise attention, adaptive GCNs) without isolating clear novel contributions or justifying the choice of components. 2. **Ambiguity in Dynamic Mechanisms.** The operations for temporal calibration ($\pi$) and spatial coefficients ($\beta$) are unclear; it is not specified whether $\pi$ is per-channel, full kernel generator, or feature modulator. 3. **Inconsistent Notation & Reproducibility Gaps.** Key dimensions (e.g., $K$, $P$, $r$) are inconsistently defined, and details on optimizer, learning rate schedule, weight decay, and dropout are missing. 4. **Insufficient Baseline Tuning and Statistical Validation.** The hyperparameter search for baselines is not detailed, and no statistical significance tests confirm that improvements are reliable. 5. **Lack of Qualitative Interpretation.** No visualizations are provided for temporal evolution of $\pi(t)$ or adaptive adjacency $\tilde{A}_{\text{adp}}(t)$, which would help verify meaningful learning of dynamic components. 1. **Calibration Mechanism.** How is the calibration vector $\pi=\Pi(X)$ applied to the temporal kernels $W$? Please provide tensor shapes and a clear forward-pass equation for the DCTC block. 2. **Adaptive Adjacency.** What is the exact formula for $\tilde{A}_{\text{adp}}$? How is it normalized, and is it computed per time step or via learned static parameters? 3. **Regularization & Overfitting.** Which strategies (weight decay, dropout) are used to prevent overfitting given the added flexibility from dynamic kernels? 4. **Visualization Request.** Please provide plots of predicted vs. ground-truth traces, and visualizations of learned dynamic components, e.g., $\pi(t)$ and $\tilde{A}_{\text{adp}}(t)$, over periods including significant events (e.g., rush hour). 5. **Baselines and Tuning.** Describe the hyperparameter search protocol for baselines and confirm that each received comparable compute budget for fair comparison. Fully AI-generated
STDACN: a Spatiotemporal Prediction Framework based on Dynamic and Adaptive Convolution Networks Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes a Spatiotemporal Dynamic Adaptive Convolution Network (STDACN) for forecasting complex spatiotemporal data such as traffic and energy flow. It integrates a high-order dynamic gated TCN to capture long-term temporal dependencies and an adaptive dynamic GCN to model time-varying spatial relationships. Dynamic calibration mechanisms allow the model to adjust convolutions and adjacency matrices adaptively. The paper presents a clear and technically sound framework that combines TCN and GCN with dynamic adaptive mechanisms. The integration of dynamic gating and adaptive graph convolution improves the model’s ability to capture time-varying spatiotemporal dependencies. Overall, the approach is well-motivated and contributes to advancing dynamic deep learning methods for spatiotemporal prediction. 1. The paper’s originality appears limited. The proposed dynamic temporal module closely resembles the TCN component in Graph WaveNet [1], with only minor modifications such as replacing the activation with Mish and introducing input-driven dynamic weight adjustment. Similarly, the dynamic GCN shares strong conceptual overlap with the adaptive adjacency mechanisms in Graph WaveNet, with limited methodological novelty. 2. Notation and formulation inconsistencies also affect clarity. The motivation for up/down-sampling and the use of the hyperparameter r_t are not clearly justified, and their effects are not analyzed experimentally. 3. Dataset usage and experimental reporting are incomplete: the Solar dataset is not introduced, while datasets mentioned in the abstract and introduce (e.g., population, electricity) do not appear in the experiments. 4. Baseline selection omits several strong and widely recognized spatiotemporal forecasting models (e.g., Graph WaveNet [1], PDFormer [2], HimNet [3], STD-PLM [4]). [1] Wu Z, Pan S, Long G, et al. Graph wavenet for deep spatial-temporal graph modeling[J]. arXiv preprint arXiv:1906.00121, 2019. [2] Jiang J, Han C, Zhao W X, et al. Pdformer: Propagation delay-aware dynamic long-range transformer for traffic flow prediction[C]//Proceedings of the AAAI conference on artificial intelligence. 2023, 37(4): 4365-4373. [3] Dong Z, Jiang R, Gao H, et al. Heterogeneity-informed meta-parameter learning for spatiotemporal time series forecasting[C]//Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining. 2024: 631-641. [4] Huang Y, Mao X, Guo S, et al. Std-plm: Understanding both spatial and temporal properties of spatial-temporal data with plm[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2025, 39(11): 11817-11825. 1. Could the authors clarify the definition of P in X \in \mathbb{R}^{F\times N\times P} and the dual use of k for both kernel size and recursion depth? 2. The Diffusion Convolution seems conceptually part of the GCN but appears in the TCN block in Figure 2 — could the authors clarify this architectural inconsistency? 3. What is the rationale for the up/down-sampling design in the TCN block, and how does the hyper-parameter r_t influence temporal aggregation? Why is it not analyzed in the hyper-parameter study? 4. Is the FC layer intended to map H to C_{in}? If so, why is \pi not shaped as C_{in} \times 1 \times \frac{P(l)}{r_t}? 5. Why is the Solar dataset omitted from the dataset description, and what are the “population” and “electricity” datasets mentioned in the abstract and introduce? 6. Could the authors justify the baseline choices and clarify whether comparisons were made against leading spatiotemporal models (e.g., GWNet, PDFormer, HimNet, STD-PLM)? 7. On which dataset were the component experiments conducted? 8. Please check the consistency of table titles and correct typographical errors across the paper. Moderately AI-edited
PreviousPage 7 of 1516 (75800 total rows)Next