ICLR 2026 - Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	15899 (21%)	4.43	3.58	3687
Heavily AI-edited	3233 (4%)	4.22	3.59	2990
Moderately AI-edited	7082 (9%)	4.20	3.61	2722
Lightly AI-edited	16648 (22%)	4.15	3.68	2746
Fully human-written	32938 (43%)	4.13	3.62	2917
Total	75800 (100%)	4.21	3.62	3026

Title	Ratings	Review Text	EditLens Prediction
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 0: Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	The paper introduces VCode, a benchmark framing multimodal understanding as generating SVG code from images and reasoning over the rendered output. It proposes CodeVQA to test whether SVG-based representations preserve semantic visual information and VCoder, which combines iterative code revision and vision-tool assistance. Experiments show gains over existing VLM coders but also reveal persistent weaknesses in fine-grained visual reasoning. 1. The idea of using SVG as an intermediate symbolic space for vision-language reasoning is conceptually novel and touches on an underexplored direction in multimodal representation. 2. The work incorporates test-time revision and tool-assisted perception, which reflects awareness of limitations in current models and attempts to address them through modular augmentation rather than purely scaling. 1. The evaluation protocol is fragile: SigLIP similarity offers weak guarantees on fine-grained structure, and CodeVQA depends on the answering model’s biases and failure modes, making correctness a function of the evaluator rather than the representation. This undermines reliability and fairness, which is critical for a benchmark. 2. The dataset is almost entirely repurposed from prior benchmarks without substantial new curation or justification for domain coverage, scale, or annotation quality. As a result, it is unclear whether the benchmark truly captures the core challenges of the proposed problem. 3. The approach lacks grounding in practical vision tasks or downstream applications, and the SVG abstraction remains unconvincing as a scalable representation (especially for natural images with complex textures, occlusions, or fine geometry). Additional empirical evidence or ablations are needed to justify that benefits outweigh the loss of fidelity and that this direction can generalize beyond small synthetic-like cases. Is the evaluation protocol reliable?	Fully AI-generated
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces VCode, a benchmark for visual-centric multimodal coding that redefines multimodal understanding as the task of generating SVG code from images. The work is motivated from the observation that most multimodal and coding benchmarks focus on linguistic or pixel-based representations, whereas SVG provides a symbolic, interpretable, and executable visual abstraction. VCode covers three domains: 1) General commonsense (MM-Vet), 2) Professional knowledge (MMMU), and 3) Visual perception (CV-Bench). To evaluate the symbolic fidelity of SVG representations, the authors propose CodeVQA, where a policy model must answer questions about the rendered SVG image. They further introduce VCoder, an augmented agentic framework that enhances existing vision–language models (VLMs) through: 1) Thinking with revision: iterative refinement based on visual discrepancies between generated and target images 2) Acting with visual tools: using detectors, segmenters and OCR to provide structured cues like object boundaries with text. Empirical results show that leading VLMs (e.g., GPT-5, Claude-4-Opus, Gemini-2.5) struggle on visual coding tasks, while VCoder achieves an +8.7-point overall improvement over Claude-4-Opus. Human studies show that people reason more robustly over symbolic SVGs than raw images, suggesting that symbolic visual coding could be key to more human-like multimodal intelligence 1) The paper introduces a novel paradigm: treating image understanding as code generation (SVG rendering). 2) The VCoder framework combining iterative refinement and external visual tools aligns with recent trends in agentic model enhancement. 3) Experiments are comprehensive, covering both closed- and open-source VLMs with detailed ablations (revision loops, tool usage, modality inputs). 1) The dataset contains only 464 image–question pairs, which is small compared to major multimodal benchmarks. Although the repurposing from MM-Vet/MMMU/CV-Bench ensures diversity, it may limit generalization and statistical reliability of reported differences. 2) CodeVQA uses an external policy model (GPT-4o-mini) as evaluator. This introduces evaluation bias and circularity, especially since some tested models are from the same family. 3) While the paper argues that SVG captures symbolic abstraction, it lacks quantitative or theoretical grounding for what constitutes “symbolic fidelity.” E.g., there could be metrics for structural alignment (e.g., object counts, relative positions) alongside SigLip and VQA accuracy. 1) How sensitive are the CodeVQA scores to the choice of the policy model? Would results differ significantly if another evaluator (e.g., Claude-Sonnet or Gemini-Pro) were used? 2) Why was SVG chosen over other symbolic representations like scene graphs or DSLs for vector graphics? Could the same paradigm extend to 3D symbolic representations (e.g., CAD or mesh code)? 3) In Table 4, the Img2Text2SVG pipeline outperforms direct Img2SVG. Does this suggest that current models inherently reason better through language than through direct visual coding?	Fully AI-generated
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation	Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes a novel visual-centric benchmark that reframes multimodal understanding as code generation, and proposes an evaluation protocol that a policy model answers questions over rendered SVG. The paper finds that there is a persistent gap between language-centric and visual-centric coding, so it proposes an agentic framework that provides VLMs with two abilities: (1) thinking with revision; (2) acting with visual tools. The experimental results show that the proposed framework achieves a significant improvement in the visual-centric benchmark. 1. Extending language-centric coding to a new visual-centric coding task is an interesting and novel research direction. 2. This paper converts the multimodal understanding task into a visual-centric coding task and utilizes a Visual Model (VLM) to evaluate whether the generated code is an adequate and faithful visual representation. 3. The proposed VCoder framework is equipped with two capabilities: thinking with revision and acting with visual tools. Experimental results demonstrate the effectiveness of the proposed method. 1. The dataset in this paper was not processed; it simply used the original images and QA from MM-Vet, MMMU, and CV-Bench. Since the SVG code is entirely generated by the VLM being evaluated, the authors only proposed SVG code generation as a benchmark approach. This benchmark does not design a unified principle for SVG code generation to guide subsequent VLM generation. The lack of a unified principle for SVG code generation can easily lead to instability in the generated code, resulting in unstable code evaluation. 2. While CodeVQA evaluates the accuracy of code generation, I believe there are two issues: First, code generation itself is a capability that needs careful evaluation; it should not be confused with multimodal understanding. For example, minor issues with the SVG code might cause rendering failures, leading to poor results, but this does not necessarily mean the model's understanding is flawed. Second, CodeVQA is easily influenced by text input. When the policy model struggles to obtain effective information from the rendered image, it may output an answer based on publicly available knowledge from the text input. 3. CodeVQA is disadvantageous for small models (e.g., models with around 7 parameters or less) because these small models have difficulty generating compliant SVG codes, resulting in poor final evaluation results. However, in reality, these small models have already achieved very good results in multimodal understanding capabilities. 1. Refer to the issues raised in the weaknesses section. 2. The authors lack experimental results on other baseline models.	Lightly AI-edited
Don't Shift the Trigger: Robust Gradient Ascent for Backdoor Unlearning	Soundness: 2: fair Presentation: 4: excellent Contribution: 4: excellent Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The authors identify and systematically demonstrate the problem of trigger shifting in traditional gradient ascent, and propose robust gradient ascent algorithm to alleviate the phenomenon. (1) The observation of trigger shifting is very novel (2) The proposed unlearning method that avoids trigger shifting is elegant (3) The proposed evaluation metric for this observation is well-defined (4) The motivation is well presented (especially in Figure 1) (1) The observation of trigger shifting is valuable; however, your theoretical explanations do not seem comprehensive. For example, intuitively, the trigger shifting issue is less severe as the number of parameters of models increase, but this is not reflected in your formula. (2) Following (1), robust gradient ascent is not necessarily the best approach to resolve the issue. (3) More advanced/SOTA models, such as the Qwen3 families should be evaluated to justify whether your observation holds with models exposed to more training data, or have more parameters. Minor issue: Table 2 is really hard to read See weakness.	Fully human-written
Don't Shift the Trigger: Robust Gradient Ascent for Backdoor Unlearning	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes Robust Gradient Ascent (RGA), a novel framework that enhances the stability and reliability of GA-based backdoor unlearning. Specifically, this paper shows that GA does not necessarily eliminate the backdoor effect but can instead redirect it to a new backdoor effect. To address this, this paper proposes RGA, which introduces a dynamic penalty mechanism that adaptively regulates the strength of GA during backdoor removal. Extensive experiments demonstrate that RGA effectively eliminates backdoors without trigger shifting, while preserving model utility, and offers a more reliable GA-based defense against backdoor attacks. 1.This paper makes a significant contribution by presenting the discovery of trigger shifting. Based on this, this paper proposes RGA, with its innovative dynamic penalty mechanism based on KL divergence, as a principled and effective solution. 2.This paper conducted comprehensive experiments across diverse datasets, models, and attack methods. The introduction of the PACC and ΔPACC metrics is a novel and essential contribution. 3.The paper is well written and clearly structured. 1.This paper does not discuss or discuss other potential functions (e.g., linear decay, step functions, or other divergence measures) to prove that the design of this dynamic penalty mechanism is optimal. 2.This paper does not discuss the time comparison with other baseline methods, except for retraining. Please refer to the weaknesses.	Lightly AI-edited
Don't Shift the Trigger: Robust Gradient Ascent for Backdoor Unlearning	Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper identifies a trigger shifting problem in the gradient ascent (GA) based backdoor unlearning methods in the text classification domain. Then, it proposes a GA-based unlearning method (RGA) with a decaying loss and L2 regularization to prevent trigger shifting. The paper shows the effectiveness by using three backdoor attacks with three datasets on three models. - The paper shows a side-effect of unlearning that is previously overlooked. - The paper presents a clear threat model and the problem. - The paper uses CUBE (backdoor detection method) by assuming unknown poison samples to reflect the real-world scenarios before unlearning the backdoor by the proposed method. - The paper is well written and easy to follow. - The attacks in the experiments (BadNets, AddSent, HiddenKiller) are not recent. The papers do not consider the recent attacks, such as, [1] https://aclanthology.org/2024.findings-acl.468/ [2] https://arxiv.org/abs/2412.18975 - The models used in the experiments are also not recent. The effectiveness of RGA in recent models is not known. - In Section 4.2, the multi-class classification case is referred to Appendix A.2 theoretically. The paper does not show empirical results except AG (4 classes). - If the trigger overlaps with the frequent concepts, there may be some collateral damage. The effectiveness of RGA on multi-class classification is not clear, and whether there is collateral damage on multi-class classification. Minor: - In Section 6.1 (Unlearning Baselines), the items do not follow a parallel structure. The (1) and (2) are noun phrases, and the third one (3) is a sentence. - The numbers in Tables 2, 7, and 8 are too small. - In Table 6, DGA is better in some cases, especially in the HiddenKiller case, where the gap is significant. Is it related to the size of the dataset? Why is regularization bad in this case? - Have you tried using actual poisoned examples rather than the ones detected by CUBE? If so, what is the performance gap? - Does trigger shifting exist in other modalities, such as image? - Can RGA be applied to generative tasks in LLMs?	Fully human-written
Don't Shift the Trigger: Robust Gradient Ascent for Backdoor Unlearning	Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper addresses the problem of removing backdoor effects from poisoned language models through a new method called Robust Gradient Ascent (RGA). The authors first identify and formalize the trigger-shifting problem (a situation in which the backdoor is not eliminated but merely moved into another class) that arises when applying unlearning approaches to remove the influence of backdoor attacks. To mitigate this, the proposed RGA introduces an adaptive penalty term that dynamically modulates the unlearning process to prevent divergence and preserve model utility. Experimental evaluations are conducted on three text classification datasets, three model architectures, and under three distinct backdoor attacks. The results demonstrate that RGA stabilizes the unlearning process and prevents trigger-shifting. The paper is well written and organized. The motivation is clearly stated, the mathematical formulation is elegant, and the proofs are concise and well integrated into the appendix. The methodology is conceptually simple yet effective. The adaptive penalty strategy in RGA is a neat and intuitive idea that provides theoretical grounding for stability and convergence while maintaining interpretability. The experimental setup is comprehensive and convincing. The authors evaluate across multiple datasets, architectures, and attack types, providing a strong empirical validation. Each research question is well defined, and the experimental section is systematically designed to address it. Multiclass setup. The experimental evaluation is limited to two-class or, at most, four-class classification problems. The theoretical framework for addressing trigger shifting is well motivated in the binary case, where the gradient ascent direction can be clearly interpreted as moving away from one class and toward another. However, it remains unclear whether this analysis holds when extending to a larger number of classes, where the gradient effects may be distributed across multiple class directions, potentially reducing the impact or stability of the unlearning process. To solve this limitation, the authors could include an additional experiment on a synthetic dataset with an increasing number of classes, or on a more complex multi-class setting, to examine whether the dynamics of RGA generalize beyond the binary scenario. Assumption on access to poisoned samples. A minor recommendation is related to the assumption that the model owner has access to the poisoned data samples ( D_p ). This assumption is explicitly stated early in the text but not reinforced in Section 5, where the methodology is presented in detail. Minor issues: - The abstract Figure 1 is visually appealing but not very clear at first glance. The figure could benefit from a simplified layout and clearer labels to better convey the main conceptual flow. - The notation ( f_\theta(y\|x) ) could be simplified to ( f_\theta(x) ) throughout the paper. Since the method does not explicitly model conditional distributions in the optimization, enforcing the conditional notation adds unnecessary complexity and may confuse the reader. How does RGA behave when the detected poisoned dataset ( D_p ) contains false positives or false negatives? Does the adaptive regularization remain stable or does it diverge under misidentified samples?	Fully AI-generated
Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes the Inverse IFEval benchmark to measure models’ capability to generalize to OOD instructions. It includes evaluation results of different models on the benchmark tasks and some analysis of why models perform better or worse. The paper has a strong motivation – language models may struggle to follow instructions that fall outside their training distribution, which can constrain their use cases. The authors develop a benchmark to evaluate this shortcoming over 8 different instruction categories and evaluate several open and closed source models. The paper is also structured and written clearly. While the benchmark has a reasonable motivation (measuring OOD instruction-following), it is not clear that the tasks in the benchmark would provide a useful real-world signal of progress on this failure case. Including more tasks such as “question correction” which can have real-world applicability would strengthen the contribution. The paper benchmarks many models on the proposed tasks, but the analysis feels too succinct and lacking insight. There could be a more thorough discussion of the impact of instruction tuning and why reasoning models are better than non-reasoning models. Following OOD instructions can also lead to harmful responses which we seek to prevent through alignment. A discussion of this concern and how to balance model alignment w/ OOD instruction following feels lacking. - To better motivate the benchmark, would be good to show at least one compelling example of a real-world problem where a model fails OOD due to instruction-following. - How were the categories for this benchmark selected? Please provide more detailed motivation. - While the authors mention that the benchmark is similar to an “IQ-test” and doesn’t directly have relevance in terms of task utility, it would be valuable to include a few tasks that are more meaningful or demonstrate that performance on these tasks is correlated with performance on more meaningful tasks so that practitioners are more motivated to use the benchmark. - Responding to out-of-distribution prompts or instructions is not always desirable and can raise concerns of safety and model alignment. It would be good for the paper to discuss this concern and describe a way to balance the need for model alignment (e.g. not responding to toxic or harmful queries) and model flexibility (responding in OOD formats). How can we measure that a model strikes this balance? - section 2.5: Please clarify how the 88% and 98% accuracy is computed? and on what dataset and sample size? - section 2.5: Using LLM-as-a-judge for evaluation on the benchmark is useful, but getting human annotations on a subset of tasks/questions would make these results stronger. - section 3.1: Overall these findings are not too surprising or insightful. It is intuitive that thinking models or larger models would tend to perform better with OOD formats. It is a bit surprising that instruction-tuned models tend to perform worse as this stage should make the model more prompt or formatting agnostic. Including a discussion / analysis on why this is + results on a base model and its corresponding instruction tuned version would be helpful. - section 3.2.1: Some more analysis on why thinking improves performance on the benchmark would be valuable. - section 3.2.2: This comparison of performance across instruction types also feels cursory. Why are some categories harder than others?	Fully human-written
Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces Inverse IFEval, a benchmark that probes whether LLMs can follow counterintuitive instructions rather than defaulting to habits learned during SFT. The authors motivate a new evaluation dimension called Counterintuitive Ability and organize 8 instruction types such as question correction, intentional textual flaws, code without comments, counter-conventional formatting, deliberately incorrect answers, instructional induction, mid-turn instruction modification, and counterfactual answering. The dataset contains 1,012 items in Chinese and English across 23 domains, built through an expert-seeded, human-in-the-loop pipeline with LLM generation and careful filtering. Experimentally, the study evaluates many closed and open models and finds that an o3 model leads the results, that thinking variants outperform non-thinking ones, and that reduced thinking budget models trail their full budget counterparts. Fine-tuned models perform worse on this benchmark, consistent with its goal of testing out-of-distribution instruction following. The authors also analyze test-time compute and show a clear Best-of-N effect, where increasing N from 1 to 16 to 32 raises scores, and several models approach or pass 90 at N equals 32. Together, these results suggest the benchmark exposes failure modes not captured by prior instruction following tests while leaving room for gains with better post-training. 1. Counterintuitive instructions provide a strong stress test of IF capability, revealing habitual biases and mode locking that standard prompts miss. This perspective fills a gap in existing evals. 2. The benchmark is carefully curated and well-documented, with bilingual coverage and diverse task types that the community can readily reuse. Its transparent construction and release improve reproducibility. 3. The authors evaluate a broad range of models and add thoughtful analyses of thinking modes and test-time compute. The comparisons yield actionable insights on trade-offs and failure modes. 1. Insufficient annotation detail. The authors lack a detailed description of the annotation process. For example, what is the detailed background of the "experts," and are they affiliated with the authors' team? What is the specific inter-rater agreement rate? Were there multiple rounds of iteration, and what was the proportion of data modified in each round? Also, clarify whether a written guideline existed and, if so, summarize key rules and provide examples of borderline cases and resolution criteria. Adding these details would give us a clearer understanding of the data quality. 2. Unexplained outliers in results. Some results appear to be strong outliers, for example Qwen3-235B-A22B-Thinking on QC English at 6.53 and DeepSeek-V3.1 on DIA English at 0.00. An analysis that probes these anomalies and discusses possible causes would be helpful. 3. Limited error analysis. The error analysis is mostly a few appendix case studies. A more systematic treatment that categorizes error types, quantifies their frequencies, and relates them to model families and test-time settings would make the findings more actionable. See the above weaknesses.	Lightly AI-edited
Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper addresses the cognitive inertia of Large Language Models (LLMs)—their struggle to follow counterintuitive instructions conflicting with training conventions—by proposing Inverse IFEval, a novel benchmark. It introduces "Counterintuitive Ability" to measure LLMs’ capacity to override training-induced biases. Inverse IFEval includes 8 challenging task types (e.g., Question Correction, Deliberately Incorrect Answers, Counter-Conventional Formatting) and a 1012-sample dataset , built via a human-in-the-loop pipeline (observation→seed construction→large-scale generation→automatic filtering→human verification). An optimized "LLM-as-a-Judge" framework, with task-specific judge models and refined prompts, achieves 98% evaluation accuracy. Experiments on various LLMs (closed-source like o3-high, open-source like Qwen3) reveal: thinking models outperform non-thinking ones, larger models perform better, and fine-tuned models lack flexibility. The work fills gaps in LLM evaluation, serving as a diagnostic tool and guiding future alignment to enhance LLMs’ reliability in real-world unconventional scenarios. The idea is neat and interesting. The paper’s emphasis on Large Language Models (LLMs)’ ability to comply with adversarial instructions that conflict with training inertia fills the gap in the existing LLM evaluation landscape regarding the "unconventional instruction-following" dimension. It further complements instruction-following benchmarks of the IFEval category, and to some extent, provides a `lower bound' for instruction-following capability assessment. Additionally, several experimental findings are particularly insightful: for instance, instruction-finetuned models exhibit a decline in long-tail instruction-following capability, which can be attributed to their overfitting to the paradigms of Supervised Fine-Tuning (SFT). Even though the paper provides a critical extension to instruction following evaluation, there are some aspects that I think the paper can further improve on. Firstly, the relevance between the benchmark’s tasks and real-world scenarios requires enhancement. Some counterintuitive instructions designed in the work are overly extreme and not aligned with actual user long-tail needs (e.g., deliberately generating texts with a fixed number of typos), which may lead to a discrepancy where models performing well on the benchmark still struggle to address practical unconventional requirements. Then, the analysis of the root causes of models’ cognitive inertia is insufficient. The paper only observes surface phenomena—such as the underperformance of fine-tuned models and the superiority of thinking models—but fails to delve into how core factors (e.g., supervised fine-tuning data scale, reinforcement learning with human feedback (RLHF) feedback types, or model architectural characteristics) shape cognitive inertia, making it difficult to fundamentally explain the mechanisms behind the observed performance differences. Thirdly, the error analysis could be improved with in-depth analysis. While the paper presents isolated error cases of several representative models (e.g., Claude-4-Opus, Doubao-1.6-Thinking), it does not summarize common error patterns across different instruction types or explore cross-linguistic error regularities, limiting the insights into models’ systematic weaknesses in counterintuitive instruction-following. Additionally, if the pos training phase integrates counterintuitive instruction training into their development pipelines, would the current 8 task types in Inverse IFEval lose their ability to effectively discriminate model capabilities? Do you have any idea how to avoid this problem? NA	Lightly AI-edited
DisCo-Layout: Disentangling and Coordinating Semantic and Physical Refinement in a Multi-Agent Framework for 3D Indoor Layout Synthesis	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	DisCo-Layout proposes a multi-agent framework for indoor layout synthesis: a planner derives grouping and placement order from text and available assets; a designer produces initial object poses conditioned on a top-down view; and an evaluator uses structured VQA to check semantic and physical compliance. If criteria are not met, two decoupled refinement tools are invoked—SRT (semantic refinement) to minimally correct relations/orientations, and PRT (physical refinement) that uses grid matching to remove collisions/out-of-bounds issues and align objects to walls when needed. This generate → evaluate → targeted (semantic/physical) refinement loop improves physical feasibility while preserving semantic consistency. Experiments across various room types and baselines show zero physical violations with equal or better semantic scores, and ablations confirm the necessity of both refiners and the evaluator. 1. The method follows a transparent sequence—planning → group-wise layout generation → evaluation → semantic refinement → physical refinement—with well-defined module boundaries, which makes it easy to reproduce and extend. 2. Semantic errors are handled only by the SRT, and physical errors only by the PRT, so the VLM focuses on a single objective per step, reducing interference in model judgments. 3. Room-defining, large objects are placed first, followed by ancillary and decorative items. This ordering allows each group to be refined in a targeted way, avoiding unnecessary global rearrangements. 4. The PRT discretizes the floor into valid grid positions and “snaps” violating objects to feasible cells, which reliably removes collisions, out-of-bounds cases, and wall-attachment violations, and is more controllable than open-ended generative adjustments. 1. The pipeline assumes the availability of structured 3D assets (with geometry/bounds), which limits applicability when such assets are absent or incomplete. 2. Heavy reliance on VLM/LLM components. Planner, designer, and evaluator all depend on the capability and stability of the underlying model; the paper does not thoroughly analyze robustness to weaker or shifted models. 3. One-way semantic→physical pipeline. The system first fixes semantics and then applies physical refinement; the physical stage selects the nearest valid grid but does not re-validate semantics afterward. In cases where multiple physically valid placements exist but only a subset preserves the intended semantic relations (e.g., chair-side arrangements around a table), the current design cannot guarantee the semantically preferable choice. 4. Limited experimental scale. Experiments are mainly on 9 categories and 45 scenes, which shows promise but is still small for strong claims about generalization to richer room layouts. 5. No vertical/small-object reasoning. Evaluation is essentially 2D/top-down; vertical relations and placement of small items (on tables, in shelves, stacked) 1. How long does it take to generate a room, and how many tokens are consumed? End-to-end generation time and token count might be acceptable for a single pass, but the step-by-step setting needs more detailed reporting. 2. How strong is instruction following? If a user specifies spatial relations directly in the input, what results do you obtain? 3. In principle, agentic approach should perform better in irregular rooms. Can you evaluate some cases?	Moderately AI-edited
DisCo-Layout: Disentangling and Coordinating Semantic and Physical Refinement in a Multi-Agent Framework for 3D Indoor Layout Synthesis	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes a multi-agent framework for 3D scene layout synthesis. It involves collaborations between three agents, namely planner, designer and evaluator. The overall pipeline is fairly straightforward, and the proposed method seems to work effectively for the examples shown. However, the tasks performed by individual agents are fairly straightforward, and often considered in earlier works (apart perhaps not using VLM), so the technical novelty of the method is limited. Relying on a multi-stage process could lead to the system being less robust, as failure in one step could lead to failure of the whole system (there are some mechanisms for correction such as the evaluator, but even this agent can make mistakes). The overall pipeline is reasonable and appears to produce plausible results, by exploiting the capabilities of VLMs. The design is fairly straightforward, and the individual agents are mimicking steps in traditional or deep learning pipelines. The collaboration between agents is fairly limited, and there are no theoretical guarantee that mistakes can be effectively corrected. The paper thus has limited technical novelty. Having a pipeline with multiple (fragile) steps could mean the overall system is not robust. For example, although the file refinement can help address some issues, there are no guarantees that a plausible solution can be found as changes may have to be applied from the very beginning of the coarse layout). The layout is represented as an image for the VLM. However, it cannot effectively handle cases where objects overlap from the top-down view, some of which are valid and others may not be so (e.g. objects can be placed on a table). The individual agents are nearly hard-coded for 3D indoor layout synthesis, and the proposed pipeline has limited potential inspiration for broader applications. There are also quite a few typos in the paper, e.g. on Page 3, 3D indoor layout synthesis have => ... has. How does the method handle failures of individual agents? What is the success rate in practice?	Fully human-written
DisCo-Layout: Disentangling and Coordinating Semantic and Physical Refinement in a Multi-Agent Framework for 3D Indoor Layout Synthesis	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper presents DisCo-Layout, a framework for room layout generation. DisCo-Layout uses several VLMs: a planner, which derives high-level placement rules; a designer, which predicts initial 2D coordinates; and an evaluator, which assesses the layout for further refinement. It also develops the Semantic Refinement Tool (SRT) and Physical Refinement Tool (PRT) to correct local errors made by the designer. The authors evaluate the method on 45 indoor scenes and claim some improvements over baselines. 1. The paper is well-written and easy to follow. All components are clearly described. 2. Splitting the agents into planner and designer makes sense. Grouping furniture into groups and optimizing their locations together also seem to be a good idea. - Problem with the evaluation of the proposed method: 1) room layout has a high degree of freedom and is very hard to judge automatically. In this paper, no human evaluation provided, and the qualitative results are very limited. 2) Even out of the limited qualitative results, results are not promising. For example, Fig4, row4: the bottom half of the room is basically empty, which is not realistic. 3) In general, for all the test cases presented in the paper, I find the prompts being not specific enough, and the constraints being too easy to fulfill given the large size of the room and very few furniture to put in. These make the proposed method seem unnecessarily complicated, as the constraints are likely to be fulfilled by some heuristic-based methods (such as the ones used in Infinigen [1]), without even involving VLMs. Given the presented results, I'm not convinced that the proposed method is better than baselines such as Holodeck. I suggest that the authors to provide human studies, and provide more visualizations on rooms which much more complicated instructions. - Why is the evaluator necessary? From my understanding, the evaluator checks if the orientations of the placed furniture are correct, and if they are collision-free and lie within the room boundary. I think all these can be easily checked by hand-designed rules, why do you need VLM invlolved here? - Semantic refinement and Physical Refinement: the refinement could solve collisions when the rooms are very empty, but it becomes way harder to resolve when the placement of furniture becomes compact. It could be impossible for the model to adjust the local placement to satisfy all constraints. I.e., some global re-adjustment would be necessary at some moment. This global adjustment is common when humans design their room layout. Does your method have any stopping criteria when the refinement fails, and does it have any fall-back plans? - Overall, the contribution and the novelty of the paper are limited. The papers chain several off-the-shelf VLM agents together, which is not novel. The core components such as Physical Refinement is basically placing objects on regular grids to avoid intersections, and seems to be a liitle bit too trivial to be a solid contribution. - (minor) typo: L074 laxyout -> layout [1] Infinigen Indoors: Photorealistic Indoor Scenes using Procedural Generation N/A	Fully human-written
DisCo-Layout: Disentangling and Coordinating Semantic and Physical Refinement in a Multi-Agent Framework for 3D Indoor Layout Synthesis	Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper presents a method for synthesising furniture layouts for rectangular rooms. It uses pretrained VLMs to assemble a multi-agent pipeline – a planner decides a set of furniture and constraints on placement; a designer refines this to exact layouts; an evaluator checks whether all constraints are satisfied; and a refinement stage improves placements where objects are not in an ergonomically-sensible arrangement. The evaluator and refinement tools are used to provide feedback to the designer, improving its layout iteratively. The method is tested on 45 prompts for nine room types. Quantitative results exceed existing LLM/VLM-based furniture layout methods. The overall pipeline, i.e. the combination of agents, their respective tasks, and how they communicate, is novel. The idea of using VLMs for layout generation is a sensible one, and having separate critics is also logical. The separation of semantic from physical concerns (i.e. whether objects are placed ergonomically to form a functional room, versus whether there are interpenetrations, floaters, etc.) is a good idea. Experiments are conducted on diverse prompts for nine room types, including several unusual ones for which the benefit of pretrained VLMs is clearer (since large training datasets would be hard to obtain). For evaluation, both physical and semantic correctness are measured; the latter is scored by another VLM. Quantitative results against three fairly recent baselines (LayoutGPT, Holodeck and LayoutVLM) show that the proposed method generally out-performs them across room types. Prompts vary in levels of detail, some simplify specifying a room type, others giving details of the furniture. In general adherence to the prompts is fairly good in the qualitative examples presented in the manuscript, and better than the selected LLM/VLM-based baselines in these cases. There is an ablation study measuring the benefit of certain components, showing that each of them yields a noticeable contribution either to semantic or physical quality. The paper is clear, well-structured, and pleasant to read. This is a complex pipeline that yields underwhelming results. In particular, the qualitative results in Figure 4 are greatly lacking realism – half rooms left empty, sparse layouts, and objects placed somewhat randomly around the edge of the room. Results are considerably worse than Merrell's work from 2012, and other more recent methods that are learnt from datasets (ATISS, etc.). Claims of "realism" and "coherence" are manifestly too strong. While I appreciate the proposed approach is intended to be training-free, these datasets and methods nonetheless exist and should be seen as defining state-of-the-art. There are no related works prior to 2021 referenced, despite various methods existing dating back to at least 2011 (e.g. Interactive furniture layout using interior design guidelines, Merrell, SIGGRAPH 2011; Make it home: automatic optimization of furniture arrangement, Yu, TOG 2011; Automatic Generation of Constrained Furniture Layouts, Henderson CoRR 2017; Human-centric indoor scene synthesis using stochastic grammar, Qi CVPR 2018), among others. The actual technical contribution is fairly small. Multi-agent systems for refining layouts etc. are now fairly well established, and the specific extensions here are small (without interesting technical novelties) and very domain-specific (hence not of great interest to the broader community). The overall 'recipe' for the pipeline also does not seem to yield many interesting or transferable insights for general readers in the ICLR community. The problematic issues that require addressing are mentioned under "Weaknesses" above. In particular, for me to favor acceptance, the paper would have to demonstrate considerably more impressive results, particularly given the complexity of the pipeline. It would also be good to describe genuine technical contributions in the introduction (i.e. fundamentally novel techniques, insights, etc.), rather than just "we built a pipeline" and "we evaluated it".	Fully human-written
Training-Free Adaptive Frame Selection for Video-Language Understanding	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	The paper proposes CoSeLECT, a training-free adaptive frame selection method that efficiently selects the most informative frames from a large pool by combining query relevance and temporal continuity. This approach achieves better performance than existing training-free methods on multiple video understanding benchmarks. The paper presents a clear and practical training-free frame selection method that combines query relevance and visual continuity in a straightforward manner. While the individual components are not novel, their combination into an adaptive, query-aware selection pipeline is sensibly designed and effectively executed. The method is well-described, easy to reproduce, and evaluated across multiple benchmarks with ablations that support key design choices. Its significance lies in offering a lightweight, plug-and-play solution that improves efficiency and performance for video understanding with MLLMs without requiring model retraining. This is useful for real-world deployment, though not theoretically groundbreaking. - The novelty of the paper is limited. Query-aware frame selection for Videl-LLM is not innovative, as discussed in KeyVideoLLM [1], AKS [2], and Q-Frame [3]. The paper pointed out that `these heavier methods are typically limited to sparsely pre-sampled frame pools` is interesting, but the experiment did not support the solution of this problem. - It is not clear whether the comparison of the experimental results in the paper with other training-based methods is fair. - The paper claims the proposed CoSeLECT is lightweight. However, there is a lack of systematic evaluation of latency and computing consumption, which is crucial for actual deployment. > but in its lightweight, principled fusion of two readily available ones—frame–text similarity for semantic relevance and inter-frame similarity for temporal continuity. - Lack of discussion of limitations. - Minor weaknesses - The first equation in Section 3.4 is missing a number - $\sqrt{D_i}$ in equation (4) lacks definition [1] Liang H, Li J, Bai T, et al. Keyvideollm: Towards large-scale video keyframe selection[J]. arXiv preprint arXiv:2407.03104, 2024. [2] Tang X, Qiu J, Xie L, et al. Adaptive keyframe sampling for long video understanding[C]//Proceedings of the Computer Vision and Pattern Recognition Conference. 2025: 29118-29128. [3] Zhang S, Yang J, Yin J, et al. Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs[J]. arXiv preprint arXiv:2506.22139, 2025. - I am confused about the avoidance of redundant calculations mentioned in the article in line 204. Although LLaVA-OneVision uses SigLIP-So400M-patch14-384 as the visual encoder, it fine-tunes SigLIP during the training process, which results in them having the same structure but different parameters. So is it really possible to avoid redundant calculations? > Since SigLIP is also used as the vision encoder in LLaVA-OneVision (Li et al., 2024a), these embeddings can be directly reused, avoiding redundant computation. - Does pre-$\textbf{E}_{im}$ in section 4 mean $N$ in section 3, and post-$\textbf{E}_{im}$ means $K$? If so, please use consistent expressions to improve reading; if not, I hope the author can further elaborate on the difference. - Lack of in-depth analysis of Table 1. With the increase of pre-$\textbf{E}_{im}$, there is no consistent performance improvement across all benchmarks. Does this contradict the motivation of the paper? > Crucially, these heavier methods are typically limited to sparsely pre-sampled frame pools in order to remain computationally feasible—risking the permanent loss of “needle-in-a-haystack” moments before the selection algorithm can even evaluate them, a limitation that becomes particularly acute under resource constraints. - I am confused about the experimental results in Table 2. Is this comparison meaningful? - Frame-Voyager is similar to the proposed CoSeLECT, and is also a plug-and-play model. But its LLM Size does not seem to be 7B. - LongVA and VideoChat2 are Video-LLMs. How to compare them with CoSeLECT ? - Supplement CoSeLECT compares the results of the Qwen2-VL [1] experiment with AKS and Q-Frame [2], which will provide a more comprehensive evaluation. [1] Wang P, Bai S, Tan S, et al. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution[J]. arXiv preprint arXiv:2409.12191, 2024. [2] Zhang S, Yang J, Yin J, et al. Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs[J]. arXiv preprint arXiv:2506.22139, 2025.	Fully human-written
Training-Free Adaptive Frame Selection for Video-Language Understanding	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This work introduces another method to select which frames to use for video-language understanding. This is an important topic since MLLMs often have input token limitations, and having some way to prefilter the data to find the most relevant answer can often help in increasing the results. The method is training-free and can be plugged with different models. Their main competitor is AKS (Adaptive Keyframe Sampling) that also adopts a training-free paradigm. However, there are some differences with the method proposed here. Whereas AKS splits the video into equal halves, CoSeLECT segments the video into different lengths depending on intra-clip similarity. The method introduced by the authors seems a bit more flexible than AKS while providing slightly better results. The authors evaluate their method across several common benchmarks such as NextQA, MLVU, VideoMME, MVbench, and LongVideoBench. They show that their method is competitive with training-based methods such as LongVu. They also compare their method with different token reduction techniques such as uniform sampling, VisionZip, PruneVid, and others, for which they also have competitive results. - A simple and yet effective training-free method for frame selection - Adaptative and not relying on specific video sub-clip size - Paper well-written - Extensive comparison with similar methods such as AKS - Extension comparison with both training-based and training-free frame selection method - Good ablation over the different hyper-parameters such as similarity threshold or frame pool size - Some overhead introduced by the method since frames need to be processed through a SigLip encoder to compute frame similarity. Depending on the number of frames being processed, this can have an important impact even if this operation can be parallelised. - Lack of ablation over the vision and text encoder. - The paper title and abstract does not exactly match the ones in OpenReview (don't know how much that can be an issue or not) Why choosing SigLIP-ViT and not another model? Did you perform an ablation on those?	Fully human-written
Training-Free Adaptive Frame Selection for Video-Language Understanding	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper proposes a training-free method that can be seamlessly integrated into existing multimodal large language models (MLLMs). The approach jointly considers both frame-level visual diversity and overall video length, leading to more balanced and informative video representations. The method achieves state-of-the-art performance across multiple video understanding benchmarks, demonstrating both simplicity and effectiveness. 1. The proposed method is simple but effective, requiring no additional training while significantly improving performance. 2. The design is model-agnostic and can be easily plugged into various MLLM architectures, indicating strong generality and practical utility. 3. Experimental results are convincing and comprehensive, covering multiple datasets and metrics, with clear visualizations that illustrate the method’s contribution. 4. The paper is well-written and easy to follow, making the technical insights accessible. 1. The main concern lies in the limited novelty of the method. The use of text–visual embedding similarity as a selection strategy is not conceptually new and has been widely seen in prior works as an auxiliary component or ablation. While the empirical results are strong, the contribution is mainly engineering-level, lacking deeper methodological insight or theoretical advancement. 2. In addition, the paper does not clearly explain how the method mitigates temporal information loss when modeling long videos. Please see the weaknesses.	Heavily AI-edited
Training-Free Adaptive Frame Selection for Video-Language Understanding	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper proposes a training-free query-guided frame selection method for efficient video processing in MLLMs. It uses SigLIP cosine similarity between each frame and the given query to measure query relevance. Then it identifies scene transitions based on visual similarity. Based on these, the method adaptively allocate tokens to those scenes that is more relevant to the queries via relevance reweighting. CoSeLECT evaluates on six video understanding benchmarks and achieves state-of-the-art performance compared with both training-free and fine-tuned methods. + The paper writing is clear and easy to follow. + The method is training-free and can be applied to any LVLMs. + The method improves the performance on base model LLaVA-OV and Qwen2.5-VL-7B. It also outperforms other frame selection methods. + The paper should compare with more video token compression or frame selection method. For example, BOLT [1] is a frame selection method. + When retrained ratio goes down to 12.5%, CoSeLECT has several benchmarks lower than FastVID. + Although comprehensive evaluation has been down, the paper is mainly based on empirical observation and has very limited innovation. + The comparison does not seem entirely fair. Although 8k video tokens are finally fed into the MLLM, it still need to process additional frames during intermediate steps. Given the method involves intermediate steps and introduces computational overhead, the comparison should be made against the base model’s best performance. For example, Qwen2.5-VL got 65.1 on VideoMME. [1] BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding + When comparing with baseline models LLaVA-OV and Qwen2.5-VL-7B, how many frames and token per frame is used within 8k context length? + Have you tried LLM’s text embedding instead of SigLIP text embedding? + Some complicated question could not be used to select key frames based on embedding similarity. For example, many questions in VideoMME are like ‘which of the statement is correct?’ Therefore, query-frame embedding similarity is not a fine-grained way for frame selection. + What is the computation overhead introduced and inference speed compared with base model, since the method needs to calculate similarity between consecutive frame embeddings.	Fully human-written
SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper analyzes SGD when the student is supervised by (i) exact Bayes class probabilities (BCPs) and (ii) noisy BCP estimates, then argues for Bayesian teachers as better BCP estimators. Core technical claims: (a) when supervising with exact BCPs, the empirical optimization interpolates and classical SGD “neighborhood” terms vanish, enabling larger admissible stepsizes (Thms. 1–2); (b) with noisy BCPs, convergence rates include a variance (gradient-noise) term whose magnitude depends on how close the teacher is to the true BCPs (Thms. 3–4 and Prop. 3). 1. Showing that the CE risk with BCP supervision shares the same minimizer as standard supervision (the Bayes posterior; the minimum equals (H(Y\|X))), then establishing interpolation for the BCP-supervised objective (Props. 1–2), is crisp and well-grounded. 2. Thms. 1–2 remove the variance neighborhood term found in standard SGD and allow a wider stepsize range, formalizing a compelling optimization advantage of distillation from accurate probabilities. 3. The Dirichlet perturbation appendix helps ensure the targets remain on the simplex and shows the main conclusions persist. 1. Prop. 3 weights Jacobian norms by (1/P(y_k\|x)) (or (1/P(y_k\|x)^2) with noisy BCPs). If any class probability can be arbitrarily small, the gradient-noise bounds can blow up. You should make explicit an assumption like (P(y_k\|x)\ge \epsilon>0) (or work with smoothed targets) and reflect this in all statements depending on Eqs. (13)–(14). 2. Additive perturbations can leave the simplex. While Appendix D covers a Dirichlet alternative, the main text should either use the Dirichlet model (preferred) or state an explicit projection/renormalization step and argue it does not break linearity in the second argument of CE used in the proofs. 3. Prop. 2 proves interpolation for the BCP-supervised objective under AS4. It would help to make explicit that interpolation is in the sense of matching the Bayes distribution (not zero training error on one-hots). Also, connect more tightly to Def. 1 and spell out that interpolation implies zero gradient at every sample (Eq. (25)), which underpins the disappearance of neighborhood terms. A short lemma bridging these steps in the main text (not only App. C) would aid readability. AS1/AS2/AS3 are standard in optimization but nontrivial for deep CE losses. Two concrete requests: (a) Explain where expected smoothness (AS3) comes from for CE with typical architectures (e.g., via bounded logits/Jacobians). (b) In Thms. 2–4 the stepsize bounds use “(\mu/(LL))” and “(\mu/(2LL))”. Please define “(LL)” or fix the notation—likely (L) vs. another constant (L'). As written, it’s ambiguous.	Moderately AI-edited
SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper carries out convergence analysis for SGD-based KD with Bayesian teacher and noisy Bayesian teacher, showing faster and more stable convergence compared to standard SGD. Based on the above analysis, the paper proposes to use BNNs as Bayesian teachers, either trained from scratch or converted from deterministic pretrained models. Experimental results are provided to validate the theoretical analysis and some performance gain is demonstrated. 1. The paper studies the benefit of Bayesian teachers in KD from the perspective of SGD convergence, which is a solid and principled choice. 2. The paper connects the literature on BNNs to KD through the use of Bayesian teachers. 1. The paper lacks high novelty. The Bayesian teacher perspective of KD has been well-studied in the literature and many results, both theoretical and empirical, have been presented to show that Bayesian teacher is optimal for student learning. The real important issue is how to pratically obtain a Bayesian teacher. However, the paper doesn't emphasize much on this issue, and simply resorts to some existing Bayesian DL methods. 2. The experimental results are not comprehensive enough. (1) All results are based on the standard KD, without extending to any latest logit-based distillers. (2) No result on ImageNet is presented. (3) No result for transformer-based models, e.g., ViT, is presented. 1. For converting a deterministic pretrained model into a BNN, is there any way to introduce stochasticity without modifying the teacher model itself (since in many cases, it's not desirable/feasible to modify the teacher model)? For example, through the use of data augmentation or adding auxiliary probabilistic modules to the teacher model. 2. The paper refers probability distributions that are closer to the BCPs as more "calibrated" probability distributions. Then, why not show some results on model calibration such as expected calibration error (ECE) and reliability diagram [1]. [1] C. Guo et al. On Calibration of Modern Neural Networks. ICML 2017.	Fully human-written
SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines	Soundness: 3: good Presentation: 1: poor Contribution: 3: good Rating: 8: accept, good paper Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The given paper demonstrates, from a Bayesian theoretical perspective, that utilizing soft probabilistic outputs in Knowledge Distillation (KD) leads to the highest performance. Based on the SGD optimizer, it shows that when the teacher’s accurate Bayes Class Probabilities (BCPs) are used as target data, the performance surpasses that obtained using one-hot encoding. Furthermore, the paper analyzes how the deviation of errors and accuracy change as noise levels are added to the original BCP distribution. 1. Originality: The paper provides a mathematical proof, from a Bayesian theoretical perspective, explaining why using soft probabilistic outputs in KD leads to better performance under the setting of an SGD optimizer. As mentioned in the Related Work section, the authors generalize this theoretical result beyond special cases such as self-distillation or model compression to more general classification settings, which represents a clear contribution compared to prior research. 2. Quality: The paper offers mathematically rigorous explanations throughout all derivations, giving the theoretical sections a strong sense of completeness and internal consistency. 3. Clarity: The logical flow of the paper is well-structured, and the authors successfully connect theoretical findings with experimental results, presenting a coherent narrative from theory to practice. However, the extensive mathematical derivations make it somewhat difficult for readers to follow the core ideas and fully grasp the main contributions. 4. Significance: By providing a mathematically complete explanation for why soft probabilistic outputs improve performance in KD—a question that has not been sufficiently addressed in previous KD research—the paper offers meaningful theoretical significance within the field of knowledge distillation. 1. In Figure 1, it would be beneficial to further quantify the amount of noise and present this quantitatively. Moreover, based on the plots of generalization error and test accuracy per epoch, it seems that the results were obtained using a single random seed. If the authors were to test with multiple seeds and compute the variance of generalization error and test accuracy per epoch for the four cases, it could more clearly demonstrate that the true Bayes probabilities exhibit significantly lower variance. Additionally, instead of using abstract expressions such as “less noisy probabilities” or “more noisy probabilities”, it would be clearer to explicitly specify the exact noise levels, which would also enable a direct comparison with one-hot labels. 2. Although the paper’s overall contribution is meaningful, it is somewhat disappointing that it does not go beyond showing that using BCPs as labels for the student model improves training stability due to lower variance. 3. The transition from Equation (10) to Equation (11) omits too many intermediate steps, making it difficult to follow the derivation. Furthermore, mathematical expressions are overly complex, and the overall explanations feel somewhat unfriendly and difficult to read. 1. Does the proposed SGD_Based_Knowledge_Dist property still hold if optimizers other than SGD - such as Adam or RMSprop - are used? 2. Is the proposed method applicable only to classification tasks, or could a similar approach be extended to regression problems as well? 3. In Table 1, under what specific conditions were the tests conducted? (For example, were results averaged over multiple runs with different random seeds, such as Seed 0 to Seed 5?) 4. In line 272 on page 6, the paper states that “we model ϵ ∼ P_{ϵ} as zero-mean noise with variance ν and uncorrelated entries.” - How was the variance ν determined or chosen?	Lightly AI-edited
SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines	Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper aims to provide a theoretical analysis of logit-based knowledge distillation from a Bayesian perspective. The authors provide analyses for both perfect BCPs and noisy BCPs, drawing the conclusion that knowledge distillation can lead to variance reduction and neighborhood term removal. Based on the theoretical findings, the authors further propose to utilize Bayesian deep learning models to improve effectiveness. 1. This paper is well organized, highly detailed, and balanced between theoretical depth and readability. 2. The theoretical analysis is supported by empirical evidence. 3. There is potential practicality, as the authors also show the benefit of converting pre-trained models into BNNs to improve the effectiveness of knowledge distillation. 1. (Minor) The analysis is based on SGD, but the experiments are conducted on Adam. Although this can show that the analysis also applies to other SGD-related optimizers, it would be better if there were some analyses or at least citations to show such generalizability from a theoretical perspective. 2. (Minor) The experiments are based on image classification solely. Could there be more complex tasks, such as semantic segmentation or object detection? 3. (Minor) Some related work is recommended to be discussed, such as [1, 2] [1] ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via α-β-Divergence. ICML 2025 [2] f-Divergence Minimization for Sequence-Level Knowledge Distillation. ACL 2023. See Weaknesses.	Fully human-written
How much correction is adequate? A Unified Bias-Aware Loss for Long-Tailed Semi-Supervised Learning	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper attempts to address the issue of dynamic bias in long-tailed semi-supervised learning (LTSSL). The authors propose BiAL, which replaces the static class prior with an online estimate of the model's own bias, measured from its output on no-information inputs. The authors subtract this estimated bias from the model's logits to form debiased energy, and uniformly applying this energy across all loss functions provides a more adaptive and effective correction mechanism. The experiments reportedly achieve state-of-the-art performance on multiple datasets. 1. The idea of probing model bias with no-information inputs seem to be simple and effective. And the experiments verified that the method achieve sota performance. 2. The paper writing is clear and easy to understand. 1. The core idea of this paper that using the model's response to no-information inputs to estimate and correct for its bias has been proposed in CDMAD [1]. The use of bias just dynamic variant of the classic Logit Adjustment (LA). [2] 2. Lack theoretical foundation for why $b_{\theta}$ is a good estimator for the "effective training prior. The paper provides no rigorous proof or analysis whatsoever. 3. The paper is replete with grandiose terms like unified, universal, and "fundamental," which do not align with the actual substance of the contribution. The so-called "unification" is just applying a simple logit subtraction to different components. [1] Cdmad: class-distribution-mismatch-aware debiasing for classimbalanced semi-supervised learning. [2] Long-tail learning via logit adjustment. See in Weaknesses.	Fully human-written
How much correction is adequate? A Unified Bias-Aware Loss for Long-Tailed Semi-Supervised Learning	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper makes the following key contributions for long-tailed semi-supervised learning (LTSSL), suitable for international machine learning conference review: 1. Unified Bias-Aware Objective: Proposes Bias-Aware Loss (BiAL), which replaces static distribution priors (limiting existing methods) with the model’s current class bias (estimated from no-information inputs). BiAL unifies bias correction across cross-entropy/logit adjustment and contrastive heads, and extends to supervised learning, enabling consistent debiasing in diverse architectures (e.g., FixMatch, CCL). 2. Theoretical Guarantees: Establishes Fisher consistency for balanced error rate (BER) with BiAL’s debiased energy, derives dynamic-regret advantages under prior drift (induced by pseudo-labeling), and proves that static-prior methods suffer from unavoidable mismatch—quantifying their excess BER to justify BiAL’s adaptivity. 3. Plug-and-Play Implementation: Adds negligible computational overhead (only lightweight bias probing and logit adjustment) without extra components. It integrates seamlessly into existing SSL pipelines via warm-up/ramp scheduling for stability. 4. Empirical Validation: Achieves state-of-the-art (SOTA) performance on CIFAR10/100-LT, STL10-LT, and ImageNet-127 across consistent/uniform/reverse unlabeled distributions. It concurrently improves pseudo-label quality and test accuracy, outperforming strong baselines (e.g., CPE, Meta-Experts, CCL). # 1. Well-motivated and unified formulation The paper proposes a unified Bias-Aware Loss (BiAL) that replaces static class priors with model-induced bias estimated from no-information inputs. This principled abstraction enables a plug-and-play correction mechanism compatible with multiple paradigms such as CE, LA, and contrastive heads . The idea is conceptually clean and addresses a central limitation of prior long-tailed SSL approaches. # 2. Solid theoretical justification The authors provide clear theoretical insights, showing that pseudo-labeling induces dynamic class-prior drift and that static prior correction becomes misspecified. The analysis within the balanced-error framework demonstrates that BiAL can reduce dynamic regret and align more closely with the evolving effective prior This theoretical grounding strongly supports the method. # 1. Bias estimation stability remains unclear The bias is estimated using model predictions on no-information images (e.g., black images) and stabilized with EMA and warm-up strategies . However, the accuracy and robustness of such estimation—particularly during early training—may be questionable. More analysis on sensitivity to batch size, input type, and noise would be helpful. # 2. Performance improvements can be marginal in some setups Although the method achieves state-of-the-art results on several benchmarks, improvements over strong baselines (e.g., FixMatch+ACR/CPE) are sometimes relatively small and appear within statistical variance. More significance analysis or discussion would help contextualize these gains. See weakness	Fully AI-generated
How much correction is adequate? A Unified Bias-Aware Loss for Long-Tailed Semi-Supervised Learning	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces Bias-Aware Loss (BiAL), a unified framework for long-tailed semi-supervised learning (LTSSL) that replaces static distribution priors with the dynamically estimated bias of the model. The core idea is to measure this bias by probing the model on no-information inputs (e.g., solid black images) and then use it to correct logits throughout training and inference. The method is simple, theoretically grounded and empirically strong. It achieves highly competitive performance across multiple datasets (CIFAR-10/100-LT, STL-10-LT, ImageNet-127). The paper provides good theoretical guarantees (Fisher consistency, dynamic regret bounds) and several experiments that validate the method's robustness. 1. Novel conceptual framework - Introduces a dynamic bias estimation mechanism that generalizes prior static approaches. - Unifies existing bias-corrective losses under a single principle. 2. Strong theoretical foundation - Fisher consistency and dynamic regret proofs provide mathematical backing. - Gradient-level analysis clarifies improvements in minority class margins. 3. Comprehensive empirical validation - Benchmarks across 4 datasets and multiple distribution regimes. - Thorough ablations and sensitivity checks. 4. Practical utility - Plug-and-play integration with minimal overhead. - Includes practical engineering details (warm-up, EMA smoothing, ramp-up). - Clear implementation and reproducibility potential. 5. The paper is well written, with clear motivation, good organization, and visual presentation. Limited bias source analysis: The method captures aggregate bias, but the paper does not separate its components (e.g. class imbalance vs. architectural difficulty). Hyperparameter sensitivity: The introduction of new parameters $(\beta, E_\mathrm{est}, E_\mathrm{warm}, E_\mathrm{ramp}, \alpha)$ adds to the tuning cost. While a sensitivity analysis is provided, clear heuristics for setting these on new datasets are limited. Minor writing issues: Includes missing references to figures and a minor equation labeling error, which slightly impact the reading flow. Limited exploration of "No-Information" inputs: The paper uses all-black images but does not ablate this choice. Exploring other types of non-informative inputs (e.g., noise patterns) could have strengthened the methodological foundation. Domain-specific semantic meaning of "No-Information" inputs: The method's core assumption is that a solid black image serves as a neutral, non-informative baseline. However, in specialized domains like medical imaging, the color black can carry significant clinical meaning (e.g. specific tissue types, or the absence of a finding). In such cases, using a black image would not probe the model's intrinsic class bias but would instead measure its response to a semantically charged input, leading to a corrupted and misleading bias estimate. Contextual bias from training data: The bias estimation relies on the model's output for a constant-colour input. However, if the original training dataset contains correlations between plain backgrounds and specific classes, the model may learn these spurious associations. Consequently, the estimated bias vector $b_\theta$ would capture this dataset-specific contextual bias (e.g., a bias towards classes frequently appearing with blank slides) alongside the intended class-frequency bias. 1. Bias composition: The measured bias $b_\theta$ is treated as a unified vector. Can you disentangle how much of this bias originates from class imbalance versus other factors, such as the model's architectural prior or dataset-specific visual biases (e.g., background correlations)? Is the bias on no-information inputs a pure reflection of the label distribution? 2. Hyperparameter tuning: What heuristics or adaptive schemes could help set $\beta, E_\mathrm{warm}$, and $E_\mathrm{ramp}$ for new datasets? 3. Scalability: How does computational cost scale with model size and class count? 4. Failure cases: When might BiAL underperform relative to static priors? 5. Bias estimation robustness: What happens when bias estimation is noisy or unstable? 6. Visual similarity: Could you analyse how BiAL affects discrimination among visually similar head vs. tail classes? 7. Experimental fairness: Were the same data samples used consistently across all method comparisons?	Fully AI-generated
How much correction is adequate? A Unified Bias-Aware Loss for Long-Tailed Semi-Supervised Learning	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper studies the problem of long-tailed semi-supervised learning (LTSSL), where both label imbalance and pseudo-label noise cause strong class bias during training. The authors observe that most existing debiasing methods use static distribution priors (e.g., class frequencies), which become inaccurate as the model evolves and pseudo-labels change the effective class distribution. To address this, the paper proposes Bias-Aware Loss (BiAL), a unified bias-aware objective that replaces static priors with the model’s current bias, estimated directly from its responses to no-information inputs (e.g., blank images). This approach allows consistent correction during both training and inference, and can be easily plugged into different SSL frameworks such as FixMatch and CCL. 1. Paper is well-written and overall well organized. The main idea of using model bias estimated from no-information inputs is conceptually close to DebiasPL (“Debiased Learning from Naturally Imbalanced Pseudo-Labels,” CVPR 2022). Both approaches rely on the model’s self-bias for correction without external priors. DebiasPL's causal inference pipeline The new method mainly wraps this idea into a unified loss formulation (BiAL) but does not introduce a very different underlying mechanism. The contribution seems more incremental than fundamentally new. Most experiments are conducted on small benchmarks such as CIFAR10/100-LT and STL10-LT, with limited data diversity and visual complexity. While the method shows nice improvements there, it is unclear whether the gains can generalize to large-scale or real-world long-tailed semi-supervised scenarios (e.g., ImageNet-LT, WebVision, or domain-shifted data). The paper only discusses classification tasks. It is not clear whether the proposed bias-aware correction can be generalized to other settings like detection, segmentation, or multimodal learning. Since those tasks often involve structured outputs and continuous predictions, the practical applicability of BiAL outside classification remains uncertain. Marginal performance gains. As shown in table 1 and table 2, the performance gains is often within 0.5%. Please check the weakness section.	Lightly AI-edited
HARPA: A Testability-Driven, Literature-Grounded Framework for Research Ideation	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes the HARPA framework, which generates testable research proposals through literature mining, hypothesis space exploration, and an executive feedback-based rater, which achieves significant improvement in feasibility and literature support. This paper points out the problem of "disconnection between innovation and feasibility" when generating hypotheses in LLM, which is of great significance for promoting the practical application of ASD systems. 1. The paper proposes a complex multi-stage process, but the need for the components does not seem to be justified. 2. This paper needs to provide a detailed cost analysis and comparison. 3. Can the scorer work effectively on other ASD systems or on other scientific domains? 1. How do you handle this data for the "Uncertain" class? 2. How much does reasoning trace contribute to performance? 3. How do you determine inter-rater reliability, whether different experts agree on the same proposal?	Fully human-written
HARPA: A Testability-Driven, Literature-Grounded Framework for Research Ideation	Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes a scientific research hypothesis generation framework named HARPA, aiming to address two major challenges of large language models (LLMs) in scientific research creativity generation: ensuring the testability and literature-groundedness of hypotheses. The HARPA framework consists of two core components: (1) a multi-stage hypothesis generator (Proposal Generator), which identifies research trends, constructs hypothesis space through literature mining, and finally converges to specific hypotheses that fill research gaps; (2) a hypothesis scorer (Proposal Scorer), which is a reward model (RM) trained based on previous experimental execution trajectories (success or failure) and used to predict the feasibility of new hypotheses. The authors evaluated HARPA through two sets of experiments: (1) Human expert evaluation (compared with the AI-Researcher baseline), which showed that HARPA was significantly superior in "feasibility" and "groundedness"; (2) Automated scientific discovery (ASD) agent evaluation, which showed that the hypotheses generated by HARPA had a higher execution success rate (20 vs 11) on the ASD agent (CodeScientist). Additionally, an independent experiment demonstrated that HARPA's scorer was 28% more accurate than the untrained LLM baseline in predicting execution success rates. 1. Important Issue: This paper addresses a critical and timely issue: how to bridge the gap between the "creative generation" and "actual scientific research execution" of LLMs. 2. Execution-oriented rewards: Taking "feasibility" as a key indicator and attempting to use actual ASD agent execution trajectories (not just zero-shot judgments from LLMs) to train the reward model is the right and valuable direction. 3. Breadth of Evaluation: The evaluation design of the paper (although flawed) attempts to cover both human expert evaluation and real agent execution simultaneously, and this multi-dimensional evaluation approach is commendable. 1. Lack of end-to-end evaluation: There is a lack of critical end-to-end (End-to-End) verification. The paper claims that its core advantages lie in "testability-driven" and "Self-Adaptation to prior experimental outcomes". This strongly implies a closed-loop system: the feedback from the Scorer should be able to guide or improve the Generator. However, the experimental design in this paper is completely disjointed: * Experiment 5.1 evaluated the generator (compared to AI-Researcher). * Experiment 5.2 evaluated the scorer (compared to the untrained LLM). * Missing Experiment: The authors never conducted an end-to-end evaluation to prove that the combination of "generator + scorer" outperforms the "generator (alone)". The authors only demonstrated that they could train a scorer, but never used this scorer in experiments to filter or re-rank the generator's output and verify whether doing so (e.g., taking the Top-K) could further improve the ratings of human experts or the execution success rate of ASD agents. This leaves the paper's core claims of "Self-Adaptation" and "learning from experience" without the most direct experimental verification. 2. Unfair baseline comparison: There are serious confounding variables in the core human evaluation and agent evaluation in Section 5.1. The input of HARPA (full paper) is far more informative than the base line (topics generated from abstracts). This makes it impossible for us to attribute the observed improvements in "feasibility" and "groundedness" to the HARPA framework, and it may simply be due to the difference in input information. This fundamentally weakens the argument for the effectiveness of the HARPA generator. This is manifested in at least two aspects: * (a) Input Inconsistency: As described in Section 4.2, the input of HARPA is the "source paper", while the input of the base line AI-Researcher is the "topic generated from the abstract of each source paper". A complete paper clearly provides far richer and more specific context than a single topic word. The significant advantages of HARPA in "groundedness" (+0.85) and "feasibility" (+0.78) are most likely solely due to this difference in input granularity, rather than the superiority of its generator process itself. * (b) Uncontrolled Retriever: The paper acknowledges that HARPA and AI-Researcher each used their own internal literature retrieval processes. Literature retrieval is the lifeblood of "groundedness". The authors did not control this variable (e.g., having both systems use exactly the same retrieval results as input), making it impossible for us to determine whether the performance improvement comes from HARPA's novel generation process or simply from a (possibly superior) literature retrieval component. 3. Sacrificed novelty: This is a key issue. According to Table 6 (Appendix), in the human evaluation, HARPA's "Novelty" score (5.98) is actually lower than the base line (6.43). Although the authors stated in the main text that "novelty is not sacrificed", for a top-tier conference like ICLR, a paper that (possibly defectively) excels in feasibility but is on par or even lags behind in novelty has limited contributions. 4. Generalization Ability and Domain Mismatch: The Scorer of HARPA was trained on a dataset of ACL (NLP domain) papers (Section 4.3). However, the evaluation of the papers (Section 4.2) covers a broader range of topics, including "RL (reinforcement learning), Optimization". The authors did not demonstrate whether this "feasibility" predictor trained on NLP can generalize to distinct domains such as RL or optimization. The universality of this framework for scientific domains other than those tested in the paper has not been fully explored. 5. Failure to address complex issues: The methodology and evaluation of the paper seem to focus on relatively straightforward hypotheses that can be reduced to "key variables". The paper does not explore how HARPA addresses multi-faceted research questions that require multi-step experiments or involve complex interactions among multiple variables. 6. Analysis of Lack of Computational Resources: The paper does not conduct a detailed analysis of the computational resources required for training and executing the HARPA framework (including the generator and scorer). This makes it difficult for other researchers to evaluate the feasibility of reproducing or deploying the framework in their laboratories. 1. (Regarding Weakness 1): Why didn't the author conduct end-to-end evaluation? For example, using the trained HARPA-Scorer to re-rank the proposals generated by HARPA-Generator, and then submitting the top-K proposals with the highest scores to human experts and ASD agents. This seems to be a direct way to verify the actual utility of the scorer and is also crucial to support your "test-driven" claim. 2. (Regarding Weakness 2): How can the author prove that the improvements in feasibility (+0.78) and groundedness (+0.85) observed in Section 5.1 stem from HARPA's superior generation process, rather than simply because HARPA was given a much more informative input (full paper vs. abstract topic)? 3. (Regarding Weakness 3): The evaluation in Table 6 shows that HARPA is lower than the base line in terms of novelty. Does this mean that the framework sacrifices novelty in exchange for (possibly problematic) improvements in feasibility? This seems to be a significant compromise for a "creative generation" tool. 4. (Regarding Weakness 4): The HARPA scorer is trained on NLP (ACL) data. How accurate is it when evaluating the feasibility of proposals in non-NLP domains (such as RL or optimization)? Are there out-of-distribution generalization (OOD) issues? 5. (Regarding Weakness 5): The method of the paper focuses on extracting variables and values (A + B). How does it handle more complex research questions that require multi-step experiments or involve complex interactions (non-direct causality) between variables? 6. (Regarding Weakness 6): Could the author provide a detailed analysis of the computational resources required for the training and inference of HARPA (including the generator and scorer)?	Fully human-written
HARPA: A Testability-Driven, Literature-Grounded Framework for Research Ideation	Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes a new framework named HARPA, aimed at addressing the ideation-execution gap problem that is commonly found in current large language models (LLMs) when generating scientific ideas. HARPA is mainly composed of a proposal generator and a scorer . The former simulates the workflow of human researchers to generate high-quality ideas, and the later can predict the feasibility of the proposed ideas. The scorer mechanism is interesting. It does not rely on heuristic rules or pure LLM judgments, but learns feasibility from the actual experimental execution results. Besides, this article precisely points out a core pain point in the current Al for Science field—the disconnection between the innovativeness and feasibility of ideas. 1. Over-optimization for "feasibility" may suppress breakthrough innovation: The core of HARPA is to enhance the feasibility of ideas. However, a potential risk is that the system may therefore prefer those ideas that are safer, simpler, and more incremental (such as, simply replacing a network layer or testing an old model on a new dataset), while filtering out those ideas that are high-risk but may bring breakthroughs. The experimental results of the paper also partly confirm this point (HARPA's novelty score is slightly lower than the baseline). The authors should discuss this "feasibility-novelty" trade-off more deeply in the paper and explore whether the HARPA framework can be adjusted to balance or encourage higher-risk innovation. 2. The organization and readability of the paper need to be improved: - Frequent jumps to appendices: The narrative flow of the main text of the paper is frequently interrupted by "see Appendix X," making it difficult for readers to understand the core methods. - It is strongly recommended that the authors reorganize the structure of the paper, optimize the writing, and enhance the reading experience. 3. The paper lacks comparison with other methods that can be used for idea generation, such as AI-scientist [A], NovelSeek [B] and so on. - [A] Lu C, Lu C, Lange R T, et al. The ai scientist: Towards fully automated open-ended scientific discovery[J]. arXiv preprint arXiv:2408.06292, 2024. - [B] Team N S, Zhang B, Feng S, et al. NovelSeek: When Agent Becomes the Scientist--Building Closed-Loop System from Hypothesis to Verification[J]. arXiv preprint arXiv:2505.16938, 2025. 4. The scope of related work can be broader: Although the paper has cited a large number of related works, some recent highly relevant research seems to have been omitted. For example: - Nova: An iterative planning and search approach to enhance novelty and diversity of llm generated ideas. - Large language models are zero shot hypothesis proposers. - Large Language Models for Automated Open-domain Scientific Hypotheses Discovery. - Can llms generate novel research ideas? a large-scale human study with 100+ nlp researchers. - Dolphin: Moving Towards Closed-loop Auto-research through Thinking, Practice, and Feedback If the author can address my concerns, I will reconsider my rating. The authors should revise the article and the figure to make the paper more readable.	Fully human-written
Delta-Triplane Transformers as Occupancy World Models	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes Delta-Triplane Transformers (DTT), a 4D Occupancy World Model (OWM) for autonomous driving, aiming to address three key limitations of existing OWMs: loss of vertical spatial information, long-term prediction error accumulation, and excessive model complexity. The core designs of DTT include: 1) A pretrained triplane autoencoder that compactly preserves spatial information of 3D occupancy; 2) Multi-scale Transformers that predict "triplane deltas" (instead of full occupancy states) to leverage the sparsity of changes for reduced modeling difficulty; 3) A sparse query-based motion planning module designed using these deltas to simplify the decision-making process. Experiments on the nuScenes and Occ3D datasets validate DTT’s superiority: compared to the state-of-the-art (SOTA) method DOME, DTT improves mean IoU (mIoU) from 27.10% to 30.85% and IoU from 36.36% to 74.58, while achieving real-time inference on an RTX 4090 GPU. 1. Clarity of Writing: The paper is very clearly structured and written, allowing readers to easily follow the authors' reasoning about the DTT method and its components. 2. Clear Motivation: The motivation for DTT is well-justified—by adopting incremental modeling and leveraging sparsity (via delta prediction), the method significantly reduces computational burden compared to full-state prediction approaches. 3. Relevance to Research Needs: The research addresses a topic of high interest in autonomous driving. The achievement of real-time inference on an RTX 4090 makes DTT more relevant to real-world deployment than slower baselines. 4. Compelling Experimental Results: The experiments clearly demonstrate that DTT improves both computational efficiency and core performance metrics, avoiding the common trade-off between speed and accuracy. 1. Insufficient Theoretical Analysis of Methodology: The paper primarily uses experimental results to validate its design choices—for example, pretraining the triplane model first and then fixing the encoder to train DTT and the decoder. While Table 4 shows that omitting pretraining degrades performance, the authors lack an intuitive theoretical analysis to explain why this is the case. For instance, Why does end-to-end training (without separating pretraining and fine-tuning) not yield better results? Or is end-to-end training too challenging (e.g., due to high dimensionality or unstable gradients) and prone to local convergence? 2. Incomplete Analysis of Cross-Input Performance: From Table 1, DTT achieves significant performance gains when the input is 3D occupancy ground truth (3D-Occ). However, when the input is replaced with camera-derived data (Camera), the improvement margin shrinks substantially—for example, DOME-F outperforms DTT-F significantly at 0s and 1s. The authors need to provide a detailed explanation for this discrepancy (e.g., whether camera-based 3D occupancy predictions introduce noise that disproportionately affects DTT’s delta-based modeling). 3. Cumulative Error Risks: DTT adopts an autoregressive framework that predicts the next frame using historical information. This raises a critical question: Does DTT still suffer from cumulative error over time? If the model is tasked with predicting longer occupancy sequences (e.g., beyond 3 seconds), will severe prediction drift occur? The current experiments only evaluate up to 3 seconds, and no analysis of long-horizon stability is provided. 4. Motion Planning Performance Trends: In terms of collision rate (Table 2), DTT-O performs significantly worse than OccLLaMA-O at 1s and 2s but outperforms it at 3s. The authors need to explain this counterintuitive trend. For example, is DTT less effective at short-term (near-future) frame prediction, and if so, what causes this delay in performance improvement? Please refer to the "Weaknesses" section for detailed questions and suggestions.	Lightly AI-edited
Delta-Triplane Transformers as Occupancy World Models	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces Delta-Triplane Transformers (DTT), a new 4D occupancy world model (OWM) for autonomous driving. Unlike previous works (e.g., DOME, OccWorld), DTT does not predict the full occupancy state but instead models changes (deltas) in a compact triplane representation (xy/xz/yz). By leveraging separate multi-scale Transformers per plane, DTT predicts sparse occupancy changes and fuses them to reconstruct future scenes and ego trajectories. The method achieves state-of-the-art results in motion planning and 3D-Occ-based 4D occupancy prediction and runs in real time (26 FPS). 1. Triplane representation preserves vertical 3D information and yields a compact latent space, helping reduce drift in long-term prediction. 2. Modeling occupancy deltas is intuitively efficient and effective. As we do not need to "copy" the existing states into the prediction. 3. State-of-the-art results: consistent improvement over DOME, OccWorld, and OccLLaMA, both in accuracy (mainly in 3D-Occ based 4D occupancy prediction) and efficiency. 4. The experiments and supplementary materials are rich. The writing and drawings overall is clear. 1. Lack of analysis about why learning the "changes" is easier. Firstly, sparse doesn't equals easier (line 085). Then, the full state isn't just hard to learn, but for works like DOME, their error accumulation may additionally comes from the exposure bias. It's intuitive but if we compare the results in Table 1 and Table 3's w/o triplane changes, we can see the mIoU is inferior than DOME. So maybe the slower error accumulation achieved in Table 1, compared with DOME, mainly comes from the triplane representation, rather than the delta estimation? 2. In line 455-456. the author claimed that predicting everything in parallel hinders autoregressive error correction. It's my first time to hear "correction" rather than "accumulation". More discussion about this is welcomed. 3. Lack of analysis about Table 1's camera-based 4D occupancy forecasting. Why DTT is clealy inferior than DOME? 1. Why compare only the OccWorld in your qualitative experiments and in your supplementary materials? The reviewer considers DOME is a better choice, as its the current state-of-the-art.	Fully human-written
Delta-Triplane Transformers as Occupancy World Models	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper introduces Delta-Triplane Transformers (DTT), a novel occupancy world model designed for autonomous driving. The core idea is to represent the 3D environment using triplane latent features and to model occupancy changes (deltas) over time rather than full occupancy states. By leveraging triplane representations, the method preserves vertical (z-axis) structural information that conventional BEV-based approaches tend to lose, avoiding the need for fine-grained semantic cues or large model capacity. Moreover, instead of predicting the entire occupancy state, DTT focuses on modeling occupancy changes (deltas). Since these deltas are sparser and more concentrated around zero, they exhibit lower variance and simplify the learning process. Experiments on the Occ3D and nuScenes datasets demonstrate that DTT achieves SOTA results in both occupancy forecasting and motion planning tasks. 1.Comprehensive methodology and clear architecture design. The paper provides a detailed explanation of the encoder, decoder, delta-based occupancy predictor, and motion planner, as well as how these modules interact within the overall framework. The description of the temporal triplane prediction module, in particular, is well-articulated and technically sound. 2.Novel representation design. The adoption of triplane as an intermediate latent representation is novel and effective. It addresses key limitations of existing BEV-based occupancy world models, which often lose vertical geometric information and rely on large feature maps for compensation. 3.Impressive results with lightweight architecture. DTT achieves impressive performance on both Occ3D and nuScenes benchmarks, outperforming current occupancy-based methods while maintaining a smaller model size. This demonstrates strong potential for real-time and practical deployment. 1.Limited novelty in delta prediction. Predicting changes (deltas) over time is not a new concept in temporal forecasting. The paper should clarify what makes delta prediction within the triplane representation particularly advantageous compared to delta prediction in other representations. A suggested ablation study could compare variants such as OccLLaMA or OccWorld, where these methods also predict deltas instead of full states, to isolate the contribution of the triplane-delta combination. 2.Limited discussion on limitations and robustness. The paper argues that occupancy deltas are sparse and thus easier to learn. However, this assumption might not hold under adverse conditions such as rain, snow, or dense sensor noise, where occupancy changes are no longer sparse. A brief discussion or empirical evidence regarding DTT’s robustness in such noisy or highly dynamic scenarios would strengthen the paper. 3.Inconsistency in figure section titles. In Section 4.2, the heading “Visual comparisons with SOTA” is inconsistent with the following subsection title “Visual comparisons of motion prediction.” It would be clearer to use a consistent title such as “Visual comparisons of occupancy prediction.” In addition, the visualizations (e.g., Figure 6) could be enlarged to make the differences between methods more clear. See above.	Fully AI-generated
Delta-Triplane Transformers as Occupancy World Models	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	The paper proposes Delta-Triplane Transformers (DTT), a 4D occupancy world model that represents scenes via a compact temporal triplane latent and predicts future states by modeling sparse deltas instead of full occupancy, which are then decoded for occupancy forecasting and used as sparse queries for motion planning. DTT pretrains an autoencoder to obtain a triplane latent, applies plane-specific multi-scale Transformers to predict future deltas autoregressively, and couples these with a planning module that attends to change features and history to output trajectories. 1. Modeling occupancy deltas in a triplane latent exploits sparsity, reduces variance, and enables lighter sequence models while preserving vertical structure versus BEV-only latents. 2. The performance is better than previous occupancy world models. 1. The paper claims improvements in motion planning but neglects much of the related work. It does not situate contributions against recent intention-aware or end-to-end planning approaches such as World4Drive, BEV-Planner, GenAD, and PPAD. 2. NuScenes is a small dataset. Would it be possible to evaluate the approach on a larger dataset, such as Waymo [4], or perform the motion planning experiments on NAVSIM? 3. Qualitative results are limited. More diverse and challenging scenarios and explicit failure-case analyses would make the qualitative story more convincing. 4. In Figure 6’s first example, the 2.0s–2.5s frames for the middle cars appear to exhibit stretching/residual artifacts, and in places, OccWorld looks cleaner. 1. How does DTT perform under closed-loop evaluation and distribution shift (e.g., NavSim or nuPlan), and does the delta modeling reduce compounding error in closed-loop rollouts relative to full-state predictors? 2. Figure 4 caption contains a “traiplane” typo. 3. Why is the reconstruction performance of the camera input setting in the last row in Table 1 lower?	Lightly AI-edited
Condensing Videos by Learning Where Motion Matters	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces Dynamic Frame Synthesis (DFS), a novel approach for video dataset condensation. DFS progressively selects key frames with a gradient-guided synthesis mechanism and linearly interpolates the allocated frames. Results across multiple benchmark demonstrate the effectiveness of the proposed DFS method. 1. Simple yet elegant core idea: DFS adaptively include key frames based on gradient misalignment, remove the need to disentangle static and dynamic motions. 2. Comprehensive experimental results: Results across diverse benchmark validate the effectiveness of DFS. Full ablation study are conducted to provide in-depth analysis. DFS also shows potential in efficient storage. 1. Lack of study on hyper-parameters: The threshold $\epsilon$ is an important parameter method. It would be interesting to see how different $\epsilon$ affects performance (across different datasets). 2. Performance gain on: From Tab.3 and 1, improvement of DFS over previous method seem very limited on UCF, Kinetics and SSv2 . 3. Following last question, previous work [1][2] show that HMDB and UCF are biased dataset on objects and backgrounds (i.e., semantics), while SSv2 is more motion heavy dataset. I wonder if DFS truly capture the motions in videos? The update rule seems to assume smooth transition between frames, yet frames with more motion might not be included as key frame? A quantitative analysis of whether synthesized videos really capture motions would be helpful to better demonstrate the effectiveness of the method. And more ablation and analysis on SSv2 or results on other motion based dataset like Diving48 would benefit this paper. [1] Li, Yingwei, Yi Li, and Nuno Vasconcelos. “Resound: Towards action recognition without representation bias.” Proceedings of the European Conference on Computer Vision (ECCV). 2018. [2] Choi, Jinwoo, et al. “Why can't i dance in the mall? learning to mild scene bias in action recognition.” Advances in Neural Information Processing Systems 32 (2019). See weakness	Fully human-written
Condensing Videos by Learning Where Motion Matters	Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper deals with the topic of video condensation and specifically the interdependence of content and motion. They propose Dynamic Frame Synthesis (DFS) a method for video condensation that synthesizes video keyframes at times where linear interpolation breaks. They start with the first and last frame and adaptively syhtesize frame at temporal locations where the gradient The paper deals with the interesting topic of video condensation and proposes a method that is based on a clever, intuitive idea: adding to the "summary" frames that cannot be captured via linear interpolation of the existing ones. 1) The experiments are only on short, curated, single action classification datasets. This is a very basic scenario for our times, and it is not clear if and how their method could be extended in a less toy scenario. This would be the equivalent of CIFAR for image classification. 2) The method assumes that the first and last frames are highly important (for every class). This makes precise trimming crucial. 3) Whether a frame can or cant be represented through interpolation in Eq (6) is based on a threshold on the cosine similarity of the gradients. This is highly heuristic and sensitive. It is unclear to me why the same epsilon should apply everywhere. The authors do not explore this. 4) It is unclear to me what Lemma 1 offers in practice. The intuition is right, but whether "further loss minimization" is important is highly arbitrary. 5) Comparisons are to very basic baselines and to only one other related work, Wang et al, a very simple method that decouples motion and appearence, representing the latter with a single RGB image. Other notes: - l40: "A pioneering work in this area by Wang et al." - Not sure if pioneering is the right word given that the impact of this paper doesn't yet seem to be wide Q1) How can the method be extended to more complex datasets, less curated ones? Q2) How crucial is video trimming? Q3) Why do you say that the flow shown in the bottom row of Fig3 "closely resemble those of the real videos"? This seems arbitrary to me from the figure. Q4) What is the sensitivity to the threshold epsilon? this is a hyperparameter that needs to be ablated.	Fully human-written
Condensing Videos by Learning Where Motion Matters	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper introduces Dynamic Frame Synthesis (DFS), a new method for video dataset condensation that synthesizes key frames based on motion complexity. This complexity is detected via a heuristic based on the cosine similarity of gradients between an interpolated frame and its adjacent key frames. Small scale Experiments on some benchmarks are good. 1. This paper is prepared well, and the technical details are described clearly. 2. The task of video dataset condensation holds significant value for the community, as training large-scale video models requires substantial compute and storage space. 1. The core idea that gradient misalignment is a reliable proxy for motion complexity is an unsubstantiated heuristic method. The method's reliance on an arbitrary and hard threshold for frame insertion suggests a lack of robustness. 2. The provided theoretical support, Lemma 1, is a proof for a highly constrained and simplified scenario that does not accurately model the optimization process, thereby providing a false sense of theoretical grounding. 3. The experimental evaluation contains instances where the proposed method is outperformed by established baselines on large-scale datasets, contradicting the paper's repeated claims of achieving state-of-the-art performance. 4. All experiments in the paper are conducted with very small model and data scales. Even on Kinetics and Something-Something, only a small number of frames are used (e.g., 8), and the resolution (e.g., 64x64) is particularly low. 1. The authors should consider adopting more standard experimental settings, such as an input resolution of 224×224. They should also use well-established models for experiments, such as I3D, SlowFast, and ViViT. The experimental results should also be competitive with those reported in these papers to be convincing.	Lightly AI-edited
Condensing Videos by Learning Where Motion Matters	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes Dynamic Frame Synthesis (DFS) for video dataset condensation. DFS starts from two key frames (start/end) and linearly interpolates the rest; during optimization, it inserts new key frames only when the gradient at an interpolated frame is directionally misaligned with the gradients of its two adjacent key frames, operationalized via cosine similarity; if both cosine similarities fall below a threshold $\epsilon$, the frame is promoted to a key frame. In all experiments, $\epsilon$ is set to 0 (i.e., insert when both cosines are negative). The method uses a simple distribution-matching objective (squared L2 between mini-batch mean features) and includes warm-up/cool-down phases to stabilize insertion. Experiments on multiple datasets (miniUCF, HMDB51, Kinetics-400, and Something-Something V2) report accuracy and storage comparisons vs. coreset and prior condensation baselines. - Intuitive Core Insight: - The gradient-based criterion for identifying where linear interpolation fails is conceptually appealing - allocating representational capacity only where motion complexity demands it. - Adaptive storage: - Starting with 2 frames and inserting only when needed yields substantially lower storage than fixed-length approaches at similar accuracy. Storage grows sub-linearly with VPC - Comprehensive Ablations: - The ablation studies systematically examine initialization strategies, similarity metrics, training phase components, and frame selection strategies. Each ablation shows meaningful performance deltas that support the corresponding design choices. - Inconsistent Empirical Results: - On Kinetics-400 VPC 5, DFS (8.1±0.1) substantially underperforms DM (9.1±0.9), directly contradicting claims of state-of-the-art performance - Performance gains over baselines are often marginal with overlapping confidence intervals (miniUCF VPC 1: 17.9±0.3 vs Wang et al. 17.5±0.1; SSv2 VPC 1: 3.9±0.2 vs Wang et al. 4.0±0.1) - Underperformance on large-scale datasets raises serious scalability concerns that are neither explained nor addressed - Critical Hyper-parameter Lacks Any Justification: - The threshold $\epsilon$ = 0 is the single most important hyperparameter, it determines exactly when frames are inserted. No sensitivity analysis over ε ∈ {-0.3, -0.2, -0.1, 0, 0.1, 0.2} provided. There should be an ablation about the value of $$\epsilon$$ - Computational Cost Completely Unreported: - The method requires computing gradients for interpolated candidate frames between every pair of key frames at each training step. "Low storage" does not mean cheap to condense. So, there should be analysis about runtime/computational cost etc. - Theoretical foundations: - Lemma 1 proves only that endpoint updates cannot decrease loss at s_t (a necessary condition), but does not prove that promoting s_t to a key frame is optimal or even beneficial (no sufficiency). - Linear interpolation assumption (Eq. 3) for motion is physically unrealistic (ignores acceleration, easing, complex trajectories) with no empirical validation of when this holds - Experimental setup: - Only action recognition evaluated; generalization to other video tasks unaddressed - Only short videos (T=8 or T=16); no evaluation on longer sequences. Is it possible to apply it longer sequences? How would incorporating longer sequences increase the cost for condensation? - Why does DFS underperform DM on Kinetics-400 VPC 5 by ~11%? This contradicts the main thesis and requires explanation. - How many interpolated candidates are evaluated per training step? Are there statistics on this? Does it scale with current number of key frames? - Why warm-up/cool-down fixed at 20% each? What about 10%, 15%, 30%? Any sensitivity analysis? - What is the computational cost? Wall-clock training time, memory requirements, FLOPs compared to baselines? - Can the method scale to longer videos (T=32, 64, 128)? How would computational cost scale? - Can the method scale to higher resolutions (e.g., 512×512 or 1024×1024)? All experiments use 112×112 (miniUCF/HMDB51) or 64×64 (Kinetics/SSv2). How would gradient computation cost and memory requirements scale with resolution? - Why is optical flow analysis only qualitative? Any quantitative motion fidelity metrics (flow EPE, trajectory consistency)? - What about stronger matching objectives? Would pairing DFS frame selection with multi-moment or MMD matching improve results further?	Fully AI-generated
Episodic Memory Representation for Long Video Understanding	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes a novel framework, Video-EM, to improve the performance of video QA tasks of long-form video understanding via generating clip-level descriptions and scene details of key events as episodic memory representations and answering on VLMs with them. The key components to get episodic memory representations are to select key frames, build events via expanding adjacent frames, generate summaries of the form {when, where, what, which object} on events, and construct scene details of object counts and location relationship. The framework integrates Chain-of-Thought (CoT) reasoning on VLMs with them. It seems that the effectiveness on the proposed methods are validated experimentally with the state-of-the-art results on 4 long-video understanding benchmarks. - The paper identifies the bottlenecks in the previous methods for long-form video understanding, which focus on context window limitations and keyframe redundancy. The proposed idea seem well-motivated and easy to analyze with readable representation for key event as episodic memories. - Video-EM looks training-free and integrated with other Video-LLM backbones. It shows good modularity and extensibility. - The paper provides experimental results on 4 benchmarks (Video-MME, LVBench, HourVideo, Egoschema), consistently outperforming state-of-the-art methods with fewer frames. - It seems that the overall performance of Video-EM highly depends on computer vision modules such as object detection, boundary decision and captioning components. In complex or atypical scenes, misdetections or poor captions can undermine the reliability. - I think it would be helpful to provide failure cases to consider weakness and robustness for the audience. - While the modularity of Video-EM is emphasized, the dependencies between modules (e.g., how errors propagate from object detection to CoT reasoning) are not deeply analyzed. The robustness of the system under suboptimal conditions (e.g., noisy input, failed detection) is not empirically validated, which is crucial for assessing the reliability of the proposed approach. How does the framework handle errors in object detection or captioning? Are there any mechanisms within somewhere such as CoT reasoning to mitigate or correct such errors? - I don’t think this manuscript provides enough information to reproduce all of the results. It can be helpful to open the code to resolve this issue. Code will be publicly available?	Fully human-written
Episodic Memory Representation for Long Video Understanding	Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	The paper introduces Video-EM, a training-free framework designed to improve the performance of Video Large Language Models in understanding long-form videos by overcoming context window limitations and frame redundancy. Video-EM reformulates long-form video question answering by treating isolated keyframes as temporally ordered episodic events, capturing essential spatio-temporal relationships often missed by traditional static sampling methods. The framework involves three core components: Key Event Selection, Episodic Memory Representation (which encodes dynamic scene narratives and relationships), and a Chain-of-Thought reasoning module that iteratively selects a minimal yet highly informative subset of memories. Extensive experiments across four long-video benchmarks demonstrate that Video-EM enhances the accuracy and efficiency of some Video-LLM backbones using fewer frames on average. - Instead of treating selected frames as disconnected images (a stated limitation of previous keyframe retrieval methods), Video-EM reformulates them as temporally ordered episodic events, avoiding the temporal discontinuities that often disrupt the semantic narrative of events in traditional methods. - Video-EM leverages a Chain-of-Thought (CoT) thinking strategy to iteratively identify and retrieve a minimal yet informative subset of episodic memories. - Video-EM is a training-free framework that can be integrated with off-the-shelf Video-LLM backbones without requiring retraining or architectural modification. - Video-EM is a complex, multi-stage pipeline that relies heavily on the quality and coordination of several external, specialized foundation models. - The concept of episodic memory has been explored by HERMES [1], also a plug-and-play model, with similar claims as Video-EM, yet the differences/similarities between the two are not specified, nor were the results of HERMES discussed in the manuscript. - Several plug-and-play modules for Video-LLM accuracy/efficiency improvements have been published in recent years such as FastV [2], VisionZip [3], VFlowOpt [4] in addition to the aforementioned HERMES [1]. I am curious about the comparison results with theses other plug-and-play frameworks in terms of accuracy/efficiency tradeoffs, and also in terms of methodology. - While Video-EM successfully reduces the number of frames processed by the final Video-LLM (from 41 frames down to an average of 9 on EgoSchema, for example), the preceding processing steps require extensive computation across multiple large models (CLIP, DINOv2, RAFT, Grounding-DINO, Tarsier2-7B, Qwen3-8B). I thus believe, the slight accuracy improvement does not justify the upstream cost of putting such a system together. - I also think such a system is very fragile. A deficiency in the initial retrieval stage (Key Event Selection) or the intermediate processing stages directly impacts the quality of the final input provided to the Video-LLM. It follows that these results would be a headache to replicate. - The author’s efficiency claims are not substantiated. Fewer frames do not equal more efficient. - Ambiguous variable definition: In section 3.2, the "Adaptive Event Expansion" paragraph, the authors define alpha as a variable with a value between 0 and 1, yet immediately after that, the paper states that alpha is set to 2. I am quite confused by that. - I think Figure 2 has too much text, is quite convoluted, and the bright red color is not easy on the eye. [1] Faure, Gueter Josmy, et al. "Hermes: temporal-coherent long-form understanding with episodes and semantics." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2025. [2] Chen, Liang, et al. "An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024. [3] Yang, Senqiao, et al. "Visionzip: Longer is better but not necessary in vision language models." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025. [4] Yang, Sihan, et al. "Vflowopt: A token pruning framework for lmms with visual information flow-guided optimization." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2025. See weaknesses plus - The paper highlights the reduction in frames input to the final Video-LLM (e.g., 41 frames down to an average of 9 on EgoSchema). What is the complete end-to-end inference latency (or total computational cost) for the full Video-EM pipeline (including Key Event Selection, Episodic Memory Representation, and Chain-of-Thought steps using all five foundation models and Qwen3-8B)? How does this total cost compare to the baseline model running the maximum allowed frame input? - The paper acknowledges that the method is "limited by the accuracy of captioners and object detectors". What testing or simulation was performed to quantify how a decrease in accuracy (e.g., failure rate) in a crucial upstream component (such as Grounding-DINO missing key objects or Tarsier2-7B generating an inaccurate Dynamic Scene Narrative) propagates and impacts the final Video-LLM performance? - Given the strong claims of superiority over prior methods, why were empirical comparisons against other existing plug-and-play, training-free long-video understanding frameworks with similar goals, such as HERMES (which also uses episodes and semantics), omitted? Providing context for these comparisons is crucial for substantiating Video-EM's novelty and competitive edge in the crowded field of V-LLM accelerators. - In the description of the multi-grained semantic retrieval (L193 onwards), is the summation of equation (1) over the set Q={q1,q2,q3} or Q={q,qo,qs}? In other words, what is qi and why do we have Wq1, Wq2 and Wq3 but no Wq, Wqo and Wqs?	Fully human-written
Episodic Memory Representation for Long Video Understanding	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes Video-EM, a training-free framework for long-form video understanding inspired by human episodic memory. Instead of treating keyframes as isolated tokens to VLLMS, Video-EM groups them into temporally ordered key events, expands events to recover missing context, and builds rich episodic memory representations that capture when, where, what, and which objects. It then uses Chain-of-Thought reasoning to iteratively select a minimal but informative subset of episodic memories before passing them to a VLLM * Clear motivation. * Training-free approach that can equip state-of-the-art VLLMs with improved performance. * The paper appears to be aware of the related work. * The key event selection is sound and they expand each event to recover query-relevant context that similarity-based approaches may miss. This sounds novel and important as pure semantic retrieval can yield a sparse set of disjoint frames. * Video-EM reduces frames while improving accuracy. * The paper provides ablation studies. * The adaptive event expansion module feels somewhat heavy for a training-free method; simpler alternatives (e.g., adjacent-frame motion thresholds) could be discussed or compared. * Heavy reliance on object detectors and captioners where failure cases of these modules may propagate. * A notable limitation is that the method introduces several hyperparameters across multiple stages (e.g., similarity thresholds, expansion limits, CoT confidence and depth, temporal gap $\Delta t$). While the authors provide ablations showing relative robustness, the number of hyperparameters is still large, and tuning them in practice may be non-trivial. * As a video agent Video-EM requires too many different models which can lead to efficiency problems and lack of end-to-end practicality. * The baselines differ from dataset to dataset. While this is ok it is a bit difficult to assess Video-EM's capabilities. * You should test Video-EM with LLMs other than the Qwen family (such as VideoLLaMA3, InterVL3...). * Results are not state-of-the-art, however, improvements over backbone models are achieved. Minor comments: * Authors use \citet instead of \citep. * Use Gemini 2.5 pro instead of 1.5 version. * How do you capture object-level semantics $q_0$ and scene-level context $q_s$? * Why do you need an adaptive event expansion mechanism based on the spatio-temporal difference metric? What does such complex method bring to the table? Couldn't simpler methods yield similar results?	Fully human-written
Episodic Memory Representation for Long Video Understanding	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes a new pipeline for preparing video features for LLMs. Beyond simple keyframe retrieval, it introduces an agentic flow designed to capture temporally ordered events and reconstruct the underlying narrative. Based on the extracted components, the authors employ a CoT prompting strategy to enhance reasoning and improve understanding. Experiments are conducted across several benchmark datasets. 1. The paper attempts to construct scene graphs to decompose video content, which is an interesting idea. 2. The proposed pipeline is reasonable, and the implementation details are concrete and easy to understand. 3. The experiments are comprehensive, covering most mainstream long-video benchmarks currently available. 1. The performance does not reach state-of-the-art results. For example, it is notably inferior to Video-XL-2 [1]. Additionally, some training-free retrieval methods (e.g., BOLT [2]) are missing from the comparison table, which weakens the technical contribution. 2. The main contribution lies in pipeline design rather than technical innovation. The approach feels closer to a text-based agent framework, so the title’s emphasis on “representation” may be misleading—it seems more like an engineering effort. [1] Video-XL-2: Towards Very Long-Video Understanding Through Task-Aware KV Sparsification [2] BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding The numbers reported in Table 1 (for Qwen2.5-VL) show a large discrepancy compared to the original paper. For instance, LVBench should report 45.3 for Qwen2.5-VL-7B. This inconsistency raises concerns about the results’ reliability. Although the relative improvement over the baseline is significant, the absolute performance values are not aligned with prior reports.	Lightly AI-edited
When Scores Learn Geometry: Rate Separations under the Manifold Hypothesis	Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper studies diffusion models and argues that their success is due to learning the manifold supporting the data distribution rather than finely learning the data distribution itself. The authors support their claim by showing that, in the small noise limit, the dominating term in the noised log density is determined purely by the manifold and does not involve the data distribution. The authors go on to argue that from a methodological perspective, the aim should be to estimate the manifold rather than the distribution since this task can be done successfully even with larger errors in the score estimation; in this sense, geometric learning is easier than distribution learning. Additionally, the authors propose a sampling algorithm to sample from the uniform distribution on the manifold given a score function estimator. Finally, the authors present experimental results to support their claims. The paper makes a good case for switching paradigms to geometric learning (i.e. learning the uniform distribution on the manifold) as the score error requirements are much less stringent than that required for full distribution learning. It also seems natural to me from a generalization perspective that the manifold ought to be the target. It is also quite nice that a simple (one-line!) modification to the sampling dynamics can tolerate higher score errors and give uniformly distributed samples on the manifold. The paper is written very clearly and the authors provide good exposition. The experiments indeed show better diversity as one expects from targeting the manifold rather than the data generating distribution. 1) The result of Theorem 4.1 requires an $L^\infty$ bound on the score estimation error. This is quite stringent and undesirable, especially since diffusion models are typically trained via score matching which is an $L^2$ loss. Furthermore, there is some empirical evidence that $L^\infty$ assumptions are not typically satisfied in practice (e.g. see Section 3.1 of "Fast Sampling of Diffusion Models with Exponential Integrator", Zhang & Chen 2023). The authors do note this in their Limitations section, but it is a weakness nonetheless. 2) The results in Theorem 3.1 are somewhat qualitative (i.e. consistency results and weak convergence), and it would be a stronger paper if quantitative rates in some distance (e.g. Wasserstein distance) were available. For example, if E(\sigma) = o(1), is it meaningfully easier to estimate the uniform distribution on $M$ rather than the true $p_{\text{data}}$? If it is not, then perhaps the justification for geometric learning becomes weaker. 3) I'm confused about the justification for learning the uniform distribution from the Bayesian Inverse problem perspective. I agree that choosing it as the uniform yields a weaker score estimation condition for correct posterior sampling. But I'm not convinced this is by itself a good enough reason. There is strong information that is being lost by switching to the uniform distribution which might have resulted in a "better" posterior distribution but harder to sample from. It's not obviously clear why this is less desirable than a less informed prior with easier sampling requirements. 1) Can the authors discuss regularity conditions on $p_{\text{data}}$? It seems that learning the uniform distribution on the manifold can get quite difficult if there are meaningful regions on $M$ where $p_{\text{data}}$ is very small. But this doesn't seem to be reflected in any of the results. Is this due to the qualitative nature of Theorem 3.1? It would be very clarifying if the authors could discuss this point in some detail. 2) Can the authors provide some high-level discussion on sample complexity? With finite data, is it the case that learning the uniform distribution on $M$ is only going to be feasible when $p_{\text{data}}$ is bounded above and below by a constant on its support (i.e. there's enough mass everywhere on the manifold)? 3) What are the difficulties in getting a quantitative version of Theorem 3.1 (say in Wasserstein distance or something else)? It would be nice to add a bit of discussion about this in the paper.	Fully human-written
When Scores Learn Geometry: Rate Separations under the Manifold Hypothesis	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	While score-based models such as diffusion models are typically thought as approximating the data distribution, this paper hypothesizes that---assuming the popular manifold hypothesis---their success arises from implicitly learning the manifold instead ("geometry learning"). The authors provide theoretical results that show that (1) recovering the data distribution (as the noise level $\sigma \to 0$) is difficult as it requires a strict $o(1)$ error on the learned scores, while (2) mere concentration on the data support (i.e., the manifold) can be achieved with a much larger score error of $o(\sigma^{-2})$. However, in the latter case, it is shown that the learned distribution supported on the manifold can be arbitrary. To this end, the paper proposes a simple modification to Langevin-based samplers that draws from the uniform distribution on the manifold, which still only requires $o(\sigma^{-2})$ score errors. Similarly, the authors show that in Bayesian inverse problems, sampling from the posterior induced by a prior that is uniform on the manifold also tolerates $o(\sigma^{-2})$ score error, while $o(1)$ error must be assumed when picking the (Gaussian smoothed) data distribution as the prior. The author's theoretical results are empirically validated using low-dimensional synthetic data, as well as pre-trained large-scale diffusion models. + Motivation and Relevance. The work is clearly motivated: Understanding learning dynamics in score-based models is highly relevant for contemporary generative modeling and Bayesian inference, but also for potential downstream tasks such as using the learned density for out-of-distribution detection etc. + Presentation. The theoretical results are presented well. The Taylor expansion in Theorem 3.1 gives a good intuition for the results, while being less technical. + Novel, important theoretical results. The rate separation results (Theorem 4.1) are novel and important for this line of research. Under standard assumptions, the results give new insight into why learning geometry information is "easier" than distribution learning in score-based models. + Uniform-on-manifold sampling. Tampered Score (TS) Langevin dynamics is a surprisingly straightforward adaptation that comes with relatively strong theoretical guarantees regarding convergence to the uniform distribution on the manifold. Specifically, the guarantees also hold in the practical case where $s(\cdot, \sigma)$ is a non-conservative vector field (as in most diffusion models that directly output the score). + Empirical Validation. + The empirical results are limited in scope, and do not directly support the main rate separation results in Theorem 4.1. The paper would benefit from such experiments (e.g., synthetic data with known manifold and ground truth scores, controlled injection of score error, systematic analysis on how manifold concentration and distribution recovery behaves). + The experiment in Section 7.1 would benefit from a quantitative evaluation of the qualitative results in Figure 2. Moreover, higher dimensional synthetic experiments (with known manifolds) would shed light onto more practical settings, as score errors can be very different in 2D when compared to high-dimensional problems. + The experiment in Section 7.2 seems weak as it shows only slight quantitative improvements on merely three prompts. No error bars are provided, and it is unclear if the results are significant. While in Table 2, $\alpha=1$ is fixed, in Table 1, TS was tuned on both the number of correction steps and $\alpha$ , while PC was only tuned on the former, which makes the tuning budget unbalanced. + Discretization Error. All theoretical analysis assume continuous time models, while discretization error is disregarded. However, in practice, discretization error has an empirically large influence on the performance. Extending the theory in this direction would make it more practically applicable. + How sensible are the results in Section 7 to the choice of $\alpha$? + A possible relaxation of the $L^{\infty}$ error assumption is mentioned in the limitations. Can you elaborate on the importance of this relaxation? How could the presented results be related to e.g. the denoising score matching loss (Fisher divergence) of trained models?	Fully human-written
When Scores Learn Geometry: Rate Separations under the Manifold Hypothesis	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	In this work, the authors studies a rate separation when score-based methods learn manifold data, namely that recovering the data manifold is easier than recovering the exactly underlying distribution on the manifold, where the difficulty is measured by the error tolerance for score approximation. The authors thus argues for a paradigm shift from distribution learning to geometry learning - more specifically, targeting uniform distributions on manifolds - and introduces the Tampered Score (TS) dynamics to generate distributions on manifolds that are close to uniform. 1. The authors' main hypothesis - that recovering manifolds is easier than recovering the exact distributions on manifolds for Langevin-type dynamics with approximate score functions - is novel to my knowledge. If fully validated on score-based generative models, it can potentially shape our understanding of how such models fundamentally work. 2. The authors derive precise theorems to quantitatively characterize the hypothesis, enabled by a rigorous framework based on differential geometry and impressive mathematical techniques (not fully checked). 3. Although the idea of tempering the drift term in Langevin dynamics is not entirely new, the proposal to use it for learning uniform distributions on sub-manifolds is novel to the best of my knowledge, and also as a modification to the corrector step in predictor-corrector sampling. 4. The presentation is clear overall. 1. The analysis is performed towards the stationary distributions of the various score function, rather than the distributions obtained from the denoising dynamics (i.e. reverse ODE / SDE) as in score-based generative models. Hence it is unclear to what extent the theoretical insights derived for the former idealized setting also hold for the latter (even in the continuous-time setup). In the latter case, for example, even small deviations in the score function can cause the final distribution to be supported beyond the manifold. 2. In particular, I wonder whether it tends to take longer for the TS dynamics to converge to the stationary distribution than the (untempered) Langevin dynamics. Intuitively I suspect this is the case since the drift is reduced with $\alpha > 0$. As the theoretical analysis is concerned only with the stationary distributions, the results are unable to address questions regarding the rate of convergence, which are nevertheless quite relevant for practice. 3. In the paragraph after Theorem 5.2, the authors claim that the TS scheme helps with recovering "the uniform distribution on the data manifold from samples of $p_{data}$", but I don't see a justification of this by the theoretical results. In fact, all the main theoretical results require the regularity assumption (Assumption 2.2) that the noiseless distribution $p_{data}$ has a continuously differentiable density on the manifold, which does not hold in the case of empirical distributions of finite samples. 4. In Remark 4.1 on Page 5, the authors claim that the assumption of a compact support of the limiting distribution is reasonable because "many diffusion models apply clipping to generated samples". I think this is a bit misleading - applying clipping as a post-processing step after sampling doesn't affect the properties of the stationary distributions of the score functions, about which the assumptions are stated. Typos: 1. Line 202, "difficult" -> "difficulty" 2. Line 336, "satisfie" -> "satisfy" See the "Weaknesses" section above.	Fully human-written
AdaptiveResidual: Inference-Time Trust Calibration for Contextual Knowledge Injection	Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes AdaRes, a parameter-free, inference-time mechanism to address knowledge conflicts between an LLM's internal (parametric) knowledge and external (contextual) information. The method is inspired by interpretability findings that identify the Attention module as the context aggregator and the FFN as the knowledge store. AdaRes calculates instance-specific "trust scores" for each source ($\alpha^{(l)}$ for context, $\beta^{(l)}$ for parametric) and uses them to reweight the respective contributions within the residual pathway. The method reports empirical results on knowledge editing and conflict-aware QA benchmarks. The paper introduces a novel, training-free mechanism to address the critical problem of knowledge conflicts in LLMs, which is highly relevant for improving the reliability of RAG. The core idea is well-motivated by mechanistic interpretability findings, specifically the distinct roles of the Attention (context) and FFN (parametric) modules. - The method's design is heavily "context-first" (as seen in the focus on Scenario #4) and seems to require a priori knowledge that the context should be trusted. This is a significant limitation that is not clearly acknowledged. A simple but crucial baseline is missing: prompt engineering. The results in Table 3 show low performance for the "Original" baseline even on instruction-tuned (it) models, suggesting that a simple, well-crafted prompt to "follow the context" was not explored as a point of comparison. The "IKE" results hint on a positive role of the instruction for Qwen 2.5. The actual prompts used are not disclosed, which harms reproducibility. - The paper's description of the methodology is lacking. Important details about the FFN probing mechanism (for $\beta^{(l)}$ estimation) are relegated to the appendix. Furthermore, a "Top-n" selection is depicted in Figure 2 but never explained in the text, and it appears to be a hyperparameter. This contradicts the claim that the set of target layers $\mathcal{H}$ is the "sole hyperparameter" (line 229). The description of the context trust estimation ($\alpha^{(l)}$) is also unnecessarily convoluted. - The submission is missing highly relevant citations in its related work (Section 2.2) regarding dual-response or fusion strategies. For example, [Huang et al. (ICLR 2025)](https://openreview.net/forum?id=K2jOacHUlO) addresses the identical problem of "dynamically calibrating trust" to "resolve knowledge conflicts" and should be discussed. - The claim of "negligible runtime cost" is questionable. Algorithm 1 and the three-stream design imply that at least one additional, full forward pass is required for the probes. The significant increase in inference time in Figure 6 demonstrates this cost, while the paper downplays it. - How is AdaRes intended to operate when it is not known a priori whether the context or the parametric knowledge is correct? What happens if it is applied in Scenario #2 (correct model, wrong context)? - Can you please disclose the prompts used for the "IKE" baseline? A comparison against a strong, instruction-based prompt (e.g., "Follow the context provided exactly") seems like a critical and missing baseline. - Please clarify the "Top-n" selection from Figure 2. Is this a hyperparameter, and how was it set? This contradicts the claim that the layer set $\mathcal{H}$ is the sole hyperparameter. ### Minor Issues - Table results (e.g., Table 1) would be more readable as percentages. - Tables 2 and 3 are difficult to parse; adding a separate header row for model labels (e.g., Phi3, Gemma3) would improve clarity. - There are minor typos (e.g., "an" -> "a" in line 248).	Lightly AI-edited
AdaptiveResidual: Inference-Time Trust Calibration for Contextual Knowledge Injection	Soundness: 2: fair Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	AdaRes (AdaptiveResidual) is a training-free, inference-time trust calibrator to resolve knowledge conflict in LLM parametric knowledge and contextual knowledge. . It probes each layer on-the-fly to compute two “trust scores” (query-to-context attention and FFN memory affinity), then asymmetrically rescales the residual contributions to prioritize the more trustworthy source; the only hyperparameter is which layers to apply it to (chosen by a simple greedy search). Across conflict-centric evaluations (ZsRE, CounterFact, ConflictQA variants), AdaRes strongly improves adherence to the supplied context and preserves locality, often outperforming editing and parameter-editing baselines 1. Resolving knowledge conflicts is a timely and important problem. 2. The paper’s methodology is presented and written clearly, with a well-structured description that makes the approach easy to follow. 1. Problem Scope. While the paper is framed as resolving knowledge conflicts, in practice it mainly addresses how to make LLMs more faithful to external contexts, that is, how to prioritize retrieved evidence over internal memory. This effectively reduces the problem to enforcing context faithfulness rather than truly deciding between conflicting knowledge sources. The more interesting challenge is how to determine which side deserves trust; if we already assume the external context is more reliable, the task becomes much simpler. In that case, one might wonder why not simply optimize the prompt or training objective to explicitly instruct the model to follow the context. The long-context scenario might make this harder, but benchmarks used in the paper (e.g, ConflictQA) involve short passages that are far below the model’s context limit. 2. Missing benchmarks. There exist several datasets [1, 2, 3] that explicitly evaluate knowledge conflict resolution, but the paper only reports results on ConflictQA, while the rest are knowledge editing benchmarks, which only partially capture the intended problem. 3. Missing baselines. Numerous prior works, both prompting-based and training-based, directly tackle knowledge conflict resolution, yet none are included as baselines [3, 4, 5, 6]. The authors should either compare with or at least discuss why these methods were omitted. [1] “ClashEval: Quantifying the tug-of-war between an LLM’s internal prior and external evidence”, NeurIPS 2025 \ [2] “FaithEval: Can Your Language Model Stay Faithful to Context, Even If 'The Moon is Made of Marshmallows'”, ICLR 2025 \ [3] “To Trust or Not to Trust? Enhancing Large Language Models' Situated Faithfulness to External Contexts”, ICLR 2025 \ [4] “KnowPO: Knowledge-aware Preference Optimization for Controllable Knowledge Selection in Retrieval-Augmented Language Models”, AAAI 2025 \ [5] “FaithfulRAG: Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation”, ACL 2025 \ [6] “Trusting Your Evidence: Hallucinate Less with Context-aware Decoding”, NAACL 2024 See weakness.	Lightly AI-edited
AdaptiveResidual: Inference-Time Trust Calibration for Contextual Knowledge Injection	Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper proposes AdaRes (Adaptive Residual), a lightweight, training-free inference method for dynamically reconciling knowledge conflicts in large language models (LLMs). The method reparameterizes the residual connections in selected layers to adaptively balance the influence of contextual knowledge (from attention) and parametric knowledge (from the feed-forward network) at some heuristically chosen layers. Experiments on knowledge editing benchmarks showcase the effectiveness of the proposed approach. - The paper was well-written and easy to follow. - The paper provides very detailed explanations of its experiments, fostering reproducibility. 1. Methodological justification is weak. The core assumption that the entire attention module represents contextual knowledge while the entire MLP (FFN) module represents parametric memory is insufficiently supported. The paper provides no empirical or theoretical evidence for this decomposition. Prior work has shown that attention layers can themselves act as associative memory mechanisms [1, 2], directly challenging this simplification. Moreover, Equation (4) implicitly assumes that all contextual information is trustworthy, which rarely holds in realistic settings. Although the authors categorize four types of knowledge conflict in Figure 3, they only address the “context-preferred” case (Scenario #4), leaving the other scenarios unhandled by the proposed formulation. This narrow scope leads to potential overclaiming of generality. In addition, the use of dynamically computed trust values $\alpha$ and $\beta$ lacks clear motivation or theoretical/empirical grounding. The paper does not explain why these specific scaling forms are appropriate or how they relate to the underlying model dynamics, making the mechanism appear ad hoc. 2. Limited novelty and contribution. The paper’s conceptual framing overlaps substantially with existing literature on scaling and analyzing attention heads and FFN modules (e.g., works in [3]). While the implementation is lightweight, it does not introduce new insights or mechanisms that significantly advance the mechanistic interpretability’s understanding of contextual–parametric interactions. 3. Experimental design and evaluation issues. The experimental setup raises several concerns. Although the work claims to address knowledge conflict, it evaluates primarily on knowledge editing benchmarks, which are conceptually distinct. (1) Datasets: For the contextual-conflict case (Scenario #4), standard benchmarks such as NQ-Swap [4] and Memo-Trap [5] should be included. If the paper intends to cover other conflict types (e.g., Scenarios #1 and #2), corresponding datasets should also be used; otherwise, these discussions should be removed for focus and clarity. (2) Baselines: In Table 1, the comparison set omits a number of decoding-based methods explicitly designed to mitigate contextual hallucination and knowledge conflict, such as [6-10] and there are more missing ones. Including these would provide a fairer and more meaningful evaluation. 4. Key related works are missing. [11] also discusses the role of MLP and attention in much more detail, and [12] shows that intervening the entire attention module could lead to superposition. Both works can reconcile knowledge conflicts in both Scenario 1 and 4 with only intervening in the attention module. These omissions weaken the contextualization of the proposed approach and raise questions about its incremental contribution. [1] Memorization capacity of multi-head attention in transformers. ICLR'23 [2] Understanding factual recall in transformers via associative memories. ICLR'25 [3] Attention Heads of Large Language Models: A Survey. ArXiv'24 [4] Entity-Based Knowledge Conflicts in Question Answering. EMNLP'21 [5] https://huggingface.co/datasets/Albertmade/memo-trap [6] Trusting Your Evidence: Hallucinate Less with Context-aware Decoding. NAACL'24 [7] Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation. EMNLP'25 [8] Sled: Self logits evolution decoding for improving factuality in large language models. NeurIPS'24 [9] Dola: Decoding by contrasting layers improves factuality in large language models. ICLR'24 [10] AdaCAD: Adaptively Decoding to Balance Conflicts between Contextual and Parametric Knowledge. NACCL'25 [11] Cutting Off the Head Ends the Conflict: A Mechanism for Interpreting and Mitigating Knowledge Conflicts in Language Models. ACL'24 [12] Taming Knowledge Conflict in Language Models. ICML'25 Aforementioned in the Weaknesses section.	Moderately AI-edited
AdaptiveResidual: Inference-Time Trust Calibration for Contextual Knowledge Injection	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper introduces an inference method for adapting the reliance of an LLM's output more on the context provided, as opposed to its internal knowledge. The key idea is that the residual connection which adds the attention outputs to the FFN outputs can serve as a modulator between the two, with the attention output serving as a proxy for the reliance on the context. Empirical results on Qwen and Llama models in the 7-8B range demonstrate that this can indeed improve the contextual dependence of the model responses. - The observation that residual connections can serve as a modulator of context vs internal knowledge dependence is, to my knowledge, new and quite interesting. - The paper provides quite extensive empirical results in terms of both the models and datasets, as well as the hyperparameter choices and other variables involved in the methods. - The paper is well written and quite easy to follow, even if it is unnecessarily math-y when describing the methods. - Section 3.2.1 for trust estimation for \alpha seems to be described incorrectly. In the formulation presented, \alpha is computed as an average of the per-row softmax outputs of query -> context attention. But softmax outputs sum to exactly 1, so it is not clear why these would sum up to anything other than 1/M. This is clearly not the case based on the results presented later, so I suspect the issue is in the description of the method. - The intro and motivation seem to position the method as a general "dynamic" scheme for selecting between context and internal knowledge. But in practice, it only applies to the setting where the context is correct and the internal knowledge is incorrect (Figure 3). The claims would be supported more strongly if there were also experiments studying the reverse direction -- internal knowledge is correct and the context is wrong. - On a similar note, the paper is lacking important baselines: (i) simply prompting the model to trust the context instead of its internal knowledge (after all, the trust method already assumes that the context is correct); and (ii) baselines from a very relevant paper published at ICLR 2024. (This paper is actually completely missed from the related work discussion). - The layers used for AdaRes seem to be quite sensitive to the choice of model and dataset. There is no discussion if the layers selected for one setup will generalize to other setups, limiting the practical applicability of the method. - Contrary to what the text claims, there seems to be quite a significant impact on the latency of the model (in some cases 50-100% slowdown in Figure 6). - Why are the main results in the paper on base models instead of instruction tuned ones? The latter seem more relevant for QA tasks. - What is the vanilla res method in Table 2?	Fully human-written
Data- and Hardware-Aware Entanglement Selection for Quantum Feature Maps in Hybrid Quantum Neural Networks	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The authors present a framework for devising an entanglement strategy for a feature map of a QNN based on data and hardware. The authors did not provide a background section introducing the setup of a QNN. I would expect the paper to at least describe the setup in terms of encoded features, ansatz, loss function and also to describe the BP phenomenon in more detail. Moreover, explanations for QFI and the content of the section on Theory-consistent design are missing. Further, there is a substantial amount of literature on quantum architecture search, QFI applications, hyperparameter optimization that were not referenced or discussed. The paper has structural issues. Large parts of the theory connecting important concepts of the main part are moved to the Appendix, making it hard to follow. I would urge the authors to refactor the paper. The appendix should be used to add additional, non-fundamental details, not to completely outsource large fractions of the paper. Further, the benchmarking setup is lacking explanations. At least an introduction and overview of the experiments should be given in the main text (the appendix should only be used to give supplementary information, not to contain crucial aspects of the whole setup). The baselines do not represent what is commonly used (i.e., why use linear and random instead of common entanglement strategies that are implemented in Qiskit or PennyLane? pairwise, circular, etc. are what is commonly used while to the best of my knowledge random entanglement is never done). The results are missing statistical significance statements, in particular, considering that the differences between the methods are often only marginal. The method is moreover not scalable, which is shortly explained in the conclusion by stating that for larger scale applications, direct state vector access is not a feasible alternative. Given that effort to retrieve and store the state vectors is exponential (as is full state tomography, which is noted in the conclusion as well), I am questioning why this is even proposed. I would be curious to hear how this method could be scaled to go even to twenty qubits. I think this is an important aspect that should be addressed in the rebuttal by the authors. It is for these reasons that I cannot recommend the paper to be published at ICLR at this point. - HW inspired architectures are indeed a promising research direction and could lead to significant benefits - The objective function they propose provides a meaningful way to balance considerations in entanglement choices - Abstract: "While entanglement in this feature map layer can enhance expressivity, heuristic choices often degrade trainability". Entanglement induced BPs are a proven phenomena - be more precise in what you want to state. - Please use \citep if you want to cite something inline, otherwise it is hard to read. - P.3: "Data-driven methods engineer the feature map to reflect data structure. This includes, ... , and data re-uploading". Data re-uploading does not engineer the feature map to necessarily reflect data structure, rather it just continuously encodes the input over and over. - Theory-consistent design: There are no further details on "manifold-local pairings" and, to me, it seems to be a claim that the HSD as a regularizer preserves favorable (non BP) scaling. This section needs further elaboration and background. - Line 181: referencing of Eq - Line 140: Please refrain from using "ideal" when used with heuristics - Elaborate on degree hints, it is not entirely clear what this means. Also, if this is a formula, why not just state it? - No explanation on synthetic dataset generation in the main paper - a short description in the main text is at least necessary - Exaggeration of results - the results do not demonstrate robustness of the approach if often times the results are only very marginally better than a random (!) baseline (Heart data AUC and Accuracy). Further, for the synthetic data, the main text states the performance compared to the worst baseline result which is misleading. - Typos line 298 - Line 129: Please elaborate on how utility and cost terms are normalized before being combined and why this is not shown in the equation. - Line 130: Elaborate on the rank-based scaling that is employed. - Eq 2: Define psi. - Line 146: It does not suffice to state that the two-qubit DMs are computed in a similar process. State the process in the main part of the paper if it is such a significant part of the work. You can put details/derivations/further explanations in the Appendix but at least stating it in the main part would be appropriate. - Line 166: What theoretical concerns are you mitigating? If stated like that, it requires at least a short explanation in the main paper before linking to the appendix. - Line 169: The claimed connection between correlation and trainability seems to be only on empirical observations. Is there any theory to back this up? Otherwise, it needs to be framed differently to mirror that this is only an empirical observation. - Eq 6: Why is the accumulated error from the SWAP sequences 3? Can you elaborate in the paper? - Line 209: Why is the full objective function not restated? This seems to be an important aspect. - Why is random entanglement not integrated into ATP and Data Reuploading? - How is random entanglement enforced? What is the strategy? How often do you repeat these results to get any sort of robustness? - With randomization involved, can you elaborate on the statistical significance? - What is the Silhouette score in Fig1a and why is not explained before? - Why are the training dynamics in 4.2 based on different experiments than 4.1? Again, missing details on experimental setup. - Line 321: Why would better convergence be evidenced by a higher class separability? How are the two connected? - Line 325: Why is the central hypothesis only detailed in the Appendix? This is an important aspect of the work, while the Appendix is only supposed to provide additional information. - Line 326: Elaborate on connection of QFI with optimization direction? - Line 332: What does "our proposed method exhibits rapidly increase in trajectory" mean? Faster convergence? Please clarify - Section 4.3.: I am missing any explanation of the ablation study - Table 2: The differences in performance seem very marginal. Please elaborate on the statistical significance. - Line 408: How would IQ and HSD criteria ensure global trainability. Please be more precise. - Line 411: The theory and experiments do not confirm that the architecture is fundamentally trainable and the experiments show that it is often only marginally better than random entanglement. Therefore, it is unjustified to claim superior performance - in particular, since also the experiments are very limited. - Line 417: please elaborate on the effects of realistic noise. You only consider two-qubit errors, which is not enough to state that it will work welll on NISQ. - Line 422: How would you employ state tomography? This most generally involves constructing the full DM, so I do not understand why it would be a vital and practical avenue?	Fully human-written
Data- and Hardware-Aware Entanglement Selection for Quantum Feature Maps in Hybrid Quantum Neural Networks	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes a framework for optimizing entanglement structures in the data-encoding layer of Hybrid Quantum Neural Networks (HQNNs). The key idea is to formulate entanglement selection as a multi-objective optimization problem balancing data-driven trainability, hardware noise robustness, and circuit efficiency. The approach introduces a data utility term based on Hilbert-Schmidt Distance (HSD) and an intrinsic dependency metric (IQ), combined with a hardware-aware cost computed from calibrated IBM Quantum backend fidelities. A bi-level optimization process searches discrete entanglement structures, guided by short inner-loop training runs. Experiments on synthetic and real datasets show improved accuracy and robustness compared to baseline entanglement patterns. 1. Addresses a practically important problem for near-term quantum hardware by integrating data- and hardware-aware considerations. 2. The proposed multi-objective formulation is conceptually coherent and unifies several recent heuristics under one framework. 3. The experimental section includes multiple encoding strategies (Angle, ATP, and Data Re-uploading), showing generality of the application. 1. In Section 3.4, the algorithm’s convergence behavior is not discussed. There is no theoretical or empirical evidence that the alternating updates between outer and inner levels lead to consistent improvement or stable optima. The paper should add convergence diagnostics or comparison with reinforcement-based search. 2. The use of rank-based scaling across objectives in Equation (1) is insufficiently justified. It may introduce bias depending on dataset size or search granularity. A sensitivity analysis to α, β parameters is missing. 3. Figure 1 lacks numerical labels and units for the Silhouette Score, Gradient Norm, and Tr(QFI). The interpretation in the text (“robust quantum gradient norm”) remains qualitative. Quantitative evidence (e.g., rate of gradient decay) would strengthen claims. 4. In the Ablation Study in Table 2, the claim that combining IQ and HSD “synergistically” improves trainability is not strongly supported. The difference between “Full Method” and “HSD Only” is marginal. 5. The noise model is limited to 2-qubit gate errors and neglects readout and single-qubit noise, which dominate in IBM hardware. Consequently, the conclusion of “hardware robustness” is not appropriate. 6. Many mathematical derivations are restatements of standard QFI-gradient relationships without formal proofs or novel insights. The theoretical justification does not clearly support the empirical advantage claimed in Section 4.2. 7. The limitation that the computational complexity of computing pairwise HSD and IQ metrics scales as O(n²) in qubits is only mentioned but not quantified. 8. All experiments use simulators. No experiments using a real device are shown despite claiming “hardware-aware optimization.” A small-scale run on IBM Q hardware would greatly increase credibility. 9. The discussion of “future scalability” is generic. Concrete numerical estimates or runtime comparisons would help evaluate feasibility on larger qubit systems. 1. How sensitive is the bi-level optimization to hyperparameters α, β and the number of outer iterations? 2. How does the proposed method compare to differentiable quantum architecture search or reinforcement learning-based approaches? 3. Could the runtime complexity or search-space statistics be provided to quantify computational cost relative to brute-force search? 4. Would incorporating realistic noise calibration (including readout errors) change the conclusions in Table 3? 5. Can the framework be extended to hybrid ansätze beyond angle encoding?	Fully AI-generated

PreviousPage 3 of 1516 (75800 total rows)Next