|
Perishable Online Inventory Control with Context-Aware Demand Distributions |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper studies online contextual inventory control where both the expected demand and the noise distribution depend on observable features, modeling context-aware heteroskedastic demand. The authors provide an online-learning algorithm combining: a mean-estimation oracle (either ridge regression for linear demand or over-parameterized neural networks for nonlinear demand), and a kernel regression estimator for the noise CDF with measurement and observation errors. They also show the regret upper bound of the algorithm, together with an almost-matching lower bound, which can be reduced to classical results with stronger noise assumptions.
1. The contextual heteroskedastic noise setting is original and practically motivated. Modeling context-dependent uncertainty is realistic.
2. The theoretical results are complete. The minimax-optimal characterization of regret is interesting. The integration of nonparametric CDF estimation and learning-while-optimizing also appears to be novel.
3. The paper is well written and easy to follow.
### Scope and Audience ###
I am slightly concerned about whether this paper has a sufficiently broad audience at ICLR. While it falls under the umbrella of online learning, the problem formulation and motivation are quite domain-specific to inventory management. I will not push strongly in this direction, but I would leave it to the senior reviewers and ACs to decide whether the paper fits the conference scope.
### Practical Relevance of the Setting ###
Although there has been some work on applying online learning to inventory decisions, it remains debatable whether online learning is the right paradigm for such problems. In practice, inventory decisions are not made frequently, typically on a weekly or even monthly basis. In the numerical experiment, Figure (c) shows that both proposed algorithms begin to outperform OSGD only after about 100 days, which is quite long in a practical context. In real-world implementations, such algorithms would likely be shoot down before they exhibit a performance advantage.
### Using of "Perishable goods" ###
The paper describes the setting as involving perishable goods, but the model essentially corresponds to a classical single-period (newsvendor-type) problem where unsold inventory is discarded at the end of each period. In more recent literature, perishable goods are referred to those that will stay for (only) a couple of periods, which make the state space much larger than the classical single-period problem. I would suggest the authors to reconsider the terminology used.
### Technical Questions ###
I have several technical questions listed below.
I have some technical questions:
1. I am not very clear about how to make the constructed Q_t(y) zero-mean? In the section A.1., the authors mentioned that "through tuning the first component of $\theta_*$ and $x_t$." but the $x_t$ and $\theta_*$ have been defined very well. Could the authors provide more specific guidelines on this?
2. As everything is bounded (please correct me if I am wrong), I can always ignore the heteroskedastic noise and just think it is generated from a sub-Gaussian distribution (maybe with a large variance proxy). By doing this, I think I can always get a $\sqrt{T}$ bound. Am I missing something? |
Fully human-written |
|
Describe-to-Score: Text-Guided Efficient Image Complexity Assessment |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
Describe-to-Score (D2S) tackles image complexity assessment by first generating captions with BLIP, then aligning vision and text to predict a scalar complexity score from visual features alone. It introduces entropy distribution alignment (EAL) to match modality entropy statistics and a CLIP-style feature alignment (FAL) in a shared space. The authors motivate D2S with information- and generalization-theoretic arguments (higher fused entropy; reduced effective dimensionality) and implement learnable pooling for efficient inference. Experiments on IC9600 show state-of-the-art correlations and faster latency, while transfer to NR-IQA yields competitive results, notably on KADID-10K.
- The paper proposes a distinctive “describe → align → score” pipeline for image complexity: captions from a VLM guide visual features during training, while inference remains image-only—a way to inject semantics without runtime cost. It further introduces Entropy Distribution Alignment (EAL) with an energy-distance loss and buffers/EMA to stabilize cross-modal statistics, plus CLIP-style Feature Alignment (FAL)—a creative combination that is new in ICA. The information- and generalization-theoretic motivation (higher fused entropy; reduced effective dimension) gives the method conceptual clarity and elevates its potential impact on complexity assessment.
- Method details are concrete: the paper specifies the projection/connector, formulates EAL analytically, and illustrates the full training/inference workflow with clear figures and a prompt template. Empirically, D2S attains state-of-the-art correlations on IC9600 with notable latency advantages, and shows competitive transfer to NR-IQA—evidence of real-world significance beyond a single benchmark.
- The paper posits that “entropy increases → richer representation → closer to true complexity,” and relies on entropy‐distribution alignment plus feature alignment (EAL/FAL), but gives no operational, reproducible definition for estimating $p(\cdot)$ or verifying the premise that “semantic compression reduces effective dimension.” Please add a concrete mapping from features to distributions (e.g., temperature-scaled softmax or KDE), run controlled synthetic tests to validate (or falsify) the “dimension compression” hypothesis, and include ablations that hold effective dimension fixed while toggling text guidance.
- Only one captioner/encoder pairing is explored. Please provide a 2D grid over captioners (e.g., BLIP variants) × prompt designs (length/order/style), measure performance vs. compute, and analyze which caption attributes (entity count accuracy, relation coverage) correlate with complexity prediction.
See Weaknesses. |
Heavily AI-edited |
|
Describe-to-Score: Text-Guided Efficient Image Complexity Assessment |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper presents D2S (Describe-to-Score), a novel framework for image complexity assessment (ICA) that integrates visual and textual semantic information. The method first uses a pre-trained vision-language model (BLIP) to generate image captions and then aligns visual and textual features through two key mechanisms: Entropy Distribution Alignment (EAL) and Feature Alignment (FAL). Importantly, D2S employs multimodal information during training but only requires visual input at inference, achieving both semantic richness and computational efficiency.
Comprehensive experiments on multiple datasets (IC9600, KADID-10K, and others) show that D2S attains state-of-the-art (SOTA) performance with significantly reduced inference latency. Theoretical analyses based on information theory and Rademacher complexity further justify the proposed design.
- The combination of text-guided semantics with visual complexity assessment is original and well-motivated, bridging a key gap in prior visual-only ICA approaches.
- The paper provides clear theoretical arguments using entropy and generalization theory to explain the advantages of multimodal fusion.
- By discarding the text branch during inference, D2S achieves SOTA performance while maintaining low latency — a practical and elegant design choice.
- The experiments cover supervised, unsupervised, small-sample, cross-dataset, and cross-task settings. Results are consistently superior or competitive across diverse benchmarks.
- Ablation studies and error analyses effectively demonstrate the contribution of each module (EAL, FAL, AttnPool) and the benefits of semantic guidance.
- The manuscript is clearly written, logically organized, and provides sufficient implementation details for reproducibility.
- Demonstrating transfer to no-reference image quality assessment (NR-IQA) further enhances the general interest and robustness of the method.
- While D2S improves performance, the interpretability of what textual semantics contribute (beyond activation histograms) could be elaborated, for example with qualitative examples of captions’ influence.
- Since captions are generated automatically, the performance might depend on BLIP’s accuracy; this dependency and its robustness are not deeply analyzed.
- Although the ablation studies are extensive, additional comparisons with other text-guided approaches (e.g., CLIP-based fusion or textual embeddings without caption generation) would further strengthen the validation.
- Some theoretical derivations (e.g., in Proposition 1) are concise and could benefit from clearer notation or discussion of assumptions.
- How sensitive is D2S to the quality or type of captions generated by BLIP? Would fine-tuning BLIP or using alternative VLMs (e.g., CLIP-ViT-L or Florence-2) affect results significantly?
- Have the authors evaluated how the accuracy or reliability of BLIP captions impacts image complexity estimation? Since BLIP is not a perfect captioning model and may produce incomplete or incorrect descriptions, it would be valuable to understand whether such caption errors significantly affect the downstream complexity predictions.
- Could the entropy alignment mechanism generalize to other multimodal tasks beyond ICA (e.g., aesthetics or memorability prediction)?
- During inference, since the text branch is discarded, to what extent are the visual encoders truly semantically informed versus statistically regularized by text during training?
- Is there any noticeable trade-off between performance and training time introduced by the entropy buffers and momentum model in EAL?
- Could the authors provide qualitative examples showing how textual descriptions guide the visual branch — for example, comparing visual attention maps with and without text alignment? |
Fully AI-generated |
|
Describe-to-Score: Text-Guided Efficient Image Complexity Assessment |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes D2S, a model for Image Complexity Assessment (ICA) that integrates vision-language learning. D2S leverages BLIP to generate textual descriptions (captions) for images, thereby injecting high-level semantic information into the visual encoder during training.
The method introduces two alignment mechanisms — Feature Alignment (FAL) and Entropy Distribution Alignment (EAL) — to align the textual and visual feature spaces.
Experiments show that D2S achieves competitive results on both ICA and NR-IQA tasks while maintaining low parameter count and inference latency, demonstrating its efficiency and scalability.
1. The paper explores an interesting idea: leveraging textual information for image complexity assessment. The training-time multimodal alignment with BLIP-generated captions and visual features is technically well-motivated.
2. The theoretical derivation from information theory and generalization theory provides conceptual depth and connects intuition to formal analysis.
3. The model achieves good performance–efficiency tradeoff, with low parameter count and short inference time.
1. Caption quality and reliability are critical yet under-analyzed. Figures 17 and 18 mention that the final generated text (“the overall visual complexity is...”) can often be incorrect, but the paper does not further analyze how such errors influence D2S’s performance. A detailed study or ablation on the four BLIP prompts would make this much stronger.
2. The paper could better discuss the role of the Projection module and whether it affects the training of the core visual encoder, especially since it is not used at inference time.
3. The main experiments (Table1) are limited to IC9600, making it difficult to confirm generalization across other ICA datasets.
4. Possible typographical errors exist in Table 4 (e.g., TOPIQ, LoDa), which need verification.
1. Since the projection module (as shown in Figure 4) is used during training but discarded during inference, could its presence unintentionally affect the optimization of the visual encoder?
2. Why were generalization experiments conducted only on datasets such as Nagle4k and Savoias, without performing full-scale experiments similar to IC9600? |
Lightly AI-edited |
|
Describe-to-Score: Text-Guided Efficient Image Complexity Assessment |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces a text-guided vision encoder only method for image complexity assessment (ICA).
In the training phase, the text feature from CLIP text encoder and the visual feature from the vision encoder are aligned with proposed entropy distribution alignment (EAL) and feature alignment (FAL).
The vision encoder then obtains a vision-text aligned (multi-modal) feature which has effectively reduced the empirical Rademacher complexity and improved generalization, after the training.
In experiments, the proposed method outperforms previous works on IC9600 benchmark, and it shows fast adaptation in early stage of the training.
The proposed method performs a form of vision-text fusion for IC modeling.
- Through theoretical background (Section 2), it demonstrates that utilizing text features can achieve increased accuracy and generalization, from which the core components of the proposed method (EAL and FAL) are derived.
- Compared to existing studies that utilize high-level information such as object counts (Shen et al., 2024) and motion trends (Li et al., 2025), the proposed method can leverage more flexible high-level information through text.
Key Features of the Proposed Method
- For vision-text feature alignment, EAL employs energy distance loss (Szekely & Rizzo, 2013) while FAL adopts InfoNCE loss (van den Oord et al., 2019).
- The text encoder is kept frozen while only the vision encoder is trained, enhancing efficiency during inference by utilizing only the vision encoder.
- D2S outperforms existing methods on the IC9600 benchmark.
Experimental Advantages
- Demonstrates advantages over existing methods in a small samples training (Table 2).
- Conducts ablation study (Table 5) to show that both EAL and FAL are necessary.
- Shows applicability to downstream tasks, including NR-IQA.
Lack of empirical justification for optimal image captioning template design.
- The core idea of the proposed method is to reduce effective feature dimensions by utilizing effective semantic (text) features (Theorem 2).
- To achieve this goal, designing an optimal image captioning template (Figure 5) is considered crucial. However, the paper lacks sufficient discussion regarding the criteria used for this design and the underlying rationale. For instance, questions remain unanswered: Is the template sufficiently rich in textual description? Is the template selected through extensive experimental validation? Clear justification for the template design choices is not provided.
Limited novelty in EAL and FAL architectures.
- The forms of EAL and FAL represent common architectures for vision-text feature alignment that have been widely adopted in other tasks beyond ICA. These structures are not novel from an architectural perspective, nor can they be considered specifically tailored for ICA.
Minor experimental design concerns.
- Caption model selection: Why was BLIP used instead of more recent, superior captioning methods?
- Vision encoder choice: Why was ResNet employed instead of CLIP's vision encoder?
- Limited benchmark evaluation: Why were results compared only on IC9600 without evaluation on other benchmarks?
- Comparative analysis: Figure 7 would benefit from direct comparison with vision-only approaches for more reasonable evaluation.
Regarding Proposition 1 (Eq2).
- I am not sure even after reading A.1.
- For example, is it valid when alpha=0.1, beta=0.1 especially for the left inequality? |
Fully human-written |
|
UniFLoW: Universal Multi-Modal Federated LoRA Fine-Tuning Framework with Analytical Aggregation |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
To efficiently leverage distributed multimodal data under heterogeneous multimodal settings, we propose FedA²-LoRA within the FL and MLLMs framework. The method adopts a two-stage training strategy—first fine-tuning the modality-specific encoder’s LoRA, followed by the LLM’s LoRA and introduces Tikhonov regularization on the LoRA A matrix to approximate the B matrix, thereby improving aggregation consistency. Experimental results demonstrate the effectiveness of the proposed method.
The method takes into account the issues of modality heterogeneity and the aggregation bias between the LoRA A and B matrices, making its research motivation reasonably meaningful.
1. The approach of approximating (B) from (U) and (A) (Equations 9–11) is not very reasonable. If each client needs to upload both (B) and (A) to the server to compute (U), then it would be more straightforward to directly multiply (B) and (A) on the server and aggregate the results, which would inherently avoid the aggregation inconsistency. Moreover, uploading both (B) and (A) does not actually reduce the communication cost.
2. The use of Tikhonov Regularization to approximate matrix (B) lacks theoretical justification, making the approach less convincing.
3. The results in Tables 2–4 seem to show only that the two-stage training strategy performs better than the single-stage approach that trains the modality encoder and LLM LoRA simultaneously. While this two-stage strategy could be an effective training method, it may not be sufficient to constitute a complete innovation.
4. There are no additional ablation studies to verify the effectiveness of using Tikhonov Regularization for approximating matrix (B).
5. Figures 1 and 2 are not clearly presented—particularly Figure 2, which is overly complicated and fails to highlight the key points. In addition, the overall writing quality of the paper still needs improvement.
The main issues are that the approach for approximating (B) appears unreasonable and does not actually reduce communication costs. The experimental results mainly highlight the effect of the two-stage training strategy; however, this strategy alone is insufficient to constitute a complete innovation, and the method for approximating (B) also lacks theoretical justification. |
Lightly AI-edited |
|
UniFLoW: Universal Multi-Modal Federated LoRA Fine-Tuning Framework with Analytical Aggregation |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper addresses the challenge of fine-tuning Multimodal Large Language Models in federated learning settings. The authors explain that applying traditional FL to MLLMs can be very expensive and methods like LoRA in FL can suffer from "aggregation inconsistency". Therefore, they propose UniFLoW, a unified framework with three core contributions:
1- It uses a pre-trained universal encoder (ImageBind) and a base LLM (Vicuna-7B), applying LoRA to both components.
2- Clients first fine-tune their respective encoder's LoRA parameters and then fine-tune the base LLM's LoRA parameters within a single local training round.
3- Their server-side aggregation algorithm (FedA²-LoRA) computes the global A matrix by simple averaging. Then, they find the corresponding global B matrix based on A.
The authors evaluate UniFLoW on multi-modal QA (image and audio) and the FedA²-LoRA component on unimodal NLU (GLUE).
* The Federated MLLM is an interesting problem.
* The ablation study confirms some of the choices. For example, it shows that the two stage training is effective, yielding better results than end-to-end local fine-tuning.
* The authors show the effectiveness of their method through different experiments.
* The paper compares its performance against methods like FFA-LORA (which freezes $A$ and only uploads $B$) and FedSA-LORA (which only uploads $A$). These methods have half the client-to-server communication cost.
* The authors do not provide any justification for some claims for example “The A matrices capture more general information”
1- Are BERTScore and Token Accuracy reliable metrics for evaluating open-domain, generative QA?
2- Line 096 is not clear. How should I read this part?
3- Figure 2 is very unclear and does not help to understand the method. The figure description is very short and does not help much.
4- It is not clear for me that if this paper is the first paper that works on first federated MLLMs fine-tuning framework (line 119) or based on the beginning of line 192 there are other FedMLLMs approaches. |
Fully human-written |
|
UniFLoW: Universal Multi-Modal Federated LoRA Fine-Tuning Framework with Analytical Aggregation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes UniFLoW, a universal multi-modal federated LoRA fine-tuning framework targeting the challenges of applying Federated Learning (FL) to Multi-modal Large Language Models (MLLMs), namely client-side modality heterogeneity, high communication costs, and LoRA aggregation inconsistency. The central contribution is FedA²-LoRA, which aggregates client-side LoRA parameters by analytically reconstructing the matrix via a closed-form ridge-regression solution. The framework adopts a two-stage training strategy that fine-tunes LoRA modules in both the modality encoders (ImageBind) and the base LLM (Vicuna-7B).Experiments on multi-modal QA tasks indicate its effectiveness.
1.FedA²-LoRA introduces an efficient analytical approach to address federated LoRA aggregation inconsistency. By directly aggregating the matrices and analytically recovering , it offers a communication-efficient alternative.
2.The work is significant in applying FL to MLLMs with heterogeneous modality data.
3.UniFLoW combines a general-purpose modality encoder (ImageBind) with an LLM and employs LoRA in key modules to cope with modality heterogeneity; the design rationale is clear and sensible.
1.FedA²-LoRA assumes “ is more global and is more local,” hence averaging and reconstructing from and . This rests on heuristic motivation; the paper should specify theoretical conditions under which the “global” nature of holds, and whether it remains valid under non-IID settings.
2.The closed-form solution in Equation (11) essentially solves the ridge-regression problem , yet the paper does not explicitly present the objective nor provide a proof of optimality.
3.Although FedA²-LoRA is said not to increase communication costs, the experiments do not report measured communication budgets or parameter payloads.
4.The paper references FedEx-LoRA and related work but does not provide head-to-head comparisons under matched communication budgets and client participation. Current conclusions rely mainly on comparisons with FedSA and FFA and are therefore less convincing.
5.While the paper emphasises heterogeneous client resources, it does not propose explicit heterogeneity-aware mechanisms (e.g., variable-rank or variable-layer LoRA) that reflect realistic constraints.
1.Mixed use of “Ⅱ/II stage”; please standardise.
2.Instances include “does not exists” in Equation (11) and “AAk” around line 300. There are occasional logical jumps and paragraph repetitions that reduce readability.
3.Please verify consistency of symbols and variable definitions throughout. |
Lightly AI-edited |
|
UniFLoW: Universal Multi-Modal Federated LoRA Fine-Tuning Framework with Analytical Aggregation |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work proposes UniFLoW (Universal Multi-modal Federated LoRA Fine-tuning Framework with Analytical Aggregation), a unified federated framework that leverages pre-trained large models and a multi-modal architecture. Moreover, it introduces Federated Aggregating Analytical Low-Rank Adaptation (FedA2-LoRA), which directly averages \\( A^t \\) to obtain \\( A^{t+1} \\), and then recovers the corresponding \\( B^{t+1} \\) matrices from the aggregated update \\( \Delta_W^* \\) using a closed-form solution of regularized least squares regression (Ridge Regression).
- The introduced FedA2-LoRA is both novel and interesting, effectively addressing aggregation errors in FL with LoRA fine-tuning.
- The paper is well-written and clearly articulated, making it easy to understand.
- This work exaggerates its contributions. The advantage of UniFLoW in addressing architectural incompatibility when dealing with multimodal data (Problem 2) stems from the characteristics of the modality-specific encoder (ImageBind [1]), which can handle various modalities, rather than from the contributions of this work.
- The proposed UniFLoW is based on specific encoders (ImageBind [1]) and LLMs (Vicuna-7B [2]). Can different encoders and LLMs be used?
- Why are the experimental results presented in Table 5 much worse than the results presented in Table 1 in FedSA-LoRA [3]?
[1] Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15180–15190, 2023.
[2] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023.
[3] Pengxin Guo, Shuang Zeng, Yanran Wang, Huijie Fan, Feifei Wang, and Liangqiong Qu. Selective aggregation for low-rank adaptation in federated learning. arXiv preprint arXiv:2410.01463, 2024.
- In the first stage, when the number of local iteration steps is less than τ, the model updates only the parameters of the corresponding encoder. When the steps exceed τ, the model updates only the parameters of the LLMs (Lines 242-245). What would be the effect of training the LLMs first and then training the encoder?
- Why is it better to train the LLM and the encoder in an II-stage approach rather than training both simultaneously? Is this related to the statement: "However, in FL, when client data exhibits certain biases, only specific types of multimodal data may be available. If
the encoder is not fine-tuned, this data can influence the fine-tuning of the base model, causing it to specialize for a specific modality and thus negatively impacting the model’s generalization." (Lines 238-241) Thus, what would be the effect of training the encoder for the first T communication rounds and then training the LLM for the following T rounds?
- What is the time complexity of solving Equation 11? Since it involves matrix inversion, how does the time complexity look?
- Line 373, "Please refer to the Appendix for a detailed evaluation." I did not find it in the Appendix. |
Fully human-written |
|
Think First, Then Select and Verify with Query–Key Alignment |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 1: poor
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper is not written well enough for me to be able to understand it clearly and appreciate the contributions. Based on what little I could make out, this paper is trying to solve MCQA task by forcing model to do CoT before selecting final answer choice. And, in that process, they are aiming to use QK scores which is part of self-attention calculation anyways. It is unclear to me what is novel idea in this paper and why standard self-attention mechanism can’t handle what is being proposed here. I am neither very clear about the research gap that is being addressed in this paper nor the novelty of the contributions. At the least, this paper needs a thorough rewriting for me to be able to understand its contributions clearly and appreciate the same.
- None that I could identify.
- The writing style of the paper is quite informal and unclear. A thorough rewrite is needed for Section 3 to convey the ideas clearly.
- This paper lacks the novelty of the problem definition as well as solution approach. The proposed ideas are well known in the literature.
- Its unclear why standard self-attention would not take care of the QK-score that is being mentioned here.
- Its unclear what and how reasoning is happening in MCQA dataset. No illustrative example is provided.
- There are some ill formed sentences at multiple places. For example, look at line number 123-124.
- Its unclear why standard self-attention would not take care of the QK-score that is being mentioned here.
- Its unclear what and how reasoning is happening in MCQA dataset. No illustrative example is provided. |
Fully human-written |
|
Think First, Then Select and Verify with Query–Key Alignment |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 1: poor
Rating: 0:
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes using internal QK-scores (attention scores without RoPE + before softmax) for
1. MCQA Evaluation (Section 4.3): selecting the model’s choice during Multiple Choice QA evaluation (instead of generating answer token)
2. Self-Verification (Section 4.4): verifying model’s own reasoning step
3. Candidate Generation Selection (Section 4.5): selecting the most promising generation (instead of majority voting)
While previous works (both cited in the paper) have explored the use of QK-scores for MCQA without CoT [1] and Self-Verification [2], the paper hopes to extend the setting to MCQA with CoT and Candidate Generation Selection. The extension from MCQA without CoT to MCQA with CoT is barely incremental. The paper claims that they have novelty beyond [2], but it is unclear how their setting on Self-Verification is different. The only novel contribution, if any, would be the Candidate Generation Selection. However, this setting can also be understood as a variation of an MCQA (the question is "which generation is most promising" and each option is the candidate generation).
[1] Tulchinskii, Eduard, et al. "Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA." arXiv preprint arXiv:2410.02343 (2024).
[2] Tulchinskii, Eduard, et al. "Quantifying Logical Consistency in Transformers via Query-Key Alignment." arXiv preprint arXiv:2502.17017 (2025).
I find no strength in this paper. Instead, I will elaborate on its weaknesses.
# Lack of Novelty/Contribution
Using QK-values to choose the best option in Multiple Choice QA evaluation has already been proposed [1]. The only addition this paper proposes is the usage of chain-of-thought (CoT). This is barely incremental. Furthermore, the paper later states that with CoT, the baseline approach (letting model generate the answer token) catches upto the performance of their proposed QK-value-based method (line 193, page 4). This further downweights the significance of their results.
Using QK-values to verify the model's own answer has already been proposed [2]. It is unclear how the proposed setting in this paper is any different from [2].
Selecting the most promising generation can be understood as a Multiple Choice QA, as is currently written. The question would be "what is the most promising generation" and the list of options would be each of the candidate generations.
[1] Tulchinskii, Eduard, et al. "Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA." arXiv preprint arXiv:2410.02343 (2024).
[2] Tulchinskii, Eduard, et al. "Quantifying Logical Consistency in Transformers via Query-Key Alignment." arXiv preprint arXiv:2502.17017 (2025).
# Lack of Details of Experimental Setup
The paper is lacking details that are fundamental to understand the experimental setup.
1. The prompt examples for Section 4.4 and Section 4.5
2. Whether the MCQA evalution is done zero-shot (Section 4.3)
3. The decoding setting (temperature, top-p etc.)
4. How exactly is the best QK head selection from the validation set?
5. Why do you need an external LLM-as-a-judge (Qwen3-70B) (line 247)? The answer for MATH-500 should be standardized right? Is is just for HLE? What is the prompt for this external LLM-as-a-judge?
etc. etc.
Many of the terms are not used consistently throughout the paper and add to the confusion. For example, MCQA+CoT (line 16), MCQA with CoT (line 67, 196, 216), MCQA with reasoning (line 51, 118, 124), MCQA-with-CoT-reasoning (line 128), MCQA with integrated CoT reasoning (line 181) all seem to refer to the same thing.
# Experimental Setup Copied From Previous Works?
Some parts of the experimental setup are either out-of-place or follow too closely to the two previous work s [1, 2], but do not properly acknowledge this resemblance.
The term "premise" in the paragraph under **QK-score and connection between reasoning parts.** (lines 100-1345) seems out of place, potentially except for the Logic Consistency Verification (line 129). It is also unclear why "premise" would be abbreviated as "c" (line 102). Upon further inspection, it seems like this term was directly copied from [2]. Relevant part from [2]: "In our setup, each input consists of a context c (which provides the premises), a statement s (a candidate conclusion), and a candidate answer a_i" (page 2).
The paper mentions: "When it is not stated otherwise, we do not aggregate predictions or QK-scores from
multiple attention heads. Instead, in each experiment we use a separate calibration subset of the data from the same domain to select the single best performing head." (line 137-139). This is weird since the paper never mentions aggregating scores acros attention heads. Upon inspection, this seems to be modified from a similar sentence from [1]: "We do not aggregate heads predictions. Instead, we use the scores from the single best head, which is selected by the accuracy on the validation set D_val." (page 4).
The paper defines Permutational Accuracy (PA) in Equation 1 and explains "where I_i is the indicator value equals to 1 if the model answers question i correctly, while I^p_i equals to 1 iff the model answers question i correctly after answer options were permuted." (lines 174-175). This is almost word-to-word copied from [1]: "where I_i is the indicator value equals to 1 iff model’s answer on question i is correct, while I^p_i equals to 1 iff model gives correct answer on question i after its options (their texts not letters) were permuted" (page 6).
None of these similarities are properly acknowledged.
[1] Tulchinskii, Eduard, et al. "Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA." arXiv preprint arXiv:2410.02343 (2024).
[2] Tulchinskii, Eduard, et al. "Quantifying Logical Consistency in Transformers via Query-Key Alignment." arXiv preprint arXiv:2502.17017 (2025).
# Questionable Results
The baseline numbers for MCQA (just letting model generate answer tokens) seem off. See Table 1 for example.
1. DeepSeek-R1-Distill models and Qwen3 models have lower / similar numbers than LLaMA-3.1-8B?
2. No benefit of model scale? For example, 8B/14B/32B models do not show a smooth increase in performance?
3. Qwen3 numbers are significantly lower than expected. For example, Qwen3-32B should have 68% on MMLU-Pro (5-shot) [3]. Even if this paper used zero-shot and used a different evalution setting (which is why the paper should have added more details on what setting they used), the number should not change this much.
[3] Yang, An, et al. "Qwen3 technical report." arXiv preprint arXiv:2505.09388 (2025).
# Internally Inconsistent Content
Now this is getting into the interesting part. Upon manual inspection of the codebase provided by the authors, not only do they not provide the full codebase, some of their results do not match the content shown in the paper.
For example, in the 16th and 17th output cell of the `HLE_MCQA_qwen3_14b.ipynb` file in the codebase, they report "Baseline: 0.336 QK: 0.354" and "Baseline: 0.36 QK: 0.35", both of which numbers are not consistent with Table 2.
For another example, in Figure 1, the paper claims that they use "Options:\n" in the prompt for MCQA. However, in the provided `HLE_MCQA_qwen3_14b.ipynb` file in the codebase, such text is not included.
# Numerous Mistakes in the References Section
There are many mistakes in the References section that raise suspicision. These mistakes are not typical mistakes made by humans.
1. The paper writes "DoLa: Decoding by contrasting layers improves **factuality and faithfulness**" (line 334-335). The correct title of this paper is "DoLa: Decoding by Contrasting Layers Improves **Factuality in Large Language Models**"
2. The paper includes "Decoding-time baseline." (line 335) at the end of an entry under Chuang et al. The paper also includes "Introduces the GSM8K benchmark." (line 338) at the end of an entry under Cobbe et al. These seem to be comments that humans generally won't include.
3. For the entry under Ren et al. (lines 367-368), the name of the journal is simply "Proceedings on"
# Nitpicky Details
1. The paper changed the margin setting. Left margin is 1inch? But this is interesting since the paper will still fit under the 9 page limit with the permitted margin setting.
2. No description of the green color in Table 3.
3. 16% in Table 3 is not colored green. It also should be 14% instead.
1. Can you provide more details into the experimental setup? Specifically, the prompts for Section 4.4, Section 4.5 would be useful. Also, please provide the decoding budget and the temperature / top-p setting for all experiments.
2. Can you explain how the baseline numbers were derived in Table 1 and 2? Why are the numbers so different from what you might expect of the models?
3. If the Baseline approach with CoT performs as well as your QK-value-based approach, what is the significance?
4. In your abstract, you say "By leveraging this signal, we surpass the performance of full-scale,
preference-optimized LLMs on two fundamental reasoning tasks: multiple-choice question answering
and solution correctness validation." What are the "full-scale preference-optimized LLMs"?
5. In your abstract, you say "Our method achieves performance gains of up to ≈ 22% across various benchmarks and models." Where do you get a gain of "22%"? |
Fully human-written |
|
Think First, Then Select and Verify with Query–Key Alignment |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents a white-box method for answer selection and verification in LLMs. The main idea is to use the raw Query-Key (QK) dot-product score from the transformer's attention mechanism as an "internal signal" from the model. The authors claim that a "think-first" phase via CoT prompting strengthens internal QK alignment, allowing for more reliable selection of answers directly from model activations. This method is evaluated in several settings, such as MCQA, verification, and hypothesis selection. The MCQA setting is tested on the MMLU-Pro and HLE-1/4 datasets; the verification and hypothesis selection settings are tested on MATH-500 and HLE-1/4. HLE-1/4 is an authors' adaptation of the Humanity Last Exam benchmark. Additionally, the MATH-500 dataset is used for testing open-ended reasoning. The authors report significant performance gains of up to ~22% for this method.
- Proposed method is novel. While prior work used QK for probing, this paper uses it as a decision rule for selection and verification.
- Some of the reported results are impressive. 1) In the MCQA setup on MMLU-Pro, the QK-score method outperforms the baseline for several models, such as Qwen-14B: 17.72% -> 44.42% or Qwen-32B: 16.6% -> 49.32%. 2) In the hypothesis selection, the QK-score method (tested with LLaMA-3.1 8B on the MATH-500 dataset) outperforms baseline by almost 22 pp.
- Figure 2 shows a high correlation of head performance between MATH-500 and HLE-1/4 for hypothesis selection. This provides evidence that QK-score method captures more generalizable signal.
- I appreciate the usage of the PA metric. Positional bias is often forgotten when using MCQA tasks.
- The entire method relies on a crucial and potentially fragile step -- selecting a single best-performing head from a calibration dataset. The paper provides no analysis of how stable this selection is. How large does the calibration set need to be? What is the performance distribution across heads? Is there only one good head, or are there many of them? Is head selection done in each setting/experiment separately? If yes, it weakens the claim of generalization of the method. The lack of these analyses is a major limitation of this work.
- The method heavily depends on the choice of "premise-representing" and "response-representing" tokens. The paper lacks an ablation study to show how sensitive are the results to this choice.
- While some results presented are strong, others contradict the paper's main narrative. For example in the MCQA with CoT setting, the QK-score method underperforms the baseline for several models (on HLE-1/4 -> DeepSeek-R1-Distill-Qwen-14B: 33.25% acc on baseline vs. 31.56% acc on QK-score; Qwen3-14B: 33.06% acc on baseline vs. 29.06% acc on QK-score). These underperformances are not discussed or explained, weakening the claim that the QK-score is a systematically better selection mechanism when CoT is used.
- The MCQA baseline in tables 1 and 2 is not explained.
- How stable is the "golden head" selection? Is the same head selected across different tasks and datasets for a given model?
- Please provide results for a top-k head ensemble to check for robustness, rather than relying on a single head.
- Why does the QK-score method underperform the baseline after CoT is applied in several cases in Table 2? This seems to contradict the main hypothesis.
- Please provide an ablation study on the choice of "premise-representing" and "response-representing" tokens to justify using punctuation.
- How was the "MCQA Baseline" in Tables 1 and 2 calculated?
I am willing to increase my score if these concerns are resolved (especially regarding the head selection). |
Fully human-written |
|
Test-Time Accuracy-Cost Control in Neural Simulators via Recurrent-Depth |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces the Recurrent-Depth Simulator (RecurrSim), an architecture-agnostic framework designed to give neural simulators explicit, test-time control over their accuracy-cost trade-off. This capability is standard in classical numerical methods but largely absent in modern deep learning-based simulators.
The core idea is to replace a fixed-depth network with a recurrent block that is iterated a user-specified number of times (K) at inference. The model is trained by sampling K from a distribution and using truncated backpropagation-through-depth to maintain a fixed memory footprint. The authors demonstrate RecurrSim's effectiveness on a wide range of benchmarks and show that it can be applied to various backbones, consistently outperforming standard architectures and other adaptive-compute models.
1. Sufficient Novelty: The paper addresses a critical and practical problem in scientific machine learning. While the core mechanism (using recurrent iterations for an accuracy-cost trade-off) has been explored in other domains like computer vision and natural language processing, this paper's novelty lies in Its application and validation for the neural simulator domain, where this feature is a standard expectation from classical solvers but has been a major missing piece for deep learning methods.
2. Methodological Simplicity and Generality: The RecurrSim framework is "plug-and-play." It requires minimal code changes to a standard architecture – like LoRa methods. The paper strongly supports its "architecture-agnostic" claim by successfully applying it to FNO, ViT, and UPT.
3. Strong comparison with other baselines and in a wide range of PDE problems.
4. Scalability and Efficiency: The results on 3D CNS are impressive (a 0.8B param RecurrFNO outperforms a 1.6B param FNO with 13.5% less training memory).
5. High-Quality Presentation: The paper is exceptionally clear, well-structured, and easy to follow. The appendices provide strong justifications for design choices.
1. Lack of a Dedicated Reproducibility Section recommended in author guidelines (although appendix provides enough)
2. Insufficient Justification for truncated backpropagation-through-depth: The authors propose truncated backpropagation-through-depth to bound memory. However, they fail to discuss or compare this to gradient checkpointing, a standard alternative. Gradient checkpointing would compute the exact full-depth gradient (trading compute for memory) instead of the approximate gradient from truncated backpropagation-through-depth. The paper provides no justification for why an approximate gradient is sufficient or preferable.
1. Minor comment Line 1215 Optimization typo. Can you correct?
2. You justify using truncated backpropagation-through-depth as a way to bound memory, which provides an approximate gradient. Could you elaborate on why this was chosen over gradient checkpointing, a standard alternative that computes the exact full-depth gradient by trading compute for memory?
3. The ICLR guidelines strongly encourage a dedicated 'Reproducibility Statement' paragraph to help reviewers locate the relevant details. While the appendices provide excellent, comprehensive details for reproducibility, this specific statement is missing. Would the authors be willing to add this paragraph in the final version to improve clarity for future readers? |
Lightly AI-edited |
|
Test-Time Accuracy-Cost Control in Neural Simulators via Recurrent-Depth |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduce a new procedure for training any block of neural architectures when learning solutions of PDEs. This procedure consists of incorporating recurrent calls to the block, whose number is controlled by a parameter $K$. The parameter $K$ changes during the training to make the obtained reccurent network able to learn the solution for any $K$, with the intuition that the approximation will be more accurate for high $K$ than for low $K$. As a result, it is possible to tune the accuracy-cost trade-off at test time by toggling $K$. The approach is validated on several benchmark and for several underlying neural architectures.
- The paper is well written, easy and pleasant to follow
- The idea is simple, original, and represents a clever approach for adding an inductive bias towards physical solvers in the obtained neural network together with controlling the cost-accuracy trade-off.
- The approach is thoroughly validated on small to large scale physical learning problem, and its applicability to different existing SOTA architectures is demonstrated (RecurrFNO, RecurrVIT, RecurrUPT).
- The benchmark would benefit from a more systematic evaluation of UPT, ViT and FNO, i.e applying those tree models and their Recurr variants to all three high dimensional datasets.
- The high dimensional benchmark lacks a study on the effect of $K$.
Are there practical limitations that prevented the authors to apply UPT, ViT and FNO and their Recurr variants to all three high dimensional datasets? |
Fully human-written |
|
Test-Time Accuracy-Cost Control in Neural Simulators via Recurrent-Depth |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors propose a very simple framework for controlling, at test-time, the accuracy/speed of a neural simulator model, without requiring retraining or architecture adaptations. They show that this technique can be incorporated into a variety of architectures.
The proposed framework is easily used in multiple different architectures, and that flexibility is a strong point.
No additional custom losses or tricks are required, and the authors provide a simple explanation of the algorithm, making adoption simple.
The authors demonstrate improved performance over baselines with reduced compute/parameter counts.
The authors say that repeated applications of the recurrent block lead encourage the recurrent block to contract toward a fixed point — what is the justification for this claim? Is there any theoretical proof that the recurrent blocks do indeed converge toward a fixed point?
I would like to see some ablations on the initial latent distribution. The authors claim that the choice “primarily affects early iterations”. The authors also show that the early iterations are the ones that lead to the largest reduction in L2 error and are the most “important” in this sense, and so it would be interesting to see whether the choice of the initial latent distribution makes a big different in terms of overall performance of this method.
Could the authors more clearly distinguish their method from DEQ, which also repeatedly applies a function (here the recurrent block) and converges to a fixed point, with the number of function applications being controllable to achieve a desired accuracy?
How was the recurrent iteration distribution chosen? It would be interesting to see how changing this distribution changes the performance of the model.
The authors show in the top of Figure 2 that performance saturates relatively quickly with the number of recurrent steps K, with the earliest steps leading to the largest reduction in L2 error. However, for memory purposes, the authors use a fixed backpropagation window where only the last B steps are backpropagated through, with the earlier steps being treated as constant. Would it not make more sense to backpropagate through the earliest recurrent layers, given that the earliest ones are the the ones that lead to the largest reduction in L2 error? |
Fully human-written |
|
Test-Time Accuracy-Cost Control in Neural Simulators via Recurrent-Depth |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes an architecture-agnostic framework, the Recurrent-Depth Simulator. During the training phase, the framework randomly samples the number of recurrent iterations K from a distribution and optimizes using truncated backpropagation; during the test phase, users can explicitly specify the number of iterations K to trade off between computational cost and simulation accuracy. The authors validate this framework across multiple datasets, including Burgers, Korteweg-de Vries (KdV), Kuramoto-Sivashinsky (KS), high-dimensional Compressible Navier-Stokes (CNS), Active Matter, and ShapeNet-Car. The method is compared against other adaptive-compute models, such as FNO-DEQ, ACDM, and PDE-Refiner, as well as standard architectures such as FNO, ViT, and UPT. The paper concludes that RecurrSim offers a superior accuracy-cost trade-off curve compared to baselines. On the high-dimensional CNS task, a lower-parameter RecurrFNO variant outperforms a higher-parameter FNO baseline while also reducing training memory.
- The framework's core contribution is providing explicit test-time control, allowing users to flexibly trade computational cost for accuracy by adjusting the number of iterations. Compared to baselines, this method offers a smoother, more predictable trade-off curve, avoiding the early saturation or erratic behavior seen in alternatives.
- The framework achieves excellent parameter efficiency through weight-sharing, enabling it to match or exceed larger baseline models with significantly fewer parameters and lower training memory consumption.
- The method is a plug-and-play, architecture-agnostic framework, and its generality has been validated across diverse backbones, including FNO, ViT, and UPT
- The core mechanism of this work,a recurrent-depth block trained with truncated backpropagation, is conceptually very similar to a standard Recurrent Neural Network (RNN), making the contribution potentially incremental as it applies existing techniques to a new domain.
- The paper is lacking in visual comparisons. The authors didn't provide corresponding visualizations for baselines like FNO-DEQ or ACDM, making it difficult to visually assess differences in physical fidelity. Also, none of the cases are provided with range.
- The paper suffers from several typographical errors and unclear phrasings. In particular, the descriptions of some experimental setups (e.g., Section 4.3 ) are brief, which may create difficulties for readers attempting to reproduce the results.
See weaknesses |
Lightly AI-edited |
|
ClusCAM: Clustered Visual Explanations for Vision Models in Image Classification |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes ClusCAM, a novel gradient-free post-hoc explanation framework designed to enhance the faithfulness and interpretability of visual explanations in image classification models. Unlike conventional CAM-based approaches that treat internal representations as independent, ClusCAM clusters them into semantically coherent meta-representations using the K-Means++ algorithm. The importance of each cluster is quantified through logit-based differences, followed by a dropout mechanism and temperature-scaled softmax to suppress irrelevant signals and highlight the most influential regions. ClusCAM is architecture-agnostic, effectively applicable to both convolutional neural networks (CNNs) and Vision Transformers (ViTs). Extensive experiments show that ClusCAM consistently outperforms state-of-the-art baselines across multiple quantitative metrics, producing sharper and more interpretable visualizations.
ClusCAM introduces a group-wise attribution strategy by clustering internal representations into higher-level meta-representations. This approach marks a significant improvement over conventional CAM methods, which assume that individual features contribute independently and with equal importance—often resulting in noisy or unreliable explanations. Furthermore, the paper presents a data-driven procedure for selecting key hyperparameters, thereby reducing the reliance on manual tuning and improving the overall stability and reproducibility of the method.
ClusCAM introduces additional computational overhead compared to highly efficient methods such as Grad-CAM. The initial K-Means++ clustering of internal representations increases inference time, particularly for large-scale models. Although ClusCAM can operate faster than exhaustive ablation-based approaches like Score-CAM and Ablation-CAM when applied to Vision Transformers (ViTs), its computational cost still limits scalability in real-time or resource-constrained environments. Moreover, while the paper proposes data-driven strategies for selecting key hyperparameters, these procedures are primarily heuristic and lack a strong theoretical foundation, leaving room for further formal analysis and optimization.
While the proposed method demonstrates several notable strengths, I have some concerns regarding its broader applicability and theoretical grounding. For instance, gradient-based analyses still provide valuable information for enhancing output probabilities, as they capture model sensitivity through backward propagation. It would be worthwhile to investigate whether ClusCAM could be integrated with gradient-based interpretability approaches, since many state-of-the-art explanation frameworks leverage both forward and backward reasoning. Although the reported results indicate that ClusCAM outperforms several contemporary baselines, a more comprehensive comparison with recent state-of-the-art model, such as Attention-Guided CAM (AAAI 2024) combining forward and backward attention mechanisms to suppress noise in Vision Transformer, would further strengthen the empirical validation of this work. Additionally, for small target objects, the hyperparameter, particularly the number of clusters, may have a significant impact on the interpretability and stability of the resulting explanations, and this sensitivity warrants further analysis. For the temperature-scaled softmax, ClusCAM uses a τ value less than one. I agree that without temperature scaling, the softmax weights can become overly uniform. However, a low τ amplifies noisy or erroneous signals. Therefore, it would be beneficial to validate this behavior using test images that contain large homogeneous backgrounds with small target objects. |
Moderately AI-edited |
|
ClusCAM: Clustered Visual Explanations for Vision Models in Image Classification |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper propose ClusCAM, a gradient-free post-hoc explanation method that groups internal representations into meaningful clusters (meta-representations). The importance of each cluster is then measured through logit differences with dropout and temperature-scaled softmax, emphasizing the most influential groups.
By modeling group-wise interactions, ClusCAM generates sharper, more interpretable, and faithful explanations. The method is architecture-agnostic, working with both CNNs and Vision Transformers. Experimental results show that ClusCAM surpasses state-of-the-art interpretability techniques.
The paper proposes a new interpretability method, ClusCAM, and provides extensive experimental validation to demonstrate its effectiveness. The authors present numerous quantitative results that highlight the superiority of their approach.
The experimental setup is highly comprehensive, covering multiple datasets—including the ILSVRC2012 benchmark and an Alzheimer’s MRI dataset—and a wide range of model backbones, such as ResNet variants (ResNet-18/34/50/101), EfficientNet, InceptionNet, and various Vision Transformers (e.g., ViT-B, Swin-B, LeViT-192/256, CaiT-XXS-24, and PVTv2).
A diverse set of evaluation metrics is also employed to thoroughly demonstrate the robustness and effectiveness of the proposed method.
The paper presents extensive experiments across multiple backbones to demonstrate the superiority of ClusCAM, but it lacks comparisons with several important baseline methods:
1. Vitali Petsiuk, Abir Das, and Kate Saenko. RISE: randomized input sampling for explanation of black-box models. In British Machine Vision Conference 2018, BMVC 2018, Northumbria University, Newcastle, UK, September 3-6, 2018, page 151, 2018.
2. Quan Zheng, Ziwei Wang, Jie Zhou, and Jiwen Lu. 2022. Shap-CAM: Visual Explanations for Convolutional Neural Networks Based on Shapley Value. In Computer Vision–ECCV 2022: 17th European Conference. Springer, Tel Aviv, Israel, 459–474
In addition to proposing ClusCAM, which is extensively validated across multiple backbones to demonstrate its superiority, the paper offers limited insight into the underlying model behavior. Several prior studies have explored interpretability and explanation in deep models from different perspectives.
1. Rulin Shao, Zhouxing Shi, Jinfeng Yi, Pin-Yu Chen, and Cho-Jui Hsieh. On the adversarial robustness of visual transformers. arXiv preprint arXiv:2103.15670, 2021.
2. Yutong Bai, Jieru Mei, Alan L Yuille, and Cihang Xie. Are transformers more robust than cnns? Advances in Neural Information Processing Systems, 34, 2021.
3. Mingqi Jiang, Saeed Khorram, and Li Fuxin. Comparing the decision-making mechanisms by transformers and cnns via explanation methods. In IEEE Conf. Comput. Vis. PatternRecog. (CVPR), pages 9546–9555, 2024.
Are there any future plans to extend or apply this method to other tasks or domains? |
Moderately AI-edited |
|
ClusCAM: Clustered Visual Explanations for Vision Models in Image Classification |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper is aiming at enhancing the XAI of CNNs and ViTs. The idea is to cluster the representations into K groups, then upsample and normalize them to be the same size as the input, and use them as a soft mask of the input. Then aggregate the difference between the logit activation of the masked input and the benign one. This approach is treating in a weighted manner the different concepts of the input. It is like projecting the input into concepts and highlight mostly the dominant ones.
* XAI is a very important topic, especially nowadays when it is crucial to better understand the inner process of networks.
* The Related Work section is detailed and informative, and the academic gap is well explained.
* The idea of combining concept fetching with explainability is intriguing.
* The approach is written as if it can be applied for both CNNs and ViTs. One of the gaps mentioned is that current approaches is overlooking weighting when aggregating representations. While it might hold for CNNs it is not true in ViTs, see for example [1], [2] and [3]. Moreover there is a large amount of dedicated approaches for XAI specifically on ViT which is failed to be mentioned.
* The method is primarily empirical with lack of intuitive explanation on the considerations for why applying each step behind it. For example why specifically using Kmeans++ (and not any other clustering scheme?)? why the element-wise product between the meta-representation to the input represents? Why is formula 5 is reasonable? what if it would be division? why removing the r% least important meta representations? (If they are not important for the classification, then they do not have impact anyway). In general it might be OK when there is empirical progression, however in my opinion it should come along with intuitive explanation so that follow-up researcher could extend your ideas. In my opinion it is lack in the current submission.
* Most of the enhancement is of very empirical steps like filtering out less relevant projections. I believe the authors need to clearly separate the conceptual novelty from the empirical contributions and to focus more on the conceptual contribution. Most of the paper elaborate on the empirical steps, which is with less academic improvement.
Minor weaknesses:
* confusing notations. H and W represent the input dimensions while h and w represent the representation dimensions. I would select other notations to make it clearer.
* The placement of visualizations is a bit awkward. For example, the algorithm is presented before the algorithm itself is explained.
* The initials ViT is more common for Vision Transformers than VT.
* r is used both for the ratio of the dropout and for the percentage of meta-representations to filter.
* What is the meaning of the colors in Fig. 2? If it is just to indicate different meta-representations, then it is not so clear.
refs:
[1] Transformer interpretability beyond attention visualization, Chefer et al. CVPR 21.
[2] Token transformation matters: Towards faithful post-hoc explanation for vision transformer. We et al. CVPR 24.
[3] From Attention to Prediction Maps: Per-Class Gradient-Free Transformer Explanations. Schaffer et al. PrePrint
* It is known that there are polysemantic neurons. i.e., neurons which activated differently when facing different inputs [1]. How do you think your approach will be affected from this? specifically Im curious on the hard clustering step which i assume that will cause hard selection for a certain "meaning".
* The clustering stage is closely related to concept clustering which typically implemented through Sparse AutoEncoders (SAE). Have you tried implement it with SAEs ([2] for example but there are a lot of papers in this topic)? If so, then what are the results? If no, I would recommend try it since it is always better to lean on a grounded approaches.
* How does the normalization is done? Which method of Upsampling is applied?
* Why is M_j represents importance? The element-wise multiplication explained as a sort of soft masking when M is the soft mask matrix. Implicitly it is referred that the magnitude of the representation represents importance (personally I agree with this observation), but in your view, why do you think it holds?
* Why do you selected using dropout to filter out outliers? It is statistical operator, it can in some cases not filter them at all? Moreover, it is not a good practice that the inference is random (depend on the seed). It was found that "registers" might be the cause for outlier heads in ViT [3], at least for ViT it can be the starting point to find this outliers systematically instead of filter them statistically.
In general, I think that the paper is too much empirical in his nature, and the authors should clearly distill the pure conceptual contribution from the empirical steps. In some cases it is even better to get a method which is a bit less good in performance but much more understandable. In the case of your approach, I think that the approach is too empirical such that it is very hard to isolate the pure contribution. Moreover, I think that the authors should focus more on explaining the intuitions behind the approach steps, and to make it clearer what information is better seized using your approach.
refs:
[1] Interpreting the Second-Order Effects of Neurons in CLIP. Gandelsman et al. ICLR 25.
[2] Interpreting CLIP with Hierarchical Sparse Autoencoders. Zaigrajew et al. ICML 25.
[3] Vision Transformers Need Registers. Darcet et al. ICLR 24. |
Fully human-written |
|
DHG-Bench: A Comprehensive Benchmark for Deep Hypergraph Learning |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents DHG-Bench, a large-scale benchmark for Hypergraph Neural Networks (HNNs). The authors implement 17 existing methods and evaluate them on 22 datasets across node-, hyperedge-, and graph-level classification tasks. The evaluation covers four axes: effectiveness (accuracy), efficiency (runtime and memory), robustness (to structural, feature, and supervision perturbations), and fairness ($\Delta$DP and $\Delta$EO metrics). The authors also release an open-source library to ensure reproducibility. Overall, this is an ambitious and useful engineering effort that aims to unify evaluation practices in hypergraph learning.
1. **Comprehensive benchmark coverage.** DHG-Bench implements 17 HNN models and evaluates them on 22 datasets spanning node-, hyperedge-, and graph-level tasks.
2. **Multi-dimensional evaluation.** The benchmark goes beyond accuracy to assess efficiency, robustness, and fairness, providing a more holistic view of model behavior.
3. **Reproducibility focus.** The open-source code and datasets enable other researchers to replicate results and extend the benchmark.
4. **Insightful findings.** The experiments reveal important phenomena such as scalability bottlenecks, fairness gaps, and underperformance of HNNs on heterophilic datasets.
1. **Limited conceptual novelty.** The benchmark aggregates existing models but does not introduce new methodologies or theoretical advances.
2. **Insufficient graph-based baselines.** Only two GNNs are included, and simple but strong baselines (e.g., direction-aware GNNs) are missing, making it difficult to quantify the advantage of hypergraphs.
3. **Directed hypergraphs ignored.** The benchmark only evaluates undirected hypergraphs, omitting variants that model asymmetric or causal relations, i.e. directed hypergraphs.
4. **Superficial scalability analysis.** Many methods fail with OOM errors, but mitigation strategies (e.g., mini-batching, sparse operations) are not reported.
5. **Shallow heterophily treatment.** HNNs underperform MLPs on heterophilic datasets, but causes such as oversmoothing or feature mixing are not analyzed in detail.
1. **Baseline selection.** What criteria determined the included baselines? Why were simple but strong baselines (MLPs, direction-aware GNNs) not systematically included?
2. **Directed hypergraphs and variants.** Do you plan to extend DHG-Bench to support directed hypergraphs, heterogeneous hyperedges, or temporal hypergraphs? If not, please justify.
3. **OOM diagnostics.** For models that ran out of memory, which mitigation strategies were attempted (e.g., mini-batching, sparse matrices, mixed precision)? Can you report peak memory usage?
4. **Heterophily analysis.** Can you quantify heterophily per dataset, test hypotheses (e.g., oversmoothing, feature collapse), and relate results to prior literature on heterophilic GNN behavior?
5. **Fairness metrics.** For datasets lacking explicit sensitive attributes, how were $\Delta$DP/ta$EO computed? Were proxy attributes used, and how were they validated? |
Fully AI-generated |
|
DHG-Bench: A Comprehensive Benchmark for Deep Hypergraph Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors propose a comprehensive hypergraph neural network (HNN) benchmark, which is called DHG-Bench.
The benchmark implementation incorporates most of the representative HNNs and datasets.
Moreover, the authors evaluate the HNNs under diverse scenarios (e.g., classification, hyperedge prediction, and noisy cases).
S1. The authors provided a timely benchmark for hypergraph neural networks.
S2. The benchmark is comprehensive, in terms of both HNNs and downstream tasks.
S3. Most of the benchmark hypergraph datasets have been covered.
I do not have major criticisms of this work, but I have several suggestions.
**W1. [pip installation]** For now, I think one needs to download the GitHub repo to run the code. I think authors can improve the code to be easier to use, as in PyG (https://pytorch-geometric.readthedocs.io/en/2.4.0/install/installation.html).
**W2. [Label split]** While many HNNs use a 50/25/25 split for node classification, I personally think this ratio contains too many training nodes, compared to the common graph evaluation settings. Can the authors analyze the HNNs in more label-scarce (i.e., less training nodes) scenarios?
See Weakness |
Fully human-written |
|
DHG-Bench: A Comprehensive Benchmark for Deep Hypergraph Learning |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces a large benchmark for hypergraph neural networks (HNNs), consisting of 17 HNN methods across 22 datasets. The benchmark standardizes training/evaluation spanning node-, hyperedge-, and graph-level tasks, then analyses four dimensions: effectiveness, efficiency, robustness, and fairness. Empirically, no model dominates and models resist structural noise but are sensitive to feature/label noise.
1. The work organizes experiments around RQ1–RQ4 (effectiveness, efficiency, robustness, fairness), with clear task coverage across node/edge/graph; complete leaderboards appear in the appendix with standardized splits and operators.
2. Findings such as no single HNN dominates, accuracy–efficiency trade-offs, and fairness brittleness of message passing are likely to steer algorithm development (e.g., toward robust/efficient or debiased HNNs).
1. Hyperedge prediction uses random splits with mixed heuristic negative sampling. This ignores temporal/inductive drift and can create unrealistic candidate sets. The paper can be strengthened by adding (a) temporal splits (train earlier hyperedges, predict future ones), (b) inductive splits with disjoint node/hyperedge partitions, and (c) open-world negatives drawn from realistic candidate generators (size- and motif-conditioned but respecting time).
2. All tasks are supervised. Many real deployments use self-supervised pretraining or contrastive objectives. The benchmark can be strengthened by adding self-supervised learning tracks (node/hyperedge masking, motif prediction, contrastive pretraining) and evaluate fine-tuning across tasks to reflect modern practice.
3. The paper successfully demonstrates that HNN performance varies significantly across datasets (a key insight for RQ1), but it falls short of explaining why. The analysis largely stops at reporting performance rankings.
1. Is performance sensitive to datasets with a few very large hyperedges versus many small ones?
2. How do different models handle nodes with very high vs. low degrees?
3. Given the results, what is the recommended decision-making process for a practitioner when selecting an HNN model for a new task?
4. What are the key trade-offs, supported by the benchmark's findings, that one should consider between performance, scalability, and specific data characteristics (e.g., homophily)? |
Heavily AI-edited |
|
DHG-Bench: A Comprehensive Benchmark for Deep Hypergraph Learning |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces a benchmark for hypergraph representation learning, by evaluating 17 popular hypergraph networks on 22 datasets containing node-level, edge-level and hypergraph-level tasks. The evaluation is performed in terms of accuracy, efficiency, robustness and fairness. The paper also provide the code for the benchmark, representing a useful resource for evaluating future models.
- Hypergraph models suffer from inadequate evaluation, which slows down the advancement of the field. Moreover, most existing setups contain only node-level classification tasks. This paper represents a significant step forward in understanding the limitations of current models and provides a consistent, uniform framework for evaluating new models in a fair way.
- The paper points out a couple of limitations exhibited by current models, which represent good areas for future research. In particular, it shows how little progress has been made in hyperedge-level prediction, highlighting the need for more focused efforts in this area.
- Exploring additional metrics such as robustness and fairness is an important represents an important direction for advancing the field.
- The node classification setup used in the paper appears similar to that of ED-HNN. However, the reported results are noticeably lower. Is there any major difference in the training setup?
- I agree that structural robustness is an important metric. However, for models that explicitly take connectivity into account, not observing a drop in performance at a 90% perturbation ratio seems more like a negative result than a positive one. The paper presents robust performance across different perturbation levels as a desirable outcome, but to me, this suggests either that the dataset does not require higher-order processing or that the model does not properly incorporate structural information. I suggest that once the new metric is introduced, the authors also include a discussion on how a good hypergraph model is expected to behave under such conditions. ****
- It would be interesting to see more statistics on the datasets used in the experiments to highlight the extent to which they are representative of higher-order interactions. For example, it is known that PubMed has around 80% of its nodes isolated. Reporting statistics such as the number of isolated nodes, the number of nodes involved only in pairwise interactions, and the number of nodes participating in higher-order interactions would be particularly useful for node-level classification tasks, where the performance on isolated nodes would depend solely on the MLP component, regardless of the model used.
Please see the Weaknesses section |
Fully human-written |
|
SPEAR: Structured Pruning for Spiking Neural Networks via Synaptic Operation Estimation and Reinforcement Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces SPEAR, a SynOps-constrained structured pruning framework for SNNs. It proposes LRE, a linear regression model that accurately estimates post-finetuning SynOps, and TAR, a reinforcement learning reward function that smoothly enforces resource constraints. Extensive experiments show that SPEAR achieves higher accuracy, better compression, and greater energy efficiency across multiple benchmarks.
Originality
1. The paper integrates SynOps constraints into structured SNN pruning using reinforcement learning.
2. The TAR reward formulation elegantly transforms hard resource constraints into soft penalties, enabling smooth optimization.
Quality
1. The paper demonstrates technical rigor, with detailed methodology, theoretical motivation, and algorithmic clarity.
2. Extensive quantitative results across both static and neuromorphic datasets validate generalizability.
3. Ablation studies systematically isolate the effects of LRE and TAR, supporting the claimed contributions.
Clarity
The writing is clear and well-structured, figures and tables effectively illustrate problem motivation, algorithm design, and empirical results.
Significance
Addresses a key bottleneck for deploying deep SNNs on edge devices. The integration of SynOps-aware pruning can influence future SNN model compression and neuromorphic design studies.
1. Limited novelty in RL formulation: While the paper integrates reinforcement learning into SNN pruning, the use of DDPG and reward shaping is largely inspired by existing ANN pruning frameworks. The novelty lies mainly in the application to SynOps constraints rather than a fundamentally new RL algorithm.
2. Comparison limited to few baselines: The evaluation primarily compares against NetworkSlimming and SCA-based pruning, which provides a limited perspective. Including more recent and diverse SNN pruning or NAS methods would strengthen the experimental validation and make the results more convincing.
1. Generalization of LRE: Does the linear correlation between pre- and post-finetuning SynOps hold for all network architectures (e.g., SNN-Transformers or temporal attention-based models)? Could a nonlinear or adaptive estimator further improve accuracy?
2. The comparison currently focuses on NetworkSlimming and SCA-based pruning. Could the authors include more diverse baselines to strengthen the empirical analysis? |
Fully AI-generated |
|
SPEAR: Structured Pruning for Spiking Neural Networks via Synaptic Operation Estimation and Reinforcement Learning |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This work proposes a structured pruning algorithm for SNNs. The method is based on the principle of neural architecture search, incorporating the number of synaptic operations as one of the key constraints. To enable faster estimation of the number of synapses, the authors introduce a linear regression–based estimation method. In addition, a reward function is designed for the reinforcement learning process to encourage architectures with an appropriate #SynOps.
+ The paper is clearly written and well structured, making it easy to follow the methodology and contributions.
+ The work introduces SynOps as a constraint within the NAS process, which is straightforward.
- NAS encompasses various search paradigms, including RL-based, gradient-based, and evolutionary approaches. Why do the authors specifically adopt an RL-based framework for deriving the pruning strategy? A more detailed justification and comparison with alternative NAS strategies would strengthen the methodological motivation.
- Although the paper empirically shows a linear correlation between pre-fine-tuning and post-fine-tuning SynOps, the proposed estimation technique lacks theoretical grounding and analysis of generalizability. Could the authors further clarify the rationale behind the linear regression model? Additionally, more discussion is needed on why the reward formulation in Equations (4), (5), and (6) is theoretically sound.
- From a hardware perspective, the energy evaluation remains simplified. Despite frequent references to “hardware,” the work does not validate results on an actual neuromorphic platform such as TrueNorth [1] or Loihi [2]. Incorporating real hardware experiments would significantly improve the paper’s credibility regarding energy efficiency claims. More importantly, measuring SNN execution on physical devices may reveal that SynOps is not always the dominant contributor to energy consumption, which would challenge a core assumption of the proposed method.
[1]Akopyan, Filipp, et al. "Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip." IEEE transactions on computer-aided design of integrated circuits and systems 34.10 (2015): 1537-1557.
[2]Davies, Mike, et al. "Loihi: A neuromorphic manycore processor with on-chip learning." Ieee Micro 38.1 (2018): 82-99.
See the weakness. |
Lightly AI-edited |
|
SPEAR: Structured Pruning for Spiking Neural Networks via Synaptic Operation Estimation and Reinforcement Learning |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents SPEAR, a reinforcement learning-based structured pruning framework for SNNs, which directly uses SynOps as a constraint. It demonstrates that a simple linear regression (LRE) can be used to predict post-finetuning SynOps. Innovatively, it combines LRE with reinforcement learning, enforcing resource constraints implicitly during the search process through a Target-Aware Reward (TAR). Experimental results across multiple datasets demonstrate that SPEAR achieves superior compression rates compared to existing methods.
1.**Well-motivated problem**. The observation that SynOps change significantly and irregularly after finetuning is important for SNN deployment and distinguishes this work from ANN pruning methods.
2.**Practical and effective approach**. The LRE method, despite its simplicity, achieves strong performance with low computational overhead. The TAR design cleverly converts hard constraints to soft penalties, enabling smoother optimization.
1. **Limited theoretical foundation of LRE**: While a linear relationship is empirically observed, the paper does not investigate the underlying mechanism or the conditions under which it applies. Furthermore, using 500 samples to learn two parameters (W and b) appears inefficient and requires further clarification.
2. **Limited experimental comparison**: Although SPEAR primarily addresses search-based structured pruning, a more comprehensive comparison with design-based, as well as other search-based structured and unstructured pruning methods, would be valuable .
1.A comprehensive comparison with related works should be included, covering and search-based methods that use SynOps for unstructured pruning [1],and also based on SCA-based design approaches [2], with a focus on SynOps, parameter count, and performance.Additionally, comparisons with SNN architecture search works, as mentioned in the related work section, should be incorporated for a more thorough evaluation.
2.Provide a theoretical analysis or intuition for why the relationship between pre- and post-finetuning SynOps is approximately linear.
3.How often must the LRE surrogate be retrained (e.g., for new target SynOps, new datasets, or new architectures)? What is the marginal cost compared to overall SPEAR training time?
[1] "Towards energy efficient spiking neural networks: An unstructured pruning framework." ICLR. 2024.
[2] “Qp-snn: Quantized and pruned spiking neural networks." ,ICLR. 2025. |
Lightly AI-edited |
|
SPEAR: Structured Pruning for Spiking Neural Networks via Synaptic Operation Estimation and Reinforcement Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes SPEAR, a structured pruning framework SNNs that leverages reinforcement learning and introduces Synaptic Operation (SynOps) estimation to directly control energy efficiency during compression. SPEAR directly targets SynOps, a biologically meaningful and hardware-relevant measure of energy use in SNNs. Linear Regression for SynOps Estimation predicts post-finetuning SynOps from pre-finetuning SynOps via a linear regression model, which avoids costly retraining. Target-Aware Reward ensures the reinforcement learning agent learning compression policies that meet resource targets without violating constraints.
1. SPEAR’s focus on SynOps aligns closely with energy efficiency on neuromorphic hardware, which is relevant to SNN training.
2. The combination of linear regression estimation and RL-based optimization is elegant and practical and demonstrates that linear estimation is sufficient and more stable than nonlinear approaches.
3. This paper is well-structured and clear written, with detailed explanation of each component and motivation.
1. These is a green rectangular on page 6 and 7. This should be corrected in the final version.
2. The linear relationship assumption between pre- and post-finetuning SynOps is empirically validated but lacks formal theoretical grounding.
3. Though not excessive, cost-benefit trade-offs for extremely large SNNs are unexplored.
1. How well does linear regression for SynOps estimation generalize when the pruning ratios or datasets differ significantly from those used to train the regression model?
2. Would a nonlinear SynOps estimator (e.g., shallow MLP) improve estimation accuracy on larger datasets or more complex networks?
3. How stable is the reinforcement learning search: does it require many episodes to converge, and how sensitive is it to reward scaling?
4. Are there any more timesteps settings using for this study? Though timestep=4 is a small timestep setting, would there be any chance for a less timestep? |
Fully human-written |
|
Energy-Guided Prompt Optimization for Controllable Cross-Architectural Diffusion Models |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper aims to improve the semantic constraint enforcement of text-to-image diffusion models.
The paper proposes a training-free energy-based optimization technique that corrects the latent at each timestep in the generation process with gradient-based optimization of an energy function, in order to suppress the generation of negative concepts.
1. Enhancing the controllability of text-to-image models is an important research problem.
2. The results of the proposed energy-based optimization method are promising.
1. The paper is claimed to contribute two major components, including 1) Jacobian-based diagnostic and 2) energy-guided optimization, in the abstract and introduction parts, where they are posed as almost equally important. However, the Jacobian-based diagnostic is only briefly introduced in Section 3.2, and empirical evidence on its utility is not provided in the main paper, but rather in the appendix. In Appendix C, the paper only uses a comparison between two guidance methods (DNG and EGP in Table 6) to show the correlation between Jacobian changes (measured by $L_F$ and $R_\sigma$) and constraint adherence (measured by Neg-ACC). More comprehensive experimental evidence on a wider range of model architectures is missing, which should be added to make the conclusion more convincing. In addition, a preliminary experiment on how the Jacobian-based analysis can help architectural selection and design for better constraint adherence should be provided, from which findings would be much more interesting and inspiring to the community. Such an experiment, unfortunately, remains missing in the current paper. The above issues would raise concerns about the significance of the Jacobian-based diagnostic and the amount of contribution involved in the paper.
2. The design choice of $\Delta J_t$ is not well justified and explained. What motivates the Jacobian-based definition of $\Delta J_t$? Why can $\Delta J_t$ quantify the architectural differences between two models?
3. Section 3 is difficult to follow, mainly because some concepts suddenly appear without any context, some concepts are not used after being introduced, and connections between some concepts are not clear. For example, what are unconstrained ($\mathcal{E}_u$) and constrained ($\mathcal{E}_c$) energy functionals in Section 3.4? What are their relationships to the energy functional ($\mathcal{E}$) in Section 3.3? Where is the objective $\mathcal{L}$ in Eq. (13) used? What is the relationship of $\mathcal{L}$ to the update equation in Eq. (14)?
4. Experimental results are incomplete. The visual comparison of images generated by different models is lacking in Section 4.2.2, and should be added. The results on Representation Balance Index (RBI) are missing in the paper.
1. In the experiments, which diffusion model is the proposed EGP method based on?
2. What is the detailed setup for the experiment in Section 4.2.3? What are inputs to different models? Why is the EGP compared with only SD-2.1 and SD-XL? |
Fully human-written |
|
Energy-Guided Prompt Optimization for Controllable Cross-Architectural Diffusion Models |
Soundness: 3: good
Presentation: 1: poor
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces Energy‑Guided Prompt Optimization (EGP), a method that enforces negative prompts during image synthesis to suppress unwanted content. At each DDIM reverse‑diffusion step the latent trajectory is adjusted so that it satisfies the negation constraints while preserving the Markov property of the diffusion process . Experimental results show that EGP attains the highest negative‑prompt accuracy, while achieving comparable or lower FID values.
- The method enforces negative prompts through an energy formulation, so it improves adherence without any model retraining.
- Experiments show that EGP attains the highest Neg‑ACC score while maintaining (or slightly improving) CLIPScore and FID, indicating superior handling of negations without sacrificing visual fidelity.
- It is not described what the qualitative assessment (Sec. 4.2.2) is based on
- Although the introduction claims a new latent‑space attribution metric, the metric is never evaluated or used in the experiments.
- Overall, the writing of the paper could be polished further:
- The related‑work section reads like a series of disconnected paragraphs, which is why it is a bit difficult to follow the paper.
- The experimental section seems to be a collection of independent paragraphs. More transitions and further explanations/analyses would be nice and help the reader to follow and understand the paper better.
- In 4.4, the table is placed in the middle of the text, disrupting the reading flow.
Minor things:
- Oftentimes, after an equation, the sentence is ended with a full stop. However, in the next sentence, it is continued with "Where ...". Either the full stop has to be changed to a comma, or the second sentence has to be edited.
Q1: How is the runtime affected? Is the inference slower when applying EGP?
Q2: Since the authors did not include an ethics statement: Could this technique also be used for guiding towards malicious concepts?
Q3: What is the quality assessment in Section 4.2.2 based on?
Q4: Where is the latent-space attribution metric used? |
Fully human-written |
|
Energy-Guided Prompt Optimization for Controllable Cross-Architectural Diffusion Models |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces EGP, a training-free framework for improving semantic constraint enforcement in DMs, particularly for negation and exclusion prompts. The approach combines two main components: (1) a Jacobian-based diagnostic tool for analyzing how different model architectures respond to constraints, and (2) an energy-guided optimization method that reshapes the latent space during sampling to avoid unwanted concepts. The authors evaluate EGP across multiple diffusion model architectures (SD-2.1, SD-3.5, SD-XL, Flux) and demonstrate improvements in constraint adherence (Neg-ACC) while maintaining image quality. The method operates by adding gradient-based correction steps during DDIM sampling, using CLIP embeddings to measure similarity between generated images and negative prompts.
- The EGP algorithm (Algorithm 1) is well-documented with step-by-step details, making the method reproducible. The mathematical formulations are generally precise and the energy function design is well-motivated.
- The paper includes thorough experiments across multiple diffusion architectures (SD-2.1, SD-3.5, SD-XL, Flux), multiple datasets (COCO, MedicalX, ComicArt, AbstractPrompt), and multiple evaluation metrics (FID, LPIPS, CLIPScore, Neg-ACC). Human evaluations add credibility to the quantitative results.
- The method's ability to work across different model architectures without requiring retraining is a practical advantage. This makes it broadly applicable to existing pretrained models.
- The connection to energy-based models and constrained sampling provides theoretical grounding for the approach.
- What is the specific purpose of the Jacobian diagnostic in your framework? The authors mention it's proposed to analyze different existing models and adapting EGP without training (lines 071-075), but I don't see experiments demonstrating this (nothing provided in the main paper, only providing some contexts in appendix C). How does the Jacobian analysis inform the EGP method? Can you provide clear examples of how the diagnostic guides model selection or parameter tuning?
- The main technical contribution: using ReLU thresholding with energy-based guidance, is a relatively incremental modification over existing negative prompting techniques (e.g., SLD [R1]). While theoretically motivated, the practical novelty feels limited.
- The paper emphasizes being "cross-architectural" but all evaluated models are text-to-image DMs with similar underlying architectures (latent diffusion with U-Net variants). It's unclear what architectural diversity is actually being addressed. What specific architectural diversity are you addressing? Would the method work on fundamentally different architectures (e.g., autoregressive models, GANs, or diffusion models with different parameterizations)?
- The paper doesn't clearly explain what base model EGP uses or how it builds upon existing models. Does it use a specific SD variant as foundation? Does it combine different pretrained components? The strong performance of EGP could be partially due to the underlying base model capabilities rather than purely the EGP contribution.
[R1] Schramowski, Patrick, et al. "Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models." CVPR. 2023.
- The method adds ~40% latency and ~46% more FLOPs compared to baseline (Table 7). While this overhead is mentioned, the paper doesn't adequately discuss whether this cost is justified or explore more efficient alternatives in the main experiments.
- Since EGP is training-free, why not position it as a complementary module that can enhance existing models rather than as a standalone method? This would make it clearer that EGP adds value on top of any base architecture. Your current framing makes it seem like a separate model competing with SD variants.
- Have the authors conducted ablations showing that the Jacobian diagnostic actually improves EGP's performance? If the diagnostic is a key contribution, it would be helpful to see experiments demonstrating its utility.
- How sensitive is the method to the choice of threshold τ = 0.25? How should practitioners set this threshold for new concepts or domains? Is there a principled way to select it?
- Some guidance-based negative prompting methods for DMs [R1, R2, R3] are specifically designed for safety applications, particularly NSFW content detection and mitigation. Have you considered or evaluated EGP for this important use case? Given that your method is training-free and focuses on constraint enforcement, it seems like a natural and practically important application. If you have explored this, what are the results? If not, could you discuss whether your approach would be suitable for safety-critical constraint enforcement, and what modifications (if any) might be needed?
[R1] Schramowski, Patrick, et al. "Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models." CVPR. 2023. \
[R2] Yoon, Jaehong, et al. "Safree: Training-free and Adaptive Guard for Safe Text-to-Image and Video Generation." ICLR. 2025. \
[R3] Liu, Zhili, et al. "Implicit Concept Removal of Diffusion Models." ECCV. 2024. |
Fully AI-generated |
|
Energy-Guided Prompt Optimization for Controllable Cross-Architectural Diffusion Models |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 0:
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes an Energy-Guided Prompt Optimization (EGP) framework aimed at improving the controllability and cross-architecture consistency of text-to-image diffusion models. Specifically, it introduces an energy-based correction mechanism applied during the sampling process to enhance the effectiveness of negative prompts and align outputs across different diffusion architectures (e.g., Stable Diffusion 2.1, XL, and Flux). The authors further propose a Jacobian-based diagnostic tool to analyze model sensitivity to textual conditioning, arguing that such differences explain inconsistencies between architectures. Experimental results are reported on several datasets, suggesting that EGP improves negative prompt adherence and semantic alignment without retraining. The paper positions its contributions as a unified, training-free approach to controllable text-to-image generation and cross-model consistency analysis.
The paper addresses an underexplored aspect of diffusion-based generative models—cross-architecture consistency and negative prompt reliability—which reflects an effort to formalize controllability issues that are often treated heuristically. The introduction of an energy-guided optimization framework at inference time represents a creative adaptation of energy-based modeling concepts to prompt control without additional training. The Jacobian-based diagnostic for analyzing model sensitivity provides an interpretable, theoretically motivated lens for comparing architectures, which could inspire future research on representational alignment in generative models. The manuscript is generally clear in its high-level organization and conveys the intuition behind energy-guided correction in a way accessible to readers familiar with diffusion processes.
The paper presents serious issues in clarity, structure, and overall scholarly presentation, which significantly detract from its readability and impact. The writing often lacks grammatical and logical precision—for instance, the very first sentence in the introduction is syntactically broken and semantically unclear, making it difficult to understand the problem being introduced. Many passages are written in a way that feels disconnected or overly verbose, with inconsistent use of terminology and weak linkage between motivation, method, and results. The manuscript also raises questions about its adherence to standard conference formatting conventions—for example, the inclusion of a “Keywords” section seems unusual for this venue. From a technical standpoint, while the idea of energy-guided prompt optimization is conceptually interesting, the theoretical grounding remains vague, and the motivation for enforcing cross-architecture consistency is not convincingly justified.
1. The introduction lacks grammatical precision and conceptual clarity, particularly in the opening sentence, which makes it difficult to grasp the main motivation and scope of the work. Clarification of the introduction framing and problem definition would help establish a clearer context.
2. The purpose and value of pursuing cross-architecture consistency remain uncertain. Further explanation is needed to clarify whether such consistency is a scientifically meaningful objective or an engineering consideration, and how it contributes to broader progress in diffusion modeling.
3. The distinction between the proposed energy-guided prompt optimization and prior inference-time control methods is not clearly demonstrated. A more explicit comparison—both conceptually and experimentally—would help assess the novelty of this approach.
4. Certain formatting choices, such as the inclusion of a “Keywords” section and unconventional section organization, raise questions about adherence to the conference submission format. Clarification of whether these reflect intentional stylistic decisions or formatting oversights would be helpful. |
Fully AI-generated |
|
FastFace: Training-Free Identity Preservation Tuning in Distilled Diffusion via Guidance and Attention |
Soundness: 3: good
Presentation: 1: poor
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents a set of training-free tricks to augment the distilled ID-consistent person generation model to achieve better ID preservation and preserve the instruction following ability.
For this objective, the paper presents two methods to augment adapter-based person generation models:
1. Decomposed Classifier-free guidance. The authors decompose the Classifier-free guidance into ID-preservation part and prompt-following part. By applying different weights and schedule for these two parts, existing person generation model can achieve better trade-off between ID conditions and text conditions.
2. Attention Manipulation. The authors find the token which corresponds to the face region in the generated images and design a set of denoising operations to make the attention map focusing more on face regions.
Experimental results on their curated benchmark and a subset of OmniContext dataset indicate their methods can effectively improve the identity preservation across two distilled ID-preserving person generation methods.
1. The idea of decomposing ID-preservation and prompt-following part in Classifier-free guidance sounds rational to me. Independent control of both parts can achieve better trade-off.
2. The proposed method achieves high efficiency after applying a set of test-time algorithms.
3. The proposed method can generalize to many adapter-based person generative models.
1. The writing of this paper requires improvement.
2. The evaluation looks weak. The proposed DiverseFaces benchmarks contains only synthetic data, which makes the evaluation biased. The single_character subset of OmniContext only has 50 samples for evaluation, making the results analysis less convincing.
3. The idea of attention manipulation is not novel. Previous works have explored the visualization of attention map like PuLID[1] and manually update attention map during inference like PersonaHOI[2]. How the proposed attention manipulation perform compared to existing methods is unclear. The motivation of designing such a complicated attention map denoising pipeline is also unclear for me.
4. The visualization has low quality, making it difficult to check quality of the generated images.
[1] PuLID: Pure and Lightning ID Customization via Contrastive Alignment. NeurIPS 2024.
[2] PersonaHOI: Effortlessly Improving Personalized Face with Human-Object Interaction Generation. CVPR 2025.
How do you find the optimal number for so many new hyperparameters in different operations? |
Fully human-written |
|
FastFace: Training-Free Identity Preservation Tuning in Distilled Diffusion via Guidance and Attention |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper presents a method for facial id preservation in distilled image generation using SDXL. This is a useful yet an underrepresented topic in research. The paper proposes well motivated though rather minimal insights as to how to implement this properly for the cases of realistic and stylistic generations. Most justifications are perfectly reasonable and justified. In terms of evaluation, the paper compares to rather old models - restricted to SDXL distillations, with very little qualitative results. This is a marginally good paper with some insights that might benefit the community, even though it is a bit dated with incremental contribution.
- Simple to implement, understand, and justify insights.
- Results seem to improve upon the baselines.
- Identity preservation is still far from perfect.
- Comparisons and implementation are rather dated.
- How would this approach affect newer models, employing flow matching and hybrid attention? (such as Flux, SD3, QWEN etc.)
- Specifically, Qwen Image Edit supports multiple images, including identity, and has a small iteration count LoRA distillation, so that is an interesting comparison.
- Significantly many more qualitative examples should be added.
- I didn't understand the scale * power transform. Please explain in more detail.
- How would this approach compare against contemporary personalization method, such as anyStory, DreamO, or others?
- Why is the bold facing and underlining in the tables inconsistent? It seems a number is boldfaced or underline only if it is in your method's rows. That is misleading. |
Fully human-written |
|
FastFace: Training-Free Identity Preservation Tuning in Distilled Diffusion via Guidance and Attention |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes FastFace, a training-free framework for identity-preserving tuning in distilled diffusion models. The method consists of two components: the Decoupled Classifier-Free Guidance (DCG) module, designed for stylistic generation, and the Attention Manipulation (AM) module, designed for realistic generation. Both modules operate solely during inference and require no additional training. They can be directly applied to various distilled diffusion models and ID adapters, enabling fast, few-step identity-preserving generation.
1. Both mechanisms work at inference time and can be applied across multiple distilled diffusion models and ID adapters without retraining.
2. Maintaining identity consistency in distilled diffusion models is a practical and novel problem. The authors divide the task into stylistic and realistic goals and design DCG and AM modules respectively, which is clear and targeted.
1. The method proposed in this paper lacks theoretical rigor, and many of its design choices are derived purely from empirical experience rather than from complete theoretical derivation. For instance, in Appendix D, the authors acknowledge that their derivation relies on an independence assumption that “generally is not true,” which leaves the method without a solid theoretical foundation. Moreover, DCG is applied only at the intermediate steps, yet no theoretical explanation is provided for this choice; the authors merely observe through visualization that applying DCG at the first or last step leads to artifacts and therefore adopt this empirical rule. Although these design decisions may be effective in practice, they lack the necessary analytical justification and theoretical reasoning to support their validity.
2. The paper lacks a systematic ablation study of the DCG and AM modules. The authors clearly distinguish their respective roles: DCG is designed for stylistic generation and prompt-following, while AM focuses on realistic generation and identity preservation. However, in the experiments, both modules are directly combined and evaluated as a unified framework, making it difficult to quantify their individual contributions. Although the appendix provides partial ablation results for AM and DCG, it lacks comparative experiments conducted under identical conditions, for example, evaluating the complete performance metrics when using only DCG or only AM. Consequently, the current experimental setup is insufficient to reveal the specific contribution and mutual relationship of the two modules in the overall performance improvement.
3. The evaluation dataset used in this paper, DiverseBench, is overly idealized, as the experiments are conducted primarily on synthetic identities and synthetic prompts. All samples are generated by diffusion models and therefore lack the complexity found in real-world human faces, such as variations in lighting, pose, occlusion, and facial expression. Since all experiments are performed in such a controlled environment, the model’s generalization and robustness in real-world settings remain unverified. Consequently, the reported results mainly reflect performance under idealized conditions rather than stability in practical applications.
4. Although the authors describe the fundamental principles and respective roles of DCG and AM, they do not explain why combining the two modules leads to the best overall performance. According to the results presented in the appendix, using DCG alone increases the CLIP score while decreasing ID similarity, whereas using AM alone improves ID similarity but lowers the CLIP score. This suggests that the two modules respectively favor stylistic coherence and identity fidelity, and that using either one individually merely achieves a trade-off between different performance metrics. However, the paper does not provide a theoretical or mechanistic explanation of why their combination can simultaneously improve both metrics, nor does it discuss whether conflicts may arise between the two modules during inference. Therefore, the joint use of DCG and AM appears to be an empirical engineering design rather than a theoretically grounded method.
5. Equations (16) and (17) appear to be incorrect. Both expressions contain mismatched parentheses, and Equation (16) additionally includes an extra “ϵ”.
1. Can the proposed method be applied to preserve non-facial human characteristics? AM enhances facial feature retention, but human characteristics are not limited to facial features—they also include factors such as body shape. Does AM have any effect on preserving these other features?
2. Can validation be conducted on real faces or public face datasets to confirm that the method can maintain stable ID similarity under complex, real-world conditions?
3. Please explain the interaction between DCG and AM. In some cases, when DCG and AM are used together, could their respective gains counteract each other? Are there situations where using only DCG or only AM yields the best performance? |
Lightly AI-edited |
|
FastFace: Training-Free Identity Preservation Tuning in Distilled Diffusion via Guidance and Attention |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
To address the incompatibility issue between existing ID adapters and distilled diffusion models—where either additional training is required or the performance is poor—this work proposes FastFace, a universal training-free framework. This framework achieves adaptation through two core mechanisms: 1) Decoupled Classifier-Free Guidance (DCG), which decomposes and optimizes guidance terms to support few-step stylistic generation (e.g., 4-step sampling); 2) Attention Manipulation (AM), which enhances attention on facial regions in decoupled blocks via scale-power transformation (AM1) and scheduled-softmask transformation (AM2) to improve identity similarity. Experiments have verified the effectiveness of this method.
1. This work addresses the incompatibility between existing ID adapters (e.g., FaceID, PuLID) and distilled diffusion models (e.g., SDXL-Hyper, Lightning) without requiring any additional training.
2. FastFace demonstrates robust compatibility with a wide range of distilled diffusion models and mainstream ID-preserving methods.
3. This paper is well-written and easy to follow.
1. Lack of validation on real human face datasets. All identity images used in the DiverseFaces benchmark—FastFace’s primary evaluation dataset—are synthetically generated rather than real human faces.
2. Limited compatibility with non-SDXL-based distilled models. FastFace’s experiments are exclusively conducted on SDXL-derived distilled models. There is no validation of its performance on distilled models from other base architectures.
3. There is no testing on extreme scenarios, such as low-quality reference images (blurred, occluded, or low-light), which can validate the robustness of the method.
4. No exploration of multi-reference image scenarios. FastFace’s entire design and evaluation focus on single-reference-image ID preservation. There is no experimental exploration of multi-reference setups (e.g., adapting to an identity using 2-5 reference images), which are common in practical personalized generation.
1. Lack of validation on real human face datasets. All identity images used in the DiverseFaces benchmark—FastFace’s primary evaluation dataset—are synthetically generated rather than real human faces.
2. Limited compatibility with non-SDXL-based distilled models. FastFace’s experiments are exclusively conducted on SDXL-derived distilled models. There is no validation of its performance on distilled models from other base architectures.
3. There is no testing on extreme scenarios, such as low-quality reference images (blurred, occluded, or low-light), which can validate the robustness of the method.
4. No exploration of multi-reference image scenarios. FastFace’s entire design and evaluation focus on single-reference-image ID preservation. There is no experimental exploration of multi-reference setups (e.g., adapting to an identity using 2-5 reference images), which are common in practical personalized generation. |
Lightly AI-edited |
|
Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes Syncphony, a diffusion-transformer-based framework for audio-to-video (A2V) generation that achieves fine-grained synchronization between sound and motion. The method introduces two key innovations: 1) Motion-aware Loss — reweights reconstruction loss to emphasize high-motion regions, improving temporal alignment. 2) Audio Sync Guidance (ASG) — uses an auxiliary “off-sync” model (without audio layers) to guide the full model during sampling toward stronger audio-motion coupling. To evaluate synchronization, the authors further introduce CycleSync, a new video-to-audio-based metric that measures whether the generated motion contains sufficient cues to reconstruct the original audio. Experiments on AVSync15 and TheGreatestHits datasets show improved synchronization and comparable or better visual quality relative to baselines such as AVSyncD and TempoTokens.
1. The paper proposes Motion-aware Loss and Audio Sync Guidance to improve the audio-visual synchronization of A2V generation, which are conceptually simple and empirically effective.
2. The proposed metric CycleSync offers a meaningful step forward over prior metrics (AV-Align, AlignSync, RelSync) by enabling evaluation at 24 fps and better correlating with human perception.
3. The paper also proposes a principled strategy to adapt a pretrained I2V model for the A2V task by selecting the most relevant layers to inject the audio cross-attention layer.
4. Syncphony consistently outperforms baselines on synchronization (CycleSync) and achieves competitive or superior FID/FVD, indicating that temporal precision does not come at the expense of visual quality.
1. While the paper’s ideas are sound, the architectural novelty is moderate. Syncphony heavily relies on a pretrained Pyramid Flow backbone, and the main innovations are at the loss and sampling levels rather than core model design.
2. The writing of the paper can be improved. For example, in the introduction, more details/motivations about the proposed methods could be included instead of the background information.
3. Limited baselines: the proposed method is only compared with AVSyncD and Pyramid Flow. Is it possible to include more baselines, such as those listed in AVSyncD?
1. Fig 5 and 6: it will be helpful to include the ground-truth video as a reference.
2. How does ASG compare with the vanilla classifier-free guidance? For example, use the features with and without audio input as guidance.
3. The paper only finetunes the last 16 blocks (8–23) of the Pyramid Flow backbone. Are there any experimental results supporting the benefits of this choice, e.g., compared to finetuning the full model?
Others: see the weakness section. |
Lightly AI-edited |
|
Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
Syncphony demonstrates meaningful progress in the audio-to-video generation domain. Its main contributions include: (1) a motion-aware loss that re-weights the training objective based on optical flow to focus the model’s learning on motion-intensive regions; (2) a training-free guidance technique that enhances the injection of audio information during inference without additional training; and (3) a more intuitive and reasonable evaluation metric for measuring audio-visual synchronization through reconstructed audio from generated videos. Experiments on two public datasets and qualitative case studies further validate the method’s effectiveness in improving both synchronization and visual quality.
- The proposed training-free guidance is novel and effectively points out the core challenge of achieving precise spatiotemporal alignment in audio-to-video generation.
- The proposed evaluation metric is intuitively reasonable and addresses the limitation of conventional metrics like FVD, which fail to effectively measure spatiotemporal alignment.
- The use of RoPE for spatiotemporal positional encoding enhances temporal consistency and spatial coherence in video generation, contributing to smoother and more structured motion representation.
- The demo results demonstrate good temporal alignment consistency.
Although the work is technically sound, there are several issues that I feel must be discussed.
- Regarding the motion-aware loss, it relies on a strong assumption that the visual content remains static and confined to a single scene. This assumption holds almost perfectly under the authors’ setting where short video clips of around two seconds. However, in more realistic video generation scenarios, background changes, camera motion, and scene transitions can occur without producing strong audio cues, but may catastrophically distort optical flow estimation and thus limit the scalability of the proposed approach.
- The proposed guidance technique randomly drops audio cross-attention layers during inference; however, this inference structure (unlike CFG or Autoguidance) is never encountered during training, making the resulting “unconditional” outputs less predictable. Moreover, although the authors provide some visual demonstrations, the underlying motivation may primarily apply to relatively smaller models where audio conditioning tends to be weaker. In larger-scale video generation frameworks such as Wan or Hunyuan-Video, audio conditions may not be as easily ignored, and models could exhibit different skip-layer behaviors. Rather than validating only on additional datasets, I would prefer the authors to verify this idea across multiple video generation baselines to strengthen the generality of their claims.
- Although the proposed new metric is intuitively reasonable, current video-to-audio (V2A) models also suffer from (or are still addressing) spatiotemporal alignment issues, which means they may not serve as a fully reliable ground-truth proxy. Moreover, the approach fundamentally increases evaluation time, potentially limiting scalability to larger experiments.
- Could the authors elaborate on how they plan to address the challenges posed by more realistic (or longer) audio-to-video scenarios, where optical flow estimation can be severely affected by background motion, camera movement, or scene transitions beyond the main subject?
- Could the authors show how the skip-layer behavior manifests in other video generation models and whether similar phenomena can be consistently observed? In addition, if temporal alignment is indeed a crucial aspect of the task, why not consider using energy-based audio features as a more direct form of control or guidance?
- Given that modern multimodal large language models (e.g., Gemini 2.5) already demonstrate strong capabilities in understanding audio-visual information, and that existing video-to-audio alignment metrics (such as DeSync, which can be applied similarity in a2v field) provide reliable proxies for spatiotemporal correspondence, could the authors clarify or compare what specific advantages their proposed metric offers over these established approaches? |
Lightly AI-edited |
|
Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces Syncphony, an audio-to-video (A2V) model capable of generating 380×640 resolution, 24fps videos synchronized with diverse audio inputs. Built upon a pre-trained video backbone, it emphasizes on audio-visual synchronization through two main contributions:
- Motion-aware Loss, a loss weighting mechanism emphasizing on the high-motion regions of the video.
- Audio Sync Guidance, a classifier-free guidance design focusing solely on audio conditioning to emphasize better synchronization.
Moreover the authors introduce an auxiliary contribution for the model evaluation:
- CycleSync, an audio similarity metrics between the detected peaks of the ground truth and the A2V -> (pretrained) V2A generated audio.
The A2V model is finetuned from a pretrained I2V model, by adding cross attention layers in the latter transformer blocks of the pretrained model. The backbone is a pretrained PyramidFlow Video model, trained on videos up to 5 seconds long at 24 fps and 380 × 640 resolution. Audio is sampled at 16kHz and encoded through DenseAV for conditioning. Text is encoded through CLIP. Temporally-aware RoPE frequencies are employed to aligned the modalities in the aforementioned cross attention layers.
Evaluations are conducted on the AVSync15 dataset using both objective (FID, FVD, Image-Audio similarity, Image-Text similarity, CycleSync) and subjective metrics (IQ, FC, Sync). They demonstrate that the proposed method outperforms the AVSyncD on all axes.
On the Greatest Hits dataset, the proposed method outperforms on the CycleSync and FVD metrics, showing its emphasis on both video generation quality and audio visual synchronization.
The paper quality is enforced by thorough ablations and experiments, presented both in the main sections and appendixes, notably:
- demonstration of the greater sensitivity of the CycleSync metrics compared with previously proposed ones, running correlation with human study and experimenting with controlled settings such as temporal shifts.
- ablations on the Motion-aware Loss and Audio Sync Guidance contributions
- pretrained model behavior study to understand where to inject the finetuning layers
Although this paper presents an Audio to Video model, its contributions should translate into better the Video to Audio model designs.
The Figure 1 is not clear enough to me. It is not clear what the frozen and trainable layers refer to (are those transformer blocks?). I would suggest adding in the captions that the audio features are injected in the latter blocks (the expectation is usually to add cross attention to each block so the presentation is quite counter intuitive without a corresponding explanation).
Formatting:
- Add spacing before opening parentheses (around the line 347).
- In the table 3, I suggest better highlighting which model is the final version (maybe in the top row and by adding a row separator).
- As one of the paper's contributions is a classifier-free-guidance design, I would suggest adding some ablations of the choice of the cfg parameters. |
Fully human-written |
|
Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes Syncphony, an audio-to-video generation model that focuses on improving synchronization between audio and motion.
To achieve this, the authors introduce a Motion-Aware Loss for capturing movement intensity and an Audio Sync Guidance (ASG) mechanism for enforcing synchronization between audio and visual dynamics.
A new evaluation metric, CycleSync, is also proposed to better align with human perceptual judgments of synchronization.
Experimental results show that Syncphony performs better than most existing baselines.
1. The introduction of the CycleSync metric sounds reasonable and shows higher alignment with human perceptual evaluations than prior synchronization metrics.
2. The model achieves consistently better performance than most baselines across several benchmarks.
1. Generating a 5-second video takes nearly 3 minutes, which significantly limits practical usability.
2. The main model builds heavily on Pyramid Flow, with only moderate extensions (audio conditioning and synchronization guidance). As such, the contributions feel incremental.
3. The CycleSync metric relies on pretrained V2A models that may introduce background or irrelevant audio content, potentially biasing the evaluation.
4. The proposed Motion-Aware Loss does not explicitly capture semantic motion as authors mentions. It merely measures pixel or latent differences between consecutive frames, which may not correspond to meaningful sound-related motion.
In the demo, only 2-second video samples are provided, even if it can generate 5 seconds video. Why is that? |
Lightly AI-edited |
|
DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes DeepDive, a web-browsing deep search agent. DeepDive develops a method to generate synthesized, hard-to-find QA data derived from knowledge graphs via long random walks with attribute obfuscation, and also combines a multi-turn RL using GRPO plus a redundancy penalty. The method aims to improve long-horizon reasoning with tools in a ReAct style agent. DeepDive-32B shows superior performance comparing to open models, though it still exists significant gaps comparing to GPT deep research.
1. The paper combines KG based synthesis of harder, multi-hop queries with multiturn RL, and demonstrates meaningful improvements on challenging deep web-search tasks, which shows that tougher supervision plus RL can better exploit tools and sustain longer reasoning chains.
2. The ablation studies isolate the effects of the reward design and the synthetic data generation, providing evidence that each component contributes measurably to performance and search efficiency.
1. Uncontrolled inference-time scaling. The paper does not standardize or report consistent test-time budgets across baselines (e.g., max tool-call limits). Several baseline scores appear to be taken from prior work rather than re-evaluated under a matched inference budget, making the headline comparisons confounded by inference-time scaling.
2. Incremental contribution in data generation. The proposed synthetic data pipeline generates multi-hop questions/targets by graph traversal (random walks) to generate search paths and then obfuscating attributes. Many existing efforts already use KG/graph path extraction to script multi-step reasoning trajectories and there are also works using LLM to paraphrase and harden queries.
3. Use recent models. The experiments may also evaluate on the Qwen3-32B model, which is enhanced on tool use ability comparing the QwQ-32B
The authors can address the concerns mentioned in the above weaknesses 1 and 3. |
Fully human-written |
|
DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents DeepDive, a method to enhance the capabilities of LLMs on complex "deep search" tasks, where they are often limited by long-horizon reasoning and a lack of quality training data. The approach has two key components:
1) A strategy to automatically synthesize complex, obfuscated question-answer pairs from Knowledge Graphs by using random walks and LLM-based entity blurring; and 2) An end-to-end multi-turn RL framework based on GRPO, which introduces a "redundancy penalty" to encourage diverse search.
Experiments show that DeepDive-32B achieves competitive results among open-source models on the BrowseComp benchmark.
- The paper's primary contribution is an automated data synthesis method for deep search. By leveraging the multi-hop structure and attributes of Knowledge Graphs (KGs) and using an LLM for "obfuscation," this method addresses the critical lack of "hard-to-find" questions in existing datasets.
- The paper provides detailed analyses, including scalability with tool calls, parallel sampling strategies, and the evolution of the model's search behavior during RL training.
- A significant concern arises from Appendix C, which introduces a semi-automated, i.i.d. data synthesis strategy. This method, using only ~3,000 QA pairs, achieves far superior results (22.2% on BrowseComp) compared to the main paper's automated KG method (15.3%). This large performance discrepancy calls into question the true value and scalability of the main proposed KG method.
- The novelty of the RL framework is limited. The core training framework is based on the existing GRPO algorithm, with the primary addition being a "Redundancy Penalty." While effective, this appears to be more of an effective engineering trick rather than a substantial algorithmic contribution.
- The ablation study for the "Format Reward" (Figure 7a) is inconclusive. The results show that removing it causes learning to stagnate (around 8.0% accuracy). This only demonstrates its necessity within the current framework, not its superiority or design effectiveness. The authors fail to compare it against other alternative intermediate rewards.
1. Regarding the large performance gap between the main KG method (15.3%) and the appendix i.i.d. method (22.2%): Since 3k QA pairs is not that much, can the authors elaborate on the cost-benefit trade-off? Specifically, what was the human-hour cost for the ~3k i.i.d. samples compared to the compute cost (including LLM obfuscation and filtering) for the KG data?
2. Regarding the Format Reward: As the ablation shows, this reward is critical for learning. Can you compare it against simpler baselines (e.g., a sparse reward given only for a successfully parsed tool call, rather than the strict, full format check)? This would help clarify if the strictness of the reward is the key factor, or just the presence of an intermediate signal. |
Lightly AI-edited |
|
DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces DeepDive, a framework designed to improve the "deep search" capabilities of open-source Large Language Models (LLMs), turning them into more effective web-browsing agents for solving complex problems.
Key Contributions:
* Automatic Data Synthesis: It proposes a novel strategy to automatically generate complex, difficult, and hard-to-find question-answer pairs from open knowledge graphs. This creates a rich dataset for training without manual effort.
* Reinforcement Learning (RL) for Training: It employs an end-to-end multi-turn Reinforcement Learning process to train the LLM. This method enhances the model's long-horizon reasoning. It also introduces a "redundancy penalty" to discourage the agent from making repetitive or similar search queries, encouraging more efficient and diverse exploration.
* The knowledge graph-based QA synthesis ensures high-quality, reasoning-intensive training data—by blurring entity attributes and filtering via frontier models (e.g., GPT-4o), it creates "hard-to-find" questions that truly stimulate deep search capabilities.
* The multi-turn RL framework (with GRPO algorithm and redundancy penalty) encourages diverse, efficient search: the penalty discourages repeated queries (measured by Jaccard similarity), while the end-to-end design integrates reasoning and tool use, enhancing long-horizon search.
* Strong experimental validation: DeepDive-32B sets a competitive open-source record on BrowseComp (15.3% accuracy), outperforms other open agents, and demonstrates test-time scalability (better performance with more tool calls/parallel sampling).
Many of the contributions in this paper lack originality and novelty. For specifics, please refer to the "Questions" section below.
Many of the paper's contributions seem to lack novelty. For example:
* One of the core claimed contributions is the synthesis of QA pairs from KGs with an obfuscation step. However, this approach was already proposed in the prior work, WebSailor. What is the specific contribution of this paper in this regard?
* There is extensive prior work on applying reinforcement learning (RL) to search agents. What is the novelty and contribution of the RL framework presented here compared to existing methods? |
Lightly AI-edited |
|
DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL |
Soundness: 3: good
Presentation: 3: good
Contribution: 1: poor
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
Augmenting large language models (LLMs) with browsing tools greatly enhances their potential as deep search agents capable of tackling complex, real-world tasks. However, open-source LLMs still struggle in such settings due to limited long-horizon reasoning with browsing tools and the lack of sufficiently challenging supervised data. To overcome these limitations, this paper introduces DeepDive, a framework designed to advance deep search agents. First, it proposes an automated strategy for synthesizing complex, difficult, and hard-to-find questions from open knowledge graphs. Second, it employs end-to-end multi-turn reinforcement learning (RL) to improve LLMs’ deep search and long-horizon reasoning capabilities. Built upon open models, DeepDive-32B achieves 15.3% accuracy on BrowseComp and demonstrates strong test-time scaling in both tool usage and parallel sampling.
Clarity:
1. The paper presents its methods and experimental analyses clearly.
2. It provides sufficient ablation experiments to support the effectiveness of the proposed approach.
Originality: How does your work differ from “WebSailor: Navigating Super-human Reasoning for Web Agent”? I believe your approach shows little innovation in both the knowledge graph–based data construction and the multi-turn RL training methods.
1. What obfuscation strategies are used in the paper?
2. Regarding the multi-turn RL training method mentioned: it seems that the agent performs multi-step tool calls to produce an answer — but isn’t this the standard approach in existing work? There are no methods that produce an answer in a single tool call, right? Or do you compute rewards separately for each intermediate tool call? Based on my understanding, your method assigns a single reward to the entire multi-step trajectory, combining answer correctness and a redundancy penalty.
3. Is your training and evaluation framework based on open-source code? |
Lightly AI-edited |
|
Towards Automatic Discovery and Explanation of Differences Between Vision Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces a framework for automatically generating natural language explanations that describe performance differences between two vision models. It tackles a very important explainability problem in vision models. while there could be gaps in terms of performance and reliability.
- The paper introduced well-defined and complementary metrics, such as completeness, density, and token length, that are reasonable for evaluating explanations.
- It performed ablation studies to understand each design choice’s impact on metrics.
- A practical use case on CelebA was shown.
- While the paper used a generator to produce images for model evaluation, it has not been validated whether the generated images are diverse and realistic.
- The conditions proposed by LLMs may be prone to errors and may not be diverse.
- There is a significant gap between human and LLM-generated explanations in Table 1. The practical implications of the proposed method should be more clearly discussed.
- How sensitive are the results to the quality and diversity of the images/conditions generated by the underlying data generators and the LLM for exploration? Have any analyses been conducted regarding potential biases or limitations in the synthetic data or LLM-proposed conditions?
- What was the computational cost of the method? |
Fully human-written |
|
Towards Automatic Discovery and Explanation of Differences Between Vision Models |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a method to automatically identify and explain performance differences between machine learning models beyond benchmark scores. They introduce a framework that generates natural language explanations describing the differences between two models’ performances.
To evaluate explanation quality, they define three metrics: Completeness (measuring correctness and overall informativeness), Density (capturing token-level informativeness), and Token Length (assessing verbosity). Based on these metrics, three explanation generation methods are proposed: Raw Differences, which enumerates all performance differences; Summarization, which condenses them into concise summaries; and Optimization, which balances informativeness and conciseness.
Experiments on the CMNIST, CLEVR, and CelebA datasets demonstrate that the Optimization method effectively reveals model differences and biases through natural language.
The paper introduces three evaluation metrics—Completeness, Density, and Token Length—along with three complementary methods to analyze performance differences between models.
The results demonstrate that the Summarization and Optimization methods generate more concise and informative explanations, effectively capturing high-level insights.
Moreover, the authors validate the robustness and generality of their approach using different large language models (LLMs), further confirming the effectiveness of their proposed framework.
Although the proposed framework aims to reduce the extensive human effort, time, and resources required to identify model strengths and weaknesses, the paper does not provide quantitative comparisons with existing approaches to substantiate this claim.
In addition, the evaluation is conducted on only 100 conditions, which is insufficient to yield strong quantitative evidence or confidently demonstrate the framework’s effectiveness.
While experiments are performed on three datasets—CMNIST, CLEVR, and CelebA—these datasets are relatively limited in scope, as they represent simple numerical, synthetic, and facial data, respectively. Evaluating the method on larger and more diverse datasets, such as ImageNet, could provide a stronger benchmark and better validate generalization.
Furthermore, the paper lacks clarity regarding the models used for training on these datasets, including whether they share the same architecture or differ, which makes it difficult to interpret the reported comparisons.
It remains unclear why the study did not include direct comparisons between existing visual models to further validate the proposed framework. This would be particularly interesting, as numerous prior works have explored differences between vision models.
1. Rulin Shao, Zhouxing Shi, Jinfeng Yi, Pin-Yu Chen, and Cho-Jui Hsieh. On the adversarial robustness of visual transformers. arXiv preprint arXiv:2103.15670, 2021.
2. Yutong Bai, Jieru Mei, Alan L Yuille, and Cihang Xie. Are transformers more robust than cnns? Advances in Neural Information Processing Systems, 34, 2021.
3. Mingqi Jiang, Saeed Khorram, and Li Fuxin. Comparing the decision-making mechanisms by transformers and cnns via explanation methods. In IEEE Conf. Comput. Vis. PatternRecog. (CVPR), pages 9546–9555, 2024. |
Lightly AI-edited |
|
Towards Automatic Discovery and Explanation of Differences Between Vision Models |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes to explain the differences between two vision models using natural language, using an LLM and a conditional data generator. They propose 3 evaluation metrics for the explanation: Completelness, Density and Token Length. They also derive three explanations: Raw differences enumerates a list of all conditions, but is too much information, while Summarization is a brief summary of all conditions but may omit details. Optimization is based on steering an LLM with the objectives of having both good Raw Differences and Summarization. Experiments on CMNIST, CLEVR, and CelebA show that the Optimization method produces the most concise and informative explanations, effectively uncovering true model differences
- Explanations between the differences of two models via natural language is an interesting and unstudied direction.
- The motivation is well defined and is interesting.
- [W1] The paper does not investigate realistic scenarios. Most of the cases are binary cases (e.g., gender), based on toy, unrealistic datasets such as CMNIST and CLEVR, and even with those, very controlled experiments where authors train their own models on their own bias conditions. I am not really sure how this paper would have an impact in the field. On the other hand, GLOVE [R1] is a similar work in essence, that also uses LLMs as implicit optimizers to generate input prompts for vision-language models. Here, the authors "explanations" can be viewed as input prompts. Furthermore, GLOVE shows how useful the method is on a variety of real-word tasks and models, such a CLIP, and autoregressive VLMs, across a huge variety of tasks including multi-label zero-shot classification, generalization and VLM safety.
- [W2] I believe that simply designing datasets, models, and conditions, and then evaluating them on self-proposed metrics without demonstrating their broader applicability, offers limited value to the community. While such metrics can indeed serve as useful reward signals for guiding LLMs (as in GLOVE), reporting them in isolation—without showing their practical relevance or impact—does not meaningfully advance the field.
- [W3] The explanations are highly built on synthetic data, which might differ from the real data distribution that the model is trained on. How do the authors fix this issue? This leads to completely misleading explanations if that is the case.
[R1] GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models, TMLR 2025
While the idea is nice and interesting, W1 and W2 are major and sadly enough, my decision will be negative. I encourage the authors to continue on the same direction, but show practicality and usefulness of their method in real-word settings and tasks.
Minor:
- The figures are misleading. For eg. L240 and Fig 3. "man with a beard", "woman with a stroller"...etc are not attributes, you explicitly specify the gender. L246 "the explanation becomes lengthy" is not reflected in the Figure.
- I recommend citing related work on explaining the difference between two models via concepts [R2].
[R2] REPRESENTATIONAL SIMILARITY VIA INTERPRETABLE VISUAL CONCEPTS, ICLR 2025 |
Fully human-written |
|
Towards Automatic Discovery and Explanation of Differences Between Vision Models |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes using large language models to explain the differences between two models through three steps: Raw Differences, Summarization, and Optimization. The approach leverages LLMs to generate comparative explanations. In addition, the authors introduce three metrics to evaluate the quality of the explanations: Completeness, Density, and Readability. Experimental results show that the proposed method outperforms baseline approaches on multiple datasets.
1. The method is simple: by leveraging the capabilities of large language models, it generates comparative explanations without requiring complex training processes.
1. Based on Definitions 1.1 and 1.2, the task can be reduced to explaining a binary classifier and using the explanation to predict the model’s output. One could simply define a difference model ( f_{\text{diff}} = f_1 - f_2 ) and apply existing explanation methods to ( f_{\text{diff}} ). Since the paper does not compare against this natural baseline in the related work or experimental evaluation, the effectiveness of the proposed method remains unconvincing.
2. The paper primarily uses Raw Differences as a baseline but does not clarify how the Conditions \(c\) for generating Raw Differences is chosen. The quality of Raw Differences is highly sensitive to the choice of \(c\)s. Intuitively, selecting an appropriate condition could significantly improve the results. In Section 4, the authors draw an analogy between conditions and concepts. However, existing methods for concept selection[1] already provide effective ways to identify concepts that strongly influence model predictions, which could also improve Raw Differences. The lack of comparison with such established methods further weakens the paper’s claims.
3. The authors introduce three new metrics for evaluating explanation quality. However, evaluation of explanations is already a well-studied problem[2-4]. Without comparison to existing evaluation metrics, presenting these three as a main contribution is inadequate.
[1] Concept-based Explainable Artificial Intelligence: A Survey
[2] A comprehensive study on fidelity metrics for XAI
[3] XForecast: Evaluating Natural Language Explanations for Time Series Forecasting
[4] On the (in) fidelity and sensitivity of explanations
1. How are the Conditions or concepts (c) chosen when generating Raw Differences? Would different choices of ( c ) significantly affect the quality of the results?
2. How does the proposed method compare to the alternative approach of transforming the problem into explaining a single difference model ( f_{\text{diff}} ) using existing explanation methods? |
Lightly AI-edited |
|
One Measure, Many Bounds: Bridging TV, Variance, and Mutual Information |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces a new information-theoretic framework for generalization bounds based on an alternative for dependence measures $V_\alpha(S; W)$. It generalizes classical results built on total variation and mutual information, and at $\alpha = 2$ yields a novel variance-based bound that links the generalization gap directly to the variance of the algorithm’s output distribution. The authors further introduce Adaptive Density Stability as a sufficient condition for non-vacuous generalization and provide empirical validation on Bayesian linear regression, demonstrating that their bounds can be tighter than MI and CMI baselines.
* The introduced dependence measure $V_\alpha(S; W)$ bridges total variation, mutual information, and variance. It is both novel and unifying.
* The $\alpha=2$ case corresponds to a variance-based bound, which gives a direct, interpretable stability quantity and provides a data-dependent measure for generalization.
* The proposed ADS condition introduces a new way to ensure non-vacuous bounds, connecting distributional stability to practical learning rates.
Major:
* The related works discussed in the paper are a bit narrow and outdated. Only 2 baselines are considered, the MI bound from (Xu & Raginsky, 2017) and the CMI bound from (Steinke & Zakynthinou, 2020). There have been new bounds based on binary KL [1] or binary JS [2] comparators, and new individual information measures that provably lead to tighter bounds [3-5]. These works should be included in the comparison, and it is currently not clear if the proposed $V_\alpha$ surpasses these new methods or not.
Minor:
* The empirical evidence is limited to 1D Bayesian regression and simple channels. More real-world demonstrations would strengthen the paper’s impact.
* Typo: ? at L880.
[1] A new family of generalization bounds using samplewise evaluated CMI. 2022.
[2] Exactly Tight Information-theoretic Generalization Bounds via Binary Jensen-Shannon Divergence. 2025.
[3] Tightening mutual information-based bounds on generalization error. 2020.
[4] Sharpened generalization bounds based on conditional mutual information and an application to noisy, iterative algorithms. 2020.
[5] Individually conditional individual mutual information bound on generalization error. 2022.
* Except for Figure 2, the comparison only involves $V_\alpha(S; W)$ and $I(W; S)$. I'll suggest including other related works (see weaknesses) in comparison, and report the standard deviation/confidence interval.
* I took a look at the source code, and am a bit confused about the choice of $\sigma$. The current implementation uses $\sigma = 1$ to compute the MI and CMI bounds, but uses a smaller $\sigma = 0.4$ to compute the $V_2$ bound, which may lead to unfair comparison. From my current reading (L950-960), they should be computed using the same $\sigma$. Can the authors justify this choice? |
Fully human-written |
|
One Measure, Many Bounds: Bridging TV, Variance, and Mutual Information |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces a new distribution- and algorithm-dependent complexity measure and uses it to derive several generalization bounds. The measure is intended to capture the dependence between the trained model W and the sample S:
V_alpha(S;W) = E_S [ ( E_{W|S} [ p_{W|S}(W|S) / p_W(W) ] - 1 )^(1/alpha) ].
For different alpha, the framework connects to TV- and MI-type bounds. The main contribution is a generalization bound based on V_alpha(S;W) (Theorem 3.1).
The strength of this work is that it proposes a new distribution and algorithm dependent complexity measure for generalization.
**MI scaling is not recovered.**
The paper suggests it “recovers” MI-style results, but the alpha = 1 specialization (Eq. 14) does not match the classical Xu–Raginsky rate:
|E[gen]| <= sqrt( (2*sigma^2 / n) * I(W;S) ).
As written, Eq. 14 has no explicit 1/sqrt(n) factor; since I(S;W) typically is constant, the RHS of Eq. 14 can only be constant and doesn't decrease with n.
**Eq. 15 is unclear or vacuous in rich or continuous hypothesis spaces.**
For continuous outputs, it’s not obvious how the sum/integral is defined or controlled; for large finite hypothesis classes, the term
sum_{w in W} sqrt( Var_S[ p_{W|S}(w|S) ] )
can scale poorly—contrast with the classical baseline sqrt( log|W| / n ).
**Computability/estimability of V_alpha**
All results hinge on access to p(W|S); outside simple models this is typically intractable.
**summary**
Without clear regimes where (i) the MI-type case exhibits 1/sqrt(n) behavior, (ii) Eq. 15 remains finite and scales with capacity, and (iii) V_alpha (or a surrogate) is computable in realistic pipelines, it’s hard to assess practical significance. Clarifying these points would substantially strengthen the paper.
1- Please clarify in what sense Eq. 14 is MI-type and state conditions under which the deviation term yields O(1/sqrt(n)).
2-Please specify conditions that make Eq. 15 finite. Also, how does it extends to continuous W. |
Moderately AI-edited |
|
One Measure, Many Bounds: Bridging TV, Variance, and Mutual Information |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The work introduces a one-parameter family of information-theoretic generalization bounds based on the vector-valued Lp-norm correlation measure, Vα. While this framework unifies and interpolates between existing bounds such as those based on total variation and Rényi information. According to the authors, the primary conceptual contribution emerges at α = 2, where it yields a variance-based bound. This result establishes the variance of the algorithm’s output distribution, VarS[p(w|S)], as a direct, data-dependent measure of algorithmic stability, providing a new, information-theoretic perspective on how unstable (high-variance) algorithms fail to generalize.
- The paper is well-written and structured. It is easy to follow
- Understanding generalization is an important topic and deriving generalization bounds can provide new insights on when and why NN generalize.
- I checked the theoretical proofs of the paper and I think they are correct.
- **Limited Novelty/impact** My main concern with this paper is that the proposed bound does not provide any new insights. Although deriving a variance-based bound is interesting, the resulting formulation does not provide any new fundamental insights into generalization. In fact, it is well established in the literature that stability (and algorithmic stability) implies better generalization (see all works on algorithmic stabiity, e.g., Section 4.3 of [1]). While the authors redefine stability in terms of variance rather than sensitivity to the removal of one training sample (β-stability as in algorithmic stability), the two notions are equivalent as can be shown through the Efron–Stein inequality. Consequently, the proposed framework does not provide any new theoretical insights, it essentially reinterprets existing stability-based results in a different mathematical form (with a novel yet simple proof trick).
- **Incomplete coverage of related work:** The authors fail to mention and discuss several related works that derive information-theoretic generalization bounds (e.g., [1–5]). In particular, [1] and [5] also provide stability-based analyses closely related to the one proposed here. [1-5] are only examples here. In fact, the literature in this area is now quite rich, and omitting prior works prevents readers from understanding how the current approach differs or improves upon prior efforts. I recommend that the authors expand the related work section to include and properly discuss all relevant recent studies.
- **Weak empirical validation:** The experimental results are confined to a Bayesian linear regression task and a few toy classification setups, which provide only preliminary evidence for the proposed framework. However, these tasks are simplistic and low-dimensional. There is no demonstration on real-world or deep learning scenarios, where the tightness or practical relevance of such bounds would matter most. Furthermore, comparing only against two classical information-theoretic baselines (MI bound of Xu & Raginsky (2017) and CMI bound of Steinke & Zakynthinou (2020)) is insufficient. These are indeed foundational bounds, but the comparison omits more recent and stronger bounds, such as those from [1–5], which employ sample-wise or conditional decompositions and often achieve much tighter estimates in practice. Without benchmarking against these more contemporary methods, it is difficult to assess the true empirical or theoretical advantages of the proposed approach.
[1] Harutyunyan, Hrayr, et al. "Information-theoretic generalization bounds for black-box learning algorithms." Advances in Neural Information Processing Systems 34 (2021): 24670-24682. \
[2] Hellström, Fredrik, and Giuseppe Durisi. "A new family of generalization bounds using samplewise evaluated CMI." Advances in Neural Information Processing Systems 35 (2022): \
[3] Dong, Yuxin, et al. "Towards generalization beyond pointwise learning: A unified information-theoretic perspective." Forty-first International Conference on Machine Learning. 2024. \
[4] Wang, Ziqiao, and Yongyi Mao. "Generalization bounds via conditional $ f $-information." Advances in Neural Information Processing Systems 37 (2024): 52159-52188. \
[5] Wang, Ziqiao, and Yongyi Mao. "Sample-conditioned hypothesis stability sharpens information-theoretic generalization bounds." Advances in Neural Information Processing Systems 36 (2023): 49513-49541.
See section above. |
Fully human-written |