|
Finetuning-free Alignment of Diffusion Model for Text-to-Image Generation |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces a finetuning-free, plug-and-play alignment strategy for diffusion models in text-to-image generation by casting the problem as sampling from a reward-weighted distribution. The authors analyze the challenges of existing guidance-based alignment schemes—particularly the emergence of adversarial artifacts—and propose a novel regularization for guidance signal stabilization. The method is evaluated on established text-to-image benchmarks, achieves strong alignment to human preferences using a lightweight guidance network, and demonstrates substantial computational savings over finetuning-based approaches.
- **Formulation Innovation:** The paper reframes text-to-image alignment as direct sampling from a reward-weighted distribution, moving away from common parameter fine-tuning approaches and offering a generic plug-and-play control mechanism.
- **Practical estimator for the guidance:** The paper adopts a simple regression trick (Eq. 13) to approximate the conditional expectation and then converts it to a guidance gradient (Eq. 14), avoiding expensive backprop through the sampler.
- **Stabilization for adversarial guidance:** The instability of naïve guidance with increasing strength is documented and addressed via a consistency regularizer merged in Eq. 16.
- **Lightweight and fast:** The guidance net is only ~72 MB and reuses the reference model's VAE/tokenizer/text encoder. Combined with SDXL-Turbo, this enables effective **one-step** generation.
- **Agnostic to Reward:** The method supports both differentiable and non-differentiable reward settings, with Table 4 in the appendix demonstrating applicability on GenEval with binary rewards.
1. **Analysis Depth of Regularization:** While the regularization is empirically justified and its effect visualized (see Figure 2), the theoretical underpinnings and limits of this regularization are not fully elucidated. What modes of artifact are suppressed, and does the regularization always guarantee avoidance of adversarial guidance? The practical selection of the regularization hyperparameter $\eta$ (Eq. 13 & 15) also remains ad hoc.
2. **Reward Dependence & Generality:** Although the proposed scheme is reward-agnostic in form, its empirical evaluation—especially in Table 2—is predominantly based on PickScore and similar human-preference proxies. It is unclear how robust the approach is to poorly calibrated, biased, or low-signal rewards. There is only a narrow demonstration on non-differentiable rewards in Table 4 (GenEval), which is limited in scope and size.
3. **Scope of generalization is narrow:** Experiments are concentrated on SDXL-Turbo; the paper asserts model-agnosticism and one-step benefits, but offers limited cross-backbone verification or stress tests on distribution shift.
4. **Hyperparameter Sensitivity:** The proposed method claims to "fix" the problem of carefully tuning the guidance strength, but practical recipes or robustness studies for the guidance parameter, regularization weight, or hyperparameter $\beta$ are lacking.
5. **Comparisons to very recent other alignment methods are light:** Table 2 includes Tweedie/Backprop and two finetuning methods (Diffusion-DPO, SPO), but a broader slate of strong alignment related methods (and best-practice configs) would better establish relative advantage.
6. **Not Strictly Finetuning-free:** Please refer to the precise definition of finetuning-free. The scenario described in this paper can at best be considered "no base-model fine-tuning".
1. **Regularization Mechanics:** Can the authors provide more intuition on how the proposed regularization term shapes the guidance network’s landscape? Are there scenarios or reward functions where this regularization might fail or even worsen adversarial behaviors?
2. **Sensitivity Analysis:** How does performance vary with η and β? Please provide curves (PickScore/HPSV2/ImageReward/Aesthetic vs. η, β) and report variance across seeds.
3. **Extension to Other Backbones:** Have you tested non-Turbo SDXL or SD 2.1 latent backbones, or text-conditional DiT variants(like Flux)? Are there empirical results or qualitative observations on data distribution shifts not covered by the current benchmarks?
4. **Robustness to Reward Misspecification:** Beyond GenEval's binary reward, how does the method fare under noisy, sparse, or biased rewards? Can the guidance network overfit to reward artifacts, and how would the regularizer respond?
5. **Comparisons with Other Alignment Related Work:** Where are the practical/theoretical boundaries vs. other plug-and-play or inference-time guidance alignment related methods? This will help substantiate the method's effectiveness and superiority. |
Fully AI-generated |
|
Finetuning-free Alignment of Diffusion Model for Text-to-Image Generation |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a finetuning-free method that improves the alignment of text-to-image diffusion models. It frames the alignment as a sampling problem from a reward-weighted distribution. Specifically, this paper decomposes the scoring function with a guidance term and proposes a regularization technique to train the model. Experimental results on Pick-a-Pic dataset show the improvement of the proposed method over baseline studies..
This paper proposes a finetuning-free method that is efficient compared to finetune-based methods. The proposed regularization strategy stabilizes the guidance signal and improves the text-to-image diffusion models.
1. I have concerns regarding the evaluation of the proposed method. According to line 418, the evaluation is conducted *using 500 validation prompts from the validation unique split of Pick-a-Pic.* How are these prompts selected? Moreover, the baseline method SPO is evaluated on 4K prompts from Pick-a-Pic, which is eight times more than this method.
2. This paper evaluates its method based on SDXL-Turbo, which was released in 2023. Considering the rapid emergence of new models, SDXL-Turbo is kind of 'old' and cannot well support the effectiveness and generalization of the proposed method. How does the proposed method perform when generalized to recent models?
3. Figure 1 shows some visualization results, while the prompts are provided in the appendix. It is kind of difficult for me to find the improvement of the proposed method over baselines. It seems the baseline method already gets good enough results.
4. Is it expected to include the related work in Section 1.1 instead of Section 2?
5. The citations of the paper could be improved, such as line 107
> In (Liang et al., 2024), Liang et al. propose....
Please refer to the weaknesses. |
Fully human-written |
|
Group Pattern Selection Optimal: Let LRMs Pick the Right Pattern for Reasoning |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes Group Pattern Selection Optimization (GPSO), an RL framework for training reasoning models to adaptively select the right reasoning pattern for a problem. The paper considers three reasoning patterns: direct solution, reflect-and-verify, and explore multiple solutions, each of which is a 1-2 sentence addition to the prompt template (App. B.2). GPSO rolls out multiple responses for each pattern, selects the pattern with the highest empirical accuracy, and updates the policy only on rollouts from the optimal pattern. They mask attention to the suffix tokens during gradient computation to prevent overfitting to explicit prompts. Their experiments show consistent gains across the four reasoning benchmarks.
The paper is generally clearly written. It has clear motivation and a good visual presentation.
- The abstract claims "GPSO learns the intrinsic mapping from problem to pattern", which led me to believe that the model is autonomously discovering reasoning patterns during training. In reality, GPSO selects one of three (or four?) pre-defined prompt templates during training. I think the framing of the paper overstates, and that it's more accurate to view GPSO as an improvement to rollout sampling that prioritizes supervision from high-reward patterns.
- Generally, I think the fact that the method relies on the diversity in a small number of hand-crafted prompt templates is a limiting factor.
- In the intro (line 82-89), you claim that Figure 1 "demonstrates that if LLMs were capable of dynamically selecting the most suitable pattern, their overall performance could be enhanced by a substantial margin". This is a core motivation for your paper. If I understand the Best bar in Figure 1 correctly, you took a question-wise maximum over the different patterns and averaged that value. This sampling procedure is an unfair comparison, similar to pass@k vs pass@1.
- The captions are overall quite sparse and don't contain enough self-contained context
- Could you comment on computational cost? If I understand correctly, you'd have to roll out three times more examples at each training iteration. If that is correct, perhaps it's reasonable to consider regular GRPO with 3x more iterations as another point of comparison?
- In a few places, you describe your method name as Group Pattern Selection *Optimal* (GPSO). This is a typo, right? I'm assuming you mean Optimization.
- Very minor, but in line 283, you say that your metric is Pass@k with k=1. This is just accuracy; I don't see why you're mentioning Pass@k at all. |
Fully human-written |
|
Group Pattern Selection Optimal: Let LRMs Pick the Right Pattern for Reasoning |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper identifies that Large Reasoning Models (LRMs) often default to dominant, sub-optimal reasoning patterns. It introduces Group Pattern Selection Optimization (GPSO), a reinforcement learning framework extending GRPO. The method operates by performing multi-pattern rollouts, where a single problem is prompted with several different "reasoning pattern suffixes". A verifier signal is used to identify the most effective pattern for that specific problem, and the model's policy is then updated using only the trajectories from that optimal pattern group. GPSO employs attention masking to prevent the model from overfitting to the explicit pattern suffixes, thereby forcing it to learn an intrinsic mapping from the problem to the optimal reasoning strategy. Experiments on math and science benchmarks demonstrate consistent, though often modest, performance gains.
- The motivation is clear. Figure 1 demonstrates that different reasoning patterns yield different accuracies for the same problem and that current RLVR pipelines can bias models toward a single, sub-optimal dominant pattern.
- The proposed training mechanism, GPSO, is general and well-designed. It builds on GRPO by adding multi-pattern rollouts, best-pattern selection, and attention masking, which could potentially be integrated into various RLVR-style pipelines.
- The contribution of the reasoning pattern optimization, using masking for pattern prompting invariance and a best-group update to force the model to learn an internal reasoning pattern policy, is a valuable idea for improving the quality of intermediate reasoning tokens generated by the model.
- The method shows consistent performance gains, particularly on weaker to midsize models (e.g., ~2–3 point average gains on 1.5B–8B parameter models) on difficult math benchmarks.
1. The pipeline’s absolute effectiveness remains unclear because the trained model still trails very much behind the oracle “Best” per-question pattern upper bound from the analysis in Figure 1; the gap (e.g. 90% best on AIME2024 with Qwen3 thinking compared to the achieved 77.5% which is only less than 1% gain compared to the baseline 76.7% and trailing behind even some reasoning pattern prompting methods) suggests the bottleneck is only slightly addressed.
2. The training objective relies on a "hard" best-pattern selection. The method always picks the single best pattern group (by highest verified accuracy) and completely ignores all other patterns. This discards potentially useful training signals from other rollouts. An ablation exploring "soft" inclusion (e.g., weighting each pattern group by its verifier score) would be useful to support the hard best-only design.
3. There is a lack of statistical significance reporting. Evaluation relies on Pass@1 averaged over 4 samples per problem, decoded at a non-zero temperature (T=0.6). Results are reported as single numbers without variance or confidence intervals. This reduces the impact of the findings, especially since several gains on stronger models are less than 1 point (e.g., +0.8 on Qwen3-8B) and could be statistical noise.
4. Despite discussing patterns like “employing tools,” the experiments are confined to math/science QA with textual reasoning. There is no evaluation on tasks that require external tools (e.g., code execution, retrieval, calculators), task decomposition, or search, so it is unclear whether pattern effects and GPSO’s gains persist in those domains.
1. Can you report variance or confidence intervals for the results in Tables 1?
2. How does GPSO's performance compare to a simpler inference-time selection baseline? For example, prompting the base model with all n (3) patterns, sampling m times, and using majority voting for example to select the final answer? This would help isolate the gains from the RL training itself versus the multi-pattern ensemble.
3. What is your intuition for why a "soft inclusion" of other reasoning patterns (e.g., a weighted mix based on verifier scores) would not outperform the "best-only" hard selection strategy?
4. Have you measured the intrinsic problem -> pattern mapping accuracy (e.g., probability of generating the correct pattern without an explicit prompt) before and after GPSO training? |
Moderately AI-edited |
|
Group Pattern Selection Optimal: Let LRMs Pick the Right Pattern for Reasoning |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 1: poor
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The authors present a work to incorporate a set of handcrafted reasoning patterns when fine-tuning LRMs with RLVR. To do so, they prompt the model with the pattern in the prompt, keeping the best pattern according to success rate for the advantage calculation. They additionally mask out the pattern suffix to mask out their contribution during updates.
- Evaluate the performance of the approach in a variety of math reasoning domains
- Perform an ablation study of each component of their method to show the contribution
- GRPO with a sample equivalent N (e.g N = num_patterns x num_samples per pattern) not compared against as a baseline
- Patterns are handcrafted per task and can be viewed as a basic form of prompt optimization[1,2,3,4, 5], which has a rich history of approaches and is not compared against. Thus unclear about novelty of proposed approach.
References
[1] DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
[2] RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning
[3] GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
[4] A Systematic Survey of Automatic Prompt Optimization Techniques
[5] Prefix-Tuning: Optimizing Continuous Prompts for Generation
See Weaknesses above. |
Fully human-written |
|
Group Pattern Selection Optimal: Let LRMs Pick the Right Pattern for Reasoning |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
GPSO is a reinforcement-learning framework that lets a model try several high-level reasoning patterns on the same problem and then pick the best one using verifier signals. It extends GRPO with multi-pattern rollouts and an attention-masking trick so the model doesn’t just memorize pattern suffixes but actually learns when to use which pattern. Their analysis shows current models often choose a suboptimal pattern, and GPSO learns a problem→pattern mapping that reduces this mismatch. The authors also conduct some experiments to show the performance of their method.
1. The method explicitly models diverse reasoning patterns instead of forcing one dominant pattern.
2. The mehtod wses verifier signals to select the best pattern per problem, improving sample efficiency and final accuracy.
1. The paper should include a baseline that uses the original GRPO but with a comparable (i.e., larger) number of rollouts/samples per batch, to show that GPSO’s gains are not just from extra sampling.
2. The procedure for constructing the pattern set needs to be spelled out: how are the patterns obtained, filtered, and validated, and what evidence do we have that this set is sufficiently diverse for the target domains?
3. The motivation for “selecting” a single best pattern is unclear; please clarify why optimizing all correct patterns isn’t preferable, since learning to solve a problem via multiple patterns may improve robustness and generalization.
4. Please discuss applicability to stronger/larger models: do they still benefit from explicit pattern conditioning, or do their naturally diverse trajectories make GPSO less useful at scale?
Please refer weakness |
Moderately AI-edited |
|
Learning Task-Invariant Features in VLMs via Dynamic Bayesian IRM |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper applies Bayesian Invariant Risk Minimization (BIRM) to address the performance degradation of vision-language models (VLMs) in out-of-distribution (OOD) tasks. By introducing a dynamic regularization weight into the original BIRM loss, the proposed method—Dynamic BIRM—enhances model robustness under OOD conditions compared to baseline BIRM approaches.
1. Improving the robustness of VLMs is an important and timely research direction. This paper presents a solid case study demonstrating the application of BIRM within the VLM context.
1. Since the motivation of this work lies in dynamically adjusting the regularization term, it would be beneficial to include empirical analyses showing how the decay of the regularization weight affects model performance.
2. Sections 4.1 and 4.2 both serve as preliminaries; it would improve readability and narrative flow to move these into Section 3.
3. The experimental scope appears limited. Given that the paper’s primary contribution is to improve VLM performance on OOD tasks, evaluating only an OCR dataset is insufficient. It would strengthen the work to include additional OOD benchmarks—such as medical QA, chemical reasoning, or other domain-specific tasks. Furthermore, the authors could consider generating synthetic OOD tasks by applying perturbations to visual or textual inputs to better validate the robustness claims.
1. In Section 3.3 (Preliminary on BIRM), the paper states that the model can be represented by $h_u$ and $g_w$. However, Equation (5) does not clearly incorporate these terms. Please clarify how $h_u$ and $g_w$ are used within the objective function and their connection to Equation (5).
2. The experimental setup is unclear. Please explicitly specify which datasets or benchmarks were used for evaluation, and clearly distinguish between in-distribution (ID) and out-of-distribution (OOD) tasks.
3. Dynamic BIRM consistently underperforms on ID tasks. Could the authors provide an explanation or discussion of this trade-off between ID and OOD performance? |
Moderately AI-edited |
|
Learning Task-Invariant Features in VLMs via Dynamic Bayesian IRM |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses the important problem of distribution shift in Visual Language Models (VLMs), where models fail when encountering out-of-distribution task types not seen during training. The authors propose Dynamic BIRM, which extends Bayesian Invariant Risk Minimization to generative VLMs by (1) reframing autoregressive generation as sequential classification, (2) defining environments based on task types, and (3) introducing a dynamic regularization coefficient that adaptively adjusts during training to prevent regularization decay. Experiments on LLaVA-OneVision with SmolVLM-2B show +33.8% absolute improvement in CEE Score on OOD OCR tasks while maintaining in-domain performance.
- Task-based distribution shift is a real challenge for deployed VLMs
- First to adapt Bayesian IRM to generative multimodal models
- +33.8% improvement in CEE Score is substantial
- No comparison to CORAL, meta-learning, or other domain generalization methods
- No systematic study of hyperparameter sensitivity or alternative design choices
- Lacks formal justification for task-based environments and when dynamic adjustment helps
- 30-40% overhead is significant with no cost-benefit analysis
- Why only OCR for OOD evaluation? Have you tested on other held-out task types (e.g., mathematical reasoning, chart understanding)?
- How does performance scale with dataset size? Your training set is quite small - do gains persist with larger datasets?
- What about automatic environment discovery? Can task types be discovered automatically rather than manually specified? |
Fully AI-generated |
|
Learning Task-Invariant Features in VLMs via Dynamic Bayesian IRM |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes Dynamic BIRM, an adaptive variant of Bayesian Invariant Risk Minimization for generative VLMs, where “environments” are defined by task types (e.g., VQA, captioning, OCR). It claims improved OOD robustness on a small LLaVA-OneVision split using SmolVLM-2B, chiefly on OCR, by dynamically tuning the invariance penalty during training.
1. The paper targets a timely and meaningful goal where task-invariant representations for VLMs under task-based distribution shift.
2. Framing VLM generation as token-level classification to port BIRM and proposing a dynamic penalty schedule is a reasonable direction with potentially useful intuition.
1.. Introduction lacks citations, leaving prior work and novelty unclear.
2. Narrative redundancy with insufficient exposition of the proposed method.
3. Citation practices are non-standard and inconsistent.
4. Experiments are insufficient: too few baselines, limited models, and narrow evaluation.
1. Core strands in DG/IRM and multimodal robustness aren’t cited where claims are made, making it hard to judge incremental contribution or distinguish from standard baselines. There is no citation in the whole Introduction part and the whole paper seems use LLM to polish since the abbreviations are redefined (e.g., VLM, ERM).
2. The method section gives a verbal, four-step schedule for $\lambda$ (Eq. 8–10) but no pseudo-code and algorithm box, complexity, or stability analysis beyond prose; definitions of $q_u$, $q_e$ are stated but their approximations are not concretely specified.
3. Mixed formatting, uneven venue info, and in-text citation gaps reduce credibility; the bibliography should follow a consistent style and anchor each claim to canonical prior work. For example, in section 2.2, “Ghosh et al. (2025)” and in section 2.3, “Wang et al. (2025)”
4. Omits standard DG baselines, evaluates mainly on a single small model and narrow OOD setting, and lacks robust ablations or stronger evaluations. For example, baselines are only ERM and static BIRM, but no standard DG competitors, limiting the strength of the claim. Moreover, evaluation uses a single small model (SmolVLM-Base-2B) and defines OOD solely as OCR; the train/test sizes are very small (3,200 train), restricting generality. |
Moderately AI-edited |
|
Learning Task-Invariant Features in VLMs via Dynamic Bayesian IRM |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses distribution shift in Visual Language Models (VLMs), where traditional ERM fails to learn task-invariant features. The authors adapt Bayesian IRM for generative VLMs and propose Dynamic BIRM, which dynamically adjusts the invariance penalty during training. Experiments on LLaVA-OneVision show substantial improvements on out-of-distribution tasks, particularly OCR, while maintaining or enhancing in-domain performance.
1. Introducing invariant learning methods to address out-of-distribution generalization challenges in large generative VLMs is an interesting and promising direction.
2. The paper is well organized and clearly written, making it easy to follow and understand.
1. The contribution of this work appears somewhat limited. The core novelty lies in introducing Bayesian Invariant Risk Minimization (BIRM) to generative VLMs and designing a dynamic balancing weight for the BIRM objective.
2. There exist many methods in the invariant learning literature (e.g., IRM-IB [1], MRI [2], [3]). The rationale for specifically adopting Bayesian IRM for generative VLMs requires further clarification. Including comparisons with other invariant learning approaches could strengthen the evaluation and highlight the advantages of the proposed method.
3. The proposed method is evaluated on only one dataset and one VLM model. Evaluating it across additional datasets and models (e.g., Qwen3-VL) would enhance the robustness of the results and better demonstrate the general effectiveness of the approach.
4. Only a single out-of-distribution setup (OCR task as OOD, others as in-distribution) is considered. Evaluating additional OOD scenarios (e.g., VQA task as OOD while others are ID) would provide a more comprehensive assessment of the method’s generalization capabilities.
[1] Ahuja, K., Caballero, E., Zhang, D., Gagnon-Audet, J. C., Bengio, Y., Mitliagkas, I., & Rish, I. (2021). Invariance principle meets information bottleneck for out-of-distribution generalization. Advances in Neural Information Processing Systems, 34, 3438-3450.
[2] Huh, D., & Baidya, A. (2022). The Missing Invariance Principle found--the Reciprocal Twin of Invariant Risk Minimization. Advances in Neural Information Processing Systems, 35, 23023-23035.
[3] Montasser, O., Shao, H., & Abbe, E. (2024). Transformation-invariant learning and theoretical guarantees for OOD generalization. Advances in Neural Information Processing Systems, 37, 108649-108673.
No. |
Lightly AI-edited |
|
Diffusion Aligned Embeddings |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper explores a new representation learning framework that models data as a stochastic diffusion or random walk process, formulated through a continuous-time Markov chain (CTMC). The authors aim to capture intrinsic relationships among data points beyond static embeddings by treating representation learning as a dynamic probabilistic evolution over time. Overall, the paper presents a clear narrative and the proposed framework is intuitively appealing. However, the study appears insufficiently mature for ICLR. While the topic is conceptually relevant, there are notable issues with scope, technical depth, and experimental completeness that weaken its overall contribution.
S1. The paper is logically structured, with a clear narrative and motivation.
S2. The proposed problem is meaningful and lies within the broader scope.
S3. The writing is fluent and the paper is easy to read.
**Concerns**
C1. While representation learning is indeed a core theme at ICLR, this paper appears to align more closely with scientific or interdisciplinary journals rather than a machine learning conference. The reference list includes almost exclusively Nature or Science publications, with no ICLR or ICML papers cited. This suggests a lack of engagement with the ML research community. Furthermore, the most recent related work in Section 1 cited dates back to 2021, raising concerns either about insufficient literature review or about the maturity and saturation of the studied problem. The authors are expected to better clarify how this work advances the state of machine learning beyond existing literature.
C2. Although the paper’s ideas are clearly articulated, the technical core appears relatively shallow. The proposed method resembles a random-walk process enhanced with continuous-time Markov modeling, but lacks substantial theoretical innovation or algorithmic novelty. It would be better if the authors emphasize what makes their formulation non-trivial—for example, any novel mathematical insights, optimization challenges, or new theoretical guarantees. Otherwise, the approach risks being perceived as an incremental reformulation of well-known stochastic models.
C3. The experiments are currently insufficient to convincingly support the paper’s claims. It would better if author can address these issues to significantly improve the paper’s credibility and completeness:
C3-1. The experimental design is limited in diversity—more datasets or settings would strengthen generality.
C3-2. The evaluation metrics are not clearly explained.
C3-3. The discussion of results is brief or missing, lacking qualitative or interpretive analysis that explains why the proposed method performs as it does.
C3-4. Ablation studies or sensitivity analyses could help clarify the contribution of individual components.
C3-5. No runtime or complexity analysis is provided to demonstrate the practicality of the approach.
Please respond to C1, C2, and C3. |
Fully AI-generated |
|
Diffusion Aligned Embeddings |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes to tackle the problem of estimating low-dimensional embeddings by aligning a diffusion process defined on the graph spanned by the data points in low-dimensional (target) space with the diffusion process on the graph spanned by the same points in the (given) high-dimensional space. They suggest to optimize this alignment through Path-KL divergence and provide theoretical guarantees for this optimization.
- The theoretical idea of optimizing through Path-KL seems interesting and novel, albeit the practical implementation of it seems unclear (see Weaknesses).
- The experimental part is sloppy at best and **lacks mentioning or discussion of the actually achieved results** – what is the performance in terms of reconstruction quality comparing DAE against existing methods, what can we draw from that?
- The **reporting of experimental results on their method is missing**, DAE is not to be seen in Figure 1. **There are no further figures or tables in the main paper**. Also in the Appendix, there are **no reported metrics on the real-world datasets**.
- Even for competitors, it is unclear what the achieved results show – the metrics were averaged across datasets, which seems odd. In the Appendix, they instead provide mean and std across different metrics, which also does not make sense. The achieved **results per metric and per dataset** should be shown for each method, the standard benchmark metrics from the field should be ideally used for this.
- The work misses a proper discussion and comparison to relevant related work, in particular works that are targeted to (very) large-scale datasets and efficiency [1-3], as well as variants that improve the reconstruction quality compared to existing works, e.g. [4], or explicitly reconstruct local and global properties at once [5]. [1-3] should scale (more) effectively to the 5 million point dataset. [4] sets a SOTA performance for reconstruction quality.
[1] Tang, J et al. *Visualizing Large-scale and High-dimensional Data.* WWW 2016.
[2] Artemenkov, A, Panov, M. *NCvis: Noise contrastive approach for scalable visualization.* WWW 2020.
[3] Linderman, G et al. *Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data.* Nature Methods 2019.
[4] Narayan, A, et al. *Assessing single-cell transcriptomic variability through density-preserving data visualization.* Nature Biotechnology, 39(6):765–774, 2021.
[4] Kobak, D, Berens, P. *The art of using t-sne for single-cell transcriptomics.* Nature Communications, 10(1):1–14, 2019.
[5] Kury, N et al., *DREAMS: Preserving both Local and Global Structure in Dimensionality Reduction*, arXiv:2508.13747, 2025.
- You claim that through the diffusion you get both local as well as global properties of the data in the embedding. How much does your **choice of considering the kNN graph influence the results which were derived without that constraint**?
- How do you get the stationary distribution $\pi_Q$?
- The final **optimization objective (Eq. 13) looks eerily close to the tSNE objective**, with $Q_{ij}$ even being a kernel functions using l2 distances the same way that tSNE does. Could you elaborate what the difference is?
- I know the student t kernel. **What is the student -t kernel?**
- What are the results for **each method (including yours and above mentioned SOTA), each metric (not aggregated across metrics) and each dataset (not aggregated across datasets)**? Almost all of this is missing in the paper. |
Fully human-written |
|
Diffusion Aligned Embeddings |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces DAE (Diffusion-Aligned Embeddings), a dimensionality-reduction framework that aligns continuous-time diffusion CTMC path distributions between the high-dimensional data graph and a learned low-dimensional graph by minimizing a Path-KL relative entropy rate between path laws. The authors prove two main theoretical guarantees: generator closeness (small Path-KL implies the two generators are close in a weighted operator norm) and semigroup closeness (closeness of the diffusion operators across timescales). They also provide a computationally efficient algorithm for this objective. Empirically, DAE is compared to UMAP, t-SNE, PaCMAP, PHATE and TriMAP on several single-cell RNA-seq datasets using the ZADU framework and the results show consistent improvement in preserving both local and global structure.
Formulating embedding as alignment of CTMC path distributions is a good idea that unifies local and global preservation through a single probabilistic objective (Path-KL), rather than ad-hoc balancing terms.
The paper’s theoretical development is one of its strongest aspects. The paper contains nontrivial, technically meaningful theorems linking Path-KL to generator and semigroup closeness.
The experiments on multiple realistic single-cell RNA-seq datasets are appropriate for the paper’s target application and show measurable gains.
The theoretical results assume an irreducible generator and, implicitly, accurate knowledge of high-dimensional generator Q and stationary π. In practice, the method builds a sparse k-NN graph; the paper does not fully quantify how approximations introduced by sparse graph construction affect the Path-KL bounds. The gap between theory (dense/ideal generator) and sparse practical graph is not fully characterized.
Key algorithmic parameters (kernel K and its bandwidth, α controlling repulsion weight, nneg, importance sampling schedule, Pmax normalization) are only lightly discussed. Ablations showing sensitivity and guidance for practitioners are missing.
All experiments are on single-cell RNA-seq datasets. While that’s an important and challenging domain, it’s quite specific. The authors claim DAE is a general embedding method for any high-dimensional data, but there’s no evidence it works outside biology. Comparisons to broader synthetic manifolds or vision/text representations (where ground truth geometry is known) would strengthen generality claims.
The paper compares against UMAP, t-SNE, PHATE, PaCMAP, and TriMAP, but it explicitly says it used default parameters.
This reduces my confidence in DAE as these methods are very sensitive to parameters like perplexity or neighbor count. DAE might look better simply because the baselines weren’t tuned
On page 15, a figure appears to be missing. In addition, the final figures are not discussed in the text, so their purpose and key message are unclear. A discussion interpreting these figures should be added to help readers understand their relevance.
A minor point: the last two figures significantly increase the PDF file size, making it difficult to open. Consider optimizing or resizing them to improve accessibility.
1) Theorems 2.2–2.3 assume an irreducible generator Q with stationary distribution π, but the implementation uses a sparse k-NN graph. How sensitive are the Path-KL bounds to this sparsification? Could you provide empirical evidence or an analytical bound on the error introduced by using sparse Q?
2) Have you conducted parameter sweeps over α, nneg, or kernel bandwidth? If so, can you share sensitivity plots or ranges where performance remains stable? Could you provide practical guidance for choosing Pmax and the importance-sampling schedule across different dataset scales?
3) Have you evaluated DAE on synthetic manifolds (e.g., Swiss roll, S-curve) or vision/text embeddings to confirm its generality? Can you discuss more the Figures in the appendix?
4) Why were the baselines run with default hyperparameters instead of tuned settings per dataset? Could you provide a supplementary tuning experiment (e.g., varying n_neighbors or perplexity) to show that DAE’s improvement persists under optimized baselines? |
Moderately AI-edited |
|
Diffusion Aligned Embeddings |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a new method for dimensionality reduction that relies on matching the generator matrices of CTMC built on (1) a graph built upon the high dimensional data and (2) a graph built upon the low-dimensional data.
A key insight of this paper is to minimize the KL divergence between the path distributions (equivalent to the relative entropy rate), which should enable multi-scale representation of the data (because it's matching these distributions at all times t). The paper also proposes an efficient algorithm to compute the embeddings.
The authors compare the performance of their method on 5 different datasets, although it's not clear results of all datasets are actually presented in the paper.
- This paper proposes an interesting approach to avoid the problem of choosing a diffusion time (which is an important hyper-parameters in diffusion-based dimensionality reduction methods).
- The authors designed an efficient computational algorithm to compute the embeddings, enabling it to scale to millions of data points (225 seconds for 9.5 M points in 50 dimensions).
- Although most of Section 2 is well presented and easy to follow, the authors fail to nail down the critical equivalence between Eq (4) and Eqs (6-7). If I understand correctly, this is the key of this paper, as Eq 4 gives the general motivation and 6-7 provides a computationally tractable way of minimizing that objective.
- Q is effectively the Laplacian of the graph, that should be stated in the text.
- Another recent work also uses heat-diffusion on a graph for dimensionality reduction [1]. In particular, that work showed that directly using the heat kernel (the matrix exponential of the Laplacian), already corresponds to combining diffusion operators at multiple scales. However, this is not discussed in the paper.
- The experiment section should contain more results in the main text. The text suggests there are 5 datasets but Figure 1 seems to only be on a single dataset with different values of k. It's not even clear what dataset this refers to. I encourage authors to clarify the results.
- Also in the results, you should report DAE as its own bar, as the current presentation erases the stochasticity of the method. You should also not normalize to 1, such that we can appreciate the difference in performance across multiple k. In particular, it's pointless to improve upon t-SNE for k<30 if absolute performance are worse in that regime than for k>30.
- As per the presented results, there is no convincing evidence that their method is better than t-SNE.
[1] Huguet, Guillaume, et al. "A heat diffusion perspective on geodesic preserving dimensionality reduction." Advances in Neural Information Processing Systems 36 (2023): 6986-7016.
- Could you please clarify the link between between Eq (4) and Eqs (6-7), as I pointed to above ?
- Could you elaborate on the connection with the work in [1], as referred above?
- Could you incorporate more quantitative (and at least one convincing qualitative) results in the main text ?
- Figure 1 should be refactored completely. You should report DAE as its own bar, as the current presentation erases the stochasticity of the method. You should also not normalize to 1, such that we can appreciate the difference in performance across multiple k. In particular, it's pointless to improve upon t-SNE for k<30 if absolute performance are worse in that regime than for k>30.
- If possible, it would be great to show a dataset where DAE outperforms tSNE.
- An important contribution of this paper is the efficient algorithm for computing the embeddings. However, there is no results comparing the performance of different methods. I encourage the authors to provide such comparisons. |
Fully human-written |
|
Diffusion Aligned Embeddings |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces Diffusion-Aligned Embeddings (DAE), a novel diffusion-based dimensionality reduction method that aligns diffusion processes between high- and low-dimensional spaces. The key idea is to use the Path-KL divergence to align diffusion generators. By minimizing this divergence, DAE formally guarantees closeness between the high and low dimensional diffusion semigroups across scales. The authors derive theoretical bounds on generator and semigroup preservation and propose an efficient, parallelizable optimization algorithm with unbiased stochastic gradients.
The paper presents a rigorous formulation of dimensionality reduction based on aligning diffusion generators, rather than fixing a single diffusion timescale. The use of the Path-KL divergence is a novel and elegant idea that provides formal multiscale preservation guarantees and removes the need to choose a specific diffusion time parameter t, a common limitation in methods like PHATE and diffusion maps. The theoretical analysis is technically sound and clearly motivated, connecting generator alignment to semigroup closeness with well-defined mathematical guarantees. The proposed optimization framework is well-engineered and scalable, demonstrating practical feasibility for large datasets.
Second surprising fact is that the the authors provide no visualizations in the main text, making it difficult to assess whether the embeddings preserve meaningful global geometry or suffer from distortions. Moreover the metrics they use are obscure. They should use manifold affinity preservation as in PHATE.
The embedding looks somewhat similar to tSNE from the one figure in the appendix. The reason for this is not surprising, tSNE also does a limited version of matching a diffusion process. Conceptually, matching diffusion processes via generator alignment may not guarantee geometric faithfulness. Because diffusion probabilities are locally normalized, embeddings that differ by large-scale rescaling or local contraction/expansion can produce equivalent diffusion behavior. Thus, the method can preserve relative diffusion dynamics while arbitrarily distorting the manifold geometry.
The paper omits direct comparison to methods like PHATE, multiscale PHATE and diffusion maps, despite positioning itself as a diffusion-based embedding method. This omission is surprising, as multiscale PHATE also models multiscale diffusion structure and is a clear conceptual predecessor. The authors should include both in the related work and the additional experimental comparisons. Another point of comparison would be the Heatgeo embedding [Huguet et al. NeurIPS 2023] which unifies diffusion based processes like PHATE, tSNE and diffusion maps and diffusion maps.
.
The paper would benefit from a dedicated background section explaining key concepts such as diffusion maps, diffusion distances, and continuous-time Markov chains before introducing the Path-KL objective. Important prior work like PHATE (and variations) and diffusion maps should be summarized in a background or related work section to clarify how DAE builds on or differs from existing diffusion-based methods. The current exposition moves quickly into derivations without establishing sufficient mathematical or conceptual background, making it difficult for readers unfamiliar with diffusion geometry to follow. Figure 2 should be in the main body of the paper. Also should include comparisons to PHATE.
How do you ensure that preserving the diffusion process also preserves geometry?
Why dont you show comparisons to other diffusion based methods? |
Fully human-written |
|
Membrane Potential Perturbation Dynamic Is Total Variation |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper reframes a heuristic SNN stabilization mechanism (MPPD) into a rigorous TV theory, generalizes it via a new ℓ₁-based variational framework, and validates its advantage in both adversarial and noisy environments.
#### **1. Research Problem**
* Spiking Neural Networks (SNNs) are vulnerable to adversarial and noisy perturbations that destabilize their dynamics. The existing *Membrane Potential Perturbation Dynamic* (MPPD) technique empirically improves robustness but lacks solid theoretical grounding.
* The paper tries to reveal mathematical nature of MPPD, and how can it be formalized to enhance SNN robustness in a principled way.
#### **2. Proposed Method**
* The authors prove that MPPD is mathematically equivalent to Total Variation (TV)
* Based on this equivalence, they introduce the MPPD–TV–ℓ₁ framework, extending prior ℓ₂-based formulations (MPPD–TV–ℓ₂).
#### **3. Theoretical Contributions**
* Rigorous proof that MPPD is TV under measurable perturbations.
* Establishment of a new TV–ℓ₁ regularization theory for SNNs, encompassing:
* The coarea formula specific to SNN membrane dynamics.
* A dominated TV property showing layer-wise boundedness of perturbations.
* A closed-form subgradient that enables backpropagation through non-smooth TV terms.
#### **4. Experimental Contributions**
* Across CIFAR-10 and CIFAR-100, the MPPD–TV–ℓ₁ model outperforms both the ℓ₂ variant and other baselines under Gaussian noise and adversarial attacks (FGSM, PGD, CW, AutoAttack).
* Demonstrates higher resistance to increased attack intensity and step size, confirming TV–ℓ₁’s superior denoising behavior.
### **1. Theoretical Contributions**
Overall, this paper provides a theoretically sound bridge between signal variation analysis and adversarial robustness in SNNs, offering a new mathematical perspective for neuromorphic robustness theory with clear derivations.
* Introduces a novel reinterpretation of membrane potential perturbation dynamics (MPPD) as a form of total variation (TV), unifying biological spiking dynamics and variational regularization theory in an original analytical framework.
* Transforms prior MS-MPPD regularization (previously heuristic) into a mathematically grounded TV–ℓ₂ model and further generalizes it into the TV–ℓ₁ formulation, expanding the functional space of admissible membrane potentials and enabling sharper perturbation modeling.
* Mathematical Rigorly establishes formal results including the Coarea formula for spiking potentials, the Dominated TV Property linking layerwise stability to weight norms, and a closed-form subgradient for optimization without additional computational cost.
### **2. Experimental Contributions**
Experimental result supports that total variation regularization can serve as a universal principle for temporal–spatial robustness in SNNs.
* Conducts extensive controlled experiments on CIFAR-10 and CIFAR-100 using both VGG11 and WRN16 backbones under Gaussian and adversarial training, comparing six state-of-the-art SNN methods.
* Demonstrates consistent and often substantial gains in adversarial accuracy, validating that MPPD-TV–ℓ₁ effectively suppresses perturbations across attack intensities and steps.
* Shows that the closed-form subgradient introduces no measurable computational overhead while maintaining compatibility with mainstream deep learning frameworks.
Overall, this paper provides rigorous theoretical analysis and effective experimental results to demonstrate the theoretical foundations of MPPD and offer a more complete version. I particularly appreciate the paper's rigorous treatment of pulse discontinuities, which is rare but meaningful in the SNN field.
1. My main concern is that MPPD has not yet become a mainstream method for SNNs, and this paper is almost entirely based on this premise, which limits its broader impact. I am not sure how interested most people in SNN field are in it. For example, can the techniques for handling spike discontinuities in the paper be generalized to other SNN research?
2. Please add some missing citations. Some related work also focuses on smoothing membrane potential perturbations under adversarial attacks with uses different methods, such as dynamic thresholding (https://arxiv.org/pdf/2308.10373) and stochastic gating (https://ojs.aaai.org/index.php/AAAI/article/view/27804).
3. In Theorem 4, "assume every node i in layer L uses the same set of preceding nodes in layer L-1" which does not hold for sparse, or skip-connected SNN architectures. Relaxing this constraint would enhance generality.
1.Can the techniques for handling spike discontinuities in the paper be generalized to other SNN research?
2.Please add some missing citations. |
Fully human-written |
|
Membrane Potential Perturbation Dynamic Is Total Variation |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper establishes a theoretical foundation for Membrane Potential Perturbation Dynamics (MPPD) in spiking neural networks (SNNs), proving that MPPD corresponds to Total Variation. The authors propose a new framework that improves robustness to adversarial perturbations compared to the existing MPPD model. Experimental results on CIFAR-10 and CIFAR-100 demonstrate that MPPD achieves superior accuracy and robustness under various adversarial attacks.
- The paper provides a clear mathematical link between MPPD and total variation, offering the first formal theoretical explanation for an empirically effective mechanism in SNN robustness.
- Experimental results on CIFAR-10 and CIFAR-100 demonstrate that MPPD achieves superior accuracy and robustness under various adversarial attacks.
- The experiments are limited to CIFAR-10 and CIFAR-100. These are small-scale image datasets. Experiments on more large-scale datasets and neuromorphic datasets are encouraged.
- Although the paper mentions efficiency, there is no detailed analysis of training time and gradient stability.
- The paper does not report clean test accuracy alongside adversarial robustness results.
In the abstract, the authors state that “this finding may provide a new insight into the essence of perturbation characterization.” Could the authors clarify what specific insights are being referred to here? |
Moderately AI-edited |
|
Membrane Potential Perturbation Dynamic Is Total Variation |
Soundness: 4: excellent
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work is build on the concept of Membrane Potential Perturbation Dynamics (MPPD) as a method for enhancing the robustness SNNs, particularly in the face of adversarial perturbations. The authors propose that MPPD can be framed as a Total Variation (TV) model and further develop a novel MPPD-TV-L1 framework, which they show improves the robustness of SNNs in adversarial environments. The proposed approach demonstrates superior performance over existing TV-L2 models on image classification tasks using the CIFAR-10 and CIFAR-100 datasets.
1: This work proves that MPPD is equivalent to TV, providing a strong mathematical foundation that underpins the proposed method.
2: The experimental setup is comprehensive, involving state-of-the-art methods and adversarial training schemes.
3: The motivation that extend the existing TV-L2 framework to TV-L1 is well-articulated.
4: The proposed framework has clear practical implications for improving the security and reliability of SNNs.
A major concern is the incremental novelty of this work. The MPPD-TV-L2 framework was already proposed in Ding et al. (2024), and this work introduces the MPPD-TV-ℓ1 framework. Furthermore, as shown in Figure 1, in the case of AT+Reg, the MPPD-TV-ℓ1 shows only minor (or no) improvement. This suggests that MPPD-TV-ℓ1 has a similar effect to the regularizer (Ding et al., 2022), which somewhat weakens the novelty and necessity of this work.
The writing and clarity of the paper can be improved. For example, the title may mislead readers into thinking that the paper proposes MPPD (which it does not). Additionally, there is no punctuation for $\epsilon$ in Equation (3), and the term "MS-MPPD" looks awkward in Equations (4) and (5).
See the weaknesses section for my major concerns.
In Table 1, the cases where MPPD-TV-L1 performs worse than MPPD-TV-L2 all occur with the FGSM and the APGD_DLR. It would be helpful if the authors could provide a theoretical or intuitive explanation for this behavior.
Could you include a comparison that shows the performance of ANNs in handling adversarial perturbations, to better highlight the relative robustness (if any) of SNNs? |
Lightly AI-edited |
|
Membrane Potential Perturbation Dynamic Is Total Variation |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper presents a theoretical analysis of the Membrane Potential Perturbation Dynamic (MPPD) in Spiking Neural Networks (SNNs). The authors' main contribution is framing MPPD as a form of Total Variation. This provides a theoretical foundation for MPPD.
The primary strength is the formal theoretical analysis offered for the MPPD approach, addressing a known gap in the field.
The method appears to achieve good performance, suggesting the theoretical insight translates into practical benefits.
A major issue is that Table 1 does not report the "clean" accuracy (performance without noise), making it impossible to evaluate the true cost of the denoising improvement.
The choice of the key parameter ζ in Section 4.1 is not explained. It is unclear if it was set arbitrarily, tuned for this work, or copied from another paper. If compared papers in Table 1 used different ζvalues, the comparison is misleading and should be noted.
The preliminary discussion describes previous MPPD work in a discrete setting, but the proposed method uses a continuous formulation. The paper does not justify this shift or explain how the continuous form is compatible with or translates to the discrete SNN simulation.
The proposed TV loss might not fully capture the original MPPD behavior. Specifically, when a reset mechanism is involved, small perturbations that are insufficient to evoke a spike may be excluded from the loss calculation, potentially making the model less sensitive to certain types of noise.
In Equation 2.4, using a shorter minus sign (e.g., \text{-}) would improve visual alignment and readability.
The statement that previous work "lacks reliable explanations and theoretical foundation" (Line 55) is too strong, especially if the authors' own prior work is in the same theoretical domain. This should be phrased more precisely.
Why was the continuous formulation chosen for the proposed method when the preliminary MPPD description is discrete? How is the continuous formulation implemented or made compatible with the discrete-time dynamics of an actual SNN?
What are the clean (noise-free) accuracy scores corresponding to the results in Table 1? This is critical for assessing the performance-robustness trade-off.
How does the proposed TV loss account for sub-threshold membrane potential perturbations that do not lead to a spike reset? Does excluding these perturbations limit the model's sensitivity to low-intensity noise?
Minor: The use of bold font in the main text seems arbitrary and should be applied more consistently. |
Lightly AI-edited |
|
Exploiting Low-Dimensional Manifold of Features for Few-shot Whole Slide Image Classification |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper identifies a novel cause for overfitting in few-shot WSI classification: the distortion of low-dimensional feature manifolds by standard linear layers . The authors propose a plug-and-play "Manifold Residual (MR) block" that replaces these layers, using a fixed random matrix as a "geometric anchor" to preserve manifold structure and a separate low-rank pathway for task adaptation
- The paper is built on a strong, clear, and insightful hypothesis. The diagnosis of overfitting as a geometric problem (i.e., manifold distortion by geometry-agnostic layers) rather than purely a data-scarcity problem is a novel and compelling contribution to the field.
- The core hypothesis is well-supported by quantitative analysis before the method is introduced. The use of spectral analysis to show low effective rank (Fig. 1) and tangent space analysis to demonstrate both the manifold's curvature and its distortion by standard linear layers (Fig. 1) provides a solid and convincing foundation for the proposed solution.
- Limited evaluation datasets. While many MIL methods are tested, the number of different types of tasks (classification only) and number of organs (limited to 3) is quite low for demonstrating a robust method improvement.
- The tasks are also artificial few shot tasks. These tasks (e.g., NSCLC subtyping) have 1000s of data points, but the few shot splits are artificially sampled. I recommend trying some real few shot tasks, such as treatment response prediction. This type of task will always be few shot in nature and helping improve performance in this domain will carry tremendous benefit for the field, which is not true for a rather solved task of RCC and NSCLC subtyping. 10+ treatment response prediction tasks can be found at: https://huggingface.co/datasets/MahmoodLab/Patho-Bench
- Why was $r=64$ used for the main comparison tables, while it is shown that performance saturates at $r=32$ (Fig. 3)? Does this choice, which doubles the parameters of the LRP, potentially understate the true parameter efficiency and performance of the MR block at its optimal rank? It may be useful to report the $k=16$ results for MR-CATE with $r=32$ in the text to show that the SOTA performance holds at this more theoretically-motivated rank.
- Is a random matrix the optimal choice for preserving the specific, learned structure of a foundation model's feature manifold? For instance, would a fixed projection based on the principal components (PCA) of the training set features serve as a more "informed" (but still fixed) geometric anchor? |
Fully human-written |
|
Exploiting Low-Dimensional Manifold of Features for Few-shot Whole Slide Image Classification |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper tackles few-shot whole-slide image classification by arguing that the root cause of overfitting lies in a geometric mismatch between pretrained pathology features and downstream linear classifiers. The authors propose a Manifold Residual block, which introduces a random geometric anchor and a trainable low-rank residual path to preserve manifold structure while reducing model capacity. Experiments across several MIL backbones show accuracy improvements and parameter reductions. The paper positions itself as introducing a geometry-aware inductive bias for few-shot computational pathology.
1. The paper identifies a real and practically significant issue in computational pathology. The connection between feature geometry and data efficiency is conceptually interesting and relevant to current efforts in adapting large pretrained models for medical imaging.
2. The proposed MR block is lightweight, easy to implement, and compatible with a wide range of MIL backbones. It can be viewed as a structured parameter-efficient adapter.
3. The paper reports consistent accuracy gains across multiple models with substantial parameter reductions.
Major:
1. The paper attributes few-shot overfitting to the “destruction” of pretrained feature manifolds by downstream linear layers. This interpretation is not entirely convincing. Linear mappings are expected to reshape representations to achieve class separability, which is the very purpose of a classifier. The observed overfitting could instead result from limited data or excessive model capacity rather than geometric distortion. The causal link between ‘destruction’ and overfitting is not yet well established and could be further clarified with additional controlled experiments.
2. The proposed MR block introduces a fixed random matrix \(B\) as a geometric anchor. The t-SNE panel shows a non-trivial disagreement (~14%) in neighborhood structure. From an intuitive perspective, once the input features are multiplied by \(B\), much of the pretrained manifold structure and discriminative geometry are likely disrupted. Classifying on \(XB\) rather than on the original \(X\) would likely reduce performance. Even with sufficient data, a full-rank MIL training setup might not learn to counteract this direct perturbation of pretrained features, let alone a low-rank adaptation like LoRA. In contrast, linear layers transform pretrained features into a task-relevant space in a data-driven manner. Injecting random noise in this way fits more closely with the definition of “destruction” than “preservation.”
3. It is not entirely clear why extreme few-shot WSI classification is a key constraint here. The computational bottleneck typically lies in patch-level feature extraction and pretraining, not in the downstream classifier. Moreover, pretrained slide-level feature extractors such as TITAN already exist, which weakens the motivation for emphasizing few-shot adaptation at the classifier level.
Minor:
1. The finding that pretrained pathology features exhibit low-dimensional manifold structure is broadly consistent with prior work in vision and contrastive representation learning. Classification layers are expected to transform pretrained features into spaces that better align with downstream tasks, which naturally alters the geometry.
2. The reported performance gains may primarily arise from substantial parameter reductions. The current experiments do not separate this effect from the claimed geometric contribution, making it difficult to assess which factor drives the improvement.
1. Could the authors disentangle the improvement due to parameter reduction from the claimed geometric preservation? A control experiment using an equally sized models would clarify this.
2. Does the MR block help when training data is not extremely limited? This would clarify whether the method primarily acts as a regularizer rather than a geometry-preserving transformation.
3. Please refer to the other weaknesses for additional concerns. |
Moderately AI-edited |
|
Exploiting Low-Dimensional Manifold of Features for Few-shot Whole Slide Image Classification |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors introduce a novel layer to preserve low-dimensional manifold geometry within modern Multiple Instance Learning (MIL) frameworks for few-shot classification of whole slide pathology images. They first show that embedding spaces from well-known pathology foundation models manifest low-dimensional manifold geometries and that these are not well-preserved in popular attention-based MIL framework. Authors show that this tends to be due to linear layers such as those present within gated-attention mechanism used in ABMIL and they propose a novel layer, called the Manifold Residual (MR) block, to better preserve geometry. The latter is decomposed in 2 parts operating on a feature matrix $X$: (1) a fixed random matrix transforming linearly $X$ useful to preserve topology; (2) a trainable low-rank residual pathway (LRP). Authors study the theoretical properties of their method, demonstrate its relevance to improve many MIL models for few-shot classification tasks and perform a range of ablation studies.
- The authors propose a relevant analysis to emphasise low-dimensional manifold properties of a range of foundation models for pathology.
- Propose a novel layer for few-shot MIL, the MR block with a custom training strategy.
- They provide theoretical results on a range of geometric/statistical properties preserved by perturbations by random matrices.
- Demonstrate a universality approximation theorem for the MR block.
- Show on 3 datasets that the MR blocks, instead of linear layers, within 5 MIL frameworks improve few-shot WSI classification while leveraging 3 different types of pretrained models (CONCH, UNI, ResNet50)
- Perform ablations on the 2 parts included within the MR blocks, which tend to show that coupling these 2 parts brings the best performance.
- Conduct several sensitivity analyses, to question where to replace linear layers with MT blocks within ABMIL, how to initialize the MR blocks and whether the MR blocks are robust to their rank hyperparameter.
- **W1 : clarity** There are several points in the paper that would benefit from clarification and/or further detail:
- a) L63: "linear layer". For people knowing the MIL literature it is not clear at this stage, about which linear layers you are referring to, e.g those included in the gated-attention layer of ABMIL or actually the linear classifier at the end of the architecture, which can also have an effect. This should be clarified.
- b) The dataset used for the geometric studies reported in Fig 1 and 5 is never mentioned.
- c) Figure 2: The box "MIL model" explaining how are supposed to intervene the MR blocks is really not clear.
- d) Section 2.2: it is not clear to me why more generic few-shot learning paradigms/literature (see e.g [A]), applicable to any bag representations in MIL is omitted in the related work. For instance well-known prototypical neural networks [B] could be applied as a readout within ABMIL instead of a linear layer and their inherent dependence to distances could be a good proxy to preserve geometric properties. This observation underlines that it is not clear in the paper why different readout strategies are not discussed. I invite authors to do so during the rebuttal.
- e) Section 3.1: While both instance-level MIL and embedding/bag-level MIL are mentioned in Section 2.1, Section 3.1 only formalizes bag-based approaches. It can be sufficient to mention that in Section 3.1 with a disclaimer that only bag-based methods are benchmarked in the paper.
- f) To improve readibility of most tables, I suggest authors to express everything in % instead of 0.x.
[A] Song, Y., Wang, T., Cai, P., Mondal, S. K., & Sahoo, J. P. (2023). A comprehensive survey of few-shot learning: Evolution, applications, challenges, and opportunities. ACM Computing Surveys, 55(13s), 1-40.
[B] Snell, J., Swersky, K., & Zemel, R. (2017). Prototypical networks for few-shot learning. Advances in neural information processing systems, 30.
- **W2: benchmarks and ablations**:
- a) Most tested datasets are significantly imbalanced hence I don't think that the choice of metrics such as the accuracy and AUC are the most appropriate. I believe that simply presenting macro F1 scores in the main paper could be sufficient to share the main messages, and potentially include AUPRC [C] for completeness in the main paper or supplementary.
- b) In most ablation studies and sensivity analyses, many strong claims are made by authors when the results hold for at most 2 out of 3 datasets. Therefore I strongly encourage authors to include at least 2 other WSI datasets in their experiments to better support these claims.
- c) Authors argue that a central issue of MIL methods for few-shot classification is overfitting. Nonetheless, there is a pletora of implicit or explicit regularizations (e.g dropout, attention dropout, norm constraints etc) that could be envisioned. Hence the scope of the baselines chosen by authors is not clear and should be further justified by authors or completed by including different regularization techniques.
- d) Hyperparameters of benchmarked MIL models are not present in the paper and should be added.
- **W3: asymptotic analysis**: While authors mention that their method brings less improvements in 16-shots WSI classification than with less supervision, it could be interesting to stress test their methods with higher ranges of shots on the larger datasets like TCGA-NSCLC.
[C] McDermott, M., Zhang, H., Hansen, L., Angelotti, G., & Gallifant, J. (2024). A closer look at auroc and auprc under class imbalance. Advances in Neural Information Processing Systems, 37, 44102-44163.
I invite authors to discuss the weaknesses mentioned above, knowing that I am really inclined to increase my initial grade. A last question:
Q1. Could authors clarify whether there are correlations between geometric properties of the different datasets with the results observed in the ablation studies reported in Table 2, Table 3 and Figure 3? |
Fully human-written |
|
Exploiting Low-Dimensional Manifold of Features for Few-shot Whole Slide Image Classification |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 10: strong accept, should be highlighted at the conference
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The study proposes **Manifold Residual block** to address the issue of overfitting in few-shot whole slide images (WSI) classification. It argues that overfitting not just from data scarcity but a fundamentally geometric problem. Features from pathology foundation models lie on a low-dimensional, nonlinear manifold that linear layers in MIL models distort.
1) The study provides quantitative proof that CONCH features exhibit a low-dimensional manifold with nonlinear curvature, which linear layers disrupt.
2) The study proposes **MR Block Innovation** with a fixed random geometric anchor and a trainable low-rank residual pathway, reducing overfitting and parameter count.
3) The study provides a extensive validation to demonstrates the generalization of the proposed method.
4) **MR Block Innovation** demonstrates SOTA performances on three datasets across 4, 8, and 16 shots settings.
1) The study does not provide comparison with SOTA methods for whole slide images classification in few-shot settings such as MGPATH [3], MSCPT [2], FOCUS [3].
2) The study does not report inference time and FLOPs for the proposed method.
3) The study does not fully explain the effective of rank on the model's performance. For example, the sensitivity analysis (Fig. 3) shows that performance saturates around a rank of **r=32**. The authors note this **aligns remarkably** with the features effective rank of 29.7. However, all main experiments in Table 1 and the ablation studies in Table 2 were run with **r=64**. In Fig. 3, NSCLC 8-shots, **r=64** performs worse than **r=32** or **r=48**, suggesting **r=64** may be a suboptimal.
4) The study lacks a clear description of how to apply **MR** block to complex methods such as TransMIL or CATE.
**Reference**:
1. Nguyen, A.-T., Nguyen, D. M. H., Diep, N. T., Nguyen, T. Q., Ho, N., Metsch, J. M., Maurer, M. C., Sonntag, D., Bohnenberger, H., & Hauschild, A.-C. (2025). MGPATH: A vision-language model with multi-granular prompt learning for few-shot whole-slide pathology classification. Transactions on Machine Learning Research (2025).
2. Han, Minghao, et al. "Mscpt: Few-shot whole slide image classification with multi-scale and context-focused prompt tuning." IEEE Transactions on Medical Imaging (2025).
3. Guo, Zhengrui, et al. "Focus: Knowledge-enhanced adaptive visual compression for few-shot whole slide image classification." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
1) how does the MR block perform if B is a non-random, fixed matrix, such as an identify matrix?
2) can you confirm if the same geometric distortion problem exists for pathology slide-level foundation models such as TITAN [1]?
3) Given this strong evidence for **r=32**, why were all main experiments (Table 1) and ablation studies (Table 2) run with **r=64**?
4) Could you elaborate on the methodology used to apply **MR** block to TransMIL and CATE?
**Reference**:
1. Ding, T., et al. "Multimodal whole slide foundation model for pathology (2024)." URL https://arxiv. org/abs/2411.19666 2411. |
Fully human-written |
|
TeFlow: Enabling Multi-frame Supervision for Feed-forward Scene Flow Estimation |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper investigates self-supervised scene flow estimation from multi-frame point clouds. It introduces a self-supervised framework that segments the scene into static and dynamic regions, and leverages temporal ensembling and voting to obtain supervision signals for the dynamic parts. Experimental results on the Argoverse 2 and nuScenes datasets show that the proposed approach achieves competitive performance with low computational cost.
- The proposed approach demonstrates competitive performance compared to other feed-forward methods on the Argoverse 2 and nuScenes datasets.
- The experimental evaluation is comprehensive.
1. The writing should be improved.
- Figure 2 needs improvement.
As the key figure illustrating the overall framework, Figure 2 does not effectively help readers understand the temporal ensembling and voting algorithms. In particular, the meanings of the different colors and arrows in the motion candidate pool are not explained, making it difficult to interpret the figure.
- The writing of Section 4.1 should be improved.
In Line 213, the paper states that "we establish correspondences by finding, for each point p_i, its nearest neighbor in P," whereas in Eq. (3), the nearest-neighbor search is performed between p_k and P. This makes the process of motion candidate generation hard to follow.
2. The rationale of Motion Candidate Generation needs to be further clarified.
When generating the supervisory signal from previous frames, the method directly finds the nearest neighbor of the current points in the previous frame, without warping the current points according to the (predicted) motion between the two frames. By ignoring the inter-frame motion, performing nearest-neighbor search without such warping or motion compensation becomes inappropriate for establishing accurate correspondences. It is worth noting that in self-supervised scene flow estimation, almost all self-supervised loss functions (e.g., Chamfer loss) warp the source points toward the target frame to find correspondences and thereby generate the supervision signal.
3. In Eq. (5), the authors use the motion direction (i.e., cosine similarity) to measure the consistency between two flow candidates. It would be helpful to explain why the end point error (EPE) is not used. Since cosine similarity only accounts for the direction of motion while ignoring its magnitude, the consistency evaluation may be incomplete or potentially misleading.
1. Please explain the detailed process of Motion Candidate Generation, especially Line 213 and Eq. (3).
2. Please clarify the rationale behind the design of Motion Candidate Generation.
3. Why is cosine similarity used instead of the EPE for measuring the consistency? Is there any experimental evidence supporting this design choice? |
Lightly AI-edited |
|
TeFlow: Enabling Multi-frame Supervision for Feed-forward Scene Flow Estimation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This work introduces a supervised multi-frame scene flow prediction framework. To address issues such as occlusion and multi-frame temporal expansion, the proposed TeFlow presents an effective temporal aggregation strategy, according to the authors, which has significant speed improvements and performance advantages.
1. Good presentation and clear writng, which makes it easy to read.
2. Effective method design and good performance.
3. Comprehensive Experimental Validation. The study includes rigorous evaluations on two large-scale autonomous driving datasets, with detailed ablation studies on input frame count, loss components, and hyperparameters.
1. Although the method is leading in many metrics, it can be learned that TeFlow has room for improvement on some indicators, which are areas that can be done better.
2. Line 464, inconsistent capitalization.
3. Has the speed of this method been averaged from multiple measurements? Specifically, how many times?
Please refer to the weaknesses. |
Lightly AI-edited |
|
TeFlow: Enabling Multi-frame Supervision for Feed-forward Scene Flow Estimation |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper presents a feed-forward network that learns how to solve scene flow using temporal ensembling strategy. The results are strong and on part of the dataset examines showed significant improvement over STOA. The used technique involves adding temporal data and a joined cost function over points and blocks.
The primary strength of the TeFlow method is its introduction of cluster loss that enables balanced multi-frame supervision. While prior feed-forward methods rely on two-frame correspondence losses, TeFlow first aggregates a highly stable and temporally consistent motion target for each dynamic object cluster through a temporal ensembling strategy. This cluster-level averaging prevents the loss from being dominated by larger objects with more points, ensuring that smaller dynamic objects, such as pedestrians, receive fair and effective supervision.
The ideas presented in this paper are not new but their combination provides strong outcome. Specifically, clustering of object-level loss enforcement was already published (and cited by the authors), as well as temporal constraints (more than two frames). Hence, while they provided solution is worthy and achieve STOA in some cases, it is an incremental improvement over known methods.
Please elaborate on the contribution of each item already used and known in literature over the provided solution. |
Fully human-written |
|
KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper argues that most existing ASVA (Audio-to-Visual Animation) models adopt the strategy of uniformly sampling video frames, which leads to two core problems in high-dynamic motion scenarios: (1) Failure to capture key audio-visual moments, resulting in unsmooth motion transitions. (2) Audio-visual temporal misalignment, especially for low-frame-rate models, which struggle to match the fine-grained temporal information of audio.
Therefore, this paper proposes a keyframe-aware audio-to-visual animation framework that first localizes keyframe positions from the input audio and then generates the corresponding video keyframes using a diffusion model, which designs a keyframe generator network that selectively produces sparse keyframes from the input image and audio, effectively capturing crucial motion dynamics.
1. The thinking of uniform frames vs. keyframes generation and the keyframe-oriented pipeline in Figure.1 are interesting and beneficial to the research community.
2. The design of multi-condition cross attention fusion is delicate.
3. The quantitative comparison results and demos show the effectiveness of proposed method, which is convincing to me.
1. The ablation studies are not very convincing since the results of Table.2 are similar. Especially in terms of the “w.o. Frame Index” setting, the FVD improvement is 1.7% and the degradations of synchronization metrics are 2.1% ~ 2.4%. So it is not clear for me to understand the necessaries of Frame Index.
2. There is no computation efficiency analysis, which is essential for real-world applications. I am wondering that whether it is heavy to conduct the multiple-condition CA in the U-Net blocks.
3. The paper does not analyze the performance differences of the proposed method across different scenarios. The paper claims that its method is particularly advantageous in "intensive motion" scenarios (in Line.485), but this lacks quantitative analysis and verification.
1. Discuss and explain the effectiveness of proposed technical in this paper, especially the “Frame Index”.
2. Compare the time efficiency of the proposed method with those of baselines. For example, RealTime Factor(RTF) and GFlops should be taken into considerations.
3. Add more comparisons with baselines on intensive-motion scenarios and non-intensive-motion scenarios, and discuss the differences.
4. Will the code and pretrained model be released? |
Fully human-written |
|
KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents KeyVID, a keyframe-aware diffusion framework for generating videos that are temporally synchronized with input audio. The core idea is to exploit the correlation between peaks in the motion signal (optical flow intensity) and peaks in the audio signal to determine key moments of action. The system decomposes the task into three modules: a Keyframe Localizer that predicts motion peaks from audio, a Keyframe Generator that synthesizes visual frames conditioned on audio, text, and the first image, and a Motion Interpolator that fills intermediate frames for smooth transitions. While the underlying assumption “strong sounds correspond to large motions” is conceptually simple, the paper demonstrates that modular design and diffusion-based conditioning yield high-quality, audio-synchronized animations, outperforming prior methods (e.g., AVSyncD) in both quantitative metrics and human preference.
The paper’s strength lies in its clear conceptual simplicity combined with strong engineering design. Instead of introducing a novel generative paradigm, it isolates key factors affecting audio-visual synchronization and builds an effective three-stage system around them. The modular structure (localization–generation–interpolation) makes the overall process interpretable and flexible. The idea of learning motion saliency from audio peaks via optical-flow supervision is intuitive yet elegantly implemented, enabling temporal precision without requiring explicit motion labels. Moreover, the integration of first-frame conditioning and frame index embeddings ensures temporal consistency and visual coherence across non-uniformly sampled keyframes—an aspect that many prior diffusion-based approaches fail to achieve. Experimental results are convincing, showing SOTA performance on both synchronization and visual quality metrics. The paper is also well-written, with clear motivation and comprehensive ablations that help readers understand the contribution of each module. The proposed framework feels robust, scalable, and generalizable beyond its training distribution.
Despite its strong empirical results, the conceptual novelty is somewhat limited. The paper’s main assumption—that audio peaks align with motion peaks—is simple and well-known in the audio-visual literature. The novelty mainly comes from a careful engineering decomposition rather than a new theoretical insight. The keyframe selection mechanism remains heuristic (based on fixed thresholds and local extrema), which, while effective, feels ad hoc and could limit robustness for more complex or subtle motion types. For instance, the model performs less consistently on “subtle-motion” videos (e.g., violin, trumpet) or single-event sequences (e.g., frog croaking), where perceptual synchronization is harder to judge and the heuristic peak detection may fail. Furthermore, the 2-second clip length used in both training and user studies constrains the evaluation of long-term consistency and overall narrative quality. The model’s dependence on the first frame also raises concerns about appearance drift or overfitting to static conditions when generating longer sequences.
In addition to the weakness, it would be great if authors can response to the following minor comments.
- The paper would benefit from more discussion of failure cases, especially where KeyVID underperforms in the user study (e.g., low-motion or single-event clips).
- Figure 5 and Appendix F could be expanded to show visual differences in subtle-motion scenarios, not just high-intensity ones.
- The authors might consider exploring learnable or probabilistic keyframe selection instead of the fixed heuristic used in Section 3.1.
- The limitation of using short 2-second videos for subjective evaluation should be explicitly acknowledged; looping or extended clips could help reduce perceptual bias.
- It would be interesting to see comparisons against pose-based or structure-aware baselines such as TANGO, to assess generalization to human-centric motion. |
Fully AI-generated |
|
KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes an approach for audio-driven image animation, where static images are animated into videos synchronized with input audio both semantically and temporally. The method decomposes the animation process into two stages: the first generates keyframes corresponding to key actions derived from the audio, and the second interpolates between these keyframes to produce continuous motion. Both stages use a video inbetweening model to generate frames.
I appreciate the idea of generating keyframes or key actions first, which need not be uniformly distributed. This design effectively mitigates the potential mismatch between audio and generated video arising from differences in their sampling frequencies.
1. I am skeptical about the definition of keyframes as frames with peak motion scores. The authors should discuss the applicability and limitations of this definition. For instance, in dance videos, key movements often occur on musical beats, where the motion velocity is near zero—these moments would not correspond to frames with the highest motion scores.
2. I would like the authors to provide further justification for this keyframe definition.
3. Based on the provided video result, the method appears to be applicable primarily to sound events. Moreover, the paper presents too few video examples to convincingly demonstrate the effectiveness of the proposed approach.
See the above weakness section |
Lightly AI-edited |
|
KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper is for adding audio conditon to existing text-image to video (TI2V) model.
1. The backbone is DynamiCrafter, the dataset is open-source audio-visual generation dataset AVSyncD. The generated videos are around 2s (48 frames).
2. The target is to solve the audio-visual misalignment. While the idea is first select audio keyframes, then generate keyframes using selected audio, and finally do video interpolation.
- The authors train a audio-to-optical flow network to predict optical flow and select audio keyframes based on local minimum/maximum.
- Use this keyframes audio feature, image and text, and the target generate frame idx to generate video
- Video interplotation is by finetuning the DynamiCrafter with Wan 2.2 style image mask condtioning.
3. The objective score beats SoTA and 7 videos results attached.
1. The paper is well written and easy to follow.
2. The evaluation contains both objective and subjective mertic/samples and it shows results better than previous methods.
3. The authors included the detail of each module in appendix.
1. The high level idea sounds rule-based and do not have enough evidence why it is better than generating all frames in once.
- limition of rule based design: using optical-flow and picking local minimum/maxmum may not suitbale for some smooth audio, e.g., river, plane takes off. the idea maybe not general enough to push to boundary of current ATI2V model. it may require a more general mapping model, for example based on the contrastive learning like text and image.
- how to set the threshold of key frames number? for the hammer case, if the speed of hitting is very fast, e.g. 10 times in 2 second, should we have a 20-frame keyframe at least.
2. The implemenation, using a video model to generate discontinus frames by a learned frame embeding but keeping the original rope sounds not strightforward.
- firstly only using select audio keyframes feature, will this be enough? considering a hammer case only the sounds of hitting is captured.
- adding the frame idx condtion to the network, is it possible to directly modify existing position embedding?
Overall this is a paper that clear written, and have completive experiments. My concern is the idea itself sounds rule-based and not general. I'm wondering for 2s audio-video generation, this is a length we have enough GPU memory to train directly, maybe end2end modeling could get good results after filteriing out the misaliged audio-visual data from the dataset. The details of my questions are in weakness part. |
Fully human-written |
|
Mode-conditioning unlocks superior test-time compute scaling |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper presents a simple, yet effective approach to diversify the model’s outputs by providing explicit control over the modes. It explores two methods: 1) training separate specialist models and splitting test-time compute between them, 2) training a single model with mode-specific prefixes, and sample equally with the corresponding prefixes. The paper shows that these approaches surpass the mixed model trained with both modes. Morover, in the case of unspecified modes, the paper proposes an automatic mode discovery method based on gradient clustering. It shows that the method captures the labels reliably and further mode-conditioning on them recovers the improvements.
The paper pinpoints a simple but important suboptimality in training language models with diverse data. It verifies the intuition with experiments with both toy settings such as different strategies for Countdown, and with real-world tasks and traces distilled from teachers. It is comprehensive in experimenting with different forms of chain-of-thought (short and long) generated with different models. Moreover, the work pushes its practical relavance further by providing a method for discovering unobserved modes in the data based on gradient clustering, which makes the idea more generalizable to different settings.
The paper could improve its presentation by defining its metrics more clearly. For example, it’s not clear how the “Fraction of BFS per problem” metric is computed for Figure 2. In section 5.1, p_\theta is not defined, so it’s not obvious how the gradient is computed.
I also did not understasnd how heuristic prunings and search budget constraints make the problems solvable with only one of BFS and DFS, making it unclear why this setting captures the diverse setting desired.
The novelty of the idea to learn separate models and aggregating them instead of learning from a mixed dataset is questionable given the literature around mixture of experts and other works such as “Mix Data or Merge Models? Optimizing for Diverse Multi-Task Learning”.
1. Could you please explain how the heuristic prunings limit the solution to one of BFS and DFS?
2. How is the ‘fraction of BFS used’ computed?
3. For the distilling experiments, what kind of prefix do you use for different teachers? How does knowledge sharing happen in those 4. experiments if the model learns to follow one strategy given a prefix?
4. Could you explain how the gradient is computed in the gradient clustering method?
5. Did you run the gradient clustering method for the long-CoT datasets too? |
Fully human-written |
|
Mode-conditioning unlocks superior test-time compute scaling |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This work suggests a method called ModC for improving diversity in generation during parallel scaling.
The idea is that different ways to approach a problem may fall into different "modes", corresponding to different broad strategies. A diverse generator of possible proofs should try to sample as diversely as possible from different modes.
This paper suggests two possible ways to do this: (1) train a separate model on each mode, or (2) train a model with prefix-tuning to make it use one mode.
Modes can be either (1) known a priori, or (2) found automatically with a gradient clustering method. The paper tests the idea on several benchmarks such as NuminaMath, AIME, and Countdown, and finds benefits over vanilla parallel scaling -- especially when there is a large amount of parallel scalin.g
* ModC has a conceptually clear motivation.
* The proposed method is simple to implement, and seems to yield increased performance. This might be of interest to much of the ICLR community, since methods to improve to model reasoning are quite popular.
* The paper presents a way to find modes automatically, using gradient cluster. This makes it more broadly applicable than it would be if the modes had to be created manually.
* The experimental methodology seems mostly sound (although I have a question -- see weaknesses below).
* On the methodology: I'm not sure how good of a metric pass@k is for AIME, when k = 1000, because there are only 1000 possible solutions for any problem as far as I know.
- Having a model that outputs a random number from 0 to 999 would give a 63% pass@k accuracy, which is roughly the accuracy reported in Figure 5.
- On the other hand, having 1000 models (each of which outputs a constant number) would give a 100% pass@k accuracy.
* There's a quite relevant prior work called "Metadata Conditioning Accelerates Language Model Pre-training" by Gao et al., 2025, that this made me think about. There they show that adding metadata of which website a text came from can improve model performance. It could be good to discuss the connection with this work.
Typos: "this achieves up to xx% improvement" |
Fully human-written |
|
Mode-conditioning unlocks superior test-time compute scaling |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes mode-conditioning, a test-time inference strategy that allocates a certain number of samples to each mode in order to improve diversity of samples, mitigating the issue of mode collapse. The authors show that ModC leads to consistent gains across tasks both when modes are fixed, as well as being able to discover modes automatically via gradient clustering.
- The ModC method is novel, creative, and effective, addressing the critical issue of lack of diversity.
- The authors demonstrate that ModC training works well on a variety of tasks. They explore the idea throughout a variety of settings and domains, and it performs above standard baselines in all cases. The technique seems to be quite general and could have potential downstream applications beyond those listed in the paper.
- The authors also compare different ways of implementing mode conditioning, and do a thorough analysis on other factors like model size, CoT length, etc.
- The work does not have any comparisons with other diversity-inducing techniques, for example pass@k training (https://arxiv.org/abs/2508.10751) or optimal sample allocation (https://arxiv.org/abs/2410.22480). While ModC is evidently effective against simple baselines, it is difficult to understand the advantages and disadvantages of this method against some of these other methods.
- Most of the ablations are comparing variants of ModC. Could you provide a comparison of ModC against other diversity-inducing techniques (see comment in weaknesses section)?
- Do you see a clear diversity increase after ModC? For example, for MATH500, if you consider how many distinct answers are produced for each problem, how much does it increase with ModC?
- Does the idea also apply to other domains, such as code generation? |
Fully human-written |
|
Mode-conditioning unlocks superior test-time compute scaling |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper studies mode conditioning as a way to address the lack of diversity in LLM generations for reasoning tasks. The problem is that when we scale test time compute with parallel sampling, current models tend to collapse to one or two dominant strategies, so additional samples mostly repeat the same errors. The paper’s proposal is to make the modes explicit and to allocate test time samples across modes rather than drawing all samples from a single collapsed distribution. They instantiate this in a controlled search setting (Countdown, a generalization of Game of 24) where the target value can be found either by a DFS style search or a BFS style search, and where the search trace itself reveals which mode was used. They then extend the idea to math post training with multiple teacher models and finally to a setting where modes are discovered automatically via gradient clustering. They show that mode conditioned training and mode conditioned inference improves Pass@k relative to standard mixed training.
I think the problem is interesting and well chosen. How to obtain diversity in reasoning style without simply increasing sampling temperature (which has its own issues) is still not well understood, and most current post training pipelines / RL algorithms do in fact make diversity worse.
I like the synthetic setup with countdown game since it cleanly isolates the question they are trying to answer, with a way to verify which mode of problem solving is used. The experiments are pretty thorough and they also show some nice results in the math CoT setting, as well as in the automatic mode finding setting using gradient clustering.
1. Several parts of the paper are somewhat hard to follow on first read. One example is Figure 2. It is not completely clear how the per problem histograms are computed. My reading is that for each test problem the authors sample the model repeatedly, detect for each sample whether the model used DFS or BFS, compute the fraction of BFS samples for that problem, and then plot the distribution of that fraction over all problems. If that is correct, the number of samples per problem needs to be stated. If that is not correct, the figure needs a more explicit description. Right now it is difficult to tell what exactly is being compared.
2. In the separate model setting each mode gets its own model trained on the subset of data for that mode. In Countdown the paper notes there are about 97k DFS trajectories and 65k BFS trajectories. If each of those is used to train a full model of size $N$ then the total training compute for the separate model setting is roughly $6 \times (2 N) \times (97+65)/165 \approx 11.7 N D$ flops, whereas the standard training or prefix based modC is roughly 6ND . This would mean that the separate model setting is using roughly twice the training compute. The paper should clarify whether training budget was controlled, whether epochs were scaled down for the separate models, or whether the comparison is intentionally not compute matched. As written, it is not a fair comparison.
3. A natural application of this work is to post train with RL, where we know that the distribution becomes sharper and diversity decreases. As far as I can tell, the paper only considers SFT / distillation like settings. A natural question is: after RL, does mode conditioning still preserve the benefits shown here, or does RL erase them. It would be useful to see in the same synthetic Countdown setting a comparison of RL that samples from the usual policy versus RL that is constrained to use mode conditioned sampling during rollouts. If the authors can show that they can keep the sampling gains after RL training (evaluating 0-shot after RL) that would be a nice finding, even if on a synthetic task.
4. There are a few grammatical and formatting issues. For example Section 5.1 appears to have an incomplete closing sentence. (Did not penalize for this.)
I am willing to improve my score if the authors can meaningfully address these comments.
1. Unclear algorithm in Section 4.2 - The post training described for math reasoning in Section 4.2 seems to be plain SFT on two teacher traces with either mode specific prefixes or separate models. It would be good to state clearly that no RL was used here, if that is in fact the case.
2. Figure 4 interpretation - The caption says that the dark gray line is “best teacher.” Does this mean this curve corresponds to distillation from only the best single teacher, not distillation from the union of best teacher traces across problems? |
Fully human-written |
|
IMPQ: Interaction-Aware Layerwise Mixed Precision Quantization for LLMs |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces IMPQ, a mixed precision PTQ method for LLMs that leverages Shapley values to estimate the importance/sensitivity of individual layers in LLMs, and forms a Hessian scaled objective for the mixed-precision problem that can be solved using quadratic integer programming solvers.
The method is clear and the evaluation supports its claim of effectiveness over baselines.
- The complexity of the algorithm needs to be analyzed: The calculation of Shapley value and the Monte Carlo samping seems very computationally expensive for LLMs. A comparison of the computational complexity of the IMPQ method against baselines may be needed given important factors like layer numbers, and sampling numbers
- Downstream task evaluation is missing. The evaluation section only shows result on wikitext perplexity, but IMPQ method and some baselines/quantization methods needs calibration. It will be more convincible if results on downstream tasks are included. Given this perplexity, how hard it will be to apply to MoE models?
Besides the weakness section, could the author answer the following questions
1. Why the average per-token NLL are used as pay-offs?
2. Could the author explain/give some intuitive hints on why IMPQ outperforms baselines?
3. Apart from the computationaly complexity analysis in Weakness section, could the author give a comparison of quantization time? This will be straightforward for readers to understand the complexity-performance trane off
4. The selection of hyper params like $\alpha$ in line 260. Could the author explain why choose this specific value of 0.5?
There are a few typos in the paper due to citation formating issue like line 109 and 112 |
Fully human-written |
|
IMPQ: Interaction-Aware Layerwise Mixed Precision Quantization for LLMs |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces IMPQ, a novel framework for mixed-precision quantization of Large Language Models (LLMs) that addresses the critical challenge of deploying massive models on resource-constrained devices. The core innovation lies in modeling quantization as a cooperative game among transformer layers, where the authors propose SPQE (Shapley-based Progressive Quantization Estimation) to capture layer sensitivities and inter-layer interactions through progressive quantization rather than abrupt pruning.
The application of cooperative game theory and Shapley values to mixed-precision quantization represents a significant conceptual advance. By framing layers as players in a cooperative game, the authors provide a principled approach to quantifying both individual layer contributions and inter-layer interactions, which existing methods neglect.
SPQE's progressive quantization (from 4-bit to 2-bit) is a clever innovation that maintains model stability during Shapley estimation. This approach effectively avoids the catastrophic performance degradation and high variance associated with layer pruning, enabling more accurate and reliable layer importance assessment.
While the paper analyzes the impact of Monte Carlo samples, it neglects a thorough examination of other hyperparameters. The diagonal shrinkage parameter $\alpha$ is fixed at 0.5 without justification or sensitivity analysis. The choice of baseline (4-bit) and target (2-bit) precisions is also not motivated or varied.
Experiments rely primarily on C4 for Shapley estimation and WikiText-2 for evaluation. Testing on more diverse domains (e.g., code, multilingual text) and larger datasets would strengthen claims about generalizability, especially given the domain-specific nature of quantization effects.
While several baselines are included, comparisons with recent Hessian-based methods like HAWQ are limited. The paper also doesn't compare against neural architecture search approaches for quantization, which could provide additional context for the performance gains.
I'm mainly interested in the presentation and theoretical parts, with some confusing content below. Please tell me if I was wrong.
The paper assumes Monte Carlo sampling provides accurate Shapley approximations without a theoretical analysis of approximation error. The value function $v_{\text{NLL}}(S) = \mathbb{E}_{(x,t)\sim D}[-\log p(x_{t+1}|x_{\leq t}; S)]$ in Equation 3 is not justified as the optimal choice for measuring layer contributions in the cooperative game framework.
Section 3.2 begins with a second-order Taylor approximation $\Delta L \approx \sum_{i=1}^{L} g_i^\top \epsilon_i + \sum_{i=1}^{L} \sum_{j=1}^{L} \epsilon_i^\top H_{ij} \epsilon_j$ but then switches to a Shapley-based approach without reconciling these perspectives. The covariance matrix $C = \frac{1}{M}(\Delta v_\ell - \hat{\phi})^\top (\Delta v_\ell - \hat{\phi})$ in Equation 8 is proposed as a Hessian proxy without theoretical justification for this equivalence.
The distribution $D$ in Equation 3 is not clearly specified. The perturbation $\epsilon_i$ in Section 3.2 is used without defining its relationship to quantization error. The construction of the covariance matrix $C$ in Equation 8 lacks clarity regarding dimensions and the precise meaning of $\Delta v_\ell$. It's kindly suggested to check these notations.
See the weakness, mainly in the exp settings, theory, and notations. |
Moderately AI-edited |
|
IMPQ: Interaction-Aware Layerwise Mixed Precision Quantization for LLMs |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces SPQE, a Shapley-based approach for estimating layer importance via progressive quantization, and IMPQ, a MILP-based method for assigning 2- or 4-bit precision under memory constraints. The authors frame mixed-precision quantization as a cooperative game among layers, capturing inter-layer dependencies more effectively than existing heuristics. Evaluated on several LLMs and PTQ backends, IMPQ achieves significantly lower perplexity, especially under 2-bit constraints, demonstrating strong empirical performance and robustness across models and settings.
The key strengths lie in the originality of modeling quantization as a cooperative game, the method’s stability under aggressive bit reductions, and the extensive and thorough experimental validation. The results clearly show that accounting for inter-layer interactions leads to better bit allocation and quantized performance than isolated sensitivity measures.
1. The method carries substantial computational overhead, with SPQE requiring many hours to estimate Shapley values even for mid-sized models.
2. The approach is currently limited to binary 2-bit/4-bit decisions, which restricts its generality, and the MILP formulation, though optimal in theory, raises questions about scalability to larger models or finer-grained bit options.
3. Moreover, the paper does not explore how robust the final assignments are to noise in Shapley estimates, nor does it fully explain implementation details such as memory constraint handling or solver configurations.
1. It would help to know whether the authors plan to support finer bit precision (e.g., 3-bit or 8-bit layers), and whether SPQE or MILP runtimes could be reduced through approximation or more scalable formulations.
2. Additionally, can this game-theoretic framework extend to other compression tasks, such as structured pruning, where modeling interactions is equally important? |
Fully AI-generated |
|
IMPQ: Interaction-Aware Layerwise Mixed Precision Quantization for LLMs |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes SPQE to obtain accurate Shapley estimates of layer sensitivities and inter-layer interactions. SPQE is based on cooperative game theory. The authors also use IMPQ to find optimal bit-width assignments.
1. An innovative use of Shapley value analysis and cooperative games among LLM layers to model mixed-precision quantization.
2. Demonstrated performance improvements on Llama-3, Gemma-2, and Qwen-3.
1. Why is modeling mixed-precision quantization using Shapley value analysis and cooperative games among LLM layers more effective than traditional Hessian-based methods? The authors did not clearly explain the motivation.
2. The experimental setup seems problematic. I believe the paper does not evaluate quantization performance under an accepted standard setting.
(a) The performance of the full-precision baseline is not stated.
(b) The bit range defined in the paper (e.g., 2.5–3.0) is overly broad. It is not explained how these bits correspond to specific mixing ratios or group sizes, nor how fairness is maintained across different baselines.
(c) The statistical results in the paper are inconsistent with previous work. A perplexity (ppl) around 15–25 is too high—much worse than those reported in existing papers. For example, in [1], methods such as OmniQuant and CherryQ achieve ppl < 10 at 2.15 bits. Although the experimental setup in [1] differs from this paper, the discrepancy should not be this large.
3. Continuing from 2.(c), I believe that while parameter sensitivity has room for optimization, improving only the sensitivity is of limited benefit. The optimal sensitivity selection strategy may only result in a small decrease in perplexity. Therefore, the claim in Table 1 that IMPQ reduces ppl by about 10 compared to other baselines may not be credible. The authors may not have obtained the optimal performance for the baselines.
4. Some writing issues—e.g., the font in Table 1 is too small.
[1] Cherry on Top: Parameter Heterogeneity and Quantization in Large Language Models
Please see weaknesses. |
Lightly AI-edited |
|
HSIC Bottleneck for Cross-Generator and Domain-Incremental Synthetic Image Detection |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a HSIC-based bottleneck to enhance the generalization of CLIP features for synthetic image detection across diverse generator families. The method encourages representations to retain label-relevant information while suppressing spurious correlations with input semantics, and further introduces HSIC-Guided Replay (HGR) to mitigate catastrophic forgetting in domain-incremental learning. Experiments on diffusion, GAN, and 3D Gaussian Splatting (3DGS) models demonstrate improved cross-generator performance and continual learning stability compared to recent baselines.
- The paper articulates the challenge of detectors overfitting to generator-specific artifacts and semantics, which is a timely and relevant problem for synthetic image forensics.
- The continual learning setting further shows practical awareness of evolving generative models, making the work meaningful for long-term applicability.
- The adaptation of HSIC into the CLIP feature pipeline is implemented in a straightforward and principled manner, and the loss formulation is coherent with prior HSIC-based bottleneck approaches.
- Ablation studies on HSIC components, use of intermediate ViT features, and kernel options provide supportive empirical evidence that each design choice is beneficial.
- The evaluation spans distinct generative paradigms, and the method shows consistent gains across them, indicating improved robustness of the learned representations.
- The 3DGS results highlight the method’s applicability to emerging synthetic formats beyond classical image generators.
- While effective, the core contribution largely builds on established concepts such as HSIC-based information bottlenecks and CLIP feature refinement. The paper does not sufficiently articulate what is fundamentally novel beyond applying HSIC to a different backbone and combining it with a replay mechanism. As a result, the contribution may be perceived as incremental rather than conceptually innovative.
- The paper provides qualitative intuition and t-SNE visualizations but lacks deeper analysis on what specific semantic attributes are suppressed or preserved through HSIC regularization. A more detailed investigation into feature disentanglement, representation shift, or artifact suppression would strengthen interpretability and scientific value. Without such analysis, the method may appear as a black-box regularizer rather than a principled representation intervention.
- The experiments focus on ProGAN, SDv1.4, and 3DGS-based generators, which do not reflect the current state of generative technology, such as Stable Diffusion 3+, Midjourney, Sora 2, or FLUX models. Since newer generators produce more photorealistic and harder-to-detect outputs, evaluation on these models is essential to demonstrate real-world utility. The absence of such results weakens the strength of the claimed “generalization” capability.
- How does HSIC specifically reshape CLIP features at different semantic granularity levels? Can the authors provide more concrete evidence—beyond t-SNE—that illustrates which types of semantic or generator-specific correlations are suppressed?
- How well does the method scale when extended to modern, highly realistic generators such as SD3.5, Midjourney, Sora2, or FLUX? Have the authors tested whether the model remains effective with these more challenging sources?
- In continual learning scenarios with more than 6–8 sequential domains, does HGR maintain long-term stability? Could the authors provide results over longer task horizons to support claims of scalability and robustness?
- What is the computational overhead of using 24-layer CLIP features + HSIC loss during training and inference? Could the model be made more efficient without sacrificing performance?
- Several recent approaches build upon CLIP to improve cross-generator generalization, and comparing against such methods would better contextualize the contribution of the proposed HSIC bottleneck. |
Fully AI-generated |
|
HSIC Bottleneck for Cross-Generator and Domain-Incremental Synthetic Image Detection |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces a new synthetic image detector with contributions to the model architecture, adaptations to continual learning and a new synthetic image benchmark that contains 3D Gaussian Splatting (3DGS) rendered images. The evaluation is twofold, including both a binary supervised detection task and its continual learning variant with a HSIC-Guided Replay (HGR) adaptation. The proposed model achieves state-of-the-art performance in cross-generator evaluation, generalizing between diffusion-based and GAN-based images, with an improvement of over 5 percentage points. Moreover, it demonstrates strong continual learning capability when incrementally trained to detect 3DGS-generated images.
- The method achieves state-of-the-art performances, especially in the cross-generators evaluation setup. Furthermore, results in the continual learning setup improve over the single-dataset training baseline and, in some cases, even surpass those obtained by jointly training on the additional 3DGS datasets.
- The inclusion of a 3DGS benchmark is a valuable addition, introducing a new family of synthetic image generation methods beyond GANs and diffusion models which have dominated the detection research.
- The paper reinforces the effective use of HSIC in both supervised and continual learning setups.
- The HSIC bottleneck is not entirely novel; it can be seen as a combination of the RINE[1] and DualHSIC[2] approaches.
RINE’s performance is missing from Table 1, which could potentially narrow the gap between the current model and the top-performing prior works reported in the same table.
- The performance gains of the HSIC term in HGR is uncertain. Ablation on the performance gains due to inclusion of \( 1 - \mathcal{N}(r_i) \) term in Equation 10 would help justify its contribution to the overall performance and clarify its impact.
Typo: In Table 5 b) The Cosine kernel achieves highest mACC on ProGAN and should be bolded instead of median version of RBF.
- While the paper improves performance on the GenImage benchmark, the core method relies heavily on existing approaches and therefore provides limited new contributions to the synthetic image detection community. However, the inclusion of 3DGS samples in the continual learning setup represents a significant strength supporting acceptance. To further strengthen the paper, the authors should better motivate the method’s novelty and its relevance to the community. Additionally, a useful way to justify the method’s performance would be to compare its cross-generator performance on 3DGS with previous methods (using the base method results from Tables 3 and 4, which show a significant gap between 3DGS and Diffusion or GAN generators).
[1] Christos Koutlis and Symeon Papadopoulos. “Leveraging representations from intermediate encoder-blocks for synthetic image detection.” ECCV 2024
[2] Zifeng Wang, Zheng Zhan, Yifan Gong, Yucai Shao, Stratis Ioannidis, Yanzhi Wang, and Jennifer Dy. “DualHSIC: HSIC-bottleneck and alignment for continual learning.” ICML 2023
- What’s the difference between your method (refferred to as Ours in Table 1 and 2) and DualHSIC with a CLIP backbone?
- What architecture does the classifier (g_{\theta_g}) has?
- From where do real samples from the 3DGS datasets come from? |
Fully human-written |
|
HSIC Bottleneck for Cross-Generator and Domain-Incremental Synthetic Image Detection |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper addresses two critical challenges in synthetic image detection: poor cross-generator generalization and catastrophic forgetting in domain-incremental learning. To tackle these issues, the authors propose two core components: (1) an HSIC (Hilbert-Schmidt Independence Criterion) bottleneck applied to intermediate CLIP ViT features, which suppresses text-image alignment semantics (irrelevant to authenticity) while enhancing discriminative representations for real vs. synthetic images; (2) HSIC-Guided Replay (HGR), a rehearsal strategy that selects per-class exemplars via a hybrid score combining HSIC relevance (information centrality) and k-center coverage (spatial diversity), mitigating forgetting during domain adaptation. Additionally, the authors curate a 3D Gaussian Splatting (3DGS) head avatar benchmark dataset, covering multi-view reconstruction, single-view reconstruction, and generative pipelines, to support domain-incremental evaluation. Empirical evaluations are conducted in two phases: Phase I tests cross-generator transfer between diffusion and GAN models, and Phase II assesses sequential adaptation to 3DGS domains. Results show the HSIC bottleneck improves cross-generator generalization, while HGR sustains prior-domain accuracy during 3DGS adaptation. The paper's main contributions include the HSIC bottleneck design, the HGR rehearsal mechanism, and the 3DGS benchmark dataset.
(1) The HSIC bottleneck innovatively leverages intermediate CLIP features to resolve the interference of text-image alignment semantics (a key limitation of CLIP-based detectors), and its combination with information-theoretic regularization (minimizing input dependence, maximizing label dependence) is theoretically grounded and practically effective.
(2) HGR addresses the inefficiency of traditional replay methods by fusing HSIC relevance (ensuring exemplar informativeness) and k-center coverage (ensuring diversity), achieving compact memory usage while mitigating forgetting--filling a gap in domain-incremental synthetic image detection.
(3) The 3DGS head avatar dataset (with identity-disjoint splits and standardized preprocessing) addresses the lack of benchmarks for rendered synthetic images, supporting research on domain-incremental adaptation to 3D-generated content.
(4) The two-phase evaluation (cross-generator generalization + domain-incremental learning) covers diverse scenarios (diffusion, GAN, 3DGS). The authors compare against 6+ baselines (e.g., CNNSpot, LGrad, UniFD, iCaRL) and conduct detailed ablations (HSIC components, kernel choices, intermediate features), verifying the necessity of each module.
(5) Ablation studies confirm the role of HSIC(x,z) (suppressing input shortcuts) and HSIC(y,z) (aligning with labels), while t-SNE visualizations qualitatively demonstrate that the HSIC bottleneck reshapes features into more separable real/synthetic clusters--strengthening the credibility of the proposed method.
(1) The paper lacks critical implementation specifics. For example, regarding the HSIC bottleneck: the authors mention a "64-D projection" but do not specify the projection layer's structure (e.g., fully connected layer with activation function? Number of neurons in hidden layers, if any?). For training parameters: the learning rate is set to \(10^{-4}\) (SGD), but no details are provided on batch size, number of training epochs, weight decay, or learning rate scheduling (e.g., step decay, cosine annealing)--parameters that directly impact model convergence and performance. For data preprocessing: while "standardized preprocessing" is mentioned for the 3DGS dataset, there is no description of specific steps (e.g., image resizing resolution, normalization mean/std values, whether face cropping is applied for head avatars). Without these details, other researchers cannot replicate the experiments, violating the reproducibility principles of academic research.
(2) The paper's theoretical foundation for the HSIC bottleneck and HGR is insufficient. For the HSIC bottleneck: Equation (6) defines the loss function, but the authors do not analyze its convergence properties (e.g., whether the loss decreases monotonically during training, or under what conditions the model converges to a global optimum). There is also no discussion of why minimizing HSIC(x,z) (input-feature dependence) effectively suppresses text-alignment semantics--only qualitative t-SNE results are provided, lacking quantitative evidence (e.g., semantic similarity scores between features and text captions before/after applying the bottleneck). For HGR: the authors claim the hybrid score (HSIC relevance + k-center coverage) improves exemplar selection, but there is no theoretical justification for why this combination outperforms single-criterion methods (e.g., pure HSIC or pure k-center). For instance, no proof is given that HSIC relevance correlates with exemplar informativeness, or that k-center coverage effectively reduces redundancy. This weakens the method's theoretical rigor.
(3) The evaluation is limited to specific scenarios, failing to test the method's robustness across broader conditions. First, **dataset scope limitation**: The 3DGS benchmark only focuses on head avatars, with no evaluation on 3D-generated non-face scenes (e.g., 3DGS-rendered landscapes, objects). This raises questions about whether the method generalizes to other 3D-rendered content. Second, **synthetic image diversity limitation**: Cross-generator evaluation only includes classic diffusion models (e.g., SDV1.4, ADM) and GANs (e.g., ProGAN, StyleGAN), but not recent variants (e.g., Stable Diffusion 3, GANformer) or hybrid models (e.g., diffusion-GAN hybrids). Third, **image quality robustness**: There is no evaluation of detection performance on low-resolution synthetic images (e.g., 32×32, 64×64) or images subjected to post-processing (e.g., JPEG compression, Gaussian blur, rotation)--common in real-world scenarios. Fourth, **adversarial robustness**: The paper does not test whether the method retains performance under adversarial attacks (e.g., FGSM, PGD attacks on synthetic images to evade detection), a critical consideration for practical deployment.
(4) While the paper compares against multiple baselines, several critical comparisons are missing or insufficient. First, **latest method omissions**: The paper cites baselines up to 2025 (e.g., VIB-Net, 2025) but does not compare against any 2025-post methods (e.g., diffusion-specific detectors or 3DGS-focused detection methods) that may have addressed similar problems. Second, **unclear baseline parameter consistency**: For baselines like UniFD and NPR, the authors do not confirm whether they used the official implementations or default parameters--if the baselines were not optimized (e.g., using suboptimal hyperparameters), the comparison results may overstate the proposed method's advantages. Third, **incomplete cross-method ablation**: For example, when comparing HGR with iCaRL and CBRS, the paper does not conduct ablation studies on combining HGR with other rehearsal strategies (e.g., iCaRL's class-mean herding) to test for synergies. Fourth, **computational efficiency comparison**: No comparison of inference time or training memory usage between the proposed method and baselines is provided--critical for practical deployment (e.g., on edge devices).
(5) Key results are presented unclearly or incompletely, hindering result verification. First, **table data gaps**: Tables 1 and 2 (cross-generator generalization results) contain empty cells (e.g., Table 1's "| 61.61/83.59 60.74/90.24 48.82/47.51 61.43/82.74 | 58.65/51.77 60.30/83.30 89.70/96.59 97.54/99.64 99.49/99.99 88.60/98.44 | | | |") and missing dataset labels for some columns, making it impossible to determine which targets the results correspond to. Second, **lack of quantitative clustering analysis**: While t-SNE visualizations (Figures 2, 6, 7) show qualitative improvements in real/synthetic separation, no quantitative metrics (e.g., silhouette coefficient, Davies-Bouldin index, or inter/intra-cluster distance ratios) are provided to measure clustering quality--weakening the evidence for feature reshaping. Third, **insufficient statistical significance**: Most results report mean accuracy/mAP but lack standard deviations (except in Figure 3) or confidence intervals. For example, Table 3 and 4 (domain-incremental results) do not specify how many runs were averaged, or whether differences between methods are statistically significant (e.g., via t-tests). Fourth, **parameter sensitivity analysis gaps**: The HSIC bottleneck uses λx=900/500 and λy=700/600 for SDV1.4/ProGAN training, but no sensitivity analysis is provided (e.g., how performance changes when λx/λy varies by ±20%, ±50%). Similarly, HGR's λkc (controlling k-center weight) only tests "λkc=0" and "larger values"--no gradient-based analysis of optimal λkc for different datasets.
(6) The domain-incremental phase only evaluates adaptation to 3DGS domains, with several limitations. First, **limited domain diversity**: No adaptation to other emerging synthetic domains (e.g., text-to-video frame extracts, neural radiance field (NeRF)-rendered images) is tested, raising questions about HGR's generalizability to non-3DGS domains. Second, **short adaptation sequence**: Only 3 3DGS sub-domains are used (GHA, SA, GAGAvatar)--no evaluation of long-sequence adaptation (e.g., 5+ domains) to test cumulative forgetting. Third, **fixed memory budget**: The paper uses a fixed keep_frac=0.01 (1% of training samples) for the replay buffer but does not test how memory size impacts performance (e.g., keep_frac=0.005, 0.02) or compare against dynamic memory allocation strategies. Fourth, **no backward transfer analysis**: Backward transfer (improvement in prior-domain performance after adapting to new domains) is a key metric for continual learning, but the paper only reports "preserving prior accuracy" without quantifying backward transfer--failing to fully demonstrate HGR's advantages over baselines.
(7) The paper does not acknowledge or discuss the proposed method's inherent limitations. First, **backbone dependence**: The method relies on CLIP ViT features, but no analysis is provided of performance degradation when using lighter backbones (e.g., MobileNet, EfficientNet) for edge deployment. Second, **data imbalance impact**: The 3DGS dataset uses balanced real/synthetic splits (e.g., GHA: 45,772 real / 45,772 synthetic), but no test of imbalanced splits (e.g., 1:10 real:synthetic) is conducted--common in real-world scenarios where synthetic images may be more abundant. Third, **modal limitation**: Only single-image detection is supported, with no extension to multi-modal synthetic data (e.g., synthetic images with text overlays, audio-synced synthetic video frames). Fourth, **computational overhead of HSIC**: HSIC calculation requires Gram matrix construction and centering, which increases computational complexity--no quantification of training/inference time overhead compared to non-HSIC methods (e.g., how much slower the HSIC bottleneck is than a standard CLIP linear probe).
(8) The related work section has gaps and superficial comparisons. First, **HSIC application gaps**: The paper cites HSIC (Gretton et al., 2005) but does not discuss recent HSIC applications in computer vision (e.g., HSIC for domain adaptation, few-shot learning) or compare how its HSIC bottleneck differs from existing HSIC-based feature regularization methods. Second, **continual learning omissions**: Key rehearsal-based methods (e.g., Memory Replay GANs, Contrastive Replay) are not cited, and no discussion of how HGR differs from contrastive exemplar selection methods is provided. Third, **3DGS detection gaps**: No discussion of existing 3DGS/rendered image detection methods (if any) is provided--failing to position the paper's 3DGS benchmark within the broader literature. Fourth, **superficial baseline analysis**: For baselines like VIB-Net (2025), the paper only states it "uses a variational information bottleneck" but does not compare the HSIC bottleneck (information-theoretic) with VIB (probabilistic) in terms of theoretical framework or performance--missing an opportunity to highlight the HSIC bottleneck's advantages.
(9) The paper states the HSIC bottleneck "concatenates features from 24 intermediate CLIP ViT layers and the final layer" but provides no details on aggregation. First, **layer selection rationale**: No explanation is given for choosing 24 intermediate layers (e.g., why not 12, 36 layers?) or which specific layers (e.g., early, middle, late) are selected. Second, **aggregation method**: Concatenation may lead to high dimensionality (e.g., 25 layers × 768 dim (ViT-B) = 19,200 dim), but no dimensionality reduction (e.g., PCA, t-SNE) or feature fusion (e.g., attention-based fusion) is mentioned--raising questions about computational efficiency and redundancy. Third, **layer-wise contribution**: No ablation of individual layer contributions (e.g., removing early layers) is conducted--failing to identify which layers are most critical for detection.
(10) Qualitative results (e.g., t-SNE, sample images) are not fully analyzed. First, **t-SNE interpretation gaps**: Figures 6 and 7 (t-SNE of CLIP vs. HSIC features) show "tighter clusters" but do not explain why some datasets (e.g., GauGAN in Figure 7) still have overlapping clusters--failing to address the method's limitations for specific generators. Second, **no failure case analysis**: No discussion of misclassified samples (e.g., why some real images are mislabeled as synthetic) or analysis of common artifacts in misclassified synthetic images--critical for guiding future improvements. Third, **3DGS sample visualization**: The paper mentions Figure 1 (3DGS sample images) but does not provide qualitative comparisons of detection performance across 3DGS sub-domains (e.g., why SA has higher accuracy than GAGAvatar)--missing insights into domain-specific challenges.
**To facilitate discussions during the Rebuttal phase, authors are advised to respond point-by-point (indicating the question number).**
(1) Could you provide the following critical implementation details to ensure reproducibility? (a) The exact architecture of the HSIC bottleneck's projection layer (e.g., number of fully connected layers, activation functions, output dimension); (b) Full training hyperparameters (batch size, number of epochs, weight decay, learning rate scheduler, optimizer momentum); (c) Specific data preprocessing steps (image resolution, normalization parameters, face cropping logic for 3DGS avatars); (d) Code for HSIC calculation (e.g., Gaussian RBF kernel bandwidth calculation via median heuristic, Gram matrix centering implementation).
(2) (a) Could you provide a formal analysis of the HSIC bottleneck's convergence (e.g., proof of loss monotonicity or bounds on generalization error)? (b) How do you quantitatively verify that the HSIC bottleneck suppresses text-alignment semantics? For example, using cosine similarity between CLIP features and text embeddings (e.g., "face" captions) before/after applying the bottleneck. (c) Could you provide a theoretical justification for combining HSIC relevance and k-center coverage in HGR (e.g., a bound on the expected error reduction compared to single-criterion selection)?
(3) (a) Did you use official implementations and default hyperparameters for baselines (e.g., UniFD, NPR, VIB-Net)? If not, what modifications were made, and why? (b) Could you add comparisons with 2025-post synthetic image detection methods (e.g., any new diffusion-specific detectors or 3DGS detection methods)? (c) Could you provide computational efficiency metrics (inference time per image, training memory usage) for your method and baselines on the same hardware (e.g., NVIDIA RTX 4090)?
(4) (a) Could you fill in the missing cells in Tables 1 and 2 and clarify dataset labels for all columns? (b) Could you add quantitative clustering metrics (silhouette coefficient, Davies-Bouldin index) for t-SNE visualizations (Figures 2, 6, 7) to quantify real/synthetic separation? (c) Could you provide standard deviations and 95% confidence intervals for all reported mean accuracy/mAP values, along with the number of independent runs (e.g., 5 runs)?
(5) (a) Could you conduct a sensitivity analysis of HSIC's λx and λy (e.g., λx=300, 700, 1100; λy=500, 700, 900) for SDV1.4 and ProGAN, and plot performance trends? (b) Could you test multiple values of HGR's λkc (e.g., 0.1, 0.5, 1.0, 2.0) and analyze how it impacts exemplar selection and domain-incremental performance? (c) Could you explain the rationale for choosing keep_frac=0.01 and test keep_frac=0.005, 0.02 to show memory-size impact?
(6) (a) Could you extend the domain-incremental evaluation to include non-3DGS domains (e.g., NeRF-rendered images, SDv3-generated images) to test HGR's generalizability? (b) Could you evaluate long-sequence adaptation (e.g., 5+ domains) and report cumulative forgetting curves? (c) Could you quantify backward transfer (using the formula: Backward Transfer = (Accuracy after new domain - Accuracy before new domain) / Accuracy before new domain) for all prior domains?
(7) (a) Could you evaluate detection performance on low-resolution synthetic images (32×32, 64×64) and post-processed images (JPEG compression: quality 20, 50; Gaussian blur: σ=1, 3)? (b) Could you test adversarial robustness using FGSM/PGD attacks (ε=0.01, 0.03) and report accuracy degradation? (c) Could you test performance on imbalanced real/synthetic splits (1:5, 1:10) and compare against rebalancing strategies (e.g., class weights)?
(8) (a) Could you test the HSIC bottleneck on lighter backbones (MobileNetV3, EfficientNet-B0) and report performance vs. efficiency trade-offs? (b) Could you extend the method to multi-modal data (e.g., synthetic images with text overlays) by fusing text and image features in the HSIC bottleneck? (c) Could you provide ablation results for CLIP layer selection (e.g., only late layers, only middle layers) to identify the most critical layers for detection?
(9) (a) What are the main limitations of the HSIC bottleneck in practical deployment (e.g., computational cost, backbone dependence)? How do you plan to address them in future work? (b) How does the method perform when synthetic images are designed to mimic real-image statistics (e.g., adversarial synthetic images)? (c) Could you discuss scenarios where the method fails (e.g., specific generators, image types) and provide failure case examples?
(10) (a) Could you explain the "identity-disjoint split" implementation for the 3DGS dataset (e.g., how identities were labeled, tools used for identity verification)? (b) Could you provide the exact sample counts for each sub-dataset in the GAN/diffusion evaluation (e.g., ProGAN: 10k samples, SDV1.4: 15k samples)? (c) Could you release the 3DGS dataset (or a sample subset) and provide access links to facilitate further research?
(11) (a) Could you quantify the training time overhead of the HSIC bottleneck compared to a standard CLIP linear probe (e.g., % increase in epochs per hour)? (b) Could you propose optimizations for HSIC calculation (e.g., batch-wise Gram matrix computation) to reduce overhead?
(12) (a) Could you compare HGR's forward transfer (performance on new domains) with baselines (iCaRL, CBRS) using quantitative forward transfer metrics? (b) Could you analyze why HGR performs better on SA than GAGAvatar in Table 4? Are there domain-specific artifacts that HGR captures more effectively?
(13) (a) Could you test alternative feature aggregation methods (e.g., attention-based fusion, average pooling) for CLIP intermediate layers and compare performance with concatenation? (b) Could you provide a dimensionality analysis of the concatenated features (24 intermediate + 1 final layer) and explain how you avoid overfitting due to high dimensionality?
(14) (a) Could you discuss the method's potential deployment scenarios (e.g., social media content moderation, forensics) and any practical challenges (e.g., real-time inference, scalability)? (b) Could you test the method on a real-world dataset (e.g., Reddit synthetic image subsets) with uncurated synthetic/real images?
(15) (a) Could you conduct a direct comparison of the HSIC bottleneck and VIB-Net's variational bottleneck (e.g., performance on the same test sets, computational cost, robustness to noise)? (b) Could you explain why the HSIC bottleneck is more effective at suppressing text-alignment semantics than VIB? |
Fully AI-generated |
|
HSIC Bottleneck for Cross-Generator and Domain-Incremental Synthetic Image Detection |
Soundness: 1: poor
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
Authors propose a new bottleneck loss for synthetic image detection, based on HSIC. The method computes the Hilbert-Schmidt Independence Criterion (HSIC) on the image and label encoded embeddings, and add that to the binary-cross entropy loss already used. For the domain-incremental setting, HSIC is used to guide replay. The work includes experiments comparing on cross-generator generalization and continual adaptation, as well as an ablation on the HSIC components and an analysis of the domain-incremental learning.
S1) The intuition portions of the paper are fairly easy to read.
S2) The t-SNE plots are nice included analysis.
S3) The method seems mathematically well-grounded.
W1) In the DIL setting, the comparison methods are out-of-date (the newest being from 2020). This makes it unclear how the presented method compares with SOTA.
W2) In the cross-generator generalization setting, the chosen models are also out-of-date (the newest being from 2022). It would be much more relevant to test on the SOTA generative models being used today, to understand how applicable this method is in practice (e.g. FLUX, Qwen-Image, etc).
W3) The related work section is also not in-depth enough and out-dated in some places, making it difficult to place the work within contemporary literature. For example, in section 2.2. (Continual Learning related works), the newest method is from 2021, while much newer work exists, e.g. [A].
W4) The mathematical background section (2.3) misses explicitly defining some mathematical notation (variables and functions), which would be useful for improving the clarity for readers. Most notably 1, but also e.g. tr and I would be useful, for completeness.
W5) An ablation over the choice of replay would be useful in understanding its role, given it is part of the proposed methodology.
[A] Boosting Domain Incremental Learning: Selecting the Optimal Parameters is All You Need, Wang et al., CVPR 2025.
None at this time |
Fully human-written |
|
Transport Clustering: Solving Low-Rank Optimal Transport via Clustering |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper builds upon the ideas introduced in [1], which establish a connection between the optimal transport problem and the k-means clustering algorithm through a non-negative factorization of the transport plan. A key distinction, however, is that while [1] focuses on clustering a single dataset, the present work extends these concepts to the co-clustering of two datasets — a framework the authors refer to as Transport Clustering. Since solving the low-rank optimal transport problem is NP-hard, the authors propose a multi-step approximation method and derive theoretical approximation ratios under various metric settings, including negative, kernel, and general metrics.
[1] Scetbon, Meyer, and Marco Cuturi. “Low-rank Optimal Transport: Approximation, Statistics, and Debiasing.” Advances in Neural Information Processing Systems 35 (2022): 6802–6814.
The paper is very well written and addresses an important problem which could be used in the development of computational tools for domain registration and alignment. The proposed methods provide a strong foundation for bridging distributional differences between datasets through transport-based formulations. Furthermore, the techniques presented in the paper could be extended to design large-scale, class-conditioned domain registration frameworks, enabling more structured and semantically meaningful alignment across complex datasets.
One major source of confusion for me is the incessant switching between assignment form partition formulations of low rank optimal transport problem. In the main body results are stated in assignment form, whereas proofs in appendix are written in partition formulation, which makes it harder for reader to follow the proofs.
While authors of the paper applied their technique to synthetic data, co-clustering on CIFAR-10 and cellular Transcription, I wonder if it could be applied to small domain alignment problem. |
Fully human-written |
|
Transport Clustering: Solving Low-Rank Optimal Transport via Clustering |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
Optimal Transport (OT) has been a popular technique in machine learning in recent years. Its low-rank variant has also received wide attention due to computational and statistical advantages. This paper proposed a method for low-rank OT called \emph{transport clustering} that approximately solves the low-rank OT problem with a clustering subroutine. Approximation error is theoretically analyzed. Experiments demonstrate the advantages over existing low-rank OT methods.
1. The transport clustering method is elegantly designed.
2. The authors provide theoretical proofs for the approximation factors.
3. Experimental results such as Figure 2 illustrate its advantages over existing low-rank OT methods.
1. When the marginal distributions are arbitrary distributions, the authors outline an extension for the method (Line 267-Line 276), but the paper seems to lack theoretical guarantees for such an extension. For example, does problem (9) even have a solution when the marginal distributions are arbitrary? How robust is this extension?
2. The method relies on existing full-rank OT solvers like the Hungarian algorithm or the Sinkhorn algorithm to register the cost matrix, which undermines the computational efficiency of low-rank OT. The Hungarian algorithm may take prohibitive runtime for large-scale real-world problems, and it may not be easy to extract a permutation matrix from the transport plan obtained by the Sinkhorn algorithm. In addition, in some applications, it may be unnecessary to consider low-rank OT after computing full-rank OT.
3. The considered tasks in Sec. 5 seem a bit too traditional.
See above. |
Fully human-written |
|
Transport Clustering: Solving Low-Rank Optimal Transport via Clustering |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
Low-rank optimal transport (OT) aims to approximate the optimal transport plan by constraining it to have small nonnegative rank, effectively routing mass through a few latent anchors. This structure naturally induces a clustering of the supports—points that share an anchor form co-clusters across the source and target distributions. While the motivation for low-rank OT is scalability, the optimization problem itself is nonconvex and NP-hard, in contrast to the convex linear formulation of classical OT. However, practical algorithms for low-rank OT often lead to scalable approximations in practice.
This paper takes the inverse view: instead of discovering clusters through a low-rank constraint, it first computes a full OT map to register the supports, then clusters the induced correspondences to identify anchors, and finally reconstructs a low-rank coupling consistent with this clustering. The resulting “transport clustering’’ approach reduces low-rank OT to a single generalized k-means problem and achieves a provable constant-factor approximation (between 2× and 3×, depending on the cost function) to the best rank-K OT cost for metric or kernel costs.
The presentation is somewhat confusing. The introduction motivates low-rank OT as a scalable alternative to full OT, yet the proposed method begins by computing a full OT map, which appears contradictory. The paper does not clearly emphasize that the full OT is computed only once to extract reusable structure for downstream efficiency. As written, it reads as if the “scalable’’ approach is computationally harder, whereas the real contribution is to show that—despite the NP-hardness of low-rank OT—one can obtain a constant-approximation solution through a simple clustering reduction built from a single OT computation.
The paper tackles an interesting compression problem where the goal is to incur a one-time OT computation cost in order to learn clusters that enable faster and more structured future OT computations. The connection to generalized k-means is elegant, and the constant-approximation result is novel and theoretically clean.
The presentation is lacking in clarity; it takes significant effort to piece together the motivation and the sequence of reductions. The introduction could better separate what is meant by “scalability” in this context.
The experiments do not include the initial full rank OT computation cost.
How does this paper relate to the result of Indyk and Price (STOC 2011), where k-median clustering provides a sparse, relative approximation of the Earth Mover’s Distance? Could their compression-based approach be viewed as an implicit low-rank or clustered approximation to OT? |
Lightly AI-edited |
|
Transport Clustering: Solving Low-Rank Optimal Transport via Clustering |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes “transport clustering’’ (TC), which reduces the low-rank OT problem to a generalized $K$-means task by first registering the cost matrix with the optimal full-rank transport plan and then solving a single clustering subproblem. The authors obtain polynomial-time constant-factor guarantees—$(1+\gamma)$ for negative-type metrics and $(1+\gamma+\sqrt{2\gamma})$ for kernel costs—and instantiate the idea with mirror-descent (GKMS) and SDP-based solvers. In the experiments, TC outperforms prior low-rank OT methods on synthetic and real datasets.
The reduction from low-rank OT to a single clustering problem via cost registration is novel in my opinion, conceptually unifying co-clustering with $K$-means while providing constant-factor guarantees, leveraging mature toolboxes such as Lloyd iterations and SDP relaxations, and consistently improving transport costs over LOT, FRLC, and LatentOT in experiments—all of which makes for a genuinely novel and useful perspective on low-rank OT.
TC still requires a full OT solve up front, so the practical savings are unclear without runtime data; the constant-factor guarantees depend on assumptions and a small gap $\gamma$ that are not quantified empirically; registration for non-square or weighted marginals is only sketched (no experiments); implementation details, for example, ensuring positive column sums, initialization fairness, iteration counts are lack; the cost registration may create prohibitive memory demands.
- How sensitive is TC to errors in the transport registration?
- Do the constant-factor guarantees still hold when $n\neq m$ or when the marginals are non-uniform? If so, can you illustrate Kantorovich registration with at least one experiment in that setting?
- Could you provide ablations where baseline solvers are initialized with the same transport-registration (e.g., feeding TC's clusters into LOT/FRLC) to isolate whether the gains come from the reduction or from better initialization?
- Since Algorithm~1 begins by solving a full OT problem, can you quantify the end-to-end complexity savings? In settings where the standard OT solve already dominates runtime, does the K-means reduction still offer any benefit? |
Fully AI-generated |
|
QORA: A Sustainable Framework for Open-World Generative Model Attribution with Quasi-Orthogonal Representation Disentanglement |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This work proposes Quasi-Orthogonal Representation Attribution (QORA), a unified framework for sustainable, open-world generative-model attribution. QORA comprises two core modules. The Progressive Orthogonal Learning Module (POLM) uses Stiefel-manifold optimization to construct a quasi-orthogonal feature space, reducing redundancy while preserving a stable attribution subspace under open-world shifts. The Fingerprint Disentanglement and Enhancement Module (FDEM) leverages classifier-guided attention and multi-auxiliary contrastive learning to disentangle and amplify model-specific fingerprints. Across GAN and diffusion benchmarks, QORA achieves superiors closed-set accuracy, strong open-set robustness, and stable performance during incremental learning.
1. The paper proposes a unified framework for sustainable, open-world generative-model attribution.
2. The introduced Progressive Orthogonal Learning Module (POLM) and Fingerprint Disentanglement and Enhancement Module (FDEM) are well-motivated and technically sound.
3. Extensive experiments cover both GAN and diffusion generators, showing competitive closed-set accuracy and strong open-set performance.
1. The novelty appears limited for the classifier-guided channel attentions, the orthogonal learning component, and the fingerprint disentanglement, which seem closely related to prior CAM/score-guided attention methods.
2. Report the computational cost of the full framework, especially the Stiefel-constrained MLP, such as FLOPs, parameters, and memory usage.
3. Provide ablation studies for each component of the framework (e.g., POLM, FDEM), including turning individual FDEM losses on/off and sweeping the loss weights used in Eq. 12.
4. Explain the class-prototype update policy in detail.
1. Please add a comparative analysis situating QORA’s components relative to prior work, for example, clarifying how the classifier-guided channel attentions differ from CAM/score-guided attention.
2. Please include computational cost results and analysis for the full framework (e.g., parameter counts, training/inference time, and memory).
3. Please provide comprehensive ablations covering each component of the framework, including the impact of individual losses and their weights. |
Moderately AI-edited |
|
QORA: A Sustainable Framework for Open-World Generative Model Attribution with Quasi-Orthogonal Representation Disentanglement |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes QORA (Quasi-Orthogonal Representation Attribution), a framework for attributing generated images to their source models in an open-world setting. The system consists of two main components: 1. POLM (Progressive Orthogonal Learning Module): Creates a quasi-orthogonal feature space using Stiefel manifold optimization. 2. FDEM (Fingerprint Disentanglement and Enhancement Module): Uses classifier-guided attention to isolate and amplify model-specific fingerprints
1. This paper addresses a critical need for AI-generated content attribution in an evolving landscape of generative models.
2. The method supports incremental learning without full retraining, which is crucial for scalability.
1. The presentation lacks clarity. For example, Critical details are missing in the pipeline illustrated in Figure 1, making it difficult to trace the flow of computation. The input's entry point is ambiguous, and the derivation processes for key components—such as the enhanced feature f _p^i, the auxiliary noise feature f_r^i , and the weight w_y —are not specified.
2. The whole method part only decribes how they are designing objective, but no explanation about why the objectives are nessisary and sufficient.
3. This mixing of residuals, noise, and other class fingerprints appears arbitrary and is not sufficiently motivated. It is uncertain why and how features can be divided into these three types.
1. How do you actually reject unknown models? What's the threshold?
2. Is there any FLOPs, memory, or inference time comparison? |
Lightly AI-edited |
|
QORA: A Sustainable Framework for Open-World Generative Model Attribution with Quasi-Orthogonal Representation Disentanglement |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces QORA, a practical and sustainable framework for open-world generative model attribution. QORA combines a Progressive Orthogonal Learning Module (POLM), leveraging Stiefel manifold optimization to enforce quasi-orthogonal embeddings, and a Fingerprint Disentanglement and Enhancement Module (FDEM) that disentangles and amplifies model-specific generative fingerprints using classifier-guided attention and contrastive learning. QORA supports efficient incremental learning via exemplar replay and classifier initialization, without retraining the backbone. Extensive experiments on GAN- and diffusion-based benchmarks verify the efficacy of the proposed method.
1. The dual-module architecture is thoughtfully designed: POLM enforces quasi-orthogonality with Stiefel manifold optimization to reduce redundancy and stabilize the attribution subspace. FDEM proposes a concrete disentanglement mechanism using classifier-guided attention maps and a contrastive, prototype-centric loss.
2. The integration of lightweight, memory-efficient incremental updates via exemplar replay and feature-similarity-based classifier initialization allows QORA to scale with emerging models while mitigating catastrophic forgetting.
3. The extensive experiments validate the effectiveness of the proposed modules and demonstrate their efficiency in handling incremental learning tasks.
1. The motivation behind the module design in the paper is unclear. For instance, the design of the quasi-orthogonal space lacks a strong connection to the source attribution task. The authors claim that this design reduces redundancy, but no corresponding experiments or visualizations are provided to validate this assertion.
2. The experimental details are inadequately described. In the comparison presented in Table 4, how were the baseline methods fine-tuned for incremental learning scenarios? Were they simply retrained by directly combining new data with the original dataset?
3. The ablation study is insufficient, as it lacks experiments on key hyperparameters such as the coefficient of the exponential moving average. Moreover, the FDEM module proposes two types of negative sample representations, how does each contribute to the performance of the method?
1. Can you elaborate on the degree to which QORA’s gains stem from the CLIP backbone versus the custom POLM or FDEM modules? Have you considered alternative backbones or initializations for true ablation? Please present additional evidence if available.
2. Regarding the claim about the quasi-orthogonal space, could you provide additional validation beyond final performance metrics, such as dedicated ablation experiments or visualization analyses, to substantiate the alleged reduction in redundancy?
3. For the Stiefel manifold optimization and Cayley transform, could you describe computational cost, convergence rate, and robustness to hyperparameters or initialization? Is there a principled justification for the chosen learning rates and orthogonality penalties? |
Moderately AI-edited |
|
QORA: A Sustainable Framework for Open-World Generative Model Attribution with Quasi-Orthogonal Representation Disentanglement |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper propose a novel solution for open-world generative model attribution problem. It is motivated by the generative model attribution problem but with samples from unknown sources as well as continuously coming new models. The new solution features two core modules, Progressive Orthogonal Learning Module (POLM) which reduce the redundancy while make the distance between different source models large and Fingerprint Disentanglement and Enhancement Module (FDEM) to disentangle the model-specific fingerprints.
- The paper studies a pressing topic in generative AI era. Generative model attribution (especially under the "open-world" definition) is more realistic than previous settings.
- The experiment shows that the improvement of the new method is obvious.
- The paper also created an experiment setting for continuously coming new models in generative model attribution, which is suitable for following work.
- The motivation of method design is not clear (or not presented very well). Here are some questions I have so that I may assess the paper in a more accurate way.
- The high-level reason why we need an orthogonally constrained encoder with a dimension-wise normalized classifier.
- How does POLM and FDEM intuitively improve the result? Figure 5 and Table 3 are empirical results for ablation study while the gap between different ablation settings are small.
- Figure 3 is very hard to read or understand the design motivation. One suggestion is to include a small subsection in Section 3 right after the problem definition to introduce how the problem is solved in previous work (with some formula), and describe the weakness as well as the research gap. After all these, introducing the proposed method could be mush easier to read.
See weakness. |
Fully human-written |