|
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces Fast-dLLM, a training-free acceleration framework for diffusion-based LLMs that addresses their slow inference speed compared to AR models. The method incorporates two key innovations: (1) a block-wise approximate KV Cache mechanism tailored for bidirectional attention that enables cache reuse with negligible performance drop, and (2) a confidence-aware parallel decoding strategy that selectively generates multiple tokens simultaneously based on confidence thresholds to preserve generation quality. Experimental results on LLaDA and Dream models across multiple benchmarks demonstrate remarkable throughput improvement with little accuracy loss, effectively closing the performance gap with autoregressive models.
1. Novel adaptation of KV Cache to bidirectional diffusion models via block-wise approximation, with insightful analysis showing high cosine similarity between adjacent steps.
2. Theoretical foundation with Theorem 1 proving the equivalence between greedy parallel and sequential decoding under certain conditions.
3. Comprehensive ablation studies covering key hyperparameters (block sizes, thresholds, generation lengths) and evaluation across models and benchmarks.
The evaluation of models relies on only four benchmarks (GSM8K, MATH, HumanEval, MBPP for LLaDA), which primarily focus on math reasoning and code generation, missing important capability dimensions like commonsense reasoning (HellaSwag), factual knowledge retrieval (TriviaQA), and real-world code generation (LiveCodeBench, BigCodeBench) that would provide a more comprehensive understanding of the method's generalization and potential failure modes across diverse task types.
1. Why does the throughput plateau when the block size continues to increase in Fig4 ?
2. Is the KV activation similarity pattern largely consistent, or can it be different among layers? |
Lightly AI-edited |
|
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes a novel training-free acceleration framework Fast-dLLM, which aims to address two key bottlenecks in Diffusion LLMs: the lack of KV caching and the quality degradation associated with parallel decoding. First, the paper introduces a block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models. This approach is justified by demonstrating the high similarity of KV activations across adjacent inference steps, which enables cache reuse with a negligible performance drop. Second, the paper posits that the quality degradation in parallel decoding stems from the conditional independence assumption, which disrupts critical token dependencies. To mitigate this, it proposes a confidence-aware parallel decoding strategy that selectively decodes only those tokens exceeding a confidence threshold. This approach is shown to reduce dependency violations while maintaining generation quality. The authors validate their method through comprehensive experiments on the LLaDA and Dream models across multiple LLM benchmarks, and results demonstrate that Fast-dLLM achieves significant throughput improvements (up to 27.6x) while incurring negligible degradation in generation quality.
1.The paper introduces a novel, training-free framework that tackles two foundational challenges in dLLM inference.
2.The proposed methods are well-justified and supported by solid theory and experiment phenomenon. The approximate KV cache is empirically validated by the high similarity of KV activations in adjacent steps. The parallel decoding strategy is theoretically supported by Theorem 1, which proves the equivalence of greedy parallel and sequential decoding under high-confidence conditions, motivating the threshold-based approach. Furthermore, the identification of these two phenomena is a valuable contribution, offering clear insights and new avenues for solving these challenges in dLLM inference.
3.The paper provides extensive and sufficient experiments. Main results validate the method's effectiveness. A wide range of ablation studies on variables (e.g., cache block size, confidence thresholds, generation lengths) assess scalability and robustness, with clear analysis and interpretation of the results.
4.This paper is writen well and clear. The structure is logical and easy to follow, making the core concepts and contributions readily understandable.
The experimental evaluation is primarily limited to comparisons against the baseline dLLM pipelines. The paper lacks an comparison to other existing acceleration techniques for Diffusion LLMs.
1.The paper's appendix C.4 demonstrates that the 'Factor' decoding strategy significantly outperforms the 'Threshold' strategy in terms of throughput, with only a minor accuracy trade-off. However, the main experimental results (e.g., Table 1) are reported using the slower 'Threshold' strategy. Could the authors clarify why the seemingly superior 'Factor' strategy was not used for the main results? Does the 'Factor' strategy have other un-discussed drawbacks that led to this decision?
2.Could the authors comment on the robustness of the approximate KV cache strategy for very long sequence generation? Is there a risk that approximation errors will accumulate over many blocks, leading to a loss of attentional fidelity to the initial prompt?
3.The qualitative comparisons in appendix B are based on a simple arithmetic prompt, could the authors provide more diverse qualitative examples (e.g., generating idioms or logical pairs) where the risk of dependency failure is more prominent? |
Fully human-written |
|
MARWA: Multi-agent retrieval-augmented framework for reliable bioinformatics workflow automation |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents an innovative Multi-Agent Retrieval-Augmented framework aimed at enhancing bioinformatics workflow automation. The approach leverages multi-perspective LLM-enhanced tool descriptions combined with contrastive representation learning to achieve robust semantic representations of bioinformatics tools, ultimately improving tool retrieval accuracy. The evaluation dataset constructed for this purpose could serve as a valuable resource for the research community.
1. The integration of multi-perspective LLM-enhanced tool descriptions is a promising approach for tool selection in complex scientific domains, potentially benefiting agent systems.
2. The two proposed datasets could significantly aid in the evaluation of bioinformatics agents.
1. The claim of complete automation in the current workflow seems somewhat overstated. Is there any algorithmic illustration provided? Is the sequence of operations predefined?
2. The framework includes six cooperative LLM-based expert agents. Are the same models used across all six components, or are there distinct characteristics for different experts? Insights into model selection would be beneficial.
3. There is a sentence structure issue on lines 263-264 that needs clarification.
4. Is the file system intended to be multimodal?
5. Which specific LLMs are used as evaluators? What distinguishes the evaluation process from the judging operation?
6. A more detailed presentation of the dataset's difficulty and characteristics would enhance clarity. Additionally, what are the cost differences between MARWA and test-time scaling with powerful LLM-only methods in solving the task?
Identical to the 'Weaknesses' noted |
Moderately AI-edited |
|
MARWA: Multi-agent retrieval-augmented framework for reliable bioinformatics workflow automation |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes a multi-agent retrieval-augmented framework call MARWA for reliable bioinformatics Workflow Automation. MARWA's architecture is composed of six specialized LLM-based agents (Analyzing, Planning, Selecting, Generating & Executing, Debugging, and Judging) that operate in a step-by-step, closed-loop fashion. The authors also design an embedding method called LAFT, based on contrastive learning fine-tuning on the pretrained BERT model. Experiments show that MARWA consistently outperforms baselines like AutoBA and BioMaster, particularly in generating correct installation commands and file paths, leading to higher workflow success rates.
S1.The six-agent, step-by-step framework is a well-reasoned and significant improvement over one-shot generation. By breaking the complex problem of workflow creation into discrete and verifiable stages, the system introduces robustness and error-handling capabilities that are critical for this domain.
S2.Experiments show that MARWA and LAFT outperform other baseline methods.
W1. The paper emphasizes that the proposed method is specifically designed for bioinformatics workflow automation. However, although the evaluation datasets are related to bioinformatics, the architectures of the proposed MARWA and LAFT methods do not appear to have domain-specific optimizations for bioinformatics. Could the authors leverage the characteristic structures of bioinformatics data to optimize the model framework itself (rather than only the prompts)?
W2. In the embedding section, fine-tuning using contrastive learning is already a well-established approach for training embedding models. This work merely uses LLM-generated synthetic data to fine-tune the embedding model(BERT), without introducing a novel method. In addition, state-of-the-art embedding models are often based on decoder-only architectures with larger parameter scales, such as BGE-EN-ICL and Qwen3 Embedding.
W3. The experiments show that MARWA achieves a higher success rate compared to baseline methods. However, given its highly complex workflow structure (including six agents and a loop structure), it is expected to consume significantly more tokens and inference time than other methods. The experiments, however, do not evaluate MARWA’s token costs, inference time, or similar metrics.
1. Based on Table 1, can the authors compare retrieval performance when replacing the BERT model with more recent open-source base embedding models, such as BGE-EN-ICL or Qwen3 Embedding?
2. Can the authors design experiments to evaluate the cost of MARWA, such as the number of tokens consumed and the average inference time?
3. Can the authors compare MARWA with some general-purpose agent methods, such as ReAct? |
Fully human-written |
|
MARWA: Multi-agent retrieval-augmented framework for reliable bioinformatics workflow automation |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a multi-agent retrieval enhancement framework, MARWA, to address the robustness and scalability issues in bioinformatics workflow automation.
With the rapid growth of multi-omics data, bioinformatics analysis workflows are becoming increasingly complex, and manually constructing workflows is both time-consuming and error-prone.
Existing automation methods based on Large Language Models (LLMs) suffer from issues such as "one-off generation," inaccurate tool retrieval, and insufficient evaluation. MARWA significantly improves the accuracy of tool retrieval and command generation by employing six collaborative agents (analysis, planning, selection, generation and execution, debugging, and judgment), combined with retrieval enhancement (RAG), multi-perspective LLM tool description, and comparative learning.
Furthermore, the paper proposes a two-stage evaluation system that combines expert execution and large-scale LLM evaluation. Experiments demonstrate that MARWA outperforms existing methods across pass rate, workflow quality, and scalability, laying the foundation for reliable bioinformatics automation workflows.
- This work employs a multi-agent collaborative architecture, involving a series of intricate processes including analysis, planning, selection, execution, debugging, and judgment, thereby significantly improving process robustness and flexibility.
- The method uses LLM to generate multi-perspective tool descriptions and optimizes tool embedding through BERT contrastive learning, achieving tool retrieval accuracy higher than mainstream baselines.
- Real-world execution and large-scale evaluation are combined: 40 expert validation tasks and 2270 LLM evaluation tasks provide a comprehensive evaluation system, making the results highly convincing.
- While the evaluation system is comprehensive, large-scale tasks primarily rely on automated evaluation using LLM, resulting in a limited number of actual tasks executed. Furthermore, for biomedicine, does over-reliance on automated evaluation accurately reflect real-world usability?
- The tool's database expansion primarily relies on manual verification and command logging, leaving room for improvement in automation and scaling up.
- The detailed description of contrastive learning is rather brief, lacking sufficient disclosure of hyperparameters, training set size, and other details.
- Will the multi-perspective generation of tool descriptions in MARWA's search enhancement feature lead to semantic drift due to LLM illusion? How can description consistency be guaranteed?
- Are there compatibility solutions for the file system interface across different operating systems (e.g., Windows, macOS)? How general is its practical deployment?
- In large-scale evaluations, is there more detailed statistical analysis of the accuracy of LLM automatic scoring and its consistency with expert scoring?
- How adaptable is MARWA to new tools or parameter changes? Does it support automatic tool version identification and compatibility?
- Will multi-agent collaboration lead to significant computational resource consumption? What are the actual hardware requirements for deployment? |
Moderately AI-edited |
|
MARWA: Multi-agent retrieval-augmented framework for reliable bioinformatics workflow automation |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper tackles the challenge of **automating bioinformatics workflows**, where existing LLM-based systems often fail due to **ambiguous task definitions, heterogeneous tools, and unreliable one-shot generation**.
To address these issues, the authors propose **MARWA**, a multi-agent retrieval-augmented framework that **decomposes workflow construction into modular stages with dedicated agents** for task clarification, tool retrieval, command synthesis, and error correction.
MARWA enhances tool selection through contrastive-learning-based retrieval and validates reliability via a two-stage evaluation combining expert execution and large-scale LLM-based assessment.
Experiments demonstrate that MARWA substantially improves workflow accuracy, robustness, and scalability over existing baselines.
The paper designs a comprehensive end-to-end execution pipeline that covers the entire process,including task analysis, tool selection and workflow execution and validation.
This holistic design ensures not only that each stage is logically grounded and traceable, but also that potential errors can be detected and corrected through contextual feedback, significantly enhancing overall reliability.
The paper mainly proposes solutions to surface-level problems without uncovering the deeper reasoning gaps between the task requirements and the chosen methods.
For instance, it claims that replacing a one-shot generation process with a step-by-step approach improves workflow reliability, yet it never explains *why* one-shot generation fails in this task setting, *what specific reasoning capabilities* are lacking, or *why* step-by-step reasoning would inherently address them.
Since step-by-step generation is already a standard configuration for large language models, this modification only strengthens a weak baseline rather than constituting a principled innovation.
Moreover, the framework appears as an incremental extension or a more fine-grained reactive pipeline, without demonstrating true multi-agent cooperation or emergent division of labor. Consequently, the proposed contributions lack clear conceptual novelty and task-specific motivation, resulting in a framework that feels more like a layered adaptation of existing paradigms than a genuinely innovative approach.
1. Have previous studies already introduced benchmark datasets or evaluation protocols for workflow automation, and how were these used to assess reliability or execution quality?
2. What specific improvements or novelties do the proposed dataset and evaluation metrics in this paper provide beyond existing ones — in terms of coverage, realism, or reproducibility?
3. Has the paper reported the computational or financial cost (e.g., model inference time, agent coordination overhead, or GPU usage) associated with the multi-agent setup, and how does this compare to the baselines? |
Fully AI-generated |
|
A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper studies the use of pretrained smaller language model (SLM) to train larger ones via knowledge distillation (KD).
The paper first provide theoretical studies of the risk bound including the effects of KD.
Then, it proposes a two-stage approach (SALT) of pretraining reusing SLM: First, perform KD to train the model (optionally select learnable data to further accelerate training). Then, pretrain the model as usual.
This utilize the intuition that SLM can help learning easier tasks, and at the latter stage the capacity of the larger model can be leveraged to further train itself.
The paper provides experimental validation comparing mainly with from-scratch pretraining and KD (without second stage). Results for post-training are also given.
- The paper provides both theoretical and empirical contributions to KD with smaller models.
- Experiments are performed with care, with ablation studies given to justify the hyperparameters used, etc.
1. Data selection method is not so practical as it involves using early model checkpoint, which is not typically available if one wishes to use off the shelf public models. Moreover, it involves more hyper parameters for tuning.
2. Lack of baseline: I think the proposed method should be compared to model growth methods as baselines. These methods are quite straightforward compared to SALT (simply stacking or expanding the widths works well; [2405.15319]) and are also reusing small models to accelerate the training of large ones.
3. While the theory provides some clue on why KD helps, it does not provide concrete guidance on when or how to transition from KD to pretraining without it. Using a two stage training process works well intuitively without the theory as well (weak teacher can provide supervision to a strong but blank-state student only in the beginning), making the theoretical contribution seemingly redundant.
1. The accuracies seem to be converging with larger train steps. I wonder if at even larger train steps from-scratch pretraining becomes better and KD becomes unnecessary?
2. The improvement seems to be small compared to from-scratch pretraining. I wonder, when both baseline and KD are compared under equal compute (KD requires extra compute due to the inference of SLM), from-scratch pretraining could perform better? |
Fully human-written |
|
A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors propose SALT (Small‑model Aided Large‑model Training), a two‑stage pre‑training recipe: early knowledge distillation (KD) from a smaller teacher followed by standard next‑token training; an extended variant (SALTDS) uses the SLM to select “challenging yet learnable” sequences for the KD phase (Algorithm 1, p. 5). A statistical framework (Theorems 3.2 & 3.4) provides excess risk bounds for language modeling under KD, highlighting a bias–variance trade‑off where SLM‑provided soft labels can reduce variance if teacher–data divergence remains small (pp. 3–4). Empirically, 2.8B/8.6B students pre‑trained on The Pile with UL2 show that SALT/SALTDS reach or exceed baseline few‑shot performance using 70% of training steps and achieve ~25–29% wall‑clock savings, with further downstream gains after supervised fine‑tuning (Tables 1–4, pp. 7–8; Appendix J, p. 44). A teacher‑size ablation (Table 3, p. 8) shows benefits diminish when the teacher is very small (0.5B).
- Clear, simple method with practical payoff. SALT’s two‑stage schedule is easy to implement and consistently improves overall few‑shot averages at the same step budget, and matches/beat baselines at 70% steps, delivering ~25–29% time savings in the authors’ TPU setup (Table 2 Table 13).
- Theory tailored to language modeling under KD. The paper gives sequence‑level generalization bounds for KD in LMs (Theorem 3.4), relating gains to reduced variance vs teacher–data divergence, and motivates selective/early KD (Remark 3.5 & §3.3).
- Thoughtful diagnostics & positioning. The histograms of (Fig. 2; Appendix C) empirically support the stability assumption; the bucketed analyses (e.g., Table 5 on XLSum‑EN,; Appendix M) illustrate that early KD mostly helps “easy” slices, which fits the theory and design.
- Robustness checks. Ablations on transition strategies and KD duration and on teacher quality (Table 21) give a fuller picture; the method still helps across two student sizes (2.8B & 8.6B).
1. Fairness of the efficiency claim at 70% steps. The core headline—“SALT surpasses BASELINE at 146k (~70%) steps” (Table 2 Table 13)—is not matched with a BASELINE@70% few‑shot evaluation. The only “early” baseline shown near the main text is BASELINE@36k (Table 1 & Fig. 3), which is too early to compare to 146k. Without BASELINE@146k, it’s hard to attribute step‑efficiency to KD rather than general training dynamics. Please include BASELINE evaluated at identical step counts reported for SALT/SALTDS.
2. Theory–practice gap in key assumptions. The token‑level bound (Theorem 3.4) assumes a finite function class
Θ and bounded per‑token log‑loss (Assumption 3.1), and quantifies bias via TV distance to the data conditional distribution—a quantity that’s intractable to estimate in practice. The empirical proxy uses completions from a strong LM “oracle”, which approximates model‑to‑model divergence, not teacher‑to‑data divergence; this weakens the validation of DIV Clarify how the bounds should be interpreted operationally given these gaps.
3. Baselines could be stronger. Reverse KD (RKD) is a useful strawman but not a competitive baseline. Consider curriculum KD top‑k token KD (Appendix A.2, p. 19) with tuned 𝑘 or self‑distillation controls. For SALTDS, a data‑selection baseline that mimics the same selected subset without KD would isolate the data‑selection contribution more fairly.
4. Statistical reporting. Few‑shot results are single‑run with no seeds/variance/confidence intervals. Given small deltas (e. g., +0.62 average points for 2.8B at 100% steps; Table 2), error bars are important.
5. Cost accounting transparency. Wall‑clock savings (25–29%) depend on specific hardware and rematerialization settings. Reporting FLOPs or a normalized throughput‑adjusted cost would make claims more portable.
6. Scope of evaluation. The Pile is English‑heavy; a brief multilingual check or domain‑shift test (e.g., code/math) beyond MBPP and MATH citations would strengthen generality claims. (You do show strong LAMBADA gains—Table 2—but broader coverage would help.)
1. BASELINE@146k: Can you report few‑shot averages and domain breakdowns for BASELINE at 146k steps (2.8B and 8.6B) to match the SALT reporting? This is critical for the step‑efficiency claim.
2. SALTDS selection: In Eq. (10) (p. 6), you use median of filtered per‑token losses with top‑k masking. Did you try mean/trimmed‑mean or per‑document entropy to balance “challenging yet learnable”? Any diversity constraint to avoid duplicative sequences?
## Suggestions
- Add BASELINE@70% few‑shot (and maybe @80%, @90%) to align curves across methods. Also show SALT@70% vs BASELINE@70% on key individual tasks (beyond overall averages)
- Report variability. Provide ≥3 seeds for the few‑shot suite with mean±std or 95% CIs; likewise for SFT results
- Stronger baselines. Include self‑distillation and top‑k token KD; try KD weight annealing beyond the tested linear variants (Appendix K, Table 23)
- Wider evaluation. Include multilingual (e.g., TyDi beyond English) or code/math reasoning benchmarks, and a brief data‑shift analysis to test whether SALT’s gains persist out of distribution. |
Fully AI-generated |
|
A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes SALT (Small model Aided Large model Training), a two-stage pre-training method for Large Language Models (LLMs). It leverages a Small Language Model (SLM) as a teacher for Knowledge Distillation (KD) in the initial stage, followed by standard pre-training. An optional extension $SALT_{DS}$ uses the SLM to select "challenging yet learnable" data for the KD phase. The authors provide a theoretical analysis presenting risk bounds for KD in the language modeling context. Empirically, SALT is shown to achieve performance comparable to or better than a standard baseline with fewer training steps (70%), resulting in significant wall-clock time savings (25-29%) for 2.8B and 8.6B models, and also improves performance after SFT.
The paper tackles the critical and practical problem of reducing the high computational cost of LLM pre-training.
It demonstrates significant wall-clock time savings (25-29%) on large-scale models (2.8B, 8.6B) while maintaining or improving performance over the baseline.
The paper provides a theoretical analysis by developing risk bounds for knowledge distillation specifically in the autoregressive language modeling context.
The improvements from SALT pre-training are shown to carry over to downstream tasks after supervised fine-tuning.
The core components (small-to-large KD, early-stage KD, data selection via small models) are not new. The claimed novelty rests on a specific combination of these ideas, and the theoretical contribution appears to be an application of existing analyses to this specific setting.
The paper lacks direct empirical comparisons against stronger, contemporary baselines for both data selection and KD scheduling. It is also unclear if the standard baseline was sufficiently tuned.
Can you clarify the specific novel contributions over prior work (like Qin et al., 2022, Ankner et al. 2024) beyond scale and the specific data selection heuristic?
The method performed poorly with a 0.5B teacher. What is the explanation for this performance degradation, and what does it imply for the method's practical limits?
If training step is longer, does the improvement disappear?
If training data is more recent clean one like fineweb-edu, does the improvement disappear?
How about the cost of hyper parameter search? And, generalization toward reuse of the selected hyperparameters to other settings? |
Lightly AI-edited |
|
A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses a core problem in the large language model (LLM) field: high pre-training costs. The authors propose a novel and efficient training paradigm, leveraging small language models (SLMs) to aid the training of LLMs, which they name SALT (Small model aided large model training).
The core contributions of this method are twofold:
1. **Theoretical Framework:** The paper first establishes a statistical framework to theoretically analyze the application of knowledge distillation (KD) in language models. Specifically, it explores the feasibility of "reverse distillation" (using a *weaker* SLM as a teacher to train a *stronger* LLM student). The theory reveals this as a **bias-variance trade-off**: soft labels from the SLM can reduce training variance, but due to its weaker capability, it also introduces bias (especially on "hard" samples).
2. **SALT Algorithm:** Guided by this theory, the authors designed the SALT algorithm. It is a **two-stage training method**:
* **Stage 1 (KD):** In the early phase of training ($n_{KD}$ steps), it uses the SLM as a teacher for knowledge distillation, capitalizing on its low bias and variance-reduction benefits in "easy" data regions.
* **Stage 2 (Standard):** In the subsequent phase, it switches back to standard (ground-truth-based) next-token prediction training, allowing the LLM to learn the "hard" samples that the SLM could not master.
3. **$SALT_{DS}$ (Data Selection):** The paper further proposes an extension, $SALT_{DS}$, where the SLM is also used for **data selection**. It uses a scoring function (Eq. 10) to filter for "**challenging yet learnable**" training sequences, specifically for the KD in the first stage.
**Experimental Results:**
The authors validate their method by training 2.8B and 8.6B parameter LLMs on the Pile dataset (using 1.5B and 2.8B SLMs as teachers).
* **Efficiency:** SALT-trained LLMs can match (or exceed) the performance of standard-trained (BASELINE) LLMs using **less than 70% of the training steps**.
* **Performance:** The final SALT models outperform the BASELINE on a wide range of few-shot benchmarks and supervised fine-tuning (SFT) tasks.
* **Time Savings:** This translates to an estimated **~25% (2.8B) to ~28% (8.6B) wall-clock time saving**.
1. **Addresses a Critical Problem:** The work focuses on reducing LLM pre-training costs, which is a highly important and valuable research direction.
2. **Reported Efficiency Gains:** The paper reports significant wall-clock time savings (~25-28%). If robust, this result is of high practical value.
3. **Theoretical Framework:** The paper provides a theoretical framework that attempts to explain the "weak-teacher-strong-student" distillation mechanism via a bias-variance trade-off.
1. **Unclear and Minimal Contribution of $SALT_{DS}$:** This is a **major weakness**. The paper introduces $SALT_{DS}$ (with data selection) as a key extension, yet the experimental results show it offers **no significant or consistent benefit** over the simpler SALT baseline (e.g., 8.6B model avg @100% steps: SALT 52.96 vs $SALT_{DS}$ 52.81). This makes the contribution of the data selection part *nearly void*. The authors need to justify why this more complex method (requiring an extra SLM scoring pass) is proposed if it provides no tangible benefit.
2. **Severe Lack of Ablation for Data Selection:** The core of $SALT_{DS}$ is the scoring function in Eq. (10), which relies on a critical hyperparameter $k$ (set to 10) to define "learnability". The paper provides **no ablation or sensitivity analysis** on this $k$ value. What happens if $k=1$ or $k=50$? Without this analysis, the effectiveness of this data selection mechanism is **unproven**, and its design appears arbitrary.
3. **Unprincipled Choice of $n_{KD}$:** The choice of the transition point $n_{KD}$ (36K steps) seems **arbitrary**. While Appendix K shows robustness for $n_{KD}$ between 20k and 60k, the paper fails to discuss how one might *principally* determine this optimal transition point. Lacking this discussion makes the method feel more like a specific "recipe" than a general approach.
1. **($SALT$ vs $SALT_{DS}$) Necessity of $SALT_{DS}$:** As noted in the weaknesses, the performance of $SALT$ and $SALT_{DS}$ is very close. Can the authors elaborate on whether the practical benefit of $SALT_{DS}$ justifies its additional complexity (i.e., a full forward pass over the pre-training data by the SLM)? Under what conditions would you expect $SALT_{DS}$ to *significantly* outperform $SALT$?
2. **($SALT_{DS}$) Data Selection Hyperparameter $k$:** The choice of $k$ in Eq. (10) seems critical. You used $k=10$. Did you experiment with other $k$ values? For example, how would performance be affected by a very small $k$ (e.g., $k=1$, focusing only on tokens the SLM gets right) or a very large $k$ (e.g., $k=100$)? This is important for understanding the definition of "learnability."
3. **(SALT) Choice of Transition Point $n_{KD}$:** The ablation in Appendix K (Table 22) shows $n_{KD}=60K$ has slightly better average performance (47.99) than $n_{KD}=36K$ (47.94). You mention choosing 36K for efficiency. My question is: is there a *principled* way to determine the optimal $n_{KD}$? For example, does it correspond to a point where the SLM teacher's training loss begins to "saturate," or some metric indicating that "easy" samples have been sufficiently learned? |
Fully AI-generated |
|
RHGCL: Representation-Driven Hierarchical Graph Contrastive Learning for User-Item Recommendation |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes Representation-driven Hierarchical Graph Contrastive Learning (RHGCL), which enhances user-item recommendation by explicitly modeling hierarchical item structures within a graph contrastive learning framework. Unlike existing GCL methods that overlook multi-level item similarities, RHGCL first pre-trains user and item representations using cross-layer contrastive learning, then clusters items into hierarchical groups via representation compression, and finally fine-tunes embeddings on a two-hierarchy bipartite graph combining both original and clustered interactions. By leveraging both fine-grained and coarse-grained item relationships, RHGCL improves recommendation accuracy, achieving superior performance on benchmark datasets.
1. RHGCL learns item hierarchies directly from data-driven representations, enabling multi-resolution modeling without relying on external graphs or attribute labels.
2. The proposed experiments across three benchmark datasets seem to demonstrate the effectiveness of the proposed method
1. The paper presents item clustering as a key innovation, but similar representation-driven clustering ideas (e.g., neighborhood-enhanced clustering in NCL [1]) have already appeared in recent recommendation literature; the proposed method is not compared with such closely-related works, weakening the novelty claim. Could the authors add comparisons with such methods to further highlight the unique contribution of the proposed clustering module?
2. All contrastive-learning baselines compared are from 2023 or earlier; several stronger 2024/2025 GCL recommenders are ignored, so the performance gap may be smaller than reported.
3. The paper offers no case study or concrete examples to illustrate why the hierarchical module matters, leaving readers without intuitive evidence of its practical value.
[1] "Improving Graph Collaborative Filtering with Neighborhood-enriched Contrastive Learning" Lin et al. WWW 2022
See Weakness. |
Moderately AI-edited |
|
RHGCL: Representation-Driven Hierarchical Graph Contrastive Learning for User-Item Recommendation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper presents a novel Graph Contrastive Learning (GCL) method for user-item recommendation that incorporates hierarchical item structures. It pre-trains a GCL module, constructs a two-hierarchy user-item bipartite graph via representation compression and clustering, and then fine-tunes representations. Experiments show RHGCL outperforms existing models by enhancing GCL with representation-driven hierarchical item structures for recommendation tasks
1. an effective method with neat writing.
2. The model consistently outperforms strong baselines (e.g., XSimGCL, SimGCL, LightGCN) across three benchmark datasets (Yelp2018, Amazon-Kindle, Alibaba-iFashion), with meaningful improvements in Recall@20 and NDCG@20
3. The paper presents a complete pipeline—pretraining, hierarchical clustering (via t-SNE), and fine-tuning on dual-level graphs—illustrating conceptual clarity and reproducibility.
1. As mentioned by the authors, incorporating hierarchical information into a recommendation system is not new. The novelty is my main concern.
2. Adding hierarchical information will incur more computation; it is better to provide some analysis.
3. Using t-SNE followed by radial–angular partitioning introduces hyperparameters that may not generalise well. The clustering might be unstable or computationally expensive for large graphs.
The questions are related to the weakness.
1. what's the core novelty compares to other works except combining different small pieces?
2. computational anlaysis?
3. ablation study on clustering? |
Lightly AI-edited |
|
RHGCL: Representation-Driven Hierarchical Graph Contrastive Learning for User-Item Recommendation |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes RHGCL, a hierarchical graph contrastive learning method for user-item recommendation that incorporates hierarchical item structures. The approach consists of three stages: (1) pre-training using XSimGCL to obtain user and item representations, (2) clustering items in a 2D space using t-SNE and geometric partitioning, and (3) fine-tuning on both the original user-item graph and a user-clustered item graph. Experiments on three datasets show improvements over baseline methods.
1. The paper identifies a relevant gap in existing GCL methods regarding hierarchical item structures and provides concrete examples of why this matters for recommendation systems.
2. The method demonstrates performance gains across all three datasets and metrics, suggesting the approach has merit.
3. The paper includes appropriate baselines, multiple datasets of varying sizes, and sensitivity analyses for key hyperparameters.
1. The core contribution appears to be applying t-SNE dimensionality reduction followed by geometric clustering (radial/angular divisions) to create hierarchical structures. This is a relatively straightforward extension of XSimGCL without significant methodological innovation.
2. The paper does not provide theoretical or empirical justification for why t-SNE is optimal for capturing hierarchical item relationships over alternatives like UMAP, spectral embedding, or learned hierarchical representations.
3. The deterministic polar coordinate partitioning seems arbitrary and overly simplistic. Why should semantic item relationships align with geometric sectors in a 2D embedding space? This lacks theoretical grounding.
4. No comparison with methods that learn hierarchical structures (e.g., differentiable pooling variations adapted for bipartite graphs).
5. The related work section (Appendix A) lists several hierarchical GCL methods for recommendations (HKGCL, HGCL, ReHCL, HEK-CL, HNECL), but none are included in the experimental comparison. While the authors claim these methods require knowledge graphs or review data, this justification is insufficient: At minimum, methods that only require interaction data (like HNECL) should be compared.
6. The improvements over XSimGCL are very modest.
1. Can you provide theoretical or empirical justification for why t-SNE followed by geometric clustering is superior to alternative approaches for capturing hierarchical item structures?
2. Why not compare with other hierarchical recommendation methods, particularly those that don't require external knowledge graphs (e.g., HNECL)?
3. Can you provide statistical significance tests demonstrating that the observed improvements are not due to random variation?
4. How do the learned clusters correspond to actual item categories or semantic relationships? Can you provide qualitative analysis? |
Fully AI-generated |
|
Local-Curvature-Aware Knowledge Graph Embedding via Extended Ricci Flow |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper targets the geometric mismatch that arises when knowledge graph embeddings are forced into a predefined, homogeneous manifold, which distorts distances under locally heterogeneous curvature. It proposes RicciKGE, coupling the KGE loss gradient with local discrete Ricci curvature via an extended Ricci flow so that the latent geometry and entity embeddings co-evolve. The method offers theoretical guarantees of curvature flattening and linear distance convergence and reports consistent, if sometimes modest, gains on standard link prediction and node classification benchmarks.
1. The motivation is precise and grounded in limitations of homogeneous manifolds for KGE: the paper clearly articulates how a static geometric prior misaligns with locally varying curvature in real KGs. This framing connects well to manifold-based KGE literature and sets a concrete failure mode that the method aims to fix.
2. Theoretical analysis is not superficial: Theorem 1 proves exponential decay of edge-wise Ricci curvature under bounded coupling, and the corollary establishes linear convergence of distances under strong convexity. The interplay between curvature decay and optimization is discussed rather than just stated, which raises confidence in the mechanism’s stability.
3. The algorithmic presentation is serviceable: Algorithm 1 aligns the distance-flow step with embedding updates, and the complexity section isolates the main overhead (curvature estimation via Sinkhorn) with a parallelization argument. While brief, this helps practitioners estimate costs of adoption.
1. The contribution over prior curvature-flow work feels incremental and needs sharper differentiation. The paper extends discrete Ricci flow with a gradient coupling term, but related graph Ricci-flow models and geometric regularizers already exist; the delta versus prior discrete/extended flows (e.g., geometry-only flows or Ricci-guided graph methods) remain under-quantified. A stronger ablation isolating “pure Ricci flow”, “pure gradient”, and “coupled” variants across datasets is necessary to establish novelty in effect, not just form.
2. The theoretical guarantees rely on strong and somewhat idealized assumptions that may not reflect KGE training practice. The curvature result assumes volume/diameter bounds, Sobolev inequalities, and a spectral gap on a closed manifold; the distance convergence assumes µ-strong convexity in distance, while typical KGE losses combine non-convex components, negative sampling, and relation-specific transforms. The paper does not empirically validate robustness when these assumptions fail, leaving a theory–practice gap.
3. There is a conceptual tension between the motivation (preserving heterogeneous local curvature) and the main theorem (all edge-wise curvature decays to zero). The text argues that curvature “imprints” are absorbed into embeddings during transients, but the paper provides limited qualitative or quantitative evidence of what structural signals survive once the manifold is flat. Visualization or probing tasks before/after flattening would help reconcile this tension.
4. The empirical gains, while consistent, are often small and sometimes not state-of-the-art, and the statistical significance is not established. Table 1 shows multiple deltas around +0.1 to +0.4 MRR on FB15K-237 and cases that do not surpass the strongest baseline; no confidence intervals or paired tests are reported. Given the added O(|E|(k^2+kd)) overhead per epoch from curvature estimation, time–accuracy tradeoffs (including wall-clock, GPU memory, and scaling on larger KGs) should be reported to justify practicality.
5. Reproducibility and implementation specifics are thin at submission time. Code is promised only upon acceptance, and several details are under-specified: the exact Sinkhorn parameters for W1, normalization schemes for edge weights, β search ranges and schedules, negative sampling strategies, and seed/variance reporting across runs. The β sensitivity plot is helpful, but does not replace a thorough hyperparameter protocol and release of full configs for each backbone
Please refer to the weaknesses. |
Fully AI-generated |
|
Local-Curvature-Aware Knowledge Graph Embedding via Extended Ricci Flow |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes RicciKGE, a curvature-adaptive knowledge graph embedding method that couples the KGE loss gradient with local edge-wise curvatures via an extended Ricci flow, allowing embeddings and manifold geometry to co-evolve. It targets the mismatch introduced by homogeneous manifolds (Euclidean, spherical, hyperbolic) when real-world graphs have heterogeneous local curvature, which can distort distances and limit expressiveness. The authors provide theory showing that, with a properly bounded coupling coefficient, edge-wise curvatures decay exponentially toward Euclidean flatness and KGE distances strictly converge to a global optimum, indicating mutual reinforcement between geometric flattening and embedding optimization. Empirically, RicciKGE improves link prediction and node classification performance on standard benchmarks, supporting the effectiveness of curvature adaptation for heterogeneous KG structures.
The paper is overall easy to follow.
The proposed method demonstrates a superior performance on large-scale graph dataset as shown in table 2 in the experiments section.
I mainly have concerns about the experiments.
The experiments use different embedding dimensions across datasets without explaining the rationale, and statistical significance measures are not reported in Table 1. This makes it difficult to assess the robustness and fairness of the comparisons.
The baseline results reported in Table 1 differ from those in the original paper (e.g., GoldE). It is unclear whether a re-implementation was used, and if so, the experimental settings, hyperparameters, and code differences should be documented to enable verification.
Insufficient baseline selection and discussion. The choice of baselines in Table 1 is not well justified. Although the paper shows performance improvements when integrating RicciKGE into several existing KGE models, the baselines appear limited to relatively basic KGE methods (except the recent GoldE). A more comprehensive comparison—including stronger, contemporary baselines and a discussion of why each was selected—would better demonstrate the generality and competitiveness of the approach.
it is not clear to me why line 258-260
"To validate itseffectiveness, we must demonstrate that under this iterative mapping, the Ricci curvature on every edge converges pointwise to zero, i.e., the manifold becomes asymptotically flat, thereby ensuring that all intrinsic curvature heterogeneity is faithfully captured in the learned entity embeddings. "
can you explain more on this |
Fully human-written |
|
Local-Curvature-Aware Knowledge Graph Embedding via Extended Ricci Flow |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces RicciKGE, a knowledge graph embedding method that leverages a Ricci flow on the input graph to gradually flatten a low-dimensional representation of the entities in that graph, leading to the usual flat Euclidean manifold, but benefitted from awareness of the local curvature at each neighbourhood in the graph and task-related gradient information. This approach is evaluated both in the link prediction and node classification tasks, consistently performing on par or beating the baseline methods.
The presentation is very good, with appropriate formulation, helpful illustrations, and an algorithm listing. Furthermore, the underlying theory is (mostly, see below) clearly presented.
As I understand the paper, the authors extend the embedding evolution from Ricci flow (as in GNRF) to use also task-specific gradient information. I appreciate that there are analyses on the convergence, as this gives more confidence on the properties of the proposed method.
Experimentally, the authors explore not only KGE-specific tasks, but also node classification, along with additional insights into how the total loss and curvature variance evolve over time.
Given it is also predominantly based on Ricci flow, I believe the paper should address more explicitly the differences between its contributions and GNRF.
Experimentally, while the authors thoroughly evaluate multiple scenarios with multiple methods (well done!), I had the impression that little space was left to properly analyse the results. While the results are convincing, I expected more analysis here (see Question 4 for details).
1. Can you define "neighbourhood measures" in the main paper before or around line 143? Since $\kappa(i, j)$ is so central to the Ricci curvature used to define the method, I believe it is important that most concepts it depends on are contained in the main paper.
2. Can you define $\beta$ and $\text{Ric}_{ij}$ in Eq. (2)? For the latter, it was also no longer used in the remainder of the paper, and its definition is left vague. Is it supposed to be some form of $\kappa(i, j)$? Is that important? Nevertheless, you could also consider rewriting Eq. (2) in a way that reflects that.
3. Algorithm 1 suggests that embeddings are updated in sequence. Is that the case? If yes, since certain entities can appear in multiple triplets, this might lead to multiple updates happening, which seems problematic. If not, can you please make this clearer in the algorithm?
4. Previous empirical evidence, such as AttH (Chami et al., 2020), suggests that a good match between the geometry of embeddings and the graph structure would enable lower-dimensional embeddings to outperform methods that ignore that (e.g., TransE, DistMult, etc.). In this context, looking at Table 1 (emb. dim. 32) would suggest that the performance gap should be larger for those methods versus AttH/GoldE, while that gap is indeed small in higher embedding dimensions. Why is that not consistently the case for lower dimenions in your results? Is there a more elaborate phenomenom going on?
**Other comments:**
- l. 483: Our theoretically => theory?
**References**
Chami, I., Wolf, A., Juan, D. C., Sala, F., Ravi, S., & Ré, C. (2020). Low-dimensional hyperbolic knowledge graph embeddings. arXiv preprint arXiv:2005.00545. |
Fully human-written |
|
Catch-Only-One: Non-Transferable Examples for Model-Specific Authorization |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes the method of non-transferable examples which tackles the problem of model-specific data authorization. The author establish formal bounds with solid theorical analysis, and prove the effectiveness of the method using strong empirical results.
- I really like the idea of "NEs leverage a structural property of neural networks in which many input directions have negligible effect on early features, yielding a model-specific set of insensitivity directions that rarely align across models." It is a great observation that worth leaveraging in this problem domain.
- The experiment results show a significant performance on protecting the encoded data, it achieve protection without retraining or expensive encryption.
- This paper provides solid theoreticals analysis with the bounds for both authorized and unauthorized performance, which is mathematically sound and adequately discussed.
- It would be better if more visual examples can be provided rather than just 2 images.
- I have several more questions which need author's further clarification regarding the applicability/limitation/etc. Please see [Questions].
- What's the benefits of this method compared with simplying add a unified gaussian noise mask to the data (substract the mask when inference)?
- I am a bit concerned on how this method will eventually change the image. As you can see in Figure 1, when ~ 40dB , this method will largely affect visual quality of the image and makes it hard to be identified by human . In this case, how you are going to use this method in the real applications? can you provide example scenarios where we do not care about what the image looks like after applying your method and only cares about the inference on a specific model?
- Can this method be applied across the models? i.e., the data owner allows the NES be used by different clients.
- Are there any approach that could undermine this method? For example, when an attack collected some data with the correct labels, are they able to tune their models' initial layers to adapt this method?
- What's the computation overhead of this method?
- What's the limitation of the method? what kind of model cannot be applied to this method? |
Fully human-written |
|
Catch-Only-One: Non-Transferable Examples for Model-Specific Authorization |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes non-transferable examples, a training-free, input-side recoding that preserves utility for a single authorized model while sharply degrading utility for any other model. The key idea is to add a calibrated perturbation within a model-specific low-sensitivity subspace of the target model’s first linear map, estimated via SVD. Theoretically, the paper provides bounds linking authorized retention to a spectral threshold and unauthorized degradation to cross-model spectral misalignment via Hoffman–Wielandt inequality. Experiments on CIFAR-10 and ImageNet across multiple vision backbones and on VLMs demonstrate strong model performance.
- The paper tackles a novel and societally relevant problem: enforcing model-level usage control without retraining or cryptographic infrastructure. Unlike anti-learnability or differential privacy approaches, the proposed method operate directly at inference and do not require access to non-target models, representing a new protection paradigm.
- The method is mathematically grounded, with clear theoretical analysis connecting spectral properties to performance retention/degradation through bounded inequalities.
- Experiments span both vision backbones and multimodal VLMs, with consistent evidence of non-transferability.
- The analysis focuses solely on the first-layer linear map of the network, assuming subsequent layers implicitly preserve the property. This simplification may not hold for architectures with highly nonlinear early blocks or skip connections (e.g., ResNet).
- The adversary is allowed preprocessing, such as adaptive adversary, but the paper does not evaluate adversaries that learn an inversion (e.g., distilling a new first layer aligned to $V$). The empirical reconstruction attempts are classical denoising and do not include learned inversion with supervision on even a small clean subset.
- Some comparisons (e.g., with FHE) rely on cited rather than reproduced results, and differential privacy is known to address a different threat model.
- How robust is the non-transferability property when unauthorized models share partial architecture?
- Whether the low-sensitivity subspace remains stable under data-domain shifts (e.g., different ImageNet subclasses or out-of-distribution inputs)?
- How does the computational cost of subspace estimation scale with model dimension, and can it be applied to large models such as VLMs? |
Fully AI-generated |
|
Catch-Only-One: Non-Transferable Examples for Model-Specific Authorization |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 4: excellent
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces NEs, a data-centric designed method for model-specific authorization in machine learning systems.
The idea is to recode data inputs via perturbations aligned with the target model's low-sensitivity subspace, making an authorized model to maintain high performance while drastically degrading utility for any unauthorized models.
No model retraining is needed. NEs are also agnostic to the target architecture and operates solely at the input side. Both theoretical guarantees and comprehensive empirical evaluation are provided demonstrating its effectiveness.
- The problem is somewhat novel and interesting to me. It could address a pressing gap in AI governance in a practical and computationally feasible way, without retraining or cryptographic overhead.
- The empirical validation is comprehensive across many vision and VLM models
- Theoretical guarantees only cover the first linear layer; deeper nonlinear effects and adversarial undoing are unexplored. How do perturbations propagate through deeper nonlinear layers, and can adaptive adversaries undo them?
- Hyperparameters for perturbation magnitude seems not to be well discussed and may not generalize across models or data. Can a method for systematic hyperparameter calibration be developed for different settings?
- My concern is that if NEs resist sanitization/purification/denoising techniques. If they are not, attacker can just purify NEs before feeding to the unauthorized models
- Generating NEs might take some time while runtime and computational cost to generate NEs are not well discussed. How practical is NE generation for real-time or large-scale use?
- No evaluation in high-stakes or privacy-sensitive real-world domains. How effective and applicable are NEs in critical privacy-focused applications (e.g., heath care)?
See Weaknesses |
Lightly AI-edited |
|
Catch-Only-One: Non-Transferable Examples for Model-Specific Authorization |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces Non-Transferable Examples (NEs) — a training-free, input-side method for model-specific data authorization. By analyzing the first-layer weights of the target model via singular value decomposition, it recodes inputs within a low-sensitivity subspace so that the authorized model’s performance is preserved while unauthorized models experience severe degradation. Theoretical analysis links this effect to spectral misalignment, and experiments on vision and multimodal tasks show that NEs retain authorized accuracy while rendering other models unusable.
1. The paper introduces a new perspective on data authorization by proposing Non-Transferable Examples (NEs)—a lightweight, training-free method that ensures data usability is restricted to a specific target model. This formulation shifts the control from model training or encryption to input-level encoding, representing a genuinely novel conceptual contribution.
2. The proposed approach is simple, efficient, and broadly applicable. Since NEs only require access to the first-layer weights of the target model, they can be deployed across diverse architectures, including CNNs, Vision Transformers, and vision-language models, without retraining. This makes the method attractive for real-world deployment.
3. The authors support their approach with a solid mathematical analysis. By leveraging spectral theory and the Hoffman–Wielandt inequality, they rigorously show why the encoded data maintain performance on the authorized model but degrade substantially on others due to subspace misalignment. The theoretical framework aligns well with empirical findings.
1. If many encoded samples are publicly available and generated under a consistent subspace or seed, an adversary could apply PCA or covariance-based spectral analysis to infer correlated energy patterns and approximate the manipulated subspace, partially neutralizing the encoding and restoring unauthorized performance.
2. NE is an input-side, inference-stage defense that becomes ineffective when large-scale datasets are directly used for training or fine-tuning. Attackers can adapt by learning on re-encoded data or mixing it with original samples.
3. Because NE encoding depends on precise spectral alignment with the target model’s first-layer subspace, common transformations—compression, resizing, cropping, or illumination shifts—may distort the encoded signals and weaken the authorization effect. Robustness under realistic noise and preprocessing variations remains an open issue.
Please see weakness |
Fully AI-generated |
|
Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a unifying framework for designing memory-efficient LLM optimizers through structured Fisher Information Matrix (FIM) approximation. The authors show that existing optimizers (Adam, Shampoo, gradient normalization, etc.) can be viewed as special cases of FIM approximation with different structural constraints, and propose two design principles: (1) selecting structures that balance generality and efficiency, and (2) applying a low-rank extension framework. Based on these principles, they introduce two novel optimizers—RACS (achieving SGD-like memory with row-column scaling) and Alice (a low-rank variant of generalized Adam with tracking, switching, and compensation mechanisms)—which demonstrate over 2× faster convergence than Adam on LLaMA pre-training tasks while maintaining low memory overhead.
1. Originality: The unifying FIM approximation framework is highly original, elegantly connecting diverse optimizers (Adam, Shampoo, normalization, Muon) under a principled theoretical lens. The low-rank extension framework with subspace switching and optimal compensation represents creative innovation beyond existing ad-hoc approaches. The insight linking GaLore to Gadam provides valuable new perspective on existing methods.
2. Quality: The mathematical formulations are rigorous, with analytical solutions derived for various structural constraints. Experiments are comprehensive and well-designed, spanning multiple model sizes (60M-1.3B) and demonstrating consistent, substantial improvements. The convergence analysis under continuous-time setup adds theoretical grounding.
3. Clarity: The paper is well-structured with logical progression from framework exposition to practical optimizer design. The step-by-step derivation of existing optimizers as special cases effectively illustrates the framework's unifying power. The two design principles provide clear, actionable guidance.
4. Significance: High practical impact—Alice achieves over 2× convergence speedup compared to Adam with minimal memory overhead, addressing critical LLM training efficiency challenges. The framework provides valuable conceptual foundation for systematic optimizer design, moving the field beyond heuristic approaches toward principled methodology. The work opens promising research directions for incorporating model structure and task-specific information into optimizer design.
1. The convergence analysis is limited to continuous-time setup (Theorem H.2), which is a gap given that practical optimizers operate in discrete time. The one-step refinement approximation in Theorem 3.4 for Gadam lacks justification for why this approximation is reasonable or how the error propagates.
2. All pretraining uses only LLaMA on C4. Section K.7 adds GLUE on RoBERTa, but this is insufficient. Generalization to other architectures (GPT, T5, Mamba) and domains (code, multilingual) is unclear.
3. Table 2 shows Alice has 15% throughput drop on 1B models (45,523 vs 53,588 tokens/sec), but lacks breakdown of where this overhead originates. Specifically, no profiling data separates the cost of: (1) EVD approximation via subspace iteration (O(m²r) every K steps), (2) subspace switching with QR decomposition, and (3) per-step compensation computation. Without this profiling, practitioners cannot determine which components to optimize (e.g., increasing K to amortize EVD cost, or simplifying Eq. 14 to reduce compensation overhead) or assess when the 15% penalty is acceptable given 2× step-wise speedup.
Q1. Can you provide convergence analysis for Alice under discrete-time dynamics, including convergence rates in terms of learning rate η, rank r, and switching frequency K, and how the one-step refinement approximation error affects final convergence?
Q2. Can you provide detailed profiling showing time spent on EVD approximation, subspace switching, and compensation computation, and demonstrate how throughput scales with update interval K to help practitioners optimize the 15% throughput penalty?
Q3. Why is random uniform sampling from the complement space optimal in Algorithm 1, and have you compared against informed strategies like gradient-based importance sampling or residual-guided selection using Σ_t from Theorem G.8? |
Fully AI-generated |
|
Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper advocates a structured Fisher Information Matrix (FIM) approximation as a unifying lens for designing efficient optimizers for LLM training. It shows that a range of optimizers/gradient operators—including Adam, Shampoo, normalization/whitening, and SOAP/AdaDiag++—arise as solutions to a Frobenius-norm projection of the empirical FIM onto different structure families. On LLaMA pretraining, RACS/Alice outperform Adam and strong memory-efficient baselines; Alice achieves >2× step-speedup over Adam with comparable memory to GaLore, and RACS delivers strong results with ~SGD-like memory. The contribution is mainly a unifying viewpoint and design methodology, not a definitive theory of LLM optimization.
1. This paper provides unfied and clear perspective: casting popular methods as structured FIM projections.
2. On C4 pretraining, Alice yields 2.22×–2.82× step-speedup vs. Adam with strong effective throughput; RACS beats multiple memory-efficient baselines (GaLore, Fira, Apollo, Muon) under comparable or lower memory.
3. The writing is good and easy to follow.
4. Solid theorem analysis is provided.
5. Both pretraining and finetuning experiments are provided, showing the superiority of the proposed method under different cases.
Overall, I believe this is a good paper, some small cons:
1. The font size in some tables and figures are too small for potential readers, for example, Figure 1 and Table 1. It would be better if the font size be larger.
2. The authors may consider adding illustrative figures to accompany the complex mathematical formulations. Visual aids would help readers grasp the core ideas and intuitions behind the proposed methods more intuitively and easily
3. The current experiments are all conducted on models with fewer than 1B parameters. It would be better to see if the proposed method can scale up to 7B model pre-training, as Galore and Fira do. Alternatively, if constrained by computational resources, the authors may consider offering more in-depth discussions or analyses to compensate for the lack of large-scale experiments.
see weaknesses |
Fully human-written |
|
Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
In this paper, the authors have proposed a unified framework for both memory and computational-efficient optimizers for LLMs. To achieve this goal, the authors have used the structured Fisher information matrix (FIM) approximation. Many existing optimizers, such as Adam and Shampoo, are approximations to the gradient descent update. FIM, on the other hand, is replaced by certain structured forms, such as block diagonal and Kronecker product matrices.
The strengths of this paper are summarized as follows:
1. This paper has a novel optimizer design. Row and Column Scaled SGD (RACS) introduces a minimal overhead two sided scaling method. It achieves strong results with almost no additional memory.
2. Empirically, the authors have shown experimental results on different model scales, and they are compared with many baselines, such as the well-known Adam and Galore. The experimental results have shown it has consistent performance improvements.
3. Theoretically, the authors have proved the convergence of Alice under the continuous-time setup.
The weaknesses of this paper are summarized as follows:
1. Theoretically, the authors have given the convergence analysis only in the continuous time setting and “leave the general treatment of the convergence to future work”. It would be better if the authors could give a brief discussion about what else needs to be done, what the difficulties are, and what the differences are with the continuous-time setup.
2. Empirically, the authors only analyze “pre-training LLaMA on the C4 dataset”. It would be better to include more evaluations in a more diverse domain. Also, the model sizes are small. Nowadays, people are more interested in larger LLMs. It would be better to include the experimental results on larger models.
Galore has numerous follow-up works, such as Galore 2 [1], Golore [2], and Sara [3]. Could the authors give a comparison with these works?
[1] DiJia Su, Andrew Gu, Jane Xu, Yuandong Tian, and Jiawei Zhao. "Galore 2: Large-scale llm pre-training by gradient low-rank projection." arXiv preprint arXiv:2504.20437 (2025).
[2] Yutong He, Pengrui Li, Yipeng Hu, Chuyan Chen, and Kun Yuan. "Subspace optimization for large language models with convergence guarantees." ICML'25.
[3] Haochen Zhang, Junze Yin, Guanchu Wang, Zirui Liu, Tianyi Zhang, Anshumali Shrivastava, Lin Yang, and Vladimir Braverman. "Breaking the Frozen Subspace: Importance Sampling for Low-Rank Optimization in LLM Pretraining". NeurIPS'25. |
Fully human-written |
|
Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
Memory, sample efficiency, and throughput are key performance metrics for optimization algorithms, and algorithmic development is essential for mitigating the cost of LLM training. This work leverages an insightful lens of Fisher Information Matrix (FIM) preconditioning (Natural Gradient Descent view) to study successful optimizers such as Adam, Shampoo, Muon, and Soap, and unifies them all under this lens by studying the specific structure of FIM approximation (different variants of Kronecker product and block diagonal structure) that each of these optimizers is made of. Interestingly, the update rule of the one-sided Soap algorithm is also derived and analyzed by leveraging block-diagonal structure, and together the one-sided Soap update rule plus the derived structure is referred to as Generalized Adam (GAdam).
Inspired by FIM unification, this work hence argues that exploration of novel structures for FIM approximations is the key design principle for systematically developing new optimizers to further balance the optimizers' performance metrics, especially memory vs sample efficiency. Building on this design principle, the paper proposes two novel optimization algorithms. The first proposed algorithm is called RACS, where the preconditioner is structured as a Kronecker product of two diagonal matrices, and by minimizing the FIM approximation error (measured in Frobenius norm), the two diagonal matrices are derived subsequently. This optimizer hence reduces the second moment memory significantly and matches the memory complexity of AdaFactor (for instance, Adam's second moment needs O(mn) while RACS needs O(m+n)).
Moreover, this work also proposes another novel low-rank optimizer called ALICE, achieving GaLore-like memory while significantly improving both sample efficiency and throughput. ALICE extends the GAdam algorithm (one-sided Soap update rule) to low-rank subspace optimizer settings by maintaining only the top r eigenvectors of the full-rank projection in GAdam, and then the projected gradients are accumulated in the subspace akin to GaLore. Subspace switching and compensation methods are also proposed to account for the residual update term that remains orthogonal to the projection matrix and is wiped out during projection.
The two proposed algorithms are validated on LLaMA pre-training and GLUE benchmarks compared to AdamW, GaLore, Fira, Apollo, and Muon. The reported results indicate that these memory-efficient optimizers (ALICE and RACS) not only match the final validation perplexity of the best baseline but both improve the wall-clock time by a factor of 2 compared to Adam.
1. The paper is well-presented and clear. The proposed algorithms are derived based on rigorous theoretical principles, and experiments are comprehensive and profile all key performance metrics (throughput, memory, and perplexity) across the LLaMA model family (from 60M to 1.3B). Comprehensive ablations verify the effectiveness of the key algorithmic components introduced for the two proposed optimizers.
2. This paper makes it clear that the most adopted and successful optimizers in deep learning indeed are different choices of FIM approximation. Principled design of optimizers based on exploring the space of FIM approximation structures is a very clever and strong idea. Finding principles for one-sided Soap based on this framework is an interesting contribution.
3. Both proposed optimizers, ALICE and RACS, achieve consistent speedup by a factor of 2 (measured in wall-clock time) and significant memory savings compared to AdamW. These gains represent very strong results, and upon reproduction of these results, the ALICE and RACS algorithms have the potential to be adopted as mainstream optimizers for LLM training based on reported results. Especially considering the benchmark of current state-of-the-art optimizers in similar pre-training settings [1], the speedup of the best SOTA optimizer compared to Adam is around 1.1-1.4, and Muon is among the best-performing optimizers. Interestingly, the paper reports both ALICE and RACS outperform Muon significantly while using less memory.
### Critiques on the FIM Framework (Section 3)
Applying the Shampoo preconditioner directly on gradients can recover spectral normalization (whitening) as well, and hence whitening can be understood without leveraging Proposition 3.3. Moreover, [3] already finds the optimal solution to Equation 2 (line 147), where the optimal R and L are computed as principal singular vectors of reshaped F. Then [3] leverages power iteration for approximating the optimal solution of Equation 2, and hence minimizing the proposed upper bound in Theorem 3.2 does not provide significant additional insight when compared to established results in [3]. For Theorem 3.4, if we assume all D_i are equal, then the structure would be equal to the whitening structure that is used to derive Equation 3, and hence the derivation of Equation 6 appears redundant. Moreover, since GAdam appears to be equivalent to one-sided Soap, and this algorithm is actually discussed in the original Soap paper, the justification for not referring to GAdam as one-sided Soap is unclear. While the derived principle has novelty, the update rule remains the same.
### Novelty of RACS Compared to AdaFactor
If we let the initial S and Q be identity matrices with proper scaling, then one iteration of the fixed point procedure in Equation 11 would indeed recover the AdaFactor update. Moreover, when applying the structure in Equation 10 to gradients as in Equation 1, we can write:
$\text{Vec}(\Delta W) = -\lambda (S \otimes Q)^{-1/2} \text{Vec}(G)$. Since both S and Q are diagonal, $S \otimes Q$ would be diagonal as well, and hence we can write $\Delta W = -\lambda \frac{G}{\sqrt{V}}$, where $V = qs^\top$. Since Proposition 4.1 shows that q and s converge to the left and right singular vectors of $G^2$ up to scaling factors, this shows that V would be a rank-one SVD approximation of $G^2$, and the iterative fixed point procedure of RACS is essentially a power iteration that approximates those principal vectors. Moreover, the original AdaFactor (Algorithm 4 in [4]) similarly maintains momentum on q and s. Hence, RACS appears to be very similar to AdaFactor. RACS performs 5 iterations of Equation 11, while AdaFactor performs one iteration. Therefore, the novelty of the RACS update rule could be considered limited, as RACS appears to be another rank-one approximation method of the second moment, which AdaFactor has already explored and studied in detail.
### Reproducibility
The code has not been made available, and considering the significant speedup reported in this work and the potential for improving upon the best existing optimizers in the field, making the source code and implementation public would greatly benefit the research community and enable better evaluation of this paper. Even sharing the source code for the 130M setup would be highly valuable.
1. Could the authors please address my critique regarding RACS, and particularly explain the key algorithmic differences between RACS and AdaFactor beyond performing more steps of power iterations? This is indeed a very important concern to address, as AdaFactor does not achieve anywhere close to the RACS speedup compared to AdamW, and having more power iterations as the only key algorithmic component does not seem sufficient to explain these gains.
2. Could the authors please provide a detailed comparison of the compensation step in ALICE to LDAdam? Line 467 mentions Fira as the closest work for compensation, but LDAdam has also made significant contributions regarding this compensation mechanism, which merits discussion and comparison.
3. Proposition 4.1 in this work appears to be a diagonal version of Proposition 1 (Section 3.1 in [3]). It would be interesting to see what additional challenges the proof of Proposition 4.1 introduces compared to Proposition 1 in [3], and how these relate to power iteration convergence.
[1] Wen, Kaiyue, et al. "Fantastic pretraining optimizers and where to find them." arXiv preprint arXiv:2509.02046 (2025).
[2] Robert, Thomas, et al. "Ldadam: Adaptive optimization from low-dimensional gradient statistics." arXiv preprint arXiv:2410.16103 (2024).
[3] Morwani, Depen, et al. "A New Perspective on Shampoo's Preconditioner." arXiv preprint arXiv:2406.17748 (2024).
[4] Shazeer, Noam, and Mitchell Stern. "Adafactor: Adaptive learning rates with sublinear memory cost." International Conference on Machine Learning. PMLR, 2018. |
Fully human-written |
|
Integrating Solving Forward and Inverse Problems in PDEs with Flow-based Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper leverages the Rectified Flow framework introduced by Liu et al. (2023) to develop a unified approach for solving both forward and inverse PDE problems with a single trained model. The key idea is to treat PDE parameters and solutions as the initial and terminal states of a rectified flow, respectively. A neural network parameterizes the velocity field governing the flow, and the model is trained by minimizing a trajectory alignment loss. Once trained on forward-problem data, the model can solve inverse problems by reversing the ODE dynamics, without requiring retraining. The paper includes empirical evaluations demonstrating the effectiveness of the approach relative to existing methods.
The idea of unifying forward and inverse PDE solving within a single framework is novel and conceptually appealing. Leveraging Rectified Flow to model the transition between PDE parameters and solutions is a reasonable.
1. The text states in line 146 that the velocity field v_\theta is parameterized by a neural network, but the architecture and configuration of the network are not specified.
2. Insufficient presentation of experimental results. In the experiments, RFO is compared with multiple baseline methods, including GNO, LNO, and FNO for forward problems, whereas for inverse problems, it is only compared with FNO. In Wang & Wang (2024), experimental results of LNO on the 1D Burgers inverse problem are provided, whereas this paper does not present a comparison with these results. Furthermore, this paper also lacks sufficient comparison with existing models, such as GNOT by Hao et al. (2023) and Transolver by Wu et al. (2024).
3. Lacks further theoretical explanation/justification. Rectified Flow aims to construct a transport path between the source and target distributions that is as “straight” as possible, including rectification and recursive refinement steps. The paper does not provide a theoretical justification for why using Rectified Flow to model the relationship between PDE parameters and solutions is reasonable. This lack of explanation leaves me somewhat confused about the theoretical soundness of the proposed approach.
1. Clarify the architecture of the neural network modeling the velocity field (e.g., network type, depth, width, activation functions, etc.).
2. In the experiments, LNO is included as a baseline for forward problems but not for inverse problems. Could the authors clarify the reason for omitting LNO in the inverse problem comparison?
3. Several advanced models are not included in the experimental comparisons, such as GNOT (Hao et al., 2023) and Transolver (Wu et al., 2024). Could the authors comment on the rationale for not including these baselines in either forward or inverse problem evaluations?
4. It would be helpful if the authors could provide a brief theoretical explanation for why the rectification and recursive steps in Rectified Flow are suitable for modeling both forward and inverse PDE problems, without requiring extensive formal proofs. |
Moderately AI-edited |
|
Integrating Solving Forward and Inverse Problems in PDEs with Flow-based Models |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents an application of flow matching to PDEs. By learning the solution of a PDE with a flow matching approach, the method allows solving the inverse problem out-of-the-box by reversing the direction of the rectified flow from the solution to the inputs. The method is evaluated on multiple PDEs and compared against neural operators such as FNO and LNO.
- Conceptually interesting application of flow matching to inverse problems.
- The paper is generally easy to follow.
- Evaluation on a wide range of PDEs.
1. Only applicable to cases where the input parameters have the same dimension as the outputs.
2. For the inverse problem, the paper only compares against FNO, even when other models designed for inverse problems were referred to in the related works section. Additionally, LNO was also used for inverse design in the original paper.
3. Formatting issues (exceeds page limit, no parentheses around citations).
4. Almost no information on the model architecture is provided. Additionally, more details on the experiments should be provided, such as the range of the input parameters, the number of training trajectories, and how the baselines were applied.
1. How did you solve the inverse problem with FNO?
2. For the different resolutions, did you repeat the training for each resolution level, or does this refer to a super-resolution setting? |
Fully human-written |
|
Integrating Solving Forward and Inverse Problems in PDEs with Flow-based Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents a novel method based on rectified flow that integrates the solution of both forward problems and inverse problems in a single model. In the training stage, the velocity field in rectified flow is approximated with fixed pairs z(0)=a and z(T)=u, where a is the input parameter and u is the corresponding solution. In the inference stage, the forward problem can be solved by feeding z(0) with a and running the forward pass of the flow model, and the inverse problem can be solved by feeding z(T) with u and running the reverse pass of the flow model. Numerical results on various equations demonstrate its effectiveness in both forward problems and inverse problems within a single model.
1. Easy to follow and easy to implement.
2. Both forward problems and inverse problems are solved in a single model.
3. To my knowledge, this is the first work integrating both forward problems and inverse problems in a single model based on rectified flow.
4. The inferences in both forward problems and inverse problems are fast without extra training.
1. The compared baselines are mainly FNO and its variants, and there is only one baseline for inverse problems. It is suggested to compare with more recent neural operator methods, such as Neural Inverse Operators (NIOs) for inverse problems.
Neural Inverse Operators: Roberto Molinaro, Yunan Yang, Bj¨orn Engquist, and Siddhartha Mishra. Neural inverse operators
for solving pde inverse problems. arXiv preprint arXiv:2301.11167,2023.
2. The training details are not given.
For the initial conditions of each equation, what distributions did you use when drawing samples? And how many training and testing samples did you use? |
Fully human-written |
|
Integrating Solving Forward and Inverse Problems in PDEs with Flow-based Models |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 0:
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a Rectified Flow-Based Operator (RFO) that unifies the solution of forward and inverse PDE problems within a single model. By training a rectified flow velocity field between PDE parameters (inputs) and solutions (outputs), the model can perform forward prediction via forward integration and inverse inference via reverse integration, without retraining. Experiments on Burgers’, Darcy, Navier–Stokes, and Advection equations demonstrate that RFO achieves accuracy comparable to or better than existing neural operators (e.g., FNO), while also handling inverse problems in a zero-shot fashion. The paper’s main claim is that rectified flow’s reversibility naturally supports both forward and inverse tasks in a unified framework.
- The idea of using rectified flow’s reversibility to connect forward and inverse PDE tasks is novel and intuitive, bridging generative modeling and operator learning.
- The method does not require retraining or architectural changes to switch between forward and inverse problems.
- Across multiple PDE benchmarks, RFO shows accuracy comparable to or better than FNO, with improved computational efficiency in some tasks.
- The experiments are mostly qualitative and small-scale, lacking systematic ablation studies
- The baselines (e.g., GNO, LNO, FNO) are not fairly tuned or described, and no analysis is given for computational cost vs. accuracy trade-offs.
- The comparisons rely mostly on older operator-learning methods such as GNO, LNO, and FNO, without inclusion of recent state-of-the-art frameworks, for instance, diffusion-based PDE solvers (DiffusionPDE, Physics-Informed Diffusion Models), or PINO variants.
- The baseline LNO is not referenced
- This paper is purely experimental, but no implementation is provided, making it hard to assess the reproducibility.
- Could you clarify whether RFO offers any theoretical justification (e.g., existence or uniqueness of the rectified flow mapping for PDE operators) beyond empirical validation?
- The baselines are mostly older neural operator models (GNO, FNO, LNO).
- Why were recent diffusion- or flow-based PDE solvers (e.g., DiffusionPDE (2025), PIDM (2025), FunDiff (2025)) not included?
- Could you provide more details on the network architecture (depth, hidden width, total parameters) and training hyperparameters used in each PDE task? |
Fully AI-generated |
|
The Unreasonable Effectiveness of Scaling Agents for Computer Use |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces bBoN, a “wide‑scaling” paradigm for computer‑use agents that generates multiple trajectories in parallel and selects the best via narrative comparison, a concept well‑framed and clearly presented in the abstract. However, it provides little discussion of the substantial computational cost such scaling entails.
The paper tackles a central issue in computer‑use agents—how to scale behavioral selection effectively—which is timely and relevant as larger models are increasingly limited by interaction efficiency.
The proposed bBoN framework is clearly formulated, with intuitive motivation and a well‑explained pipeline from trajectory generation to narrative‑based comparison.
The experiments, especially the comparison in Figure 5, provide concrete evidence that bBoN outperforms both average‑rollout and simple WebJudge baselines, illustrating the benefit of cross‑trajectory evaluation.
Although bBoN achieves better performance through multiple rollouts and narrative evaluations, this improvement comes with a potentially huge computational overhead. The paper does not quantitatively analyze cost, efficiency, or fairness under equal‑budget conditions, leaving readers uncertain about the real‑world practicality of the method.
The main contribution—running many rollouts in parallel and selecting via pairwise narrative evaluation—follows a relatively straightforward ensemble logic: performance generally improves when sampling more trajectories and filtering them with a learned judge. While well‑executed, this approach feels incremental rather than conceptually innovative, relying on existing ideas of scaling and comparative evaluation.
The method relies on aggregating multiple rollouts into a single narrative summary for comparison. As the number of rollouts or the length of each trajectory increases, both token‑level and computational costs grow rapidly. Moreover, very long context windows may dilute key information and amplify model attention errors, leading to less reliable judgments. The paper does not analyze how accuracy or reasoning quality scales with context length, which raises concerns about robustness in large‑scale or long‑horizon tasks.
How does the framework handle potential side effects or environment resets when multiple rollouts are executed repeatedly? Do many resets cost more?
Does the narrative summarization stage require multimodal input (e.g., multiple images per rollout)? If so, how is this handled given that some current LLM/VLM APIs only support single‑image inputs?
What is the additional inference or token cost introduced by the narrative‑comparison stage?
In Section 3.2, the paper describes the Behavior Best‑of‑N Judge as performing a multiple‑choice (MCQ) comparative evaluation over N candidate trajectories. However, the corresponding system prompt (Appendix) only supports pair‑wise comparison (“which one better completes the user request”), without any structure for handling N > 2 options. This discrepancy suggests that the actual implementation is Best‑of‑2 rather than the claimed Best‑of‑N? |
Moderately AI-edited |
|
The Unreasonable Effectiveness of Scaling Agents for Computer Use |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This work proposes a method for improving the performance of computer-use agents by generating multiple rollouts (wide scaling), summarizing their behavior by generating facts per action using a VLM (Behavior Narrative Generator), and selecting the best one through comparative evaluation using a VLM (Behavior Best-of-N Judge). The proposed framework achieves state-of-the-art performance on OSWorld, a standard benchmark for computer-use agents, in success rate for long-horizon tasks.
- Experimental setup: Detailed experimental evaluation, including ablation studies, baselines, and datasets.
- Strong empirical performance: Achieves state-of-the-art success rate on OSWorld.
- Clear presentation: The method is well-described and motivated.
- Inefficiency: Naive wide scaling means running N times more rollouts, with no mechanism to prune unpromising rollouts early or summarize fewer behaviors.
- No learning or adaptation: The method relies on the performance of pre-trained LLMs and VLMs in a zero-shot setting.
- VLM dependency: The approach depends on pre-trained VLMs both for summarizing behavior and for selecting the best trajectory, which may compound errors.
- Could you add in Table 2 the runtime for different BoN values? Is it expected to be exactly N× that of Agent S3?
- How many steps does an optimal agent typically need to solve OSWorld tasks? Why did you choose 50 and 100 steps? Could you report the statistics (e.g., mean, median, variance) for the number of steps of an optimal agent?
- What is the impact of visual augmentations for pointer interactions on the performance of the Behavior Narrative Generator? Could you provide further details on the visual augmentation method?
- Have you tried generating facts for the entire behavior with a single prompt (i.e., a trajectory-level summary) instead of per-action summaries? |
Lightly AI-edited |
|
The Unreasonable Effectiveness of Scaling Agents for Computer Use |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces a test-time scaling framework for computer-use agents called Behavior Best-of-N (bBoN): it runs multiple full task trajectories in parallel and then selects the best one at the end, greatly improving reliability on long, brittle workflows. To make different candidates directly comparable, it compresses each step into a concise Behavior Narrative that records the action and resulting screen change, and uses a multi-way bBoN judge to pick the strongest trajectory rather than scoring runs in isolation. The authors also streamline the underlying agent (Agent S3) by removing heavy hierarchical planning and integrating a coding agent, so it can generate stronger candidates faster. Together, these components yield higher success rates and better cross-platform generalization (e.g., to Windows and Android) under practical step budgets.
- Running multiple full trajectories in parallel and selecting the best (bBoN) reduces single-run brittleness in long-horizon tasks and provides predictable gains as the number of candidates increases.
- Behavior Narratives distill each step into factual “action → screen change” summaries, enabling a multiway judge to compare candidates directly, more reliable and scalable than independent scoring or pairwise elimination.
- The streamlined base agent (Agent S3) produces stronger rollouts with lower overhead, and the full stack shows consistent success-rate improvements and positive cross-platform transfer (e.g., Windows/Android) under realistic step budgets.
- Missing implementation details for parallel rollouts:The paper claims Best-of-N by running multiple full trajectories concurrently but omits a reproducible execution recipe: no specification of multi-machine vs. single-host isolation, environment cloning/reset procedures (images/snapshots/templates), seeding and cross-run isolation (caches, network, account state), or resource-parity policies (CPU/GPU/IO). This gap undermines reproducibility and comparability.
- Best-of-N increases cost roughly with N, yet the paper does not report wall-clock time, CPU/GPU hours, energy, or cost-normalized success. There is no equal-compute baseline comparison (e.g., single/multi-sample methods under the same total budget), so observed gains may primarily reflect more sampling rather than a more efficient method.
- Gains may hinge on heterogeneous candidate pools (different backbones, prompts, temperatures). Without isolating diversity vs. pure sampling, it’s unclear whether bBoN helps when all candidates come from the same model/config.
- The method is demonstrated on tasks with clear success signals. It remains unclear how the judge handles open-ended goals, partial credit, or ambiguous outcomes where “best” is not binary.
- The work does not characterize marginal gains as N grows or provide guidance to select N under a fixed budget, making it hard to tune for real-time constraints.
- When generating N parallel trajectories, what infrastructure do you use (multi-machine vs. single host with multiple VMs/containers), how are environments cloned/reset, and how are seeds and resource quotas set to ensure isolation and fairness across candidates? Could you release scripts/configurations to reproduce the same parallel conditions?
- Under equal total compute budgets, how do bBoN results compare to single-run or few-sample baselines, and what wall-clock/CPU-GPU hours and energy are reported?
- Do gains depend on heterogeneous candidate pools (different models/prompts/temperatures), or do they persist when all candidates come from the same model and config?
- Are there early-exit policies to stop clearly bad candidates mid-run, or adaptive N strategies by task difficulty to control latency and cost at test time?
- How does bBoN perform under UI/layout drift, app updates, ads/pop-ups, and network jitter, particularly when initial snapshots cannot perfectly normalize the environment? |
Fully AI-generated |
|
The Unreasonable Effectiveness of Scaling Agents for Computer Use |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a Behavior Best-of-N (bBoN) method, which scales over agents by generating multiple rollouts and selecting the best trajectory among them for execution.
The paper explores the scaling over agents by a simple but effective best-of-N method, which largely improves the performance in benchmarks.
The paper proposes an effective Behavior Judge method that could support the scaling, which is better than the WebJudge.
1. The proposed method is not practical. It requires a simulator or virtual machine to perform multiple rollouts before real execution, which limits its applicability and can incur excessive computational costs during testing, especially when the N is set to a large number.
2. This is also based on a strong assumption that the environment's transition dynamics are not static and that the simulator could be exactly the same as the real environment. This is difficult to guarantee in the real world, as websites are not stationary and can always have pop-ups or ads [1]. Thus, although the proposed method can achieve better performance in benchmarks, it is actually overfitting to the benchmark, which cannot represent real-world performance.
3. As one can access the simulator, it would be better to run the baseline the same N times and use the task success indicator to select the successful trial among the N trials. This allows for calculating the success rate for each task, which can serve as the performance upper bound of the best-of-N method.
[1] DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning.
See the weakness. |
Lightly AI-edited |
|
Enumerate-Conjecture-Prove: Formally Solving Answer-Construction Problems in Math Competitions |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces Enumerate–Conjecture–Prove (ECP), a neuro-symbolic pipeline for answer-construction problems in competition mathematics. ECP (i) enumerates candidate objects via LLM-generated Python, (ii) conjectures a closed-form Lean expression from these hints, and (iii) proves correctness using Lean automation or learned provers (e.g., DeepSeek-Prover-V2-7B). The authors also curate ConstructiveBench, a Lean-verified dataset of 3,640 answer-construction problems gathered and autoformalized with an error-guided loop and an LLM judge. Empirically, ECP improves answer-construction accuracy across models and datasets—for example, GPT-5 mini rises from 69.7% → 73.6% on ConstructiveBench, while end-to-end accuracy increases 32.5% → 33.1%; on the PutnamBench subset ECP solves 6/337 end-to-end vs 4/337 for a CoT baseline.
- **Clear problem focus & formulation:** The paper crisply separates theorem-proving from answer-construction and formalizes the latter in Lean’s dependent type theory, including canonical-format constraints to prevent trivial “echo the statement” answers.
- **New dataset with care for quality:** ConstructiveBench (3,640 problems) includes multiple sources, deduplication, a retrieval-aided autoformalization loop, and a post-cutoff test split (106 problems after June 2024) to mitigate contamination. The human study on 100 random items finds 77% fully correct formalizations.
- **Well-structured, modular pipeline:** The three stages—Enumerate → Conjecture → Prove—are easy to reason about and reproduce. Figure 1 illustrates the looped tool-use design; the Prove stage sensibly leverages both Lean tactics and learned provers.
- **Small end-to-end gains:** While answer-construction accuracy improves notably, end-to-end improvements are modest (e.g., 32.5% → 33.1% on ConstructiveBench; 4 → 6 on PutnamBench). It’s unclear whether these differences are statistically significant or practically meaningful given Pass@32 variance.
- **Compute and sensitivity not fully surfaced:** The enumeration stage caps runtime (60 s) and candidates (≤100), but the paper does not quantify sensitivity to these budgets or how often enumeration misleads conjecturing (the under-generalization can hurt a subset of problems). A cost-vs-accuracy curve and failure taxonomy would help.
- **Limited comparative analysis with concurrent frameworks:** The related work cites broader formal problem-solving frameworks (e.g., FPS/D-FPS) and domain-specific systems, but experiments compare mainly to a CoT baseline rather than to other tool-integrated or search-based pipelines. Including such baselines (where feasible) would clarify ECP’s unique contribution.
- **Lack of related work:** CounterMath [1] is a counterexample-driven mathematical benchmark that also needs answer construction.
- Line 152: should be a $\neq$ 0.
- Figure 2 is not complete with percentages.
[1] Li, Yinghui, et al. "One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs." Forty-second International Conference on Machine Learning. |
Moderately AI-edited |
|
Enumerate-Conjecture-Prove: Formally Solving Answer-Construction Problems in Math Competitions |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses the challenge of solving "answer-construction" mathematical problems, which require both the creative generation of a mathematical object and a rigorous proof of its correctness. The authors introduce the `Enumerate-Conjecture-Prove (ECP)` framework, a modular neuro-symbolic method. `ECP` first uses an LLM to generate and execute Python code to enumerate potential solutions, then uses these examples to guide the conjecture of a closed-form answer in Lean, and finally attempts to formally prove the result using a Lean prover. To support this work, the paper also presents `ConstructiveBench`, a new dataset of 3,640 answer-construction problems autoformalized into Lean via an iterative, retrieval-augmented process. The authors report that this process achieves a 77% correctness rate in a human audit. Experiments show that `ECP` yields consistent, though modest, gains over a CoT baseline, improving the end-to-end solve rate on `ConstructiveBench` from 32.5% to 33.1% and increasing the number of solved `PutnamBench` problems from 4 to 6.
1. The effort for arriving at a correct autoformalization is commendable, involving multiple autoformalization models, retrieval of relevant documentation and Lean entitites, and a feedback loop.
2. The presented formalized dataset has significantly lower error rate than those presented in prior work. Further, including a human validation audit and an "after-cutoff" subset to address data contamination concerns adds significant credibility to the benchmark.
3. The `ConstructiveBench` dataset is a significant new resource, notable for its scale (3,640 problems) and its focus on answer-construction tasks.
4. The modular nature of the `ECP` pipeline allows for incorporating future advances in models and provers.
1. The primary weakness is that the empirical improvements from the complex `ECP` pipeline are marginal, especially for state-of-the-art models (e.g., a 0.6% absolute gain on `ConstructiveBench`). The paper provides no confidence intervals or statistical significance tests, making it difficult to determine if these small gains are meaningful or simply due to noise.
2. The different components of the pipeline are not separately evaluated. It is crucial that the authors perform ablation studies to verify how the removal/addition of each component in the `ECP` and `ConstructiveBench` pipelines affect both the dataset and results.
3. Crucial details are missing, hindering the paper's clarity and reproducibility. Specifically:
- The mechanism for verifying against vacuous or trivial answers in the "Conjecture" stage is not explained.
- The method for formally checking the equivalence of a conjectured answer and the ground truth is underspecified.
- The exact format for how Lean compiler messages and retrieved definitions are passed back to the model during iterative refinement is not provided.
4. While the creation effort is a strength, the result has limitations. The human audit revealed a 17% "major error" rate in the final dataset. This is a non-trivial error rate for a "verified" benchmark and raises concerns about the validity of the evaluation results, as models are being tested on a significant number of incorrectly formalized problems. The reliance on an LLM judge for the final semantic check also risks introducing systematic, hard-to-detect biases.
5. The related work section rarely relates to the proposed paper, making it difficult to contextualize the work in the scope of the broader field. Further, the authors might find it beneficial to emphasize why the problem they are working on is important and namely how the answer-construction tasks differ from theorem-proving ones (given that in most benchmarks the former can be rephrased into the latter).
1. What was the nature of the "major" and "minor" errors found during the human validation of `ConstructiveBench`? For instance, did errors typically make statements more permissive, factually incorrect, or unprovable? What were the experimental outcomes (for both CoT and `ECP`) on these incorrectly formalized problems?
2. It is unclear how much each component in the autoformalization pipeline affects the quality of the formalized statements. Can the authors provide an ablation study or evidence from prior work to quantify how much each component of the autoformalization pipeline (retrieval, feedback, LLM judge) contributes to the final formalization quality?
3. Can the authors present 95% confidence intervals on their results, so that readers are informed about the statistical significance of the results?
4. What are the precise differences between the CoT and `ECP` frameworks in the experiments? The paper implies CoT also involves conjecturing and proving. Is the only difference the "Enumerate" stage? How does `ECP` compare to a simpler baseline that only uses a prover on a ground-truth formalization with the answer provided?
5. How is equivalence between the conjectured and ground-truth answers verified programmatically?
6. How is the check against vacuous answers (L214-215) implemented?
7. Could the authors include the full prompts for the LLM judge, the Lean solver/prover, and the iterative refinement loop in the Appendix?
8. How does this work compare methodologically and empirically to recent work like Hilbert [1], which also combines informal reasoning with formal methods?
9. What are the baseline results on the PutnamBench problems when the final answer is provided (as in the original benchmark)? This is crucial for judging the value of an end-to-end pipeline that must discover the answer itself.
10. The autoformalizations use Lean v4.23.0, while many prover models were trained on earlier versions (e.g., v4.15.0). Have the authors verified that this version mismatch does not negatively impact prover performance?
## Current recommendation
I have assigned this paper a score of **2: Reject**. I believe the paper's core contributions are potentially valuable to the field. However, the paper in its current form suffers from significant weaknesses that undermine its claims. The empirical gains are marginal without statistical validation, critical implementation details are omitted, and there is no ablation to justify the framework's complexity or comparison to key alternative methods. If the authors can thoroughly address my concerns, I would be happy to raise my score.
### References
[1] Varambally, Sumanth, et al. "Hilbert: Recursively Building Formal Proofs with Informal Reasoning." arXiv preprint arXiv:2509.22819 (2025). |
Fully human-written |
|
Enumerate-Conjecture-Prove: Formally Solving Answer-Construction Problems in Math Competitions |
Soundness: 2: fair
Presentation: 3: good
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper present the Enumerate-Conjecture-Prove (ECP) framework, a method for solving answer-construction problems in three phases: (1) enumeration of answers satisfying the constraints by hooking up an (informal) LLM to Python, (2) conjecturing of the full solution set using an LLM based on the enumeration, (3) solving the problem in Lean using a formal LLM with the conjecture from step 2 applied. Additionally, the authors present ConstructiveBench: a benchmark consisting of 3640 formal answer-construction problems taken from public high-quality competitions. The authors show in their experiments that ECP outperforms a baseline that just uses steps 2 and 3.
- Code is provided, well-documented, and well-structured.
- Paper is clean and easy to read and understand.
- The authors investigate an important area that is lacking in current formal reasoning, namely coming up with the answer.
- The authors slightly outperform a reasonable baseline.
My main concerns with this paper is the lack of any interesting contribution. I have structured each of my concerns in three categories: major weaknesses, weaknesses, and remarks. The latter have not influenced my judgment, but should be fixed.
**Major Weaknesses**
- **ConstructiveBench is very limited**, and just another formal benchmark with little effort taken in its creation (a simple, cheap pipeline is used that the authors themselves show contains significant noise). As such, it provides no benefit over already existing benchmarks such as MiniF2F and PutnamBench. In fact, it is worse than these benchmarks because of three reasons:
- Noise: 23% noise is high, and no measure is taken to reduce this. PutnamBench has very little noise since its constructed by human experts.
- Contamination: By getting problems from existing public competitions, significant contamination concerns pop up. The authors even admit so by using their "past june 2024 selection". While PutnamBench also has this problem, it has several advantages that make this less concerning: it has been a well-known benchmark for years, leading to decontamination procedures for existing models are likely implemented, it has been selected from a single competition, making exclusion of these problems in training much easier, it is constructed by human experts, and does not use the same autoformalizers that are also used by the LLMs for their training.
- While ConstructiveBench likely has more answer construction problems, the authors show that a significant number can also be found in PutnamBench.
- **ECP is simple and leads to very low improvements.** The only novel thing in ECP is the "E" stage (as the authors themselves admit by including "CP" as a baseline). However, enumerating answers is very limited and applicable to only a limited number of problems where such enumeration can be performed. While it can be made more general (e.g., for proving an upper bound, a model could try various parameters and see what the maximum is it gets), the system prompt used by the authors make it clear they did not envision this: "Your task is to write a Python program to enumerate possible answers to assist in conjecturing and proving the true answer." Furthermore, the E stage can be very easily put into the C stage by just allowing tool use by the model. In fact, the implementation, even in the E stage, should use such a tool-calling implementation, especially for reasoning models. Currently, the authors parse the python in the result of the model output manually, which is suboptimal. Finally, the numbers only increase by only up to 2% (only looking at reasoning model results, the others I am discarding as they are not relevant, see weaknesses). This is very limited even if the authors proposed a very nice and novel method.
**Weaknesses**
- **Informal models are given very limited reasoning tokens.** This is especially problematic given that high reasoning effort would likely result in models performing the enumeration stage more manually as well, with specific directions being explored. 4000 tokens is not enough for this. Additionally, the authors should include at least one state-of-the-art model (grok 4, gemini 2.5 pro, gpt-5), not just mini variants. Finally, non-reasoning models are unnecessary: they are essentially never relevant for any mathematical task.
- **Compute limit should be equal between baseline and ECP**. Since the first stage is skipped for the baseline, it essentially uses a lower amount of compute than ECP. I am not entirely sure how to most appropriately fix this (and whether it even can), but at the very least, the number of iterations allowed for it should be 8 instead of 5. Note that this comment is very much on the limit between remark and weakness, as I do not think it will influence much.
- **The differentiation between answer-construction and theorem-proving tasks is somewhat artificial in places**. Many theorem proving tasks also require you to come up with answers. I do understand where the authors come from: in formal theorem proving, the answer for such theorem proving tasks is always already given in the theorem statement that needs to be proven. However, the way it is presented now makes it seem as if this is also often the case in natural language mathematical problems, which is most definitely not the case. For instance, the comment in L40-44 is somewhat out of place for the motivation here: most of the USAMO problems also require finding an answer and then proving it. The authors should more clearly differentiate this. Personally, I think it is much easier to argue the necessity by simply noting that formal theorem proving almost always skips the "finding the answer" stage of a mathematical problem.
- **The use of the term "answer-construction" problems is somewhat strange here and there**. Final-answer problems is a more agreed upon term in the literature, and answer-construction makes it seem as if the authors are only handling constructive problems. Interestingly, they also define answer-construction problems as constructive problems in L138-139. However, the more important issue is that solving any final-answer problem is skipped in formal reasoning, not just for those where enumeration can take place. For instance, finding the maximum value of a certain expression, the number of ways in which something can be done, ... all fall under final-answer problems, but not necessarily under problems where enumeration is possible (for constructive problems). I believe the authors do mean the "final-answer" interpretation of "answer-construction" problems, as becomes clear in L155-170, but in this case the use of an enumeration stage becomes somewhat problematic as it is not relevant for many problems.
- The authors should use the recommended hyperparameters for all models. For instance, Deepseek-v3.1 is advised to be run with temperature 0.6, not temperature 1.0.
**Remarks**
- L152 and L154 in agent.py in the code reads two files that do not exist, should be conjecturer_formal.txt?
- L463-465 is not true: DeepMind got silver with a formalized approach last year, SeedProver got gold this year (although they got P1 only after the grading was finished).
See above |
Fully human-written |
|
Enumerate-Conjecture-Prove: Formally Solving Answer-Construction Problems in Math Competitions |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper is based on the observation that LLMs, when guided by a careful prompting and tool integration, excel in solving high-school mathematics problems, but they struggle in mathematical reasoning problems requiring formal proof or the construction and verification of complex mathematical objects. For mathematical reasoning, LLMs can tackle difficult answer-construction tasks but are prone to errors from hallucinations and unverifiable steps. On the other hand, symbolic methods guarantee rigor but falter in creative answer construction. Motivated by these observation, this paper aims to address the following gap: “how to solve answer- construction problems while preserving both LLM creativity and mathematical rigor?”
Towards bridging this gap, this paper proposes Enumerate–Conjecture–Prove (ECP) framework, a modular neuro-symbolic method. ECP is model-agnostic and shows consistent improvements over pure LLM baselines. ECP framework is more like an agentic flow where an LLM is invoked for enumerating candidate answers. The candidate answer enumeration is achieved by means of LLM generating a python program.
This paper seems to be focusing on two kinds of problems in math competitions: answer-construction problem and theorem-proving problem. For the answer-construction problems, it proposes a mapping of the task into the Lean dependent type framework.
Next, this paper also proposes a dataset called ConstructiveBench. This is a collection of answer-construction problems and is sourced from multiple original sources including AMC 12 A/B (from 2000 onward), AIME I/II, HMMT, regional Olympiads, and the IMO Shortlist and Longlist. Additionally, this dataset integrates data from established informal math datasets including OlympiadBench (He et al., 2024), Omni-Math (Gao et al., 2024a), and MathOdyssey (Fang et al., 2024). This paper also proposes an autoformalization pipeline translate each informal answer-construction problem into a formal version.
To show the efficacy of ECP framework, this paper evaluates it on the proposed ConstructiveBench and PutnamBench datasets; and compares it against classic chain-of-thought (CoT) baseline method. On a subset of PutnamBench for answer construction, ECP formally solves 6 out of 337 answer-construction problems end-to-end (up from 4 without ECP) using GPT-5 mini and DeepSeek-Prover-V2-7B. On the ConstructiveBench, ECP achieves 33.1% end-to-end state-of-the-art accuracy (up from 32.5%).
- The constructiveBench dataset could be useful to the community for advancing neuro-symbolic systems.
- ECP improved the state of the art on ConstructiveBench, raising GPT-5 mini’s answer-construction accuracy from 69.7% to 73.6% and its end-to-end problem-solving accuracy from 32.5% to 33.1%.
- It was unclear whether the focus of the paper is only on answer-construction problems or also on theorem-proving problems. Most of the discussions seem to be focusing only on answer-construction problem. Its better to clarify this.
- Section 3.2 is bit cryptic and somewhat less clear. There is a scope of improving the writing to make it more accessible.
- Figure 3 is not referred anywhere in the text. I, as a reader, found myself lost when reading Section 4.3 because it started in a bit abrupt manner and there was no reference to the example given in Figure 3 for motivating the section. Some rewriting would help improving this section.
- In Line 152, should it be $a \ne 0$ instead of $b \ne 0$?
- From the given illustration of ECP in Section 3.2, it is not quite clear what is the difference between Enumerate and Conjecture stages. Looks to me that Enumerate stage itself is sufficient for the answer-construction kind of tasks? |
Fully human-written |
|
LU-500: A Logo Benchmark for Concept Unlearning |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper proposes LU-500, a benchmark for “logo unlearning” in text-to-image models, built from 9,584 prompts across Fortune Global 500 brands with explicit (LUex-500) and implicit (LUim-500) tracks.
The benchmark design is clear. LU-500 isolates logo unlearning with two realistic prompting modes and a sizable, vetted prompt set.
1. I think this work essentially belong to the “prompt engineering” , not only LU-500 built from some prompts, but also the ProLU are three prompt-based LLM agents. Unfortunately, I do not see any algorithmic innovation in this work.
2. This work heavily relies on GPT-4o to built both the benchmark and the ProLU agents, yet Appendix A claims LLMs were used only to polish writing—that’s ridiculous
3. The author claims to “propose” 5 metrics as core contribution in the introduction, but CLIPScore and SSIM are common metrics and there is nothing new.
See the weakness part of my review |
Fully human-written |
|
LU-500: A Logo Benchmark for Concept Unlearning |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces a benchmark for logo unlearning. It focuses on evaluating inference time unlearning methods on this benchmark, and proposes a baseline unlearning method based on prompting editing. Through experiments, it shows that existing inference-time unlearning methods are not effective in unlearning logos.
The presentation of this paper is generally clear. And it provides an interesting correlation analysis between unlearning performance and various logo characteristics (area, location, edge density, etc.).
This paper has the following major limitations:
- The scope of copyright protection is limited. The exclusive focus on logos is too narrow for meaningful copyright protection. Other crucial copyrighted elements may include characters, protected artworks and patterns, and so on. The methods may not generalize to other types of copyrighted content
- The sole focus on inference-time unlearning methods is a major limitation because it does not represent the full spectrum of unlearning approaches. There is no explanation for why other unlearning approaches wouldn’t work. Fine-tuning methods and model manipulation unlearning should be compared with even if they might be more computationally demanding.
- The proposed baseline shows fundamental flaws. As we can see from Figure 6 that residual logos remain clear and brand identities are recognizable even if logos are partially removed. And it may not work if implicit brand indicators beyond logos are presented in the prompt.
- The current metric does not guarantee complete information removal. It does not test against adversarial attempts to recover logos.
The front page image needs some work, the layout is cluttered and does not clearly show the main information.
Why focus exclusively on logos rather than a broader range of copyrighted visual content? Have you tested whether your benchmark and methods generalize to other types of copyrighted material (e.g., characters, artistic styles, patented designs)? |
Fully human-written |
|
LU-500: A Logo Benchmark for Concept Unlearning |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The work introduces LU-500, a new benchmark designed to evaluate concept unlearning methods for company logos within text-to-image diffusion models.
The dataset contains prompts derived from Fortune Global 500 companies and is divided into explicit (LUex-500) and implicit (LUim-500) tracks.
The authors propose five quantitative metrics (CLIPScore, LogoScore, LogoSSIM, ImageScore, ImageSSIM) to assess both local logo removal and global image preservation across pixel and latent spaces.
Experiments compare inference-time unlearning methods (NP, SLD, SEGA) and fine-tuning approaches (ESD, Forget-Me-Not) on Stable Diffusion 3 Medium.
All baselines perform poorly, motivating the authors’ prompt-based baseline, ProLU, which edits prompts through a three-agent pipeline (Remover, Reflector, Checker). ProLU achieves stronger logo removal but somewhat weaker background preservation.
The paper also performs a correlation analysis between unlearning effectiveness and image characteristics (area, location, fractal dimension) and finds only weak relationships.
- New benchmark focusing on logo unlearning. This is a neglected but socially relevant copyright-protection task.
- The work is clearly written and easy to follow.
- The five proposed metrics systematically separate local logo removal from global fidelity, going beyond binary success rates.
- LU-500 focuses only on Fortune 500 logos; small-brand or non-Latin logos are not covered.
- Reliance on CLIP and SSIM metrics raises concerns about semantic leakage or bias: low CLIPScore may not perfectly reflect successful logo removal. Human evaluation or perceptual studies would strengthen claims of “logo removal.”
- The benchmark and metrics are valuable, but ProLU mainly repurposes LLM-based prompt rewriting without clear algorithmic innovation beyond dataset design.
See weaknesses above. |
Heavily AI-edited |
|
LU-500: A Logo Benchmark for Concept Unlearning |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper propose a logo unlearning benchmark, LU-500 and also 5 metrics designed for quantitative evaluation of logo unlearning efficacy. Furthermore, a prompt-based unlearning method, ProLU, has been provided.
1. Well-motivated contribution. This paper fills a gap: prior benchmarks largely focus on natural images and stylistic concepts. Logos are intellectual property that diffusion model often memorize and highly relevant for both safety and legal compliance.
2. Proposed benchmark.The proposed LU-500 contains 500 logos across 10 commercial categories.
3. Four quantitative metrics are proposed for logo unlearning efficacy evaluation.
1. More strong concept unlearning baselines (e.g., [1] ) should be involved for benchmark evaluation.
2. Limited benchmark scale and diversity. Despite its value, 500 logos remain small relative to the variety of commercial marks. Many logos share similar geometric primitives, which might cause evaluation saturation.
3. Ambiguous boundary between memorization and semantic retention. Some metrics (e.g., CLIP-based LS) may not effectively differentiate between semantic similarity (“a bitten apple”) and literal logo reconstruction (“Apple Inc.” logo). A clearer delineation between concept-level leakage and pixel-level memorization would improve interpretability.
[1] Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models, NeurIPS 2024
Check the above weakness section. |
Lightly AI-edited |
|
BEARD: Benchmarking the Adversarial Robustness for Dataset Distillation |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces an open and unified benchmark designed to systematically evaluate the adversarial robustness of models trained via dataset distillation methods called BEARD. The benchmark covers multiple DD algorithms, adversarial attacks and widely-used image datasets. The authors formalize an adversarial game framework and employ three key evaluation metrics, Robustness Ratio, Attack Efficiency Ratio and Comprehensive Robustness-Efficiency Index respectively.
1. The code, leaderboard, and data pools are open-sourced, which can help facilitate future research.
2. The adversarial game formalism is thoughtfully articulated.
1. The empirical results do not directly benchmark against some newer strategies for adversarial training (e.g, [1]), adversarial attacks (transformation-based attacks [2] and generative approachs [3] )and other widely used datasets (e.g., cinic10, imagenet and mnist)
2. Section 5 reports trends but lacks deeper causal explanations (e.g., why DM improves CREI).
3. Section 3 introduces too many mathematical definitions, but provides limited experimental interpretation or discussion later.
4. typo and errors in grammar. 1) "DISCUSSION THE DIFFERENCES BETWEEN BEARD AND OTHER BENCHMARKS" (B.5) -> "THE DIFFERENCES BETWEEN BEARD AND OTHER BENCHMARKS" in 2) Missing space between “IDM” and “BACON” in figure 3. 3) TinyImageNet” or “Tiny-ImageNet? should be consistent. 4)
5. From the current description, BEARD appears conceptually similarly to DD-RobustBench in both purpose and experimental scope, though the authors claim they provide a more holistic assessment. But RRM does not provide substantial novelty beyond existing robustness evaluation metrics. And it is easy to integrate target settings in DD-RobustBench. Furthermore, the DD methods, attack methods and provided in BEARD are also limited. The paper reads more like an engineering consolidation than a fundamentally new contribution. I am not sure I understand it correctly.
[1] Yang, Zhuolin, et al. "Trs: Transferability reduced ensemble via promoting gradient diversity and model smoothness." Advances in Neural Information Processing Systems 34 (2021): 17642-17655.
[2] Yun, Zebin, et al. "The Ultimate Combo: Boosting Adversarial Example Transferability by Composing Data Augmentations." Proceedings of the 2024 Workshop on Artificial Intelligence and Security. 2024.
[3] Wei, Zhipeng, et al. "Enhancing the self-universality for transferable targeted attacks." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023.
What is the meaning of $|\epsilon|$? Why the authors use $\epsilon= 8/255$ and $|\epsilon|= 8/255$ interchangeably? |
Fully human-written |
|
BEARD: Benchmarking the Adversarial Robustness for Dataset Distillation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces BEARD, a unified benchmark for evaluating the adversarial robustness of models trained on distilled datasets. It proposes an adversarial game framework and three new metrics (RR, AE, CREI) to systematically assess robustness across multiple datasets, distillation methods, and attack strategies. The benchmark includes a dataset pool, model pool, and leaderboard, with extensive experiments showing that dataset distillation can improve robustness, especially when combined with adversarial training.
- Novel and well-defined evaluation framework and metrics.
- Comprehensive experiments across datasets, methods, and attacks.
- Open-source code and leaderboard support reproducibility and community adoption.
- Limited to image classification; does not cover other modalities.
- Does not include very large-scale datasets like ImageNet.
- Some results (e.g., robustness improvements) are not thoroughly analyzed or compared to non-distilled baselines.
- Some metrics (e.g., AST) may be sensitive to hardware and implementation details.
1. Why does dataset distillation often improve adversarial robustness? Is it due to implicit regularization or reduced capacity?
2. How does BEARD compare to training on random subsets of the original data?
3. Could the benchmark include more recent distillation methods (e.g., SRe2L, CAFE)? |
Fully AI-generated |
|
BEARD: Benchmarking the Adversarial Robustness for Dataset Distillation |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces a benchmark for evaluating the robustness of dataset distillation approaches against adversarial examples. The authors wrap several models, datasets, and attacks, and provide a set of metrics to evaluate the adversarial robustness. The wide range of experimental results is also collected in a leaderboard, which is publicly released (together with the implementation code).
- The paper is clear and well written
- The addressed problem is relevant, and I think those benchmarks and their codebase are very valuable for the research community and can serve as a baseline both for attacks and defenses
- The authors made a lot of effort to wrap together models, datasets, and attacks, and run a considerable amount of experiments
- I am concerned about the contribution, as it appears weak (particularly considering this venue), both from a technical and novelty point of view. The authors (although I recognize the hard work that has been made) "simply" wrap together existing works, whereas the most novel contribution appears to be the proposed metrics (on which I have some concern, see below). Additionally, there is a non-negligible overlap with the competing DD-RobustBench work, with only incremental improvements over it.
- I don't understand the reason to define relative metrics (RR and AE), as the maximum ASR and AST, which serve as baselines for them, are strongly influenced by several factors (model pool, attack performance, which in turn depends on many parameters, etc.). Why not use absolute metrics, such as an average?
- Using the GPU time to compute the computational cost is not reliable, as this can be influenced by multiple side effects. A suitable metric to compute the attacker cost is the number of model inferences and gradient computations. In this way, both the model itself and other overheads unrelated to the attack are excluded.
- I also have concerns about the attack hyperparameters. Unlike AutoAttack, which is parameter-free (and thus suitable for benchmarking different models), the other approaches require careful tuning of the hyperparameters for each attacked model (e.g., iterations, step size) to achieve the best results. For this reason, the results of those attacks might not be reliable. Moreover, using 10 iterations for PGD is unlikely to lead to convergence of the optimization.
- Could you please justify the use of relative metrics (RR and AE), especially considering that adding other models/attacks to the benchmark might lead to recomputing the entire results? |
Fully human-written |
|
BEARD: Benchmarking the Adversarial Robustness for Dataset Distillation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents the BEARD benchmarking framework to evaluatethe robustness of models trained via dataset distillation against adversarial attacks. The authors point out that existing benchmarks such as DD-RobustBench and RobustBench fail to sufficiently reflect the actual performance of dataset distillation techniques in adversarial scenarios. To address this issue, BEARD integrates multiple dataset distillation techniques, attack methods, and datasets, while introducing three new metrics: RR, AE, and CREI. These metrics employ game-theoretic approaches to simultaneously evaluate attack effectiveness and efficiency. The benchmarking platform is publicly available, featuring leaderboards and a curated model dataset collection to support reproducible research. Experiments demonstrate that dataset distillation significantly enhances adversarial robustness, particularly when combined with adversarial training.
1. The paper presents the first unified benchmark for adversarial robustness in dataset distillation, introducing a novel adversarial game framework and three tailored metrics (RR, AE, CREI).
2. As dataset distillation gains traction in resource-constrained settings, understanding its robustness is critical. BEARD provides a standardized platform for comparative evaluation.
3. The paper is well-structured, with clear descriptions of the framework, metrics, and experimental setup. The public release of code and leaderboard enhances transparency and usability.
4. The benchmark covers multiple DD methods, attack types, and datasets, offering a comprehensive evaluation landscape.
1. Completion is slightly insufficient. This paper has systematically expanded and deepened DD-RobustBench through introducing unified evaluation metrics, incorporating more attack types, and proposing a game-theoretic framework. yet it is unable to prove on other larger datasets, more complex architectures and algorithms, and only remains in relatively simple scenarios.
2. In Section 3.10, the CREI metric locks α at 0.5 without explanation. Giving robustness and efficiency equal weight might not suit every task; an ablation on α or a data-driven reason for this choice would help.
3. In Section 5.1, the claim that “dataset distillation improves adversarial robustness” is counter-intuitive and lacks mechanistic explanation. The observed CREI drop with increasing IPC is noted but not interpreted. Include a discussion on why distilled datasets may enhance robustness—e.g., whether they filter out non-robust features or reduce overfitting. Analyze the IPC–robustness trade-off more deeply.
4. In Section 5.1, the performance differences among DD methods (e.g., why DSA/DM/BACON perform better) are reported but not explained. The analysis remains descriptive. Provide hypotheses or further experiments (e.g., feature analysis, robustness curvature) to explain why certain methods excel.
5.In Figure 4 & Figure 5, the paper claims that “distilled datasets improve adversarial robustness,” a conclusion that runs counter to intuition (smaller datasets are usually expected to yield more fragile models) yet no convincing explanation is provided. Discuss the interaction between dataset scale, distillation method, and adversarial training to provide more actionable insights.
1. Why was α=0.5 chosen for CREI? Have you experimented with other values, and how sensitive are the rankings to this parameter?
2. Can you provide a deeper explanation for why some DD methods (e.g., DSA, DM, BACON) exhibit stronger adversarial robustness? Is it related to their distillation objectives or synthetic data diversity?
3. The conclusion that “distillation improves robustness” contradicts the common belief that smaller datasets lead to weaker models. Can you discuss potential reasons for this phenomenon? |
Fully AI-generated |
|
Enabling Your Forensic Detector Know How Well It Performs on Distorted Samples |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
Conventionally, fake image detectors assign a real/fake label. However, detecting these fake images in the wild involves dealing with post-processing operations modify the original signal and hence affect their detection. The authors try to develop an effective method through which one can also predict the confidence of the detector. This is especially challenging since, neural networks are not well-calibrated. In order to do this, the authors leverage image quality to measure the confidence. The change in image quality, depends on the reference image, which is conventionally not available during test time, in order to predict this, the authors train the neural network to do so. Experiments show improved calibration and the authors also show how this can be used in effectively detecting fake images.
1. The problem of uncertainty estimation in fake image detection is both interesting as well as practically relevant. It is also an understudied problem.
2. The focus of the study on distortions also makes the paper practically relevant.
3. Multi-Detector Routing and Confidence-Based filtering are good use cases for the method.
1. The experiment reported in section 3 would benefit from the inclusion of more details. For instance, what is the data used, what is the amount to which each post-processing operation is applied, etc.
2. My main concern comes from the fact that the existing method seems to only account for single distortions. However, one can compose distortions (resize first, then blur later for instance). It is currently unclear to me as to how the method would work given these settings.
3. The limitations should be discussed with further detail. The issues that the current method has with respect to data coming from different sources would be interesting and insightful to the community.
4. The plots have extremely small text and it can be hard to follow, it would be better if the text in the plots are much bigger than they currently are. Especially for the plots in the appendix and Fig 2.
Minor Weaknesses
1. Line 42-43: This statement is not correct. A lot of detectors use common post-processing operations as part of their training [1,2].
References,
1. Wang, S. Y., Wang, O., Zhang, R., Owens, A., & Efros, A. A. (2020). CNN-generated images are surprisingly easy to spot... for now. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8695-8704).
2. Gragnaniello, D., Cozzolino, D., Marra, F., Poggi, G., & Verdoliva, L. (2021). Are GAN generated images easy to detect? A critical analysis of the state-of-the-art. arXiv preprint arXiv:2104.02617.
1. Does the method currently account for multiple distortions, if not can it be made to account for this case?
2. For the training, why does equation 7 use MSE loss as opposed to the binary cross entropy loss? |
Fully human-written |
|
Enabling Your Forensic Detector Know How Well It Performs on Distorted Samples |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors propose to train a predictor for calibration-type confidence in the sense of probability of the prediction being wrong. They do this specifically for GAN-generated image detection.
To do this, they use detector features, features from a predictor of distortion type, features from a predictor of no-reference image quality, and use this with an MLP on top to predict the confidence via regression.
They perform experiments on correlation between the predicted and the true calibration, they show the usefulness of the predictor for top-1 routing to GAN-generated image detectors, and for confidence based vote abstention, evaluated by a ranking measure.
- They perform experiments beyond just training the calibrator.
- It shows its usability for top-1 routing and for vote abstention / or low confidence flagging.
- if one would want to hide the fact that an image was generated by a deep learning model, then image distortions are a natural candidate for obfuscation, so the setting makes sense
- Clear idea
- well readable paper
- it has not the greatest novelty, it is not what one would think has to be shown as an oral
- The dataset used for the main experiments, PROGAN. is from the pre-diffusion model era.
It would be better to see the results for distortions also for diffusion datasets. They do this for the cross-evaluation in section 5.4 but it would be good to have done it also for section 5.3 and also for the confidence evaluation.
none |
Fully human-written |
|
Enabling Your Forensic Detector Know How Well It Performs on Distorted Samples |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes DACOM, a neural regressor that predicts the probability that a given forensic detector will correctly classify a (possibly distorted) image. The key insight is that FR-IQA scores correlate monotonically with detector accuracy conditioned on distortion type. Training labels are obtained by bucketing FR-IQA scores per distortion type and mapping bin-wise balanced accuracies to [0,1]. At inference DACOM uses detector features, NR-IQA features and a distortion-type embedding to estimate sample-level confidence without references. The score enables “selective abstention’’ and “top-1 routing’’ among detectors and improves overall accuracy on several benchmark.
1. The paper introduces a novel framework that tackles reliability estimation, which is orthogonal and complementary to existing works.
2. It provides a principled, detector-conditioned confidence definition and a practical solution that avoids requiring reference images at test time.
3. Extensive experimental validation on diverse detectors and a broad spectrum of distortions is conducted. Ablations clearly show each component’s contribution.
4. The paper is well written and easy to follow.
1. Evaluations rely mainly on Balanced Accuracy and EER. In practice, real and fake samples are imbalanced. Would precision-recall–based measures (e.g., AUC-PR, F1) change the conclusions?
2. The inference pipeline runs QualiCLIP, ARNIQA and the detector for every input may be costly on edge devices. A detailed analysis of timing/FLOPs is missing.
3. While abstention can improve safety, it also off-loads decisions to human operators and potentially offers adversaries a mechanism to trigger systematic abstention. The paper lacks discussion of these risks and possible mitigations.
1. How would the results change if Balanced Accuracy is replaced with AUC-PR or F1 measures?
2. Please report the inference overhead of DACOM and compare it to baselines.
3. Please add more discussions about ethical aspects.
4. Minor issue: Text in Fig. 2 and Fig.6 is tiny, please enlarge. |
Lightly AI-edited |
|
Enabling Your Forensic Detector Know How Well It Performs on Distorted Samples |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 4: excellent
Rating: 6: marginally above the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper addresses the problem that forensic detectors for AI-generated images produce predictions without indicating reliability when test images undergo various distortions such as compression and noise. The authors propose DACOM (Distortion-Aware Confidence Model), which uses full-reference image quality assessment metrics to label training data with detectability scores during training, then learns to predict sample-level confidence using detector features, no-reference image quality descriptors, and distortion-type information at inference time.
1. The paper's motivation and perspective are well-founded and novel. Few existing works approach AI-generated image detection from the lens of image quality assessment to evaluate detector reliability under distortions. This angle provides a fresh and meaningful contribution to the forensics community.
2. The paper is well-organized with clear logical flow. The authors systematically progress from problem identification (Section 3 analysis) to method design (Section 4), making the work easy to follow. The empirical analysis establishing the correlation between FR-IQA scores and detection accuracy provides solid justification for the proposed approach.
3. The experimental evaluation is comprehensive and thorough. The authors provide extensive ablation studies (Section 5.5), test on multiple distortion types (both seen and unseen), evaluate across different datasets (Evaluation-dataset and Cross-dataset), and include detailed supplementary results in the appendix, all of which substantiate their claims with adequate evidence.
1. Concerns regarding the use of detector performance for labeling in Stage A. The authors use the detector's balanced accuracy on each bin to generate detectability labels (Equation 3-4). This raises several concerns: (a) If FR-IQA scores already exhibit monotonic correlation with detection performance (as shown in Section 3), why is the additional step of computing detector accuracy necessary? Could the FR-IQA scores themselves serve as supervision? (b) More critically, this design may limit generalizability—if training data is labeled using Detector A's performance, will DACOM trained on this data generalize well to Detector B? This detector-specific labeling could hinder practical deployment across different detection models. (c) The requirement to evaluate detector performance on large distorted datasets during Stage A significantly increases the computational cost of the training pipeline.
2. Limited discussion on robustness training and its interaction with the observed monotonicity. The authors address distortion robustness by applying light data augmentation (10% JPEG compression and blur) during detector training. However: (a) Only two distortion types are used for augmentation, which seems insufficient given the diversity of real-world distortions. (b) It remains unclear whether the monotonic relationship between FR-IQA and detection accuracy (Section 3) still holds when detectors are trained with more aggressive data augmentation strategies. If extensive augmentation flattens the performance curve across distortion levels, would DACOM's premise still be valid? This interaction between robustness training and the proposed method deserves further investigation.
3. Limited training data scale may restrict generalization. The model is trained on only 2,500 base images from a single dataset (ProGAN subset). While distortion augmentation expands this to 200K+ samples, the underlying content diversity remains limited. This could lead to: (a) Overfitting to the specific visual patterns in these 2,500 images. (b) Poor generalization to different content types, as evidenced by the noticeable performance drop in Cross-dataset evaluation (Table 4). Experimenting with larger and more diverse training sets would strengthen the claims about generalizability.
4. The proposed applications have practical limitations that merit further exploration. While the paper demonstrates two uses of DACOM—selective filtering and multi-detector routing—both have notable drawbacks: (a) Selective abstention necessarily reduces coverage, which may be unacceptable in applications like content moderation where all samples must be processed. (b) Multi-detector routing requires maintaining and running multiple detectors (6× computational cost in experiments), which may be prohibitively expensive for real-time systems. I suggest the authors explore alternative applications that leverage DACOM more seamlessly, such as incorporating confidence-aware calibration directly into a single detector's training or inference process, or using confidence scores to dynamically adjust decision thresholds rather than completely abstaining from prediction.
Overall assessment: Despite these concerns, I view this as a valuable contribution that introduces a novel perspective on detector reliability. If the authors can adequately address the above concerns—I would be inclined to raise my score.
1. The formula y = 2×|BAcc - 0.5| assumes equal difficulty in improving accuracy across the entire range (e.g., 50%→75% vs. 75%→100%). However, achieving near-perfect accuracy is typically much harder than reaching moderate levels, suggesting a non-linear relationship.
Have you experimented with non-linear transformations (e.g., y = (2×|BAcc - 0.5|)^α with α > 1, or logarithmic scaling) that better reflect the diminishing returns at higher accuracy? Alternatively, why not use BAcc directly as labels without transformation? An ablation study comparing different label functions would clarify whether this linear design is optimal or just a convenient choice.
2. You train four DACOM variants with different FR-IQA metrics (Table 6) and all perform similarly. Does this mean the choice of FR-IQA is not critical? If so, why present four variants instead of selecting one? What guidance do you offer to practitioners on choosing the FR-IQA metric?
3. What is the parameter count and FLOPs of DACOM? Since it must run alongside the detector, efficiency matters. How does DACOM's overhead compare to the detector itself (e.g., DACOM adds X% latency)? |
Fully AI-generated |
|
Network of Patterns: Time Series Forecasting with Pattern Passing |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes the Network of Patterns (NoP) for time series forecasting, which flexibly connects all relevant pattern segments to enable comprehensive interactions while employing a Pattern Passing strategy to efficiently propagate information.
The paper presents a complete structure and introduces SegmentTST as an auxiliary study, outlining a relatively clear research pathway.
- In the last sentence of the third paragraph in the Introduction, the reference should be made to Figure 2(b) rather than Figure 2(a).
On the other hand, the explanation based on *dependency* is not convincing — a low attention score may simply result from high similarity and redundant information between the two elements. This raises concerns about the fundamental motivation of the work.
Moreover, even if the explanation were valid, the rationale for removing the constraints of chain or tree structures remains weak, since a graph structure can naturally be regarded as a generalization of both. The work also lacks realistic motivations such as temporal dependency or hierarchical periodicity.
Without solid reasoning, the model design appears to offer limited novelty, as the objective seems not to address a genuine problem but merely to pursue a higher SOTA performance.
- Comments on the writing of Method section:
1. The subsections are poorly connected, resulting in a strong sense of fragmentation. Readers may find it difficult to understand the purpose of each part when reading, which hinders overall readability.
2. It is recommended to provide the shape changes of the variables throughout the pipeline to enhance clarity and reproducibility.
3. The authors should explicitly state that the model is constructed using only the seasonal component and explain the rationale behind this design choice, which would improve both readability and methodological rigor.
4. For example, the opening sentence of Section 3.1 lacks fluency and could be revised to improve readability and clarity.
- Regarding the virtual pattern node proposed in Section 3.3:
The authors arbitrarily introduce an additional node, and the ablation study demonstrates its effectiveness. From another perspective, this suggests that the graph construction may be incomplete. The authors do not clarify how the virtual node connects with other nodes. If the connections are bidirectional, it effectively introduces additional edges between existing nodes, which could be seen as evidence of imperfect graph construction. If the connections are unidirectional (as illustrated in Figure 3), the aggregation appears to be meaningless. The authors should provide evidence that the graph construction is sufficiently complete—for example, by considering alternative graph construction methods and corresponding experiments with and without the virtual pattern node.
- Lack of experimental convincibility: The manuscript lacks comparisons with strong baselines such as xPatch. Moreover, given the rapid development of large-scale time series models, it is necessary to include comparisons with models like Moirai,Time-MoE. In scenarios with sufficiently long input sequences, the full-shot performance of the proposed model should surpass the zero-shot performance of these models to provide convincing evidence of its effectiveness.
Additionally, I recommend that the authors supplement their experiments with results on other datasets, such as Monash, since the current benchmark is somewhat controversial (https://cbergmeir.com/talks/neurips2024/) and its performance has reached a high level of saturation (https://arxiv.org/abs/2510.02729). Including datasets like Traffic could further enhance the credibility of the experimental evaluation.
The main points are detailed in the Weaknesses section. The authors should reiterate the core motivation of their work, provide additional experiments to demonstrate the effectiveness of the virtual pattern node design, furnish further evidence to validate the overall model, and improve the clarity and completeness of the writing. |
Lightly AI-edited |
|
Network of Patterns: Time Series Forecasting with Pattern Passing |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces NoP, a novel method for time series forecasting that organizes multi-scale pattern segments into a network structure, using Spectrum KL Divergence to measure the similarity. A Pattern Passing mechanism is then employed to aggregate information. The authors claim that NoP overcomes the limitations of traditional chain- and tree-based structures by enabling more flexible and comprehensive interactions between patterns. Extensive experiments on multiple benchmarks show that NoP achieves state-of-the-art performance, validating the effectiveness of it.
* **Novelty**
The core idea of organizing pattern segments into a network, as opposed to chain or tree structures, is novel and represents a creative contribution to the field of time series forecasting.
* **Empirical Results**
The NoP demonstrates superior performance over a wide range of baselines, including recent pattern-based and other forecasting methods, across both long- and short-term tasks. The ablation studies further substantiate the importance of each component.
* **Thorough Evaluation**
The authors have conducted a comprehensive set of experiments, including comparisons with different structural variants (chain, tree, network), hyperparameter sensitivity and efficiency analysis.
* **Insufficient Motivation and Theoretical Justification**
The paper's primary weakness is the lack of a compelling theoretical motivation for why a network structure is necessary and fundamentally more suitable than chain or tree structures. The authors empirically show that network-based method performs better. However, this argument remains largely heuristic. The paper would be significantly strengthened by a more rigorous theoretical discussion or analysis that explains the inherent advantages of the NoP.
* **Limited Analysis of Model Effectiveness**
While the paper shows that the method works, it falls short of deeply analyzing why. For instance, beyond the visualization in Figure 5, a more detailed analysis of the properties of the learned pattern network would be insightful.
* **Lack of Computational Complexity Analysis**
The paper mentions the efficiency of NoP but does not provide a detailed computational complexity analysis.
* **Reproducibility Concerns**
The paper includes a "Reproducibility Statement" (on page 10, and I’m not sure if this violates the rule that "At the time of submission, the main text should be 9 pages or fewer"). However, at the time of review, no code or implementation is provided.
See Weaknesses. |
Fully AI-generated |