|
ACTS: Adaptive Control for Test-time Scaling |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes Adaptive Control Token Sampling (ACTS), a framework that dynamically regulates the reasoning length of LLMs at test time using the sub-argmax probabilities of control tokens (e.g., EOS, EOT). The authors frame generation as an optimal stopping problem and design several policies to balance “underthinking” (premature termination) and “overthinking” (unnecessary reasoning). Experiments across reasoning and instruction-following (AlpacaEval) benchmarks show small gains in accuracy or efficiency.
- Using control-token probabilities for inference-time control is original and interesting, potentially offering a lightweight alternative to external reward models or verifier signals.
- The paper correctly situates itself within the growing literature on test-time scaling, referencing S1, Self-Consistency, and Speculative Rejection works.
- The submission appears rushed and lacks careful proofreading. Algorithm 2 and Figure 5 overflow the page boundaries; Figures 6 & 7 have inconsistent spacing; some sub-figure captions (e.g., “accumulated,” “last-interval”) are unclear; Table 2 includes ambiguous terms such as “unconditional forking.” Overall readability and figure organization require substantial revision.
- Improvements in Table 1–2 are small (≈ 2–3 % absolute at best) and sometimes trade off token efficiency inconsistently. It is unclear whether these changes are statistically meaningful or justify a new inference-control paradigm.
- The paper claims ACTS mitigates overthinking and underthinking, yet provides no quantitative trajectory analysis or diagnostic evidence (e.g., token-level reasoning-trace inspection, error typology, or heuristic measures of thought quality). Without such evidence, the claimed cognitive interpretation remains speculative.
- The spike thresholds and critique-trigger parameters appear hand-tuned, but no ablation or development experiment explains their choice or sensitivity.
- Operating on the “stopping probability” is a modest extension of known heuristic stopping rules. That said, the conceptual leap from S1’s budget forcing or early-termination heuristics is incremental.
See weaknesses. |
Heavily AI-edited |
|
ACTS: Adaptive Control for Test-time Scaling |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces Adaptive Control Token Sampling (ACTS) that effectively mitigates the negative effects of "underthinking" and "overthinking" through a fine-grained control over special tokens.
The paper proposes several effective methods for mitigating overthinking and underthinking. Experiments do show their effectiveness over benchmarks where these phenomena are observed.
1. I must put into question the contribution of the paper, as it appears to me overclaimed. The main idea is three-fold: when underthinking is a problem, make the model think longer by saying "wait"; when overthinking is a problem, cut the model short once it has had several opportunities to end its thinking; self-critique, which is a rather self-contained method. The rest of the paper deals with engineering these approaches. However, there is a lack of central theme around these methods. It feels as if the authors tried these methods out, saw some performance improvements, and attempted to put them under the same framework, while they really should be treated as separate engineering tricks and studied individually. The most interesting method by far is the self-critique framework, with an intuitive explanation of effecient tree search for LLM problem solving, but it is only studied superficially through performance gains, without any attempts and theoretical analysis or a closer look into its mechanisms.
2. The mitigation of underthinking and overthinking are restricted to reasoning and instruction following tasks, respectively. This is a problem here because the methods appear to target these two problems separately and only within their respective applications, putting into question the generalizability of the proposed methods.
3. The general presentation of the method was not intuitive to me. I think the paper tried to go from a general framework to a more specific implementation, where Section 3 establishes the ACTS framework, and Section 4 specfies it into proposed methods. This overcomplicated the narrative because Section 3 taken out of context is rather confusing, for example "forcing the emission of the appropriate control token" is too general on Line 151.
4. Section 4 has numerous writing problems that also overcomplicates things. First, the titled paragraphs are not individual policies, but a progression of different scattered ideas. Among the actual methods used in Section 6.2 / 6.3, Accumulated / Last-interval / N-Spike, only one is explained in the main text. Next, many concepts are mentioned but not discussed adequately, such as $N_{patience}$ mentioned without definition on Line 178 (minor issue), the unspecified directive for critic self-evalution through Lines 185-192, an unfinished sentence on Line 203 (also I would caution comparing LLM thinking procedure to humans unless it's highly relevant), unspecified "opportune" moment on Line 205, etc. Algorithm 1 is also not really needed in main text as a straightforward textual explanation makes it clear enough for me.
Q1. Most importantly, I think the narrative of the paper needs to be revised. Targeting my weakness 1 and 2, can you clarify what main methods are used in the paper and how they fall under the same framework? Just to be clear I'm not asking for a simple restatement of contributions, but a more structured discussion of the methods' relations, synergies, etc.
Q2. Can you also clarify what the requirements are before applying your methods? Specifically, beyond the tested benchmarks, do we need to know if the model is underthinking or overthinking as a priori, and what other metrics need to be assessed if any?
Q3. For Section 6.1, are all the plots obtained with a single prompt? For Figure 2, did you do a cutoff at 0.0001, if so why did you choose this cutoff and what's the largest observed value, and if not why does the probability reach this value and not beyond so frequently? For Figure 3, can you explain why the spikes become much sparser after a while? Did the model rollout collapse in a way so as to never stop thinking?
Q4. What are "Accumulated" and "Last-Interval" methods in Figures 5 and 6? May be related to Q1.
Q5. For Figure 6, since we're dealing with underthinking, I assume everytime the model terminates prematurely, a "Wait" token is appended to keep it thinking. In this context what does N-spike refer to? Why is there a distinction between different amount of "Wait" tokens? Or is it "whichever comes first"?
Q6. For Section 7, the main table sees accuracy increase over baseline as well as token count increase. In fact, the Avg. Token Count column is marked wrongly in terms of best performance. Shouldn't this be an application of mitigating underthinking instead of overthinking, so both accuracy and token count increases over baseline? |
Fully human-written |
|
ACTS: Adaptive Control for Test-time Scaling |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
Current language models often operate in "thinking" mode, allocating a fixed budget to reasoning. However, models may overthink or underthink in these settings; there is a need for calibration. This paper introduces an inference time framework - ACTS - which leverages EOT probability spikes to mitigate underthinking in complex reasoning scenarios. Across a variety of reasoning tasks and models, the proposed method improves over naive decoding strategies, pushing the pareto frontier of the best performance with a reduced token budget.
* The proposed approach is training-free, making it lightweight and accessible (to models that provide token probabilties)
* The problem is clearly motivated and timely
* Analysis is done that justifies the method (ie, token probabilties of eos)
* Section 4.1 is well-justified and seems to cover the most likely scenarios encountered during decoding.
* ACTS explicitly frames reasoning control as an optimal stopping problem; this provides a more principled lens for test-time scaling. This formulation unifies stopping, critique, and branching decisions under one controller.
* ACTS improves accuracy while reducing token usage on reasoning and instruction-following tasks.
* Though the formulation is interesting, it is unclear how it will generalize to models outside the ones tested; reasoning is only tested on two families of model, and instruction-following on one. Since this is a lightweight, inference-time method, it can be more easily verified by running on more models.
* This space is saturated and it is unclear how significantly this improves over existing work like S1, and other early-stopping methods (including ones that are inference-time only as well). In addition to further explaining this novelty, it would be helpful to show empirically it works well against more baselines, instead of against fixed decoding strategies (like greedy, or inserting wait x times).
* There are several formatting errors which make the paper slightly harder to read. For example, Figure 5 extends beyond the page boundary, and also impacts the next page by forcing a single column. Same thing for Algorithm 2.
* The paper shows spikes correlate with reasoning boundaries but doesn’t explore more in-depth here. It would be interesting to know why these spikes emerge or whether they consistently indicate correctness vs. hesitation.
* Can ACTS improve performance on other benchmarks besides reasoning/instruction-following, or is this a specialized phenomenon in these domains? |
Fully human-written |
|
ACTS: Adaptive Control for Test-time Scaling |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposed ACTS (Adaptive Control for Test-time Scaling), which measure the spike of EOS and EOT token to detect models intention of termination. The proposed method treats generation as a control process, which monitors the control-token probabilities and decides whether to continue reasoning, critique itself, or terminate at each decoding step. The strong empirical results demonstrate that ACTS can effectively reduce the average length of generating tokens during decoding, meantime achieving comparable or better results, and can effectively help solve the underthinking and overthinking issue of recent reasoning models.
1. The idea of monitoring the spike of the signal tokens (EOS and EOT) to determine models intention to terminate is novel and interesting. This also well aligns with recent work that measure token entropy or probability for tracking major transition during LM reasoning process. [1]
2. The empirical results are strong, showing significant improvement on MATH500 and AIME dataset, meantime reducing the average token length to reduce computing.
3. The method is simple and easy to implement, can easily be applied on all kinds of reasoning tasks.
4. The paper is well written and easy to follow.
[1] Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
1. Minor issue: The threshold choices for spikes and critique triggers are empirically tuned. Unable to automatically determine the threshold can bring difficulty to implementation when trying to apply it across tasks.
None |
Fully human-written |
|
Making Slow Thinking Faster: Compressing LLM Chain-of-Thought via Step Entropy |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a method to prune out extra reasoning steps in reasoning models, and then SFT+RL with the result to allow for more concise chains of thought.
1. Clarity: The paper is generally written pretty clearly, and the method is not very complex which is nice. I would say that the writing is a little repetitive (saying the same thing many times), but not to the point where it makes the paper hard to understand.
2. Significance: It seems that the method, while simple, is reasonably effective. It greatly compresses the resulting CoT at a reasonable loss in accuracy.
1. To be honest, the method feels rather "hacky" to me, inserting skip tokens based on heuristics. My feeling is that the community in general is trying to move towards methods that perform end-to-end RL in a more principled way rather than these sorts of processes.
2. Relatedly, while this paper proposes methods to compress chains of thought, there are other methods to directly control the length of chains of thought such as L1 (Aggarwal and Welleck). These seem simpler, more elegant, and can also be retro-fitted onto existing models. I was surprised that there was no discussion of this work, and it seems like it would be a competitive baseline.
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
Pranjal Aggarwal, Sean Welleck
None |
Fully human-written |
|
Making Slow Thinking Faster: Compressing LLM Chain-of-Thought via Step Entropy |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes **step entropy** as a signal to identify and prune redundant segments within Chain-of-Thought (CoT) traces, aiming to make slow “deliberate” reasoning faster without sacrificing accuracy. Concretely, the step-level entropy for the \(i\)-th step \(S_i\) is defined as the sum of token entropies conditioned on the prior context, \(H(S_i \mid S_{<i})=\sum_j H(t_{i,j}\mid c_{i,j})\). The core hypothesis is that **low-entropy steps contribute little information** to the final answer and can be safely skipped. The authors present (i) an information-theoretic motivation; (ii) a pruning recipe that removes the lowest-entropy steps and replaces them with a special token (e.g., [SKIP]); and (iii) a training pipeline (SFT and GRPO) to teach models to emit compressed CoTs during inference. Empirically, they report substantial token reductions (often 16–57%) with modest accuracy degradation on math-style reasoning benchmarks.
- **Conceptual clarity:** A clear, information-theoretic criterion (step entropy) with an intuitive link to redundancy.
- **Granularity choice:** Results indicate step-level pruning is more effective than naïve token-entropy pruning, suggesting the *step* is a useful unit.
- **Practical interface:** Using a placeholder like [SKIP] makes the compression operationally simple; ablations on replacement strategies are helpful.
1. **Autoregressive dependency not directly addressed**: Evidence is largely post-hoc (compress after generating a full CoT). In AR decoding, earlier “redundant” content can steer later tokens; removing it afterward does not prove it can be skipped causally during generation.
2. **Low-Entropy Steps Pruning shows no pre-training practical gain; current use is post hoc.** As implemented, “Inference with Compressed CoT” is applied **after generating the full CoT**, so it yields no acceleration and provides limited practical value before additional training. To make this genuinely efficient without extensive post-training, the method should be reframed as an inference-time control。
3. **Attribution vs. training data/compute:** The main contribution is both a compression rule and a data-construction pipeline (e.g., ~130k compressed pairs). Without baselines trained on identical data with matched optimization budgets, it’s hard to attribute gains to step entropy rather than more/better post-training.
4. **Fixed compression ratio:** A static global pruning rate (e.g., “up to 80%”) ignores that redundancy varies by instance difficulty, dataset, and model size; no mechanism adapts compression per-instance.
5. **Step segmentation heuristic:** Steps defined by formatting (e.g., `\n\n`) can be brittle. The paper does not validate segmentation accuracy or analyze sensitivity to finer/coarser granularity or token-entropy sparsity.
1. **Causal necessity at inference:** Can you run **inference-time interventions** that compress the low-entropy steps while holding other decoding parameters fixed, and report accuracy relative to full-CoT? This would directly test whether those steps are unnecessary *causally* rather than *post hoc*.
2. **Fair baselines on identical data/compute:** Train strong baselines (e.g., token/chunk compression, rule-based CoT pruning) on the same ~130k instances with the same training budget. Do your gains persist under these controls?
3. **Difficulty-aware adaptivity:** Can the compression ratio be predicted per instance (e.g., via a learned controller that thresholds entropy or estimates a target depth \( \kappa(x) \))?
4. **Segmentation robustness & granularity:** How exactly are “steps” defined and validated? Have you analyzed token-entropy distribution and how aggregation affects pruning decisions? |
Fully AI-generated |
|
Making Slow Thinking Faster: Compressing LLM Chain-of-Thought via Step Entropy |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes step entropy as a principled, per-step information measure for CoT, shows that pruning low-entropy steps preserves accuracy while cutting tokens, and trains models to self-compress via SFT + GRPO with an explicit [SKIP] token. Theory upper-bounds each step’s information by its entropy (Lemma/Theorem 1), and experiments across DeepSeek-R1 and Qwen families show strong accuracy–efficiency trade-offs.
1. Using entropy as information proxy is clean. The paper defines step entropy by summing token entropies within a step and proves the conditional information contribution is bounded by this entropy, offering clear intuition for “skip low-entropy steps.” This is simple, transparent, and theoretically motivated.
2. Theorem 1 provides usable intuition. Bounding the information of any subset of steps by the sum of their entropies gives a direct justification for step-level pruning rather than token-level pruning. The token-vs-step ablation empirically supports this semantic unit choice.
3. Fine-tuning setup is sensible and practical. The two-stage SFT → GRPO pipeline, rewarding correctness and compression while discouraging degenerate [SKIP] flooding, is clear and leads to better compression than static pruning on hard sets (e.g., AIME 2024).
4. Empirical results are good across models/benchmarks. They show ~30–55% token reductions with minimal accuracy loss; on some tasks accuracy even improves. Cross-architecture results (DeepSeek-R1-7B/14B, Qwen3-8B) and comparisons to recent compression methods are solid.
1. Step segmentation heuristic.
Steps are segmented using simple newline heuristics. While this works reasonably, it can blur step boundaries or merge logically distinct thoughts. A sensitivity study with sentence-based or LLM-predicted segmentation would improve robustness.
2. Fixed 80% pruning threshold.
The global 80% rule is justified empirically but may not generalize across datasets or reasoning styles. An adaptive or learned κ could better reflect per-problem difficulty.
3. Unnormalized entropy may bias toward longer steps.
The paper uses total (unnormalized) entropy per step. While this matches the theoretical bound, longer steps automatically accumulate more entropy even when per-token uncertainty is low, potentially biasing the pruning policy. A length-normalized or mixed variant could help disambiguate whether information or verbosity drives retention.
4. Scope of benchmarks.
Most experiments center on math and logic tasks. Including one additional open-ended domain (commonsense, code, or writing) would broaden the evidence that entropy-guided compression generalizes.
1. Entropy normalization:
Did the authors try normalizing entropy by step length (e.g., average or log-length scaling)? If so, how did this affect correlation with information contribution?
2. Alternative to entropy-based labeling:
Instead of relying purely on entropy, have the authors tried using an LLM itself to label which steps are informational or non-trivial (e.g., “steps that advance reasoning” vs. “repetitive or obvious steps”)? Such annotations could provide a complementary supervision signal for training or validating entropy thresholds.
3. Fine-tuning stability:
During GRPO fine-tuning, how sensitive are results to the [SKIP] penalty coefficient? Does the model ever collapse to always skipping or never skipping?
4. Adaptive threshold:
Could the target entropy ratio κ be dynamically chosen based on per-question entropy distribution? |
Fully AI-generated |
|
Making Slow Thinking Faster: Compressing LLM Chain-of-Thought via Step Entropy |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces step entropy, an information-theoretic metric to quantify each reasoning step’s contribution in LLM Chain-of-Thought. By pruning up to 80% of low-entropy steps, the method reduces tokens by 16–57% with minimal accuracy loss across DeepSeek-R1 and Qwen3 models. A two-stage training framework combining SFT and GRPO further enables models to autonomously skip redundant steps via [SKIP] tokens. The approach outperforms prior CoT compression methods, offering an efficient and interpretable way to accelerate LLM reasoning.
1. The method is simple and easy to implement, requiring only entropy calculation and pruning based on low-information steps to construct the pruned CoT. It achieves strong performance, such as maintaining accuracy on Math500 even with a 30% compression ratio, demonstrating both effectiveness and efficiency.
2. The introduction of the SFT+RL framework makes the approach more practical. By allowing the model to learn when to skip redundant steps automatically, it extends the static compression method into a trainable and deployable solution.
1. The segmentation and granularity of reasoning steps are not rigorously defined. The approach relies on manually designed delimiters like \n\n, which may not generalize well across datasets or model architectures.
2. The definition of step entropy as the sum rather than the average of token entropies could bias the metric toward longer steps, potentially misrepresenting their true informativeness.
3. The presentation of Table 1 is not so good. A clearer organization would make the results easier to interpret.
Please see the weaknesses. |
Moderately AI-edited |
|
Quantifying Statistical Significance in Diffusion-Based Anomaly Localization via Selective Inference |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
In this work, the authors propose a statistical framework to quantify the reliability of anomaly localization methods that use diffusion models. These generative models are used in domains like medical diagnosis and industrial inspection by reconstructing a "normal-looking" version of an input image; the difference between the input and reconstruction highlights potential anomalies. However, the authors note that inherent uncertainties and biases in these models can lead to inaccurate localizations, posing a critical risk.
To address this, the paper introduces the Diffusion-based Anomaly Localization (DAL) Test, which is based on the principles of Selective Inference (SI). The key problem with standard statistical tests is that the hypothesis (i.e., the specific anomalous region is selected using the same data that is used for testing. This "double-dipping" invalidates traditional p-values and leads to an inflated false positive (Type I error) rate. The proposed DAL-Test framework computes a valid p-value by performing the statistical test conditional on the selection event—the fact that the specific anomalous region was identified by the diffusion model. The authors formulate this as a two-sample test comparing the mean pixel values in the detected region between the test image and a reference image.
Technically, the framework conditions on a nuisance parameter, which reduces the problem to a one-dimensional search. The authors show that if the diffusion model's U-Net uses piecewise-linear activation functions (like ReLU), the entire reconstruction process is piecewise-linear. This crucial insight allows the conditional sampling distribution of the test statistic to be characterized as a truncated Gaussian distribution. The truncation intervals (the set of values that produce the same anomalous region) are identified analytically using parametric programming.
The authors validate their method on synthetic data and real-world datasets (BraTS brain tumors and MVTec industrial defects). Experiments show that the "naive" and "permutation" methods fail to control the Type I error rate, exhibiting high false positives (e.g., 0.46–0.88 on MVTec). In contrast, the proposed method successfully controls the Type I error rate at the desired significance level (e.g., $\alpha=0.05$) and demonstrates higher statistical power than the conservative Bonferroni correction. The authors conclude that their framework provides a principled way to assess the statistical reliability of diffusion-based anomaly detection.
The following are the positives of this work:
- The paper addresses a gap in the use of generative models for anomaly localization. It moves beyond merely generating reconstruction maps and provides a formal statistical framework, the DAL-Test, to quantify the risk of false positive detections by computing valid p-values.
- The authors propose using the selective inference (SI) framework for diffusion models. The key insight is the characterization of the U-Net-based reconstruction process as a piecewise-linear function , which enables the analytical derivation of the conditional sampling distribution as a truncated Gaussian.
- The experimental validation is well-conceived. It correctly focuses on demonstrating the failure of standard and permutation-based approaches to control the Type I error rate.
I have the following concerns regarding the paper:
- The entire tractability of the proposed selective inference framework, as detailed in Appendix B, revolves around U-Net and the entire reconstruction process $\mathcal{D}(X)$ being a piecewise-linear function of the input $X$. This is achieved by restricting the model to components like ReLU activations and pooling (line 703). This is a significant constraint, as many state-of-the-art diffusion models employ non-linearities such as SiLU/Swish or attention mechanisms, which are not piecewise-linear. This raises a critical question about a potential trade-off: does one have to sacrifice the generative performance of the diffusion model to gain statistical rigor? The paper does not explore this trade-off or discuss the method's applicability to non-piecewise-linear architectures.
- The method relies on parametric programming (Algorithm 2) to identify all truncation intervals $\mathcal{Z}$ along a 1D path. For a deep, piecewise-linear network, the number of linear regions (and thus potential intervals) can grow extremely large. The authors acknowledge this in the limitations ("growing the size of the diffusion model also leads to increased computational demands," line 483) but do not provide a formal complexity analysis or empirical runtime data. This makes it difficult to assess the practical feasibility of the DAL-Test for larger, higher-resolution images (e.g., $256 \times 256$ or $512 \times 512$) or deeper models, which are common in real-world applications.
- The paper formulates the statistical test as a two-sample comparison of the mean pixel value within the detected region $\mathcal{M}_x$ (Eq. 6, line 246). This test statistic (Eq. 7) is sensitive to changes in mean intensity but may lack statistical power for anomalies characterized by other features. For instance, subtle textural changes, fine scratches, or complex tissue abnormalities might not significantly alter the mean pixel value of a region but would be visually distinct. The framework's validity is not in question, but its ability to detect these more complex, non-mean-based anomalies could be limited.
Please see weakness section above |
Heavily AI-edited |
|
Quantifying Statistical Significance in Diffusion-Based Anomaly Localization via Selective Inference |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a statistical framework based on Selective Inference (SI) to quantify the significance of anomalous regions detected by diffusion-based anomaly localization methods. The authors focus on Denoising Diffusion Probabilistic Models (DDPM) as a proof-of-concept, applying it to medical diagnosis (e.g., brain MRI) and industrial inspection tasks. The core idea is to compute valid p-values that control the false positive detection rate by conditioning on the data-driven selection event induced by the diffusion model's reconstruction
1. First to use Selective Inference in anomly detection: The application of Selective Inference to diffusion-based anomaly localization is highly novel and addresses a critical, often overlooked problem of reliability and statistical validity in this rapidly growing field. This is particularly relevant for high-stakes applications like medical diagnosis.
2. Solid Theoretical Foundation: The paper is well-grounded in selective inference theory. The derivation of the conditional sampling distribution and the proof for the validity of the proposed p-value are sound and clearly presented in the appendix.
3. Awareness of Limitations: The authors acknowledge the computational demands of their method, which is a significant and honest point for discussion regarding practical deployment.
1. Lack of Depth in SI Applications: The section lists citations but provides only high-level categorizations (e.g., "linear model features," "complex feature selection," "unsupervised learning tasks," "deep learning models"). It does not explain specific application scenarios, such as: What problems were solved in linear models?
2. Lack of novelty: The core framework essentially extends established SI techniques—originally developed for feature selection in linear models and later adapted to deep learning and unsupervised tasks—to the context of diffusion models like DDPM, without introducing fundamentally new methodological innovations or theoretical breakthroughs
3. Insufficient Discussion of Computational Complexity: The computational burden of the parametric programming approach is drastically understated. For high-resolution images, exhaustively searching the one-dimensional line to identify all truncation intervals Z is likely computationally prohibitive. The paper provides no data on runtimes or scalability, making it impossible for the reader to assess its practical feasibility.
4. Limited and Weak Baselines:
1. The experimental comparisons are insufficient to convincingly demonstrate the method's advantage. The chosen baselines are weak: the naive and permutation methods fail to control Type I error, making power comparisons with them uninformative. While the Bonferroni correction controls error rates, it is a notoriously conservative baseline, and outperforming it does not sufficiently prove the method's power.
2. The evaluation is limited to only Type I error and power at the image level. It does not provide segmentation-level metrics, such as AUROC or F1-score, which are standard in anomaly detection (AD). Consequently, the paper fails to demonstrate a tangible improvement for practical AD tasks.
3. The description of the real-world datasets is unclear, omitting essential details for reproducibility, such as precise data splits and preprocessing steps.
1. Computational Scalability: Could you provide concrete data on the runtime of your algorithm (e.g., for the largest image size n=4096 or the 128x128 MVTec images)? How does the computational cost scale with image size (n) and model complexity? This is critical for assessing the method's practical utility.
2. Baseline Comparison:** Would you consider adding a comparison with the Benjamini-Hochberg procedure to control the False Discovery Rate (FDR)? This is a more powerful and commonly used alternative to Bonferroni for multiple testing, and a comparison would better situate the performance of your method.
3. Parameter Sensitivity and Robustness: The anomaly region Mx is highly dependent on the threshold λ. How sensitive are the Type I error and power of the DAL-Test to the choice of λ? |
Fully AI-generated |
|
Quantifying Statistical Significance in Diffusion-Based Anomaly Localization via Selective Inference |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This study proposes a method to compute p-values for anomaly localization based on diffusion models (DAL). Because the anomalous regions are data-dependent, standard hypothesis testing results suffering from Type I error. Therefore, selective inference is necessary. So, this study proposes selective inference for diffusion models.
It has a long history and is widely used to use the reconstruction errors of deep generative models for anomaly detection. However, it has been difficult to assess what constitutes a meaningful difference. The proposed method addresses this question in a rigorous way.
There are many studies on selective inference and p-values, but this study addresses a highly unique setting, which gives it strong originality. The extensions needed to apply the framework to diffusion models are non-trivial and show a great contribution.
This study assumes a U-Net with ReLU. It may not apply to architectures that use other activation functions or layers. For example, more modern diffusion models heavily use attention layers and normalization layers. Can this method still be applied?
In the context of unsupervised anomaly segmentation, reconstructions by diffusion models are not necessarily the default, and for strong performance, feedforward methods appear more powerful and more common (for example, see the leaderboard of the VAND 3.0 Challenge at CVPR). In this sense, the practical utility of the proposed method may be limited.
See Weaknesses. |
Fully human-written |
|
Rethinking Benign Relearning: Syntax as the Hidden Driver of Unlearning Failures |
Soundness: 2: fair
Presentation: 4: excellent
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This work investigated why benign relearning happens when using gradient based heuristic for unlearning LLM. Different from prior explanations, this work observed that syntactic similarity is the primary driver of why relearning works instead of topical relevance. The paper analyzes the BLUR dataset and showed that the relearn success rate is mainly due to the syntactic similarity, evaluated by the cosine similarity between relearn set and target set. The paper further proposes a new way of unlearning by using GPT to rewrite the forget set so that the forget set is syntactically different from the target set, which effectively limits the power of relearning.
- The paper is well written.
- The paper provides a new explanation of the success of relearning in the context of LLM unlearning, which is important for understanding in this community.
- One thing that has been significantly underlooked in this paper is **between which two sets** should syntactic similarity be looked at. There are three sets: unlearning set $D_{forget}$, eval set $D_{target}$, relearn set $D_{relearn}$. Section 5 characterizes the syntactic similarity between $D_{target}$ and $D_{relearn}$, but there are other dimensions: syntactic similarity between $D_{forget}$ and $D_{relearn}$ and syntactic similarity between $D_{forget}$ and $D_{target}$. In the TOFU case, an implicit assumption is $D_{target}$ and $D_{forget}$ high overlaps. Therefore, $D_{relearn}^{syntactic}$ is syntactically different enough from both $D_{target}$ and $D_{forget}$. However, such assumption might not be true for e.g. WMDP, where $D_{forget}$ is pub-med articles and $D_{target}$ are some expert drafted questions, not necessarily about the verbatims in the articles themselves. From the current analysis, what is missing is, **whether the success of relearning is due to syntactic similarity between $D_{target}$ and $D_{relearn}$ or $D_{forget}$ and $D_{relearn}$, or more complex among all three sets**.
- It is also important to make clear separation between knowledge unlearn (such as wmdp, where $D_{target}$ and $D_{forget}$ can usually be very syntactically different) and verbatim unlearn (such as tofu, where $D_{target}$ and $D_{forget}$ are potentially closer, but also try cases where you paraphrase $D_{target}$ itself) and investigate the above question under both cases.
- The robust unlearning part is less convincing to me. In practice, designing unlearn set given target set is unfair. The ideology should be a defender build an unlearn set with the purpose of defending against model outputting sensitive information, not against a set of fixed queries. Moreover, one can always rewrite the target set with GPT as well.
- Figure 8 is also less convincing. For TOFU, the original relearn set is a subset of the unlearn set. Now that if the forget set changes, shouldn't the relearn set also changes? Otherwise, this is not an apple-to-apple comparison as the adversarial in this case has less information.
- Have the authors explore systematic analysis on knowledge unlearning task such as WMDP?
- I like this paper and think this paper provides good observations. The most important thing is a clearer and more systematic investigation to the weakness 1 above. That's the major issue I think the authors should address. The defense part in its current form is also not convincing as it is too rough and did not consider stronger adversarial. **I am willing to raise my score if both points have been further explored and explained in the rebuttal.** |
Fully human-written |
|
Rethinking Benign Relearning: Syntax as the Hidden Driver of Unlearning Failures |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper studies why unlearning comes back after benign fine-tuning (“benign relearning”) in LLMs. Contrary to the common belief that topic overlap (e.g., Harry-Potter-related text) is the main culprit, the authors argue and empirically show that syntactic similarity (shared surface forms/templates) is the real driver: fine-tuning on sentences with the same structure (even about different entities) reactivates the forgotten content. They formalize and test this on BLUR/TOFU-style setups, diagnose evaluation confounds in prior relearning studies, and propose a simple mitigation, syntactic diversification (paraphrasing forget queries into varied structures before unlearning). This diversification both reduces relearning, speeds up forgetting, and improves utility trade-offs.
1. The paper shows syntactic similarity (in the query), not topicality, is the consistent relearning driver across methods (GA, NPO, SCRUB) and datasets. The paper also identifies evaluation confounds (dataset size -> step budget i.e., non-monotonic training trajectories) that can make topicality look stronger than it is, then re-evaluates with a step-standardized protocol. This corrects the narrative and is a valuable insight for the community.
2. The analysis provided with the Heatmaps, relearning vs unlearning steps, demonstrating the and analyses show syntactically similar relearn sets recover forgotten names more strongly than topically relevant ones (measured via keyword presence / ROUGE-L to base). Representation- and gradient-similarity analyses support the mechanism. Overall syntactic overlap aligns hidden states and gradients with the target set after unlearning, explaining recovery.
3. Syntactic diversification (multi-paraphrase forget queries via GPT-4o) reduces template rigidity, balances loss between templates and keywords, delays/attenuates relearning, requires fewer unlearning steps, and improves utility (e.g., Real Authors, World Facts and Retain split). That’s a rare win-win (better robustness with less utility damage).
1. The approach relies on GPT-4o paraphrasing. What is the cost/latency at unlearning time for large forget sets, and does quality vary by domain or language? A scaling/cost analysis and a cheaper in-house paraphrase baseline would help adoption.
2. Keyword-based relearn success rate captures name reappearance but may miss partial leakage or paraphrastic leakage. Similarly, ROUGE-L to base captures surface similarity but not factual equivalence. Including embedding-based and judge-LM evaluations would strengthen claims.
3. The overall message in the paper (templates get suppressed more than keywords during unlearning) is compelling, but stronger causal tests (e.g., controlled template injection/removal, counterfactual templates, layerwise intervention) would be needed to further substantiate this analysis.
1. Can you report results with richer syntactic similarity measures (e.g., tree edit distance, template mining) and show which correlates best with relearning?
2. Can you add judge-LM / embedding-based leakage metrics to complement keyword/ROUGE and analyze disagreements, if not then can you help explain why such an analysis could not be performed? |
Heavily AI-edited |
|
Rethinking Benign Relearning: Syntax as the Hidden Driver of Unlearning Failures |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper delves into a key and perplexing phenomenon in the field of LLM unlearning—benign relearning.
The core argument of this paper is that the primary driver of benign relearning is not, as previously thought, topical relevance, but rather syntactic similarity. The authors first use rigorous experimental design to identify confounding variables in the evaluation methods of previous benchmarks and then correct these experiments, finding that the role of topic relevance is overestimated.
To verify their core hypothesis, the authors construct two relearning datasets: "topically related but syntactically different" and "topically unrelated but syntactically similar." Experimental results show that syntactically similar data triggers information retrieval more effectively. The paper further analyzes the underlying mechanism from the perspectives of representation and gradients, revealing that syntactically similar data and the forgetting target are highly aligned within the model, and that the standard forgetting process suppresses "templates" rather than "keywords," thus leaving structural "backdoors."
Based on this finding, the paper proposes a simple yet effective solution—syntactic diversification. This method enriches the syntactic structure of the forgetting set through paraphrasing before forgetting. Experiments demonstrate that this method not only effectively inhibits benign relearning but also accelerates the forgetting process and significantly reduces the impairment to the model's generalizability.
1. The problem is clearly and significantly addressed. The authors challenge mainstream understanding in the field (topic relevance-driven) and propose a novel and insightful perspective (syntactic similarity-driven), crucial for understanding the failure of forgetting mechanisms.
2. The experimental design is rigorous and comprehensive, ensuring fair and rigorous evaluation. Evaluation confounding in BLUR is identified and eliminated, the number of steps is standardized, and the optimal result is chosen, making the conclusions more reliable.
3. The defense methods are simple and practical. Syntactic diversification requires no modification to the optimizer or model structure, yielding significant results (slower and weaker relapses, fewer forgetting steps, and better utility), and demonstrating robustness.
4. The writing is clear and logically coherent: the paper's structure is clear and easy for readers to understand.
1. Limitations of the syntactic similarity metric. The paper uses normalized Levenshtein distance as a measure of syntactic similarity. While this is a simple and effective character-level metric, it may fail to capture more abstract and deeper syntactic structures (such as the structural similarity of parse trees). It is suggested to explore using more sophisticated syntactic analysis tools to measure syntactic similarity, which could reveal more subtle mechanisms.
2. Cost of the proposed solution. The proposed solution relies on the robust GPT-4o model to generate syntactic variants. It is suggested to conduct a short ablation experiment to explore the effects of using smaller, more cost-effective open-source models (such as Llama-3-8B itself) or simpler methods for syntactic diversification.
see weakness |
Lightly AI-edited |
|
A Balanced Neuro-Symbolic Approach for Commonsense Abductive Logic |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper presents ARGOS, a neuro-symbolic framework designed to solve logic problems that require abductive reasoning—the ability to infer missing commonsense information. It addresses a well-known gap in existing systems: while symbolic solvers are rigorous, they are brittle and require a complete set of premises, whereas Large Language Models (LLMs) possess vast commonsense knowledge but often fail at complex proof planning.
- Clarity: The paper is exceptionally well-written and easy to follow.
- Quality: The experiment is executed to a high standard, both methodologically and empirically.
- Durability of the Problem Statement Against Frontier Models: The paper's motivation hinges on the inability of LLMs to perform abductive reasoning. I find that SOTA thinking models like Gemini-2.5-pro can solve the paper's motivating "winter fox" example directly via chain-of-thought. This raises the question of whether the proposed method addresses a fundamental limitation or a capability gap in a specific class of models that may soon be obsolete.
- Worst-Case Complexity: The paper reports an average cost of 18.4 COT calls (Table 3), which is reasonable. However, the worst-case cost is unbounded in theory and in practice determined by the number of iterations allowed. For very hard problems requiring many abduction steps, the cost could become prohibitive, as each iteration involves multiple LLM calls (generation, commonsense scoring, relevance scoring) and solver calls. A discussion of the distribution of costs, not just the average, and the performance/cost trade-off would be valuable.
- Inability to Express More Complex Rules: Real-world commonsense often takes more complex forms. The current llm_generate prompt structure seems hard-coded for the two-antecedent form. The paper would be more complete if it acknowledged this limitation and discussed potential extensions.
Minor suggestion on paper structure: Section 4 ("PROBLEM STATEMENT") is very concise and could be integrated into the end of Section 3 ("BACKGROUND") to improve the paper's narrative flow. |
Lightly AI-edited |
|
A Balanced Neuro-Symbolic Approach for Commonsense Abductive Logic |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses the problem of reasoning with missing commonsense information. It proposes a method of iteratively providing with an LLM the missing information in the form of L1 \land L2 \imply L3, where L1, L2, and L3 are all literals, and L1 and L2 are already deducible from the current premises. The paper experiments with 3 logical datasets and 3 less logical datasets. Experimental results demonstrate that the proposed method outperforms existing neural and neural-symbolic methods.
1. The paper addresses the important problem of reasoning with missing commonsense information.
2. The paper proposes a simple but potentially effective method to abduce and reasoning with the missing information. In contrast to neural-symbolic methods based on auto-formalization, the method resorts to more involved interaction of neural and symbolic methods.
3. Experimental results demonstrate the viability of the proposed method.
1. The paper states that it is dealing with abductive propositional logic problems (sec 4. Problem statement). But I believe the reasoning problem is first-order. Especially, the used dataset FOLIO is a typical dataset for natural language reasoning with first-order logic. The paper does not specify which SAT solver it is using.
2. Some use of logical notions in the paper is improper. For example, Line 141: “Propositional logic is a logical system that involves propositions about variables”. This is not a proper introduction of propositional logic. Line 91: “contain variables not previously mentioned in the input problem“. I had difficulty understanding this in the beginning. But later, I understand it actually means “contain propositions”.
3. Many logical notations used in the paper are problematic. I only give some examples here. The logic formula in Line 187 is incorrectly written. The formula in Line 214 is confusing. Proposition 1 in Appendix A is not well-stated.
4. From Sec 6.2, it seems that the work of the paper is founded on perfect logical translation. On the one hand, auto-formalization is still a challenging topic. On the other hand, when the paper presents the performance of SAT-LM, I assume it is based on auto-formalization, then the comparison might be unfair.
Does the paper deal with propositional or first-order reasoning? Which SAT solver is used? |
Fully human-written |
|
A Balanced Neuro-Symbolic Approach for Commonsense Abductive Logic |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes a new neuro-symbolic approach for enhancing the logical commonsense reasoning capabilities of LLMs and suggests a new way for LLM’s integration with logic solvers. In the approach proposed in this paper, the LLM iteratively provides unstated commonsense clauses to a logic solver, which is guided by feedback from the solver in the form of the SAT problem backbone. This approach allows the system to perform abductive reasoning, filling in missing background facts while keeping the search tractable. Overall, this work aims to contribute to leveraging the benefits of existing neural and symbolic methods to tackle commonsense logical reasoning problems.
1- The paper is easy to follow and is well-presented (modulo some issues that I point out in the weaknesses). The worked example presented in Section 5.2 and the methodology overview in Figure 3 faciliate understanding of the work.
2- The proposed methodology is novel and insightful. I think the general idea of the work in providing a new paradigm of interaction between an LLM and a symbolic solver is interesting. Existing approaches either initiate reasoning from the LLM and delegate theorem proving to a solver, mimic inference rules using an LLM, or propose methodologies to leverage the LLM to undergo a rigorous reasoning process while leveraging its commonsense. This paper introduces interactions between the LLM and the solver which I find novel and interesting.
3- The topic of focus, commonsense logical reasoning of LLMs is a quite an important topic with numerous practical applications. I think the idea of proposing novel frameworks for LLM interaction with formal reasoners can be be impactful by reducing reasoning errors of LLMs, but the paper's evaluation needs to be strengthened to validate this effect more properly.
1- There are several statements in Section 3 which I think are vague, inaccurate, or wrong:
- Line 141: “Propositional logic is a logical system that involves propositions about variables.”
Propositional logic does *not* involve “propositions about variables.” It is a logic of propositions themselves, and variables are symbols for propositions, not objects that propositions are “about.”
- “A proposition, such as A → ¬B, is some statement about literals”
Propositions aren’t “about literals”. They’re built from literals (or propositional variables) using logical connectives.
- “We assume that ¬(P ∧ C → ⊥), that is that the premises P are not contradictory with commonsense.”
The correct way to show consistency is “P∧C⊬⊥”.
- “A predicate is a function, such as MotherOf (x, y)”
A predicate represents a property or relation that can be true or false depending on its arguments. Whereas FOL functions point to a particular object in the domain as their output. For example, MotherOf (x, y) is a predicate which can be true or false, but MotherOf (x) is a function that returns a specific object y, i.e., MotherOf (x) = y.
- “∀(x, y)MotherOf (x, y) → ¬Male(x)”
This is a very unusual syntax. In standard FOL, you either write ∀x ∀y (MotherOf(x, y) → ¬Male(x)) or ∀x ∀y [MotherOf(x, y) → ¬Male(x)].
- Line 190: “First-order logic problems…”
I encourage the authors to use the conventional terms “grounding” or “instantiation”.
- Line 214: we first try to solve the problem using the SAT Solver (sat_solve) to test whether (P ∧ C) = P ⊢ Q or ¬Q
what does (P ∧ C) = P ⊢ Q or ¬Q mean? I think you’re just trying to show whether (P ∧ C) ⊢ Q or (P ∧ C) ⊢ ¬Q.
2- The use of self-consistency as one of the ways ARGOS can come up with the final answer is questionable. At the end of the day, self-consistency is relying on the LLM to do the reasoning, but the reason why people leverage or combine symbolic theorem proving with the LLM reasoning is because LLMs alone may make errors in their reasoning or generate hallucinated answers.
The experiments section only reports accuracy of the final answers, whereas in LLM reasoning works such as [1], the correctness of the reasoning process is also critical. Specifically, I think this metric can be insightful to see whether the correctness of reasoning process for answers provided by self-consistency mechanism of ARGOS is also improved or not. This can be a useful complement to the results in Figure 5.
3- There are some typos in the text such as line 81.
[1] Kazemi, Mehran, et al. "LAMBADA: Backward Chaining for Automated Reasoning in Natural Language." Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023.
1- In Line 228, the expression rankB(L) = #{L′ ∈ B | L′ has an entity in common with L} is written. Is this intended to be score(L)? Why would the rank of each literal be the number of existing literals that share the same variable? Regardless, the rationale for this approach is unclear to me. The only explanation provided is “which gives a measure of relevance of the literal to the problem” which is vague. I understand space limitations in the main paper, but I strongly suggest you explain the rationale near algorithm 2 in the appendix.
2- The methodology proposed in this work for leveraging the LLM’s commonsense knowledge is restricted to generating literals that can be deduced from the existing literals in the backbone. While this approach is in agreement with the way existing logical reasoning datasets are formed, I don’t think it is general enough for all practical applications. For example, a commonsense rule can be generated using only one literal from the backbone (e.g., ∀x Car(x) → Vehicle(x)), so why are pairs of literals necessary in the proposed approach?
3- Aside from being limited, I think the commonsense rule generation process in this work is also inefficient. Every pair of literals is presented to the LLM, and the scoring mechanism explained in appendix D4 is used to filter irrelevant ones. Some works cited in the paper such as LAMBADA and LLM TRes take a goal-driven approach to only generate rules that can contribute to solving the problem. Why isn’t a similar approach taken in ARGOS?
4- In appendix D3, why are the scoring propositions approaches different across datasets? I suggest a clarifying sentence to explain the rationale. I also appreciate the running example in section 5.2 which facilitates understanding.
5- In line 202, it’s stated that: “Four annotated examples are provided, intended for few-shot prompting.“
How are these few-shot examples chosen? Do they differ per-dataset? Are they chosen in a way that there is no risk of revealing the answer to the LLM?
6- What is the rationale for reducing γ at each iteration? By reducing γ, you are making the method more lenient, accepting answers even if there is less consistency in LLM generations. Doesn’t this approach reduce the rigor of reasoning as the algorithm proceeds?
7- Self-consistency is a key component of ARGOS and in fact one of the ways by which ARGOS generates its answers. As I mentioned in my earlier comments, using the LLM to generate the final answer might be sub-optimal, potentially generating hallucinated answers. Figure 5 nicely provides insight about how ARGOS improves accuracy of self-consistency responses, but I think the paper’s analysis also requires reporting the accuracy of reasons for all methods, at least for one dataset. Also, I think a study in which the self-consistency module of ARGOS is ablated should be provided, at least for one dataset.
8- Regarding RQ1, two mechanisms are used in ARGOS for scoring in the thresholding calculation. Are they both necessary? An additional ablation would be helpful. I also appreciate the honest discussion of limitations in RQ2.
9- I find the discussion in section 6.2 is questionable. Logical translation is a core part of ARGOS, not an orthogonal one and assuming having a correct propositional translation for real applications is a very big assumption to make. I think experiments on at least one dataset is needed without filtering the failures to shed light on how critical this step is to the framework. Using a more powerful LLM than the ones used for generating the commonsense rules is acceptable if it’s a major bottleneck, but having an experiment using the same LLM that ARGOS uses for reasoning is also quite beneficial. A proposed method isn’t required to beat all baselines on all tasks, but the reader must know the strengths and limitations of the proposed method. |
Fully human-written |
|
A Balanced Neuro-Symbolic Approach for Commonsense Abductive Logic |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes a neuro-symbolic framework called ARGOS to improve commonsense reasoning. This framework addresses the inability of logic solvers to handle missing commonsense facts by using an LLM to iteratively provide new commonsense propositions. An interesting contribution is the use of feedback from the symbolic SAT solver itself to guide the LLM's search for relevant facts. This allows ARGOS to search a larger space of potential facts, including new variables not present in the original problem. The framework also uses the LLM to score the generated facts for commonsense and relevance before adding them . The authors show that this approach improves performance on three abductive reasoning datasets.
- The proposed method provides an intuitive framework for combining the strengths of symbolic solvers and LLMs.
- The use of the SAT solver's backbone to guide the generation of new commonsense facts is novel.
- The empirical results are strong and show consistent improvements over existing neural and symbolic baselines on 3 datasets.
- The ablation studies show the value of the two main contributions, ie, backbone-guided search and score-based thresholding.
- The tasks/datasets used are not practically relevant and lack real-world applicability. In addition, the paper relies on modified versions of existing datasets (ProntoQA, CLUTRR, FOLIO) to create an abductive setting, which means the evaluation is on a somewhat artificial task.
- The method requires logit-level access to score generated clauses for commonsense and relevance, which may not always be accessible for closed-source models.
- The main experiments assume a perfect logical translation from text, as failed translations were filtered out. But this ignores the issue of imperfect translation, which could be a bottleneck for this method and neuro-symbolic systems in general.
- The method relies on an LLM itself to score its own generated clauses for commonsense and relevance. The reliability of this LLM-as-a-judge component is not validated against human-annotated scores.
- There is a lack of examples accompanying the error analysis in the paper, showing failure cases of ARGOS.
- The paper does not report the latency of all approaches compared in the main results. This is important because it would seem to me that ARGOS likely takes much higher computation time.
- Since ARGOS depends on the LLM's ability to reliably score commonsense and relevance, did you do any human analysis to verify that the LLM-generated scores are calibrated & accurate?
- How often does ARGOS introduced an unseen variable that is important for solving the problem? |
Fully human-written |
|
Autonomous Urban Region Representation with LLM-informed Reinforcement Learning |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes an urban representation learning method named SubUrban aiming to reduce human efforts in feature selection and engineering. The core idea is to represent an urban region with a set of POIs within or near the region. A reinforcement learning-based approach is presented to automatically learn the representative POIs for each region (and hence the representation/embeddings). LLMs are applied to help pre-select a subset of the POIs within a region, and to guide the optimization of POI category weighting. Experimental results using data from four cities (Beijing, Shanghai, Singapore, and NYC) across three downstream tasks (Population density, house price, and GDP density prediction) showed the effectiveness of the proposed method.
1. The proposed method uses POI data only and helps avoid manual feature selections.
2. Datasets from different cities (and countries) and different downstream tasks showed the effectiveness of the proposed method.
3. Source code has been made available.
1. Motivation:
- The motivation of using POIs within a region and its $\delta$-neighborhood to represent the region needs further discussion and justification. Also, how is $\delta$ determined?
- Using LLMs to generate keywords for each region to serve as POI filters seems quite restrictive (especially for less known/small regions). The LLM prompt template shown in Appendix C.1 treats each borough of NYC as a region which does not match the number of regions in NYC as shown in Table 1. It is unclear how exactly the POI pre-selection prompt is designed for each city or region. Both the motivation and implementation need further discussion.
2. Technical details:
- More details are needed on how k-means is applied to prune the POIs and why this help "regulate spatial density and ensure more uniform coverage across the regions".
- What are $q_c$ and $C$ in Equation 3?
- Where do the candidate $p_i$'s in Equation 4 come from?
- What does the prompt look like for the LLM-instructed CEM tuning process?
3. Experiments:
- The choice of baselines in the experiments needs further justification. Only two baselines are on urban representation learning. More baselines are needed:
Li et al. Urban region representation learning with OpenStreetMap building footprints. In KDD 2023.
Yan et al. UrbanCLIP: Learning text-enhanced urban region profiling with contrastive language-image pretraining from the web. In WWW 2024.
Jin et al. Urban region pre-training and prompting: A graph-based approach. In KDD 2025.
Hao et al. UrbanVLP: Multi-granularity vision-language pretraining for urban socioeconomic indicator prediction. In AAAI 2025.
While these methods may use more features, using POI only but with substantial performance gaps may not fully justify the advantage of the proposed solution.
- The population density prediction results reported in Table 2 for CityFM are close to those in Table 7 of the CityFM paper for Singapore but quite different for NYC. Clarification is needed.
- How are the LLMs and prompts chosen for the implementation? How are their choices impact overall model performance?
- It is also a bit odd to use Random Forest as the downstream task prediction model given that the downstream tasks are regression tasks.
See the Weaknesses section. |
Fully human-written |
|
Autonomous Urban Region Representation with LLM-informed Reinforcement Learning |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 1: poor
Rating: 0:
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
**IMPORTANT: The anonymous code repo was last updated on Oct. 1, 2025, which is a few days after the paper and supplemantary material submision deadline. I believe this violates ICLR code of conducts and should be desk rejected. The review below is only for reference.**
This paper proposes SubUrban, an RL-based framework for urban region representation learning, aiming to reduce reliance on manual feature engineering and city-specific heuristics. The propsoed approach includes 1) LLM-guided POI preprocessing to filter redundant or low-value urban features, 2) a submodular-aware hypernode expansion mechanism to adaptively construct expressive regional representations, and 3) an LLM-instructed CEM optimization strategy to calibrate category-wise attention weights. Experiments conducted on four cities (Beijing, Shanghai, Singapore, NYC) and three prediction tasks (population, house price, GDP density) demonstrate improved performance and robustness.
1. Overall, the paper is well-organized, and the motivation is clearly stated from the standpoint of reducing human-designed heuristics.
2. Experimental results are extensive, involving multiple cities and tasks, and the reported data efficiency improvements seem to be promising.
1. While the model claims to reduce heuristic dependency, involving a LLM naturally introduces a new form of heuristic (e.g., manually designed prompt templates, assumed semantic priors of regions). Clear clarification is needed to demonstrate how stable or reproducible these LLM-based components are across different language models or prompt variations.
2. The description of hypernode expansion (soft/hard selection alternation) is technically detailed, but the intuitions behind key parameters are relatively under-explained. It is generally more important to explain "why" ratehr than simply introducing "how". Besides, it is unclear how sensitive the performance is to these hyperparameters.
3. Although experiments are conducted on four cities, the source and partitioning methods differ (GADM vs. OSM vs. NYC planning). It could be helpful to study and clarify whether these differences influence evaluation comparability.
Please refer to the weakness section for my questions. |
Fully human-written |
|
Autonomous Urban Region Representation with LLM-informed Reinforcement Learning |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a self-supervised learning paradigm based on submodular functions and reinforcement learning, which models POI selection as a sequential decision-making process. By defining states such as Coverage, Saturation, and Buffer, and incorporating reward signals that combine downstream task performance with improvements in local states, it autonomously learns an expansion strategy, thereby reducing reliance on manual feature engineering and heuristic design. It introduces LLMs to provide semantic guidance in the urban domain, including generating representative keywords during the preprocessing stage to filter the initial POI candidate set, and guiding the Cross-Entropy Method during the optimization stage to adjust the attention weights for POI categories, consequently accelerating convergence and enhancing cross-city transferability. Experiments across multiple cities and downstream tasks demonstrate that SubUrban outperforms existing state-of-the-art methods using only 10% of the data, exhibiting exceptional data efficiency, robustness across cities and tasks, and interpretability.
1. This paper innovatively combines submodular functions with the sequential decision-making capability of RL for autonomous construction of urban hypernodes. This approach offers a novel and automated perspective to address the long-standing pain points in urban computing that rely on manual heuristics and city-specific tuning. The utilization of LLMs to inject domain knowledge for guiding data selection and optimization processes is also an interesting methodology.
2. The proposed framework in this paper demonstrates high practical value, as it can significantly reduce the costs associated with data processing and model tuning for urban AI applications, while enhancing the model's generalization capability across cities with varying data distributions. This is crucial for the scalable deployment of smart city applications.
1. The study primarily compares POI-encoding-based representation learning methods, which is reasonable given its core focus on processing POIs. However, incorporating some powerful multimodal fusion methods (such as UrbanCLIP or UrbanVLM) that also generate high-quality regional representations as baselines-or comparing/combining SubUrban's learned representations with those from such models-could yield more compelling evidence.
2. The entire system integrates multiple complex components including RL, submodular rewards, LLM preprocessing, and LLM-instructed CEM. Although ablation studies were conducted, it remains unclear, for instance, to what extent the LLM contributes. How much would performance degrade if the LLM were replaced with a simple statistics-based method (e.g., using information gain) to generate initial keywords? Such analysis would help determine whether the LLM truly provides irreplaceable semantic understanding or merely offers a decent initialization.
1. The paper designs multiple reward signals. In practical training, how are these reward terms (e.g., $R_{GAT}$ , $R_{MHA}$ , $R_{buf}$ ) balanced during optimization to prevent any single component from dominating the entire training process?
2. The case study in Section D.5 is insightful. Could you briefly comment on whether the expansion strategies learned by SubUrban demonstrate consistent and interpretable patterns across the multiple regions you observed? For instance, are there systematic differences in the focused POI categories and spatial expansion patterns for different functional area types, such as residential versus commercial zones?
3. Are complete results for House Price and GDP Density prediction available for Singapore and NYC? |
Fully human-written |
|
Autonomous Urban Region Representation with LLM-informed Reinforcement Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents SubUrban, a framework that combines submodular rewards with reinforcement learning and incorporates LLM-based semantic guidance in preprocessing and parameter search. The system first uses LLM-generated keywords and clustering to semantically pre-filter large POI sets, then treats the filtered POIs (by category) as actions and defines a reward balancing coverage, saturation, and buffer to train a modular policy that selects the most informative POIs under a budget to expand hypernodes. Category-weight search is accelerated with an LLM-guided Cross-Entropy Method (CEM). The authors evaluate on Beijing, Shanghai, Singapore, and New York across downstream regression tasks (population density, house price, GDP), including sparse-data settings (e.g., using only 10% of POIs), and report that SubUrban outperforms several strong baselines in many settings while offering data efficiency and interpretability; implementation details and appendices are provided and the authors commit to open-sourcing the code.
1. The idea is clear and practical: using submodularity to model diminishing marginal returns of POI selection and learning policies under budget constraints via RL is intuitive and engineering-ready.
2. The LLM-in-the-loop engineering attempt is valuable: using LLMs for semantic prefiltering and to guide CEM reduces manual heuristics and has practical appeal.
3. Broad empirical coverage: comparisons and ablations across four cities, several regression tasks, and sparse-data scenarios (e.g., 10% POIs) demonstrate applicability in varied settings.
4. Interpretability and intuitive design: the Coverage/Saturation/Buffer components help explain why certain POIs are selected, aiding qualitative analysis and visualization.
1. Different baselines in the paper use embeddings of varying dimensionalities (e.g., BERT 768, OpenAI 1536, HGI 64, CityFM 1024), which can significantly influence downstream Random Forest performance and lead to unfair comparisons.
2. The study relies solely on Random Forest (with a 4:1 train–test split) as the downstream evaluator, without demonstrating results from stronger or more diverse supervised learners (e.g., MLP, GBDT/XGBoost, or a linear regression baseline). This may overestimate or underestimate the embedding quality.
3. Although LLM prompts and templates are provided in Appendix C.1/C.2, the main text omits crucial operational statistics—such as the number of LLM calls, average query size, total token cost, and whether any manual filtering of outputs was performed.
4. While the paper frequently refers to “submodular gains” and “marginal utilities” to motivate its selection strategy, it does not provide a formal proof or sufficient conditions showing that the designed reward function or policy is truly submodular. If submodularity does not hold, the approximation guarantees of the greedy policy become invalid.
1. The appendix indicates substantial differences in embedding dimensionality across baselines (e.g., BERT 768, OpenAI 1536, HGI 64, CityFM 1024), which could affect the fairness of downstream comparisons. Have the authors attempted to unify or project these embeddings to a common dimension? If not, please consider adding unified-dimension experiments or a sensitivity ablation, and specify this in the tables.
2. The current experiments primarily rely on Random Forest as the downstream evaluator. To provide a more comprehensive assessment of embedding quality, it would be valuable to include results from other evaluators (e.g., linear regression, MLP, XGBoost/LightGBM, or end-to-end fine-tuning) and indicate whether the main conclusions hold consistently across these setups.
3. The paper mentions using different LLMs at various stages (e.g., prefiltering and CEM optimization), yet the corresponding statistics remain somewhat abstract. It would be helpful if the authors could provide a systematic summary of the LLM models/versions used at each stage, along with call counts, average tokens, total runtime, and estimated cost. Including ablation results such as no-LLM / small-LLM / GPT-4 in the appendix would further clarify the performance–cost trade-off introduced by LLM integration.
4. The discussion of submodularity in the reward function is mostly intuitive and lacks explicit theoretical assumptions or validation. Under what conditions can the reward be guaranteed to be submodular? If a formal proof is challenging, please consider providing marginal-gain curves or statistics for representative regions to demonstrate approximate submodularity. |
Moderately AI-edited |
|
Implicit Regularization Through Hidden Diversity in Neural Networks |
Soundness: 4: excellent
Presentation: 1: poor
Contribution: 3: good
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
- The paper proposes writing a single neural network with a dense final layer, as a weighted sum of subpredictors, where the weights are the rows of the logit layer scaled by a scalar multiplier depending on the initialization scheme.
- It applies Wood et al.'s decomposition to these subpredictors, viewing them as an ensemble of learners that exhibit a "diversity" term.
- The experiments suggest that, if I understood correctly, the risk of a single neural network can be estimated by computing the bias, variance, and diversity terms on the subpredictors.
If I understood correctly, this work presents a novel application of Wood et al.'s decomposition to a single neural network, offering interesting insights into the generalization properties of deep nets. This is particularly relevant given the widespread use of deep learning models and the lack of consensus on the mechanisms underlying good generalization. It provides an additional perspective on this important topic, using quantities that are simple to compute.
- The paper appears to be a shortened version of a journal article, which may be better suited for a journal submission. For an ICLR submission, more careful editorial choice could be applied to decide what content belongs in the main body versus the appendix.
- The main body of the paper is not fully self-contained, with missing details. For example, it is not clearly stated what is plotted in Figure 3, which I believe is the primary experimental result. As a result, I largely had to guess what exactly is the takeaway, what and how exactly are the quantities computed.
- There is an excessive focus on irrelevant details, such as in section 3.2, where the general case for centroids is explained, despite only requiring a simple discrete weighted average for both MSE loss and KL divergence in the rest of the paper.
- The paper lacks explicit actionable takeaways. For instance, it could be beneficial to directly incorporate diversity as a regularizer during training, making this concept more practical and accessible.
1. In Section 3.3, Theorem 1: What is y, as defined in Section 3.1? What are q_s ? Are they densities as described in Section 3.2?
2. Line 261: Why is it necessary for the coefficients to satisfy \sum \beta = 1 ? From Section 3 alone, it is not clear why the ( p(q_{(i)}) ) coefficients should form a valid probability mass function rather than any arbitrary set of weights. Is it actually relevant in the context of the paper?
3. Line 276: h_{(1)} depends on the training set, but it is never explicitly mentioned how up to that point. I assume the models are trained, but this should be clearly stated. Are different training sets used since, in the expectations further below (Eq. 10), D is a random variable?
4. What exactly is plotted in Figure 3 (right)? How are all the terms computed? Is it (option 1) the same training subsets as in Figure 3 (left), or (option 2) are these estimated quantities over subpredictors? If option 2, are the subpredictors trained on the entire training set, while the estimators in the left figures are trained only on subsets? Is there a scaling factor between quantities on the left and their estimators on the right ?
Minor comments:
===============
5. Line 256: q^i versus q_{(i)} — why the change from subscript to superscript?
6. Section 3.1: The indices i are used both for training examples and subpredictors, which causes confusion (e.g., the formula for R_{emp} on line 117 does not make sense). It is also used in d_i, where I assume it refers to input space dimension.
7. Line 376: "It is known that width is the primary factor for good network performance" — this should be toned down, as for example, large language models often require increased depth rather than just increased width.
8. Line 849 (Appendix): "Importantly, we use the same seed to initialize the neural network weights on each trial set." If this is an important point, it should be discussed in the main body: what happens when seeds are not fixed?
9. Experiments are a bit toyish. Ideally, it would be interesting to include experiments on more realistic datasets, even just a ResNet on CIFAR-10 rather than fully convolutional neural networks. |
Fully human-written |
|
Implicit Regularization Through Hidden Diversity in Neural Networks |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
- This work combines two lines of work on understanding implicit regularization in deep neural networks and studying ensembles from the perspective of bias, variance and diversity terms. This work considers a single neural network as an ensemble of multiple neural networks and hence, breaks down the loss like ensembles, They connect the diversity term to an additional form of implicit regularization in neural networks. They also show that this diversity term is large for overparmeterized neural networks and can explain the double decent framework.
- The paper overall is well written.
- The paper combines two different lines of work on ensembles and implicit regularization in NNs. I find the study of implicit regularization through the ensemble loss decomposition interesting and novel.
- Although the work provides interesting connections, the main theorem 1 is taken from previous work and does not provide new theoretical contribution.
- In this work, the authors have considered one way of decomposing the model into multiple subnetworks along the last layer. But, would the results change if we use a different decomposition?
- In this work, the authors have considered one way of decomposing the model into multiple subnetworks along the last layer. But, would the results change if we use a different decomposition?
- It seems like in all the experiments, diversity has the same behavior as variance. Can the authors suggest a case where this is not true?
- Does this diversity has any connections to any other form of implicit regularization that is classically studied in this literature? |
Fully human-written |
|
Implicit Regularization Through Hidden Diversity in Neural Networks |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This work reinterprets a single neural network as an implicit ensemble, drawing upon existing literature on parameterizations that incorporate output (or logit) scaling factors, as well as on diversity theory.
- Section 3 provides a well-organized summary of the essential components from Wood et al. (2023), which helps readers grasp the necessary background.
- It is quite interesting to incorporate the output (or logit) scaling factors from MFP, SP, or muP into the ensemble combiner. This reminds me of Kirsch et al. (2025), where a slight connection between implicit ensembles and the NTK was discussed.
- I enjoyed the “ensemble” decomposition of a “single” model presented in Section 4.3. In the past, there have been discussions in the context of CNNs suggesting that the final average pooling layer could be interpreted as a form of implicit ensemble; the formulation here feels much more direct and follows the well-defined diversity decomposition of Wood et al. (2023).
---
- Kirsch et al. (2025), (Implicit) Ensembles of Ensembles: Epistemic Uncertainty Collapse in Large Models.
- The limitation, as also acknowledged by the authors in Appendix A.1, is that the analysis is restricted to networks whose final component is an MLP with ReLU activations. In practice, modern neural network architectures (yes, there’s really only one nowadays, transformers) do not typically fit this assumption, which makes this a clear weakness of the work. That said, it still offers a valuable perspective, so I wouldn’t consider this a major flaw.
- wrong left quotation mark in line 89.
- One notable point in Wood et al. (2023) is that they consider (pre-softmax) logit ensembling for classification models. I’ve often felt that this differs somewhat from the common practice of performing (post-softmax) probability ensembling. While logit ensembling is sometimes used, from a Bayesian perspective, if we think of modeling the categorical predictive distribution, (post-softmax) probability ensembling would be the more appropriate formulation in the context of Bayesian model averaging. I wonder, though, whether a similar line of reasoning could be extended to justify post-softmax ensembling as well; what are your thoughts on that?
- One interesting aspect of neural network ensembles is that training individual members separately and then combining them often leads to different outcomes compared to training them jointly in an ensemble form from the start (Allen-Zhu and Li, 2023). The ensemble considered in this work corresponds to the latter case. I wonder whether similar results would still hold if the former approach were taken instead.
---
- Allen-Zhu and Li (2025), Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning. |
Lightly AI-edited |
|
Implicit Regularization Through Hidden Diversity in Neural Networks |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper finds that by interpreting single neural networks as implicit ensembles, one can use existing decompositions for the risk of ensembles into a bias, variance and diversity term to explain the expected risk of these single networks. They then argue that the diversity term in this decomposition of the risk of a neural network forms a new source of implicit regularization regulating the variance term. Finally, they empirically investigate this regularization in experiments using CNNs and MLPs.
The paper makes use of the elegant idea to interpret single neural networks as implicit ensembles, thereby being able to make use of already developed theory for ensembles to analyze single networks. Using this approach, they can in particular identify the diversity of the subpredictors of the model as a new source of implicit regularization, which is a relevant finding.
More generally, the story of the paper was relatively clear and the paper was well-organized. Using the examples of the square loss and the KL divergence throughout the paper helped a lot in being able to understand the introduced concepts better.
One concern I have is that the paper does not provide substantially new insights beyond applying the existing theory from Wood et al. (2023) to subpredictors of a single neural network. I would have been interested in seeing slightly more discussion on where they provide new insights beyond the results from Wood et al. (2023) throughout the paper.
Furthermore, most of the discussion in Section 4.4. seemed relatively speculative (e.g., multiple 'we hypothesize') and I would have been interested in seeing these hypotheses being empirically tested more explicitly.
1. I assume that the Bias-variance-diversity decomposition (Theorem 1) does not need independence between the subpredictors (since this would not be fulfilled for the subpredictors in your model). Could you expand on why independence is not necessary here and how non-independence affects the different subterms?
2. How are the Bias and Variance terms in your decomposition in Equation (10) and (12) related to a more classical Bias-Variance Decomposition?
3. Could you clarify in more detail which points mentioned in the discussion in Section 4.4. you think have been validated by your experiments and how you would test any other hypotheses that remain?
4. Do you think the following is generally true (even for subpredictors in large networks):
> On their own, these subpredictors are relatively simple models (a hidden node multiplied by a weight) and each subpredictor will likely have a high bias error.
5. Could you explain what you mean by this sentence in more detail/why you made this choice:
> To minimize some of the implicit regularization effects due to mini-batch SGD (Smith et al., 2021), we make use of full batch gradient descent as an optimizer.
6. What is your explanation for why the diversity term is tracking the variance term so closely (e.g., in Figure 1)? |
Fully human-written |
|
DRBench: A Realistic Benchmark for Enterprise Deep Research |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes DRBench, a benchmark and reproducible enterprise environment for evaluating deep-research agents that must synthesize evidence from both public web sources and private organizational data real across applications. It provides 15 persona-grounded tasks spanning 10 domains with injected ground-truth insights and distractors, generated via a structured LLM+human-in-the-loop pipeline and anchored to dated, authoritative URLs. The evaluation framework scores reports along insight recall, factuality via per-claim citation verification using a fixed RAG pipeline, and multi-dimensional report quality via LLM-as-a-judge. A baseline DRBench Agent with planning variants is analyzed; results show adaptive planning boosts recall while lightweight planning better preserves factuality, and app-based environments are notably harder than local file access. The paper includes ablations across multiple backbone models, browser-only baselines, and a small human study validating metric alignment.
- The benchmark genuinely bridges public web retrieval with private enterprise data across heterogeneous formats and real apps, grounded in personas and company context; this goes beyond web-only research settings.
- The evaluation is thoughtfully designed—atomic insight extraction, strict per-claim citation checks with a controlled RAG pipeline, explicit distractor avoidance, and nuanced report quality scoring—plus an anti-gaming cap on evaluated insights.
- A clear, reproducible pipeline with human verification produces distractor-rich files and stable, dated public sources; the environment is containerized and well-documented, enhancing reproducibility.
- Ablations on planning strategies, backbone models, and local vs app-based settings expose concrete failure modes, offering actionable guidance for future agent design.
- DRBench fills a gap between deep research benchmarks and computer-use agent tests, with code/scripts promised and a setup that feels close to real enterprise workflows.
- The paper’s presentation could be slightly improved;
- Figure 8 could be better presented.
- The paper introduces “golden insight” (e.g., Sec. 6 and Prompt 15) without prior definition, seemingly synonymous with “ground-truth insights,” which creates confusion in the evaluation description. Please unify terminology and define the term at first occurrence
- There are two versions of labels in Figure 4 that overlap with each other.
- Only one backend evaluation (GPT-4o) is conducted. The stability of the evaluation across different backends is not evaluated.
- What do the stars in L948–L949 mean?
- MinEval selects only the retail domain. Given that stratifying the same total number of tasks across industries (retail, healthcare, EV) need not increase computational/evaluation cost, why not adopt stratified sampling or include at least one representative task per industry?
- Can the agent access public resources beyond the Task URL? If not, how do you limit the behavior of agents like OpenAI Deep Research? |
Lightly AI-edited |
|
DRBench: A Realistic Benchmark for Enterprise Deep Research |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces DRBench, a benchmark for answering deep research questions that require synthesizing information from both Web pages and enterprise documents embedded in files or apps like emails. The 15 tasks are curated using LLMs with a human-in-the-loop approach that involves generating a company and persona, collecting relevant public URLs and extracting insights from them, generating questions based on the public insights, generating internal insights and distractors for those questions, and generating internal documents to embed those insights. The answers are evaluated based on insight recall, factuality, distractor avoidance and report quality measure using LLM-as-a-Judge. A baseline DRBench Agent (DRBA) is also developed that consists of research planning, action planning, adaptive action planning and report generation. Experimental results show that even SoTA models struggle with these tasks, particularly on insight recall.
1. Novel and challenging task that involves assimilating information from various sources and also interacting with apps like emails
2. Systematic pipeline for benchmark creation
3. Extensive experiments and analyses to measure the performance of many models on four comprehensive criteria. Results demonstrate benchmark complexity
4. Human evaluation to validate both benchmark creation and evaluation metrics
1. Missing relevant work "Benchmarking Deep Search over Heterogeneous Enterprise Data" by Choubey et al.
2. Very small dataset consisting of only 15 tasks
3. No agent identified public insights. There should be some analysis done to understand if that is due to lack of web indexing, retrieval approach, tool limitations, benchmark design, agent extraction etc. For example, are the public insights inaccessible by the search tool?
1. Although some tables show (e.g., Table 11) show number of questions answered successfully, why is such an accuracy not reported for all tasks and models? This seems to be the most important evaluation criterion. For example, an agent might retrieve all insights but still not synthesize them into the correct answer. LLM-as-a-Judge could be used for this criterion too. Does the benchmark include gold answers for all questions?
2. Which model is used for LLM-as-a-Judge? This could impact results as the judge LLM is known to be biased towards models from its own family. |
Fully human-written |
|
DRBench: A Realistic Benchmark for Enterprise Deep Research |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces DRBench, a benchmark for evaluating deep-research agents that must integrate public web information with private, enterprise-like data (e.g., files, chats, emails) inside a realistic multi-app environment. It proposes three evaluation axes—Insight Recall & Distractor Avoidance, Factuality (via evidence-checked citations), and Report Quality—and presents a baseline DRBench Agent (DRBA) with variants (SRP/CRP/AAP). Experiments span 15 persona-grounded tasks across 10 domains, with analyses of planning strategies, backbone LLMs, and app-based vs. local file access.
The paper convincingly argues that prior deep-research benchmarks are predominantly web-only and do not measure whether agents surface the most salient enterprise insights or ground claims with citations.
Tasks require tool use across storage, chat, email, and documents, which distinguishes DRBench from web-only retro-search settings such as DeepResearchGym and Deep Research Bench, both of which rely on fixed corpora or “frozen web” for reproducibility rather than mixed private+public sources.
Multi-axis evaluation design. The insight recall vs. distractor avoidance split is well motivated; factuality uses RAG-style evidence checks; report quality is judged on structured dimensions. The methodology reflects current best practice in LLM-as-a-judge evaluations.
Evidence of external validity & task coverage. While the 15 tasks are persona-grounded, the coverage across industries and the depth of internal knowledge heterogeneity remains modest. Benchmarks such as DeepResearchGym and BrowseComp-Plus now report hundreds to thousands of instances or large curated corpora; DRBench’s small task count risks overfitting and limited statistical power.
LLM-as-judge reliance & bias. All key metrics (recall alignment, factuality judgments, report quality) ultimately depend on LLM judges. The paper would benefit from more thorough human-vs-LLM agreement studies and inter-rater reliability beyond the limited assessments reported.
How robust is Insight Recall to paraphrase or partial matches? Lack of evaluating the span-level alignment and evidence coverage per insight.
Can you quantify LLM-judge and human agreement for each metric (beyond small samples), and report Fleiss’ κ or Krippendorff’s α per dimension?
What are the exact artifacts you will release (images, VM snapshots, container specs, synthetic email/chat generators, grading scripts)? Any non-redistributable components?
How robust is Insight Recall to paraphrase or partial matches? Do you evaluate span-level alignment and evidence coverage per insight? |
Fully AI-generated |
|
DRBench: A Realistic Benchmark for Enterprise Deep Research |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces DRBench, a benchmark for evaluating AI agents on multi-step, long horizon enterprise deep research tasks. It consists of 15 persona grounded tasks across 10 domains. The tasks comprises of retrieval and insights generation from public web content as well as private enterprise data (emails, chats, documents, spreadsheets) to answer business related queries. It proposes 4 evaluation metrics - Insight Recall, Factuality, Distractor Avoidance, Report Quality. It introduces DRBench baseline agent and evaluates it on the benchmark across multiple planning strategies and backbone models.
1. The paper tackles an important challenge of enterprise deep research which is a problem space that remains highly unexplored.
2. The inclusion of private datasources simulating real world applications (such as cloud storage, chat, file system etc) and containing diverse file formats creates a more realistic evaluation environment. The benchmark incorporates private datasources distributed across realistic enterprise data sources such as cloud storage, chat, file system etc containing diverse file formats, resulting in a highly authentic simulation environment.
3. The evaluation framework consists of multiple complimentary metrics that help in evaluating agentic systems across both precision and recall and quality of report.
4. The paper includes ablation studies across planning strategies, backbone llms and environmental settings (local vs app based).
1. A major limitation of the benchmark is its limited size. 15 tasks and 114 insights makes the benchmark significantly smaller which raises questions on statistical significance of the evaluation.
2. Extraction of atomic insights from the final report is a very important step in the evaluation method since 3 of the 4 metrics depend on it. However due to the use of llms, this step will be noisy which will lead to less reliable metric scores.
3. Having LLM-as-a-judge as the only method of evaluation raises question about the accuracy of evaluation since llms can halucinate and show biasness. Even though the authors talk about correlation with human preference, It does not really indicate how accurate the llm evaluations are.
4. Synthetic data generation, even though it makes the benchmark generation approach more scalable, raises questions about the internal enterprise data being realistic in nature. Combined with the fact that LLM is also used for evaluation, it can lead to more noise and biasness in evaluation results.
5. Even though the paper includes several ablation studies, it does not provide indepth analysis of why the results are the way they are. It simply states that certain model / approach is better than the other without trying to provide any explanation as to why it might be so. A main example of this is stating the fact that no agent managed to successfully source external knowledge without providing any explanation to why it happened.
1. Results show relatively poor performance in insights recall metric across models and methods. Why is it so? What % of it is due to incorrect / noisy insights extraction?
2. How good is atomic insights decomposition? Is there any quantitative analysis done to measure the performance of insights decomposition as well as the different metrics scoring?
3. The decision to choose number of ground-truth insights plus five for calculating insights recall score seems arbitrary. Was some other methods for penalising copying all content into the generated report explored? |
Fully human-written |
|
Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents Pixel3DMM, a method for 3D face reconstruction from single RGB images. The idea lies somewhere between conventional parameter regression pure 3DMM approaches and screenspace facial normal prediction. The approach trains two Vision Transformers (ViTs) built on DINOv2 features to predict per-pixel surface normals and UV coordinates, which are then used to constrain FLAME 3DMM fitting optimisation. The authors also introduce a new benchmark for evaluating both posed and neutral facial geometry reconstruction.
1. The overall idea is simple but effective - this is appealing.
2. The method achieves significant improvements over state-of-the-art, particularly on posed expressions.
3. The paper introduces the first benchmark that jointly evaluates both posed and neutral facial geometry, addressing an important gap in the field. The benchmark includes diverse, extreme expressions from NeRSemble.
4. Training requires only 2 GPUs for 3 days using publicly available data, making the work reproducible and accessible to the research community.
5. The paper includes extensive ablations, comparisons on multiple benchmarks (NoW, FaceScape, plus their own), and evaluations of the normal estimation component.
6. Qualitative results (Fig. 1, Fig. 8) demonstrate robust performance on challenging in-the-wild images with occlusions, lighting variations, and diverse appearances.
1. The method primarily fine-tunes DINOv2 with a simple prediction head (4 transformer blocks + 3 up-convolutions). So there is limited novelty in the architecture or set up itself.
2. Despite strong posed reconstruction, the method only marginally improves over MICA for neutral faces.
3. The paper lacks discussion of when and why the method fails. What types of expressions or conditions are most challenging? The qualitative comparisons show strong results, but no failure cases are presented.
4. The method critically depends on MICA's identity predictions (Table 3: "no MICA" ablation shows significant degradation). This is a strong assumption that limits the method's independence and could propagate MICA's biases or errors.
5. The use of IC-Light for lighting augmentation is neat but not thoroughly evaluated. How much does this contribute to robustness? An ablation would be valuable.
6. The fitting takes 30 seconds in an "unoptimised implementation." How does this compare to baselines? Real-time performance matters for many applications.
7. In the UV coordinate loss design (Eq. 6) the nearest neighbour lookup seems clumsy to me. Can't you use barycentric interpolation from the per-pixel UVs to interpolate a vertex position?
8. Simple L1 difference in normal space. Why not use cosine similarity or angular error, which are more standard for normal estimation?
9. The paper states all datasets are registered to FLAME topology using NPHM's procedure, but doesn't discuss registration quality or errors this might introduce.
Besides responding to the above listed weaknesses, some additional questions are:
Can you provide quantitative analysis of where the method fails? What percentage of benchmark images have errors above certain thresholds?
Have you explored learning identity/expression disentanglement more explicitly, rather than relying on MICA?
What is the actual runtime comparison with baselines in a fair setting (same hardware)? |
Fully AI-generated |
|
Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes Pixel3DMM for 3D face reconstruction from a single image. It first predicts pixel-aligned geometric priors by training two foundation models to predict per-pixel surface normals and UV coordinates separately. Then it uses them as supervision in an optimization-based FLAME fitting process for reconstruction. To get the training data, the authors register three public 3D facial datasets (FaceScape, NPHM, Ava256) to NPHM along with lighting augmentation to get the image with corresponding normal and UV groundtruth. Moreover, they introduce a new benchmark from the NeRSemble dataset for both posed and neutral face reconstruction evaluation. Experiment results show that the model outperforms SOTA methods on reconstruction and normal prediction, and demonstrates solid generalization to in-the-wild images.
The idea of using pixel-level geometry prior for 3D reconstruction supervision is sound to me. This paper combines the geometric prior of foundation model with FLAME optimization, showing an interesting direction to improve 3DMM-fitting robustness. Another strength is that it only requires 2 48G GPUs to train, making it computationally accessible. Quantitative and qualitative comparisons with previous methods show better performance across multiple benchmarks with large expression and poses. Moreover, the model is robust to in-the-wild examples and video tracking. The paper is overall easy to read and provides enough details for data processing, model architecture and training, which contributes to reproducibility.
1. The proposed 3D reconstruction method takes two steps and requires two networks to predict normal and UV coordinates separately, which is relatively complex compared with previous feed-forward or optimization-only methods.
2. The reconstruction still relies on FLAME parameters, which have limited representation capacity, so the method cannot reconstruct fine-grained details beyond 3DMM space. Also the paper uses NPHM to get uniform topology for supervision, which brings error.
3. The ablation study shows that MICA identity initialization plays a significant role in performance, which brings questions about the importance of the predicted priors versus inherited identity cues from MICA.
4. The lack of qualitative ablation results, which would help the understanding of the benefit for each component.
Beyond the concerns listed in the weakness section, I have a few questions. First is the choice of the normal map and UV-coordinate for geometry cues. Have you considered other 3D representations like depth map or point map? Another question is why not performing FLAME fitting for all the training data and then train a feedforward prediction network to regress the parameters? Would it lead to degraded generalization or accuracy? |
Fully human-written |
|
Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces a hybrid method for single-image 3D face reconstruction, named Pixel3DMM, which leverages a powerful prior model to predict normals and UV coordinates as supervisory signals, guiding the optimization of the FLAME mesh through test-time optimization. The approach utilizes two screen-space priors—surface normals and UV correspondences—predicted via a customized ViT architecture. Despite training on a moderately sized dataset, it achieves competitive accuracy with reduced training resources compared to prior methods. The estimated priors are used to fit the FLAME model, delivering strong performance on a newly proposed benchmark, particularly excelling in expression disentanglement evaluation based on the NeRSemble dataset.
* Employs a lightweight ViT-based approach for face normal estimation using limited data, providing a reproducible alternative to complex methods.
* Enhances model robustness by carefully processing a large-scale multi-view dataset and applying IC-light-based data augmentation to account for lighting variations.
* Innovatively decomposes FLAME parameter recovery into an image translation problem and dense keypoint optimization, yielding strong experimental performance.
* The paper's core techniques, including the use of ViT for predicting screen-space attributes and the FLAME fitting process, show minimal novelty, as similar approaches have been extensively explored in prior research with only minor architectural adjustments.
* While the method outperforms baselines on the new benchmark, it falls short of state-of-the-art methods like FlowFace and TokenFace on established benchmarks (e.g., NoW, FaceScape), indicating a limited competitive advantage.
My core concerns regarding this paper lie in its limited innovation and relatively minor improvements, as highlighted in the Weaknesses section. Should the authors identify any aspects overlooked in my evaluation of innovation and performance enhancements, I encourage them to point them out. Such clarifications may potentially lead to a revision of my overall assessment. |
Moderately AI-edited |
|
Reinforcement Learning from Dynamic Critic Feedback for Free-Form Generations |
Soundness: 2: fair
Presentation: 4: excellent
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper tackles the important issue of the most likely failure case of generated free-form text by introducing a novel adversarial learning scheme for training a generator and critic. The generator outputs multiple candidate solutions and the critic, given a candidate, finds the case to most likely fail given the generation. They employ a DPO loss for both the generator and critic. They have consistent results on factual text generation and code generation showing improved performance over baseline methods.
-The paper is well-written and easy to follow
-Paper outperforms baselines methods in both factual text generation and code generation
-The proposed approach is novel, applying adversarial learning to the important, significant problem of identifying failure cases in free-form generation
-I find the use of DPO fairly interesting and novel
-For fact verification, I do not necessarily see the need for a critic to specify which fact the check. Can one not separate out all the facts in the generated responses, either programmatically or with an LLM, and then run each one through the fact verifier? I understand from later on in the paper the verification is costly, but verifying each fact would provide more accurate rewards for the generator, correct? For code generation, I understand this simplification is not possible because the test cases cannot be parsed from the generated code.
-If I understand correctly, given a candidate answer a_i to s, the reward that is used to fine-tune the generator is sparse. Therefore, if even one fact is incorrect or one test case is not correct, the whole candidate output is assigned a reward of 0, correct? If so, in the case of code generation, that may be too strict because realistically code is regularly updated to handle unseen test cases, not marked as entirely wrong. Also for fact generation, if one fact is wrong, the sparse reward does not help the generator learn which fact was wrong. I understand the issue may be mitigated by the sampling multiple candidates and using DPO loss, so similar candidate solutions with different rewards can help the generator learn with finer-grained feedback. However, how many times, especially in the beginning of training, do you get a group of candidates with reward = 0 and reward = 1 to provide some distinction to the generator?
-I’m not too familiar with FactScore, so I am not sure what is the bottleneck cost for verification mentioned in L249. But for code generation, I do not see why number of test cases executed is a good metric for efficiency analysis. Test cases, especially for the ones provided in the used datasets, can be quickly executed.
See Weakness 1 and 2
Minor Suggestions/Weaknesses:
-L86 has two empty parentheses
-Missing reference in L248 |
Fully human-written |
|
Reinforcement Learning from Dynamic Critic Feedback for Free-Form Generations |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
Post-training methods for large language models (LLMs) constitute an active area of research. However, reinforcement learning (RL)-based fine tuning is very challenging due to the generative nature of LLM outputs. In particular, the design of efficient reward functions is difficult.
To address this issue, the authors introduce Reinforcement Learning from Dynamic Critic Feedback (RLDCF), a framework that focuses on post-training as an adversarial game between a generator and a critic. In RLDCF, each instruction/prompt provided to the LLM is associated with a set of rubrics representing task-specific requirements that the output should satisfy. The objective is then to train a generator that maximizes the probability of providing correct outputs.
The critic, modeled as a stochastic policy, aims at providing the worst-case criterion for a given instruction–action pair. The generator is then trained by solving a mini–max optimization problem. Both the generator and critic are implemented as LLMs fine-tuned using a DPO loss.
The proposed approach is evaluated using text and code generation tasks. For factual text generation, experiments are conducted on the Wikipedia Biography Dataset using base generators Qwen3-4B and Qwen3-8B, compared against three baseline methods. For code generation, the authors employ the AceCode-87K-hard subset, with base generators Qwen2.5-Coder-7B-Base and Qwen2.5-Coder-7B-Instruct, also benchmarked against three baselines.
Post-training methods for large language models is a crucial task to improve task specific use of generative models and provide robust LLMs.
The presentation of the paper is clear, the problem is well motivated and the overall description of the method is good.
Experimental results demonstrate that RLDCF yields competitive results in both text and code generation quality, highlighting the effectiveness of adversarial critic feedback to finetune.
- In the text generation experiment, RLDCF achieves the same level of factuality as FacTune-FS with fewer verification calls, it also improves KL divergence along epochs with monotonic FactScore gains.
- In the code generation task, the proposed approach outperforms enumerative method and static reward model method for most benchmarks.
The results are interesting and promising on the two proposed tasks. However, as the paper is mostly experimental, I would expect more discussion on the choice of the methodological choices. For instance on the way the critic and generator are updated. The influence of K (candidate outputs for each instruction) or N (number of criteria sampled from the critic) should be strong on the results. Even if no theoretical guarantees are provided (which is a probably a very hard question), I would expect more discussions on these "hyperparameters".
In the experiments, it is hard to assess the statistical significance of the results are there is no uncertainty quantification (standard deviation of the metrics for independent runs for instance).
Although adding additional experiments or simulations is not strictly necessary to demonstrate the soundness of RLDCF, the contribution being primarily methodological, its impact would be strengthened if the empirical evaluation, especially for the factual text generation task, were illustrated with a broader variety of performance results.
Minor weakness: the paper would benefit from a careful proofreading to remove typos (Appendix and Tables wrongly referenced for instance).
In both experiments N and K are fixed. However, these parameters should have a great impact on the results. Can you discuss this ?
These hyperparameters are different in both experiments. Is there a reason for this ?
In the experiments, can you add some qunatification of uncertainty (std over various runs for instance) ?
Is it possible to highlight the performance of RLDCF on other text generationt tasks to support the applicability of the method ? For instance using datasets used in papers associated with the baselines (medical question answering of [Tian et al., 2024] for instance).
Does changing the backbone model for the generator and critic have an influence on the experimental conclusions (factuality level/number of calls, dynamics of the KL divergence along epochs) ? |
Fully human-written |
|
Reinforcement Learning from Dynamic Critic Feedback for Free-Form Generations |
Soundness: 2: fair
Presentation: 4: excellent
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper tackles a real pain point in training large language models: how do you optimize for tasks where outputs need to satisfy tons of different criteria, and you can't possibly check them all? The authors propose RLDCF, which basically turns training into a game between two models. A generator tries to produce good outputs, while a critic tries to catch the generator's mistakes by proposing specific ways it might fail. An external validator then checks if the critic found a real error. The critic gets better at spotting weaknesses, and the generator gets better at avoiding them. The idea is neat: instead of exhaustively checking every possible criterion or relying on a static reward model that can be gamed, you dynamically focus on the most likely failure points.
The paper suggesr promising results. For biography generation, they hit a FactScore of 0.889 while doing 5.7 times less verification work than existing methods. For code generation, they claim the best scores despite using only 9% of the training data. However, a potential problem in code generation experiments suggest circular logic: they essentially created "reference solutions" using the same model family they're training. Then the model could just be learning what Qwen is already able to do. Qwen-7B-Instruct training shows limited improvement over the base model, could be well within variance, and the authors didn't provide much details on what or how the model is improved.
The core idea of adversarial training idea isn't particularly novel, similar approaches have appeared in recent work on generative verifiers. The biography experiments are more solid and the overall problem they're solving matters. But between the circular validation issues, limited novelty, and some unfair experimental comparisons, this feels like a decent idea that needs another round of work with more solid experiments and applications that can justify sufficient contribution to the area.
- Important and natural problems to tackle for rubric based reward modeling and RL training for LLM post training.
- Theoretical formulation is solid.
- Strong factual text generation results.
- Good ablation studies.
- 4 to 8 sentence comparison shows the method scales well to complexity
- Presentation is clear
- I have doubts about code experiment set up as mentioned above
- Lack of theoretical or empirical analysis into the method and experiment results
- Could benefit from more analysis & learnings and more experiments.
I'm open to change my score if authors can provide sufficient justification or insights in both of the experiments, especially the coding one |
Fully human-written |
|
Scalable Second-order Riemannian Optimization for $K$-means Clustering |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a scalable second-order Riemannian optimization algorithm for solving the K-means clustering problem through a low-rank semidefinite relaxation. The authors reformulate the nonconvex problem as a smooth unconstrained optimization problem over a product manifold and apply a cubic-regularized Riemannian Newton method. The algorithm is claimed to achieve linear per-iteration complexity with provable convergence to second-order critical points.
However, several issues remain. The key assumption (Assumption 1) seems inconsistent and lacks justification; the algorithmic novelty is limited. If the authors can address my concern, I would raise my score.
- The mapping from the constraint manifold $\mathcal{M}$ to the product manifold provides a principled and novel way to represent K-means as a smooth manifold problem. This reformulation reduces the projection cost from $O(n^2)$ to $O(nr + r^3)$.
- By exploiting block-diagonal-plus-low-rank Hessian structure (Appendix E), each iteration scales linearly in \(n\) in the cubic-regularized Newton approach.
- Use bisection search to solve the subproblem in the cubic-regularized Newton approach.
- Assumption 1 is difficult to interpret. The problem (1) does not require \(U > 0\), (U > 0\) enforced in problem (3), which is only a sufficient but not necessary condition for $Z = UU^\top > 0$. Hence, the assumption is logically inconsistent with the model definition. There exists the case that the optimal solution $U_{ij}<0$ . In addition, this assumption is verified only empirically (Fig. 1), with no analytical justification or example provided. Since all global optimality claims rely on this assumption, its ambiguity weakens the theoretical contribution.
- The method essentially applies a standard cubic-regularized Newton algorithm to a reformulated manifold problem, which not introduce new algorithm, or adaptive strategies, or theoretical improvements.
- Although the reformulation simplifies the constraint structure, computation of gradients and Hessians remains costly. I recommend the authors discuss hybrid methods that separate simple manifold constraints (handled via Riemannian optimization) and other remained constraints (handled via augmented Lagrangian function). Related work worth discussing includes:
- Wang, J., & Hu, L. (2025). Solving low-rank semidefinite programs via manifold optimization. Journal of Scientific Computing, 104(1), 33.
- Monteiro, R. D., Sujanani, A., & Cifuentes, D. (2024). A low-rank augmented Lagrangian method for large-scale semidefinite programming based on a hybrid convex-nonconvex approach. arXiv preprint arXiv:2401.12490.
- Hou, D., Tang, T., & Toh, K. C. (2025). A low-rank augmented Lagrangian method for doubly nonnegative relaxations of mixed-binary quadratic programs. Operations Research.
- Wang, Y., Deng, K., Liu, H., & Wen, Z. (2023). A decomposition augmented lagrangian method for low-rank semidefinite programming. SIAM Journal on Optimization, 33(3), 1361-1390.
- Important components—algorithm pseudocode, subproblem solvers, and retraction details—are placed only in the appendix.
Why is the cubic-regularized Newton framework preferred? The original method (Agarwal et al., 2021) solves subproblems via *Lanczos iterations*, whereas this paper adopts bisection search. The authors should discuss those two approaches. |
Lightly AI-edited |
|
Scalable Second-order Riemannian Optimization for $K$-means Clustering |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes a new smooth, unconstrained Riemannian reformulation of a nonnegative low-rank factorization of the K-means SDP, enabling second-order optimization with feasibility preserved via retractions. The key idea is to map the original constraint set $\mathcal{M}$ onto a product manifold $\tilde{\mathcal{M}} = \mathcal{V}\times \mathrm{Orth}(r)$ via a submersion, which yields simple $O(nr + r^3)$ retractions and efficient expressions for Riemannian gradients/Hessians.
Building on this geometry, the authors design a cubic-regularized Riemannian Newton method that solves each Newton subproblem in $n\cdot\mathrm{poly}(r,d)$ time by exploiting a block-diagonal-plus-low-rank structure and a bisection scheme on the regularization parameter. The overall algorithm finds $\epsilon$–second-order points in $O(n\cdot\epsilon^{−3/2} \cdot\mathrm{poly}(r, d))$ time.
The work hinges on an empirical "benign nonconvexity" assumption: in regimes where the convex $K$-means SDP recovers ground truth, all approximate second-order critical points of the nonnegative low-rank model are near-global optima. The authors provide extensive empirical evidence for this behavior.
Experiments on synthetic GMMs and CyTOF data show that the proposed second-order method converges in hundreds of iterations, achieves similar or better clustering accuracy than the strongest first-order baseline (NLR), and substantially reduces time despite costlier iterations. It also outperforms prior Riemannian $K$-means methods and classical RTR/CG solvers that struggle with the log-barrier’s ill-conditioning.
The paper supplies detailed derivations: manifold geometry, LICQ verification, submersion proof, closed-form Lagrange multipliers for projections, efficient Hessian-vector products, feasible initialization (and necessity of $r > K$), and implementation details for the linear-time inner solves.
Conceptual advance: The submersion to a product manifold with simple retractions is elegant and practically impactful, removing the main bottleneck that hindered prior Riemannian approaches to $K$-means (expensive retractions and feasibility maintenance).
Algorithmic engineering: The cubic-regularized Riemannian Newton solver is carefully tailored—analytical gradients/Hessians, efficient tangent projections, and a block-diagonal-plus-low-rank exploitation for the inner linear systems. The bisection-based solver for the regularization parameter is simple and reliable.
Clear bridge to theory: The paper situates the contribution within the exact-recovery phase transition for the $K$-means SDP and explains how second-order points suffice under the benign nonconvexity hypothesis. The LICQ and manifold calculus are handled rigorously; the smoothing argument for the log penalty clarifies the use of second-order guarantees.
Strong empirical evidence: On both synthetic and real data, the method exhibits rapid, stable convergence to second-order points with near-zero mis-clustering; it consistently reduces iteration counts by orders of magnitude compared to NLR and RTR, translating to 2-4x faster runtimes despite more expensive steps.
Reproducibility and completeness: The paper provides explicit formulas, complexity accounting, initialization constructions, hyperparameter guidance, and ablations (rank $r$, penalty $\mu$, comparisons with several baselines), which make the contribution actionable for practitioners.
1. Assumption 1 (benign nonconvexity) is purely empirical in this work. While the authors motivate it with analogies to Burer–Monteiro and back it with experiments, the lack of any partial theoretical characterization (e.g., under separation/noise conditions and mild overparameterization $r > K$) limits the scope of the main claim in average-case regimes.
2. Dependence on the log-barrier: Although handled well algorithmically, the severe ill-conditioning induced by the log term drives both the design choices and some limitations (e.g., RTR/CG underperform). A discussion or experiment on alternative barrier/penalty designs (e.g., smooth-plus hard positivity via projections on $U$) could strengthen the case or show robustness.
3. Sensitivity to $\mu$ and feasibility interior: The method requires strictly positive $U$ and shows a phase transition when $\mu$ is too large. While the paper provides heuristics, a more systematic procedure (or adaptive schedule with safeguards) would make the solver more turnkey across datasets; also, the necessity of $r > K$ to ensure interior feasibility may be restrictive in memory-limited settings.
4. Scalability constants: The per-iteration complexity is linear in $n$ with $\mathrm{poly}(r, d)$ factors; however, the inner solves involve $r^3$ and $d$-dependent Schur complements. Scaling with varying $r$ and $d$ would help practitioners understand the limits and guide parameter choices.
5. Generality beyond GMM: Although the manifold formulation applies to kernelized $K$-means, empirical validation is focused on GMMs and one CyTOF setup. Broader tests (imbalanced/many clusters, higher $d$, other real datasets, kernels) would better support claims of robustness.
1. Can you provide partial theory toward Assumption 1? For example, under the separation in Chen & Yang (2021b), and mild overparameterization $r = K + O(1)$, can you show absence of spurious second-order points in a neighborhood of the ground-truth factor, or establish a strict saddle property?
2. How robust is the method to mis-specified $K$ and to cluster imbalance? Could you include experiments where $K$ is under/over-estimated, and where cluster sizes vary significantly, and report both accuracy and convergence behavior?
3. Could you explore alternative penalties that retain positivity while improving conditioning (e.g., softplus smoothing, additive offsets, or barrier homotopy/continuation schedules), and compare convergence to your log-barrier?
4. What is the practical guidance for selecting $r$ beyond $r = K + 1$? Are there cases where larger $r$ improves optimization (escaping poor basins) or statistical robustness, and what is the runtime trade-off empirically?
5. Can you extend experiments to kernel $K$-means and non-Gaussian mixtures, or real vision/NLP datasets where the Gram matrix is built from learned features? This would help demonstrate the generality implied in Appendix A and the stated manifold framework. |
Fully AI-generated |
|
Scalable Second-order Riemannian Optimization for $K$-means Clustering |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper tackles the computational difficulty of solving the nonconvex, constrained formulation of K-means clustering. It reformulates the low-rank SDP relaxation as a smooth optimization over a Riemannian manifold, so that constraint feasibility is handled intrinsically. This allows the use of a cubic-regularized second-order Riemannian Newton method that guarantees convergence to second-order critical points—empirically aligned with global optima in K-means.
To make the approach scalable, the authors introduce a product-manifold factorization and exploit the Hessian’s structure to achieve linear-time per-iteration complexity. Experiments on Gaussian mixture and real-world datasets show faster convergence and lower mis-clustering rates than state-of-the-art first-order methods, demonstrating both theoretical soundness and strong empirical performance.
1. The paper successfully integrates second-order Riemannian methods, specifically the cubic-regularized Newton approach, into the K-means SDP framework. Moreover, The proposed product-manifold factorization resolves the computational bottleneck of expensive retraction operators in previous works.
2. Demonstrates consistent improvements in convergence speed and clustering accuracy over leading baselines, including NLR, RTR.
3. Sections 2 and 3 provides a concise review of SDP relaxations and information-theoretic thresholds. Theorem 1 and Theorem 2 are well-chosen to connect existing Riemannian convergence theory to this specific manifold formulation.
4. Reduces per-iteration complexity to linear in n, making second-order methods computationally competitive. The approach can be generalized to other manifold-constrained ML problems.
1. Assumption 1 (Benign Nonconvexity) remains unproven. The paper heavily relies on this assumption for its theoretical motivation but provides only empirical justification. A rigorous explanations or partial theoretical evidence would substantially strengthen the work.
2. In the numerical experiments, only a single real-world dataset (CyTOF) is used. Additionally, there is no ablation study on manifold dimension, sensitivity to initialization, or robustness to cluster imbalance.
3. The main theorems of complexity are adaptations of known Riemannian results rather than new convergence analyses specific to K-means.
4. The paper’s contribution seems somewhat incremental. It reformulates the K-means problem and employs a standard cubic-regularized Newton method, with the primary advancement lying in a more efficient computation of the Newton step by exploiting its structure.
1. In line 295, how sensitive is the algorithm to the initial point $(V_0,Q_0)$?
2. Could this approach generalize to kernelized or probabilistic K-means variants? If so, how would the manifold and retraction structures adapt?
3. How is the second-order condition in Eq.(10) verified numerically? Does this influence empirical runtime?
4. As discussed in Appendix C.1, the Riemannian cubic-regularized Newton is equivalent to certain SQP methods. What are the computational advantages compared to projected Newton or augmented Lagrangian solvers?
5. In the proof of Lemma 4, the fourth equality is incorrect, but I believe it's just a typo and not a significant issue.
6. In Eq. (11), is the efficiency of the Hessian 1/2?
7. Typo on line 1127: this is not a quintic equation, but a quartic one. |
Heavily AI-edited |
|
Scalable Second-order Riemannian Optimization for $K$-means Clustering |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper tackles the nonconvex low-rank factorization of SDP relaxation of K-means clustering. By penalizing the nonnegative constraints, the authors view the resulting problem as a manifold optimization problem and apply the Riemannian cubic regularized Newton to obtain a 2nd-order solution.
1. Develop a fast algorithm from manifold optimization viewpoint (Theorem 2), which is novel at least to me.
2. The subproblems of Riemannian cubic regularized Newton can be solved efficiently.
3. Empirically, 2nd-order critical points are globally optimal.
1. This paper only provides sublinear convergence rate. Is it possible to prove stronger local convergence rate results? For example, quadratic/superlinear convergence rate. Fig1 shows fast convergence rate in practice.
2. The benign nonconvexity is described in Assumption 1. Is it possible to say something about benign landscape? For example, related work of benign nonconvexity results.
1. Is the part of lines 311-355 new result? If this is not new, it would be better to cite some references.
2. Could the authors explicitly explain why Riemannian cubic regularized Newton can overcome the ill-conditioning and RGD/RTR cannot? |
Fully human-written |
|
Spatial Sign based Direct Sparse Linear Discriminant Analysis for High Dimensional Data |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces Spatial Sign-based Direct Sparse Linear Discriminant Analysis (SSLDA), a novel classification method designed to address the critical challenge of robust high-dimensional classification under heavy-tailed distributions. The authors identify that classical Linear Discriminant Analysis (LDA) and its high-dimensional sparse variants often fail when data deviates from the Gaussian assumption, as they rely on non-robust sample mean and covariance estimators.
This paper presents a robust high-dimensional classifier, Spatial Sign-based Sparse LDA (SSLDA), which directly estimates the optimal discriminant direction under elliptical distributions. The method's core innovation lies in replacing conventional, non-robust estimators with the spatial median and the spatial sign covariance matrix, enabling accurate classification even for heavy-tailed data where standard methods fail. The authors provide strong theoretical guarantees, proving the estimator's consistency and establishing optimal convergence rates for the misclassification error.
- **1. Limited Discussion on the Elliptical Distribution Assumption:** The paper's entire theoretical framework relies on the assumption that the data follows an elliptical distribution. This is a potential limitation for real-world datasets that may exhibit significant skewness or more complex, non-elliptical dependency structures. The work could be improved by explicitly discussing the robustness of SSLDA to violations of this assumption. A constructive suggestion would be to include a simulation where data is generated from a clearly non-elliptical (e.g., skewed) distribution to empirically explore the method's performance boundaries and better define its applicability.
- **2. Scope of Experimental Validation Could Be Broadened:** Although the experiments are well-designed, they could be more comprehensive to fully demonstrate generalizability. Specifically, the empirical validation relies heavily on synthetic data and a single, specific image classification task. To more convincingly argue for the method's broad utility, the authors could include experiments on a wider range of real-world benchmark datasets from other domains where high-dimensional, heavy-tailed data is common, such as finance (e.g., stock returns), genomics, or text analysis. This would provide stronger evidence of the method's practical impact beyond the presented application.
- **3. Lack of Comparison with Alternative Robust Sparse Methods:** The paper effectively compares SSLDA against several leading sparse LDA methods. However, it does not include comparisons with other classes of robust, high-dimensional classifiers that are not based on the LDA framework, such as robust sparse logistic regression or support vector machines with robust kernels. Including such comparisons would help to position SSLDA more clearly within the broader landscape of robust classification tools and would provide a more complete picture of its relative strengths and weaknesses.
- **1. On the Robustness Beyond Elliptical Distributions:** The theoretical guarantees of SSLDA are firmly established under the elliptical distribution assumption. Could you please comment on the empirical robustness of SSLDA when this assumption is violated, for instance, with significantly skewed distributions? Have you conducted any preliminary tests on such data? A discussion on the expected behavior or potential modifications to handle non-elliptical data would greatly help users understand the boundaries of the method's applicability.
- **2. On the Generalizability and Practical Impact:** The experimental results on synthetic data and the image classification task are compelling. To further demonstrate the general utility of SSLDA, it would be highly beneficial to see its performance on one or two additional benchmark datasets from domains known for high-dimensional, heavy-tailed data, such as finance (e.g., asset returns) or genomics. This would significantly strengthen the claim of the method's broad practical impact.
- **3. On the Comparison with the Broader Robust Classification Landscape:** The paper provides excellent comparisons against other sparse LDA methods. Could you discuss the rationale for not including comparisons with other paradigms for robust high-dimensional classification, such as $l_1$-regularized robust logistic regression or sparse SVM with robust kernels? A discussion on how you expect SSLDA to perform relative to these alternative approaches, or results from such a comparison, would help to better position your contribution within the entire field of robust classification, not just the LDA family.
- **4. On the Choice of Tuning Parameters:** The method involves a regularization parameter $\lambda_n$. It would be helpful for practitioners if you could provide more detailed guidance on the selection of this parameter in practical scenarios, especially when the underlying distribution is unknown. Did you observe any robust strategies for choosing $\lambda_n$ across different distributional settings in your simulations? |
Fully AI-generated |
|
Spatial Sign based Direct Sparse Linear Discriminant Analysis for High Dimensional Data |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes SSLDA (Spatially Sign-based LDA), a method that constructs discriminant directions using robust estimators of mean and covariance based on spatial signs, which are less sensitive to extreme values than classical moment-based estimators. Unlike traditional LDA, which relies on sample means and covariances (vulnerable to heavy tails), SSLDA uses spatial sign transformations to achieve stability.
Unlike traditional LDA, which assumes Gaussian data and is sensitive to outliers, SSLDA leverages spatial sign-based estimators for mean and covariance.
This provides strong theoretical guarantees, ensuring that SSLDA performs well even in high-dimensional settings.
Simulation studies and real-data experiments demonstrate that SSLDA outperforms state-of-the-art robust LDA methods.
SSLDA relies on the assumption that data follows an elliptical distribution.
The performance of SSLDA depends on the choice of spatial sign scaling parameters.
It is crucial to explore the model's extension to multi-class scenarios. Additionally, the paper does not explicitly address or discuss the computational complexity associated with SSLDA.
The experimental evaluation on real-world datasets remains limited.
Some robust LDA such as regularized LDA and L1-norm LDA should be addressed |
Moderately AI-edited |
|
Spatial Sign based Direct Sparse Linear Discriminant Analysis for High Dimensional Data |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper focuses on the direct sparse linear discriminant analysis for high-dimensional classification. The proposed SSLDA directly estimates the discriminant direction under the assumption of elliptical distribution, which accelerates the training efficiency without corrupting the final accuracy. Moreover, the spatial sign-based methodology is introduced to handle the heavy-tailed outliers. Theoretical and experimental results demonstrate the superior classification ability of the proposed SSLDA. However, the work seems to be incremental, and the presentation is relatively poor. The detailed comments are summarized in the weakness list. Overall, I think this paper fails to reach the borderline of ICLR.
1. This paper introduces the spatial sign-based methodology to the classific LDA algorithm, which can provide a reference for further research.
2. Experiments show the effectiveness of spatial sign-based theory on enhancing LDA.
1. The proposed SSLDA seems to be a straightforward combination of existing technologies. As discussed in Section 1, the spatial-sign-based methodology is a mature tool for high-dimensional data classification, and it has been integrated into many machine learning algorithms. This paper combines the spatial-sign-based approach with LDA straightforwardly, without making enough novel and significant improvements.
2. In Section 1, the first main contribution states that ‘we establish theoretical results for SSLDA in the sparse scenario’. What exactly does this theoretical result mean? There is a lack of detailed explanation.
3. Section 1 provides a detailed history of LDA with high-dimensional classification, which is somewhat long-winded, and fails to elaborate on the key concepts relevant to SSLDA. What are spatial sign and elliptical distribution? The author should appropriately reduce the review of previous works, and provide a more detailed introduction to SSLDA. It is necessary to clearly and intuitively state why the spatial sign can handle high-dimensional long-tailed distributions.
4. The comparative algorithms are too old. The latest was published in 2019.
5. SSLDA is regarded as a robust classification approach. However, there are no experiments to test the robustness of SSLDA on handling outliers and noisy points.
6. Some equations lack punctuation, such as Eqs. (3) and (5).
Please see the weakness list. There are no more questions. |
Fully human-written |
|
DBMSolver: Fast Diffusion Bridge Sampling for High-Quality Image-to-Image Translation |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes DBMSolver, a training-free fast sampler for Diffusion Bridge Models (DBMs) used in image-to-image translation. It derives (i) a closed-form step for the bridge SDE and (ii) an exponentially-integral form for the bridge probability-flow ODE, then uses a truncated Taylor expansion ($k=2$) to build a few-step, higher-order sampler. Experiments on E2H, DIODE, Face2Comics, ImageNet inpainting, and CelebAMask-HQ report better FID at low NFE than baselines.
– The paper is mostly well written.
– The method is training free, yielding strong results relative to low NFE without any fine-tuning or retraining procedure.
– Novelty is incremental relative to DPM-Solver. The work largely transposes DPM-Solver’s semi-linear/exponential-integrator recipe to bridges. With closed-form DBM SDE/ODE from [1] and the generalized ODE solution from [2], the framework appears to extend by swapping in bridge formulas. Please state precisely what is technically new beyond adapting DPM-Solver to DBMs and the initial SDE step.
– The bridge-SDE derivation itself does not seem to add practical benefit beyond the standard one SDE “warm” step used to avoid singularity (as in DBIM [3]) before switching to ODE sampling; Prop. 1 and the algorithm that sampling SDE style at the first step the look very close to DBIM’s ODE sampler.
– Appendix B.3 is missing. Therefore, I couldn't check on the analytical solution of Exponential Integral.
– DPM-Solver originally provides $k \in \\{ 1,2,3 \\} $ with order proofs; here the final algorithm fixes $k=2$. A higher-order ablation should be included to justify the chosen order.
– DBMSolver’s gains over DBIM are marginal in several settings. E.g., ImageNet-256 @ NFE=20: DBIM (in original paper) **4.07** vs DBMSolver **4.07**; @ NFE=10 with 2nd-order DBIM vs 2nd-order DBMSolver: **4.33** vs **4.38**. Please compare against DBIM-2nd-order at matched NFE and settings for fairness.
– Result trends are inconsistent on new datasets: while E2H, DIODE, and ImageNet inpainting are close between DBIM and DBMSolver, CelebA-MaskHQ and Face2Comics show large DBMSolver gains. Please explain the cause of this discrepancy.
– What is formally new beyond adapting DPM-Solver’s exponential-integrator to bridges and adding an initial SDE step? Any identity unique to bridges?
– Do you provide a convergence/order analysis for DBMSolver on bridges , analogous to DPM-Solver’s guarantees?
– Why fix $k=2$? What is the effectiveness of higher order (e.g., $k=3$)?
– Please compare DBMSolver vs DBIM (2nd-order) at the same NFE since DBMSolver use 2nd-order solver
– Why does DBMSolver outperform DBIM much more on CelebA-MaskHQ/Face2Comics than on E2H/DIODE/ImageNet?
[1] Zhou, et al. "Denoising Diffusion Bridge Models", ICLR 2024
[2] Lu, et al. “DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps”, NeurIPS 2022
[3] Zheng, et al. “Diffusion Bridge Implicit Models”, ICLR 2025 |
Lightly AI-edited |
|
DBMSolver: Fast Diffusion Bridge Sampling for High-Quality Image-to-Image Translation |
Soundness: 3: good
Presentation: 3: good
Contribution: 1: poor
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents DBMSolver, a training-free sampler for Diffusion Bridge Models that uses their semi-linear structure and exponential integrators to sample faster with fewer function evaluations.
(i) Empirically, the method attains the strongest NFE vs performance trade off in terms of the standard I2I metrics among training-free baselines.
(ii) The evaluation covers a broader set of domains than prior work DDBM [1] and DBIM [2], specifically label to face generation on CelebA and image stylization on Face2Comics.
[1] https://arxiv.org/abs/2309.16948 Denoising Diffusion Bridge Models
[2] https://arxiv.org/abs/2405.15885 Diffusion Bridge Implicit Models
(i) The main idea of the paper (Proposition 2) seems to already exist in [2] (see Appendix C.4, Eq. 60) and was tested there (Table 6). The only difference between the proposed Algorithm 1 and the earlier DBIM (high-order) algorithm in [2] that I see is the final Euler update step. If this is true, then the claimed novelty is overstated. The authors should clearly explain what is new compared to [2].
(ii) There are also some problems with how the theory is presented:
1. Proposition 1: The authors claim to give an exact solution for the SDE from $x_s$ to $x_t$, but the proof (line 744) uses a first-order Taylor approximation (k = 1). So, it is not exact, and this should be made clear in the text.
2. Incorrect score functions in Equations (2) and (5): These equations use $\nabla_{x_t} \log p_t(x_t \mid x_0, x_T)$, but based on DDBM [1] (see Theorem 1), they should use $\nabla_{x_t} \log q_t(x_t \mid x_T)$. These two are related as:$$q_t(x_t \mid x_T) = \int p_t(x_t \mid x_0, x_T) q_{\text{data}}(x_0\mid x_T) dx_0$$ (see [1], Appendix A.3).
(i) Can the authors clearly explain the theoretical difference between their method and the high-order DBIM sampling in [2]? This would help to understand what is new in this work.
(ii) If the methods are different, it would be more fair to include a direct comparison with high-order DBIM in the experiments.
(iii) Based on the comments above, I suggest the authors revise Proposition 1 and Equations (2) and (5) to be more accurate. |
Fully human-written |
|
DBMSolver: Fast Diffusion Bridge Sampling for High-Quality Image-to-Image Translation |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposed a training free algorithm DBMSolver to accelerate existing diffusion bridge models for image-to-image translation problems, which is based on high order exponential integrator theory for bridge SDE and bridge ODE. The theoretical contributions include Proposition 1 and Proposition 2, which give closed formulas for solutions of bridge SDE and bridge ODE, respectively. The evaluation of the proposed DBMSolver includes sketch-to-image on edges2handbags, normals-to-image on DIODE-Outdoor, face-to-comic, image inpainting with central masks on ImageNet, and semantic label-to-face on CelebAMask-HQ. The method is compared with closely related exponential integrator methods DBIM, DDBM, and DPM-Solver++ using image realism quality metric FID and inference time. The results of the DBMSolver show its superiority against compared methods in terms of FID while maintaining small NFE between 6 and 30.
1. Proposition 1 and 2 provide general results for solutions of Bridge SDE and Bridge ODE, which can be applicable for other data-to-data bridge models, for example, audio restoration models and text-to-speech models.
2. Figure 2 shows faster convergence in terms of FID for DBMSolver with the increase of NFE compared with DBIM, so 6-30 NFE is enough for all considered image-to-image translation problems with DBMSolver while being a training free method.
3. The method alleviates the problem of divergence of Bridge ODE for the initial step from Proposition 2 using the Bridge SDE solution from Proposition 1, which unifies the proposed theoretical results.
4. The DBMSolver was tested on 5 various image-to-image translation problems, which supports the practical usage of the method.
1. Lack of comparisons. I appreciate that the major advantage of the proposed method is that its training free, which is different from other methods for acceleration of diffusion bridge models, such as distillation. However, there is a gap between NFE, which is used in DBMSolver, and distillation method, such as CDBM (He et al.) and IBMD (Gushchin et al.), which achieved good I2I results with 2 or 1 NFE. In particular, Table 5 and 6 in IBMD show that the distillation models achieve very close results to the results, which are reported in Table 2 and 5 in DBMSolver with NFE=6, while using only NFE=2 or NFE=1. The paper lacks of clear discussion of practical results of the proposed training free method in relation to existing fast distillation diffusion bridge models.
2. Lack of evaluation. The method used only the FID metric to evaluate the image generation quality for 4 image-to-image translation problems: DIODE, edges-to-handbags, face-to-comics, CelebAMask-HQ. In DBIM and DDBM methods, diffusion bridge models for the problems of DIODE and edges-to-handbags were evaluated with other metrics - paired LPIPS and MSE, and image generation IS. Additional image quality metrics (paired or no-reference) are important in the light of the observation in Section 5.2 of IBMD work, where the authors found that for edges-to-handbags and DIODE-Outdoor image-to-image translation problems the evaluation protocol of DDBM and DBIM leads to report of FID on training set, while testing sets are too small for FID computation.
3. Lack of discussion of existing theoretical results regarding exponential integrator for bridge diffusion models with the proposed Proposition 1 and 2. The novelty of the proposed theoretical results should be highlighted in relation to existing methods, which apply exponential integrator theory for data-to-data translation models with diffusion bridges. In particular, the results of Proposition 1 and 2 in DBMSolver seem to be very close to the results of Proposition 3.2 in Bridge-TTS work of Chen et al.
4. Lack of ablation studies. Since the DBMSolver mixes steps from Bridge ODE and Bridge SDE solutions, there is a question which formulation provides better practical results. In particular, authors used $k = 2$ for Bridge ODE, but didn't explore the option $k = 2$ for Bridge SDE. I also expect that Bridge SDE formulation provides a diversity, which is important for multimodal image-to-image translation problems (see also Table 1 and Table 2 in BBDM). The explanation behind the choice of Bridge ODE instead of Bridge SDE would explain the method better.
References:
CDBM - Consistency Diffusion Bridge Models. NeurIPS-2024.
IBMD - Inverse Bridge Matching Distillation. ICML-2025.
Bridge-TTS - Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis, arxiv, 2023 (https://arxiv.org/abs/2312.03491v1).
BBDM - BBDM: Image-to-image Translation with Brownian Bridge Diffusion Models, CVPR-2023.
1) Can you comment on comparison between DBMSolver and fast distillation bridge diffusion models for image-to-image translation, such as CDBM and IBDM in terms of quality and inference speed?
2) Can you provide other metrics to evaluate DBMSolver on image-to-image translation problems, such as LPIPS, MSE and IS?
3) Can you explain the choice behind Bridge ODE and Bridge SDE in your method? Can DBMSolver be applicable with Bridge SDE formulation and $k = 2$?
4) Can you comment on diversity of DBMSolver?
5) Since there is an issue with the evaluation protocol for DIODE-Outdoor and edge-to-handbags problems, can you comment the results of your method on other image-to-image translation problems, like JPEG image restoration, which was considered in I2SB (Table 3) and IBMD (Table 2 and Table 4) methods with bridge diffusion models? My question about these problems is because there is a gap between the results of teacher and student diffusion bridge models, as shown in Table 2 and Table 4 of IBMD.
6) Can you comment on the relation of Proposition 1 and 2 in DBMSolver and Proposition 3.2 in Bridge-TTS.
References:
I2SB: I2SB: Image-to-Image Schrodinger Bridge, ICML-2023.
Bridge-TTS - Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis, arxiv, 2023 (https://arxiv.org/abs/2312.03491v1). |
Fully human-written |
|
DBMSolver: Fast Diffusion Bridge Sampling for High-Quality Image-to-Image Translation |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 4: excellent
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work identifies a critical limitation in existing Diffusion Bridge Models (DBMs): current samplers are primarily based on stochastic differential equations (SDEs), which introduce two major issues: low sampling efficiency requiring a large number of steps, and stochasticity from random noise at each step, leading to output uncertainty. While these challenges have been recognized in score-based diffusion probabilistic models (DPMs), they remain unresolved in DBMs. In this study, the authors observe that the Bridge ODE retains the semi-linear structure of the Bridge SDE while enabling fast convergence through Taylor expansion. Building on this insight, they propose DBMSolver, a novel ODE-based diffusion bridge sampler that eliminates the need for stochastic sampling while maintaining high fidelity. Extensive experiments demonstrate that DBMSolver significantly outperforms existing benchmarks in terms of FID and NFE, greatly enhancing the practicality and efficiency of DBMs in image-to-image (I2I) translation tasks.
This paper is well-written and easy to follow. DBMSolver exhibits significant advantages through its efficient and training-free sampling mechanism. By introducing a novel sampler that requires no additional training or fine-tuning, it substantially accelerates the sampling process of DBMs and applies seamlessly to both conditional and unconditional I2I translation tasks. Grounded in a rigorous theoretical foundation, DBMSolver provides exact analytical solutions to ODEs governing diffusion bridge dynamics, ensuring both mathematical soundness and interpretability. Extensive experiments across diverse I2I tasks and image resolutions demonstrate that DBMSolver consistently achieves state-of-the-art performance, outperforming prior leading methods in both image quality and computational efficiency.
1. The paper does not provide a code example or supplementary materials related to implementation. Including a code link or additional details about the experimental setup would greatly enhance reproducibility.
2. The current experiments are conducted on relatively simple datasets and tasks. It would strengthen the work to evaluate the proposed method on more complex scenarios, such as image editing, to further demonstrate its generalization and robustness.
1. In Section 3.3, the paper mentions that an SDE solver is required for the initial steps due to $\rho(\lambda_s, \lambda_T) = 0$. Could the same issue not be addressed by introducing a small constant in the denominator instead? This alternative seems more intuitive and might simplify the implementation.
2. In Appendix B.2, it is unclear why the derivative of the simple logSNR term $\frac{d\log{\frac{\alpha_t}{\sigma_t}}}{dt}$ is omitted, while the derivative of the more complex logSNR term $\frac{1}{\frac{SNR_t}{SNR_T} - 1}\frac{d\log{\frac{\alpha_t}{\sigma_t}}}{dt}$ is retained. Intuitively, discarding the complex term could lead to a more concise and elegant derivation. |
Lightly AI-edited |
|
Splat Feature Solver |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper addresses the feature lifting problem, which aims to transform vision foundation model features (e.g., DINO/CLIP) into Gaussian-splatting-based representations. The authors innovatively formulate this problem as a sparse linear inverse problem and propose a closed-form solver. In addition, they introduce two modules to suppress the noise inherent in foundation model features. Experimental results demonstrate the proposed method’s effectiveness and generalization across different Gaussian splatting kernels. Overall, the paper presents solid contributions both theoretically and practically.
* The paper formulates feature lifting as a sparse linear inverse problem and derives a closed-form solution, which is elegant and theoretically sound.
* The mathematical derivations and reasoning are solid and well-motivated.
* The space allocation in the manuscript is unbalanced — too few visualizations are included in the main text, while most figures are deferred to the supplementary material. Moreover, some figure captions are vague.
* The paper lacks discussion and comparison with feed-forward models related to VGGT, such as Anysplat, which can also lift DINOv2 features to Gaussian-splatting representations. Considering their feed-forward nature, such models are likely to offer faster runtime performance.
* In Table 1 (b) and (c) lack highlighted numerical values, making it difficult to visually discern the performance trends.
* The statement “Third, many existing methods are specialized for particular feature types or geometric kernels, which may limit generalization across broader settings” requires further clarification. Specifically, which feature types are those methods specialized for? It would also strengthen the paper to include visual evidence showing that the proposed approach generalizes better across diverse cases.
* In Figure 2, why are the segmentation boundaries so noisy? Moreover, for the two similar and adjacent eggs, why does the method only able to segment one of them?
* According to Section 3.2, the solver is closed-form and enables one-shot estimation without iterative SGD. Theoretically, this should result in very fast inference, yet the runtime is still reported as 1–3 minutes. Could the authors clarify what factors contribute to this computational cost? |
Lightly AI-edited |
|
Splat Feature Solver |
Soundness: 3: good
Presentation: 1: poor
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
1 The paper formulates feature lifting in splat-based 3D representations as a sparse linear inverse problem AX=B , and proposes a closed-form row-sum preconditioner solver with a provable (1+β) -approximation error bound under convex losses.
2 It introduces two regularization strategies—Tikhonov Guidance (to enhance diagonal dominance and numerical stability) and Post-Lifting Aggregation (to filter noisy SAM masks via clustering)—and evaluates the method on open-vocabulary 3D semantic segmentation using mIoU on LeRF-OVS and 3D-OVS benchmarks.
3 Comprehensive ablation studies validate each component, and experiments across multiple splat kernels (3DGS, 2DGS, DBS) and feature backbones (CLIP, DINOv2, ResNet, etc.) demonstrate state-of-the-art performance with minutes-level runtime, confirming both effectiveness and generality.
The paper presents a strong and cohesive contribution by formulating feature lifting in splat-based 3D representations as a sparse linear inverse problem with an original and theoretically grounded perspective that unifies and improves upon prior heuristic, training-based, and grouping-based methods. The proposed closed-form solver with a provable (1+β) -approximation error bound enhances both originality and technical quality, while the two lightweight yet effective regularization strategies (Tikhonov Guidance and Post-Lifting Aggregation) address real-world noise and inconsistency issues without sacrificing efficiency. The work is clearly presented with a logical flow from problem definition to theoretical analysis and extensive experiments across kernels, features, and benchmarks. Its significance lies in enabling fast, general, and high-fidelity semantic enrichment of 3D scenes—advancing open-vocabulary 3D understanding with practical impact and theoretical insight.
1 The paper lacks a clear and detailed pipeline diagram—Figure 1 is overly abstract and fails to illustrate concretely how high-dimensional features are assigned to Gaussian splats, making the core lifting mechanism hard to grasp.
2 Despite claiming SOTA performance on LeRF-OVS, the paper provides minimal qualitative comparisons (only Figures 2 and 8, each against a single baseline), severely limiting confidence in the method’s robustness across diverse scenes.
3 Table 1(b) reports cosine similarity across feature types but doesn’t link these metrics to downstream task gains, raising questions about its necessity.
4 Additionally, the paper suffers from formatting issues. Such as overly large table captions, excessively long figure titles, and inconsistent font sizes in visuals—detracting from readability and professionalism.
Similar to weakness. |
Heavily AI-edited |
|
Splat Feature Solver |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces Splat Feature Solver (SFS), a self-supervised framework designed to learn 3D scene representations using 3D Gaussian Splatting (3DGS) as the core rendering primitive.
The key idea is to reconstruct multi-view images from learnable Gaussian feature fields, optimizing both photometric and perceptual losses without camera supervision. The authors claim that the model learns geometry-aware features from raw multi-view imagery and achieves competitive results on downstream 3D tasks such as novel-view synthesis and depth prediction.
1, The paper targets a highly relevant goal,efficient and scalable self-supervised 3D representation learning using Gaussian splatting, an area of growing academic and industrial interest.
2,Compared to NeRF-style volumetric sampling, the splatting-based pipeline is computationally lighter and supports faster convergence. The engineering design is practical and well-motivated.
3, The pipeline, loss functions, and training strategy are described with good clarity. Figures are intuitive and well-illustrated.
4,Experiments across multiple datasets show consistent, if modest, improvements over previous self-supervised 3D baselines. Ablation results are included to demonstrate the influence of loss terms and feature solvers.
1, The method essentially reuses the existing Gaussian Splatting pipeline as a self-supervised pretext task, with minor modifications to the loss formulation. The “feature solver” concept adds no clearly new principle beyond standard photometric reconstruction with latent feature regularization. The contribution is incremental and primarily engineering-driven.
2, The paper does not provide any analysis explaining why the proposed self-supervised optimization leads to meaningful 3D representations. There is no exploration of feature alignment, depth consistency, or the information content of Gaussian features. The claim of “self-supervised 3D understanding” is thus empirically unsubstantiated.
3, Although the method is described as “self-supervised,” it implicitly assumes access to approximate camera poses or adjacency constraints during training. The paper does not clarify how SFS handles unposed or unordered images. True pose-free capability is not demonstrated.
4, Reported performance gains over NeRF-based SSL or other Gaussian-based SSL approaches (e.g., GS3, UniSplat) are small (often <1% absolute improvement). Key comparisons to Pose-Free Gaussian Fields, PixelSplat, or DUSt3R are missing, leaving the evaluation incomplete.
1, How does SFS perform on unposed or pose-free datasets compared to pose-supervised ones?
2, Could the authors provide any quantitative measure of learned 3D consistency (e.g., reprojection error or latent geometry alignment)?
3, Are the improvements statistically significant across multiple runs? |
Fully AI-generated |