|
ARMOR: Conceptual Augmentation for Robust Multi-Concept Erasure in Stable Diffusion via Model Retrieval |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents ARMOR, a framework for robust and scalable multi, concept erasure in diffusion models. The approach introduces three main components: (1) conceptual augmentation that expands textual representations using image, derived tokens via textual inversion, (2) lightweight per, concept fine, tuning of cross, attention key/value matrices through a closed, form update, and (3) a retrieval, based composition module that dynamically selects and blends relevant per, concept erasers at inference. Experiments are conducted on object, explicit, content, celebrity, and artistic, style erasure tasks, showing improved results compared to several existing methods (ESD, UCE, MACE, etc.).
1) Addresses a highly relevant challenge: scalable and robust concept erasure in diffusion models.
2) Well, structured and coherent framework combining conceptual token expansion, modular fine, tuning, and retrieval, based composition.
3) Demonstrates consistent improvements across multiple erasure categories and robustness evaluations.
4) Lightweight and modular design makes it efficient and easily extendable to large concept sets.
5) Clear motivation and strong practical relevance to safety and controllability in text, to, image generation.
Weaknesses
1) Missing direct comparison to the most relevant recent baseline (Receler).
The paper cites Receler but does not include experimental results against it. As Receler also focuses on reliable concept erasure with modular lightweight updates and retrieval, based activation, its absence prevents a fair and complete evaluation of ARMOR’s effectiveness.
2) Lack of clear evidence supporting the claimed novelty.
Several components closely parallel prior works: textual, inversion, based concept expansion (Kumari et al., ICLR 2023; Gandikota et al., ICCV 2023), selective cross, attention fine, tuning (ESD, ICCV 2023), and retrieval, based modular erasure (MACE, CVPR 2024; Receler, ECCV 2024). The paper would benefit from more explicit differentiation and justification of its unique contribution.
3) Limited discussion of challenging or failure cases.
While results are strong, the paper primarily highlights successful examples. A more balanced analysis of difficult multi, concept or visually overlapping scenarios would better establish reliability.
Questions
1) How does ARMOR quantitatively compare to Receler under the same setup and datasets?
2) How are multiple per, concept Wₖ/Wᵥ matrices combined during inference without causing interference or instability?
3) Does the retrieval module ever produce false activations, and if so, how sensitive is performance to this? |
Fully AI-generated |
|
ARMOR: Conceptual Augmentation for Robust Multi-Concept Erasure in Stable Diffusion via Model Retrieval |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper presents a new framework ARMOR to improve the reliability and scalability of concept erasure in text-to-image diffusion models. The method addresses two key issues: the limited robustness of existing approaches against adversarial or synonymous prompts, and the severe quality degradation that occurs when multiple concepts are erased. ARMOR introduces Conceptual Augmentation, which transfers visual features into the text domain to enrich semantic coverage and strengthen robustness, together with Model Retrieval, which fine-tunes the cross-attention key and value projections for each concept and employs a contrastive retrieval module to select the appropriate erasure parameters during inference. This design allows the model to suppress specific undesired concepts while preserving general image generation quality. Extensive experiments demonstrate that ARMOR achieves superior performance compared with state-of-the-art baselines, effectively resists red-team attacks, and delivers over 10 percent improvements in CLIPScore across multiple erasure benchmarks.
1. The proposed ARMOR framework combines conceptual augmentation and model retrieval in an elegant way. Conceptual augmentation enriches textual representations with visual information, while model retrieval dynamically selects fine-tuned submodels for different concepts, reducing interference and catastrophic forgetting. The closed-form fine-tuning approach is also efficient and theoretically sound.
2. The paper precisely identifies two major limitations of existing concept erasure methods, namely lack of robustness to adversarial or synonymous prompts and poor scalability when erasing multiple concepts. This motivation is well grounded and practically important for safe deployment of text-to-image models.
3. The paper provides extensive experiments across multiple concept types such as objects, styles, nudity, and celebrities. Results consistently show improved robustness, better image quality, and superior performance under adversarial settings compared to strong baselines.
1. The main quantitative metrics are CLIPScore and accuracy on erasure detection, which focus on semantic similarity but not on perceptual fidelity or human satisfaction. Metrics such as FID could provide a fuller picture.
2. The paper does not provide quantitative measurements of training time, inference latency, or memory cost, especially as the number of erased concepts grows. This omission weakens the claim of scalability.
3. Architectural generalization is untested, as all results are reported on Stable Diffusion v1.4 without validation on SDXL or other diffusion backbones.
4. Ablations are incomplete, focusing on the retrieval module and a fine-tuning check, but lacking sensitivity to top-K thresholds, regularization strengths $\lambda$, and the number of augmented tokens.
1. How is the number of learned tokens per concept chosen in practice, and how sensitive is performance to this choice?
Other questions are consistent with weaknesses. If the author can solve the above problems, I will consider improving my score. |
Fully AI-generated |
|
ARMOR: Conceptual Augmentation for Robust Multi-Concept Erasure in Stable Diffusion via Model Retrieval |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This study is motivated by the need for robust and scalable concept erasure methods, i.e. methods to erasure certain concepts from text-to-image diffusion models. The main methodological clue of the presented ARMOR approach is the application of Textual Inversion (TI) to map a generated set of images back into the input embedding space as a way of augmenting the erasure target concept. It is claimed that this inversion-based concept augmentation improves the efficacy of the erasure. The erasure algorithm itself appears to be the closed-form approach from UCE so there is no contribution on the core algorithm but the presented augmentation is demonstrated to improve the robustness through the help of textual inversion, which in itself is quite similar to what STEREO (Srivatsan et al. CVPR'25) is doing but without the closed-form erasure process. When it comes to the multi-concept erasure scenario, they propose an inference-time retrieval mechanism that compares the input prompt to the erasure target of the already existing individual adapters, but their custom retrieval model is only minimally better than taking CLIP off-the-shelf. Overall, this work lacks novelty and timeliness as every individual component was already explored (closed-form erasure, textual inversion for target augmentation, retrieval of adapters) and their joint ARMOR approach is not achieving convincing results. It is only developed and tested for SD v1.4, while most existing research on new erasure methods focus already on newer architectures. Most importantly, this work is motivated by "robustness" and "multi-concept erasure" but it is never actually demonstrated that ARMOR is capable of both at the same time.
- (S1) **Intuitive idea** of deriving a richer concept representation through augmentation by inversion.
- (S2) **Exploration of systematic approach** of using multiple adapters that are activated at inference-time based on the input prompt, following prior works like MACE or SPM.
- (S3) **Great range of erasure baselines and scenarios** that this study uses in its experiments to evaluate the proposed ARMOR method.
I find the following list of things to be major weaknesses:
- (W1) **Distinction to STEREO**: My current understanding: STEREO applies textual inversion during training to deeply collect adversarial examples that it can patch during the erasure, while ARMOR "mines" these augmented examples broadly before the erasure. STEREO uses an ESD-inspired non-linear erasure while ARMOR uses a linear UCE-based erasure. If that is true, then the main baseline is STEREO (which is missing in Figures 13 and 14, see W5). Generally, ARMOR does not seem to outperform STEREO; however, ARMOR has an advantage: it is likely cheaper or faster due to the internal closed-form erasure inherited from UCE. To answer whether ARMOR and STEREO are on par, it is important to test ARMOR against CCE. I would greatly appreciate some clarifying comments on this from the authors or other reviewers.
- (W2) **No results for reliably multi-concept erasure**: Table 3 suddenly does not show any robustness metrics, even though the motivation of this work is robust and scalable concept erasure. There are better approaches for robustness and simpler (such as using CLIP for retrieval) or already existing other methods for scalability (such as MACE).
- (W3) **Lack of clarity on the key methodological contribution**: Section 3.4 seems to be largely explaining the methodology of prior work (such as UCE). The proposed approach to applying contrastive learning for a retrieval mechanism proves to be less successful compared to simply using pre-existing CLIP as a retriever. The relevant contribution is thus the inversion-based augmentation, which I think needs to be a more prominent focus of this work with more, specific experiments for this particular part of the contribution. The reader wants to understand what role this augmentation actually plays, how many augmentations one needs, does this number differ between scenarios, and how it compares to more naive augmentation approaches like just rephrasing the prompt, adding noise to the embedding, or using synonyms or translations.
- (W4) **Unfair multi-concept erasure comparison to baselines**: Fine-tuning a separate model for each target and then using a retrieval module makes the comparison unfair to many of the other methods that do in-weight unlearning without adding additional multiple additional adapters/parameters. When the method relies on multiple of these adapters to be ready at inference time, then the LoRA adapters cannot be merged back into the model, which fundamentally changes the model and is thus only applicable to API-based black-box scenarios where users do not have system or model access. For example, MACE merges all the target-specific adapters into a single one at the end.
- (W4) **Overall results are not convincing**: The results overall only suggest a slight superiority of ARMOR when it comes to the CLIPScore metric. However, even this advantage is far from clear, and CLIPScore is generally not a very informative or sensitive metric.
And here the minor weaknesses:
- (W5) **Lack of consistency in results**: No consistent comparison to baselines. Figures 13 and 14, for example, miss STEREO. Table 2 misses the most challenging metric: CCE robustness.
- (W6) **No results for SD v2, SD v3, or beyond**, while the field is slowly moving to newer models as SD v1 pales w.r.t. to the image quality and prompt adherence in comparison to those newer models.
- (W7) **Figure 7 lacks original samples** before the erasure with the same prompts.
- (Q1) Table 1: It is a bit strange that STEREO, PRUNE, and ARMOR all have the same values for "Others". I would appreciate a small comment on that.
- (Q2) Wrong highlighting in Table 5! The lowest "Erase" accuracies are not achieved by "Ours", right? |
Fully human-written |
|
ARMOR: Conceptual Augmentation for Robust Multi-Concept Erasure in Stable Diffusion via Model Retrieval |
Soundness: 2: fair
Presentation: 3: good
Contribution: 1: poor
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes **ARMOR**, a framework for robust and scalable multi-concept erasure in Stable Diffusion. The authors identify two key challenges in existing concept erasure methods: vulnerability to adversarial or synonymous prompts (robustness) and severe degradation when erasing multiple concepts (multi-concept forgetting). ARMOR addresses these by combining **conceptual augmentation**, which back-optimizes text tokens against concept-related images to enrich textual representations, and **model retrieval**, which fine-tunes separate key/value cross-attention layers per concept and uses contrastive learning to select the most relevant “eraser” models during inference. Experiments across four benchmarks—object, explicit content, celebrity, and artistic style erasure—show that ARMOR achieves superior erasure robustness and general image quality, outperforming recent baselines such as MACE, UCE, and SPM by large margins in both quantitative and qualitative results.
1. The paper is well-organized, with detailed mathematical formulation (including a closed-form update for fine-tuning) and a clear pipeline diagram that illustrates the conceptual and retrieval stages.
2. The paper identifies two critical and underexplored challenges in concept erasure: robustness to adversarial prompts and scalability to large numbers of erased concepts, both of which are critical in concept erasure.
1. The proposed ARMOR framework is somewhat incremental and lacks substantial new insights. In Section 3.3 (*Concept-Augmented Dataset Construction*), the preprocessing pipeline largely follows prior work—*STEREO* employs textual inversion for adversarial training, and *MACE* uses SAM for object masking. In Section 3.4 (*Per-Concept Model Fine-Tuning*), the closed-form optimization is essentially a variant of *UCE* without introducing new analytical terms. As a result, the main technical novelty lies in Section 3.5 (*Contrastive Learning for Model Retrieval*). However, as elaborated below, this module is conceptually misaligned with the paper’s original motivation.
2. The authors argue that concept erasure is more effective and harder to bypass than post-filtering approaches, which I agree with—post-processing filters are impractical in real-world concept erasure applications. In black-box settings, simple rule-based or neural filters can already screen undesired content at input/output, while in white-box settings these filters (e.g., the *Stable Diffusion* safety checker) can be easily disabled. From this perspective, the proposed model retrieval module essentially acts as a post-filter mechanism. Regarding the two motivations stated in the paper:
(a) For **robustness**, this module can be trivially bypassed in white-box scenarios, rendering it ineffective against adversarial attacks such as *CCE* or *UnlearnDiff*.
(b) For **multi-concept erasure**, the improved performance mainly stems from decomposing the multi-concept task into multiple single-concept submodels, which is unsurprising. In fact, a simpler keyword-based matching strategy (e.g., matching celebrity names) might yield comparable results. Overall, the retrieval module contradicts the authors’ motivation and weakens the contribution.
3. In Table 2, using CLIPScore to evaluate *nudity* erasure may not be reliable. Most prior works use **accuracy (by NudeNet)**as the main metric for NSFW removal, while CLIPScore is rarely adopted for this purpose. Furthermore, CLIP itself is not well-trained on NSFW data, since such samples are typically filtered out during training, leading to inaccurate text-image alignment for explicit content. Although Figure 3 reports ASR results under “normal” and “adversarial” prompts, the distinction between the two cases is unclear. It would be better to follow standard benchmarks and report **ASR-based metrics** for a fair comparison.
4. In Figure 3, *MACE* achieves the best erasure performance, yet ARMOR is described as an incremental extension of it. It is unclear why ARMOR performs worse in robustness—does the use of textual inversion or other augmentations have a negative effect? In addition, *STEREO*’s reported results differ significantly from its original paper under the same benchmark. The authors should clarify what causes these discrepancies (e.g., different data splits or training hyperparameters).
5. In Table 3 and Figure 4, the performance of *UCE* deviates sharply from its reported results, despite *UCE* also being a multi-concept erasure method. The ΔAcc value (0.44) is much lower than expected, although its protection performance in Table 4 appears normal. It is unclear whether the authors aligned the experimental setup—for example, using 100 celebrities as the retain set for celebrity erasure and 100 styles for style erasure. Since *UCE* only provides the retain set for style erasure in its official release, the authors should double-check the experimental alignment to ensure fairness.
See the weakness part. |
Lightly AI-edited |
|
Modeling Student Learning with 3.8 Million Program Traces |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work involves a set of model training experiments on a large programming traces dataset from Pencil Code. Specifically, they used five models to model students' behaviors, and also investigated the representations of both code and students.
+ It is a major contribution to show that, at a large scale, programming traces are useful for modeling students' programming.
- The contribution of this paper is unclear. One key issue is that while it comes from an educational discipline, it does not have any task involving actual educational goals. The five models are all about student behaviors or representations of students or code, but what is the next step? There is almost no educational implication discussed in the work.
- There are some key claims counterintuitive for general machine learning tasks. One of the biggest issues is about student embedding -- how exactly can we expect a model learned with student IDs to be generalizable for future new students? There are discussions about the result when new students are involved, and the result says "generalization is still difficult" -- this is almost certain, even for a large language model now trained with a lot of data. If you ask GPT-4 who a student is, it likely won't give you any good idea. The power of generalizability cannot help with tasks like overfitting to IDs.
- Line 89: Why is this large dataset suitable for language model training? For small language models, smaller datasets could also work, especially if we want to create models for specific contexts.
- Line 94: The requirement and context of learning will be very important for the final program states. In classroom settings, students' final program states are almost all correct, while in situations of informal learning, there's often a lack of motivation for students to finish programming for many. This is actually not a minor issue -- context is very important for educational applications and this is missing.
- Line 99: We cannot train from IDs, but can check about certain classes or sessions.
- Line 118: So what is the goal? Education happens in a certain context, and it will need to show it surpasses small models in their own context to make sense. Otherwise we can always use smaller models trained in specific contexts.
- Line 130: While training LMs are important, it is still important to show what exactly will be a good downstream educational task.
- Line 140: I don't get this -- student IDs are involved in training and in this case, any new student will be an unknown input. |
Fully human-written |
|
Modeling Student Learning with 3.8 Million Program Traces |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper presents an analysis of a dataset of 3.8 million code editing traces.
These traces are taken from PencilCode, which is a web-based code editing
platform focused on education. PencilCode allows the user to read and edit code
in both textual and graphical form, and seamlessly switch between the two.
However, this paper focuses on the textual representation.
The paper performs continuous pre-training or fine-tuning (you can argue which)
of GPT2-124M and Olmo-1B using the trace dataset. Each training item is a
sequence of code edits, along with certain metadata such as student ID. The
paper ablates the training data format: using synthetic traces (assuming each
step adds a line), using the ground-truth traces, and using just the final
program. The natural traces perform best on several days. The tasks considered
include getting the trained models to correct errors in student traces (i.e.,
completing a student trace to be correct), predicting the program title from the
trace, etc.
- This paper presents a dataset that is potentially very interesting. However, I
believe there is no plan to release the dataset publicly.
The primary weakness of this paper is that it is missing several obvious
baselines that involve prompting pretrained models (e.g., any open-weight model
that is 32B+ or even a proprietary model). Since the traces involve program
execution in JavaScript and CoffeeScript, I imagine that a reasonable pretrained
model will pick-up enough in-context signals given the trace and a reasonable
prompt. I thought the most interesting task in the paper was on L428, where the
fine-tuned model completes a prefix of a student-written trace that ended in
failure with a successful trace 60% of the time. I expect that if you give a
broken program or trace to a reasonable pretrained model, it will identify and
fix the bug at least as well. I don't expect a pretrained model to be good at
probing student representations, but it's worth asking if they can do the other
code representation tasks. E.g., asking "will a student backtrack" is similar to
asking "is there a bug".
I also think this paper needs to do a better job engaging with related work.
There is enormous interest in studying how students learn to code, with
and without LLMs:
- BlueJ Blackbox (ICER 2018) has very detailed logs of edit actions. The
ICER papper lists 18 papers that use the dataset.
- FalconCode: https://huggingface.co/datasets/koutch/falcon_code
- StudentEval (this is LLM related): https://huggingface.co/datasets/wellesley-easel/StudentEval
The datasets above are either open or relatively easy to get access to.
See weaknesses. |
Fully human-written |
|
Modeling Student Learning with 3.8 Million Program Traces |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces a dataset of over 3.8 million programming reasoning traces from a free online educational platform. The authors develop and compare five model variants trained on this dataset: the trace model, last model, synthetic model, trace downsampled model, and synthetic downsampled model, which are evaluated from both behavioral and representational perspectives. They demonstrate that models trained on full traces acquire stronger representations of student coding behavior compared to models trained solely on synthetically generated traces or final program submissions.
1. This paper is well-motivated, and a decent amount of technical details are given.
2. The idea of modeling students' coding behavior through intermediate traces is both interesting and practical.
1. Insufficient dataset presentation
2. Missing discussion of related work and evaluation metric
3. Lacks user study
4. Limited evaluation to outdated model (GPT-2)
5. Code not provided
1\. **Concerns about the dataset presentation**
A key contribution of this paper is the presented programming reasoning traces dataset. I suggest the authors add a dedicated section within the main text to thoroughly introduce the dataset's features and characteristics rather than placing this important information in the appendix. Additionally, providing an illustrative visualization of the dataset structure would help readers better grasp its organization and content.
2\. **Missing discussion of related work and evaluation metrics**
For the behavioral evaluation, the authors compare generated samples against actual student-written code. This objective semms similar to the work "Open-ended Knowledge Tracing for Computer Science Education" (EMNLP, 2022), which should be cited and discussed. Also, I suggest adopting CodeBLEU—a variant of the traditional BLEU metric specifically adapted for code—as suggested by this related work, as it would allow for a more accurate assessment of similarity between the predicted and actual student code.
3\. **User study**
The authors demonstrate that their trace model can help students recover from errors. I suggest that the authors conduct a user study in real educational settings to further validate this claim. Such an evaluation would provide valuable empirical evidence for the practical effectiveness of the proposed model.
4\. **Clarification on Figure 6 results**
In Figure 6, as the number of fine-tuning traces increases, the performance on trace generation appears to be lower compared to final program generation. Could the authors provide a more detailed analysis or explanation of this phenomenon?
5\. **Evaluation on more advanced language models**
The authors conduct experiments using base GPT-2 and OLMo-2 models. Given that GPT-2 is somewhat outdated, I suggest extending the evaluation to include more advanced models, such as those from the Llama series or other state-of-the-art LLMs, to further strengthen the generalizability of the findings.
6\. **Code and reproducibility**
The authors are encouraged to release the code to facilitate reproducibility and benefit the research community.
7\. **Typo**
Page 4, line 202: "a an" → "an" |
Lightly AI-edited |
|
Modeling Student Learning with 3.8 Million Program Traces |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces a 3.8M programming trace dataset from Pencil Code and trains language models to capture student coding behavior, comparing models trained on real traces, synthetic traces, and final programs only. The focus on modeling "how" students code rather than just "what" they produce is interesting, but the scope is limited and the experimental section needs significant reorganization.
* The focus on "how" students code instead of just "what" they produce is a valuable perspective shift for modeling programming behavior.
* The 3.8M trace dataset from real students over 9 years is substantial and could benefit the education and code generation community.
* The base models (GPT-2 124M, OLMo-2 1B) are outdated. Modern models (such as qwen3, starcoder) would be more convincing baselines.
* Line 132 mentions "reported in Table 3" but I cannot find Table 3 anywhere in the paper.
* The entire work is based on one platform (Pencil Code) teaching "simple programming concepts" with visual blocks. This feels too narrow for ICLR. There is no evidence the findings generalize to other languages, platforms, or more complex programming tasks.
* The citation format does not follow ICLR style. Please check the formatting guidelines.
* Figure 3 has overlapping numbers that make it hard to read. Please fix the visualization.
* The experimental results section is very hard to follow. There are too many sub-research questions (5 major sections, each with multiple questions) but they are not well-justified. For example, Section 4.1 asks "Can models generate code that reflects real student programming behavior?" but I don't understand why this matters. The model is still just generating programs, so what is the point? The later experiments on probing and adapting to new students are more interesting, but they get buried.
* The paper tries to answer too many research questions at once. The authors should narrow down to 2-3 core questions and go deeper on those instead of spreading thin across many shallow analyses. More research questions does not equal a better paper.
see weakness above |
Heavily AI-edited |
|
Optimal Sub-data Selection for Nonparametric Function Estimation in Kernel Learning with Large-scale Data |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper deals with the problem of estimating a nonparametric function from a given reproducing kernel Hilbert space, when using large-scale (possibly with large sample size) training data.
This is a recurring problem in the field.
Specifically, the work focuses on a kernel Ridge Regression (KRR) problem where the RKHS is assumed given. Then, the goal is to estimate the kernel expansion coefficients and regularization weight (called tuning parameter in the paper) in a data-scalable manner.
For that, the proposed method looks for a fixed number of clustered data subsets whose centers act as a representative values of all the data samples in the cluster, and then solves the KRR problem over the resulting (smaller) data.
Finally, a theoretical discussion about the rate of convergence of the proposed method and experimental results are presented.
The presented approach seems novel and is well motivated.
**W1**: To me, the biggest weakness of this paper is an inconsistent, and sometimes vague, mathematical notation.
This makes following the main claims of the paper a tedious task.
Overall, this paper would benefit substantially from a consistent and properly introduced mathematical notation.
To give some examples: \
**W1.1**: The equation in line 132 defines the optimal argument as $\hat{f}\_{N,\lambda\_T}$ .
In the ensuing sections the comma in the subindex disappears. \
**W1.2**: In the expression in line 140, the authors use brackets [ ] to denote a vector. Later, vectors are denoted with parenthesis ( ), e.g., line 177, or curly braces { }, e.g., line 195, instead. Same for the transpose symbol, sometimes is an apostrophe $'$, e.g., line 266, and sometimes a superscript $^T$, e.g., line 268. \
**W1.3**: Some notation is introduced without definition, such as $\hat{f}\_{s\lambda}\^\*$ in line 151, and $\theta_i$ and the subscript $\_{n*n}$ in line 226.
I believe there are also some minor typos/mistakes such as: \
**W1.4**: The sentence spanning lines 158 to 161 defines the centers of the clusters redundantly. \
**W1.5**: In line 192, $\bar{\epsilon}_i$ is using the definition intended for $\bar{y}_i$. \
**W1.6**: In line 196 the $(i,j)$-th element is a set. It should likely be $K(C_i,C_j)$ or a singleton. \
**W1.7**: In line 277 $\mathscr{S}(x_0)$ is defined as a set of $x^*_i$ values; however, in line 279 it is used as the set of indices of those values. \
**W1.8**: There is typo in line 476, ``can effectively use~~s~~''.
**W2**: A discussion of the memory and time complexity comparing the proposed method to alternatives (e.g., Nyström and FALKON) would have been appreciated.
For instance, in a comparison table.
**Q1**: In table 3, I assume that ``centers-time'' refers to the execution time of the clustering algorithm. Taking that into account, the computation time of the proposed method is still one, or even two, orders of magnitude larger than FALKON and Nyström.
Can you elaborate on that? |
Fully human-written |
|
Optimal Sub-data Selection for Nonparametric Function Estimation in Kernel Learning with Large-scale Data |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces a subset selection method for large-scale nonparametric regression in reproducing kernel Hilbert spaces (RKHS). The main goal is to improve prediction accuracy when computational resources limit the use of all available data. The proposed approach first partitions the dataset using clustering methods such as k-means to identify representative centers, then assigns optimal sampling weights to clusters in order to minimize the pointwise mean squared error of the resulting kernel estimator. Weighted subsampling is performed multiple times, kernel ridge regression models are trained on each subset, and their predictions are averaged. The authors claim to achieve better MSE error compared with basic Nyström or FALKON (but its first version in 2017 in Matlab).
The topic is interesting and the paper targets the well-known bottleneck of kernel methods, i.e. scalability. The clustering and sampling procedure with optimal weights can be interesting, but a clearer and more precise exposition, and deeper experiments are needed to evaluate it.
### **Critical Observations**
1. **Outdated comparison**
The paper compares only against the *original* **FALKON (2017)** and a basic **Nyström** baseline, but not against more recent or optimized implementations — for example, *“Kernel methods through the roof: handling billions of points efficiently”* (Meanti et al., NeurIPS 2020), which provides a **modern, fast version of FALKON**. As a result, the experimental comparison does not reflect the current state of the art.
2. **Misleading comparison and trivial findings**
The comparison is conceptually weak. Works such as *“Less is more: Nyström computational regularization”* (Rudi et al., 2015) and *“FALKON: An optimal large scale kernel method”* (Rudi et al., 2017) have already shown that Nyström-based methods can **achieve optimal learning rates** while maintaining **computational efficiency**, provided that a **sufficient number of landmarks** are used (e.g., √n in kernel ridge regression).
It is therefore **expected** that using too few subsampled points degrades accuracy, and that more sophisticated (but slower) sampling strategies can improve MSE. Even a fully greedy point-selection scheme — which maximizes accuracy at each step — would outperform uniform sampling, but at a much higher computational cost.
-> Consequently, the reported “superior accuracy” is **trivial and somewhat misleading**: it arises from sacrificing scalability, missing the main purpose of FALKON, which is to remain **fast while preserving accuracy** once enough centers are used.
3. **Computational inefficiency**
The proposed method is **consistently slower** than FALKON (often by an order of magnitude) while achieving only **moderate MSE improvements**.
-> The key insight of FALKON — maintaining accuracy while being computationally efficient — is not addressed or appreciated in this work.
4. **Clarity and presentation issues**
The paper is **poorly structured and very difficult to follow**. Among all, definitions and notations are often unclear or inconsistent:
- The meaning of \( y_{ij} \) (double index) is never explained.
- The notation \( \hat{w}_{x,i,C} \) contains an unexplained hat (possibly a typo).
- \( K_{xi} \) is defined *after* Equation (2) but used *before* it.
More than that, theoretical assumptions are **scattered and vaguely presented**, often embedded directly within theorem statements without being clearly introduced and discussed. How do these assumptions compare with the rest of the literature (which is **largely** missing in the entire paper)?
Overall, the exposition lacks precision and logical flow, making the paper unnecessarily hard to read and evaluate.
Most perplexities are already expressed above under Weaknesses.
- Why is the most recent literature (both theory and algorithms) not considered?
- in Results at pag. 8 it's said that the method "remains efficient", but it is 100x slower than old version of FALKON, how can be considered efficient when datasets contain billions of points? |
Lightly AI-edited |
|
Optimal Sub-data Selection for Nonparametric Function Estimation in Kernel Learning with Large-scale Data |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper studies kernel ridge regression (KRR) in reproducing kernel Hilbert spaces (RKHS) in the context of large sample sizes (N) and proposes a sub-data selection scheme aimed at matching or improving the statistical accuracy while simultanesouly reducing computational complexity. The method clusters inputs into L groups (via $k$-means or variants), chooses a tuning parameter using only the cluster centers, and then, for each test location $x$, assigns sampling probabilities to training points proportional to $ | K_x|$ (a transformed kernel vector that depends on $x$, the chosen kernel, the center kernel matrix, and the tuning parameter $\lambda$). The estimator averages KRR fits over \$B\$ resampled subsets of size $n$ much smaller than the sample size $N$. The paper provides a clean derivation of KRR rates and optimal tuning parameter value $\lambda_T\$ under eigenvalue decay (Theorem 1) and an asymptotic IMSE rate for the proposed estimator (Theorem 2) showing improvement over simple random sampling (SRS) under assumptions on the informativeness profile based weights $\omega_{x_{0},j}$. Simulations and real-life data study show competitive IMSE vs. full-data KRR and improved test MSE versus Nyström method and FALKON at comparable budgets.
The consitional MSE‑minimizing weights $w_{x,i,C}(x)\propto |K_{xi}|$ derived from the cluster‑center surrogate model are simple, somewhat interpretable, and targeted at the desired loss (pointwise prediction at a given $x$). Theorem 1 recovers the optimal full‑data KRR rate of convergenceunder different choices of target RKHS function class and eigendecay rate, while Theorem 2 formalizes how an informative sampling profile (captured by $\omega_{x_{0},j}$) yields an IMSE rate faster than SRS for fixed $n$. The paper shows the value of using information in the full input data to determine sub‑data selection.
Theorem 2 makes the assumption : $\omega_{x_0,j}\asymp j^{2\beta}$ with $0\le 2\beta<k$ (and $2\beta\le k\le 4\beta$). However, it is not shown that the proposed $k$‑means + $|K_x|$ weighting mechanism induces such a condition, for any $\beta>0$, even under standard kernels and benign marginal input disribution $\pi(x)$. As written, the theorem establishes rates for a class of informative samplers rather than for the concrete algorithm. A lemma connecting cluster geometry and $K_x$ to eigenfunction mass can strengthen the result. Also, using Euclidean $k$‑means to define centers may not align with the RKHS geometry for non‑RBF kernels. A short discussion (or a kernel‑$k$‑means variant such as $k$‑medians) would clarify when the centers faithfully represent $K(\cdot,\cdot)$. In the definition of Relative efficiency (RE) on Lines 316-318, the choice of the exponents of IMSE and Time is not well-motivated.
Pleae see Weaknesses section. |
Lightly AI-edited |
|
Optimal Sub-data Selection for Nonparametric Function Estimation in Kernel Learning with Large-scale Data |
Soundness: 3: good
Presentation: 1: poor
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses the computational challenge of large-scale nonparametric function estimation in Reproducing Kernel Hilbert Spaces (RKHS) by proposing a weighted sub-data selection method. Theoretical results show that the proposed method achieves a faster convergence rate than simple random sampling (SRS).
S1. The paper provides a theoretical analysis, demonstrating that its proposed estimator achieves a convergence rate superior to Simple Random Sampling.
S2. The method demonstrates superior performance in MSE compared to established benchmarks.
W1. The technical components of the proposed method rely on established algorithms and principles that lack innovation. For instance, the core clustering step relies on k-means, a classical algorithm initially proposed more than six decades ago (e.g., by Stuart Lloyd in 1957). Similarly, the use of variance minimization to derive optimal weights is a long-standing principle in classical optimization and statistics. While the integration of these elements for point-wise prediction in RKHS is applied to a specific setting, the overall methodology constitutes a recombination of existing techniques rather than a conceptually novel contribution. Furthermore, the problem being addressed, scaling kernel methods,does not align with pivotal frontier challenges in contemporary AI research, thereby limiting the broader impact and relevance of this work.
W2. The paper suffers from several issues in scholarly rigor and exposition that impact its professionalism and readability, for example:
1. In Section 2, the theoretical foundation lacks references to crucial theorems like Mercer's theorem and the Representer Theorem. Their absence makes the theoretical setup incomplete.
2. The meaning of notation is sometimes unclear. For instance, in Section 2.1, the expression {K(x, C_1), ...} does not specify whether the braces {} denote a set or a vector, creating unnecessary confusion.
3. Incorrect formatting of references is present throughout the manuscript, which diminishes the work's professionalism and adherence to standard academic conventions.
Q1. In the first part of Section 2, the parameter $\lambda_T$ is introduced. Could you please clarify the relationship and distinction between $\lambda_T$ and the regularization parameter lambda used later in Section 2.1? Specifically, what does the subscript $T$ denote, and what is the conceptual or mathematical difference between these two symbols?
Q2.The proposed sampling strategy appears to rely on a clustering step. Could you please discuss the potential sensitivity of your method's final performance to the specific clustering algorithm chosen? Have you experimented with different clustering methods? |
Heavily AI-edited |
|
Focusing on the Riskiest: Gaussian Mixture Models for Safe Reinforcement Learning |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes GMM-SSAC, a safe RL framework that models the safety-cost distribution using Gaussian Mixture Models (GMMs) and introduces Supremum Conditional Value-at-Risk (SCVaR), the maximum CVaR across mixture components, as a conservative risk metric. The safety critic is trained through a Bellman-consistent incremental EM update, while the actor minimizes an SAC-style objective penalized by SCVaR. Theoretical sections show that SCVaR upper-bounds the mixture CVaR and is a coherent risk measure; experiments on Safety-Gymnasium and velocity-constrained MuJoCo tasks show improved constraint satisfaction with comparable reward.
* **Sound extension**: Modeling multimodal or heavy-tailed cost distributions is a reasonable step beyond unimodal Gaussian critics.
* **Intuitive concept**: The SCVaR metric provides an easy-to-understand conservative surrogate for tail-risk control of the GMM.
* **Technical integration**: The paper combines a GMM-based safety critic, an incremental EM fitting step, and a primal-dual-style policy update into a coherent framework.
## Conceptual and Empirical Alignment
1. **Multi-modality assumption**: The paper tries to motivate the necessity of multimodal safety cost distribution in Fig. 1, but it is unclear to me whether the true safety cost is multimodal. The constraints tested in the experiment section all seem to be unimodal (e.g., velocity limit, distance to hazard). It'd help to illustrate the scenarios where unimodal cost model fails and multi-modal is required to maintain safety.
2. **Missing connection to distributional robust safe RL**: The proposed critic models a full cost distribution (instead of a point estimate like SAC-Lagrangian) and optimizes a SCVaR. This approach is conceptually closer to distributionally robust CMDP formulations than to standard SAC-Lag baselines. The paper would benefit from comparisons or discussion along that line.
## Theoretical Clarity
3. **Significance of Theorem 1**: The paper could discuss the significance of Theorem 1. It's understood that SCVaR ≥ CVaR and implies conservatism. But since the GMM distribution is estimated and available, why not use the CVaR of the GMM distribution directly? Using an upper-bound (SCVaR) introduces additional conservatism and it's unclear to me why it should be used in the optimization program instead of CVaR.
4. **Intuition on the refinement operator $\mathcal{R}$ and $\beta$**: The historical sample set $\Psi(s, a)$ and Bellman-transformed set $\Psi_{\beta}(s, a)$ are both sampled from target safety critic $\mathcal{G}^{\pi}$, albeit at different time points. The paper could benefit from discussing the conceptual stabilization effect of $\mathcal{R}$ and $\beta$. For example, why not simply use the most up-to-date Bellman-transformed set $\Psi_{\beta}(s, a)$ only?
## Implementation Details
5. **Neural update with MSE loss**: Lines 240–245 describe regressing the network to the EM-updated parameters using an MSE loss, but this step is not accounted for in the convergence analysis. Is there an approximation gap introduced by this?
6. **Constraint formulation**: Eq. 18 converts an episode-level cost limit (problem setting, Eq. 2 and Eq. 7) to a per-step constraint, which can be stricter than the original CMDP constraint. The paper should clarify whether the goal is to enforce stricter per-step limits and justify this choice.
7. **Figure choices**: The "training cost" boxplots (Fig. 5 bottom row) convey coarse trends; line plots with confidence bands might better illustrate progression of the training cost.
8. **$\alpha$ interpretation**: Lines 430-431 claim $\alpha = 0.1$ "achieves precise balance," yet this value yields the lowest rewards. The sweet spot appears to be $\alpha = 0.5$, but I think it could vary significantly from task to task (and safety cost distribution shape). Perhaps the paper could discuss more on this.
9. **Missing detail in ablation**: The $\alpha$ value used in the ablation of Section 4.2.1 is not specified.
(related to the Weaknesses above)
1. Could you discuss what example safety tasks demonstrate genuine multimodality?
2. Is the per-step constraint intentional, and how does performance change under episode-level limits?
3. Why is SCVaR preferred over the directly computed mixture-CVaR if both are available from the GMM critic? |
Fully AI-generated |
|
Focusing on the Riskiest: Gaussian Mixture Models for Safe Reinforcement Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes GMM-SSAC, which models the cumulative cost distribution with a Gaussian Mixture and introduces SCVaR, the maximum CVaR across mixture components, to emphasize the worst-case tail among multimodal risks. This matters because single-Gaussian critics can underestimate tail risk and miss heavy-tailed/multi-peaked structure in safety-critical settings; in contrast, SCVaR is conservative (upper-bounds mixture CVaR) and coherent. Empirically, GMM-SSAC reduces safety violations both during training and evaluation while maintaining competitive rewards, with $\alpha$ controlling the safety–reward trade-off.
- Addresses the limitation of single-Gaussian critics in modeling multimodal risk.
- Solid formulation of SCVaR and clear integration with SAC.
- Empirical results show fewer safety violations on benchmark tasks.
- Novelty is limited relative to the existing WC-SAC (CVaR-SAC with Gaussian costs) and CAL (multiple Gaussian cost estimates with UCB aggregation). The contribution centers on SCVaR and Bellman–EM projection is incremental rather than a fundamentally new safety-risk paradigm.
- The evaluation is conducted on some traditional RL testing benchmarks but the ablations show performance depends on $K$ and $\beta$; very high $\beta$ degrades performance (variance from EM+Bellman), suggesting some instability that merits deeper analysis.
- Runtime/sample-efficiency analysis is missing: while compute setup and wall-times are reported, there’s no per-update overhead or learning-curve comparison against baselines to quantify the cost of EM/GMM (especially as K grows).
- The observed multimodality in cost distributions is assumed rather than empirical validated. Any validation for this?
- Does SCVaR fundamentally add value over simply using a lower CVaR confidence level (smaller $\alpha$) with a standard critic?
- How do we know the observed multimodality in cost distributions is real and not an artifact of function approximation noise?
- How stable is the online EM procedure under off-policy distributional shift? |
Heavily AI-edited |
|
Focusing on the Riskiest: Gaussian Mixture Models for Safe Reinforcement Learning |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses a key limitation in CMDP safety constraints by replacing the standard Gaussian cost distribution assumption with Gaussian Mixture Models (GMMs). The authors introduce SCVaR (Supremum Conditional Value at Risk) to capture worst-case risk across GMM components, providing a more robust safety measure than traditional CVaR. To enable online cost distribution estimation without waiting for episode completion, they propose a Bellman-style incremental update that bootstraps from instantaneous safety measures and historical distribution estimates. The approach is evaluated on Safety Gym benchmarks, demonstrating improved safety guarantees while maintaining performance.
- This is a nice contribution in terms of modeling cost and a right step for studying CMDPs. As an initial approach to CMDP cost modeling, making such estimates more conservative (rather than just expectation over cost) and integrating the Bellman update for cost distribution, this will inspire more methods that will reframe the cost and approach solving CMDPs with more robust methods compared to GMMs (see weakness and/or questions below on this).
- The paper includes an effective ablation study to validate the components - they show comparisons between CVaR and SCVaR, usage of different number of Gaussian components for GMMs and their sensitivity to it.
- The paper has a nice bit of theory as well, including a proof of Bellman update contraction.
- Sound empirical results including a comparison of SCVaR vs CVaR performance in GMMs.
- The presents three images, first for explaining the current usage of Normal distribution for cost, and another image for explaining SCVaR, and the third for explaining the algorithm and integration with RL env - actor cycle. I would have preferred a more linear approach to fig three to sequentially explain the process of RL and integration of SCVaR + GMM - this could have been accomplished by rearranging the figure first showing the RL env-actor cycle, breaking action into three components policy, value, and cost, and breaking cost and explaining the use of SCVaR and GMM.
- Why GMM? Perhaps the choice lies in its simplicity and relatively "easy" treatment in terms of theory. But questions arise about accuracy, expressiveness, parameter efficiency and so on. The paper actually motivates this choice in terms of expressiveness and the fact that they have universal approximation capability. We have so many popular, SOTA distributional models that might currently be used for other purposes (e.g. as generative models) that might serve as a better basis or backbone for the SCVaR component. I am curious what the authors think about this. |
Fully human-written |
|
Focusing on the Riskiest: Gaussian Mixture Models for Safe Reinforcement Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This work mainly considers safe RL with safety constraints. Specially, CVaR may fail to capture complex, multimodal, or heavy-tailed risks, thus this work proposes the Supremum Conditional Value‑at‑Risk (SCVaR) for capturing worst‑case tail across all components of a Gaussian mixture. Consequently, combining with an EM‑based method to update the GMM parameters, the proposed GMM‑SSAC(Gaussian Mixture Model‑Based Supremum CVaR‑Guided Safe Soft Actor‑Critic) can estimate reliable SCVaR. Extensive theoretical and experimental results show that GMM‑SSAC is better than previous CVaR-based RL methods.
- I really like the idea of introducing new metrics in safe RL, thus I think SCVaR is a clear contribution if authors can introduce its advantages in safe RL more clearly (see weakness).
- SCVaR has some obvious insights like it can be easily computed in GMM, which is an expressive distributional framework.
- Lots of theoretical analyses clearly state the properties of SCVaR.
- The writing is great and easy to read.
- My major concern is, what is the main advantages of SCVaR compared with CVaR? In my understanding, CVaR considers the tail of the distribution and SCVaR considers the maximum CVaR of each component. What are the benefits of ignoring the tail distribution of other components? Providing some theoretical or experimental insight will make this work more solid. Also, a natural question is, can SCVaR be extended to any distributions that can not be represented by GMM? As CVaR is well defined in all distributions, the application of SCVaR will be limited if it can only be considered on GMM (of course CVaR of complex distribution can not be directly computed but still can be estimated).
- Assume that the ground truth distribution is GMM, there are always estimated gap between our estimated GMM and the ground truth distribution. Under this situation, what about the relationship of the estimated SCVaR and the ground truth SCVaR?
- Assume that the ground truth distribution is **not** GMM, of course we can utilize a GMM to estimate this distribution and calculated our estimated SCVaR, then what is the meaning of this estimated SCVaR?
- In experiments, I think a natural ablation study is that utilizing GMM to estimate the distribution and use CVaR of the estimated GMM to measure the risk, which might be a good comparison of SCVaR and CVaR.
- There are several works on safe RL with different risk measures need to be discussed, like CVaR [1-3] and EVaR [4-5].
Overall, I think this work is currently boardline, I'd like to actively join in the following discussion and adjust my score if the authors can address my concerns.
Ref:
[1] Towards safe reinforcement learning via constraining conditional value-at-risk
[2] Efficient off-policy safe reinforcement learning using trust region conditional value at risk
[3] Risk-sensitive reward-free reinforcement learning with cvar
[4] Risk-sensitive reinforcement learning via Entropic-VaR optimization
[5] Evar optimization in mdps with total reward criterion
See weaknesses above |
Fully human-written |
|
Focusing on the Riskiest: Gaussian Mixture Models for Safe Reinforcement Learning |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
- The authors propose a risk-averse safe RL algorithm that maximizes reward while reducing the risk measure of the cost return.
- They parametrically estimate the distribution of the cost return using a Gaussian Mixture Model (GMM).
- They propose a coherent risk measure, called SCVaR, which can compute using GMM.
- While using GMM to estimate the cost return distribution has addressed in prior work (GMAC [1]), proposing SCVaR via this parameterization is novel.
- The authors analyze the convergence of the proposed Bellman operator.
- The presentation is clearly and effectively presented.
[1] Nam, Daniel W., Younghoon Kim, and Chan Y. Park. "Gmac: A distributional perspective on actor-critic framework." *International Conference on Machine Learning*. PMLR, 2021.
- The introduction lacks analysis of prior work and appears biased.
- They mention only methods approximating the cost return distribution with a single Gaussian.
- However, numerous distributional RL approaches exist for more realistic estimation, such as quantile regression [1], percentile-based methods [2], and moment parameterization [3].
- Omitting these references reveals a limited understanding of prior work.
- Additionally, the authors did not cite GMAC, a prior method that estimates return distributions using a GMM, which is closely related to the proposed method.
- While convergence is shown for the critic, it is not guaranteed to achieve an optimal policy.
- Quantile-based parameterization can use various risk measures, but the proposed method is limited to SCVaR.
- This drawback is neither mitigated nor offset by advantages of the proposed method.
- While SCVaR is more conservative than CVaR, adjusting $\alpha$ of CVaR could achieve similar effects.
- Additional analysis of SCVaR's physical properties would help readers intuitively tune $\alpha$ and $K$.
- The experiments include too few risk-constrained RL baselines.
- CAL focuses on conservative policy updates rather than solving risk-defined constraints.
- SAC-Lag is risk-neutral.
- Only WC-SAC is relevant.
- Others, such as CPPO [4], CVaR-CPO [5], and SDAC [6], should be included.
- SRCPO [7], which proves convergence to an optimal policy for risk-constrained RL, is essential for comparison.
- In the experimental results, mean + std exceeds the threshold in all tasks except Ant.
- Despite using risk constraints, this indicates failure to obtain conservative policies.
[1] Bellemare, Marc G., Will Dabney, and Rémi Munos. "A distributional perspective on reinforcement learning." *International conference on machine learning*. PMLR, 2017.
[2] Dabney, Will, et al. "Distributional reinforcement learning with quantile regression." *Proceedings of the AAAI conference on artificial intelligence*. Vol. 32. No. 1. 2018.
[3] Cho, Taehyun, et al. "Bellman Unbiasedness: Toward Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation." *Forty-second International Conference on Machine Learning*.
[4] Chengyang Ying, Xinning Zhou, Hang Su, Dong Yan, Ning Chen, and Jun Zhu. Towardssafe reinforcement learning via constraining conditional value-at-risk. In Proceedings ofInternational Joint Conference on Artificial Intelligence, 2022.
[5] Qiyuan Zhang, Shu Leng, Xiaoteng Ma, Qihan Liu, Xueqian Wang, Bin Liang, Yu Liu, and JunYang. CVaR-constrained policy optimization for safe reinforcement learning. IEEE Transactionson Neural Networks and Learning Systems, 2024.
[6] Kim, Dohyeong, Kyungjae Lee, and Songhwai Oh. "Trust region-based safe distributional reinforcement learning for multiple constraints." *Advances in neural information processing systems*, 2023.
[7] Kim, Dohyeong, et al. "Spectral-risk safe reinforcement learning with convergence guarantees." *Advances in Neural Information Processing Systems,* 2024.
- Can it be shown that SC-MGR converges to the ground truth distribution as the number of GMM components $K$ goes to infinity?
- Figure 3 contains too many equations, making it hard to follow. Can it be simplified?
- According to the primal-dual method, should Equation 22 be written as $-\kappa (\Lambda - d)_{\leq 0}$?
- Experiments on $K$ in SCVaR show that larger $K$ increases conservativeness.
- However, the relationship between $\alpha$ and $K$ remains unclear, making it difficult to choose an appropriate $K$.
- Could you provide guidance to help readers select a suitable $K$? |
Fully human-written |
|
Learning Non-Gradient Diffusion Systems via Moment-Evolution and Energetic Variational Approaches |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a two-stage weak-form learning framework for recovering drift decompositions in generalized diffusions without detailed balance. Stage 1 identifies the drift from first-moment evolution; Stage 2 recovers the pseudo-potential using an energy-dissipation law with a physics-motivated orthogonality penalty. The idea of combining weak-form moment evolution with energy-based learning is interesting and potentially impactful for non-gradient stochastic dynamics.
However, several parts of the theoretical formulation, numerical justification, and experimental design remain insufficiently rigorous or clearly motivated.
1. Addresses the important problem of learning non-gradient stochastic dynamics, beyond detailed-balance systems.
2. The weak-form formulation is appealing for noisy data, avoiding higher-order derivative estimation.
3. The paper provides multiple 2D diffusion examples, including noisy and rough potentials, plus ablation studies on penalty and training strategies.
1. Derivation of Equations (6)–(9) lacks rigor and clarity.
1.1 The transition from Eq. (4) to Eq. (6) appears ad hoc and not rigorously derived from the underlying stochastic dynamics or variational principles.
1.2 It is unclear how Eq. (8) is obtained from Eq. (6)—specifically, how the second term in Eq. (6) is eliminated and under what assumptions this simplification holds.
1.3 The statement that “we can minimize (8) in a weak form to learn the pseudo-potential and the rotation part” lacks justification. The rationale for why minimizing this functional corresponds to learning the desired decomposition should be explicitly established.
2. The constraint $\nabla\psi \cdot R = 0$ is enforced only via an integrated (global) penalty term. There is no theoretical argument showing that minimizing this global loss guarantees pointwise orthogonality. A discussion of this discrepancy and its practical implications would be important.
3. The paper provides no analysis of the consistency, bias, or variance of the proposed estimators. Without such analysis, it is unclear under what conditions the learned drift and potential converge to the true physical quantities. The good numerical results currently shown may depend strongly on the specific form of the training data rather than the generality of the method.
In particular, the dataset includes distributions at long times, which might already be close to the stationary distribution. This could artificially improve the training performance. It is recommended to quantify this effect—for example, by computing and plotting the distance between the data distribution at large t and the stationary distribution—to clarify how much of the observed accuracy stems from near-stationary data.
4. All numerical examples are synthetic 2-D toy problems. The absence of higher-dimensional or real-world cases limits the demonstration of scalability and practical relevance. Moreover, no quantitative evaluation of runtime, efficiency, or robustness across architectures is provided.
5. The assumption $b = -\tfrac{1}{2}\sigma^2\nabla\psi + \tfrac{1}{2}\sigma^2R$ is central to the method but not theoretically or physically discussed. All test cases are artificially constructed to satisfy this decomposition, which weakens the claim of general applicability. The paper should clarify under what conditions this assumption holds and how violations would affect learning performance.
6. The assumption that $\sigma$ is a scalar function is not discussed either.
Please see weakness above |
Fully AI-generated |
|
Learning Non-Gradient Diffusion Systems via Moment-Evolution and Energetic Variational Approaches |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes a two-stage learning algorithm for generalized diffusion processes with non-gradient drift fields: first stage learning the drift fields by the first moment estimates; and second stage learning the learn the pseudo-potential parts of the drift fields by applying a physically consistent penalty in the loss to enforce orthogonality of the pseudo-potential and rotational components. The method is built on prior work such as Lu et al. (2024) and introduces a penalty term by considering the pointwise orthogonality of the pseudo-potential and rotational components to improve robustness to noisy density data. Numerical experiments in low dimensions illustrate that the method better recovers the rotational components than baseline approaches.
The paper presents an interesting and physically grounded approach to learning non-gradient diffusion drift decomposition via Helmholtz methods. However, the empirical and theoretical scope remains limited to low dimensional, synthetic settings. With stronger sensitivity analyses, realistic applications, and tighter parameter guidance, the work could become significantly more compelling.
(1) The application of Helmholtz decomposition to drift learning is conceptually compelling: separating the gradient (pseudo-potential) and divergence-free (rotational) parts aligns with physical modelling of non-equilibrium systems.
(2) The physically consistent penalty enforcing pointwise orthogonality is well motivated and matches the dimension of energy dissipation rate and may improve robustness to noisy data.
(3) The authors provide clear implementation details and present a set of representative synthetic examples, which show improved drift reconstruction over simpler baselines.
(1) The two-stage algorithm requires accurate density function data on a relatively large domain with dense spatial grids; this limits applicability to low-dimensional problems. The manuscript primarily uses numerical solutions of the Fokker–Planck (FP) equation as “given” density data, which raises the question: if the FP drift/diffusion terms are known (so the equation can be solved), then the learning task is less realistic.
(2 There is minimal discussion or theory guiding the choice of time windows $t_1,t_2$, $T_1,T_2$ and penalty strength $\lambda$ with respect to the underlying diffusion process (e.g., drift/diffusion regularity, relaxation time scales, spectrum of the generator).
(3) As acknowledged by the authors, the method cannot currently learn time-dependent pseudo-potentials, rotational components that vary in time, or diffusion processes with nonlocal effects. These restrictions should be discussed more clearly in terms of limitations and future work.
(4) The method depends on an accurate density field; the manuscript lacks any numerical study of how errors in the density data propagate into drift estimation and result in bias.
(5) The numerical examples remain synthetic and low-dimensional. I suggest applying the method to a more practically motivated diffusion (e.g., impurity diffusion in crystalline solids) to demonstrate relevance beyond toy problems.
(1) I wonder how errors in the density data propagate into drift estimation and affect the choice of hyperparamter and result in bias in the learning objectives.
(2) Following the previous question, I wonder if some types of errors or noises in the density data are less important to the learning of rotation components .
(3) I suggest applying the method to a more practically motivated diffusion (e.g., impurity diffusion in crystalline solids) or higher dimensional examples to demonstrate relevance beyond toy problems. |
Lightly AI-edited |
|
Learning Non-Gradient Diffusion Systems via Moment-Evolution and Energetic Variational Approaches |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
In the paper, the authors propose a data-driven method to learn the drift vector field of stochastic dynamic systems. Specifically, a two-stage method based on a physically consistent penalty and first-moment evolution is proposed to solve this problem.
- The investigated problem is important.
- The mathematical formula is clear written well.
- The idea of considering rotation filed and potential filed is reasonable and insteresting.
- The code is not provided, which limits the reproducibility of the work.
- The experimental section is a significant weakness of the paper. The most critical issue is the lack of comparisons with state-of-the-art (SOTA) baselines from top-tier conferences such as ICLR and NeurIPS. The current comparisons are limited to relatively simple methods, many of which are simplified variants proposed by the authors themselves.
- Another weakness lies in the experimental metrics, which are not intuitive, while the analysis tends to be overly subjective. For instance, in line 384, it is stated that “our method still yields reasonably reliable results.” However, it remains unclear what RMSE value qualifies as “reasonable,” as this is highly dependent on the specific scale and context of the system under study. Such claims may lead to confusion.
- The results in Figure 2 are also difficult to interpret. It is unclear from the figure whether the proposed method performs well or poorly. At the very least, a side-by-side comparison with the fields learned by SOTA methods should be provided to better illustrate the effectiveness of the proposed approach.
- The experimental comparisons are currently limited to simplified baselines. Could you include comparisons with state-of-the-art methods to better demonstrate the relative performance and competitive advantage of your proposed approach?
- Please clarify the plans for releasing the source code and the experimental setup. |
Lightly AI-edited |
|
Learning Non-Gradient Diffusion Systems via Moment-Evolution and Energetic Variational Approaches |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors propose a two-stage method for learning the drift of SDEs in the setting where the ground truth SDE does not satisfy a detailed balance condition. The approach is based on decomposing the drift into a gradient (pseudo-potential) and a rotational term. The framework uses snapshots of the probability density function generated from different initial Gaussian densities, captured at both short and intermediate times. Stage 1 learns the total drift via first-moment evolution, and Stage 2 learns the decomposition using an energy dissipation law.
- The authors identify a tractable subset of the difficult non-gradient SDE learning problem: systems where the drift decomposition satisfies a pointwise orthogonality constraint.
- The paper proposes a novel two-stage learning framework that cleverly combines moment-evolution and energy-dissipation principles.
- The numerical evaluation, while limited to 2D examples, is thorough. It effectively demonstrates the method's robustness to significant data noise, rough potentials, and non-canonical rotations.
- The work is very well presented.
- The method's primary weakness is its data requirement. It assumes access to full, gridded snapshots of the density function, which is unrealistic in most practical applications where data typically consists of sparse, noisy particle trajectories.
- The reliance on gridded data and Riemann sums for integration raises concerns about the method's scalability to high-dimensional problems due to the curse of dimensionality.
- The authors claim applicability to biology and engineering, but the experiments are limited to 2D toy problems.
- The entire method is contingent on the pointwise orthogonality constraint. There is limited discussion on the prevalence of this assumption in real-world systems or how the method's performance degrades if this constraint is only approximately satisfied.
- The noise robustness experiment is a good inclusion. However, could the authors provide a more formal analysis of error propagation? Specifically, if the density f were estimated from sparse data (e.g., via KDE), how would that estimation error propagate through the loss functions?
- How does this method perform in practical applied settings? |
Lightly AI-edited |
|
PromptFE: Automated Feature Engineering by Prompting |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes an automated feature engineering method using large language models (LLMs). The proposed method, termed PromptFE, automatically constructs features in a compact string format by using reverse Polish notation and generates semantic explanations based on dataset descriptions. The experimental result demonstrates that PromptFE outperforms existing automated feature engineering methods on tabular datasets.
- The proposed method leverages canonical Reverse Polish Notation (cRPN) to represent features.
- The effectiveness of the proposed method is verified through experiments on several tabular datasets.
- Overall framework of the proposed method is similar to existing methods such as CAAFE (Hollmann et al., 2023). The primary difference lies in the use of RPN for feature representation. The novelty of the proposed method seems to be limited.
- It is not clear whether RPN is really effective for feature representation in the LLM-based AutoFE. It would be better if the authors could compare RPN with other representations, such as mathematical expressions or Python code, under the same prompt template setting.
- The number of datasets used in the experimental evaluation seems small compared to prior works.
- The reviewer wonders about the description "PromptFE reduces the search space with pre-defined operators and represents features in compact cRPN." How does the use of cRPN reduce the search space? What is the definition of the search space in this context? Is it possible to quantify the reduction of the search space by using cRPN?
- How is the performance of the proposed PromptFE using mathematical expressions or Python code representation instead of cRPN? This comparison will help in understanding the effectiveness of cRPN for feature representation.
- Could you clarify the selection reason for the datasets used in the experimental evaluation?
- The title of this paper seems ambiguous. It might be better to include specific words that reflect the key contribution. |
Lightly AI-edited |
|
PromptFE: Automated Feature Engineering by Prompting |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces PromptFE, a framework using large language models (LLMs) for automated feature engineering (AutoFE) on tabular data. The method iteratively prompts an LLM (GPT-3.5/4) with dataset/attribute descriptions, a predefined set of operators, and a ranked list of the best-performing features generated so far, represented in canonical Reverse Polish Notation (cRPN). The LLM generates new candidate features in cRPN and provides semantic explanations. Features are evaluated using cross-validation, and the scores guide subsequent LLM prompts via in-context learning. The authors report competitive performance against several AutoFE baselines on public datasets
[Note: I have used LLMs to improve my writing and help me answer paper questions]
- Partial Novelty: Uses cRPN to represent features, ensuring uniqueness and providing a structured, concise format potentially easier for the LLM to process and generate compared to code or natural language descriptions.
-Strong Empirical Results: Demonstrates significant performance gains over raw features and achieves competitive or superior average performance against several SOTA AutoFE baselines across multiple datasets and downstream models.
- Novelty: Many concepts are taken directly from CAAFE with only minor adjustments
- Limited Expressiveness of Feature Space: The chosen feature representation (cRPN) combined with a predefined, relatively basic set of mathematical operators (+, -, *, /, log, sqrt, etc.) significantly restricts the complexity and type of features that can be generated. This approach cannot invent novel, domain-specific transformations (e.g., handling date differences, geospatial calculations, complex conditional logic) that might be crucial for certain datasets. Compared to methods generating arbitrary code, the expressive power of PromptFE's feature space is inherently limited to combinations of these fixed operators, potentially failing to capture more intricate data relationships.
- Scalability Limitations (Cost & Evaluation Overhead): Each iteration involves potentially multiple LLM API calls and, more significantly, feature evaluation using cross-validation. Repeated model training for scoring can be computationally expensive, particularly for complex downstream models or large datasets. The overall wall-clock time might still be substantial.
- Heuristic and Opaque Search: The LLM-driven search is fundamentally heuristic, relying on the model's opaque internal mechanisms to interpret scores and examples. There are no guarantees of convergence, optimality, or systematic exploration. The process might fixate on suboptimal patterns derived from early high-scoring features (local optima). Sensitivity to hyperparameters like temperature and k (number of examples) is also noted.
Handling Complex Feature Types: How would PromptFE handle the need for features involving date differences, string manipulations, conditional logic (if-then-else), or interactions with external knowledge, given the limitations of the current operator set and cRPN? Is the framework extensible in this regard?
Search Strategy Robustness: How sensitive is the search process to the initial random features, the number of top-k examples (k), and the LLM's temperature? Is there a risk of premature convergence or unproductive exploration? |
Fully AI-generated |
|
PromptFE: Automated Feature Engineering by Prompting |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes PromptFE, an automated feature engineering framework based on large language models (LLMs). Its core concept involves guiding LLMs to generate new features through prompting, while continuously optimizing feature quality via a context-learning mechanism. PromptFE employs canonical reverse Polish notation (cRPN) as the feature representation format, ensuring uniqueness and parsing capability of feature expressions. Methodologically, PromptFE integrates dataset descriptions, field semantics, operator definitions, and high-scoring feature examples into prompts. It iteratively generates and evaluates new features, ultimately selecting the optimal feature subset through cross-validation. Experiments across seven real-world datasets compare PromptFE against four baseline methods including traditional AutoFE and LLM-based approaches, encompassing both linear and nonlinear models. Results validate PromptFE's advantages in performance, efficiency, and interpretability.
(1) Method design is simple yet effective: PromptFE ingeniously combines the context learning capability of LLMs with feature engineering tasks. By representing features through cRPN, it avoids expression ambiguity and enhances the model's understanding of feature structures and generation quality.
(2) Comprehensive experimentation with thorough comparisons: Systematic evaluations against mainstream AutoFE methods like OpenFE, DIFER, and CAAFE across multiple public datasets—encompassing diverse model types—consistently demonstrate PromptFE's significant performance advantages, achieving over 15% improvement specifically on linear models.
(3) High interpretability with semantic utilization: PromptFE not only generates features but also provides semantic explanations, enhancing feature comprehensibility and credibility. Additionally, ablation experiments validate the critical role of dataset semantic information in feature quality.
(1) Computational efficiency still has room for optimization: Although PromptFE evaluates far fewer features than traditional methods, each round still requires invoking the LLM and training the model for evaluation, resulting in relatively high overall computational costs. We recommend introducing early-stopping mechanisms or uncertainty-based feature selection strategies to reduce redundant evaluations.
(2) Strong dependency on LLMs: The current approach relies heavily on the generative capabilities and semantic understanding of GPT-3.5/GPT-4, without fully exploring the applicability of smaller or locally deployed models. This may limit its use in resource-constrained scenarios. Testing lightweight models like LLaMA-7B is recommended to enhance the method's versatility and controllability.
(3) Lack of stability analysis for generated features: While experiments demonstrate significant performance gains, systematic analysis of feature stability across different random seeds or data partitions is absent. We recommend incorporating feature consistency evaluations (e.g., Jaccard similarity, feature importance stability) to strengthen robustness validation.
(1) Does PromptFE maintain its current efficiency and performance advantages on larger datasets (e.g., millions of samples)? Are there corresponding complexity analyses or scalability experiments?
(2) The current method primarily relies on GPT-series models. Have performance differences been considered when using open-source LLMs like LLaMA or Qwen with identical prompts? Are there impacts from model bias or variations in semantic understanding?
(3) Does PromptFE support user-defined operators or domain-specific knowledge embedding? Are there future plans to introduce a more flexible, extensible framework to accommodate domain-specific requirements? |
Moderately AI-edited |
|
PromptFE: Automated Feature Engineering by Prompting |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper proposes an automatic feature engineering framework PromptFE. This work uses dataset metadata and feature correlations to generate new features. The core is to adopt LLM reasoning to incorporate dataset descriptions into prompts to select features. It further introduces a compact feature representation to iteratively produce feature-enhancement code based on a predefined set of operators. Experimental results show that this method improves feature quality and model performance compared to baseline approaches.
The motivation of this paper is sound. Automated feature engineering should be more explainable and visible, and using LLMs to generate features is an interesting direction.
The workflow is well-structured, as it defines a compact feature representation and integrates predefined transformation operators to make the feature generation process interpretable and reproducible.
The experimental evaluation includes comparisons with some existing methods, the proposed method achieves better performance in the given benchmark cases. The paper also provides illustrative examples.
The novelty is limited. Although explained (in the appendix), I still do not see much difference between this work and CAFFE, or the LFG.
The algorithm does not match well with the method and problem definition, and is not clearly illustrated individually.
For experiments, I do not see much performance gains. The experiments were on GPT3.5 and GPT4 only, more models, especially white box models, like Llama or Qwen, should be tested. Also there should be an ablation study to prove the effectiveness of each component.
See above. And the figures in Figure 6 are really small. I use my largest screen, still find them hard to read. |
Fully human-written |
|
A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes MeRF, a clever approach to improve RLVR for LLMs by injecting a natural language description of the reward function directly into the training prompt, which is termed "in-context motivation". This enables the model to be aware of the optimization objectives during generation, aligning its outputs with desired behaviors more efficiently than the traditional trial-and-error RLVR paradigm, which relies solely on random explorations. Empirical results across benchmarks, including K&K Logic Puzzles, MATH datasets, and CountDown number games, demonstrate substantial gains, and better exploration as evidenced by higher entropy during training. Ablation studies show that performance benefits primarily from the training process rather than inference-time motivation, with MeRF robust to suboptimal or even adverse motivations.
- A novel, simple and very practical approach to improve RLVR, which also makes sense
- Interesting experimental design and results on Q4
- Well presented (in terms of design) to make the paper easy to read
- The experimental results are scattered around the paper and somehow do not seem complete:
- Figure 1 includes results on 4 different LLMs and Figure 3 includes result on deepseek but most of them not presented in table 1.
- Results on Figure 2 (right) have no details; What dataset is this?
- Figure 1, 3, 7, 6, 5, 8 all show increasing performance on steps, but differently grouped (some on metrics, some on datasets), and feels very repetitive, being scattered all around the paper. Need to be better organized
- Figure 2 (right) and experiment Q3 (figure 8) basically telling the same thing but repeated
- A number of analysis experiments using different base models and datasets while there is only one single base model for main results, it makes me feel like the models are cherry picked for the analysis experiments.
- Main results need to be complemented with a number of different base models, to show that the method is robust between LLM choices
- Lots of repetitive figures (in terms of message), system prompt in the main text, a handful of main results, repeated analyses, all these make the paper less dense in terms of how much information it conveys.
- I would expect the improvement of MeRF to highly dependent on what reward function is used in the dataset. What happens if the reward function is much denser (having a lot of different criteria)? What happens if the reward function is less clear in terms of natural language (e.g. Math dataset)? There is no analyses on where the proposed approach would benefit best and where the proposed approach would benefit least.
- system prompts other than K&K puzzle not provided
- Some questionable analyses: continued in questions.
- Why does MeRF have high-entropy? and is it even a good thing? The paper motivates the need of MeRF like: we need it to have structured exploration instead of naive exploration of usual RLVRs. But entropy is more of a metric for "unstructured exploration"; usually I would expect smaller entropy when we move from unstructured exploration to structured exploration. Why is it not the case here? What happens if we control the temperature to increase entropy: does it help on RLVR?
- It is interesting to see pass@8 saturates fast with RLVR so that it is soon outperformed by MeRF. However, to claim that such early saturation is due to better exploration, I think the authors should also show that the samples generated by RLVR rarely have high rewards (higher than what LLM is currently getting in expectation). Such early saturation might be caused by different training dynamics due to different prompts. |
Fully human-written |
|
A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a simple, low-cost intervention for RL with verifiable rewards (RLVR): append a natural-language description of the reward ("motivation") to each training prompt. The claim is that exposing the policy to the reward structure during rollouts improves exploration (higher entropy, stronger pass@k) and yields consistent gains on K&K logic puzzles, several math benchmarks (AIME/AMC/MATH), and CountDown, even when the motivation text is removed at evaluation. The paper is properly evaluated and ablated.
Simple, well-motivated idea that's easy to implement; the paper reads clearly.
Consistent improvements over RLVR across two model families (Qwen2.5, DeepSeek-R1-Distill) and multiple reasoning benchmarks; importantly, performance holds without motivation at test time.
The method achieves better performance in fewer training steps. For example, in one experiment, MeRF achieved better pass@4 and pass@8 performance at step 140 than the final RLVR model did at step 280.
Currently the MeRF variant is compared only to the RLVR. Given the nature of MeRF consists of injecting the reward in the instruction, consider comparing against tuned-prompt variants via DSPY (https://github.com/stanfordnlp/dspy) , to see whether this benefit comes from better prompting.
The method's effectiveness is tied to tasks where the reward function is verifiable and describable in simple natural language. This limits the scope of MeRF, making it unclear how it would apply to tasks with more complex or non-describable reward signals, such as human preference scores.
The entropy analysis (Figure 4) shows higher entropy for MeRF, interpreted as "better exploration," but higher entropy could also indicate increased uncertainty. Alternative explanations aren't ruled out.
Can you provide evidence that models actually use the motivation during generation (e.g., attention analysis, probing)?
Catastrophic forgetting / over-alignment: After MeRF training, how does the model perform on unrelated general-purpose tasks? Any drop vs. base/RLVR?
Have the authors analysed how complex the motivation prompt needs to be? How sensitive are results to motivation wording, length, or position? Any robustness sweep?
How much improvement comes from simply having better prompts vs. reward-specific information?
How does performance scale with reward function complexity? |
Fully human-written |
|
A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper introduces Motivation-enhanced Reinforcement Finetuning (MeRF), a method that injects a natural language description of the reward function into the prompt during RLVR training to make LLMs aware of the optimization objective. This leverages in-context learning to improve efficiency over standard RLVR. Contributions include empirical evaluations on logic puzzles and math benchmarks showing performance gains, and analyses on motivation-reward consistency.
1. The approach creatively combines in-context learning with RL by explicitly providing reward rules as "motivation," offering a simple extension to existing RLVR paradigms that could inspire hybrid training methods.
2. Experiments cover multiple models (e.g., Qwen2.5 series) , with consistent comparisons to baselines, providing some evidence of improved accuracy and efficiency.
3. The paper is well-structured, with clear illustrations of the method, prompts, and results, making the core idea accessible.
1. The method is overly simplistic and lacks rigorous theoretical justification; it's unclear how the specific reward scoring rules (e.g., +2 for correctness, -1.5 for understandable but wrong answers) mechanistically influence the model's generation of correct reasoning trajectories, relying too much on intuition without deeper analysis.
2. Extensive experimental data is provided mainly for logic puzzles, but for more general tasks like mathematics and code generation, the motivation descriptions appear ineffective or irrelevant, as evidenced by smaller gains on MATH benchmarks and no code-specific results.
3. Experiments rely heavily on simple numerical comparisons (e.g., accuracy curves), lacking in-depth qualitative analysis, such as case studies of trajectory changes or failure modes, which fails to convincingly support the paper's motivation and leaves readers questioning the underlying mechanisms.
1. Could the authors provide a theoretical explanation or ablation on how the reward rules in the motivation prompt causally affect trajectory generation? For instance, why do negative scores for "understandable but wrong" answers guide the model better than a binary reward?
2. Why are gains on math tasks (e.g., only 3-4% average improvement) much smaller than on puzzles? Please elaborate on why the method may not generalize to code or other domains, and suggest experiments to test this.
3. The analysis section mentions Q1-Q4 but seems incomplete in the provided document. Can you expand on deeper insights, such as visualizing prompt-motivation interactions or reward hacking examples, to better convince readers of the method's value? |
Heavily AI-edited |
|
A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
By providing evaluation criteria with the system prompt (find-grained guideline), RLVR pipeline could be effectively trained.
Through a series of experiments, the paper demonstrates that when training a model in a domain where the grading criteria are clearly defined, providing information about the grading scheme alongside the data can significantly accelerate the model’s learning process.
* As mentioned in the limitations, in situations where there is no prior knowledge of how the grading will be done, an approach that dynamically identifies the motivation and effectively solves the problem seems to have greater scalability.
* Since adding even a suboptimal motivation still provides non-zero additional information, it is unsurprising that the performance improves compared to RLVR. An interesting phenomenon is shown in Fig. 9, where performance increases after 500 steps when an adversarial motivation is provided. It would be important to check whether this result consistently appears across multiple runs, as the paper does not seem to report repeated experiments for this particular setting. Furthermore, when adversarial motivation is given as a guideline, it may be worth analyzing whether the RLVR training process includes any mechanisms that allow the model to ignore or correct such misleading guidance.
Naturally, ML model could align their answers with the evaluation criteria when grading guidelines are provided rather than only being told whether an answer is correct or not. But does that constitute a discovery? |
Lightly AI-edited |
|
JudgeLRM: Large Reasoning Models as a Judge |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper establishes a relationship between reasoning ability (enhanced through reinforcement learning) and judge quality. The authors first find a negative correlation between SFT performance and judge quality on reasoning-heavy tasks. They then propose RL-based rewards and train the judge model using RLVR. Experimental results show that JudgeLRMs yield significant improvements.
1. The paper introduces a new dimension of judge quality, i.e., its relationship with reasoning ability.
2. The paper adapts RLVR to pairwise response comparisons and introduces three types of content rewards.
3. The paper conducts experiments using models of various sizes and from different families to validate the effectiveness of the RL method.
1. The authors regard SFT as the opposite of reasoning, which is not convincing. This makes the overall claim somewhat confusing.
- Q1: SFT models distilled from GPT-4 should also be capable of generating CoT. As shown in Figure 18, the response lengths of JudgeLRM-3B and JudgeLRM-7B are not particularly long. What, then, is the major difference between the responses produced by the SFT and RL models?
- Q2: What if trajectories from LRMs such as DeepSeek-R1 or the Qwen3 series are used to fine-tune the model via SFT?
2. The ablation study is not comprehensive, so the necessity of the proposed rewards cannot be fully verified. The authors introduce three types of content rewards, but the ablation results only report *w/o r_absolution + r_confidence*.
- Q3: Could you provide more ablation studies to demonstrate that all three rewards are indispensable? I am interested in whether the policy model can learn such complex relationships from a sparse scalar reward.
3. The writing quality is relatively low. In Table 2, there is a nonsensical phrase "yaobuyao," and the table also mixes up *Instruct* and *Ins*. Lines 315–323 lack indices (4) and (5). Additionally, "Qwen3 Base" in Table 3 does not clarify whether it refers to the Qwen3-Base series or a model based on Qwen3 (the reasoning version).
Q4: What are the results of Qwen3-4B and Qwen3-8B without any training? Since they are reasoning models, are they better than their non-reasoning counterparts?
Q5: Is JudgeLRM-8B based on Qwen3-8B-Base or Qwen3-8B? If it is based on Qwen3-8B-Base, why not use Qwen3-8B, given that the focus is on LRMs? If it is based on Qwen3-8B, could you explain why it performs worse than JudgeLRM-7B on PandaLM, considering that Qwen3-8B has been shown to outperform Qwen2.5 in reasoning ability? |
Lightly AI-edited |
|
JudgeLRM: Large Reasoning Models as a Judge |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces JudgeLRM, a family of reinforcement learning–trained models designed for judgment tasks that demand complex reasoning. The central hypothesis is that many judgment and reward-scoring tasks inherently require reasoning, which limits the effectiveness of standard SFT-based judges and reward models. To address this, the authors propose an RL framework that trains models to reason and evaluate by rewarding them based on the accuracy of their predicted scores relative to gold human or frontier model judgments. Experiments demonstrate that JudgeLRM models outperform both SFT and baselines such as BT and DPO, highlighting the importance of using reasoning and RL for developing better judges.
* The motivation is clear: judgment tasks are inherently reasoning-intensive, and this work provides an interesting attempt to align reward learning with reasoning ability.
* The empirical results are strong, showing consistent improvements over SFT and existing baselines (e.g., BT, DPO).
* **Restricted applicability**: The framework seems limited to settings where ground truth scores are available. It’s unclear how it extends to preference-based or binary feedback data.
* **Writing**: The paper is poorly written. The method is not introduced properly, the introduction is too experimental, the results are not written cleanly. The baselines and experimental design can be explained significantly better.
* **Heuristic reward design**: The absolute reward formulation and its hyperparameters appear heuristic and tuned to specific score distributions, raising concerns about generalizability.
* **Confidence reward is poorly motivated**: It’s unclear why a non-continuous confidence reward is used, and it seems to specifically encourage overconfidence rather than calibrated predictions.
* **Pairwise-only framework**: Since the model is trained pairwise, it may struggle to produce meaningful single-response scores, limiting its inference-time utility.
* **Weak case study**: The case study relies purely on qualitative analysis, without quantitative evidence to substantiate claims about emergent complex reasoning behaviors (e.g., verification. subgoal setting).
* **Limited related work**: The related work section is underdeveloped. There have been a few other works training reasoning reward models. Even if they are concurrent, some more discussion is necessary.
* **Poor presentation choices**: Table 1 placement is suboptimal—this section should feature a main figure summarizing the method or key results.
* **Cluttered introduction**: The introduction includes excessive experimental discussion. These details would fit better in the Motivation or Background section.
1. How would the framework handle preference data (e.g., pairwise comparisons without numeric scores)?
2. Can you clarify the motivation and formulation of the confidence reward? Why is it not continuous or calibrating?
3. I am highly doubtful about the generalizability of the method to different score distributions. For example if the ground truth scores are distributed between 0-100, then reward hyperparameters need to be tuned accordingly.
4. The claim that these rewards encourage calibrated confidence is probably incorrect as no proper scoring rule [2] was used in the reward function. Can the authors argue why this should be true?
5. How were the reward hyperparameters chosen, and how sensitive are results to these values?
6. Since the framework is pairwise by design, how is it adapted for single-response evaluation, and how reliable is that setup? If single-response evaluation is possible, it would be useful to have a baseline trained specifically for that setting to test if relational reasoning is truly necessary.
7. Could you provide explicit equations for the baselines, especially DPO? Is the DPO implementation just the standard version? It is okay to put these into the appendix if space is a constraint.
8. Could you compare against pairwise preference models trained explicitly for this setting (e.g., [1] but also used in multiple other works)?
9. In the “length reward” experiment, why was a threshold-based reward (120 tokens) chosen instead of a more standard continuous reward formulation? Why was 120 tokens picked as a threshold?
10. Can the authors formally define the training setting and data assumptions—specifically, the use of continuous scores between [0,10]—and discuss how this constrains generalizability?
[1]: Munos, R., Valko, M., Calandriello, D., Azar, M. G., Rowland, M., Guo, Z. D., ... & Piot, B. (2024, July). Nash learning from human feedback. In Forty-first International Conference on Machine Learning.
[2]: Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477), 359-378. |
Lightly AI-edited |
|
JudgeLRM: Large Reasoning Models as a Judge |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work addresses one of the problems of existing LLM-as-a-judge approaches -- their poor performance on the tasks requiring complex reasoning. Authors show the negative correlation between the performance gains from SFT and proportion of reasoning-demanding samples in a given domain. To fix this gap, they propose a family of models -- JudgeLRM -- which were trained using GRPO method with "judge-wise, outcome-driven" reward function. This reward function was designed to optimize for structural correctness, relational accuracy, absolute score accuracy, and judgment confidence. Their results show that 3B JudgeLRM model is more accurate than GPT-4 on the human-annotated PandaLM benchmark, and 7B outperforms DeepSeek-R1 model. Additional analysis show that JudgeLRM gains most on the reasoning-demanding tasks were SFT models fail.
- Clear motivation: The authors show that SFT is insufficient for judging tasks requiring some extend of reasoning.
- Technical Contribution: Core contribution -- the judge-wise, outcome-driven reward function -- is novel and give significant results.
- Strong results and deepened analysis: Authors show that JudgeLRM works better than SFT on reasoning-demanding tasks. They provide ablation studies to justify the reward-function design, and provide qualitative examples of improved reasoning.
- Contradictory Statistics in Analysis: In the caption of Figure 3, the authors mention the negative linear trend, but both the plot and the equation parameters (y = 0.2x − 1.05) seem positive. In the Section 4.3 the authors also mention "we observe a correlation coefficient of 0.20 between relative improvement and reasoning rate", which imply $R^2=0.04$. In the Figure 3 caption the $R^2=0.95$ is given.
- "Reasoning" Labeling Methodology: In the two most important figures (Figure 1 and Figure 3) the authors use "Proportion Requiring Reasoning to Judge (%)" metric. The process of creating this metric is only revealed in the appendix. While authors did validate this method against human annotators it still creates a potential circularity. The paper essentially uses GPT-4's own definition of "reasoning" to motivate the need for a new model. The paper then uses this new model to claim it has surpassed the performance of GPT-4 itself. Given that this metric is so central to the paper's argument, the use of an LLM to generate it should be discussed transparently in the main paper, not just in the appendix.
- Overstated claims about surpassing GPT-4: JudgeLM dataset used to train JudgeLRM was made entirely from GPT-4 answers. The reward function was design to match the final outcomes from GPT-4 labels. Taking that into consideration, claiming that JudgeLRM "surpasses GPT-4" may be too strong. More precise framing, that this specialized 3B model was able to outperform its general-purpose teacher (GPT-4) on a different test (the human-annotated PandaLM benchmark). This is still a very impressive result, but this description is more accurate.
I don't have specific questions, but I would be more than happy to see improvements stated in the weaknesses section. |
Fully human-written |
|
Adaptive Dual Prompting: Hierarchical Debiasing for Fairness-aware Graph Neural Networks |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper studies a fairness-aware graph prompting method called ADPrompt that integrates Adaptive Feature Rectification and Adaptive Message Calibration to mitigate biases in both node attributes and graph structure. This method reduces group prejudice in GNNs and also improves adaptability for downstream tasks. The authors also give empirical results on multiple datasets that ADPrompt outperforms some baselines.
Theorem 1 gives a fairness guarantee of ADPrompt, which shows that ADPrompt reduces initial feature bias and suppresses
bias propagation, providing a tighter upper bound on Δ_GSP than a standard GNN. The authors also provide empirical result to justify their theoretical findings. This results are very interesting.
Theorem 1 only shows a relationship of less or equal than, but does not really tell how much higher the inequality is.
1. Can you explain how tight Eq. 12 and 16 are? What is the best possible inequality (i.e., the limit) of Eq. 17?
2. Where are the complete proofs of the results in Section 5.2? I cannot check the whole details in the current version.
3. The empirical results show that their proposed method achieves the best or highly competitive performance across various pre-training strategies (Table 1). My question is, does a method that can better suppress the bias always show better performance? Can we analyze it quantitatively?
4. Small issue: after Theorem 1, the explanation "This formally shows that ADPrompt reduces initial feature bias and suppresses bias propagation, providing a tighter upper bound on ΔGSP than a standard GNN." should be not part of the theorem. |
Fully human-written |
|
Adaptive Dual Prompting: Hierarchical Debiasing for Fairness-aware Graph Neural Networks |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces ADPrompt, a fairness-aware prompting framework for adapting pre-trained GNNs to downstream node classification while improving group fairness. It comprises two modules: Adaptive Feature Rectification (AFR), which gates feature dimensions via learnable attribute prompts to suppress sensitive information, and Adaptive Message Calibration (AMC), which injects edge-specific structure prompts at each layer to calibrate message passing. Experiments on four datasets and four pre-training strategies show higher accuracy with lower $\Delta\mathrm{EO}/\Delta\mathrm{SP}$ than seven baselines.
+ The modular method is compatible with frozen backbones. AFR and AMC are lightweight prompts on features and messages, easy to add to existing GNNs.
+ The theoretical results are tied to design. The $\Delta\mathrm{GSP}$ upper bound links AFR to reduced initial bias and AMC to damped propagation amplification.
+ Experiments are comprehensive. Four datasets $\times$ four pre-training schemes with seven baselines demonstrate the method's effectiveness.
- The work is restricted to binary $y$ and a single binary $s$. How about multi-class or multi-attribute evaluation?
- AMC learns an edge-specific prompt $e^{(l-1)}_{ij}$ at each layer, implying $\mathcal{O}(|E|\cdot d \cdot L)$ memory/compute overhead. The paper does not report runtime/memory comparisons with baselines.
- The fairness bound relies on Lipschitz assumptions and multiplicative factors $\tilde{\gamma}^{(l)}, \tilde{\epsilon}^{(l)}$, but the paper provides no estimators or empirical diagnostics for these constants. It is unclear how the training losses control the bound.
- Individual fairness is not assessed, though structural edits may alter local similarities.
- Potential error amplification from mislabeled $y/s$ is not analyzed.
Please refer to the above weaknesses. |
Lightly AI-edited |
|
Adaptive Dual Prompting: Hierarchical Debiasing for Fairness-aware Graph Neural Networks |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper proposes Adaptive Dual Prompting (ADPrompt), a fairness-aware graph prompting framework for adapting pre-trained GNN backbones to downstream node classification while improving group fairness. It introduces two complementary prompt modules: (i) Adaptive Feature Rectification (AFR), a self-gated attribute prompt that suppresses sensitive information at the input; and (ii) Adaptive Message Calibration (AMC), edge- and layer-specific structure prompts to softly calibrate message passing. A min–max objective combines supervised training with an adversary predicting sensitive attributes from the prompted representations. Theoretical analysis bounds group disparity across layers, and experiments on four datasets demonstrate the effectiveness of the proposed method compared to existing baselines.
1. The idea is straightforward and easy to follow.
2. The theoretical analysis is provided.
3. The experimental results across four datasets demonstrate the empirical effectiveness of the proposed method.
1. The writing is unclear and can be largely improved. Specifically, (i) in Section 1 (Introduction), the authors fail to claim why we need this proposed method instead of existing fairness graph prompt methods, such as [1]. The challenges mentioned in this section are merely some well-known fairness issues of GNN, making the reasons to design the proposed method unclear; (ii) The contributions mentioned in Section 1 should also be largely rewritten. The first two points are literally the same thing.
2. In Section 4 (Methodology), while some limitations of existing methods are mentioned, these points are not convincing. For example, the author claims that FPrompt [1] may disrupt critical topological information. However, it is unclear how and why it will disrupt critical topological information and lead to a performance drop. I can only find the underlying reason when I read Section 4.2. The authors are suggested to rewrite this section to make it clearer to readers. In addition, if possible, please add some preliminary empirical results to validate the claims. And I think the above clarification should also be mentioned in Section 1.
3. While theoretical analysis is provided, the statement in Section 5.2 is not clear enough to conclude Theorem 1. I understand that this subsection aims to give proofs and propose Theorem 1 to validate the effectiveness of the proposed method. However, a detailed proof for Theorem 1 should be included. Otherwise, it is unclear how Theorem 1 comes from and why Eq. (16) can lead to Theorem 1.
4. Computational overhead. AMC learns edge- and layer-specific structure prompts, which may impose significant memory and time costs on dense graphs or deep GNN backbones. The paper omits complexity analysis and runtime/memory profiling relative to baselines.
5. It seems that the framework presumes a binary sensitive attribute is known for all nodes. The AFR and adversary rely on this supervision. Real-world graphs often have missing or multi-valued sensitive labels. More discussion and analysis on this scenario can benefit this paper.
6. Current ablation studies focus on removing AFR or AMC. However, it is also important to investigate the sensitivity of the hyperparameter,s such as $\lambda$, and the transferability of the proposed method. And the impact of layer numbers of GNNs is also important to investigate the effectiveness of the layer-specific techniques.
7. Too many critical analyses are put into Appendix. The authors are suggested to reorganize this paper such that some important analysis can be in the main text.
[1] Fairness-aware prompt tuning for graph neural networks. WWW 2025.
1. Scalability analysis. Can the authors provide runtime/memory comparisons between ADPrompt and GPF/FPrompt on large graphs? Are there ways to sparsify AMC prompts (e.g., top‑k neighbors or low-rank factorization)?
2. Transferability of prompts. If prompts are learned on one dataset or under one pre‑training method, can they be transferred to another domain or backbone without retraining? Preliminary results would be interesting. |
Moderately AI-edited |
|
Adaptive Dual Prompting: Hierarchical Debiasing for Fairness-aware Graph Neural Networks |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents ADPrompt, a fairness-aware prompting framework for adapting pre-trained GNNs to downstream tasks. The core idea involves two adaptive prompting modules: an Adaptive Feature Rectification (AFR) module that purifies node attributes at the input layer to suppress sensitive information, and an Adaptive Message Calibration (AMC) module that dynamically adjusts the message-passing between nodes at each GNN layer to mitigate structural bias. By jointly optimizing these lightweight prompts with a combination of supervised and adversarial losses, the method aims to enhance fairness while maintaining task utility, without updating the frozen pre-trained GNN parameters. Extensive experiments on four datasets under various pre-training strategies demonstrate its effectiveness.
1. This paper is well-written and easy to understand
2. This paper grounds its proposed method, ADPrompt, in a robust theoretical framework.
1. The paper's primary motivation—using graph prompting for fairness—rests on the assumption that pre-trained GNNs are a valuable and widely adopted resource that should be efficiently adapted. However, this premise is not thoroughly debated. In contrast to large language models or vision transformers, GNNs are often task-specific and can be trained from scratch relatively quickly and efficiently. The claimed benefit of prompting—parameter efficiency by freezing the backbone—is less compelling when the backbone itself (a GNN) is not an exceptionally large or general-purpose model. The paper would be stronger if it provided a more convincing justification for why prompting is the right paradigm for this problem, compared to simply building a fairness-aware objective into an end-to-end GNN training process, which is common in graph fairness literature.
2. The baselines are mostly graph prompting methods, lacking dedicated, state-of-the-art graph debiasing methods that do not rely on pre-training or prompting (e.g., Edits [1], FairVGNN [2]).
[1] EDITS: Modeling and Mitigating Data Bias for Graph Neural Networks
[2] Improving Fairness in Graph Neural Networks via Mitigating Sensitive Attribute Leakage
Please see weaknesses |
Moderately AI-edited |
|
Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision-Language Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes RUDDER, a single‑pass inference‑time steering method for LVLMs that (1) extracts a per‑sample direction (CARD) from self‑attention residual updates during prefill and (2) applies a per‑token Beta‑gate to adapt the steering strength during decoding. On CHAIR and POPE, across three LVLMs and three decoding strategies, RUDDER typically matches or outperforms strong ITI baselines, while keeping ~baseline latency/throughput. General ability on MME is largely preserved
1. Low‑overhead (no extra forwards) with clear efficiency gains versus prior steering methods; quantitative latency/throughput reported.
2. Minimal code hooks; works across three distinct LVLM architectures; integrates with standard decoding loops.
3. Token‑wise gate improves precision vs fixed‑strength steering; ablations show why adaptive > fixed for open‑ended captioning.
4. Solid experiments, including cross‑model, cross‑decoding, efficiency measurements, and layer/parameter sweeps.
1. The “Bayesian‑inspired” gate is heuristic, there is no formal guarantee that the steering always improves negative log likelihood, though ablations suggest it works well.
2. Layer choice is model‑specific (e.g., late layers for LLaVA/Idefics2; early for InstructBLIP with Q‑former), and final configs differ substantially across backbones. Ablations confirm a sensitive trade-off between CHAIR scores and recall, implying non-trivial parameter search per model/task.
3. The evaluation scope is modest, evaluation on more capabilities like MM-Vet would be beneficial.
4. No diagnostic attribution of why corrections happen. The paper shows outcome metrics and some internal geometry analyses but lacks faithfulness diagnostics that could verify that the method truly reduces language-prior reliance rather than suppressing certain token types.
1. Exactly which tensors are pooled to form CARD (per‑layer, per‑head residual updates after self‑attention, before MLP)? What pooling (mean, median, head‑weighted)?
2. How were g_min, g_max, and softplus temperature chosen? |
Fully AI-generated |
|
Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision-Language Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces RUDDER, a lightweight inference-time framework to reduce hallucinations in LVLMs with (almost) no extra computational cost. RUDDER extracts a Contextual Activation Residual Direction (CARD) vector from residual updates during a single forward pass to capture visual evidence, and applies an adaptive Beta Gate to modulate correction strength per token based on visual alignment. Experiments on benchmarks like CHAIR and POPE show that RUDDER matches or surpasses SoTA hallucination mitigation methods while maintaining nearly identical inference speed and general multimodal performance, making it a practical solution for real-world LVLM deployment.
- **Good Writing.** The writing of the paper is clear and easy to follow (although more high-level intuition and motivation can be expressed in a better way).
- **Extremely Low Computational Overhead.** It's a smart idea to utilize the intermediate results (embeddings, attention heads, etc.) of the pre-filling phase for later steering of the LVLMs. This indeed avoids typical repetitive computation in the contrasting-based methods. The empirical results on the efficiency analysis perfectly supports this.
- **Extensive Experimental Results.** Experiments are conducted in various evaluation benchmarks on multiple LVLM backbones, supporting the main claims of the paper.
+ **Lack of (Sometimes Contradictory) Intuitive Explanation for the Proposed Method.** Despite its practical values in terms of performance and efficiency, I find it hard to understand the rationale behind the proposed method:
+ What is the meaning of the main body of the steering vector $v\_{\text{CARD}}$?
+ Why pooling the token-wise attention output $\Delta$ is a good idea, not causing too much information loss?
+ If the similarity score $g$ between the current token's hidden state $h$ and $v\_{\text{CARD}}$ is high, then this hidden state already contains lots of visual information. Why would we want to do stronger steering: $v\_{\text{steer}}=g \cdot v\_{\text{CARD}}$ in this case? Shouldn't we put more steering on the ones that loses lots of visual information?
+ **Over-claims about "Bayesian".** It's a bit hard to persuade me to believe the gating mechanism is "Bayesian". This is a general training-free strategy, no parameters are updated based on new observations. To me this gating mechanism is at most "adaptive".
+ **Sensitive hyperparameters setting.**
+ This method introduced many hyperparameters, and they are all adjustable: $L$, $\alpha\_{\text{max}}$, $k$, etc.
+ The hyperparameters are all set differently for different models and different benchmarks, showcasing the sensitiveness of them.
+ It's not clear how the author found the optimal setting of the hyperparameters. Is it based on an extra validation dataset?
See above (Weaknesses). |
Fully human-written |
|
Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision-Language Models |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces RUDDER, an inference-time intervention technique designed to mitigate object hallucinations in large vision-language models (LVLMs) with minimal computational overhead. The method features two core components: (1) the Contextual Activation Residual Direction (CARD) vector, a per-sample visual evidence representation derived from residual updates in a self-attention layer during a single forward pass; and (2) a Beta Gate, a Bayesian-inspired adaptive gating mechanism that dynamically steers generation toward stronger visual grounding. Evaluations on hallucination benchmarks (CHAIR and POPE) and general multimodal assessments (MME) across three LVLM architectures demonstrate that RUDDER achieves hallucination reduction at lower inference costs compared to existing interventions.
1). The CARD vector is efficiently extracted from the standard computation pipeline, while the Beta Gate relies on lightweight vector operations, enabling seamless deployment in latency-constrained real-world applications.
2). The ablation studies and illustrative examples are thorough, providing clear insights into the method's mechanics.
1). The approach shares conceptual similarities with cross-layer methods like DeCo, which integrates early-layer logits into later layers to address visual forgetting, and RUDDER similarly incorporates attention from early generated tokens into later ones. A direct comparison with such methods would strengthen the novelty claims.
2). The method introduces multiple hyperparameters (e.g., $L$, $k$, and $\alpha_{\max}$), raising concerns about its stability and generalizability across diverse VLMs.
3). Performance gains on the test sets are modest, and the method's efficacy diminishes as the underlying VLM's capabilities improve. The baselines are somewhat dated, omitting recent mainstream VLMs like Qwen-VL and InternVL, which questions its applicability to current open-source models. Additionally, comparisons with recent training-free hallucination mitigation techniques, such as DeGF and AGLA, are absent, limiting the benchmarking comprehensiveness.
4). The writing could be refined for clarity and conciseness; for instance, the introduction devotes excessive space to reiterating the need for an "effective and lightweight" solution and listing RUDDER's components, without adequately highlighting key insights, motivations for each module, or defining terms. The final three paragraphs overlap significantly, resulting in low information density.
1). Could you include experimental results on contemporary VLMs such as Qwen-VL and InternVL? Additionally, please add comparisons with current SOTA methods like DeGF and AGLA.
2). To facilitate broader adoption, could you provide a concrete recipe for automated hyperparameter tuning? For example, suggest strategies like grid search, Bayesian optimization, or other efficient approaches for optimizing $L$, $k$, and $\alpha_{\max}$ in new deployment scenarios? |
Lightly AI-edited |
|
Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision-Language Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes RUDDER (Residual-Update Directed DEcoding Regulation), a method to mitigate object hallucination in Large Vision-Language Models (LVLMs). The method introduces two key components: (1) the CARD vector, a per-sample visual steering signal extracted at negligible computational cost during the prefill stage, and (2) the Beta Gate, an adaptive token-wise mechanism that dynamically adjusts intervention strength. While the experimental results appear promising, several aspects require clarification and further validation.
1. The paper clearly identifies the trade-off in existing methods—existing approaches incur high computational overhead and require multiple forward passes, which limits their practical deployment.
2. The concept of dynamically adjusting intervention strength based on the model's deviation from visual context is well-motivated.
3. The paper is generally well-structured and clearly written, making it accessible to readers.
1. The Beta Gate design appears fundamentally counter-intuitive. According to Equation (3):
- When hl,t has high similarity with vCARD (i.e., cos(hl,t, vCARD)≈1), the intervention strength g_t becomes large.
- When hl,t derivates from vCARD (i.e., cos(hl,t, vCARD) is negative), the intervention strength g_t becomes small.
- This design contradicts common intuition: one would expect stronger intervention when the model deviates from visual grounding, not weaker. The paper does not adequately justify this seemingly backward design choice.
2. While the paper claims Beta Gate is "Bayesian-inspired," it lacks rigorous Bayesian derivation. The connection between the Beta distribution framework and the specific formulation in Equation (3) is unclear.
3. The paper omits several state-of-the-art hallucination mitigation methods. Missing references and performance comparisons (including effectiveness and efficiency): OPERA (Huang et al., 2023), HALC (Chen et al., 2024), ADHH (Yang et al., 2025).
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation. CVPR 2023
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding. ICML 2024.
Understanding and Mitigating Hallucinations in Large Vision-Language Models via Modular Attribution and Intervention. ICLR 2025.
1. How are k (sensitivity) and c (concentration) determined? Why they are necessary?
2. Do hyperparameters need adjustment for different models (e.g., 7B vs. 13B)? |
Moderately AI-edited |
|
Adaptive Multi-Scale Attention-Based LSTM Coupling for Early Detection |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper addresses the core challenges of real-time scene recognition in complex automotive electronic and electrical systems and proposes a breakthrough solution - the adaptive attention Coupled LSTM architecture. This study has for the first time systematically constructed a dual-path time series modeling paradigm that combines specialization and collaboration, fundamentally resolving the inherent contradiction of a single time series model in capturing long and short time dependencies. The core theoretical innovation lies in the introduction of a bidirectional and adaptive attention coupling mechanism, which serves as an "intelligent information hub" connecting the two specialized paths. To achieve the optimal collaborative performance, the author designed a progressive multi-stage training protocol. Through independent expertise cultivation, progressive coupling introduction, and global joint optimization, it ensures that the model is both highly specialized and collaborative. There exist some issues to be addressed as follows.
1.The paper addresses the core challenges of real-time scene recognition in complex automotive electronic/electrical (E/E) systems, proposing a novel and breakthrough solution—the adaptive attention-coupled LSTM architecture.
2.It represents the first systematic construction of a dual-path time series modeling paradigm that effectively combines specialization and collaboration, fundamentally resolving the inherent contradiction in single models for capturing both long- and short-term dependencies.
3.The core theoretical innovation lies in the introduction of a bidirectional and adaptive attention coupling mechanism, which acts as an "intelligent information hub" to dynamically connect the two specialized paths, enhancing information exchange.
4.The authors designed a progressive multi-stage training protocol (including independent expertise cultivation, gradual coupling introduction, and global joint optimization) to ensure optimal collaborative performance, balancing specialization and integration.
1.There is a lack of experimental data or justification for the design choices regarding the number of hidden units in LSTM A (24-32) and LSTM B (32-48). The paper does not provide empirical evidence or ablation studies to validate why these specific ranges were selected.
2.The paper fails to compare the proposed model with recent state-of-the-art time series prediction architectures, such as Transformer-based models (e.g., Informer, FEDformer), which have demonstrated advantages in capturing multi-scale dependencies. This omission limits the comprehensiveness of the evaluation.
3.The computational complexity and memory usage of the dual LSTM paths, combined with the attention mechanism, are not thoroughly analyzed. There is no deployment feasibility study or lightweight experiments for real-time inference on in-vehicle embedded devices, raising concerns about practical applicability.
4.The experiments are entirely based on synthetic data and lack validation on real-world automotive E/E system data. This reduces the reliability and generalizability of the results for actual automotive applications.
1. Is there any experimental data to verify why the number of 24-32 and 32-48 hidden units, respectively, used by LSTM A and LSTM B was designed in this way?
2. Why not compare it with the time series prediction architectures that have performed well in recent years, such as Transformer-based models (such as Informer, FEDformer)? These models also have their advantages in capturing multi-scale dependencies.
3. How much does the computational complexity and memory usage of dual LSTM paths combined with the attention mechanism increase compared to a single LSTM baseline? Is there a deployment feasibility analysis or a lightweight experiment for real-time inference of in-vehicle embedded devices?
4. The experiments are entirely based on synthetic data and have not been verified on real automotive E/E system data. |
Heavily AI-edited |
|
Adaptive Multi-Scale Attention-Based LSTM Coupling for Early Detection |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 0:
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes an adaptive attention-coupled LSTM architecture. This architecture employs two separate LSTMs to process data with different window lengths, thereby capturing information across distinct temporal domains. It integrates information from the two pathways via cross-attention with adaptive weights and generates predictions for multiple time horizons. Specifically designed for real-time time-series forecasting and scenario detection in complex automotive electrical/electronic (E/E) systems.
1.Clear Methodological Logic. The core idea of the adaptive attention-coupled LSTM is straightforward and well-structured. The technical route from problem definition to solution implementation is logical and easy to follow.
2.Tailored Weight Parameter Schemes for Different Scenarios. The paper designs distinct weight parameter strategies for various application scenarios. These scenario-specific weight designs enhance the method’s adaptability to diverse temporal patterns in automotive E/E system data.
1.The proposed method has not been validated through experiments on other types of data.
2.Lack of Justification for the Addressed Problem. The paper fails to sufficiently demonstrate the necessity and urgency of the problem it claims to solve—like the limitations of existing methods in real-time time-series forecasting for automotive E/E systems.
3.Limited Innovation and Oversimplified Methodology. The adaptive attention-coupled LSTM primarily combines existing techniques (dual-pathway LSTMs + cross-attention) without introducing fundamental innovations.
4.No Demonstration of Runtime Efficiency for Real-Time Scenarios. While the paper claims the method is designed for "real-time scenarios," it provides no quantitative data on runtime efficiency.
5.Insufficient Experiments and Lack of Baseline Comparisons. The experimental scope is narrow, and there is a lack of comparative data with mainstream baseline methods. For example: The paper does not include benchmarks widely used in automotive time-series forecasting; It fails to validate performance on publicly available automotive E/E dataset, relying instead on potentially custom or synthetic data.
6.Low Paper Completion. No Clear Architectural Schematic. The paper lacks explicit architectural schematic diagrams and related works.
7.Narrow Application Scope. The method is exclusively designed for automotive E/E systems and has not been validated on other time-series scenarios. This limits the method’s generalizability and academic impact.
8.Lack of Ablation Studies and Parameter Sensitivity Analysis. There are no ablation experiments to validate the effectiveness of key components (e.g., adaptive weights, cross-attention) or analyze the impact of hyperparameters (e.g., time window length, number of hidden units). It is impossible to determine whether the improved performance stems from the proposed design or merely increased model parameters.
1.More experiments are needed.
2. See weakness. |
Moderately AI-edited |
|
Adaptive Multi-Scale Attention-Based LSTM Coupling for Early Detection |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes a deep learning based time-series prediction model comprising two data encoding towers, one based on short-term data (past 6 time-steps) and another based on long-term data (past 15 time-steps). The final prediction is based on an attention coupling of predictions from the short-term and long-term towers to obtain the `enhanced` final prediction. The authors evaluate the proposed method on synthetic data that they have generated.
1. The paper proposes a method for an important problem of scenario recognition and prediction in the automotive system context.
2. Overall, the de-coupled modeling approach to estimate the short-term and long-term patterns are intuitive and the attention based mechanism to couple representations from the two towers is somewhat novel.
1. The paper is difficult to understand and lacks cohesion as many of the architecture choices seem somewhat arbitrary and scattered across sections 3 and 4. The paper (especially the methodology description) would benefit from a significant re-write. Several sections in the appendix can be eliminated. For example A.1.1 and A.1.2 are unnecessary as the machine learning audience is familiar with recurrent architectures like the LSTM and the attention mechanisms.
2. Overall, the paper lacks rigorous evaluation on real-world data and has only been evaluated on synthetic data. Further, the dataset generation procedure has not been described in detail.
3. There is no rigorous baseline comparison with other popular baselines e.g., state-space models like MAMBA.
1. How are $w_{i,base}$ and $\gamma_i^{(t)}$ estimated? The text says $\gamma_i^{(t)}$ reflects the "average contribution" of each feature but it is unclear how this is derived / what "contribution" means.
2. How is $\beta^t$ estimated?
3. What are the various hyper-parameters that need to be tuned if the proposed method is to be adapted to a new dataset?
4. Why has no comparison been conducted with state of the art baselines (e.g., Mamba [1], standard autoregressive based and single-layer linear [2] models) which have shown effective performance in time-series forecasting scenarios?
5. Why is there no ablation study conducted to highlight the importance of various components (e.g., short-term, long-term and attention based components)? This is imperative to holistically understand how the various components contribute to the overall performance.
6. Are there any real-world datasets on the automobile real-time scenario recognition and prediction use-case that can be employed to test the effectiveness of the proposed method? Synthetic data is useful but cannot serve as a comprehensive evaluation of the performance of the proposed method.
# References
1. Wang, Zihan, Fanheng Kong, Shi Feng, Ming Wang, Xiaocui Yang, Han Zhao, Daling Wang, and Yifei Zhang. "Is mamba effective for time series forecasting?." Neurocomputing 619 (2025): 129178.
2. Zeng, Ailing, Muxi Chen, Lei Zhang, and Qiang Xu. "Are transformers effective for time series forecasting?." In Proceedings of the AAAI conference on artificial intelligence, vol. 37, no. 9, pp. 11121-11128. 2023. |
Fully human-written |
|
Adaptive Multi-Scale Attention-Based LSTM Coupling for Early Detection |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes an Adaptive Multi-Scale Attention network for time series forecasting, specifically motivated by early anomaly detection and scenario prediction in automotive systems.
The main idea is to separate short-term and long-term dependencies into 2 specialized LSTM branches (“dual-path”), which exchange information through bidirectional scaled dot-product attention.
A 3-stage training schedule (specialization, coupling ramp-up, joint refinement) is introduced, along with an adaptive feature-weighting mechanism to emphasize dynamically relevant inputs.
Experiments on a synthetic dataset show a significant reduction in Mean Squared Error compared to isolated LSTM baselines.
While the architecture is conceptually sound and intuitively appealing, the novelty is not clearly demonstrated with respect to prior multi-scale or hierarchical LSTM approaches, and the evaluation is restricted to synthetic data. The paper would benefit from clearer theoretical justification and improved reproducibility.
- The idea of decoupling short and long-term temporal patterns and coupling them via attention is intuitive and could have practical advantages.
- The 3-phase schedule is well thought out and potentially generalizable.
- The reported improvements in MSE demonstrate the model’s potential effectiveness in controlled scenarios.
- The paper includes implementation information, which, if integrated into the main text, could help reproducibility and lead to a better understanding of the approach.
- The paper could benefit from better motivation.
- The authors do not clearly differentiate their proposal from prior work, particularly in the Related Work section.
- All results are based on synthetic data with Gaussian noise, and there is no testing on real or public benchmarks, which limits the applicability and impact of the findings.
- No code or data repository has been provided.
- There are contradictions within the paper, especially regarding the notation and ranges for hyperparameters.
- The description of the methodology is vague and contains excessive repetition of certain concepts (e.g., dual path or short vs. long). Some explanations are found only in the appendix. A more comprehensive description of the proposal should be included in the main body of the paper.
1) Please clarify the discrepancy regarding β_max: the main text limits it to [0.1, 0.8], while Appendix A.2.1 lists 1.5 as optimal and 2.0 as the maximum limit.
2) The main text uses “d” for the dropout rate, whereas the appendix uses p_drop. Please confirm which notation is correct.
3) Regarding adaptive windowing: Section 3.3 mentions averaging over 10 steps, while Appendix A.6.2 refers to 50 steps. What is the actual configuration used in the experiments?
4) How exactly does LSTM A “dynamically adapt”? The methodology is not clearly explained.
5) What optimization strategy was utilized for hyperparameter optimization?
6) Considering that the dataset is synthetic, can this approach be tested on a publicly available time series dataset?
7) Many important methodological choices (e.g., preprocessing steps, window sizes, data generation parameters, hyperparameter ranges) are only located in the appendices. Could the authors summarize or move this information to the main text to clarify the overall pipeline and justify these design choices? |
Heavily AI-edited |
|
OASIS: An Optimized Approach to Systematic Calibration Data Selection |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper identifies that pruning large language models is highly sensitive to the calibration data used, and that existing heuristic-based methods often lead to inconsistent and suboptimal results due to data quality variance. To address this, it proposes OASIS, a fully differentiable framework that optimizes calibration data selection end-to-end by backpropagating task-level gradients through a soft-mask proxy, allowing the model to learn which samples most improve post-pruning performance. Experiments on structured and unstructured pruning across Llama and Qwen models show that OASIS consistently outperforms heuristic and synthetic data baselines, establishing a new standard for data-aware model compression.
1. This paper provides a thorough investigation of the impact of calibration data on pruning from both macro and micro perspectives, offering valuable insights for future research in this area.
2. The experiments are solid and comprehensive, covering multiple LLMs under both structured and unstructured pruning settings, which strongly support the paper’s conclusions.
3. The writing is well-organized and easy to follow.
1. The macro-level conclusions have already been established in prior work, so the novelty in this aspect appears limited.
2. The motivation for introducing noise perturbations into the input is not clearly explained. Although the authors demonstrate its effectiveness through ablation studies, it remains unclear why adding noise would lead to more stable optimization.
3. The paper does not report the time or computational cost of data selection. Excessive overhead could undermine the practical value of the proposed method. If I allocate the same computational cost for data selection to gradient-based iterative pruning or recovery training, would it yield better performance?
1. Why would adding noise lead to more stable optimization?
2. If I allocate the same computational cost for data selection to gradient-based iterative pruning or recovery training, would it yield better performance? |
Lightly AI-edited |
|
OASIS: An Optimized Approach to Systematic Calibration Data Selection |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper provides an analysis of the influence of individual calibration quality and proposes a soft-mask-based pruning method combined with data selection to improve pruning performance. Experimental results show that the proposed method outperforms existing randomized and synthetic data selection approaches.
The paper provide detailed analysis for the importance of data selection based on multiple pruning methods, which provide good motivation for the problem studied.
- The paper needs a thorough proofread. Additionally, the overall presentation should be improved. Although I did not read every section in detail, I noticed a significant number of writing issues throughout the paper (see the Questions section for more specific examples).
- Beyond the writing, the contribution of this paper feels limited. The main contributions can be summarized in two parts: (1) an analysis of the influence of pruning data, and (2) the proposed OASIS method. However, the analysis largely revisits well-established findings—such as the impact of data quality and quantity—which have been studied in prior work. As for the method, it essentially builds on existing soft pruning frameworks, with a gradient-based weight for data selection. These contributions, in my view, are not substantial enough to warrant publication at ICLR.
- The experimental section is also quite limited, particularly in terms of baseline coverage. The authors should include more direct comparisons related to the data selection component, as this is the core novelty of the paper. Specifically, comparisons with prior data selection techniques would help clarify the relative effectiveness of the proposed approach.
Here are some typos or mistakes I found:
- Line 154: The pruning score for Wanda is not correct.
- I’m curious about the definition of (golden, mediocre, detrimental) data. Maybe I missed something, but I think the author should give a clear definition for the criteria at the very beginning of the paper, since these terms are mentioned many times without any explanation.
- Line 289: There are typos in the parentheses.
- The saliency score is defined in Section 3 as S = |WX^2|; however, in Line 269, the parameter becomes a vector without an explanation or a new definition of the saliency score.
There are many other typos and mistakes in this paper, I highly suggest the author further revise the paper. |
Fully human-written |
|
OASIS: An Optimized Approach to Systematic Calibration Data Selection |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The OASIS framework proposes a novel approach to improving post-training pruning of large language models (LLMs) by addressing the problem of calibration data selection. Traditional calibration data selection methods rely on simple heuristics, such as random sampling or entropy, which often result in suboptimal and inconsistent pruning outcomes. The authors point out that this inconsistency arises because the importance of calibration samples varies and is context-dependent (i.e., it depends on the specific model and pruning method). A key feature of OASIS is its end-to-end framework, which formulates calibration data selection as an optimization problem and solves it using a differentiable soft-mask proxy. This allows task-level gradients to be backpropagated to the calibration data, dynamically discovering the subset most beneficial for pruning. Experiments show that OASIS improves the performance of various state-of-the-art pruning methods, establishing a new standard for data-aware model compression.
1. Context-aware calibration: The adaptive selection of calibration data allows pruning results to be optimized based on the specific model and pruning algorithm, providing high specificity.
2. Improved pruning performance: Compared with traditional heuristic methods, OASIS offers more consistent pruning outcomes and can reduce variance in pruning results.
3. Wide applicability: The method is compatible with various pruning techniques, making it practical and suitable for different types of model compression.
1. Poor figure readability: The legends and chart sizes in the paper are relatively small. Although the figures are information-dense, it is difficult to extract clear conclusions, which affects readers’ intuitive understanding of the experimental results.
2. Limited improvement for low-accuracy models: When the base model has low accuracy, OASIS provides only minimal gains in perplexity and downstream task performance. For example, for Llama-3.1-8B, the average accuracy is 79.36, which drops to 52.50 after pruning. With OASIS, it only increases to 52.97, indicating that the method does not significantly improve low-accuracy models and does not bring substantial performance breakthroughs.
3. Unclear iterative process and high time cost: OASIS relies on iterative optimization to dynamically select the optimal calibration data subset, but the paper does not specify the number of iterations needed or the computational cost per iteration. This may require significant computational resources and long training time in practical applications, limiting the feasibility of the method.
4. Generality issue: Experiments are conducted only for a 50% pruning rate and models under 8B parameters. It remains unclear how the method performs at higher pruning rates or on larger models, limiting the assessment of its general applicability.
1. How many iterations are required for the optimization problem to converge, and what is the computational cost per iteration?
2. Does the method still work effectively at lower pruning rates?
3. Can the method be applied to large models, such as Llama-65B, and is the computational cost still acceptable? |
Moderately AI-edited |
|
OASIS: An Optimized Approach to Systematic Calibration Data Selection |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a data selection method for LLM pruning. The authors first investigate how different data selection strategies affect pruning performance from both macro and micro perspectives, and find that "heuristics fail in calibration data selection." They then propose an algorithm called OASIS to select datasets for pruning tasks. Experiments demonstrate that OASIS outperforms other data selection approaches and is suitable for both structured and unstructured pruning.
1. A comprehensive study on how various calibrated data selection strategies affect model pruning performance, encompassing both structured and unstructured pruning methods.
2. The proposed algorithm is straightforward and easy to implement, while consistently improving upon the performance of baseline methods in experiments.
1. The findings are sound but not very surprising. First, it is apparent that performance saturates as data size increases, while data diversity significantly impacts model performance including in model pruning, and the optimal data composition varies across different tasks. Additionally, the statement "A single low-quality ('detrimental') sample can contaminate the entire set and severely degrade the performance of a high-quality ('golden') set" is somewhat confusing. What exactly is the size of the "entire set"? For example, if we have a selected calibrated set chosen by OASIS and introduce just one low-quality sample, will the performance indeed degrade severely?
2. The perturbation of embeddings requires further justification. Specifically, it is unclear why such perturbation ensures stability. Moreover, the ablation studies do not report the final model performance without perturbation.
1. It would be helpful to include a small experiment showing how performance degrades when a single detrimental sample is added to a high-quality calibrated set of realistic size.
2. Ablation study should report final model performance without perturbation to better illustrate its contribution.
3. It appears that the reported code site has no content. |
Lightly AI-edited |
|
ParaFlow: Parallel Sampling for Flow Matching Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper addresses the fundamental challenge of accelerating the inherently autoregressive sampling in Flow Matching (FM) models such as Stable Diffusion 3 and Flux from a numerical systems perspective. It introduces a unified framework that recasts the autoregressive sampling process as solving a system of triangular nonlinear equations (TNEs), enabling a paradigm shift toward non-autoregressive sampling with parallel vector field computation across multiple timesteps. Within this generic framework, the paper establishes two key points: (1) the TNE system has a unique solution that precisely corresponds to the autoregressive sampling trajectory; (2) solving the TNE system ensures convergence to this exact trajectory in far fewer sequential iterations. Building on these insights, this paper presents ParaFlow, a training-free, step-parallel sampler for accelerating autoregressive FM samplers. Extensive experiments validate that ParaFlow reduces sequential sampling steps by up to 4× and achieves significant wall-clock speedup of up to 4.3×, with negligible impact on FID and CLIP scores. The source code will be publicly released.
1. Clear motivation and good writing.
2. Introduces a unified framework that recasts autoregressive sampling as solving triangular nonlinear equations (TNEs), enabling a paradigm shift to non-autoregressive sampling with parallel vector field computation across multiple timesteps.
3. Presents a training-free, step-parallel sampler, avoiding extra training costs.
4. Achieves up to 4× reduction in sequential sampling steps and 4.3× wall-clock speedup, with negligible impact on FID and CLIP scores.
1. The experiments lack validation in other domains, e.g., text-to-video generation and class-to-image generation.
2. The paper is missing related work on deep equilibrium models [1, 2, 3, 4], which are actually trainable fixed-point iteration models.
3. The paper also omits related work on Jacobian decoding [5], which similarly involves fixed-point iterations.
4. Would a fixed-point iteration solver—such as Anderson acceleration—be helpful?
I would be willing to raise the score if these concerns are addressed.
[1] Bai, Shaojie, J. Zico Kolter, and Vladlen Koltun. "Deep equilibrium models." Advances in neural information processing systems 32 (2019).
[2] Pokle, Ashwini, Zhengyang Geng, and J. Zico Kolter. "Deep equilibrium approaches to diffusion models." Advances in Neural Information Processing Systems 35 (2022): 37975-37990.
[3] Bai, Shaojie, et al. "Deep equilibrium optical flow estimation." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
[4] Wang, Shuai, Yao Teng, and Limin Wang. "Deep equilibrium object detection." Proceedings of the IEEE/CVF international conference on computer vision. 2023.
[5] Song, Yang, et al. "Accelerating feedforward computation via parallel nonlinear equation solving." International Conference on Machine Learning. PMLR, 2021.
[6] https://en.wikipedia.org/wiki/Anderson_acceleration
see weakness |
Lightly AI-edited |
|
ParaFlow: Parallel Sampling for Flow Matching Models |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes ParaFlow, a parallel sampling algorithm for Flow Matching (FM) generative models such as Stable Diffusion 3 and Flux. Instead of sequentially integrating the learned ODE, ParaFlow reformulates the sampling process as a system of triangular nonlinear equations (TNEs), which can be solved using a parallel fixed-point iteration (FPI) scheme. This approach allows simultaneous computation of multiple ODE steps, thus reducing sequential latency. Experiments on Stable Diffusion 3 and Flux demonstrate wall-clock speedups of up to 4.3× with negligible changes in FID and CLIP scores.
1. The paper gives a clean explanation of autoregressive sampling property of ODE integration in flow matching, and the idea of equating euler iteration with TNE is novel and interesting.
2. ParaFlow can be directly applied to existing flow-based models without retraining, making it practically feasible.
3. The method achieves noticeable acceleration with minimal degradation in visual quality on strong baselines (Stable Diffusion 3, Flux).
1. The core formulation mainly relies on classical ODE discretization and fixed-point iteration theory. The novelty lies more in application and engineering than in new theoretical development. Both propositions 1 and 2 are common senses in ODE courses.
2. The paper does not quantitatively relate the number of parallel iterations $K$ to the actual error or convergence precision. A clear trade-off curve between accuracy and iteration count is missing.
3. The experiments are confined to only two pretrained models (Stable Diffusion 3 and Flux) and mostly show image-level metrics (FID, CLIP). There are no comparisons with other recent parallel diffusion solvers such as ParaSolver (Lu et al., 2025) or ParaTAA (Tang et al., 2024).
4. Although the method reduces sequential steps, it increases total NFE substantially. The impact on total compute cost and GPU utilization efficiency is not rigorously analyzed.
Q1: Could the authors provide a quantitative relation between the number of parallel iterations $K$ and the achieved numerical accuracy?
Q2: Have the authors compared ParaFlow with other parallel ODE or diffusion solvers (e.g., ParaTAA, ParaSolver)? This would clarify whether the benefit comes from the specific TNE formulation or general step-parallel strategies. |
Fully AI-generated |