ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 15899 (21%) 4.43 3.58 3687
Heavily AI-edited 3233 (4%) 4.22 3.59 2990
Moderately AI-edited 7082 (9%) 4.20 3.61 2722
Lightly AI-edited 16648 (22%) 4.15 3.68 2746
Fully human-written 32938 (43%) 4.13 3.62 2917
Total 75800 (100%) 4.21 3.62 3026
Title Ratings Review Text EditLens Prediction
Data- and Hardware-Aware Entanglement Selection for Quantum Feature Maps in Hybrid Quantum Neural Networks Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper presents a data- and hardware-aware framework for selecting entanglement structures in hybrid quantum neural networks (HQNNs). It formulates entanglement selection as a multi-objective optimization problem that balances data-driven trainability, hardware cost, and circuit efficiency. By integrating the Hilbert–Schmidt Distance (HSD) and a quantum correlation metric (IQ), the proposed bi-level search algorithm automatically identifies entanglement patterns that improve performance, robustness, and efficiency on realistic quantum devices. Experiments on synthetic and real datasets demonstrate significant gains in accuracy, trainability, and noise resilience compared to heuristic baselines. Na 1. The manuscript suffers from severe issues in clarity and technical presentation. Many critical terms and symbols are undefined or inconsistently introduced, making it difficult to follow the proposed framework and assess its validity. Specifically: - The terms $\mathcal{U}_{data}$, $C_{hardware}$, $R_{eff}$ in Eq. (1) are not properly defined, leaving their mathematical or physical meanings unclear. - The notation $\mathcal{D}$ in Eq. (2) is introduced without explanation. - The weights $w_{corr}$ and $w_H$ in Eq. (5) are presented without specifying their definition. - While the authors denote $M$ as the structure of QNNs, the specific mathematical description of $M$ is not given. How to understand the notations $(i,j) \in M$? - The parameter $N_{CNOT}^{SWAP}$ in Eq. (6) appears without any accompanying explanation of its computation or physical significance. Overall, the presentation quality is poor, and the manuscript appears to be insufficiently prepared for formal submission. The lack of rigorous definitions, consistent notation, and clear exposition severely undermines the readability and credibility of the work. 2. While the paper introduces a seemingly principled multi-objective optimization framework, its theoretical underpinnings remain vague and insufficiently justified. Key formulations, such as the bi-level optimization structure and the data-utility function, are presented without clear derivations, assumptions, or proofs of convergence. The connection between the Hilbert–Schmidt Distance (HSD) and Quantum Fisher Information (QFI) is asserted but not rigorously demonstrated within the manuscript. 3. The experimental evaluation is weak and lacks sufficient evidence to support the claimed contributions. All results are obtained on small-scale simulators with limited datasets (e.g., synthetic and UCI Heart), which fail to demonstrate scalability or generalization to more realistic quantum or hybrid settings. The questions are included in the weakness. Fully AI-generated
Data- and Hardware-Aware Entanglement Selection for Quantum Feature Maps in Hybrid Quantum Neural Networks Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents an approach to quantum kernel methods through Quantum Generator Kernels (QGKs), leveraging variational generator groups for parameter-efficient quantum embeddings. The work demonstrates theoretical foundations with rigorous Lie-algebraic framework and comprehensive experimental validation across multiple datasets, showing superior performance over classical and quantum baselines. The concept of generator-based kernels addresses a critical scalability challenge in quantum machine learning. This paper presents a highly innovative approach to quantum kernel methods through Quantum Generator Kernels (QGKs), leveraging variational generator groups for parameter-efficient quantum embeddings. The work demonstrates strong theoretical foundations with rigorous Lie-algebraic framework and comprehensive experimental validation across multiple datasets, showing superior performance over classical and quantum baselines. The concept of generator-based kernels addresses a critical scalability challenge in quantum machine learning. 1. NISQ Hardware Practicality: The compiled circuit depths (e.g., 4,455 gates for 5-qubit MNIST task in Table 3) significantly exceed current NISQ device capabilities. While the hybrid compression strategy is proposed, actual hardware validation or concrete depth-reduction techniques are lacking, undermining near-term applicability claims. 2. Baseline Comparison Fairness: The HEE baseline encodes only 5 features versus QGK's 93 for MNIST, creating an unfair comparison due to vastly different classical preprocessing burdens. This skews performance comparisons and should be addressed through balanced feature encoding or explicit discussion of this limitation. 3. Noise Robustness Gap: The theoretical framework assumes perfect generator commutativity, but no analysis is provided on how realistic hardware noise might disrupt this property or affect kernel performance. Noise simulation results remain limited to ideal conditions. 1. NISQ Hardware Practicality: The compiled circuit depths (e.g., 4,455 gates for 5-qubit MNIST task in Table 3) significantly exceed current NISQ device capabilities. While the hybrid compression strategy is proposed, actual hardware validation or concrete depth-reduction techniques are lacking, undermining near-term applicability claims. 2. Baseline Comparison Fairness: The HEE baseline encodes only 5 features versus QGK's 93 for MNIST, creating an unfair comparison due to vastly different classical preprocessing burdens. This skews performance comparisons and should be addressed through balanced feature encoding or explicit discussion of this limitation. 3. Noise Robustness Gap: The theoretical framework assumes perfect generator commutativity, but no analysis is provided on how realistic hardware noise might disrupt this property or affect kernel performance. Noise simulation results remain limited to ideal conditions. Fully AI-generated
Structuring Semantic Embeddings for Principle Evaluation: A Kernel-Guided Contrastive Learning Approach Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper concerns on evaluating whether synthetic text can adhere to human-defined principles. This is a practical problem to align AI and human. The authors seem to transform this into a classification problem in which each principle is considered as a class. Then a contrastive learning framework is proposed to learn text embeddings. The goodness of those embeddings is evaluated using downstream tasks, including emotion classification and toxicity detection. Finally, the authors did experiments on GoEmotions, Amazon reviews, and toxic comments datasets to evaluate the learned embeddings, in comparison with the raw text. - The main problem of interest is practical and relevant. - The trained embeddings seem to be better than the raw representation. - Although focusing on principle adherence, the authors do not formally discuss about principles. This leads to an unclear research problem in this paper. - There seems to be no explicit kernel in their proposed framework, despite repeated mentions about "kernel-based" in the paper. - No baseline is used for comparison. The main framework focuses on learning representation for the text. So, the authors should take some existing methods that can learn text embeddings into comparison. - The current manuscript lacks lot of details. For example, AdaptedInfoNCE is used without definition. - The experiments focus on some supervised problems, but not the main concern of the paper about human-defined principles. This means, the experimental part may not support well the claim of this paper. Can you define principles explicitly? Fully human-written
Structuring Semantic Embeddings for Principle Evaluation: A Kernel-Guided Contrastive Learning Approach Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces a kernel-guided contrastive learning framework that uses learnable prototype kernels and a novel offset penalty to restructure fixed embeddings, forcing disentanglement of principle-specific features for post-hoc evaluation The optimized embeddings consistently yield statistically significant performance improvements compared to raw features and demonstrably outperform few-shot LLMs on structured proxy tasks. The resulting structured, low-dimensional subspace provides a reusable intermediate representation that simplifies downstream modelling and improves computational efficiency. TL;DR: while the work offers some performance improvements, the proposed method is largely an amalgamation of known techniques without significant novelty so I would not consider it appropriate for ICLR and suggest submitting to an alternative venue. * Excessive Loss Complexity: The method relies on a highly composite loss function $L_{\text{total}}$ combining five distinct, weighted terms (Contrastive, Offset, Classification, Orthogonality, Magnitude). This fragility implies that no single mechanism is robust, mandating extensive hyperparameter tuning (e.g., numerous $\lambda$ weights, $\tau$, $\delta_{\text{intra}}$, $\delta_{\text{inter}}$). * Derivativity of Core Components: The introduction of "learnable prototype kernels" is an architectural modification of existing Prototypical Contrastive Learning (PCL), and the "novel Offset Loss" is fundamentally a configuration of standard metric learning distance margin constraints applied via Euclidean distance penalties. * Verification Gap in Problem Scope: The framework is motivated by the "critical challenge" of evaluating abstract principles (e.g., fairness, honesty, safety). However, empirical validation is confined to more measurable, structured proxy tasks, namely fine-grained emotion classification (GoEmotions), ordinal star ratings (Amazon Reviews), and binary toxicity detection. This validation scope avoids the subtle, context-dependent, and inherently subjective principles that prompted the research. * Non-Rigorous Baseline Comparison: The highlighted superior performance against few-shot Large Language Models (LLMs) lacks rigour. The comparison contrasts a specialized, supervised transformation network against generalist LLMs restricted to simple prompting. A meaningful assessment requires benchmarking against state-of-the-art specialized methods in supervised contrastive learning or deep metric learning. see Weaknesses Moderately AI-edited
Structuring Semantic Embeddings for Principle Evaluation: A Kernel-Guided Contrastive Learning Approach Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces a kernel-guided contrastive learning framework for evaluating how well text aligns with abstract human principles such as fairness or honesty. Conventional text embeddings often mix principle-related signals with general semantic information, making evaluation unreliable. To address this, the authors project embeddings into a structured subspace where each principle is represented by a learnable prototype vector and the surrounding geometry is shaped by an offset penalty that enforces clear separation while preserving contextual nuance. Through joint optimization of contrastive and regularization losses, the model learns embeddings that better capture the essence of each principle without discarding general meaning. Experiments across emotion, sentiment, and toxicity datasets show that these structured embeddings yield more accurate classification and regression results than raw embeddings or few-shot large language models, indicating that purpose-built representations can enhance reliability in principle evaluation. - The paper correctly identifies that post-hoc value or principle evaluation is underexplored compared to controllable generation methods. - The approach reframes alignment evaluation as a representation optimization problem, which is a useful perspective that could inspire modular or plug-in evaluators. - The experiments report cross-validation and multiple downstream models, providing some statistical grounding for the claims. ### w1. Conceptual Ambiguity: “Principle Evaluation” vs. “Value Alignment Evaluation” The paper frames its goal as “principle evaluation” but positions it as a step toward “value alignment evaluation.” The distinction is unclear: - Value alignment typically concerns consistency between model and human values. - Principle evaluation here appears to test adherence to explicit, human-defined rules. If these are equivalent, standard alignment terminology should be used; if not, the boundary between the two should be clarified and the focus on “principles” justified. This ambiguity weakens the conceptual framing. ### w2. Limited Experimental Coverage Experiments are limited to one embedding model (Jina v3) and three proxy datasets. - The tasks (emotion, sentiment, toxicity) evaluate classification separability but not “principle adherence” or “value alignment.” - Comparison with few-shot LLMs is not directly relevant, as prompting-based generation differs fundamentally from embedding-level evaluation. Without alignment-specific benchmarks, it is difficult to assess whether the proposed framework advances value-alignment evaluation. ### w3. Missing Alignment Evaluation Baselines If the goal is value alignment evaluation, the paper should compare against existing alignment evaluators or benchmarks. Instead, it relies on few-shot LLMs and proxy tasks that assess classification rather than alignment. This makes it unclear whether the method improves on standard classification. If the task differs from conventional alignment evaluation, this distinction and its rationale should be explicitly stated. ### w4. Use of “Kernel” Terminology The paper uses “kernel” to describe learnable prototype vectors acting as embedding anchors. However, these are not kernel functions in the classical RKHS sense, and no kernelized similarity is applied. This unconventional terminology may confuse readers familiar with kernel methods. Unless a formal connection is provided, “prototype vectors” or “anchors” would be clearer. ### w5. Limited Qualitative and Interpretability Analysis While t-SNE visualizations, kernel–principle correlations, and offset analyses illustrate geometric structure, they remain low-level and do not reveal how learned “principles” map to human-interpretable semantics. More qualitative examples or probing analyses (e.g., principle-specific attribution or activation studies) would strengthen the interpretability claims. Please refer to the Weaknesses. Heavily AI-edited
Structuring Semantic Embeddings for Principle Evaluation: A Kernel-Guided Contrastive Learning Approach Soundness: 1: poor Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a kernel-guided contrastive learning framework to improve text embeddings for "principle evaluation" - essentially classifying text according to various attributes like emotions, toxicity, or ratings. The authors argue that standard embeddings entangle principle-specific signals with general semantic content, making evaluation difficult. Their solution involves learning prototype kernels for each "principle" and using a complex multi-component loss function to structure a 64-dimensional embedding space. They test on three datasets: GoEmotions (emotion classification), Amazon Reviews (rating prediction), and toxic comment detection. - The paper includes proper statistical reporting with cross-validation and provides extensive implementation details in the appendix. - The optimized embeddings do show performance improvements over raw embeddings across the tested scenarios, even if the gains are sometimes modest. - Thorough ablation study: The authors validate the contribution of different loss components. - I do not understand the core motivation of this work: The claim that post-hoc evaluation is "less explored" compared to generation-time control is not correct. Evaluation metrics and classifiers are tools that drive AI safety development. The motivation for why embeddings are the preferred approach "at scale" over specialized classifiers is not clear. - The paper conflates completely different tasks (emotion classification, toxicity detection, rating prediction) under the vague umbrella of "principle evaluation." These are distinct problems different characteristics. - Overcomplicated solution: The framework involves different loss components with numerous hyperparameters. For what amounts to learning better representations, this seems unnecessarily complex compared to standard approaches. - Unclear technical contributions: The "learnable prototype kernels" are essentially learned class centroids, which isn't novel. The paper reads as if it's searching a problem, applying complex machinery to what seems to be standard classification tasks with different embedding pre-processing. The paper is confusing about whether they're learning principle-specific dimensions or just mapping to 64 dimensions. The relationship between the "structured subspace" and the actual 64-dimensional output is unclear. Heavily AI-edited
Learning from Examples and Self-Exploration: A New Paradigm for Dynamic Fusion Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes LESE (Learning from Examples and Self-Exploration), a way to train large language models by dynamically blending supervised fine-tuning and reinforcement learning. Instead of fixing their ratio (static), LESE adjusts it per example based on how well the model understands the task (task mastery) and how diverse its solutions are (exploration diversity). This helps the model stay stable when it’s uncertain and explore more once it’s confident. It also uses a filtering technique to filter out training on examples that already get good rewards. With fine-grained rewards for clearer feedback, LESE outperforms existing methods on several math benchmarks (ID, OOD), showing better reasoning, stability, and generalization. I think this paper explores a really interesting direction by trying to **dynamically combine SFT and RLHF** instead of treating them as separate stages. The idea of adjusting their balance per instance based on the model’s current mastery and exploration diversity feels both intuitive and well-motivated. I also like the **filtering mechanism for saturated instances** — skipping updates when all responses already reach maximum reward is a clever and practical way to avoid overfitting and wasted computation, making the training process more efficient and stable. 1. The explanation of Exploration Diversity (D) is unclear. The equation introduces a summation over N without defining what N represents or how it relates to the reward components. It would help to clarify whether N is the number of reward signals or samples, and how exactly D is computed (per instance or across batches). 2. Figure 3 is confusing. What exactly does R_format represent? What are the y-axis units or limits? If I understand correctly, the GRPO curve actually rises faster at the beginning — so why does the text claim that “LESE converges rapidly and maintains stable preference rewards”? Do you mean a different phase of training or a different metric? It would help to clarify this or add a zoom-in or wall-clock-time plot to support the claim. 3. I’d like to see a concrete example comparing LESE with a standard SFT then RL pipeline on the same prompt. Showing the original model’s answer and how it improves through each training regime would make the adaptive mechanism much easier to understand. 4. Similarly, providing an example with outputs from SFT-only, RL-only, and LESE for one problem would help illustrate the differences in behavior — for instance, how SFT falls into the imitation trap, how RL might show exploration risks, and how LESE balances the two. 5. The experiments are limited to mathematical reasoning tasks. Have you tried LESE on other domains like coding or logical reasoning to test its generality? That would strengthen the paper’s broader claims. 6. It would be useful to quantify the rollout efficiency gains introduced by the “saturated instance filtering” mechanism. How many rollouts per prompt are actually skipped compared to standard SFT→RL training? Presenting this comparison would make the efficiency argument more convincing. 7. The Figure 4 caption seems incorrect for subfigures (c) and (f). Also, the color coding isn’t explained, making it hard to interpret. Please clarify which colors correspond to which methods, specify what metrics the axes represent, and add short explanations for each subfigure. Please take a look at weaknesses section. Also I think the overall writing of the paper needs to be improved a lot. Moderately AI-edited
Learning from Examples and Self-Exploration: A New Paradigm for Dynamic Fusion Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This work proposes instance-level dynamic fusion of SFT (using expert examples) and RL (via self-exploration). This is done by adaptively weighting their losses, based on task mastery and exploration diversity: intuitively, the method increases SFT weight when task mastery is low or exploration diversity is limited, and increases RL weight otherwise. Empirical comparison with prior SFT-RL fusion methods are conducted on math and reasoning benchmarks. - Presents a reasonable motivation and practical approach for instance-level dynamic fusion of SFT and RL, in contrast to prior fusion methods that are largely static. - Shows promising empirical performance compared to selected baselines. **Concerns about experimental setup:** - Different reward designs are used for baseline methods, according to Appendix A.5. This is problematic for fair comparison, as reward design is orthogonal to SFT–RL fusion strategy. - No details are provided on how hyperparameters for baseline methods are selected or tuned. - Experiments are conducted only with Qwen2.5-math-1.5B/7B models, and primarily on mathematics benchmarks (with the exception of GPQA), raising the question about how generalizable the results could be. - In Section 4.1, the training set contains 32,000 samples, with an RL batch size of 16. This implies one epoch equals 2,000 steps, yet RL methods are trained for only 500 steps. The rationale for this choice is unclear, and it is uncertain whether the algorithms had converged before the reported results in Table 1 were obtained. - Training samples are ordered according to an easy-to-hard curriculum, which appears non-standard and somewhat arbitrary. - The number of rollout generations per sample is not reported. - For the 7B model results (Line 364), the paper compares with baseline numbers from prior works. This is unlikely to be a fair or meaningful comparison, as differences in training data, curricula, and initial models (e.g., instruct vs. math models) introduce significant variability. **Writing and presentation issues:** - Lines 40–50: The discussion on limitations of SFT and RLVR (e.g., “imitation trap,” “exploration risks,” “vanishing gradients”) lacks citations. - Lines 77 and 80 are repetitive. - Line 116: The sentence listing PPO, DPO, and GRPO is grammatically awkward and unclear. - Eq. (5): The variable $R_{j,i}$ is not clearly defined, though it seems to refer to the $i$-th sub-reward for the $j$-th response. Similarly, the meanings of $R_{\text{score}}, R_1, R_2$ in Figure 2 are unclear. - Line 223: The claim that “the marginal utility of exploration grows non-linearly with task competence” does not seem intuitive. The proposed function in Eq. (6) seems more like an ad-hoc choice that grows with both $M$ and $D$. - Line 278: The claim regarding GRPO’s failure with binary rewards is not supported by Figure 3. - Line 349: The assertion that the method “adheres to stringent human preference rewards” is not supported by results in Table 1. - Lines 357–358: The reported “4.5 and 2.9 points improvement” over baselines should clarify how these numbers are calculated exactly. - Figure 4 caption: - (a, d): The test datasets are unspecified. - “(b, e) pass@k performance on the AIME benchmark” should read (c, f), and it is unclear whether AIME24 or AIME25 was used. - The 3D plots in (c, f) offer no apparent benefit over simpler 2D line plots. - Line 455: The statement that LESE-Static and LESE-Random have “formatting optimized at the expense of reasoning” conflicts with their 0 format score in Table 3. **Other concerns:** - The proposed method introduces multiple new hyperparameters (e.g., $R_{\text{thres}}$ in Line 192; $c$ and $\ell$ in Line 215) with limited guidance on their selection. - The second claimed contribution (“fine-grained preference reward design”) is not novel, as reward design has long been a standard RL practice. - The notion of “human preference” in this paper is vague; based on Line 269, it appears to mainly refer to formatting constraints (e.g., tag presence, ordering, format compliance). - The paper does not discuss its limitations, despite clear issues raised above. Please see "weaknesses" above. Lightly AI-edited
Learning from Examples and Self-Exploration: A New Paradigm for Dynamic Fusion Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces a method to adaptively combine SFT and RL - depending on mastery (M) - how often the model already solves the task and diversity (D) - how much reward varies across samples, The authors propose a method to combine these metrics to set alpha as a param to interpolate between SFT Loss and RL Loss. Intuitively, SFT is used when the model cannot solve the task naively, and RL, when the model can solve the task atleast once. Results are reported on math reasoning benchmarks. The problem statement is well motivated, if solved could be impactful in how we can combine the strengths of imitation and on policy RL for language models. 1. Method generality * Most results are on Qwen2.5-Math-1.5B–style math solvers. While the results are positive, adding tasks such from reasoning gym, which are harder to solve would also improve confidence in the method. Qwen 2.5-1B is know to have poor initial performance on these tasks. 2. Diversity collapse / pass@k vs pass@1 * When $\alpha$ is low (ie. more SFT), do we see any loss of diversity in outputs? * Concretely: how does LESE affect pass@k (best-of-k sampling) vs pass@1? RL-style exploration often helps pass@k more than pass@1; heavy SFT pressure might over-regularize toward a single canonical answer style. * If entropy collapse is an issue with standard GRPO baseline, having the comparison would be helpful. 3. Choice of mastery threshold c, number of rollouts. * How is the mastery cutoff c picked in practice? Is it tuned per dataset, fixed globally, or annealed? Please report sensitivity to c, since $\alpha$ rises sharply once M crosses c. In the appendix, for the math specific task, c=0. This assumes that the model if unable to solve an instance will rely completely on SFT. * The appendix states that only 4 rollouts were used, can this be increased to mitigate the stability issues for RL mentioned in the Appendix? - An ablation on how number of rollouts affect the stability of the method would strengthen the paper. 4. Section 4.2 – using output length as confidence. * The authors claim that LESE gives short answers when confident and longer chains when uncertain. That’s plausible, but length is only a proxy for hardness/uncertainty. Stronger metrics would make the claim better. 5. Line 266: fine-grained reward * The authors mention a fine-grained / multi-component reward. Can you spell out the components (format correctness, reasoning validity, final answer match, etc.), how they’re combined, and whether weights are hand-tuned or learned? This is important for reproducibility and for understanding how generalizable the approach is beyond math-style verifiable tasks. * Details in appendix are for math related datasets, would similar reward functions have to be handcrafted for any RLVR task? 6. Out-of-domain alignment degradation. * To argue the SFT component isn’t overbearing, can you add evals on unrelated chat / instruction benchmarks (e.g. open-ended dialogue like WildChat, safety / helpfulness like MT-Bench) to check that LESE hasn’t harmed general usability? This would also help position the method as “alignment-friendly,” not just math tuning. Especially since high levels of SFT have been shown to hurt model performance. 7. Curriculum details (Appendix A).: The authors mention a hand-crafted curriculum. * How exactly is that curriculum in the training data curated (sampling weights? staged unfreezing? ordering of problems)? Is it model specific? This is unclear from the details in the appendix * The related work section would benefit from a discussion on curriculum learning as well, and not just RLHF vs SFT. Please see weaknesses section. Fully human-written
Learning from Examples and Self-Exploration: A New Paradigm for Dynamic Fusion Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes to combine the RL and SFT losses as a convex combination for training LLMs on reasoning datasets. The goal of this combination is to compensate for the weaknesses of individual methods: SFT alone leads to overfitting, while RL alone leads to reward hacking or sample inefficiency. The convex combination parameter, $\alpha$, is composed of two components: task mastery and exploration diversity, which are dynamically learned during the training of the LLM. This approach, when paired with “fine-grained rewards” (a concept the paper references but does not detail), resulted in improved performance on reasoning benchmarks under a specific training setup. - The core idea of combining the RL and SFT losses to harness the strengths of both methods is a promising approach, as it allows both successful and unsuccessful trajectories to provide useful gradient updates to the LLM. - The proposed approach dynamically combines these losses (rather than using a static combination), which reportedly results in faster training by weighting one loss over another based on question difficulty and the model's competence. - The results demonstrate improved performance under a specific training setup over its baselines on a variety of math reasoning benchmarks, including AIME and MATH-500. - The approach appears mathematically unsound. There is more nuance to SFT and RL-based training approaches that requires careful and thorough investigation (e.g., [1]). Trivially combining the losses is methodologically questionable, as the units of the two losses are incommensurable (analogous to adding kilograms and kilometres). - The paper lacks a clear explanation or mathematical insight for Eq. 6. Furthermore, it does not justify why the proposed estimation method for the $\alpha$ parameter is appropriate or correct. - The experiments are performed in a non-standard setup (lines 306-312), and the results, including those of baselines, differ drastically from the standard, publicly reported results on Huggingface for the same 1.5B model (https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B). - The presentation in some parts of the draft is unprofessional. For example, the plots lack titles and y-axis labels, and appear to be unprocessed screenshots. There are no discussion or limitation sections in the paper. [1] Roux, Nicolas Le, et al. "Tapered off-policy reinforce: Stable and efficient reinforcement learning for LLMs." arXiv preprint arXiv:2503.14286 (2025). - The discussion of token lengths in Table 2 presents a counterintuitive finding. Intuitively, RL approaches should have higher token counts due to exploration, while SFT methods should have fewer because the LLM imitates a known answer. The table, however, shows the opposite. The authors should explain this discrepancy. Furthermore, the relevance of this discussion on token length is unclear, given that the proposed method does not explicitly optimize for it. - The authors should address the apparent poor performance of the standalone RL method in Figures c and f (compared to SFT and LESE). This result is counterintuitive, given that RL methods are generally expected to generalize better and eventually outperform SFT-only approaches. - The paper requires a detailed explanation of the “fine-grained rewards” referenced in the text. Specifically, the authors should describe: (1) what these multiple reward signals are, (2) how they are combined with the true reward, and (3) what mechanisms are in place to prevent reward hacking, especially given their heuristic nature. Lightly AI-edited
Reflective Flow Sampling Enhancement Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. They propose RF-Sampling, a training-free sampling method which enhances sample quality in flow models. The method extends Z-Sampling, which alternates between denoising and inversion steps with controlled CFG, RF-Sampling interpolates between text embeddings and repeatedly applies a reflection loop. This design allows it to support CFG-distilled models such as FLUX-dev and enables test-time scaling. Experimental results demonstrate that RF-Sampling achieves better performance compared to various existing sampling baselines across multiple image generation benchmarks. Additionally, an ablation study on hyperparameters and further application on video generation and image editing model further validates the effectiveness of the proposed method. **S1**. It seems interesting that the sampling method also supports some of the commonly used CFG-distilled models recently. **S2**. It demonstrated the effectiveness of the method across various tasks and models, with experiments conducted under multiple settings, e.g., when used in combination with acceleration methods. **W1**. Although the paper proposes an improved training-free sampling method, it lacks theoretical justification or analysis explaining why it works. A theoretical explanation is necessary to clarify the meaning and role of temporal embedding interpolation, the hyperparameters $\alpha,\gamma$, and the overall underlying mechanism of the method. **W2**. The improvement in image generation quality appears to be only marginal, and the effect of inference-time scaling shown in Figure 2 does not seem significant. A comparison with the Inference-Time Scaling paper [1] would be necessary to clearly demonstrate the effectiveness of the proposed method. [1] Ma et al, Scaling Inference Time Compute for Diffusion Models, CVPR 2025 **W3**. The method involves some hyperparameters, which appear to have been selected manually. As shown in the ablation study, the method seems quite sensitive to these choices. **W4.** Line 232: “We then take one step of the ODE solver,” but the corresponding equation 3 indicates multiple ODE steps with $\sigma$, which could cause confusion. If my concerns are addressed, I would be happy to reconsider the score. **Q1.** In Tables 1 and 2, how is the inference process configured for RF-Sampling? Is it conducted using the same sampling time as in the standard setting? For a fair comparison, it would be necessary to include a report on the inference time. **Q2**. Although FLUX is a CFG-distilled model, it still includes a CFG scale input condition. Would it be possible to apply this sampling method by adjusting the CFG scale through that input? Fully human-written
Reflective Flow Sampling Enhancement Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes RF-Sampling, a method whose core idea, for flow models like Flux.Dev, is to denoise during inference under text embeddings with higher semantic intensity and then perform inversion under text embeddings with lower semantic intensity. This process helps obtain noise latent that better aligns with the prior of the text prompt, thereby improving image generation quality. The authors conducted experiments using different flow models combined with various sampling enhancement techniques, demonstrating the effectiveness of RF-Sampling. 1. The paper contains a rich amount of experiments, conducting extensive tests under different task-based flow models (including flux.dev and flux.lite for text-to-image generation, Wan2.1 for video generation, etc.), verifying the versatility of RF-Sampling; 2. Exploring inference-time enhancement strategies for cfg distillation flow models like flux.1 dev is promising; 3. The paper is well-written and easy to read. 1. CFG-distilled flow models, such as flux.dev, typically still allow for specifying the CFG scale at inference time to produce outputs at varying guidance strengths (often by modulating the latent via AdaLN). Given this, Z-Sampling should, in principle, be applicable to flux.dev. Why do the paper not compare their method against a Z-Sampling variant adapted for this model? Theoretically, the output of a perfectly distilled model at cfg_scale=1.0 should be identical to the output of its non-distilled counterpart conditioned on a empty-text prompt. This suggests that the paper's construction of $c_{mix}$ and $c_{w}$ might be unnecessary, as Z-Sampling could likely be migrated to flux.dev with appropriate modifications. 2. The image quality metrics used in the paper, such as ImageReward and HPSv2, primarily reflect human preferences. Such metrics tend to emphasize aesthetics and prompt fidelity, but may not adequately assess the diversity or realism of the generated images. I am concerned that the proposed RF-Sampling, by weighting text embeddings and latents during the sampling process, might shift the model's inputs away from their original prior distribution. This could potentially lead to a decrease in image diversity or the introduction of visual artifacts. Using a metric such as FID, which directly assesses realism and diversity, would perhaps be more appropriate 3. For a standard 28-step sampling process, each step of RF-Sampling appears to require three model forward computations (forward, inversion, and re-forward). This effectively triples the computational cost relative to the number of steps. Consequently, the comparisons in Tables 1 and 2 may be unfair. The results for RF-Sampling should be compared against a baseline standard sampler using three times the number of inference steps (e.g., $28 \times 3 = 84$ steps). While Figure 2 suggests RF-Sampling also performs better under an equivalent inference time, the specific experimental settings (e.g., the number of steps for the baseline) are not detailed, raising concerns about the fairness of this comparison. 4. The different starting points for the standard sampling and RF-Sampling curves in Figure 2 indicate that RF-Sampling incurs a significant initial overhead or increase in inference time. A more relevant comparison would be: using the *total* time taken by RF-Sampling, how does it compare to a standard sampling baseline that generates multiple candidates ('best-of-N') and selects the one with the highest metric score? 5. ICLR policy requires the appendix to be included in the same PDF as the main paper. The paper has incorrectly placed the appendix in the separate supplementary materials. I am willing to revise my score if any of these points are based on a misunderstanding of the proposed method. See Weaknesses. Fully human-written
Reflective Flow Sampling Enhancement Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces Reflective Flow Sampling (RF-Sampling), a training-free inference-time enhancement approach for text-to-image (T2I) generation using flow-based generative models, particularly those that are CFG-distilled (such as FLUX). The method leverages an interpolation of text embeddings and a three-stage inference loop (composed of high-weight denoising, low-weight inversion, and standard denoising) to guide the generative process toward better semantic alignment with text prompts. RF-Sampling is demonstrated, through extensive experiments on multiple T2I and related tasks, to yield significant improvements in both generation quality and prompt fidelity compared to standard sampling and several baseline enhancement strategies, particularly where conventional diffusion-based techniques are inapplicable to flow models. 1. The work clearly identifies a real limitation in the applicability of inference-time enhancement techniques to flow-matching-based text-to-image models, a rising class of efficient generative models that are not well-served by prior methods. 2. The approach is mathematically formalized, with explicit equations describing the reflective sampling mechanism (see Eqs. for staged denoising/inversion, Section 3.3), and its integration with flow-based ODE solvers is well-articulated. The staged loop and embedding interpolation are presented in sufficient detail, including the merge and amplification parameters. 3. Experiments are thorough and span a large suite of benchmarks, including HPDv2, Pick-a-Pic, DrawBench, GenEval, T2I-Compbench, and evaluations on video and image editing tasks. 1. While the reflective mechanism is motivated by semantic accumulation in latent spaces and interpolation of embeddings, the core reason behind why the three-stage loop (particularly the low-guidance inversion step) should regularize generation toward prompt-faithful images remains largely empirical. The mathematical foundations of convergence or guarantees (e.g., what class of distributions are targeted, what properties are preserved or enhanced during the reflective step) are not rigorously analyzed in Section 3.3 or elsewhere. This limits reproducibility and makes the method feel heuristic. 2. In Section 3.2, the equations for embedding mixing and semantic amplification ($c_{\text{mix}}$ and $c_w$) are presented, but their concrete integration into each ODE step is scattered and not fully formalized. For example, it is left ambiguous in Eqns. for the inversion step whether $c_{\text{mix}}$ is always being recomputed per time step and how the standard scale $w$ interacts with $s$ and $\beta$. This may impede direct implementation from the text. 3. While Figures 7 and 8 and the corresponding ablation analyses add value, the scope is restricted to interpolation and amplification parameter sweeps. There are no ablations studying the impact of each stage of the loop independently (e.g., what happens if the low-guidance inversion/reflection is omitted entirely, or replaced with a linear or simpler heuristic?), nor are qualitative failure cases or negative results provided. The efficiency and scaling comparisons, though favorable, would be strengthened by a more detailed breakdown versus parameter count, steps, or compute time. 4. While Table 2 shows superior scores for RF-Sampling, in some settings the improvements over standard sampling are marginal (see FLUX-Dev AES on DrawBench: 6.1866 vs. 6.1459), raising questions about practical significance in certain operational regimes. 5. The UMAP analysis in Figure 6 purports to show that RF-Sampling trajectories align better with the true data distribution. However, there is little discussion on how this alignment concretely translates to improved perceptual or semantic outcomes, or whether it is artifactually driven by the chosen projection or data statistics. 1. Can the authors provide rigorous analysis (not only empirical) for why the three-stage reflective loop in RF-Sampling leads to better semantic alignment or image quality than direct/high-weight denoising? For example, can theoretical guarantees or explanations be offered for convergence, robustness, or generalization? 2. The mathematical integration of $\beta$, $s$, and merge ratio $\gamma$ parameters in the flow equation steps can be made more explicit, possibly via an explicit algorithmic pseudocode in the main text. Would the authors include this in a revision? 3. Can the authors elaborate on what prevents adapting state-of-the-art diffusion-based inference enhancement methods (e.g., Z-Sampling, W2SD) to flow models such as FLUX? Are failure cases due to model architecture, incompatible objective/loss, or something else? 4. Is there a detailed efficiency breakdown (step counts, FLOPs, wall time) for RF-Sampling vs. standard sampling (and possible alternatives) beyond the high-level graphs (Fig. 2)? 5. Is the improvement in Table 2 statistically significant across multiple seeds/runs, or is it within experimental noise in lower-difference settings? Fully AI-generated
Reflective Flow Sampling Enhancement Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper introduces RF-Sampling, a new inference-time sampling technique that enables trading off computational cost and output quality. The authors take inspiration from Z-Sampling, which introduces a process where a denoising step with a high guidance scale is followed by an inversion step with a low guidance scale to better align the noise with the desired semantics before proceeding with standard denoising. They adapt this idea for CFG-distilled flow matching models and propose a similar approach that applies CFG-like guidance directly on the text embeddings instead. The method is comprehensively evaluated across several text-to-image, text-to-video, and image-editing models, showing improved performance compared to prior techniques. - The paper is very well written and easy to follow. - The proposed algorithm in simple to understand and easy to implement. - The experiment suite is quite extensive with an impressive number of models, benchmarks, and ablations. - Since the proposed algorithm can be viewed as an inference-time scaling method, including a simple Best-of-$N$ baseline in the experiments would help better position the paper within the literature. For instance, according to Figure 17, performing RF-Sampling on all steps appears to yield the best results. Using the $\alpha = 1$ setting used in the experiments, RF-Sampling requires three times as many forward passes as generating a single sample from the base model. Therefore, comparing this approach against a Best-of-3 baseline from the base model would be a useful addition. - The paper argues that Z-Sampling cannot be directly applied to CFG-distilled models such as FLUX. However, it is unclear why this would be the case, since these models typically distill a range of guidance values during training. As a result, one could, in principle, perform a variant of Z-Sampling by simply changing the guidance scale values during the denoising and inversion steps. - Following the previous point, Section 3.2 (Lines 203–205) states: “Flow models are typically trained only under conditional settings (Labs, 2024; Daniel Verdú, 2024). As a result, directly using CFG techniques or adopting an empty-text embedding as guidance for flow models is inappropriate.” According to the cited literature [1], this claim is inaccurate. To obtain a guidance-distilled model such as FLUX-dev, a standard flow matching model is first trained with text embedding dropout to enable guidance. This model is then CFG-distilled into a student that takes the guidance scale as input and outputs the guided velocity in a single forward pass. [1] Meng, Chenlin, et al. "On distillation of guided diffusion models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023. 1. Could the authors clarify why Z-sampling cannot be applied for FLUX-dev? Fully human-written
Mapping Overlaps in Benchmarks through Perplexity in the Wild Soundness: 3: good Presentation: 2: fair Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces benchmark signatures to quantify overlaps among LLM benchmarks by mining salient tokens from in-the-wild corpora where token-level perplexity predicts benchmark performance. Using 32 LLMs and 88 benchmarks, the authors apply Thrush correlation filtering and stepwise forward selection to extract signatures. The work examines three overlap levels: semantic similarity, performance correlation, and signatures. Key findings show signatures discriminate better than semantic or performance measures, performance correlations are biased by format and benchmark family, and coding is the least overlapping domain while logic, math, language, and instruction-following are interconnected. Despite the writing issues, I highly recognize the motivation of this work. I believe the problem this paper addresses is extremely important and has troubled many researchers working on large language model pretraining. The perspective that in-the-wild corpora implicitly encode the benchmark signature is quite interesting. The discovery that performance-based benchmark agreement is contaminated by format and family effects is valuable for the community. The community should encourage bold papers that tackle important problems. Soundness can be gradually supplemented during the rebuttal phase. This represents ambitious thinking we need for fundamental issues in LLM evaluation. With revisions addressing scalability and adding qualitative analysis of signatures, this will be a strong contribution. The motivation is strong. Benchmark submissions have grown sevenfold and we need methods to understand if new benchmarks measure distinct capabilities or just repackage existing ones. The insight that in-the-wild corpora implicitly encode benchmark signatures through perplexity patterns is interesting and provides a theoretical connection between pretraining exposure and benchmark performance. The experimental scope with 32 models and 88 benchmarks is substantial. The finding that performance correlations are more influenced by question format and benchmark family than actual function (Figure 1) exposes a real limitation in current benchmark agreement studies. The observation that coding benchmarks are relatively clean and distinct is useful. The statistical framework using Thrush correlation screening and AIC-based forward selection is reasonable for the ultra-high dimensional setting. The writing has issues. It feels like the authors have limited writing experience. The abstract is not very clear and could benefit from referencing how other papers structure theirs. Also, the last line of the last page is not aligned to the bottom. This directly affects the clarity of presentation. My main concern is computational cost and scalability. The authors use only 1B tokens from RedPajama, then downsample by 50x to ~16.9M token contexts. Our large model in-the-wild datasets are on the scale of tens of trillions of tokens. It is unclear how this work would handle larger scales. From Section A.5.1: - Initial scale: ~8.45 × 10^9 tokens (RedPajama 1B variant) - Downsampling: 1/50 - Final scale: ~1.69 × 10^7 tokens - Feature matrix: P ∈ R^(32 × 1.69×10^7) The complexity of generalizing this needs discussion. A previous work addressing similar scale is LESS [1]. Getting scaling behavior from 1B → 10B → 100B tokens seems quite difficult, but the authors should discuss this. The paper never shows what the actual salient tokens are. They extract ~30 tokens per benchmark but provide no examples. What tokens predict math performance? What about coding? Without this, the signatures remain black-box features. This is a significant missed opportunity. Methodological choices lack justification. The authors mention Lasso or Elastic Net but choose forward selection for "interpretability" without demonstrating this. No comparison with Spearman correlation or mutual information for screening. More ablations would help. The paper discusses two interpretations of overlap - genuine cognitive interdependence versus "leaky" benchmarks - but never resolves this. When instruction-following and logic overlap more than within-logic benchmarks, the authors could use signature tokens to distinguish these interpretations but do not. [1] Chen et al., "LESS: Selecting Influential Data for Targeted Instruction Tuning," ICML 2024. Can you provide analysis of how signatures change with corpus size? Even experiments with 100M, 500M, and 1B token subsets would show whether signatures stabilize. What is the computational cost and how does it scale to 10B or 100B tokens? Please show examples of actual salient tokens. What are the top tokens for MMLU math versus MBPP coding? Do they make intuitive sense? When instruction-following and logic show high signature overlap, what do the shared tokens look like? This could distinguish between genuine interdependence versus contamination. Have you validated signatures on held-out models not used in extraction? This would show signatures capture fundamental properties rather than overfitting. Heavily AI-edited
Mapping Overlaps in Benchmarks through Perplexity in the Wild Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces the concept of benchmark signatures that can be used to characterize and compare LLM benchmarks. The signatures are defined as a set of salient tokens drawn from large corpora such that the perplexity of various LMs on those tokens predict their performance on a benchmark. The paper’s contributions are (1) a novel two-stage pipeline to derive “fingerprints” of a benchmark. The pipeline involves perplexity-based screening to identify candidate tokens, and then applying stepwise regression to select a spare set of tokens more predictive of performance. (2) A framework for multi-level benchmark comparison. (3) Empirical insights into how current LLM benchmarks may be redundant or divergent. - The introduction of benchmark signatures is a new way to quantify what a benchmark tests. It introduces a principled multi-level framework for analyzing benchmark redundancy. - The idea of a misaligned benchmark intent – e.g., logical reasoning benchmarks ending up testing a model’s instruction-following capability – is a notable discovery worthy of further investigation. - The paper’s general hypothesis that pretraining familiarity affects performance has been noted in works on benchmark saturation. For example, Humanity’s Last Exam (HLE) was created specifically to address the saturation of MMLU by providing expert-written questions. The paper doesn’t mention HLE or similar broad evaluations that address saturation or overlapping knowledge issues. Phan, L., Gatti, A., Han, Z., Li, N., Hu, J., Zhang, H., ... & Wykowski, J. (2025). Humanity's last exam. arXiv preprint arXiv:2501.14249 - It appears the evaluation focuses on academic QA-style tasks (e.g., MMLU, Big-Bench Hard), many of which are multiple-choice or short-answer. This omits classes of benchmarks, such as open-ended generation tasks like summarization or dialogue, where the signature approach may be harder to apply because it inherently works better with well-defined quantitative performance metrics. This focus should be acknowledged as a limitation. - Some choices in the signature extraction pipeline could use further justification. The authors use a fixed threshold (top/bottom 1% by thrush correlation) for pre-filtering, and then a greedy forward selection up to 1500 tokens. Why 1% and how was 1500 chosen as the max features? There’s no comparison to regularization methods like Lasso or Elastic-Net, which the authors mention as a possible extension but do not report results for. - A conceptual weakness is the paper’s implicit conflation of model capability and pretraining familiarity. By design, the signature tokens are those that models find easier (lower perplexity) and that correlate with better benchmark scores. This risks a tautology since models may perform well on a benchmark because they have seen similar content before. For benchmarks aiming to test reasoning beyond memorized knowledge, a high signature overlap might indicate the benchmark is inadvertently easier for models that have seen certain clues. In other words, the approach might not distinguish generalization ability from training-set overlap. - If classical regression methods are ill-posed for this high-dimensional problem, did the authors consider regularization techniques or matrix factorization/filtering methods? The use of Lasso or Elastic Net regression is mentioned as an extension but is not explored in this paper. - For the THRUSHPREFILTER step, the paper mentions using "approximately the top 1% of tokens with the strongest signal” but doesn't specify the exact threshold. Is there a more principled data-driven procedure for selecting this threshold; for example, using cross-validation or analyzing the correlation distribution? - Relatedly, how sensitive are the results to the fixed variable choices, such as 1% thrush filtering and up to 1500 tokens used in forward selection? - While code is provided, certain details of the experimental setup are not explicit in the paper. For example, data processing details for extracting signatures from the RedPajama corpus would be helpful; e.g., how the corpus was tokenized, how large $d$ was, and how contexts for tokens were chosen. - The paper would be stronger if it included concrete examples of signature tokens or token clusters for a few benchmarks. Including a table of signature tokens for different benchmarks and their regression coefficients would enhance interpretability. Fully AI-generated
Mapping Overlaps in Benchmarks through Perplexity in the Wild Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces benchmark signatures: sets of salient tokens from in-the-wild corpora where LLM perplexity predicts benchmark performance. Using 32 LLMs and 88 benchmarks, the authors compare three overlap measurement approaches: semantic (question similarity), performance (score correlations), and signature-based (perplexity patterns). Key findings: (1) signature analysis better distinguishes benchmarks than semantic/performance approaches, (2) performance-level results are strongly influenced by benchmark-orthogonal factors, (3) coding is the most distinct domain while math/logic/language show cross-functional overlap. 1. I like the idea of signature. The signature approach elegantly unifies semantic and performance perspectives: signatures inherently contain semantic information (salient tokens from meaningful contexts) while perplexity directly hints at performance. This is a creative and well-motivated synthesis. 2. The experiments are done in Impressive scale: 32 models × 88 benchmarks with multi-level analysis provides comprehensive coverage. 3. There are good theory grounding. Good justification via SIS theory for ultra-high dimensional feature selection and two-stage filtering approach appropriately handles the d >> m problem. 1. All signatures are extracted from RedPajama only, yet the 32 models were trained on diverse corpora. It remains unclear whether RedPajama signatures are representative of these varied training distributions. Furthermore, proprietary models may rely on private training data containing unconsidered signatures. If signatures are biased, can they still meaningfully reflect model capabilities? Robustness experiments are needed to validate the approach. 2. The claim that signature correlations reveal "model capabilities" is questionable. Two alternative explanations exist: (a) High within-dataset vs. between-dataset correlations may simply reflect training data contamination, as authors acknowledge; (b) Same dataset may inherently represent a single capability independent of intended subject divisions—for instance, if all MMLU subjects source from Wikipedia, signatures may capture "Wikipedia-style capability" (or even spurious correlation) rather than genuine subject-specific skills (history vs. chemistry). The perplexity-performance link could be coincidental. Can you do more causality analysis on that? See weakness. Lightly AI-edited
Mapping Overlaps in Benchmarks through Perplexity in the Wild Soundness: 3: good Presentation: 3: good Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper exposes a critical flaw in benchmark categorization that benchmark overlap is dominated by superficial factors like quention format and benchmark family, rather than the actual capabilities, and proposes "Benchmark signatures" to reliably describe benchmark features. Key findings from this analysis include: - Logic, math, IF and language form an interconnected cluster - Coding stands distinctly alone from other domains - Many "logic" benchmarks actually measure instruction-following - Multilingual / cultural benchmarks show low within-cateogry similarity The takeaway is the current evaluation methods conflate test design with capability. Signature-based analysis reveals what LLMs actually learn. 1. The benchmark signature concept is novel and elegantly bridges pretraining data characteristics and benchmark performance through perplexity. The mechanistic view (rather than correlational view) reveals intrinsic capabilities that benchmarks measure rather than superficial artifacts. 2. The paper repurposes proven perplexity-correlation methods from data selection literature for benchmark characterization, and provide rigorous theoretical justification through SIS framework. It explains why the proposed method work through rigorous theoretic analysis. 3. The paper reveals that cross-function overlap (instruction following vs. logic) exceeds within-function overlap in some cases, shows interesting observation that some benchmarks designed to test "logic" acutually measure instruction following, and this further demonstrates the power of the methodology which leads to non trivial findings. 1. Table 1 shows token level has the greatest standard deviation and interquartile range, and the paper claims this indidcates token level is the most informative. However, this is not necessarily true: the higher vairance could also originates from larger noise. It is better to show signal-to-noise ratio analysis (like R^2) for token vs. chunk vs. doc level. 2. All analysis uses the same 32 models for extraction and comparison. Cross-validation and held-out test results are absent. 3. Across 16 targets, token-level values achieved 12 wins, which indicates 25% of loses. Please provide analysis of when and why token level loses. 4. AIC is a greedy algorithm known to be suboptimal, why not use LASSO, elastic net, etc.? Could you provide stability analysis of feature selection? 1. Can signatures from 24 models predict the performance ranking of the held-out 8 models on benchmarks (held-out test for predictive validation)? 2. What are the examples of actual salient tokens for specific benchmarks? Do they align with human intuition? 3. Please add confidence intervals to Fig. 4 and conduct pairwise tests for within category vs. cross category differences. Fully human-written
Weak Correlations as the Underlying Principle for Linearization of Gradient-Based Learning Systems Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper offers a nice unified explanation for NTK‑style linearization (weak derivative correlations) and a careful asymptotic calculus to make that explanation precise. It provides some helpful intuition for understanding the driver of the lazy regime vs. feature‑learning regimes. - Clear unifying idea. The paper puts forward weak derivative correlations, small correlations at initialisation between the first and higher‑order parameter derivatives, as the underlying mechanism for NTK‑style linearization, with precise definitions $(C_{D,d}$ and two equivalence theorems (Theorems 3.1 and 3.2) tying correlation decay to linearised dynamics and learning‑rate scaling. This gives a compact, testable lens on “lazy training.” - Technical framework. Section 2 builds a random‑tensor asymptotics calculus (subordinate tensor norm + stochastic big‑O, uniform bounds), and proves existence/uniqueness of a tight upper bound. - Deviation‑over‑time statement. Corollary 4.1 bounds the SGD deviation $F- F_{\text{lin}}$ by $O(1/m(n))$ over (finite) time under an exponential NTK‑phase contraction assumption, making the linearisation guarantee feel more operational. - Attempt at architectural breadth. By leaning on tensor‑programs, the authors argue many wide architectures satisfy weak correlations (with rates tied to activation derivatives, Equation 22), offering a route to reason about how architectural choices and learning‑rate reparametrization $\eta \mapsto r(n)\eta$ push systems toward or away from linearisation. - All experiments use tiny subsets of MNIST, CIFAR‑10, and Fashion‑MNIST; fully connected nets; MSE loss; and NTK‑normalised learning rate with long training (1,000 epochs). These datasets/architectures in this setup typically do not require rich feature learning and are well known to be close to the lazy/NTK regime already. Thus, showing that the relative discrepancy $|f - f_{\text{lin}}|$ shrinks with width (Figure 1) and that estimated 2nd/3rd‑order correlation proxies decrease (Figure 2-3) does not validate the central predictive claim in regimes where feature learning matters. That is - nowhere do we empirically validate that networks that are known to feature learn do not also have weak correlations between the first and higher order derivatives. I appreciate that it is hard to compute the derivatives with large amounts of data (where feature learning typically happens) but there are toy models that show feature learning with relatively small datasets (e.g. sparse parity, multi-index, staircase functions). 1. Main question (feature learning regime). Can you evaluate on feature‑learning regimes (even on small datasets as described in weaknesses), and show that your correlation diagnostics measured at initialisation predict the gap between finite‑width training and NTK? 2. Learning‑rate scaling predictions. Theorem 3.2 claims reparametrising $\eta \mapsto r(n)\eta$ modulates linearity. Can you add experiments that sweep (r(n)) with width to confirm the predicted $O(r(n)/m(n))$ deviation and the $O((1/\sqrt{m(n)})^d)$ decay of $C_{D,d}$? Moderately AI-edited
Weak Correlations as the Underlying Principle for Linearization of Gradient-Based Learning Systems Soundness: 3: good Presentation: 1: poor Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper tries to connect the linearity of training wide neural networks with the weak correlations between the first and higher-order derivatives. The authors propose a novel formalism for this purpose by investigating the asymptotic behavior of random tensors, which might be generalized to other cases of machine learning. The framework of this paper is generally novel. The goal of this paper is clear and straightforward. To this end, the authors are able to develop a sophisticated asymptotic theory for random tensors, which might be of independent interest. This formalism is roughly applied consistently across the proofs, providing a uniform treatment of various architectures. 1. I think the presentation is obviously under the bar of ICLR, which sometimes makes the paper challenging to read. Below I list some examples: - line 278 "under the conditions described above". I'm confused about the exact conditions mentioned here: it seems that there is not any condition above this line. Similar issues exist in Theorem 3.2 line 298. - line 279 "sufficiently small learning rate $\eta < \eta_{the}$". How is this $\eta_{the}$ obtained? And how small it should be? - line 294, I think a "for example" should not be stated in a formal Theorem. - line 405-407 "then exists some 0< T, such as for every s =1 ... S, if:". I'm confused about the statement: - $0 < T$ is obvious, why exists? - "such as for every $s$": Why "such as" here? It seems that "such as" should not be in a Corollary. - Again, how is $\eta_{cor}$ obtained here? Is it a same one as $\eta_{the}$ in Theorem 3.2? 2. The authors do not sufficiently justify that existing tools are inadequate for deriving the main results, which makes the motivation of designing a new formalism a bit unclear. In addition, the overwhelming technical machinery in fact raises a barrier for readers to appreciate the methodology, and I'm only able to roughly read the proofs. 3. The authors make a new and even bold claim, yet the empirical validation is far from sufficient to support this new claim. 4. Indeed, the framework is general. However, the core insight seems to be a deep reformulation of existing knowledge (lazy training, NTK, infinite-width limits). From this perspective, I think the contribution is limited. Please see the first point of the Weaknesses. Fully human-written
Weak Correlations as the Underlying Principle for Linearization of Gradient-Based Learning Systems Soundness: 4: excellent Presentation: 3: good Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a criterion, that of having weak correlations between the first derivative of the model. and the higher derivatives, and shows that it is equivalent to the model being in a linear/NTK regime where one can take a Taylor approximation around initialization. Thus any model with weak correlation can be proven to convergence exponentially fast, and wide DNNs can be thought as just a special case of this result. This is to my knowledge the first equivalent condition to NTK-type linearization, and offers a pretty novel point of vue. Other conditions such as the ratio of the Hessian norm to the gradient norm were more sufficient conditions. This aligns well with an intuition I (and other) had: proving NTK dynamics by looking at a ball around initialization often yields loose rates, instead one has to leverage the fact that the parameters typically move along directions where the NTK moves little. Some parts of the paper are very technical and a bit hard to follow, as well as some parts of the discussion. Also the criterion is probably very hard to compute in practice because it involves high-dimensional objects (higher order derivatives). It seems that it would be easier in practice to simply compute how much the NTK moves rather than computing these correlation values. - It seems that you have missed a paper that is very closely related (it is very similar to Dyer & Gur-Ari 2019): https://arxiv.org/abs/1909.08156 . I wonder how close your condition is to assuming that the higher order terms vanish in the Neural Tangent Hierarchy defined in this paper. Fully human-written
SELU: Energy-based Targeted Unlearning in LLMs Soundness: 1: poor Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper introduces SELU, a parameter-efficient framework for LLM unlearning. SELU addresses a key challenge in unlearning, balancing knowledge removal with the preservation of model confidence on retained data, by combining Low-Rank Adaptation (LoRA) with an energy-based objective optimized via straight-through estimators. The method raises the energy of forget examples while maintaining low energy for retain examples, creating a stable and effective forgetting signal. Experiments on the TOFU benchmark demonstrate that SELU achieves better forgetting vs. utility trade-offs than existing suppression-based unlearning methods, while maintaining coherent and contextually appropriate outputs. - [S1] **Interesting direction.** The paper explores the intersection of energy-based modeling and unlearning, which is a relatively underexplored and potentially promising area for improving controllability in LLMs. - [S2] **Parameter efficiency.** The attempt to integrate LoRA within the unlearning framework is appealing from a practical standpoint, as full fine-tuning is often computationally expensive and resource-prohibitive for large models. - [W1] **Incoherence in motivation and contribution.** The paper repeatedly emphasizes the *learning rate mismatch* between pretraining and unlearning as a key motivation but does not convincingly demonstrate why this constitutes a fundamental challenge. If the issue arises primarily due to conservative fine-tuning, it is unclear why full fine-tuning with the same learning rate would not address the concern. Furthermore, the paper does not clearly connect how SELU specifically resolves this mismatch beyond proposing a new loss formulation. Without a well-defined link between the identified challenge and the proposed solution, the overall narrative feels fragmented and conceptually incomplete. - [W2] **Unjustified design choices and unclear optimization rationale.** The SELU loss is described as a weighted combination of four different objective terms, yet Section 4 only provides surface-level intuition for each term rather than theoretical or empirical justification for their inclusion and relative weighting. It remains unclear how the energy-based formulation aligns with the stated goal of likelihood-scale alignment or how the optimization is expected to achieve a better balance between forgetting and retention. Moreover, the paper’s claim that SELU mitigates “leakage through partial or paraphrased generations” (Lines 272–277) is unsubstantiated: since losses are computed token-wise over exact samples, it is difficult to see how generalization to paraphrases is explicitly addressed. - [W3] **Concerns about experimental rigor and overfitting.** The method introduces several hyperparameters (multiple loss weights $\lambda$’s, thresholds $\tau$’s, and energy margins) that are not systematically analyzed. This design opens the door to potential overfitting and makes it unclear whether the observed performance gains generalize across settings. The mention of training instabilities in Section 6 further reinforces this concern. To strengthen the empirical validity, a sensitivity analysis or ablation on key hyperparameters would be necessary to demonstrate the robustness of SELU’s claimed advantages. - [Q1] The list of contributions in Section 1 cites forget quality and utility scores near 0.3, but the paper does not contextualize what these numbers represent. Could the authors provide comparative baselines or prior work results to help interpret these values? - [Q2] Table 1 appears incomplete: the discussion mentions red-highlighted tokens and coherent completions, yet the table itself contains truncated sentences and no visible highlighting. Could the authors clarify or provide full examples? Heavily AI-edited
SELU: Energy-based Targeted Unlearning in LLMs Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a new LLM unlearning method, SELU, which involves an energy-based model for the forget (high energy) and retain data (low energy). It uses straight-through estimators to elevate energy of forget data and keeping retain data at low energy. The experiment shows that it outperforms baselinse on TOFU benchmark. * Involving energy-based model in LLM unlearning is an interesting idea. * Complete ablation study experiments. The experiment is complete with all componenets involved. * Optimization instability concern. The proposed training involves six loss term, which seems hard to maintain a balance. Section 6 also mentions this optimization instability across different learning rate when performing unlearning on forget10 subset. * Limited evaluation setup. The experiments only involve TOFU dataset on synthetic data with fictious authors. Involving real-world knowledge benchmark like RWKU and WMDP is necessary. * Unsupported energy landscape shape. While Figure 1 shows an ideal energy landscape visualization of forget and retain data distribution, there is no evidence supporting this ideal case for the unlearned LLM in TOFU experiments. [1] RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models [2] The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning * Why only LoRA training instead of full model training? As previous works suggested, LoRA training generally performs worse than full model unlearning. Fully human-written
SELU: Energy-based Targeted Unlearning in LLMs Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a novel parameter-efficient framework for machine unlearning in large language models (LLMs), named Straight-through Energy Language Unlearning (SELU). The method integrates energy-based modeling with Low-Rank Adaptation (LoRA) to selectively remove specific knowledge while maintaining model utility. By leveraging straight-through estimators, SELU projects discrete token outputs into a differentiable energy function, assigning high energy to forget examples and low energy to retain examples. Experiments on the TOFU benchmark using LLaMA-2-7B show that SELU achieves superior forgetting–utility trade-offs and generates coherent, context-preserving responses. 1. The proposed SELU framework introduces a novel combination of energy-based modeling and Low-Rank Adaptation (LoRA) for parameter-efficient unlearning. 2. The use of straight-through estimators to connect discrete token outputs with continuous energy functions is technically innovative and enables fine-grained control. 3. This paper is well-written and easy-to-follow. 1. The theoretical justification for using energy-based modeling in the unlearning context is underdeveloped. The connection between “energy elevation” and “knowledge removal” remains mostly empirical. 2. The stability and convergence properties of the straight-through estimator (STE) are not well analyzed. Gradient variance and potential optimization bias could significantly affect performance. 3. The experiments are limited to the TOFU benchmark and a single model (LLaMA-2-7B). see above Fully AI-generated
SELU: Energy-based Targeted Unlearning in LLMs Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces Straight-through Energy Language Unlearning (SELU), a targeted and parameter-efficient method for removing specific knowledge from large language models (LLMs) without retraining from scratch. SELU combines Low-Rank Adaptation (LoRA) with an energy-based learning objective guided by Straight-Through Estimators (STEs), including Gumbel–Softmax and straight-through argmax variants. Its key idea is to assign high energy to examples that must be forgotten and low energy to retained data, explicitly shaping the energy landscape so that forgetting is targeted and confidence on desired knowledge is restored. The framework introduces four coordinated loss terms: push-up/push-down energy separation, pairwise margin, calibration, and coupling losses, which collectively maintain a balance between forgetting precision and overall model coherence. The use of EBM is new to me for LLM unlearning. It introduces an energy-based unlearning mechanism using straight-through estimators, bridging discrete token generation and differentiable optimization. I am not very familiar with EBM, could the authors explain why EBM is suitable for LLM unlearning in Sec 3. In the first paragraph, the authors state the drawbacks of GD in 1) requiring design of the forget loss and 2) the interaction with the retain term. How does EBM address these issues. From my understanding, EBM still require the design of energy function, and the proposed SELU loss also interacts between unlearning and retention. The architecture and loss design are too specific. From the reader’s perspective, I cannot clearly figure out the motivation and usefulness of each part. Overall, it is hard to me to understand the significance of the proposed framework, the authors just try to use EBM for unlearning, but it is unclear to figure out the contribution of each part (although the authors try to state some contributions in the introduction, they are not directly linked to Sec 4.) The mapping from model responses to energy scalars requires MLP to be trained. How could the authors guarantee that such learned energy is meaningful in conducting unlearning and retention? Is it possible that when we fix the base model and only training W_q and W_k, we can still minimize \mathcal L. In this case, the overall framework is meaningless and stochasticity leads to the reported good performance. This is also reflected by the limitations mentioned in Sec 6. The authors only conducted experiments on TOFU benchmarks with only LLaMA-2-7B, which is not enough to show the general superiority. The results also look weird to me, for example, it is counterintuitive to see that GA outperforms NPO in model utility as in Fig 2 left. The authors claim addressing instability as one of the contributions in the introduction, yet stating in Sec 6 that their proposed method is instable. I think the authors should carefully polish their paper, think about the real contribution, make the description clearer, and conducting more experiments. More ablation studies are also required, such as other hyperparameters, w/ vs. w/o LoRA. The related works are not up-to-date. Please try to add more recent works [1-3]. [1] Rethinking Unlearning for Large Reasoning Models [2] LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics [3] Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs How does EBM overcome the drawbacks of gradient-descent (GD)-based unlearning mentioned earlier—namely, (1) the need to design a forget loss and (2) the complex interaction with the retain term? Isn’t EBM still subject to similar design choices and term interactions through the proposed energy function and SELU loss components? Explain the intuition behind each component and its necessity. Clarify the overall significance of SELU beyond simply applying EBM concepts to unlearning. Strengthen the link between the claimed contributions in the introduction and the details in Sec. 4. How do the authors ensure that the MLP mapping from model responses to energy values produces meaningful supervision for forgetting and retention? Could the observed improvements be due to stochastic noise rather than genuine learning? Only one benchmark (TOFU) and one model (LLaMA-2-7B) are used, which is insufficient to demonstrate generalizability. Lightly AI-edited
From Values to Tokens: An LLM-Driven Framework for Context-aware Time Series Forecasting via Symbolic Discretization Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes TokenCast, an LLM-driven framework for context-aware time series forecasting based on symbolic discretization. Instead of processing continuous numerical values directly, the authors convert time-series data into discrete temporal tokens via vector quantization and reversible normalization, enabling the model to operate in the same token space as textual inputs. By extending the vocabulary of a pre-trained LLM, the method aligns time-series and text representations within a shared semantic space, allowing joint reasoning through next-token prediction. Extensive experiments on six context-rich datasets (economic, health, web, and stock domains) show that TokenCast achieves competitive or superior results compared with strong baselines such as Time-LLM, GPT4TS, and Crossformer. Ablation and sensitivity studies confirm the effectiveness of the proposed tokenization and alignment strategies. Overall, the paper offers a novel perspective on unifying numerical and textual modalities under the LLM generative paradigm, though the baseline coverage could be broader and the efficiency analysis remains limited. - **Proper positioning within current research trends.** The paper is aligned with the recent movement toward symbolic or token-based time-series modeling, showing that the authors are aware of ongoing developments in the field. - **Well-organized framework.** The three-stage pipeline (tokenization, alignment, and generative prediction) is logically structured and easy to follow. - **Readable presentation.** The writing is clear, and figures effectively illustrate the workflow. - **Lack of novelty relative to existing work.** The proposed vector quantization and tokenization strategy is highly similar to the approach used in Amazon’s Chronos model, which also discretizes numerical sequences into symbolic tokens for autoregressive forecasting. Several recent works (e.g., Chronos, Chronos-Bolt, SymbolicTS, and TokenTS) have already explored nearly identical ideas. The paper does not clearly differentiate itself in methodology or theoretical contribution, making the innovation appear incremental. - **Lack of clear evidence for multimodal gains.** Although the paper emphasizes context-aware forecasting, it does not clearly show how textual or non-temporal modalities enhance numerical prediction. Many so-called multimodal datasets contribute little meaningful contextual signal, and in some cases may even introduce data leakage risks. - **Overreliance on existing LLM architecture.** The contribution lies primarily in applying tokenization to an existing LLM rather than introducing a new modeling principle or objective. - **Efficiency and scalability not evaluated.** Tokenization and vocabulary extension introduce additional computation, but the paper provides no analysis of training or inference cost. 1. Could the authors provide clearer **evidence that multimodal context actually improves forecasting performance**? For example, are there quantitative comparisons between using and omitting textual/contextual inputs, or analyses showing which modalities contribute the most? 2. The vector quantization approach appears similar to that used in **Chronos**. Could you clarify the methodological or empirical differences? 3. Are the reported results averaged across **multiple random seeds** for reliability? 4. Can you provide **runtime, memory, or parameter comparisons** to support the claimed efficiency? 5. How sensitive is performance to the size of the token vocabulary or the choice of LLM backbone? Fully AI-generated
From Values to Tokens: An LLM-Driven Framework for Context-aware Time Series Forecasting via Symbolic Discretization Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper studies context-aware time series forecasting, where the goal is to predict future multivariate trajectories from historical signals together with auxiliary contextual information such as textual event or domain descriptions. The proposed framework, TokenCast, discretizes time series via a VQ-style tokenizer with reversible instance normalization (to avoid future leakage), injects these discrete indices into the shared vocabulary of a frozen large language model through a learned unified embedding layer that aligns time-series tokens and text tokens, and then generatively fine-tunes the model to autoregressively produce future tokens that are decoded back to continuous values. The approach is presented as a unified pipeline that allows an LLM backbone to consume numeric history and contextual signals without altering its core architecture beyond the shared embedding layer. The method is evaluated on six real-world datasets spanning economics, public health/mobility, web traffic, stock markets, and environmental sensing using MSE/MAE across multiple horizons and baselines, and the paper reports lower errors on most datasets plus ablations linking gains to alignment, generative training, and contextual conditioning. 1. The paper formulates context-aware forecasting as conditional sequence generation by mapping multivariate time series into discrete tokens, aligning them with text tokens in a shared LLM vocabulary, and autoregressively generating future trajectories. 2. The method includes reversible instance normalization using only historical context and a shared codebook, encoder-decoder, which keeps the tokenization invertible. 3. Experiments span six real-world domains and compare against LLM-based, Transformer-based, linear, and self-supervised forecasting baselines, reporting averaged MSE/MAE over multiple horizons. 1. Dataset descriptions contain internal inconsistencies (e.g., the Economic dataset describes as daily in the main text but as monthly macroeconomic data in the appendix), which obscures the exact sampling frequency and temporal structure assumed in training and evaluation. 2. The reported MSE/MAE averages lack standard deviations, confidence intervals, or significance tests, which limits assessment of robustness when baseline performance is numerically close. 3. The paper only sketches how contextual features are constructed, temporally aligned, and used at inference time, and this under-specification affects reproducibility and the scope of claims about context-driven forecasting. 4. Figure/table references are inconsistent. In Section 4.1.1, the panel summarizing domains, frequencies, lengths, and variable counts is captioned as Figure 3 but referred to as Table 3. 1. Can you provide robustness or failure-case analysis, for example, regimes such as market shocks, policy changes, or abrupt environmental shifts where the approach does not reduce error relative to baselines? 2. The method conditions on contextual features. For identical historical numeric input, can you show how adding and removing specific contextual signals changes the generated forecast and explain how those changes reflect the contextual content? Fully AI-generated
From Values to Tokens: An LLM-Driven Framework for Context-aware Time Series Forecasting via Symbolic Discretization Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes TokenCast, an LLM-driven framework for context-aware time series forecasting, which consists of three stages: time series tokenizer, modality alignment, and supervised fine-tuning. Experimental results show that TokenCast achieves strong performance. 1. The paper introduces a novel LLM-driven framework, named TokenCast, for time series forecasting by leveraging LLMs to utilize unstructured contextual information. 2. The paper is clearly written and well-organized, making it easy to follow the main ideas. The methodology is technically sound and clearly explained. 1. The discussion of related work on contextual information integration could be strengthened. While many existing approaches incorporate numeric contextual signals to enhance forecasting, the integration of unstructured contextual information requires cross-modal alignment strategies. Several recent studies have explored this direction; however, this emerging line of work is not sufficiently discussed or contrasted with TokenCast. 2. In line 181, the paper states that RevIN may risk leaking future information. However, this claim might not be fully justified, as RevIN typically computes normalization statistics (e.g., mean and standard deviation) based only on the lookback window within the input sequence. 3. In line 82, the paper states that it is unclear whether time series forecasting can be addressed through autoregressive generation over discrete tokens. However, this direction has been explored in prior work. For example, Chronos and AutoTimes both employ a decoder-only architecture and transform numeric time series into discrete tokens via value-based quantization. 1. I am interested in how the model's performance would change if it only outputs time series tokens, instead of a mixture of time series and textual tokens. 2. I am confused by the organization of the input tokens. In the text, the paper states that time series tokens are placed in front of textual tokens. However, in Figure 2, the textual tokens appear in front of the time series tokens, which seems inconsistent. 3. In the stage of the time series tokenizer, TokenCast employs a TCN as a causal encoder. The choice of convolution kernel length is likely to have a significant impact on performance, and it would be helpful to include an ablation study to examine this effect. Lightly AI-edited
From Values to Tokens: An LLM-Driven Framework for Context-aware Time Series Forecasting via Symbolic Discretization Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces TokenCast, a novel framework for context-aware time series forecasting via symbolic discretization. By transforming continuous time series data into discrete tokens and embedding them into a semantic space shared with contextual features, TokenCast leverages the generative and reasoning of pre-trained LLM. The proposed approach demonstrates superior performance across various real world datasets and provides a new perspective on integrating time series data with contextual information. - The framework is well-motivated and clearly presented. - The proposed method is extensively evaluated on various real-world datasets, covering diverse domains such as healthcare, finance, and environmental monitoring. TokenCast consistently outperforms existing baselines. - The paper claims to leverage the modeling and reasoning capabilities of LLMs, which are generally associated with larger-scale models. However, the experiments primarily rely on a relatively small LLM (Qwen2.5-0.5B). This raises questions about whether the claimed reasoning capabilities are being fully utilized and whether such a small-scale LLM can truly demonstrate the generative and reasoning power the framework aims to exploit. The choice of model size contradicts typical expectations for LLM usage and requires further explanation. - The multi-stage training process (e.g., symbolic discretization, cross-modal alignment, and generative fine-tuning) introduces significant computational overhead. It seems that each stage requires separate optimization. The training cost and efficiency of this approach compared to existing baselines are not adequately discussed. - The overall design of the framework lacks novelty. For example, as mentioned in Line 186, normalization is a standard component in many existing models. Additionally, the contextual feature selection is similar to existing TimeLLM. - In Line 191, there is a T that denotes the number of latent vectors, but there is no explanation of how T is determined or computed. Could the author provide more details of the encoder? Lightly AI-edited
Learnable Spiking Neural P System with Interval Excitation Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces a model named "Learnable Spiking Neural P System with Interval Excitation" (LSNP IE), which aims to address two core challenges of traditional Spiking Neural P (SN P) systems when processing real-world data: their limited expressive capacity and the non-differentiable nature of their excitation mechanism. The paper tries to discuss three key innovations: 1. Interval Excitation Mechanism: The traditionally strict point-triggered firing condition (where the potential must exactly equal the threshold) is relaxed to an interval. This improves the model's robustness and ensures continuous information flow when handling real-valued (floating-point) data. 2. Potential Adjustment Module: The paper introduces two normalization-like modules (M1 and M2) to align the input and residual potentials and to shift the fused potential's distribution towards the excitation interval, ensuring firing stability and effective learning. 3. Surrogate Gradient-based End-to-End Training: The Surrogate Gradient (SG) method is employed to handle the non-differentiable parts of the firing function, enabling end-to-end backpropagation-based training for the entire network. The authors validate their method on neuromorphic datasets like N-MNIST and MNIST-DVS, reporting competitive performance compared to traditional non-spiking and spiking models. The main contribution of this work lies in bridging the gap between the theoretically-oriented SN P systems and modern deep learning training frameworks, demonstrating the feasibility of applying such models to vision tasks. 1. Problem-Driven Design. The proposed "interval excitation" and "potential adjustment" modules are well-thought-out. The interval excitation directly addresses the issue of point-triggered firing having near-zero probability in a continuous domain. The potential adjustment module counters the "distributional collapse" or "distributional drift" issues during training, which is crucial for stable learning. The effectiveness of these modules is also validated through ablation studies. 2. Clear Structure. The paper is well-structured, logically flowing from background to method, experiments, and conclusion. The appendix provides a detailed introduction to WSN P systems and supplementary experiments on the necessity of the potential adjustment module, all of which strongly support the reader's understanding of the paper's core ideas. 1. For the broader audience, SN P systems are a relatively niche concept. The introduction and related work mention that SN P systems possess a "parallel and distributed architecture" and "modularity," hinting at their potential for hardware implementation. However, the paper fails to articulate more concretely what unique advantages this architecture offers over more established SNNs or traditional ANNs for tasks like image classification. This lack of justification might leave readers questioning the necessity of introducing the complexity of P systems. 2. The potential adjustment module is formally very similar to Batch Normalization or Layer Normalization. While the ablation study proves its necessity, the paper could provide a deeper discussion on its design motivation. Is it merely an engineering trick to "push" the potential values into the firing interval? Did the authors consider other, simpler methods (like clipping or simpler scaling/shifting)? Clarifying this would help in understanding whether this module is a general-purpose solution within the SN P framework. 3. The paper is somehow a disconnect between its core motivation and its experimental design. It argues that SN P systems offer advantages beyond standard SNNs, but then evaluates the model exclusively on image classification—a task where SNNs are already well-established and highly effective. I think the experiments thus fail to provide a compelling answer to the crucial concern: why should one utilize SN P framework if they do not showcase a experimentally unique capability (e.g., in computational modeling, structured problem-solving) that would justify its additional complexity over SNNs. 1. About the Motivation of SN P Systems: could the authors elaborate on what practical or potential advantages (e.g., in computation, energy efficiency, or scalability) the parallel and modular structure of SN P systems offers for a task like image classification, compared to standard SNNs? 2. Could you comment on the choice of experimental tasks? Given that a key motivation for SN P systems is their unique structure inherited from P systems, have you considered tasks where this structure might offer a more distinct advantage over standard SNN architectures (beyond visual classifications)? Furthermore, i think provide a comparison or discussion regarding more recent SOTA SNNs on these datasets could be better to contextualize performance of the proposed method. Moderately AI-edited
Learnable Spiking Neural P System with Interval Excitation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a Learnable Spiking Neural P System(LSNP_IE) to improve the expressive capacity of traditional SN P systems, using the interval excitation mechanism and potential adjustment module. The authors introduce the surrogate gradients, which are widely used in SNNs, to train the SN P system. The performance of LSNP_IE is evaluated on two neuromorphic datasets, N-MNIST and MNIST-DVS. 1.The paper proposes an innovative SN P system. 2.The paper has a complete writing structure and clear logic. 1.The experiments lack validation on higher-resolution datasets. 2.The paper lacks comparative studies with other SN P systems of the same type. 3.The hyperparameter sensitivity analysis is incomplete. 4.The experimental results report the standard deviation, but the experimental settings do not specify which random number seeds are used. 1.Why does LSNP_IE perform significantly worse on the scale 16 of the MNIST-DVS dataset compared to the other two scales? 2.Why are tests not conducted on static images? Can LSNP_IE only be applied to neuromorphic data? 3.Why is it difficult for existing SN P systems to adopt complex structures such as convolutional layers? Fully human-written
Learnable Spiking Neural P System with Interval Excitation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces a Learnable Spiking Neural P (LSNP_IE) system, contributing a novel interval excitation mechanism that enables effective gradient-based training for SN P systems on neuromorphic datasets. It provides a mathematically precise framework and demonstrates competitive performance, representing a step towards bridging the theory of membrane computing with practical learning. However, the work is limited by its use of a simple MLP architecture, which fails to demonstrate scalability, and provides no empirical evidence for the core claimed advantage of energy efficiency, leaving its practical superiority over established spiking neural models unproven. 1. It introduces a learnable Interval Excitation Mechanism, effectively solving the "probability zero" firing problem for SN P systems with continuous data, a significant conceptual advance in the field. 2. It demonstrates end-to-end gradient-based training for an SN P system, a crucial step towards bridging the gap between their theoretical potential and practical application on real-world tasks. 3. It provides a mathematically precise definition of LSNP_IE and supports it with thorough ablation studies (e.g., on surrogate gradients, decay coefficient d) that empirically validate design choices. 1. The core innovation, the interval excitation mechanism, is essentially a relaxation of a discrete threshold to a continuous one—a well-established concept in traditional SNNs (e.g., the use of surrogate gradients often implicitly does this). The potential adjustment module is a direct analog of Batch Normalization, adapted for membrane potentials. While the application of these ideas to the SN P system formalism is new, the conceptual building blocks are largely borrowed from adjacent fields. The paper does not demonstrate a fundamental theoretical advance in membrane computing itself. 2. Using a simple Multi-Layer Perceptron (MLP) is outdated and fails to demonstrate the model's compatibility with modern deep learning architectures. The claim that the framework supports "arbitrary depth" is unsupported, as no deep or convolutional networks are tested. This raises serious doubts about its practical applicability. 3. A primary motivation for SN P systems is their purported energy efficiency on neuromorphic hardware. However, the paper provides zero evidence for this claim. There is no analysis of computational complexity, number of spikes, or energy consumption compared to standard SNN baselines. Without this, the work remains a purely theoretical exercise, and its advantage over simpler, more established SNN models is unproven. 4. Given that the interval excitation is conceptually similar to the soft-thresholding used in surrogate gradient methods for SNNs, what is the fundamental advantage of the LSNP_IE formalism over a standard, well-optimized SNN with surrogate gradients? The experiments currently show comparable, not superior, performance on simple tasks. 5. Lack of Empirical Evidence for Computational Efficiency Claims: A core motivation for spiking models and neuromorphic hardware is energy efficiency. The paper repeatedly mentions this advantage (e.g., "significant energy efficiency advantages," "low-energy implementations") but provides no empirical measurements or estimates to support this claim for LSNP_IE. There is no analysis of the number of synaptic operations (SOPs), spike counts, or any other proxy for energy consumption compared to the baseline models. Without this data, the practical benefit of LSNP_IE for low-power applications remains an unverified assertion. The paper emphasizes the 'inherently parallel, distributed, and modular structure' of SN P systems as a key differentiator from 'monolithic SNNs.' However, in the presented LSNP_IE, which uses a standard MLP topology, how does this modularity manifest in a way that is functionally different from a standard, layered SNN? Could you specify a concrete computational or representational benefit provided by the membrane/rule formalism in this specific instantiation that cannot be achieved by an SNN? Fully AI-generated
Learnable Spiking Neural P System with Interval Excitation Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a Learnable Spiking Neural P System with Interval Excitation (LSNP-IE) that integrates a differentiable learning framework into traditional Spiking Neural P (SN P) systems. The proposed LSNP-IE introduces three main innovations: (1) an interval excitation mechanism that replaces the point-triggered rule with a continuous interval, (2) a potential adjustment module to stabilize and normalize membrane potentials, and (3) a surrogate-gradient-based back-propagation algorithm for end-to-end training. Experiments on two neuromorphic datasets, N-MNIST and MNIST-DVS, show that LSNP-IE achieves competitive or superior accuracy compared to existing spiking and non-spiking baselines. 1. The paper extends the SN P system toward differentiable and adaptive learning. 2. The paper is well-structured with clear notation. 3. The presentation of back-propagation with surrogate gradients through interval excitation and potential adjustment is clear. 1. Evaluation is restricted to small-scale neuromorphic datasets (N-MNIST, MNIST-DVS). These are relatively simple and may not demonstrate scalability to complex or high-dimensional spatiotemporal tasks. 2. The paper does not report training/inference time, memory usage, or energy efficiency, which are central claims for spiking and membrane computing models. 3. LSNP-IE uses an MLP-style feed-forward topology only. The absence of convolutional or recurrent structures limits its comparison with advanced SNN architectures. 4. The study omits recent transformer-based or event-driven spiking frameworks. 5. Although the model is inspired by neural dynamics, the biological plausibility of the interval excitation mechanism is not discussed or experimentally supported. 1. How does LSNP-IE scale to deeper or convolutional architectures? Could the interval excitation mechanism be integrated into spiking CNNs? Fully AI-generated
CLAMP: A Chebyshev-Weighted Multi-Gradient Approach for Multi-Objective LLM Alignment Soundness: 3: good Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces CLAMP, a multi-objective alignment framework that operates without an explicit reward model. It utilizes distinct preference datasets for various human preference dimensions to optimize a vector-valued objective function. The framework integrates weighted Chebyshev scalarization with multi-gradient descent algorithms to find Pareto-stationary solutions. The authors also provide a theoretical guarantee for a finite-time convergence rate for the framework, which is notably independent of the number of alignment objectives. Experimental results confirm CLAMP's effectiveness in aligning LLMs to heterogeneous human preferences, showing significant improvement over existing methods. 1. The paper introduces the CLAMP framework and establishes a theoretical guarantee for solving the Multi-Objective LLM Alignment problem. 2. Experimental results validate the efficiency of the proposed method. My main concern is the concept of Pareto Optimality and the Pareto Front. Though the consideration for Multi-Objective LLM Alignment across different preference functions is reasonable, the Pareto Optimality seems too weak. In detail, a Pareto optimal solution only requires that the solution is not dominated by others. Based on this definition, if we consider a summation reward function $f(\theta)=f_1(\theta)+\dots+f_m(\theta)$ and only try to maximize this reward function, then the optimal solution for this summation reward function will naturally be Pareto Optimal. This is because, for any other solution, this optimal solution must have a larger value for at least one objective (or agent $m$) to achieve a higher summation. In an extreme case, when one of the the function $f_i(\theta)$ is continuous, it usually does not have the same value for different solutions. In this case, only maximizing the function $f_i(\theta)$ can still find a Pareto Optimal solution. Overall, the Pareto Optimality seems not to capture the fundamental intuition for the Multi-Objective LLM Alignment, and the calculation of the Pareto Optimal solutions can be easily reduced to the Single-Objective LLM Alignment problem, which highly challenges the contribution of this paper. 1. The current experimental results are limited to LoRA-based fine-tuning. Could the authors provide results using full fine-tuning to give a more complete performance comparison? 2. The two primary multi-objective tasks ("Helpfulness-Harmlessness" and "Helpfulness-Honesty-Instruction-Following") appear similar in nature. Do the authors have any results demonstrating the divergence or degree of conflict between the objective functions in these two evaluation settings, which is essential for reflecting the difficulty of true multi-objective problems? Lightly AI-edited
CLAMP: A Chebyshev-Weighted Multi-Gradient Approach for Multi-Objective LLM Alignment Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes CLAMP (Chebyshev-Weighted Multi-Gradient Alignment), a new framework for multi-objective LLM alignment that does not rely on reinforcement learning or reward models. The method addresses the challenge of aligning models with multiple, potentially conflicting objectives by formulating training as a multi-objective optimization problem. CLAMP defines a Chebyshev-weighted loss, which minimizes the maximum deviation among all objectives, effectively prioritizing the worst-performing one at each step. The optimization combines this scalarization with the Multi-Gradient Descent Algorithm (MGDA) to compute a single update direction that achieves Pareto-stationary improvements across objectives. The approach includes theoretical analysis proving an (O(1/T)) convergence rate independent of the number of objectives, indicating good scalability. Empirically, CLAMP is tested on multi-preference alignment benchmarks and compared with MORLHF and related baselines. Results show improved Pareto front coverage, alignment stability, and task trade-offs with minimal computational overhead. The paper claims CLAMP offers a theoretically principled and efficient alternative for balancing multiple alignment goals in LLMs. The paper addresses a problem in multi-objective alignment for LLMs and proposes a RL-free framework that is both theoretically motivated and empirically supported. In terms of originality, the integration of Chebyshev-weighted scalarization with multi-gradient descent (MGDA) offers an interesting combination of classical multi-objective optimization principles and modern LLM alignment methods. The formulation provides a clear geometric interpretation of balancing conflicting objectives and offers an alternative to traditional RLHF-based approaches. Regarding quality, the paper includes formal convergence analysis and claims an O(1/T) convergence rate independent of the number of objectives, suggesting theoretical soundness and scalability. The optimization strategy is simple yet mathematically grounded, and the experimental evaluation demonstrates that CLAMP can achieve balanced alignment across objectives while maintaining low computational overhead. In terms of significance, the method contributes to a growing line of research aiming to reduce reliance on RL and reward modeling in preference alignment. The focus on Pareto-stationary updates aligns well with real-world alignment challenges where multiple preferences must coexist. Although the implementation could be clarified, the framework itself has potential for broader applicability in multi-objective fine-tuning of LLMs. 1. While the paper introduces a theoretically motivated framework, several issues limit its clarity and empirical strength. First, the core loss formulation in Equation (3) appears mathematically equivalent to the MaxMin-RLHF objective when incorporating the preference vector p. However, the paper only compares with MORLHF and does not include MaxMin-RLHF. This omission makes it difficult to assess whether CLAMP’s improvements arise from the algorithm itself or simply from reparameterization of an existing loss. 2. The proposed loss function does not have a unique solution. The Chebyshev-weighted max-min scalarization naturally admits multiple Pareto-stationary points, depending on initialization and gradient geometry. The paper does not provide any analysis or experiments on how these different solutions behave in practice. Multiple training runs could yield models representing distinct trade-offs on the Pareto front, but the paper lacks results or visualizations demonstrating this diversity or consistency. Clarifying how these solutions differ and whether they produce meaningful alignment trade-offs would substantially strengthen the empirical section. 1. Comparison with MaxMin-RLHF and Computational Advantage: Equation (3) appears to define the same loss function as the MaxMin-RLHF formulation when incorporating the preference vector $p$, i.e., minimizing $\min_\theta \max_m \{ p_m f_m(\theta) \}$. However, the paper only compares CLAMP with MORLHF rather than with this closely related MaxMin-RLHF method, which shares the same scalarized objective. Could the authors clarify how CLAMP provides a meaningful improvement, either theoretically or empirically, over MaxMin-RLHF, given that both methods optimize an equivalent objective but differ in optimization dynamics? 2. Could the authors include any experimental results or analysis comparing multiple runs to demonstrate whether the algorithm consistently converges to similar solutions or explores diverse regions of the Pareto front? The loss function defined in Equation (3) does not appear to yield a unique optimal solution, as multiple Pareto-stationary points can exist depending on initialization or gradient geometry. It would also be helpful to visualize or quantify the diversity of solutions obtained from different random seeds to confirm the robustness and stability of CLAMP’s optimization process. Fully AI-generated
CLAMP: A Chebyshev-Weighted Multi-Gradient Approach for Multi-Objective LLM Alignment Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes CLAMP (Chebyshev-weighted LLM alignment with multi-objective preferences), a method that integrates stochastic multi-gradient-based and Chebyshev-weighted techniques to achieve multi-objective alignment for LLMs. Experimental results demonstrate that the proposed approach improves multi-objective alignment compared to existing baselines. - The proposed method is theoretically grounded. - The method is RL-free and reward model-free, and its training time remains unaffected by the number of objectives. - The clarity of the paper needs improvement. It took me some time to understand the methodology, and after reading, it remains unclear how to compute the multi-objective loss. For example, given a single sample (x, y_w, y_l) to optimize and three objectives, how is the loss function computed for each objective? - The novelty appears limited, as the method primarily applies previous theories to the multi-objective alignment problem. - Comparing with more recent baselines, such as those mentioned (e.g., MO-GRPO) in the related work, would enhance the overall quality of the paper. - The term "heuristics" could be misleading. Are all existing multi-objective alignment algorithms heuristic-based? The authors should provide a clearer explanation of this. - The proposed method is sensitive to the hyperparameter μ. - The experimental settings are not clearly stated. For instance, the test set and the reward models used should be explicitly specified, with direct references to the appendix where applicable. - The paper does not validate whether the proposed algorithm performs well with other methods, such as IPO or SimPO. - A concern is the notably poor performance of MORLHF reported in the paper. In my own experience, using multi-objective reward-weighted PPO/GRPO often yields better results than DPO. Could the weak performance be due to the reward models used? It would be beneficial if the authors could report the results of training with more advanced reward models, such as Skywork-llama3-8b-v2 on UltraFeedback. - Formatting issues: The citation format is incorrect. For example, "Reinforcement learning from human feedback (RLHF) Christiano et al. (2017); - Typo: Line 232: “meta-algorithm.” should be “meta-algorithm”. - Can CLAMP be applied to the online RL setting? Lightly AI-edited
Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces Perception-R1, a method to improve multimodal reasoning by fixing the key bottleneck of poor visual perception. The authors find that standard accuracy-only reinforcement learning (RLVR) fails to correct these perception errors, as models can guess the right answer despite flawed visual understanding. Perception-R1 addresses this by adding a visual perception reward. This reward is calculated by a judging LLM that compares the model's response to pre-extracted "visual annotations" (key visual facts) from correct solutions. Experiments show this method achieves superior performance on multiple benchmarks with high data efficiency, using only 1,442 training samples. 1. The paper provides a clear and compelling statistical analysis (using McNemar's test) of accuracy-only RLVR-trained MLLMs. This builds a strong case that a significant bottleneck for current models is indeed multimodal perception, not just high-level reasoning. 2. The proposed visual perception reward is intuitive and cleverly designed. By having an LLM judge responses against verifiable, extracted annotations rather than training a holistic reward model, the method directly targets the identified bottleneck. This approach appears significantly more robust to the reward hacking that can harm end-to-end MLLM-as-reward-model RLVR. 3. The performance gains achieved using only 1,442 training samples are impressive. This strongly suggests that a higher-quality, more targeted reward signal (i.e., combining perception and accuracy) can be far more sample-efficient than simply scaling up data for a sparser, accuracy-only reward. 4. The method delivers substantial performance improvements not only on its training domain (math/geometry) but also, surprisingly, across several general-domain benchmarks, outperforming baselines that used orders of magnitude more data. 1. Limited analysis of generalization: The model's strong generalization from geometry-only training (Geometry3K) to general-domain benchmarks (like MMMU and MMStar) is a key result, but it is not fully explained. The authors hypothesize that they are improving a foundational perception capability, but the link between 'perceiving geometry diagrams' and 'perceiving real-world images' could be strengthened. To make this claim more concrete, the authors could include: - Qualitative analysis on general benchmarks: Provide qualitative examples from MMMU or MMStar. Does the Perception-R1 model now exhibit the same "describe-then-solve" behavior on these general-domain images? Where do the baseline models fail on perception in these tasks? Is Perception-R1 delivering more accurate visual perception in these examples? - Error breakdown on general benchmarks: Conduct a small-scale error analysis on a subset of a general benchmark (like MMMU-Pro, where they show strong results). What percentage of the baseline's failures on these tasks are due to perception errors, and what percentage of those specific errors does Perception-R1 fix? This would directly support the claim of foundational perception improvement. 2. Dependence on a single training data domain: The reliance on Geometry3K, while clearly effective, is a potential limitation. The data curation pipeline itself seems general, but its effectiveness has only been demonstrated on this one domain. An ablation study training on a different domain (e.g., general textbook diagrams, or even a VQA dataset) using the same pipeline would be highly valuable. This would help demonstrate the general applicability of the Perception-R1 framework, distinguishing its contribution from the (clearly very effective) choice of geometry data as a training source. Please see the weaknesses above. Heavily AI-edited
Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 1: You are unable to assess this paper and have alerted the ACs to seek an opinion from different reviewers. This paper proposes Perception-R1, a method to enhance multimodal reasoning in Multimodal Large Language Models (MLLMs) by introducing a visual perception reward alongside the standard accuracy reward in Reinforcement Learning with Verifiable Rewards (RLVR). Key Contributions: 1. Problem Identification: Through McNemar's test, the authors demonstrate that existing accuracy-only RLVR fails to improve MLLMs' multimodal perception capabilities, which they identify as a major bottleneck. 2. Method: They introduce a visual perception reward that extracts textual visual annotations from CoT trajectories and uses a judging LLM to assess consistency between these annotations and model responses, which provides additional training signal beyond answer correctness. 3. Results: Using only 1,442 training samples, Perception-R1 achieves SOTA performance across multiple benchmarks, outperforming Vision-R1 (which uses 200K samples) and other baselines. 1. The methods introduced in the paper provide denser reward for the reinforcement learning process, which accurately solve the problem identified by the authors. 2. The experiments are extensive and thorough. 3. Using only 1,442 training samples, Perception-R1 achieves SOTA performance across multiple benchmarks, outperforming Vision-R1 (which uses 200K samples) and other baselines. Concern 1: Marginal Improvement Beyond GRPO Baseline. According to Table 2, the performance improvements appear to be primarily driven by GRPO rather than the proposed visual perception reward. The reviewer notes that GRPO is also used in Vision-R1, making it unclear how much of the improvement is attributable to the novel contribution versus the baseline RL algorithm. Concern 2: Judging LLM Quality Dependency. Figure 3b shows that when using smaller judging LLMs (e.g., Qwen2.5-7B or 14B), the performance sometimes drops below even the base model performance (e.g., MathVerse: 46.1% vs 47.4% baseline; MathVision: 24.2% vs 25.1% baseline). This raises questions about the robustness and practical applicability of the method when high-quality judging models are unavailable. The reviewer suggests that the author could conduct additional experiments where: (1) Vision-R1 trained on the same 1,442 samples used in this paper. (2) Using the same model to generate CoT trajectories for fair comparison. Lightly AI-edited
Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces Perception-R1, a reinforcement learning framework that enhances multimodal reasoning in MLLMs by explicitly improving their visual perception. Specifically, the authors (1) extract visual annotations from correct chain-of-thought trajectories as ground-truth perceptual references, (2) employ a judging LLM to evaluate the consistency between these annotations and the model’s generated reasoning, and (3) aggregate this feedback with accuracy and format rewards under the GRPO optimization scheme. - The idea of augmenting RL with a verifiable visual perception signal represents a clear conceptual advance over prior RLVR frameworks (e.g., Vision-R1, MM-Eureka) that focus solely on final answer correctness. - The authors conduct extensive evaluations on multiple multimodal benchmarks, demonstrating the method's effectiveness and robustness. - The paper is well-structured and clearly written. - The paper lacks systematic exploration of critical parameters such as the perception reward weight (γ) and judgment thresholds, leaving robustness questions unanswered. - Although data-efficient, the additional judging and reward assignment stages may increase computational overhead, which is not quantitatively discussed. - The paper would benefit from more qualitative evidence demonstrating how the model’s perception improves—e.g., visual attention maps, step-by-step perception-reasoning examples, or case studies showing corrected misperceptions. Such analyses would strengthen interpretability and directly connect the proposed reward to perceptual behavior. - The method’s success relies heavily on the quality and alignment of the judging LLM used to evaluate perceptual consistency. As shown in Figure 3(b–c), weaker judges introduce reward hacking and degrade performance, but the paper stops short of analyzing why this happens or proposing safeguards (e.g., calibration, ensemble judgment, or confidence filtering). Further discussion or mitigation strategies would make the approach more robust and reproducible across settings. - The perception reward weight (γ) and the number/quality of visual annotations are central to the method, yet their interactions are not fully studied. Figure 3(a) provides only coarse exploration. More systematic experiments varying γ and annotation noise would clarify stability and guide practitioners in tuning the method. Fully AI-generated
Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper tackles a key but often neglected limitation in reinforcement learning for Multimodal Large Language Models (MLLMs): existing Reinforcement Learning with Verifiable Rewards (RLVR) methods focus solely on final answer correctness, overlooking the accuracy of visual perception during reasoning. The authors show that such outcome-only rewards allow models to guess correct answers despite severe perception errors. To address this, they propose Perception-R1, which introduces a verifiable visual perception reward into RLVR. This reward is derived from textual visual annotations extracted from high-quality CoT trajectories and evaluated by a judging LLM that measures consistency between model outputs and these annotations. Contributions: 1. Empirically and statistically demonstrate that accuracy-only RLVR fails to enhance multimodal perception. 2. Introduce a novel, verifiable visual perception reward that alleviates reward sparsity and improves perceptual grounding. 3. Achieve state-of-the-art performance on multiple multimodal reasoning benchmarks using only 1,442 training samples, showing exceptional data efficiency. 1. This paper reveals the impact of poor perception on reasoning performance. Current RLVR methods fail to enhance multimodal perception, which fundamentally limits the reasoning performance of MLLMs. 2. The introduced Perception-R1 framework incorporates a novel visual perception reward that significantly strengthens the visual understanding and reasoning capabilities of MLLMs, particularly in mathematical reasoning tasks. 3. Extensive experiments across multiple benchmarks verify that Perception-R1 substantially improves both perception and reasoning performance, achieving superior results even with highly limited training data. 1. The paper claims that it enhances the multimodal reasoning capabilities of MLLMs through improved perception. However, the presented results do not provide direct evidence that the observed performance gains stem specifically from enhanced perception. I suggest including an analysis or ablation that directly links perception improvement to the reasoning gains. 2. While the paper reports significant improvements on multimodal math benchmarks, these results primarily reflect reasoning performance rather than perception itself. To convincingly demonstrate perception enhancement, it would be helpful to include evaluations on dedicated perception-level benchmarks (e.g., BLINK, MMBench, MME, or similar datasets). 3. The method employs Gemini-2.5 Pro to generate CoT trajectories and uses an LLM to extract visual annotations, followed by GRPO training on these annotations. This pipeline closely resembles a distillation process from Gemini-2.5 Pro, which may primarily transfer reasoning knowledge rather than genuinely improving perception. It would strengthen the paper to disentangle and clarify whether the observed gains truly originate from improved perception rather than implicit reasoning distillation. 1.After distilling the CoT trajectories from Gemini 2.5 Pro, could you clarify why an LLM is used to transform these trajectories into atomic statements? In particular, how does this approach differ from directly inputting the trajectory data into the LLM to evaluate the atomic statements? Lightly AI-edited
CMPS: Constrained Mixed Precision Search Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This work proposes a new constrained mixed precision search for post training quantization. To solve the constrained optimization problem the authors leverage barrier-based interior-point method. The method keeps model weights frozen, and needs only a small calibration set (128 samples). Experiment on various LLMs report consistent gains over uniform precision baselines at the same or lower effective bit budgets. 1. The work discusses the problem of reducing computational and memory footprints for deployment of DNNs which is practical and important. 2. The paper is well written and easy to follow. 3. 4.5-bit the proposed method often beats MXFP in terms of perplexity, on the examined benchmarks. 1. The authors claim that after rounding there always remains a strictly feasible solution with respect to the budget. I believe a proof for this claim is required. 2. The comparison is limited. The work only compares itself to the MX baselines but there are many other strong PTQ techniques. Only a single dataset was used in the experiments. 3. No thoughput\latency comparisons are provided. 4. The improvement over the baselines is marginal. 5. How does the method operate compared to integer PTQ techniques? 6. According to the experiments, the proposed algorithm does not always meet the constraint. How were the samples for calibration chosen? What is the meaning of the upsidedown question mark in the caption of Figure 2? There is a typo in in line 75 (double "“Our contributions are as follows:") Fully human-written
CMPS: Constrained Mixed Precision Search Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. A DNAS-based post-training mixed-precision quantization method (CMPS) is proposed. CMPS provides fine-grained control over model compression, enabling stable and predictable performance. The proposed CMPS method is compared with uniform quantization baselines, demonstrating the advantages of learnable mixed-precision bit allocation. 1. This paper works on the post-training mixed-precision quantization with controllable compression ratios. The problem studied is important and the motivation is clear. 2. The detailed theoretical analysis is provided. 1. Quantization details are missing. It seems that CMPS is a weight-only quantization method. However, the quantization details are not provided. 2. Optimization cost is not provided. The advantage of PTQ is its efficiency in quantization optimization. The CMPS relies on end-to-end tuning with multiple branches. The speed and memory cost overheads should be reported. 3. Comparison with previous methods is also missing. The authors didn't provide any quantization details, including the uniform quantization baselines. In the llm quantization literature, many high performance PTQ methods are proposed. What's the performance advantages over these methods? How the proposed CMPS can be combined with these techniques? Moreover, the authors only compared with uniform quantization baselines, the comparison with previous mixed-precision methods are missing. 4. In several places, it says "hardware-constrained bit allocation", however, only "total model size in bits" is modeled during the optimization. Moreover, only two bit levels are explored in the bit allocation (MXFP4 and MXPF8). 5. In the experiments part, previous methods commonly use wiki2 for calibration in addition to C4. For zero-shot scenario, only one task of LAMBADA is evaluated, which is clearly not enough. The largest model used is 3B, experiments on larger models or architectures like MoEs are also needed. 6. In the limitations, regarding the statement "the memory required to hold activations or gradients for multiple low-bit options might still be comparable to, or less than, holding a single higher-precision (e.g., FP16 or BF16) baseline tensor", more careful and precise expression should be used. Many PTQ methods do not need to store all activations, and the gradients are not needed. However, in CMPS, full-precision activations of all layers and gradients are needed, which expands the memory usage. If these tensors (activations and gradients) can be stored in low-bit, then the authors should verify it use controlled experiments. Please refer to the Weaknesses for further questions. Fully human-written
CMPS: Constrained Mixed Precision Search Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces a differenciable NAS algorithm for data format allocation in a network. It first formalizes the optimization problem, including architecture constrains (the maximum average number of bits for a model), then proposes a gradient descent based heuristic for solving the mixed precision data format allocation problem. While the method is fully post-training, it still requires a small calibration data set to perform the training of the data format precision parameters. The paper shows that using this formalism, mixed precision constrained NAS can achieve better results than uniform quantization. - The mathematical formulation of the constrained optimization problem for mixed precision data format allocation seems fairly general; - The large number of results (which are combinations between models and tasks used for the calibration) seems to demonstrate the robustness of the approach. - The paper completely lack any comparison with the state-of-the-art! No comparison with other mixed-precision post-training optimization methods is even attempted... yet plenty exists. That is clearly a major issue in this paper. - While the fundations of differentiable NAS methods seems to be adequately described and cited, the novelty of the proposed method remains hard to grasp. I would suggest to add a short but clear statement on what it brings compared to the closest SoTA work. - While the method seems very general, it is frustrating see it tested on a single NAS scenario, namely, 4.5 bit mixed precision with the MXFP data format. What about mixing different formats (integer, FP...)? Or testing other maximum average number of bits (like 3.5, or 5.5...)? - The perplexity/accuracy gains of the method remain modest and the proposed NAS scenario is too limited. Please carefully answer the issues mentioned in the weaknesses section. I may increase my rating provided that at least 1) quantitative comparison with other SoTA methods is provided. 2) Additional NAS scenario, beyond 4.5 bit mixed precision is evaluated. Fully human-written
Unmasking the Tiny: Foreground Probing for Small Object Detection Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper tackles small-object detection in high-resolution imagery by reframing the failure mode as “suppressed foreground scores” in one-stage, decoupled-head detectors. It introduces a Foreground Probing paradigm built on YOLOX: (i) a Sparse Token Selection Module (STSM) that keeps the top-K candidate tokens by objectness from all feature maps, and (ii) a Foreground Refinement Module (FRM) that uses classification-feature attention to refine the foreground/objectness branch via a gated combination of regression- and classification-self-attentions. Integrated into YOLOX-X, the method reports improvements on VisDrone with a small FPS drop, and improvements on UAVDT over ESO Task-tailored designs for small object detection. Promising improvement over baseline. Balanced paper organization. There are weaknesses regarding novelty, clarity, and presentation. 1. The method is only designed for single-stage YOLO-style detectors that use separate classification and localization heads. It does not apply to two-stage models like Mask R-CNN or transformer-based models such as DETR and RT-DETR, which already handle small objects quite well through better region proposals or global attention. The paper should clearly state this limitation in the title or abstract. Right now, the writing sometimes gives the impression that the method can improve any type of detector. The authors should also compare or at least discuss more recent DETR-based small-object detectors. 2. Figure 1 does not clearly explain what the “foreground probe” actually is or how it works. The terms “foreground probe” and “foreground refinement” are introduced but not illustrated in an intuitive way. As it stands, the figure is too abstract to help readers understand the mechanism. 3. The experimental comparisons do not fully support the claimed advantages. The competing methods often use weaker or older backbones, making the improvements less meaningful. Also, efficiency comparisons (FPS) are not clearly standardized. FPS varies greatly depending on GPU type, input size, and framework. The results would be more convincing if the authors compared models with similar settings and included standard measures like parameters, FLOPs, and latency on the same hardware. It would also help to report results by object size (small/medium/large, like in COCO, if possible) to show that the proposed method truly improves detection for small targets. 5. All results are based on YOLOX. It is unclear whether the proposed modules can be easily applied to other detectors. In addition to the concerns listed in weaknesses, another question is: the gate function in Eq. (4) seems to contain a typo: fcls should come from Fcls rather than Freg (correct me if I am wrong). The paper does not explain whether the gate is applied globally or per token, which affects the interpretation. Lightly AI-edited
Unmasking the Tiny: Foreground Probing for Small Object Detection Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper addresses the challenge of detecting small objects in high-resolution images and proposes a Foreground Probing paradigm, which aims to recover suppressed foreground confidence through collective semantic features. The method’s effectiveness is validated through experiments on the VisDrone and UAVDT datasets. # **Strengths** 1. **Clear and Reasonable Idea:** The paper analyzes the internal mechanisms of detectors and identifies that the bottleneck for small object detection lies in the "systematic suppression of foreground confidence." It proposes a solution to feed classification features back to improve foreground estimation. This approach is novel and distinct from traditional "filter-and-detect" or "crop-and-detect" paradigms. 2. **Sound Experiments:** The authors conduct comprehensive evaluations on two mainstream high-resolution UAV datasets (**VisDrone** and **UAVDT**) and perform ablation studies on key hyperparameters (e.g., number of tokens, gating parameter λ). The results show robustness to hyperparameter variations and consistent performance improvements. 3. **Good Readability:** The paper is well-structured, with clear motivation, intuitive figures, and straightforward comparative experiments. 4. **Practical Significance:** Small object detection has broad applications in UAV security, remote sensing, and traffic perception. The proposed improvement strategy is practically valuable. # **Weaknesses** 1. **Limited Depth of Innovation:** Although the "Foreground Probing" paradigm provides a new perspective, its core implementation (based on token selection and attention fusion) is still a combination of existing mechanisms rather than a fundamentally new theoretical framework. There is a lack of evaluation on the transferability of this paradigm to different detector architectures, such as Transformer-based or anchor-free DETR series. 2. **Incomplete Baseline Comparisons:** Experiments only compare with a few one-stage detectors like YOLOX and ESOD, without including recent detectors specifically designed for tiny object detection or other mainstream detectors. 3. **Insufficient Ablation Studies:** Ablation studies only cover the number of tokens and the gating parameter λ, without further investigation of other parameters or structural components. 4. **Limited Generalization and Scenario Coverage:** Experiments are confined to UAV aerial datasets (**VisDrone** and **UAVDT**), limiting evaluation of the method’s generalization to other scenarios. 5. **Other Formatting Issues:** Table 2 appears before Table 1 and is cited first; Figure 4’s corresponding dataset is not explicitly specified. # **Questions** 1. **Computation Efficiency:** For STSM, the K value (500) and the attention computation in FRM (O(K²))—do they become speed bottlenecks? Could Table 3 include FPS results for different K values (100/300/500/700) to quantify the trade-off between token number, speed, and accuracy? 2. **Experimental Extension:** Could the authors compare with more mainstream small object detection methods, such as RT-DETR or Deformable DETR? 3. **Sensitivity of K and Gating Coefficient λ:** Are the K value and gating coefficient λ consistent across different datasets, or do they require dataset-specific tuning? 4. **Validation in Diverse Scenarios:** Only two datasets are used, which may be insufficient to demonstrate generalization, especially as UAVDT has only three categories and high frame similarity. Have the authors considered evaluating on more diverse scenarios, such as **DOTA**, **TinyPerson**, or other small object detection datasets? 5. **Inference Efficiency:** Tables 1 and 2 show that the baseline YOLOX and the proposed method achieve only 16 FPS. It is recommended to consider stronger baselines to improve efficiency. 6. **Incomplete Ablation Studies:** Ablation experiments only cover K values and the gating parameter λ. Further investigation into other parameters or structural components is suggested. Moderately AI-edited
Unmasking the Tiny: Foreground Probing for Small Object Detection Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposes a new paradigm for Small Object Detection (SOD) called Foreground Probing (FP). Through two core modules—Sparse Token Selection Module (STSM) and Foreground Refinement Module (FRM)—the authors attempt to recover the suppressed foreground confidence from the semantic features of the classification branch, thereby improving the recall rate of small objects. Experiments on the VisDrone and UAVDT datasets demonstrate consistent performance improvements. 1. Clear problem definition and motivation: The paper accurately identifies the key bottleneck in current small object detection: the foreground score is suppressed by the large background. The analytical formulation is insightful, theoretically revealing the root cause of confidence suppression in detectors. 2. Simple and highly compatible design: The combination of STSM and FRM is easy to integrate into existing detection frameworks (such as YOLOX) without modifying the backbone architecture, making it suitable for large-scale deployment. 1. The paper does not provide a visual comparison between classification scores and regression scores, which makes the motivation explanation insufficient. 2. STSM is essentially a top-K selection based on score ranking; it does not introduce a new feature learning mechanism or adaptive selection strategy. It is suggested to explore a learnable token selection strategy for STSM rather than simple ranking, or to include comparative experiments with alternative approaches. 3. The attention weighting mechanism in FRM (A = λA_reg + (1−λ)A_cls) is highly similar to existing cross-attention or gating mechanisms. 4. The paper claims that the semantic features from the classification branch are more stable, but this is not quantitatively verified. reference Weakness Moderately AI-edited
SpEmoC: Large-Scale Multimodal Dataset for Speaking Segment Emotion Insights Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper introduces SpEmoC, a large-scale multimodal dataset for emotion recognition in spoken conversational segments. The dataset consists of 306,544 raw clips extracted from 3,100 English-language movies and TV series. These clips were refined to a high-quality, class-balanced subset of 30,000 clips that are annotated for seven Ekman-based emotion categories. The authors propose an automated annotation pipeline that leverages pretrained DistilRoBERTa (for text) and Wav2Vec 2.0 (for audio) models. Human validation is also used to ensure the accuracy of the annotations. Additionally, the authors introduce a lightweight baseline model based on CLIP-HuBERT-MLP and a novel Extended Reweighted Multimodal Contrastive (ERMC) loss to align cross-modal emotion embeddings. The model is evaluated on both SpEmoC and two other datasets: MELD and CAER. 1. SpEmoC significantly outperforms previous benchmarks (e.g., MELD, CAER) in terms of scale and emotional balance. The use of synchronized video, audio, and text from various cinematic sources in real-world settings (e.g., with low lighting and variable resolution) improves the ecological validity of multimodal emotion recognition. 2. The authors use a two-step annotation strategy: first, they use pseudo-labels generated from pretrained unimodal emotion classifiers (DistilRoBERTa and Wav2Vec 2.0) to create a large dataset of labeled clips. They then use the KL-divergence regularization to ensure consistency between different modalities. Second, they have 20 experts validate 50,000 clips, achieving a Fleiss' Kappa score of 0.62, which balances scalability and reliability. 3. The proposed Extended Reweighted Multimodal Contrastive Loss incorporates sentiment-based reweighting using KL divergence between unimodal emotion distributions. This aligns emotionally consistent embeddings across modalities, which significantly improves performance compared to using cross-entropy alone. 4. The lightweight model was trained and evaluated on not only SpEmoC, but also MELD and CAER. This allows for a direct comparison of the quality of the datasets through consistent modeling, strengthening the claim that the balanced design of SpEmoC yields more equitable performance for minority emotion classes, such as Fear and Disgust. 1. While the dataset includes 3,100 videos, 85% originate from scripted films/TV shows, limiting generalizability to spontaneous, real-world affect. Moreover, the paper provides no information on speaker-level demographics (e.g., gender, age), only coarse video-level ethnicity estimates. This omission hinders fairness and bias analysis, that critical in affective computing. 2. The pipeline uses YOLOv8 for face/human detection, but the paper does not explain how the target speaker is chosen when multiple individuals are present. Without explicit speaker diarization or face-voice alignment, the emotion label (based on text/audio) may not match the visual subject, especially in group conversations. 3. The proposed model combines frozen CLIP-ViT (with AIM adapters), HuBERT, and a simple MLP fusion head - a standard architecture in multimodal learning. Although efficient (~8.68M trainable parameters), it does not offer any architectural innovation, and serves primarily as a validation tool for the dataset, rather than a significant contribution to the field of modeling. 4. The authors report F1 scores for MELD and CAER, but they do not compare them with existing state-of-the-art models. Therefore, it is unclear whether the performance differences are due to the quality of the dataset or the capability of the model. Consequently, the claim of "strong results" is not supported by the current literature. 5. The ERMC loss is only compared to vanilla cross-entropy and not to other contrastive, focal, or rebalancing losses. Without these comparisons, the additional value of ERMC is uncertain. 6. Although the authors use a movie-level split, emotional expressions in acted content can be stereotypical or dependent on the genre (e.g., fear in horror). If certain genres dominate the splits, the model may learn correlations between genres and emotions rather than genuine affective cues, leading to inflated generalization metrics. 1. In clips with multiple speakers, how is the subject of visual analysis aligned with the audio/text transcript? Has any form of facial recognition or voice identification been used? 2. Can the authors provide speaker-level metadata, such as gender, age, and ethnicity, for the 30,000 refined clips? If not, how do they ensure that their model is not biased against certain groups in terms of representation? 3. Why was the proposed model not compared to recent state-of-the-art (SoTA) systems on MELD and CAER datasets? The authors could have included such comparisons to determine whether performance improvements are due to improved data quality or the model design. 4. Have the authors explored the use of alternative loss functions to address class imbalance and modality alignment in their experiments? If so, how does the ERMC model compare to these other approaches? 5. Given that 85% of the clips used for training were acted, how confident can the authors be that models trained on SpEmoC will be able to generalize to real-life, spontaneous conversations (e.g., interviews, customer service calls)? Lightly AI-edited
SpEmoC: Large-Scale Multimodal Dataset for Speaking Segment Emotion Insights Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper introduces SpEmoC: a large-scale, multimodal corpus for the recognition of emotions in spoken, conversational segments. It is derived from 3,100 English-language films and television series. The dataset comprises 306,544 raw clips that have been refined into 30,000 high-quality samples that are balanced across seven Ekman emotions. The authors propose an automated annotation pipeline that uses pre-trained DistilRoBERTa (for text) and Wav2Vec 2.0 (for audio) models. These are fused via a KL-divergence-regularized logit fusion strategy and then validated by humans. They also present a lightweight CLIP-based baseline model with an Extended Re-weighted Multimodal Contrastive (ERMC) loss function for aligning cross-modal emotion embeddings. 1) The SpEmoC is the largest publicly available multimodal emotion corpus featuring class balancing, which enables fair evaluation across all seven emotions. 2) Multi-stage refinement (thresholding and human validation) and a movie-level split ensure the creation of high-quality, generalizable benchmarks. 1) Around 85% of the data originates from feature films and TV series, in which emotions are typically acted out and often exaggerated. 2) 60% of participants are from the Western/white ethnic group. This calls into question whether models can be generalized to global populations, and it may exacerbate inequalities in affective systems. 3) Although it is critical for multimodal systems, it is unclear how cultural norms influence the expression of emotions in data. 4) Pseudo-labels are generated by DistilRoBERTa and Wav2Vec 2.0, which are trained using social media and actor speech corpora. However, these models can carry their own biases (e.g. associating anger with aggressive vocabulary), which can distort the 'true' labels. 5) The architecture is a standard combination of CLIP-ViT, HuBERT and MLP, with no new fundamental components. 6) In real-life scenarios, emotions are usually complex, but the corpus assumes only one dominant category, which makes the task easier but reduces its practical value. 1) How did you verify that no actors were duplicated between splits? 2) Why wasn't ERMC compared with modern methods? 3) What measures have been taken to address the cultural bias of casting 60% white actors? 4) How were cases handled where the actors spoke without emotion, even though the scene was emotional? 5) How did you deal with mixed emotions? 6) Has an analysis of model errors been conducted by demographic group (gender and ethnicity)? Fully human-written
SpEmoC: Large-Scale Multimodal Dataset for Speaking Segment Emotion Insights Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents SpEmoC, a relatively large-scale multimodal dataset designed for emotion recognition in conversational speech segments. Curated from 3,100 English-language films and television series, the dataset comprises 306,544 raw clips as well as 30,000 refined clips. Each of these refined clips integrates synchronized visual, audio, and textual modalities, and all clips are annotated with labels corresponding to 7 basic emotion categories. It is anticipated that this dataset will provide a valuable contribution to advancing research in the relevant field. The paper makes valuable contributions to multimodal emotion recognition (MER) research: 1. SpEmoC directly addresses the problem of emotion imbalance in existing MER datasets, it achieves a relatively balanced distribution across 7 emotions, enabling robust recognition of minority classes . 2. The scale and source diversity (thousands of films/TV series) provide rich acoustic, visual, and linguistic variability, potentially improving generalization beyond lab-recorded datasets. 1. The empirical validation of the dataset’s effectiveness is too limited. In the main paper, only a single table (Table 4) reports results, and although some ablations are deferred to the appendix, there is no exploration of how SpEmoC benefits other downstream tasks. For a dataset contribution, readers expect broader evidence of utility, such as transfer to related tasks, pretraining gains for unimodal and multimodal backbones, or improvements in low-resource settings. 2. If I understand correctly, comparisons across datasets are conducted on each dataset’s own train/test split, without cross-dataset evaluation. This setup prevents a clear assessment of generalization. For a large-scale dataset claim, the community typically prioritizes evidence of out-of-domain robustness over in-domain performance on the new dataset. High in-domain scores on same-source data may simply indicate that the collection is not particularly challenging. 3. Results reported for the other two datasets in your tables are substantially below those in the literature. Stronger baselines should be selected and reproduced under comparable settings. 4. The dataset is potentially useful for the community; however, the novelty appears limited considering the data processing pipeline is quite standard. 1. Beyond Table 4 and the appendix ablations, what additional evidence can you provide to demonstrate SpEmoC’s utility? Have you evaluated pretraining/fine-tuning on SpEmoC and transferring to related downstream tasks? 2. Do models pretrained on SpEmoC yield consistent gains in low-resource regimes on external benchmarks? Have you conducted cross-dataset evaluations to assess out-of-domain robustness? 3. How sensitive are results to clip segmentation, alignment errors (In the data processing pipeline)? Moderately AI-edited
Transformers Trained via Gradient Descent Can Provably Learn a Class of Teacher Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper provides a theoretical analysis of one-layer transformers trained via gradient descent to learn from a class of teacher models with bilinear structure. The teacher models covered include convolutional layers with average pooling, graph convolutional layers on regular graphs, sparse token selection models, and group-sparse linear predictors. The authors prove that transformers with simplified "position-only" attention can recover all parameter blocks of these teacher models, achieving optimal population loss with a convergence rate of Θ(1/T). They also establish out-of-distribution generalization bounds and validate their theory through experiments on both synthetic and real-world (MNIST) data. 1. Unified theoretical framework: The paper identifies a fundamental bilinear structure shared across diverse learning tasks, enabling unified learning guarantees. This is a significant conceptual contribution that connects previously studied disparate settings. 2. Tight convergence guarantees: The paper establishes matching upper and lower bounds for the convergence rate of Θ(1/T), improving upon prior work 3. Empirical alignment with theory. Synthetic experiments match predicted slopes and show early directional alignment between learned value matrix and teacher weights. 1. **Limited model complexity** The analysis is restricted to one-layer transformers with simplified "position-only" attention. While the authors justify this simplification empirically, the gap between this architecture and practical multi-layer transformers with full attention is substantial. 2. **Notation density**: The paper is extremely dense with notation making it difficult to follow. A notation table and more intuitive explanations would improve accessibility. 3. **Scalability concerns**: The convergence time $T^*$ grows quadratically with sequence length D, which may be prohibitive for longer sequences. This scalability issue is not discussed. 4. **Gap between theory and practice:** Theory–practice gap in positional encodings. Assuming a fixed D×D orthogonal positional matrix departs from learned or sinusoidal encodings used in practice; further clarification is needed to justify this assumption. See weaknesses. Heavily AI-edited
PreviousPage 4 of 1516 (75800 total rows)Next