|
CoFact: Conformal Factuality Guarantees for Language Models under Distribution Shift |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper addresses the challenge of maintaining the factuality of large language model (LLM) outputs in real-world scenarios where the distribution of user prompts evolves. The authors correctly point out that existing conformal prediction (CP) methods for providing factuality guarantees fall short in this setting. This is because they depend on the exchangeability assumption, which holds when the calibration and test data come from the same distribution. However, this assumption no longer holds when the prompt distribution changes.
To address this limitation, the paper proposes CoFact. This is a new conformal prediction framework designed to maintain factual reliability under distribution shift. The key idea behind CoFact is to employ online density ratio estimation to adaptively reweight the static calibration data. This way, it aligns with the evolving test distribution at each time step. This enables the computation of an adaptive conformal threshold that effectively filters out hallucinated claims. Moreover, the system no longer relies on the exchangeability assumption or the unrealistic requirement of having ground-truth labels for test instances.
This paper addresses the problem of LLM factuality guarantees under the realistic condition of distribution shift. Moreover, the experimental validation is comprehensive. A key strength is the inclusion of the new challenging real-world benchmark (WildChat+). The results show that CoFact outperforms baselines and achieves its stated goal.
The CoFact methodology relies on an online ensemble framework where each expert is updated using an Online Newton Step (ONS). I wonder if this method is computationally more expensive. If it is more costly, this is a potential limitation for real-world, low-latency deployment.
The CoFact methodology relies on an online ensemble framework where each expert is updated using an Online Newton Step (ONS). I wonder if this method is computationally more expensive. If it is more costly, this is a potential limitation for real-world, low-latency deployment. |
Fully human-written |
|
CoFact: Conformal Factuality Guarantees for Language Models under Distribution Shift |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces CoFact, a novel conformal prediction framework designed to ensure factuality guarantees for large language models (LLMs) under dynamic, real-world distribution shifts. Specifically, CoFact incorporates online density ratio estimation (DRE), enabling adaptive reweighting of calibration data to align with the changing test distribution. CoFact is validated through a theoretical analysis and empirical experiments.
1. The paper addresses a critical limitation of existing conformal prediction methods by introducing online density ratio estimation, which allows for adaptation to dynamic and non-stationary distributions.
2. The authors provide a solid mathematical foundation for CoFact by establishing a theoretical upper bound on the hallucination rate under shifting distributions.
3. A new dataset, WildChat+, is proposed to evaluate the approach, featuring real-world user-generated prompts that effectively capture distribution shifts.
4. CoFact is rigorously evaluated across multiple experimental settings, demonstrating its robustness and effectiveness.
1. The writing in the paper is difficult to follow, as many key terms are introduced without proper explanation. For instance, the concept of calibration is presented without clarifying its meaning or role within the framework, making it harder for readers to grasp its significance.
2. The framework heavily depends on accurate online density ratio estimation, which can be computationally intensive and challenging to implement efficiently, particularly for high-dimensional or complex data distributions.
3. While the inclusion of WildChat+ is a notable contribution, the paper could benefit from exploring additional domain-specific real-world applications, such as legal, financial, or healthcare contexts, to further demonstrate its practical impact in high-stake tasks.
4. CoFact's emphasis on factuality overlooks other crucial response qualities, such as informativeness. For example, the framework might encourage the model to produce overly short or simplistic responses that sacrifice depth or detail in order to meet factuality requirements. This potential trade-off warrants further investigation.
5. The paper lacks an extensive discussion on distribution shifts, such as distinguishing between in-distribution and out-of-distribution user prompts. Providing examples or a detailed analysis of these distinctions would help clarify how the framework handles varying prompt distributions.
Including case studies and error analysis can significantly enhance the clarity and impact of the paper.
1. Case studies can illustrate how the proposed method performs under distribution shifts, offering concrete examples of its effectiveness.
2. Meanwhile, an in-depth error analysis can help readers understand why and how CoFact outperforms previous approaches, shedding light on its strengths and limitations in handling challenging real-world scenarios. |
Fully AI-generated |
|
NoisePrints: Distortion-Free Watermarks for Authorship in Private Diffusion Models |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper tackles the challenge of proving authorship over visual AI-generated content, especially for cases where the diffusion model is private and traditional watermarking approaches (which need model weights or modify outputs) are impractical or inefficient. The authors propose NoisePrints, a lightweight, distortion-free watermarking scheme that utilizes the random seed used to initialize the diffusion process as a watermark proof, exploiting the strong correlation between this seed’s noise and the generated content. The verification only requires the seed and output, with no changes to the generation process and no access to model internals. NoisePrints is validated on multiple state-of-the-art diffusion models for images and videos, demonstrating efficient verification using only the seed and output, without requiring access to model weights.
**Originality:**
NoisePrints introduces a technically novel approach by using the stochastic seed of diffusion models as a watermark, enabling efficient and model-agnostic authorship verification. The scheme also integrates cryptographic techniques, such as zero-knowledge proofs, to assure privacy and security in third-party verification scenarios.
**Quality:**
The paper demonstrates strong methodological rigor, offering a comprehensive security analysis and empirical validation across diverse models and datasets. Experiments show the watermark’s high robustness to output manipulations and adversarial attacks, while significantly reducing computational overhead compared to inversion-based methods.
**Clarity:**
The writing is clear and logical. Core ideas, threat models, and algorithms are well explained, with protocols articulated stepwise, making the contributions accessible to both technical and broader audiences.
**Significance:**
By removing the need for model internals and ensuring distortion-free watermarking, NoisePrints directly addresses real-world needs for copyright and provenance management in private and proprietary diffusion models. Its privacy-preserving design and scalability make it highly significant for the trustworthy deployment of generative AI and digital content protection.
- In the introduction (from line 52), the discussion of the method does not clearly convey the technical challenges involved. As a result, readers may be left with the impression that the approach is straightforward to implement, which could undermine the perceived significance of the contribution. The authors should better articulate the complexities and nontrivial aspects of their method.
- Tables 1 and 2 lack visual highlights or markers for the best-performing methods, making it difficult for readers to quickly identify key results. Clear visual cues, such as bolding or color highlights, are recommended to enhance table readability and emphasize the main findings.
- Sections 3.3 and 3.4 are dense with technical details and formalism, posing accessibility challenges for readers without deep expertise in diffusion models or cryptography. These sections would benefit from additional intuitive diagrams and plain-language protocol summaries to broaden accessibility.
- Figure 1 suffers from inconsistent font usage and the inclusion of citation notes within module labels, which detracts from the professionalism and clarity of the visual. The authors should standardize fonts and reconsider the figure layout for improved visual coherence.
- Figures 2 and 3 are overly large and contain dense information, resulting in plot axes that are difficult to read. The authors should revise these figures to more logically group and present the data, possibly by dividing them into multiple panels and increasing the size and clarity of key elements.
1. As mentioned in the limitation: The verification scheme relies on access to the public VAE used by the diffusion model. When the VAE is not public or is heavily modified, the approach may be less applicable. How the author plans to address such issues?
2. Can the authors better articulate the complexities and nontrivial aspects of their method in terms of the task of this paper. |
Fully AI-generated |
|
NoisePrints: Distortion-Free Watermarks for Authorship in Private Diffusion Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper provides some insights on how the embedding of the generated image is related to the random seed, and thereby proposes a watermark verification method. The method only verifies the dependency on the embedding and the random seed and no weight information is needed.
This paper provides an interesting angle to analyze the relationship between the random seed and the generated image. Theoretical result is also provided for H0 (though I understand there is difficulty in analyzing the underlying distribution under H1).
(W1) Practicality of the scenario: The paper assumes a setting where the model provider is different from the model user, and the user seeks to protect their IP. However, there are some concerns: (1) If the model provider is untrustworthy, why would the user choose to use that model in the first place? (2) If the model provider is trustworthy, what advantages does this approach offer over existing white-box methods, especially considering potential robustness concerns raised in (W3)?
(W2) Theoretical justification: When $E(x)$ and $h(s)$ are independent, the derived distribution appears reasonable. However, the distribution is unclear when $E(x)$ and $h(s)$ are correlated, leaving a gap in the theoretical analysis.
(W3) Robustness and empirical evaluation: Based on the theoretical results, it seems that the test’s power strongly depends on $E(x)$. This suggests vulnerability to attacks that substantially alter $E(x)$. In the examples in the paper, the embeddings remain relatively close to the originals. For more aggressive attacks, such as redrawing the image, the test’s power could degrade significantly. While the algorithm is lightweight, it would be useful for the authors to provide a comparison that examines the trade-off between computational cost and performance under stronger attacks.
Please address my concerns in (W1) and (W3). I understand that it would be difficult to derive theories for (W2), but is it possible to provide an empirical distribution (histogram) on the test statistics under H0 and H1 respectively? |
Lightly AI-edited |
|
NoisePrints: Distortion-Free Watermarks for Authorship in Private Diffusion Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes NoisePrints, watermarking framework for diffusion models that uses the random seed of the generation process as a proof of authorship. By leveraging the strong correlation between the initial noise and the final output, the proposed method secures the generation process with a cryptographic hash and optional zero-knowledge proof, which enables verification without accessing diffusion model weights. Experiments on image and video diffusion models demonstrate that NoisePrints achieves robust and efficient authorship verification under common content manipulations.
1. The writing of the paper is clear and well-structured.
2. The proposed method is simple but effective, especially in terms of robustness against different types of attacking.
3. Experiments are conducted on multiple diffusion models (including both image and video generation) and against various types of attacks, demonstrating the generality of the method.
1. A main concern is the applicability of the method, in the scenario considered in this paper, the verification of the watermark relies on public structure VAE. However, in practice, it is possible that some diffusion models may update or fine-tune their VAEs across versions. It is not clear that under this circumstance, whether the proposed method is still effective.
2. The motivation for the considered scenario requires stronger justification. In real-world applications, a more common concern is that model owners aim to trace who is responsible for the malicious or unauthorized use of their models, or that data owners wish to verify whether their data have been improperly used to train a model. In contrast, if a regular user simply generates an image using the model, it is unclear why they would need to prove authorship of the generated content, or why others might contest such authorship.
1. If the watermarked images gone through semantic level modification, such as style transfer, can the watermarked detection accuracy still maintain?
2. According to some studies [1], large diffusion models exhibit partial memorization of training images. In such cases, different seeds may yield visually or latently similar outputs, breaking the one-to-one correspondence between the seed and the generated content. This could lead to higher false positive rates in NoisePrints verification, since unrelated seeds might still produce embeddings that correlate above the verification threshold. Can the author provide evaluation results of proposed method in such case?
[1] Memory triggers: Unveiling memorization in text-to-image generative models through word-level duplication
[2] Understanding (un) intended memorization in text-to-image generative models.
[3] Extracting training data from diffusion models. |
Fully human-written |
|
NoisePrints: Distortion-Free Watermarks for Authorship in Private Diffusion Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses the copyright verification problem for private models and proposes a novel method named NoisePrints. The core idea is to leverage the inherent correlation between the initial Gaussian noise seed used during generation and the final output image, treating this correlation as a natural and verifiable watermark. The authors also innovatively incorporate zero-knowledge proofs, enabling ownership verification in this scenario without exposing the random seed. Experimental results demonstrate that NoisePrints can effectively resist both common image processing operations and specific forgery attack, exhibiting excellent robustness.
1. The paper presents a method with an elegantly simple design. Its requirement of only a public VAE for verification makes it highly efficient, resulting in significantly faster performance than existing approaches.
2. The approach eliminates the need for training and preserves output quality, offering a plug-and-play integration capability for diverse diffusion-based architectures.
3. The incorporation of a zero-knowledge proof circuit is a notable strength. It enables secure third-party verification without exposing the secret seed, effectively mitigating a key risk in ownership attestation.
1. The proposed NoisePrints is a single-bit watermarking method, which is primarily designed for ownership verification. In contrast, many existing methods are multi-bit, offering the extended capability of tracing the specific user responsible for the generation.
2. The proposed scheme faces the same practical deployment challenge as Gaussian Shading[1] (GS). To remain distortion-free, it relies on a randomly sampled seed for each generation, which necessitates securely logging and managing a vast database of `(x, s)` pairs. This requirement creates significant operational overhead and scalability concerns in real-world systems.
3. A logical inconsistency is identified regarding the applicability of the Dispute Protocol. According to Section 3.3, the protocol is activated exclusively when "two parties submit conflicting authorship claims that both pass the verification test." This precondition is inherently incompatible with the threat model of a geometric removal attack, as defined in Section 3.2. In such an attack, the adversary's success is measured by causing the legitimate claim `(x, s)`to fall below the verification threshold τ, thereby making the "both pass" condition unattainable. Consequently, the protocol, as currently formulated, offers no recourse for the rightful owner in this common adversarial scenario.
4. A significant tension exists between the stated threat model and the technical prerequisites of the proposed method. The introduction emphasizes the challenge of watermarking for private models where "model weights remain private and are never shared." However, the verification protocol requires the model provider to use a publicly available VAE encoder. This creates a dependency that contradicts the scenario of a completely self-contained, proprietary model, as a provider wishing to keep their entire pipeline private would be unable to use NoisePrints.
5. A critical security flaw exists in the naive protocol (without ZKP). The requirement for the content producer to expose the seed `s` by submitting the pair `(x, s)` makes the scheme vulnerable to forgery. An adversary who steals this pair can exploit the public VAE to create a perturbed image `x'` that is perceptually similar to `x` but lies outside the verification boundary for the legitimate owner. The adversary can then present the pair `(x', s)` and successfully claim ownership, all without requiring access to the private U-Net. This breaks the security model under a stolen seed scenario.
6. The claimed robustness against geometric attacks is conceptually problematic. The resilience is not an inherent property of the NoisePrint signal but is entirely dependent on the Dispute Protocol's ability to apply a corrective inverse transformation. This process merely reverses a specific, pre-defined manipulation (e.g., rotation, scaling) prior to verification. It does not demonstrate that the watermark itself can survive true geometric distortion, which typically causes irreversible, non-aligned spatial scrambling. The same logic could theoretically be applied to any basic image processing attack (e.g., contrast adjustment, blur) if an effective "inverse operation" could be found and applied. Therefore, the credit lies with the corrective pre-processing within the protocol, not with the fundamental robustness of the NoisePrints method.
7. The evaluation of the Zero-Knowledge Proof (ZKP) implementation is relatively preliminary.
8. The threat model for the "Watermark Injection" attack appears to lack practical motivation. The scenario in which an adversary creates a forged image that is *visually similar* to the original while also embedding an *identical* watermark seems contrived. In practice, an adversary seeking to claim ownership would more plausibly create a *different* image (e.g., a novel artistic creation) and falsely associate it with a forged seed, rather than meticulously replicating the original content with the same watermark. The authors should either provide a stronger justification for the considered injection attack scenario or redefine it to reflect a more realistic adversarial goal.
1. **Regarding Weakness 1**: NoisePrints already assigns a unique seed to each user as an identity identifier. This foundation could be directly extended to construct a simple multi-bit scheme, for instance, by allocating a subset of seeds to represent specific user IDs. However, the authors have not evaluated NoisePrints from this perspective. To comprehensively benchmark against state-of-the-art methods like Stable Signature[2] and Gaussian Shading[1], it is necessary to evaluate its performance in terms of traceability accuracy within a multi-bit framework.
2. **Regarding Weakness 3**: The description of the Dispute Protocol's usage scenario should be reorganized to align with the experimental setup and resolve the logical contradiction when facing geometric attacks.
3. **Regarding Weakness 4**: To resolve the contradiction in the threat model, I recommend removing the requirement for a public VAE. Instead of relying on model providers to reuse a public VAE—which conflicts with the scenario of fully proprietary models—the model owner could entrust their private VAE to the fully trusted verifier. This approach would better align with the stated principle that "the verifier is the only trusted party" and that "model weights remain private and are never shared," while still enabling the verification procedure.
4. **Regarding Weakness 5**: I recommend that the authors augment the threat model with specific strategies to mitigate the risk of seed exposure in the naive protocol version.
5. **Regarding Weakness 6:** The paper attributes geometric robustness to the NoisePrints method itself. However, the described mechanism relies entirely on the Dispute Protocol's ability to apply an inverse transformation to "correct" the image before verification. Could you clarify how this approach demonstrates inherent robustness of the watermark signal, as opposed to being a general pre-processing strategy that could theoretically be applied to any watermarking scheme? Furthermore, does this mean that NoisePrints' geometric robustness is ultimately limited to attacks that are both invertible and whose inverse transformation is known and included in the public set 𝒢?
6. **Regarding Weakness 7:** The paper demonstrates the functional correctness of the ZKP implementation. However, its security guarantees are primarily cryptographic. A critical remaining question is: how does the *watermark robustness* fare when the image undergoes attacks *before* the ZKP-based verification is performed? Specifically, if an attacked image `x'` (e.g., after JPEG compression, blurring, or a geometric transformation) is submitted for ZKP verification, will the circuit still correctly output `1` (indicating a valid watermark) when provided with the legitimate seed `s`? We recommend that the authors conduct a simple but essential robustness evaluation for the ZKP scenario to confirm that the strong robustness demonstrated in the standard setting is preserved when verification occurs within the ZKP circuit.
7. **Regarding Weakness 8:** The authors should consider evaluating more practical forgery attacks, such as those proposed in [3].
8. The paper does not mention a specific method for mapping bits $PRNG(h(s))$ to Gaussian noise $\epsilon(h(s))$. How is this step concretely implemented?
9. It is unclear from the experimental description whether a single VAE was used for all verification tasks, or whether the native VAE corresponding to each generative model was employed. Could the authors clarify this point?
[1] Yang, Zijin, et al. "Gaussian shading: Provable performance-lossless image watermarking for diffusion models." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*. 2024.
[2] Fernandez, Pierre, et al. "The stable signature: Rooting watermarks in latent diffusion models." *Proceedings of the IEEE/CVF International Conference on Computer Vision*. 2023.
[3] Müller, Andreas, et al. "Black-box forgery attacks on semantic watermarks for diffusion models." *Proceedings of the Computer Vision and Pattern Recognition Conference*. 2025. |
Heavily AI-edited |
|
NeuMoSync: End‑to‑End Neuromodulatory Control for Plasticity and Adaptability in Continual Learning |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
NeuMoSync is quite a complex architecture. The authors successfully demonstrate that the model does not lose plasticity on online continual learning tasks. The model can not only adapt quickly to new tasks but can also re-adapt quickly to previously trained tasks. A central component is the NeuroSync controller, which takes the current input and a feature vector for every neuron in the main network and outputs coefficients that govern how the inference network behaves.
It’s not clear to me how exactly the NeuroSync controller learns to produce effective coefficients for the inference network, given that there is no meta-learning loop. But it’s great that it does! The authors show that parameter sharing in the controller’s architecture is crucial, though I’m still not quite sure about the reasons for this.
Recommendation: Weak accept. The core idea is interesting, the empirical results are encouraging, and the ablation study suggest that all the components of the architecture matter. I did have a hard time grasping what was done at first. I do think the presentation could be improved. Figure readability is an issue, and quite a lot of important details are deferred to the appendix, but I don’t see fundamental issues that would block publication.
The authors demonstrate that NeuMoSync works for online continual learning and appears to enable fast adaptation and re-adaptation.
Results in Figure 2 and Table 1 are positive, even if some metrics are not immediately intuitive.
The ablation studies indicate that removing components degrades performance, supporting the claim that each part of the architecture matters.
For the results, it’s hard for me to know how impressive the performance is from the information about the comparisons provided. It would be helpful to know parameter counts for each of the comparisons and a little bit more information about the choice of hyperparameters. A lot of this information is in the appendix, but I think some of it should be moved to the main text if possible.
Some figures (especially Figure 4) are hard to read due to small text, and several practical details are mostly in the appendices, making it harder to judge the main claims from the body alone.
One thing I’m curious about is that the controller requires the current input: how important is this? It would be interesting to see an ablation without this. (It may already be in the appendix, I may have missed it.) |
Fully human-written |
|
NeuMoSync: End‑to‑End Neuromodulatory Control for Plasticity and Adaptability in Continual Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses the challenges of plasticity loss and poor knowledge transfer in continual learning by introducing NeuMoSync, a novel architecture inspired by global neuromodulatory mechanisms in the brain. NeuMoSync utilizes a higher-level module that synthesizes current inputs and the network's historical state, allowing it to adaptively regulate activation dynamics and synaptic plasticity. Evaluated on a diverse set of CL benchmarks, the method demonstrated strong performance in retaining plasticity and achieved significant improvements in both forward and backward adaptation compared to existing approaches.
The paper's core strengths lie in its clarity, rigorous validation, and the impressive performance of the proposed NeuMoSync architecture. NeuMoSync demonstrates significant performance gains over a wide array of baselines on diverse continual learning benchmarks. The paper also goes beyond raw accuracy and includes adaptation speed and knowledge transfer for more comprehensive quantification of models' performance. The paper's claims are supported by ablation studies which clarifies the benefit of each component of the model. The analysis of emergent behaviors provides valuable intuition into why the system succeeds.
1. The paper frames the proposed method as inspired by neuromodulatory mechanisms in the brain, but the link is very weak. The modulation in this work seems more closely linked to conditioning modules, such as FiLM [1], as opposed to being brain-like. The authors also claim the consolidated network is inspired by the memory consolidation process in the brain, but it's not clear to me how averaging weights updated with gradient-descent is tied to any known neural mechanism. While the authors have acknowledged this in the appendix (Appendix I.5), this remains a weakness given that "brain-inspired" seems to be the key motivation for the method.
2. The scalability of the method to larger networks for more complex tasks is unclear. The NeuroSync module needs to take the feature vectors of every neuron as input and produce the modulation coefficient for every neuron. It seems hard to scale this to larger networks, as acknowledged by the authors. One possibility would be to only modulate a subset of neurons in the large network, but that would require additional empirical experiments.
3. The focus of the paper is on quick relearning and adaptation (plasticity) but does not address the stability issue (i.e., catastrophic forgetting) in isolation. However, this is not made clear in the main text. It would be better to mention these limitations in the main text instead of in the appendix.
[1] FiLM: Visual Reasoning with a General Conditioning Layer
In addition to the points above, I wonder if using global consolidation and plasticity factors $\alpha_{WC}$ and $\alpha_{SM}$ would degrade model performance, this could help clarify if neuron-specific consolidation is necessary. |
Fully human-written |
|
NeuMoSync: End‑to‑End Neuromodulatory Control for Plasticity and Adaptability in Continual Learning |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors propose a method for preserving plasticity when training over successive tasks.
The methods maintains both a normal trained network (MainNetwork), and its slow-changing Exponential Moving Average (EMA - ConsolidatedNetwork).
Furthermore, each neuron maintains a vector of features (position in network, past activation stats, learnable features...)
Then a NeuroSync module takes in the feature vectors of all neurons, together with the current data sample, and outputs 4 modulation coefficients per neuron. 2 of those are used to interpolate between the neuron's weights from fast and slow (EMA) networks, and the other modulate the neuron's output function.
Standard Backpropagation occurs through both Neurosync and MainNetwork, after each training sample/batch.
The model seems much better than multiple alternatives at quickly learning new tasks. This is especially true for tasks that require pure memorization with little shared structure, such as Random-Label MNIST and CIFAR.
Various ablations and experiments attempt to describe the dynamics of the system and explain its performance.
The model's performance in quickly learning new tasks, over a succession of tasks, seems much higher than many alternatives, including training from scratch, strong L2 regularization, and various algorithms for maintaining plasticity.
The algorithm itself is clearly explained.
- The authors make many references to continual learning. However, they implicitly acknowledge that their proposed method itself is not actually capable of continual learning and forgets previous tasks, since they are forced to augment it with Experience Replay in their "Forgetting" experiments. The method seems to improve the capacity of the network to quickly learn new tasks, with little regard to previously learned tasks.
- The authors show many graphs of the system's behavior. Unfortunately some of these graphs seem to contradict each other, making interpretation difficult (see below).
- However, from some of those graphs, a possible explanation for the system's behavior emerges; this explanation is quite different from what the one the authors suggest.
- Graphs about the alphas produced by the system, as shown, seem contradictory. For the same Random-Labels CIFAR task, Figure 4a shows average alpha_sm as small and slightly positive. Figure 4c shows all alpha_sm as uniformly and strongly negative (<1). Figure 9 shows alpha_sm as moderately negative (~-0.5). Something must be missing from the descriptions.
- Similarly, figure 4b shows that alpha_wc is basically 0 for the first few tasks, before jumping higher. But Figure 4C shows alpha_wc jumping almost immediately to sizeable values. Please clarify.
- Comparison are not helped by the authors constantly changing the x-axis between "epochs", "tasks" and "steps". Some consistent markers for successive tasks would be useful.
- Overall, the graphs suggest that alpha_sm (the dynamic weight on the current network) and alpha_wc (the dynamic weight on the averaged, slow-changing network) are consistently of opposite signs. This remarkable fact is not mentioned by the authors, unless I missed it. If true, it would suggest an immediate explanation: the system simply *subtracts* the accumulated weights from the current, fast-moving network, making the changes "faster" (the specific assignment of which is negative or positive, between alpha_sm and alpha_wc, should make no difference, unless I'm missing something).
- This is particularly relevant for Figure 4b, which suggests that a jump of alpha_wc coincides with and counteracts a dip in performance, presumably caused by loss of plasticity. The authors choose to interpret this as "learned reliance on consolidated knowledge" - which seems counter-intuitive since (as the authors point out) this particular task has no use for past knowledge. Instead, it suggests an increased "active forgetting" of this past knowledge, reducing the burden of accumulated (and now irrelevant) information.
- Please clarify whether the above makes sense. Maybe it doesn't, but there should be at least some mention of the apparent opposite signs between outputted alpha_sm and alpha_wc (assuming it the graphs showing it are the correct ones). |
Fully human-written |
|
NeuMoSync: End‑to‑End Neuromodulatory Control for Plasticity and Adaptability in Continual Learning |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a biologically inspired continual learning architecture, NeuMoSync, which introduces three core components: a Main Network (for rapid adaptation to new tasks), a Consolidated Network (for long-term memory), and a NeuroSync module (as a global regulator). NeuMoSync dynamically modulates neuron-level plasticity, activation functions, and synaptic weights to address the loss of plasticity and inefficient knowledge transfer commonly observed in deep neural networks during continual learning. Across six categories of continual learning benchmarks, NeuMoSync demonstrates superior performance in maintaining plasticity and achieving fast forward/backward adaptation compared to existing methods. Systematic experiments further reveal that the modulation parameters exhibit neuroscience-like behaviors, such as dopamine-like responses during task switching and neuron functional specialization, which validates the biological plausibility of the proposed approach.
This paper presents an innovative integration of neuromodulatory mechanisms with continual learning, achieving a dynamic balance between plasticity and stability through neuron-level modulation. The experimental design is rigorous and extensive, covering six benchmark types and multi-dimensional metrics, such as plasticity, adaptation speed, generalization. And the comparisons with meta-learning approaches and stability-enhanced methods further demonstrate the method's robustness. Besides, the emergence of neuroscience-like modulation behaviors (e.g., dopamine-like responses and neuron specialization) observed through systematic experiments strengthens the credibility and interpretability of the biologically inspired design. Moreover, the paper features a clear structure with detailed methodological descriptions. Illustrations such as architecture diagrams, learning curves, and modulation parameter analyses provide intuitive support for key arguments.
Although the paper mentions that the parameter overhead of NeuMoSync is only 5–8%, the scalability of the NeuroSync module, which relies on Transformer-based network, is not sufficiently discussed for networks such as ResNet. Besides, the forgetting experiments (Appendix F.3) depend on experience replay, which does not directly demonstrate NeuMoSync’s intrinsic ability to mitigate Catastrophic Forgetting. Moreover, some biological analogies (e.g., αARM as “tonic neural modulation”) lack direct empirical validation, which weakens the persuasiveness of these claims.
How can NeuMoSync be extended to very large-scale architectures?
Would adjustments to the neuron grouping strategy or sparsification in NeuroSync be necessary?
Are the observed phenomena such as the “dopamine-like responses” of modulation parameters supported by neuroscientific experiments results, or are they qualitative analogies?
The paper states that “in this manner enables input-dependent amplification or attenuation of each network’s contribution within the Inference Network.” in line 171-172. Could the authors elaborate on the implementation details of this mechanism?
In Lines 185-188, is there corresponding ablation experiments results supporting the described behavior |
Moderately AI-edited |
|
Continuous multinomial logistic regression for neural decoding |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces CMLR, a generalization of classical multinomial logistic regression to handle continuous output variables. Instead of discrete class weights, CMLR defines smooth, output-dependent weight functions with GP priors, allowing it to model conditional probability densities over continuous variables such as time, orientation, or spatial position. The authors derive a SVI algorithm in the Fourier domain for scalable learning on large datasets. They show CMLR’s performance on diverse neural decoding tasks across multiple brain regions (mouse and monkey V1, hippocampus CA1, motor cortex), and show that it outperforms Naive Bayes, FlexCode, XGBoost, and DNN in both accuracy and calibration.
1. Although it relies on strong modeling assumptions than nonlinear methods, the approach is flexible enough to handle circular and multidimensional outputs.
2. The method is interpretable, learning weight functions that correspond to tuning curves.
3. This method offers an attractive alternative for researchers who prefer to avoid the complexity and hyperparameter tuning required by nonlinear models while still getting strong decoding performance.
The DNN results in Figs 2 and 3 show relatively poor performance with large variance. While I understand that DNNs typically have higher variance than the proposed method, the degree of underperformance here suggests that the architecture or hyperparameter tuning might not have been well optimized for these decoding tasks. This raises some concern about the fairness of the comparison. It would be helpful for the authors to acknowledge this limitation in the paper. If, on the other hand, the proposed method achieves strong performance without requiring extensive hyperparameter tuning, that could be an additional advantage of the proposed method.
Relatedly, I would encourage the authors to clarify how they see the contribution of this work in the context of increasingly expressive modern models and large-scale neuroscience datasets. Although the proposed method is more interpretable, it may lag behind nonlinear models (e.g., transformers) in decoding performance. From a practical standpoint, the utility of the method might be limited. I would appreciate a clarification of how the authors envision their approach complementing or coexisting with more complex models.
Despite using SVI for scalability, the method may still be computationally expensive for large-scale or high-dimensional datasets. A theoretical or empirical runtime comparison with baselines would strengthen the paper. |
Lightly AI-edited |
|
Continuous multinomial logistic regression for neural decoding |
Soundness: 3: good
Presentation: 3: good
Contribution: 4: excellent
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper develops a novel approach, Continuous Multinomial Logistic Regression (CMLR), that is an extension of MLR to predict continuous-valued outputs, and showcase this approach for neural decoding (predicting behaviors/stimuli from neural activity). The model enables interpretable understanding of smooth 'tuning curves' of neurons within the decoding model via GP priors. They develop an approach to fit the model with stochastic variational inference. The authors demonstrate excellent neural decoding performance, in additional to interpretability of the model across multiple datasets using their method.
This is an original, creative, and novel technical development for neural decoding (and prediction problems more broadly). The method has a good blend of accuracy and interpretability (via the learned tuning curves within the model). The paper is clearly written and the authors are very thorough. It is a very strong paper overall.
My one fundamental concern is with the comparisons in results, specifically in terms of hyperparameter selection. I might have missed it, but it's not clear to me how/if hyperparameter tuning was done both for the author's model and for comparison models. It seems odd how poor many of the results are from what should be close to state-of-the-art approaches (XGBoost and DNNs), which makes me suspect poor hyperparameters (causing either overfitting or underfitting). Proper hyperparameter tuning on a held-out validation set (within the training set) should be done, if it wasn't already.
1. Fig 4C - Euclidean error is hard to interpret - Coefficient of determination (the standard r2_score in sklearn) would be easier to interpret, and more standard for velocity decoding
2. Discussion: "First, unlike models such as XGBoost or DNNs, CMLR does not incorporate priors over outputs" - I don't understand this statement in terms of XGBoost and DNNs having priors on outputs.
3. Line 263 - closing bracket ] missing
4. I would put the methods in section 5.3 into an actual methods section. They feel very out of place when reading results
5. How long does your approach take to run compared to others? |
Fully human-written |
|
Continuous multinomial logistic regression for neural decoding |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 4: excellent
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors introduce continuous multinomial logistic regression (CMLR), a flexible nonparametric model that allows for both discrete- and continuous-valued outputs by mapping inputs to a full probability density using per-feature additive functions with Gaussian priors. The resulting model is applied to a wide variety of neural decoding tasks, both continuous and discrete, and shows impressive results compared to baselines.
The paper is well-written and clearly articulates both the gap in the current literature and the exposition of the method. As applied to the neural decoding problem, the weight functions become per-neuron tuning curves that provide more interpretability than tree-based or neural network approaches. The method is applied to a range of neural decoding problems (sensory and motor) and shows strong performance across all datasets. The limitations and future directions section makes clear that this is a rich model class with many exciting directions to explore, and the paper is a strong foundation from which to start this work.
My main concern is the thoroughness of the baseline comparisons. The authors chose a different number of Fourier components M and mini-batch size N' for each of their datasets, but did not clearly state how these values were chosen. Furthermore, it seems no attempt was made to search hyperparameter space for XGBoost or the DNN. Better hyperparameter tuning (especially for the baselines) would therefore give me more confidence in the stated results. The authors could, for example, choose 2 or 3 important hyperparameters per method and search over these by performing a train/val split of the 80% of training data and selecting the best hyperparameters on the validation data.
I am also concerned about the robustness of the results if they are indeed only reported on 20% of the data. My own experience in neural decoding has been that the train/test split can significantly affect model performance, and conclusions from one split may not hold with another split. Performing full k-fold cross validation (where every trial lands in the test set exactly once) would better control for noise introduced in the sampling process.
What is the computational complexity of CMLR, i.e. how do training and inference time scale with D, T, M, J? The authors have included training times in the Appendix, but it might be useful to mention these numbers (or at least their order of magnitude) more explicitly in the main text.
How much data do I actually need to train CMLR? Fig S3 is very cool and helpful, and it would be interesting to see something similar with real data. Do the GP priors/additive structure of CMLR make it more amenable to fitting models with less data, as compared to XGBoost or DNNs?
Fig 2E/3C/4C/S4C: what is the value of J used for CMLR/FlexCode/NB? How was this value chosen?
I find the brief mention of uncertainty calibration very interesting; it is well-known that DNNs, for example, are very often poorly calibrated. It would be cool to see the PIT results (i.e. Fig S5) for some of the other methods. Is it possible to do this for DNNs? Better calibration could be as strong a selling point as better accuracy for scientific applications, and might be worth emphasizing this in the main text more.
More with uncertainty calibration: having never seen PIT histograms, I can believe they are a good approach for quantifying calibration, but they feel very disconnected from the actual datasets that are being analyzed. The mouse hippocampal data offers an interesting example: here CMLR tends to make mistakes when the true or decoded position is at one end of the track. What do uncertainty estimates look like for these mistakes? Does the uncertainty scale with the magnitude of the error? How do those compare to uncertainty estimates from the other methods?
Minor: not having a background in this literature, the following sentences in the first Intro paragraph were a bit confusing to me: "However, many neural decoding tasks involve continuous variables... . In such settings, researchers who wish to use MLR-like decoding methods are commonly forced to discretize the output variable into a finite number of classes." At this point I don't really know what "MLR-like decoding methods" are and the first thing I think of is "sure but doesn't linear regression work just fine?" Perhaps clarifying this point early will help convince readers unfamiliar with the literature. |
Fully human-written |
|
Continuous multinomial logistic regression for neural decoding |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces a combination of multinomial logistic regression with the gaussian processes defined on the weights.
The method is designed for neural decoding, where the goal is to estimate external variables from neural population activity, such as running speed or direction.
To handle computational complexity, the method assumed a univariate gaussian process per weight, uses standard radial basis function in the Fourier basis and trains the model as a stochastic variational inference.
Then they test the model on several neural datasets, comparing with XGBoost and DNNs.
1. **Clarity**. The paper is clearly written and suggests extensive (theoretical) comparison with prior work (*"Connections to Prior Work"* section)
2. **Various datasets**. The papers tests their method on various datasets from different species and data modalities (calcium imaging and electrophysiology)
3. **Provided implementation**. The code is attached in the supplementary material.
4. **Strong baselines**. The baselines are chosen to cover both bayesian and performance-drive methods (like DNN), covering the field.
1. **Unclear computational scalability**. The paper claims to provide a *"scalable framework"*, however, the practical aspects of scalability are under-explored. What are the computational restrictions? How many additional parameters the method provides and how longer does it take to train compared to modelling regression with uncertainty or the other baselines provided (such as FlexCode or XGBoost)?
2. **Baselines**.The comparison with standard multinomial logistic regression is missing. For the other strong baselines, it is not clear how they were optimized per dataset if any optimisations were present.
3. **Lack of ablations for design choices justifications**. While authors acknowledge in limitations that multivariate Gaussian processes could be used and fixed Fourier-domain bases and RBF kernels could be replaces by adaptive basis functions - all of these are not conceptual limitations of the methods but rather a list of sometimes very straightforward technical improvements (like multivariate Gaussian processes) and could be done within the current submission.
4. **Unclear interpretability gains**. See Q6 - What are the additional interpretability benefits provided by CMLR? XGBoost can also give an interpretable weighted impact of each neuron on the target variable. Multinomial logistic regression with uncertainty can also be close in terms of interpretations.
1. How is CMLR related to the following works [1-3]?
2. Why the XGBoost and DNNs lines are missing in Fig 2c and Fig 3A?
3. I might have missed it in the text but which loss function do you use to train the CMLR? Is it MSE loss everywhere?
4. Why there is no comparison with the standard multinomial logistic regression? You do not really analyse uncertainties in the main text, and even for uncertainties methods like Laplace Redux [3] could be used to derive uncertainties for a non-bayesian methods.
5. Have you tuned the hyperparameters of DNN and XGBoost per dataset? As your datasets had different sizes the ratio of data to parameters might be crucial for performance. How DNN parameters compared to the CMLR learn parameters?
6. What are the additional interpretability benefits provided by CMLR? XGBoost can also give an interpretable weighted impact of each neuron on the target variable.
References:
[1] Chan, Antoni B. "Multivariate generalized gaussian process models." arXiv preprint arXiv:1311.0360 (2013).
[2] Payne, Richard D., et al. "A conditional density estimation partition model using logistic Gaussian processes." Biometrika 107.1 (2020): 173-190.
[3] Daxberger, Erik, et al. "Laplace redux-effortless bayesian deep learning." Advances in neural information processing systems 34 (2021): 20089-20103.
[3] Murray, Iain, David MacKay, and Ryan P. Adams. "The Gaussian process density sampler." Advances in neural information processing systems 21 (2008). |
Fully human-written |
|
Language Models That Think, Chat Better |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces Reinforcement Learning with Model-rewarded Thinking (RLMT) which trains LLMs to generate long CoT reasoning before final answers, using online RL algorithms such as GRPO. Compared to RLVR relying on rule-based rewards tied to ground-truth answers, RLMT only requires prompts and uses reward models trained on human preference data over diverse prompts, as in RLHF, to evaluate responses. Previously RLVR is limited to structured domains like math and code and RLMT extends RL to open-ended reasoning tasks like open-ended chat.
1. The paper introduces Reinforcement Learning with Model-rewarded Thinking (RLMT) - a new way to incorporate reasoning in LLMs. It focuses on domains other than math, code and science and focuses on reasoning needed for creative writing and chat.
2. The paper has conducted comprehensive experiments across different RL algorithms like GRPO, DPO and PPO.
3. The paper goes the additional mile of qualitative analysis of model behavior under SFT and RLMT, as well as ablations with various SFT data sources and reward models.
1. It would be interesting to see what happens when RLMT training is mixed with RLHF/RLVR. Would you get the best generalized model which does well on all tasks - math, science and code as well as creative writing etc?
See above. |
Fully human-written |
|
Language Models That Think, Chat Better |
Soundness: 3: good
Presentation: 3: good
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces RLMT, a post-training paradigm for LLMs that combines CoT reasoning with model-based reward optimization, extending reinforcement learning with verifiable rewards (RLVR) into general-purpose chat and open-ended tasks. RLMT requires models to produce detailed thinking traces before responses, which are then optimized via online RL against a reward model trained on human preferences, similar to RLHF but with forced long-form CoT. Experiments involve the Llama-3.1-8B and Qwen-2.5-7B families, both base and instruction models, across DPO, PPO, and GRPO. From the reported results, RLMT substantially and consistently outperforms standard RLHF on chat, creative writing, and general knowledge benchmarks. Analysis covers model behaviors, prompt/reward model ablations, and qualitative planning traits.
- RL with unverifiable domains is a timely topic. This addresses a longstanding limitation in generalizing explicit reasoning to open-ended tasks.
- The analysis is quite insightful to read.
- The paper is carefully written and well organized.
- The contributions seem a bit limited. The key difference that the paper claimed is enabling RL to work on unverifiable domains. This is achieved through substituting a verifiable reward function in RLVR with a reward model. However, the key, in my opinion, becomes how to obtain a strong reward model such that any policy model can improve its chat performance when RL with the reward model.
- The conclusion is a bit too intuitive. It is not surprising that long CoT benefits chat performance.
- The long CoT in chat might bring extra computation overhead. I understand that the chat domain is an example of an unvarifiable domain, but usually the general domain chats demands fast response and low latency. Always conducting the long thinking might not actually be what people want.
- Experiments about scaling effects should be a substantial part of the paper, but are completely missing. How does the performance change when scaling up the data size, model size, and inference budgets (token length)?
- Some observations are not properly interpreted. For example, why does long CoT help creative writing? Where does the creativity come from?
- The proposed method relies heavily on the scores produced by reward model. Is it robust to reward model bias or poor reward calibration? Have the authors measured or observed any reward gaming, reward hacking (length bias, verbosity), or mismatches between human preference and model reward during RLMT?
- How does the performance change when scaling up the data size, model size, and inference budgets (token length)? |
Fully human-written |
|
Language Models That Think, Chat Better |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper investigates the effectiveness of incorporating long chain-of-thought (CoT) reasoning in language model post-training for general-purpose chat capabilities. The authors introduce RLMT (Reinforcement Learning with Model-rewarded Thinking), which combines long CoT generation with online RL using preference-based reward models. Experiments are conducted across 40 training runs on Llama-3.1-8B and Qwen-2.5-7B using DPO, PPO, and GRPO algorithms. The results show consistent improvements of 3-7 points on chat benchmarks (AlpacaEval2, WildBench, ArenaHardV2) compared to standard RLHF pipelines. The best 8B model reportedly surpasses GPT-4o on chat and creative writing tasks.
1. The paper conducts extensive experiments across multiple model families (Llama-3.1-8B and Qwen-2.5-7B), training algorithms (DPO, PPO, GRPO), and settings (warm-start vs. zero-shot). This provides robust emperical insights.
2. The paper addresses an important question about whether thinking/reasoning capabilities can improve performance on open-ended tasks beyond verifiable domains like mathematics and coding. The findings are useful for posttraining practioners.
1. The primary weakness is that the comparison conflates two factors: (1) the presence of thinking/CoT and (2) RLMT vs RLHF paradigm. The paper frames the comparison as "RLMT (with thinking) vs RLHF (without thinking)," but this is not a correct framing. In RLHF, one can still incorporate thinking by having models generate CoT traces and then extracting only the final output for the reward model to evaluate. The current setup makes it difficult to isolate whether improvements come from thinking itself or from the specific RLMT training paradigm. A more appropriate comparison would be: (a) RLHF with thinking vs RLHF without thinking, and (b) RLMT with thinking vs RLHF with thinking.
2. The paper is primarily empirical and does not have much novel research value (for example, an industry lab can quickly get such insights by sweeping across different methods). Also, for RLMT, there are many relevant work that do RL on unverifiable tasks, such as [1] [2]. These work are not discussed in the paper.
[1] Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains, Gunjal et al., https://arxiv.org/abs/2507.17746
[2] Checklists Are Better Than Reward Models For Aligning Language Models, Viswanathan et al., https://arxiv.org/abs/2507.18624
1. Have you considered trying out RLHF with thinking (where thinking traces are generated but only final outputs are evaluated by the reward model)? This would help isolate the contribution of thinking itself versus the RLMT training paradigm. |
Fully human-written |
|
Language Models That Think, Chat Better |
Soundness: 2: fair
Presentation: 4: excellent
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces RLMT: performing RL on an LLM using a learned reward model, with thinking. In experiments, they perform RLMT with a modest set of prompts and compare this to baselines that simply use STF or RL without thinking as well as fixed open and closed models. They demonstrate good performance on a range of benchmarks (especially chat benchmarks) relative to both baselines and closed models. They perform a considerable set of ablations, showing good performance across choices of RL algorithm but demonstrating the importance of a strong reward model. They provide a qualitative analysis of the differences due to RLMT as well as analysis of CoT length.
Originality. To my knowledge, despite this domain receiving a great deal of attention, this precise technique, while simple, hasn’t been showcased before, and it makes good use of some newer benchmarks useful for measuring performance on chat and creative writing.
Quality. The experimentation — both the headline comparisons as well as the ablations — are fairly extensive. The headline comparisons are sensible, and the ablations make some helpful disambiguations about what’s working here.
Clarity. This paper is very well-written very understandable.
Significance. This domain is of great interest to many and has received quite a lot of attention. The main idea here is very natural and worth having results on, and the secondary results (e.g. the No SFT runs) are a natural extension of DeepSeek results and are very much worth highlighting.
I’m torn on how to think about the originality here. It’s a very simple extension of pretty well-understood ideas. It’s quite close to Wu et al. 2025a (as you cite), with the adoption of some more recently adopted techniques in online RL ported to LLMs as well as updated benchmark. It’d be helpful to tease apart precisely what makes this new relative to work like that in methods.
The results feel a bit overstated in parts. In particular, looking at the warm-started models (RLMT seems to work more decisively in the no sft setting), results look strongest on the chat tasks, which are most closely aligned with the prompts used during training. The paper claims strong performance on creative writing, but that only really seems to hold for instruct models. On other benchmarks, RLMT is pretty clearly doing worse than baselines. I’d suggest making this tradeoff clearer in results.
Already articulated in weaknesses. |
Fully human-written |
|
Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes Control Reinforcement Learning (CRL): a PPO-trained policy that, at each generation step, selects SAE feature(s) at a chosen layer and adds the corresponding decoder vector to the residual stream, thereby "steering" the model token-by-token. The state is the current residual activation; the action is a binary selection over the SAE dictionary; rewards are task-specific. Reported gains on Gemma-2 2B are modest but non-trivial on some tasks (e.g., HarmBench +5.61-pt, BBQ-Ambig +3.55-pt; others are small). The paper also analyzes (i) layer/coefficients ("sweet spots" in later layers), and (ii) critic behavior (bottlenecks for single-token tasks vs gradual divergence in long-horizon reasoning).
1. Clear control interface over interpretable features. Formulating steering as an MDP over SAE features with per-token actions is neat and practically implementable. The action/state definitions and the steering equation are explicit.
2. The paper gives useful empirical guidance: later layers tolerate larger coefficients; early-layer large coefficients tend to break behavior ("sweet-spot" effect), aligning with residual-norm growth across depth.
3. Feature "impact" and diversity metrics provide some transparency into which SAE features are used when steering helps or hurts.
1. It is dissatisfying to see that most headline improvements are small. Also, there are no confidence intervals, multiple-seed runs, or bootstrap tests. I couldn't be confident about the stability of the gains.
2. The paper itself notes that on single-token QA without constrained decoding, a substantial portion of MMLU gains comes from eliminating invalid outputs (e.g., "*", whitespace) rather than improving knowledge. That weakens the claim that CRL improves reasoning/knowledge rather than format adherence.
3. Since CRL's core novelty is adaptive selection of interpretable features, it needs stronger ablations vs. simple static/greedy heuristics (e.g., always add the top-k SAE features by activation, or by a supervised classifier over features), and vs. logit-space steering matched for compute. The paper doesn’t convincingly isolate the benefit of PPO-based selection over such cheap alternatives.
4. The authors report critic "bottlenecks" (corrected vs. misguided nearly indistinguishable on MMLU), suggesting value estimation struggles when rewards are sparse and binary. That weakens the paper's promise that CRL delivers reliable token-wise interpretability. If the critic can't separate outcomes, the per-token attributions are noisy.
1. First, can you provide CIs / multiple seeds and report per-task variance; which gains survive across seeds?
2. If resource permits, could you add non-RL baselines (e.g., pick top-k SAE features by current activation so that total intervention norm equals CRL's)? Also, could you add a format-sanitizer that only enforces valid answer formats to quantify the fraction of gains due to formatting?
Conditional on satisfactorily addressing the above points, I am open to increasing my rating. |
Moderately AI-edited |
|
Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces Control Reinforcement Learning, a framework for dynamically steering LLMs through SAE features. Instead of static activation edits, CRL learns a policy to select which SAE features to activate at each generation step, guided by reinforcement learning rewards. The authors claim this allows interpretable, token-level control while modestly improving performance across tasks such as MMLU, GSM8K, and HarmBench. The approach also provides diagnostic insights into critic bottlenecks, layer-specific effects, and semantic coherence of SAE features.
1. **Novelty of the proposed method**: I think the main strength lies in framing reinforcement learning over SAE features, which differs itself from other activation-based or gradient-based method.
2. **Breadth of evaluation**: The authors conduct extensive experiments across reasoning, factual, and safety benchmarks, showing the generality of their approach.
3. **Clarity of methodology**: The paper is well-structured and technically sound, with a clear explanation of the training process and design choices.
1. **Limited comparison with existing feature control methods**: A key weakness is the lack of comparison to established feature control approaches such as activation-based or gradient-based interventions. I find this omission makes it hard to understand what specific advantages CRL provides beyond existing techniques. A more direct experimental or conceptual comparison would help clarify novelty.
2. **Computational complexity and scalability**:Training a PPO agent over sparse feature activations introduces nontrivial computational cost. I think it would help to include runtime analysis or efficiency comparisons to justify the added complexity relative to simpler steering methods.
3. **Under-analyzed reward design**: The reward function is central to the method but not well justified or analyzed. Its stability and sensitivity to tuning are unclear, which weakens confidence in reproducibility.
4. **Limited interpretability validation and generalization evidence**: Interpretability results are qualitative and narrow in scope. The authors should demonstrate that the learned feature control generalizes across different tasks or datasets.
Please refer to weaknesses |
Fully AI-generated |
|
Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces a framework called "Control Reinforcement Learning" (CRL) , aimed at interpretable, token-level dynamic steering of LLMs . The core of this method is to use "monosemantic features" extracted from LLM activations by Sparse Autoencoders (SAEs) as a control interface. The CRL framework trains a RL policy network that, at each generation step, observes the current token's residual stream activation and dynamically selects an SAE feature. Then, the decoded vector of this selected feature is added back into the model's residual stream to "steer" the model towards a better output. The method aims to solve the "emergent misalignment" problem that occurs during LLM inference and has achieved modest performance improvements on various benchmarks (such as MMLU, GSM8K, BBQ bias, and HarmBench safety) , while providing interpretability by tracking feature contributions.
1. The paper addresses a very forward-looking and critical problem in the field of LLM alignment: how to achieve dynamic, interpretable, token-level control over model behavior.
2. Using monosemantic features extracted by SAEs as the action space for RL is an innovative idea
3. The paper provides qualitative evidence to support its claim of "interpretable steering"
I am not an expert in the RL area. Therefore, the following comments are based on my current understanding, and I welcome any corrections to potential misunderstandings on my part.
1. The abstract mentions the use of "Adaptive Feature Masking" (AFM) to balance exploration and exploitation, but this key component is never mentioned or defined again in the paper's main body, experiment, or appendix.
2. Section 3.2 defines the intervention as $\tilde{x}_{t}=x_{t}+a_{t}W_{dec}$ which implies a steering coefficient of 1. However, Experiment Section 4.2 dedicates significant analysis to "steering coefficients" ranging from 10 to 100. This coefficient c, which is critical in the experiments, is not defined in the methodology.
3. The paper only compares CRL's results against the "Before" model (i.e., the original, non-intervened model). It completely lacks a comparison against standard fine-tuning methods (like SFT or DPO) on the same task data.
4. The paper admits in its conclusion that these steering effects "do not transfer well" to models after supervised fine-tuning (SFT). This severely limits the method's practical application value in real-world systems that require continuous updates and iteration.
No |
Fully human-written |
|
Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features |
Soundness: 1: poor
Presentation: 1: poor
Contribution: 3: good
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work proposes to learn by reinforcement learning a policy to steer LLMs by intervening on the activations of sparse features, as identified by SAEs trained on the residual stream activations. Experiments show modest improvements. The use of SAE features enable a degree of interpretability.
[S1] The idea of learning a policy to control feature strength is interesting and, to my knowledge, novel.
[S2] Some of the proposed analysis yield interesting insight, e.g. the different effects of controlling features across layers.
[S3] Results are presented in a measured way, without overstating them.
[W1] The paper suffers from a general lack of clarity. Crucial terms, like 'coefficients', are used without introduction or being obvious from the context (e.g. there are no coefficients in Eq. 3, which defines the activation interventions). Many sections in the body are not understandable without going back and forth with the appendix, as if content was moved there without checking the impact of doing so on the narrative flow. See below a detailed list of issues.
[W2] There are reproducibility issues: an Algorithm 1 is mentioned in the reproducibility section, but it does not seem to be anywhere. There is no description of the policy and value network implementations besides them being MLPs. Task-specific rewards are announced in 3.4, but are not in Appendix A as promised.
[W3] There is no experimental baseline. How well would CRL work if using random features instead of SAE features?
[W4] It is unclear whether reported results in Table 1 are obtained directly on the test set, or if the intervention layer is determined based on held-out data.
Additional points:
L22-24: Adaptive Feature Masking (AFM) is mentioned in the abstract and in the contributions, but never again in the paper?
L132-133: Why is the full problem a POMDP? Does this have to do with the fact that the influence of the KV cache of previous tokens through attention is not taken into account? This should be clarified, and an attempt to quantify the impact of this approximation should be made.
L159-161: This part is unclear. What are the coefficients being referred to? Should there be additional coefficients in Eq (3)?
L240-241: Coefficient averaging has not been introduced at this point.
Figure 2: what is in the left pane, and what in the right?
L263-265: Constrained and unconstrained decoding patterns have not been introduced. What is the connection between constrained/unconstrained decoding on one hand and factual question answering on the other?
L 268 and elsewhere: disambiguous --> unambiguous
L268-269: Can this norm increase be visualized?
L270: what is the 'coefficient 18 analysis'?
L285-286: Correct, incorrect, corrected and misguided are not introduced/defined.
In Figure 3 "generation step" means token position, in appendix it means layer.
Figure 4 caption: blue <--> green.
L480: "SAE-learned directions operate in non-linear interaction spaces rather than simple superposition": what is the evidence supporting this?
L481: "steering effects do not transfer well to models after supervised fine-tuning": was this shown anywhere?
L863-864: Fig. 9 right and Fig. 10 right do not seem to show this.
Appendix A.5 is empty
Fig. 14 caption: what does coefficient 18 mean?
See above.
Also: why PPO rather than DPO or GRPO? |
Fully human-written |
|
AlphaZeroES: Direct Score Maximization Can Outperform Planning Loss Minimization in Single-Agent Settings |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors demonstrate that one can use evolution strategies with an AlphaZero setup to directly optimize maximizing the reward. This appears to work well--or better than the base version in the tested games.
* It is always interesting to see ES
* The authors test their method on a number of games
* The authors provide and extensive survey of work in the area in the appendix
* The authors highlighted a few contexts where the ES strategy also did a good job of optimizing the auxiliary tasks that AZ targets.
* Are there any other baselines or methods (or families of methods AC) you could provide? Here, you only show AZ.
* Overall, (1) it is unclear what the score lines mean, what is a good score / success, (2) relatedly, if AZ is not learning at all or having any success.
The main question I still have is a want for more context: What type of baselines could you provide, what kinds of other RL agents are typically used for these tasks and how would they compare? Were the AZ agents fairly tested?
* Are the results good? You outperform AZ but it is not clear from the paper would a good performance is. Do other models do much better on these tasks? If AZ models do very poorly and perhaps are under-trained, then doing better is not necessarily super meaningful. Adding context of what success looks like for each task would help.
* Could you try other ways of organizing the plots? I don't think you need to show the loss/value for everything. Make the primary score plots bigger and more readable?
* How much compute is done per model/run ("4 hours of training time per trial") vs what is needed? Did the model training converge?
* What are the standard approaches for these tasks? Is this within the scope under which AZ works? I do see the appendix figures where ES still always did better even under increased compute.
# Minor notes
* L289 - Seem to "increase" |
Fully human-written |
|
AlphaZeroES: Direct Score Maximization Can Outperform Planning Loss Minimization in Single-Agent Settings |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper focuses on enhancing AlphaZero algorithm with evolution-based optimization algorithms, which allows to directly optimize cumulative reward function over an episode. The core idea is to train actor model with ES algorithm rather then gradient descent.
- One of the huge advantages of this method is it ability to scale to a large number of workers, which allows to train agent in parallel setting.
- No direct comparison to AlphaZero and other methods. Experiments only show an effect of different hyperparameters on AlphaZeroES performance.
- The description of resulting algorithm is hard to understand from pure text description. There is no outline (i.e. step-by-step description or pseudocode) provided in the main part of the paper.
- Is there any direct comparison with AlphaZero and other RL methods in terms of performance?
- The main motivation of applying ES is direct black-box optimization of cumulative reward. However, RL algorithms such as REINFORCE can do that to. Even more, the agent learning algorithm of original AlphaZero can do that too via n-step returns or lambda-returns. Is there any other motivation for applying ES specifically? |
Fully human-written |
|
AlphaZeroES: Direct Score Maximization Can Outperform Planning Loss Minimization in Single-Agent Settings |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes AlphaZeroES, a modification to the AlphaZero algorithm for single-agent settings. The core idea is to replace AlphaZero's standard planning loss—which minimizes the difference between network predictions and MCTS search results (for policy) and episode returns (for value) —with a new objective that *directly* maximizes the total episode score. Because the MCTS component is non-differentiable, the authors employ Evolution Strategies (ES) as a zeroth-order, black-box optimizer to train the neural network parameters. The method is evaluated on five single-agent environments (Navigation, Sokoban, TSP, VKCP, and MDP), where the authors claim that AlphaZeroES "dramatically outperforms" the standard AlphaZero baseline while using the same MCTS algorithm and network architecture.
The paper's primary strength is the question it poses: challenging the optimality of the standard AlphaZero planning loss. This is a fundamental and worthwhile question to investigate.
The paper's observation that maximizing the score via ES does not correlate with minimizing the standard planning losses (and can even be anti-correlated) is an interesting finding.
The ablation study in Appendix D, which attempts to separate the contributions of the policy and value network optimization, is a good addition and provides some insight, showing the source of improvement is environment-dependent.
My main concerns are as follows:
1. **Lack of Intuition and Analysis:** The paper provides no satisfying explanation for *why* its method works. In fact, the results in Figures 2, 3, 5, etc., show that the value and policy losses for AlphaZeroES often increase or stagnate while the score improves. The paper even states "a definitive explanation... is beyond the scope of this paper". This is not acceptable. If the network's value/policy heads are producing "worse" predictions (according to the standard loss), how is the MCTS using these heuristics to produce a *better* overall policy? What is being learned? Without this analysis, the paper is just a collection of puzzling empirical results.
2. **Questionable Sample Efficiency:** Evolution Strategies are zeroth-order methods and are notoriously sample-inefficient, especially in high-dimensional parameter spaces (like neural networks). Yet, this paper claims comparable or better performance within the same training time and number of episodes as gradient-based AlphaZero. This is an extraordinary claim. It suggests that ES is *more* sample-efficient than backpropagation for this complex planning problem, which contradicts a large body of literature. This result is highly questionable and may be an artifact of poorly tuned baselines or very simple environments.
3. **Vague Methodology & Reproducibility:** As mentioned in the Presentation section, the paper lacks a clear, high-level pseudocode for the AlphaZeroES training loop. Section 4.3 just describes ES, not its integration. This, combined with the lack of a complete, runnable code repository, makes the results difficult to trust or replicate.
4. **Poor Scalability and Simple Environments:** The experiments are conducted on "toy" problems. A 10x10 grid or a 20-city TSP is not a convincing demonstration of a method intended to improve *AlphaZero*. AlphaZero's fame comes from its ability to master extremely complex domains like Go or chess. The scalability experiments in Appendix C only test up to 36 nodes and 128 hidden dimensions. This is insufficient. It is highly likely that a black-box ES approach will fail to scale to the millions of parameters and vast search spaces where gradient-based AlphaZero excels.
1. Could the authors provide a more concrete analysis of the learning dynamics? If the value loss is high (e.g., Fig 2), does this mean the MCTS effectively learns to ignore the value head's output and relies purely on rollouts? What do the learned policy/value predictions actually look like? How can the search be effective if its guiding heuristics are, by the paper's own metrics, not improving?
2. Can you please comment on the surprising sample efficiency? Why would a zeroth-order method (ES) be more efficient than backpropagation here? Were you able to run the baseline AlphaZero with a fully optimized set of hyperparameters? The results are very counterintuitive.
3. To properly test the claims of scalability, would it be possible to test this on a much larger, more standard benchmark? For example, a larger combinatorial problem (e.g., TSP with $n=100$) or a different domain entirely (like a small board game)? The current environments are too simple to support the paper's strong claims.
4. Section 5 states that "AlphaZero and AlphaZeroES took about the same amount of time per iteration." This is a very surprising claim. A standard ES update (like OpenAI-ES) requires $N$ full episode rollouts (one for each worker/perturbation) just to compute a *single* pseudogradient $g$. In contrast, a standard AlphaZero update (or batch update) can learn from the data of a single episode (or batch of episodes) via backpropagation. Could you please clarify what "iteration" refers to in this context? Does it mean one full parameter update, or one generated episode? This claim is central to the method's viability but seems to misrepresent the computational cost of ES. |
Fully AI-generated |
|
AlphaZeroES: Direct Score Maximization Can Outperform Planning Loss Minimization in Single-Agent Settings |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces AlphaZeroES, which replaces AlphaZero's planning loss minimization with direct episode score maximization using evolution strategies (ES). The authors test this on single-agent environments including Navigation, Sokoban, TSP, VKCP, and MDP, and report that AlphaZeroES consistently outperforms standard AlphaZero while maintaining the same MCTS algorithm and neural architecture.
1. The paper poses a focused, well-motivated question about whether direct score optimization can outperform indirect planning loss minimization.
2. Comprehensive demonstration of related work is provided.
3. Limitation is honestly discussed.
4. Statistical tests are provided, including Wilcoxon, signed-rank, and paired t-tests, which show statistical significance.
1. The organization and presentation of the paper can be improved. It would be better put mathematical expressions separately for clear elaboration instead of using in-line mode. For Section 5 (Experiments), the authors spend a great deal of space to introduce the environments instead of discussing the quantitative results demonstrated in the figures. Moreover, for each environment, all contents are put in huge bulk of single paragraph, making it difficult to follow.
2. The paper lacks theoretical justification. In Section 6 (Discussion), the claim about "simple optimal policy but complex value function" lacks rigor and doesn't explain why this would systematically favor ES.
3. The paper claims to test "across a wide range of learning rates for fair comparison" between AlphaZero and AlphaZeroES. However, this doesn't ensure a fair comparison. Different learning rates are optimal for different objectives, making it impossible to isolate whether improvements come from the objective change or better hyperparameter selection.
4. The loss report is inconsistent. Figures show AlphaZeroES doesn't minimize value/policy losses, yet performance improves. This disconnect isn't adequately explained.
5. Sensitivity analysis is not provided regarding perturbation scale, which is set to be 0.1. The ES-specific hyperparameter selection is not explained.
1. Did the authors perform a grid search or other systematic hyperparameter optimization for both methods? If so, what was the protocol?
2. How did the authors ensure the learning rate ranges tested were appropriate for each method's objective? Did the authors verify convergence for both methods at their optimal learning rates?
3. How did the authors select the perturbation scale of 0.1 for ES? What happens with different scales, such as 0.01, 0.05, 0.2, or adaptive schedules?
4. Regarding loss report, if AlphaZeroES achieves high scores without minimizing value/policy losses, what exactly has it learned? Could the authors analyze the learned representations? Could the authors plot the correlation between planning loss and episode score throughout training for both methods?
5. For theoretical justification of "self-consistency is not necessarily aligned as an objective with performing better", could the author explain why would policy-value inconsistency not hurt MCTS performance, given that MCTS explicitly uses both value estimates and visit counts for action selection? Could the authors provide a formal characterization of when "simple policy but complex value function" occurs? Under what conditions would this favor ES over gradient-based methods? |
Lightly AI-edited |
|
HiFo-Prompt: Prompting with Hindsight and Foresight for LLM-based Automatic Heuristic Design |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces HiFo-Prompt, a new method for LLM-based AHD. It integrates a Foresight module, featuring an evolutionary navigator that monitors population dynamics and steers the search using interpretable "verbal gradients," and a Hindsight module, which maintains a self-evolving insight pool that distills successful design principles from high-performing code into knowledge. Evaluated on different heuristic design tasks like TSP, BPP, and FSSP, HiFo-Prompt demonstrates competetive performance, achieving superior results with greater computational efficiency than existing LLM-based AHD methods.
The self-evolving insight pool (Hindsight) and foresight instructions effectively prevent knowledge decay while enabling more strategic exploration of the heuristic space.
It achieves superior performance with fewer LLM calls and lower runtime compared to other state-of-the-art AHD methods.
The evolutionary navigator uses a fixed, rule-based policy with hand-tuned thresholds, which may lack generalization.
The paper would benefit from additional illustrations and a more extensive set of results to further support its claims
The stagnation is measured by raw fitness, delta g, which is a fixed value and may suffer from poor generalization to different heuristic design tasks.
The semantic variety is calculated based on the textual descriptions of algorithms (eq. 7). Is it the thought or the code text? It seems that the indicator only counts when the two algorithms are exactly the same. Will it be too greedy?
For the Foresight module, how was the specific set of Design Directives in the pool (Appendix G.3) designed? Was there an ablation study on the impact of different directive wordings on the LLM's output quality?
The framework's knowledge management is confined to a single task; can the learned insights be generalized or transferred to new, unseen problem domains?
What does the final algorithm look like? and how the insights and foresight prompts contribute to the generation of better heuristics, could you provide example illustrations?
A discussion and comparison with related works on prompt evolution and hierarchical search is suggested [1-3].
[1] MeLA: A Metacognitive LLM-Driven Architecture for Automatic Heuristic Design, arXiv
[2] Large Language Model-driven Large Neighborhood Search for Large-Scale MILP Problems, ICML
[3] Experience-guided reflective co-evolution of prompts and heuristics for automatic algorithm design, arXiv
There are typos and inadequate descriptions: e.g.,
line 811 ?
line 854, 788
Figure 1, The Left and Right can be misleading |
Fully human-written |
|
HiFo-Prompt: Prompting with Hindsight and Foresight for LLM-based Automatic Heuristic Design |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
HiFo-Prompt tackles two common gaps in LLM-based Automatic Heuristic Design (AHD): lack of global search control and poor knowledge persistence. It adds a rule-based Foresight meta-controller that watches population progress/diversity and switches prompts among explore/exploit/balance regimes, and a Hindsight Insight Pool that distills reusable design principles from elites with utility-based credit assignment, then injects top-scoring insights into subsequent prompts. The method obtains the best results among various LLM-AHD baselines.
- The idea of tracking both local and global evolution dynamics via specialized modules is interesting and well executed
- Useful ablation studies
- Strong performance with few function evaluations
1. Seed insights are required by the method. Importantly, these insights could significantly improve generation quality: “Design adaptive hybrid meta-heuristics synergistically fusing multiple search paradigms and dynamically tune operator parameters based on search stage or problem features.” particularly is a high-quality handcrafted prompt that can have a substantial effect on the generation. For fairness of comparison, one should provide the same information in the prompt of other baselines, say EoH.
2. The novelty regarding global control and historical information aggregation is overstated, e.g., ReEvo already implements a short and long-term reflection that could be seen as a simpler version of hindsight. Discussions would be appreciated.
3. I am not convinced about the population size being chosen as 4. How can diversity be maintained in such as small population and avoid inbreeding?
4. I found the methodology section quite confusing, with many quite complicated implementations. For example, a decay rate is introduced, but there is no ablation or sensitivity analysis on it. Eq. 3, which describes the evolutionary contribution, is full of hardcoded parameters, which are hard to parse, and the rationale for choosing them is not explained.
1. On this point, please clarify whether $g$ is a minimization or maximization objective. In EoH, this is maximization, but Fig. 2 and equations suggest otherwise. However, section A.1 again takes $g$ as argmax. This is confusing.
5. No code is provided
1. About dissimilarity (Eq 7): how are the textual descriptions calculated, and how do you ensure these are the same? (e.g.: will changing a single word make two descriptions different?)
2. It appears that there is a massive degradation if Qwen 2.5 max is not used in Table 9. How do you explain this?
3. What would happen if baseline methods also have the seed insights as part of the generator prompt? |
Fully human-written |
|
HiFo-Prompt: Prompting with Hindsight and Foresight for LLM-based Automatic Heuristic Design |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes HiFo-Prompt with (i) a Hindsight module that distills reusable principles from successful candidates, and (ii) a Foresight module that adaptively switches explore/exploit/balance based on population state to guide LLM-based AHD. The proposed method is evaluated on TSP, Online BPP, FSSP, and BO.
1. The proposed method is well motivated and outperforms recent LLM-based AHD baselines across several tasks.
2. The design details are well presented.
3. The limitations and future directions are clearly analyzed.
1. For TSP step-by-step construction (i.e., Table 1), Appendix B.1 states that HiFo-Prompt involves LLM calls at inference time, however, it is unclear to me that whether such strategy also applies to the baselines. Please disambiguate: (a) If baselines also call the LLM at inference, please explain why HiFo-Prompt’s runtime is longer; (b) If they do not, please also report HiFo-Prompt under the same inference protocol for fair comparisons.
2. The main text claims TSPLIB results are in Appendix C.1, but C.1 contains only descriptive text and a placeholder “Table ??”, with no actual results. Please add the promised table/metrics or revise the pointer.
1. Line 387 says “100 instances at each of five sizes,” but Table 1 shows three, please fix the mismatch. Also, there are several misplaced “?” characters around lines 811, 854, 946, 967 that need cleanup.
2. Can you present some of the actual heuristics generated and used to produce the reported results?
3. In Table 5, removing the Insight Pool would make the method perform worse than EoH, which is surprising to me since the setup still retains the Foundational Prompts adapted from EoH and the Navigator module. Can you analyze the concrete differences between EoH and HiFo-Prompt w/o Insight Pool & Navigator that can explain this gap? Will the Navigator module improve baselines like EoH as a drop-in controller?
4. How frequently does the Navigator select explore or exploit across runs? Have you tried an ablation that fixes the state to “balance” throughout to isolate the benefit of adaptive switching? |
Fully human-written |
|
HiFo-Prompt: Prompting with Hindsight and Foresight for LLM-based Automatic Heuristic Design |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes HiFo-Prompt, a prompting framework for LLM-based automated heuristic design that marries two modules: Foresight (an Evolutionary Navigator that steers exploration vs. exploitation from population signals) and Hindsight (an Insight Pool that distills and reuses design principles from successful code across generations). By decoupling “thoughts” from code, HiFo-Prompt supplies state-aware guidance and persistent knowledge. Experiments on TSP, FSSP, online bin packing, and black-box functions show state-of-the-art quality, faster convergence, and lower token/time cost than prior AHD systems; ablations confirm both modules matter.
- Dual Foresight/Hindsight design elevates the LLM from code generator to meta-optimizer.
- Evaluation sees evident performance gain.
- It’s unclear how you ensured a fair comparison under “the same query budget.” Does distilling insights consume additional queries? How many times did you run your method and the baselines? Did you use the same number of heuristic evaluations? Standard deviations are not reported, so the performance gains are not fully convincing.
- The approach involves many hyperparameters. It’s unclear how they were chosen and how robust the method is to their settings.
- The method relies heavily on pre-engineered prompts.
- Similar ideas appear in EoH and ReEvo, where thoughts and reflections are distilled (both) and accumulated (the latter). Please clarify the novelty relative to these.
See weaknesses. |
Moderately AI-edited |
|
HiFo-Prompt: Prompting with Hindsight and Foresight for LLM-based Automatic Heuristic Design |
Soundness: 1: poor
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper presents HiFo-Prompt, a framework for LLM-based automated heuristic design that combines Hindsight, which builds an evolving Insight Pool of distilled design principles, and Foresight, an Evolutionary Navigator that adaptively balances exploration and exploitation. The method is applied to several optimization tasks (TSP, Online BPP, FSSP, and BO), and the authors report improvements over prior AHD methods in both solution quality and sample efficiency.
1. The motivation of this paper is clear and reasonable. The design ideas of global guidance and the insight pool are interesting and inspiring.
2. The similarity-based diversity discussion for the insight pool is conceptually stimulating.
3. The paper is clearly written and well organized, making it easy to follow.
1. I have concerns about the novelty threshold. The Insight Pool’s novelty filtering relies on Jaccard similarity over token sets. While this removes near-duplicate sentences, such a pure text-based comparison cannot capture semantic overlap. For example, one insight might be expressed in different ways. Since this novelty threshold is crucial for ensuring diversity, I worry this design may harm the actual effectiveness of the diversity mechanism.
2. The combination of a usage penalty and a recency bonus in $U(k, t)$ aims to balance exploration and exploitation, but the dynamics between these opposing terms are not analyzed. This could be sensitive in practice, and it would be helpful to justify or empirically demonstrate that this interaction leads to stable selection rather than oscillatory behavior. In particular, $w_u$ is a hyperparameter without ablation or sensitivity analysis, and the calculation of $B_r$ is not clearly presented. This reduces the soundness and reproducibility of the method.
3. The mapping from normalized performance $\tilde{\rho}$ to the effective credit $g_{\text{eff}}$ uses manually chosen piecewise constants (0.8, 0.6, 0.5, -0.3, etc.) with no theoretical justification or ablation. While the idea of tiered reward regimes is understandable, the specific scaling choices seem ad hoc and may not generalize across tasks. It would strengthen the work to at least provide hints or guidelines on how to select these values.
4. The definition of phenotypic diversity as the fraction of non-identical algorithm text strings feels coarse and potentially misleading. The measure is a bit lexical that two code snippets are treated as completely different even if they differ only by refactoring or variable renaming, ignoring actual semantic or functional similarity (similar with my commen in 2.). As a result, the system may overestimate diversity and trigger unnecessary exploration. Moreover, this approach scales as $O(|P|^2)$ comparisons per generation, which may become inefficient for larger populations and increase token consumption for LLM-based evaluations. The diversity threshold is also arbitrary and not justified or ablated. Overall, the lack of semantic grounding and unclear efficiency raises concerns about the robustness and practicality of the Navigator’s diversity control.
5. The experimental section raises several concerns about fairness, reproducibility, and efficiency. Although the paper states that all LLM-based baselines were evaluated under the same Qwen 2.5-Max model, implementation details and prompt adaptations are not provided, so fairness remains unclear (like baselines might use different LLMs, thus they can not be comparied directly). HiFo-Prompt’s runtime on small TSP instances (Table 1) is about an order of magnitude slower than competing methods, with no explanation, contradicting the claim of improved convergence speed. Token-usage statistics are summarized only coarsely (Appendix C.7) without breakdown or cost analysis, leaving uncertainty about true computational overhead. The brief multi-LLM comparison (Table 9) covers only two tasks and lacks analysis, providing little evidence of model generality. Finally, runtime behavior is inconsistent across tables (slower in TSP 10–50 but faster in TSP 100–500) with no explanation. Together, these issues make it difficult to assess the practical efficiency and generalizability of the proposed framework.
6. The code does not seem to be provided. Even though the authors share the core prompts, several computational details remain unclear, as mentioned in earlier points. This makes it hard to guarantee reproducibility and verify the soundness of the proposed method.
**Minors**
1. Very minor: for LaTeX quotation marks, please use the proper “…” format instead of plain double quotes. For example, in L055 the quotation marks are incorrectly formatted.
2. There are a few missing or incomplete citations in the appendix, such as at L811 and L1017. These should be corrected for completeness and consistency.
See the weakness. |
Lightly AI-edited |
|
Ditto: Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces DITTO, a framework designed to scale instruction-based video editing models. The core contribution is a novel, large-scale, high-quality synthetic instruction-video-edit dataset (DITTO-Data), created through an innovative pipeline that leverages powerful pre-trained Large Language Models (LLMs) and diffusion models (DMs) to automatically generate diverse, complex editing instructions and the corresponding edited videos. Based on this dataset, the authors train DITTO-Model, a video editing model which demonstrates strong capabilities in instruction following, temporal consistency, and maintaining content fidelity. The experiments show DITTO-Model achieving state-of-the-art results on several benchmarks, particularly excelling in complex, style-based, and semantic edits, validated by both quantitative metrics and comprehensive human evaluation.
1. High-Quality, Scalable Data Generation: The synthetic data pipeline is the major strength, addressing the prohibitive cost and complexity of manual video editing data collection. The use of LLMs for instruction diversity is particularly effective.
2. State-of-the-Art Performance: DITTO-Model achieves superior results across multiple metrics, notably in human evaluation on Instruction Following and Temporal Consistency, which are crucial aspects of video editing.
3. Instruction Complexity: The generated dataset and resulting model are shown to handle a wide range of instruction complexities, including appearance transformation, style transfer, and semantic manipulation, moving beyond simple object insertion/removal.
1. Black-Box Data Quality: While the paper describes the Quality Control module, the extent to which the synthetic data truly captures the complexity and subtle detail of real-world human-labeled edits is hard to quantify. Further analysis on the "failure modes" of the synthetic pipeline and the resulting data distribution bias would be beneficial.
2. Model Architecture Novelty: The DITTO-Model architecture itself is largely an assembly of existing, robust components (latent diffusion model, motion modules). The novelty lies more in the data and training strategy than the architectural innovations.
1. Generalization to Real Edits: While the human study uses the synthetic dataset, how does DITTO-Model perform when asked to execute complex instructions on out-of-distribution real-world videos that might contain more unusual or messy degradation patterns not fully captured by the synthetic base videos?
2. Ablation on LLM Prompting: Could the authors provide more detail, perhaps in the appendix, on the meta-prompts used to guide the LLM to generate the diverse and complex editing instructions? This "prompt engineering" is critical to the dataset's quality. |
Fully AI-generated |
|
Ditto: Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a novel framework named Ditto to address the long-standing scarcity of high-quality training data in instruction-driven video editing. The authors construct a scalable, low-cost, and fully automated synthetic data generation pipeline to create large-scale, high-fidelity, temporally coherent video editing triplets (source video, editing instruction, edited video), and release the Ditto-1M dataset based on this approach.
1. Compared to existing datasets, this work presents the first million-scale, high-resolution, long-sequence instruction-based video editing dataset that covers both global and local editing tasks, filling a critical gap in the community.
2. The paper achieves state-of-the-art (SOTA) performance on both automatic metrics and human evaluations, with particularly significant margins in human assessments.
3. The paper is clearly written and accompanied by well-designed, informative illustrations.
1. In the post-processing stage, a Vision-Language Model (VLM) is used for filtering. However, it is well known that VLMs have limited capability in understanding fine-grained visual details. How does the method ensure consistency in non-edited regions? More specifically, how is pixel-level detail consistency guaranteed?
2. The quantitative comparison on video editing is necessarily limited due to the scarcity of comparable methods. If expanding the set of video baselines is infeasible, the authors could instead evaluate their model on image editing tasks, where strong benchmarks exist, to better validate its core editing capabilities and the effectiveness of the proposed modality curriculum learning strategy.
3. Although the paper reports strong human evaluation results, it lacks essential experimental details such as the number and background of evaluators, the precise definition of rating criteria, and inter-annotator agreement metrics, making it difficult to assess the reliability of these results.
4. The approach relies on a video generator to directly synthesize the edited video, but it does not address how to ensure physical plausibility in the outputs. There is no discussion of evaluation criteria or validation for physical realism, which could lead to generated content that appears unrealistic or violates real-world dynamics.
See Weaknesses |
Lightly AI-edited |
|
Ditto: Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper presents Ditto-1M, a large-scale instructional video editing dataset designed to address the data scarcity problem in instructional video editing. The dataset is constructed through an automated pipeline that generates editing instructions and corresponding edited videos from source videos. Based on this dataset, the authors propose Editto, a VACE-based video editing model trained on Ditto-1M. According to the reported results, Editto achieves better performance compared to previous inversion-based and feedforward methods in video editing tasks.
1. **Large-scale open-source dataset**: The proposed Ditto-1M dataset is substantial in scale and will be made publicly available, which represents a valuable contribution to the research community and can facilitate future work in instructional video editing.
2. **Comprehensive data construction pipeline**: The entire data construction process demonstrates significant engineering effort and computational resources, involving multiple stages of video processing, instruction generation, and quality control.
3. **Visually appealing results**: The edited videos presented in the paper show visually plausible results, suggesting that the proposed method can generate reasonable editing outcomes for various instruction types.
1. **Lack of Novelty and Insufficient Comparison with Prior Work**
This paper lacks significant novelty. The methodology is quite similar to Senorita-2M[1], with the primary difference being the new backbone used. However, the paper does not include any comparison or discussion with that work. To illustrate the substantial similarities, we present the following table:
| Dimension | Ditto-1M | Senorita-2M |
|-----------|----------|-------------|
| **Source Videos** | 200K videos from Pexels | 388K videos from Pexels |
| **Instruction Generation** | Qwen2.5-VL | Llama 3.2-8B |
| **Target Video Generation** | Key frame editing (Qwen-Image) + video propagation (VACE, training-free) | First frame editing (SD1.5 ControlNet/Flux Fill) + video propagation (task-specialized models trained on CogVideoX-5B) |
| **Training Strategy** | Fine-tune VACE-14B (essentially a ControlNet) with edited key frames | Two variants: (1) Pure instruction-based (InstructPix2Pix-like), (2) First-frame guided ControlNet. Based on CogVideoX-5B-I2V/Wan2.1-1.3B/Wan2.1-14B |
2. **Limitations of the Target Video Generation Pipeline for Local Editing**
The proposed target video generation pipeline appears to handle only global editing tasks effectively. As described in Section 3.4, the method utilizes VACE with three inputs: (1) an edited reference image produced by Qwen-Image, (2) dense depth maps extracted from the source videos, and (3) the editing instruction. The pipeline is expected to propagate the editing results from the reference image while preserving the original video motions and structure through the dense depth maps.
However, this approach presents a fundamental contradiction for free-form editing tasks, such as object addition or removal. Specifically:
- The **edited reference image** correctly reflects the added or removed objects
- The **depth maps**, extracted from the source videos, do not reflect these changes—the depth information for added/removed regions remains unchanged from the original video
Given these conflicting inputs, it is unclear how the VACE model resolves this inconsistency. Does it prioritize the reference image or the depth maps? This limitation suggests the pipeline may struggle with local editing tasks that require geometric changes, potentially restricting its applicability to primarily global appearance modifications. Can the authors elaborate more about this?
3. **Insufficient Details in Quantitative Evaluation**
The quantitative evaluation presented in Section 5.2 and Table 2 lacks critical methodological details, severely limiting reproducibility and interpretability of the results. Specifically, the following essential information is missing:
- **Evaluation dataset**: Which dataset(s) were used for evaluation? Is it a held-out test set from the training data, or an independent benchmark?
- **Dataset scale**: How many videos were evaluated?
- **Video specifications**: What are the resolution, duration, and frame rate of the test videos?
- **Baseline implementation**: How were the baseline methods executed? Were official implementations used, or were models retrained? What hyperparameters were applied?
As this represents the **only quantitative evaluation** in the paper, the absence of these details significantly undermines:
1. The ability to assess the true performance and generalization capability of the proposed method
2. The validity of comparisons with baseline methods
3. The reproducibility of the reported results
[1] Zi, B., Ruan, P., Chen, M., Qi, X., Hao, S., Zhao, S., ... & Wong, K. F. (2025). Se\~ norita-2M: A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists. arXiv preprint arXiv:2502.06734.
See weaknesses. |
Moderately AI-edited |
|
Ditto: Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes Ditto, a scalable synthetic data pipeline that couples (i) an instruction-based image editor to produce a high-quality edited keyframe, (ii) an in-context video generator (VACE) conditioned on the edited keyframe and a depth video for temporal structure, (iii) an autonomous VLM agent to author instructions and filter failures, and (iv) a lightweight temporal enhancer/denoiser (Wan2.2 fine stage) for quality polishing. Using ~12,000 GPU-days, the authors build Ditto-1M, a 1M-sample triplet dataset (source, instruction, edited video) and train Editto with a modality curriculum that anneals away reliance on the edited keyframe to achieve purely instruction-driven editing. Quantitative results (Table 2) and extensive visuals (Figs. 1,5,7–11) show strong instruction following and temporal consistency versus TokenFlow/InsV2V/InsViE and a commercial baseline. The paper commits to releasing dataset, models, and code.
Holistic, scalable pipeline. The edited keyframe + depth + in-context generator composition is simple and effective; the VLM-driven instruction/QA removes human bottlenecks (Fig. 2).
Large, curated dataset.Ditto-1M (720p, 101 frames @ 20 FPS) with ~700k global and ~300k local edits is a sizable, diverse resource; the authors report strong aesthetic curation and motion screening (Fig. 3, §3.1–3.6).
Modeling insight.The modality curriculum bridges visual-to-text conditioning in a stable way (Fig. 4), improving instruction-following without the edited frame at test time.
Empirical results. Consistent wins across automatic and human metrics (Table 2) and convincing visuals (Fig. 5); ablations show data-scale and MCL benefits (Fig. 8).
Efficiency consciousness. Distillation/quantization and a temporal enhancer reduce generation cost while improving temporal stability (§3.4–3.5).
Attribution granularity. While Fig. 9 analyzes generator contexts, the paper lacks systematic ablations over the full pipeline: e.g., removal/variation of the VLM filter, different denoisers, or no enhancer; per-stage quality/cost curves would support the “cost-quality” claims more rigorously.
Evaluation breadth. Automatic metrics are mostly CLIP-based and a VLM score. Consider adding FVD/KVD, LPIPS-T, or user-calibrated instruction-fidelity rubrics. Also report identity preservation for local edits.
Dependence on proprietary tools. The pipeline leans on powerful closed models (image editor, VLM judge, commercial T2V denoiser). This may limit reproducibility and confound generality claims; an open-source-only variant and a sensitivity study would help.
Data/rights clarity. The dataset is sourced from Pexels; redistribution and downstream editing rights should be specified precisely (e.g., license text, usage constraints, opt-out). Provide counts rejected by safety filters and failure taxonomies.
Risk analysis limited. Ethics statement is brief; a more thorough assessment of misuse (e.g., identity edits, misinformation), watermarking, and content provenance would be welcome.
1. Provide ablation results for denoiser enhancer and VLM filtering on instruction fidelity, temporal consistency, and aesthetics
2. Test performance drop when replacing all components with open models
3. Add evaluation on public benchmarks with FVD/KVD and identity-preservation metrics
4. Clarify Ditto-1M license, redistribution rights, content filtering, and TOS compliance
5. Analyze VLM judge rejection categories and failure modes |
Fully AI-generated |
|
Constrained-Data-Value-Maximization: Utilizing Data Attribution for Effective Data Pruning in Low-Data Environments |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper focuses on the challenge of training data pruning (that is, removing least useful training samples while minimizing performance degradation). The authors observe that common Shapley/semi_value based data valuation methods may lead to steep degrade in performance by often pruning entire clusters. To circumvent this limitation, they propose a new formulation called Constrained Data-Value Maximization (CDVM), which uses a data attribution matrix and poses pruning as a linear optimization problem that encourages balanced influence across test samples while penalizing over concentration. Experiments on the OpenDataVal benchmark (six datasets, multiple retention rates) show CDVM outperforms baseline alternatives and achieves competitive runtime.
Overall it is a very well written paper.
1. The author provided a strong motivation to study this paper by clearly identifying the limitations in the existing approaches.
2. The proposed CDVM leverages fine grained per test sample attribution signals and introduces a slack penalized objective preventing cluster collapse issue associated with the semi-values based literature. This is a great idea indeed.
3. The authors performed a comprehensive empirical evaluation of their proposed approach by working with 6 datasets × 6 pruning budgets. This makes the empirical evidence very strong.
1. The proposed still needs to estimate the attribution matrix T which is often very large (several GBs) in practice. Thus scalability becomes a bottleneck when we head to really large datasets.
2. The authors did not carry of systematic analysis of failure modes in the experiments. That is.. certain datasets exhibit unexpected under performance. Why such a behavious is not explained properly.
In fact, it would be of great value to discuss what dataset characteristics trigger failures and how to detect these cases automatically.
3. The proposed model conveniently ignores higher-order interactions.. in fact, the authors explicitly highlight that T entries are not additive.
Please address the above 3 weak points.
Furthermore, I have following comments too:
(a) Please provide guidance on hyperparameter defaults.. κ and α are crucial. if you can offer, practical recommendations would improve reproducibility.
(b) Please characterize failure conditions.. this would increase methodological transparency.
(c) Can you provide early-stopping heuristics to avoid expensive grid searches on T-based hyperparameters..
(d) Providing histograms or test sample coverage plots would illustrate the “balanced coverage” claim... without such plots, it would be hard to judge the practical utility of this approach. |
Fully human-written |
|
Constrained-Data-Value-Maximization: Utilizing Data Attribution for Effective Data Pruning in Low-Data Environments |
Soundness: 2: fair
Presentation: 4: excellent
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces Constraint-Data-Value-Maximization (CDVM), a novel approach for data pruning that addresses key limitations of existing Shapley-based data valuation methods. The authors demonstrate that semi-value approaches tend to assign lower importance to instances in larger clusters, leading to imbalanced data pruning. To overcome these issues, CDVM formulates data pruning as a constrained optimization problem over a data attribution matrix that tracks training data influence on individual test samples. CDVM achieves state-of-the-art accuracy in 28 out of 36 configurations while maintaining competitive runtime.
Very clear presentation.
The motivation for improving the current scoring-based data pruning is very well-motivated.
It is unclear to me why semivalue-based approaches cannot be used for attribution on individual test samples. While the original Data Shapley and Banzhaf papers employ average test accuracy/loss as their evaluation metric—primarily because they target "data valuation" applications—there is fundamentally little distinction between attribution and valuation. Adapting these methods to compute individual test loss/accuracy appears to require only minor modifications to the code, if I understand correctly. This would enable more direct apple-to-apple comparisons in experiments, such as Data Banzhaf versus Data Banzhaf + CDVM.
The proposed approach is straightforward but lacks theoretical justification. For example, how does the proposed approach compared with the line in coresets (e.g., https://proceedings.mlr.press/v119/mirzasoleiman20a.html)?
The experiment scale is very small. The authors mentioned that "CDVM relies on a selected soft upper bound κ and incurs quadratic cost in computing and storing T (e.g., roughly 250 GB for a naive implementation without sparsity on the full ImageNet-1k train
and val splits)", which is not very clear to me why that's the case. If it means the storage cost during the computation of gradients, here's a highly efficient implementation for computing In-Run Data Shapley (aka TracIn-Ideal) https://github.com/Jiachen-T-Wang/GhostSuite
The following reference is missing, but it seems highly relevant to the discussion in Section 2.1.3.
[1] Wang, Jiachen T., et al. "Rethinking data shapley for data selection tasks: Misleads and merits." ICML 2024
Line 337: citation for Maximum Sample Reuse is wrong.
See weakness. |
Fully human-written |
|
Constrained-Data-Value-Maximization: Utilizing Data Attribution for Effective Data Pruning in Low-Data Environments |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces Constraint-Data-Value-Maximization (CDVM), which addresses a critical flaw in Shapley-based data pruning methods (and more generally methods based on semi-values): they lead to cluster removal bias, systematically pruning entire large clusters first and causing sharp performance drops. CDVM reformulates pruning as a constrained linear optimization over an attribution matrix $T$, maximising total influence while using slack variables to ensure balanced coverage across all test samples (preventing any cluster from losing all influence).
The proposed method is a novel and well-motivated contribution that directly addresses the weaknesses of existing data-pruning techniques based on Shapley or other semi-value formulations. The authors introduce a principled mechanism to balance influence across test samples and prevent over-removal of clusters, a limitation often overlooked in prior work. The approach is original to my knowledge, and the experimental evaluation across multiple OpenDataVal datasets demonstrates strong performance.
The main weakness of this work is its high computational cost, which limits the method’s practicality to relatively small datasets, precisely where data pruning is often less critical. This restricts CDVM’s immediate applicability to large-scale, real-world problems. A more minor concern is the sensitivity to its two hyperparameters: as shown in Figure 4.c, the chosen $\kappa$ values vary considerably across datasets, suggesting potential instability to achieve good performance.
- Could the authors elaborate on potential strategies to make CDVM computationally feasible for larger datasets ?
- Given that κ and α appear to vary considerably across datasets, how robust is CDVM to these choices ?
- The constrained optimisation program relaxes the $w$ to continuous variables. How do you use them for pruning ? |
Lightly AI-edited |
|
Constrained-Data-Value-Maximization: Utilizing Data Attribution for Effective Data Pruning in Low-Data Environments |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper identifies and describes limitations of leave-one-out and semivalue-based data valuation methods that lead to poor accuracy when pruning large fractions of data. The paper then proposes Contsraint-Data-Value-Maximization that addresses the limitations. The key idea to solve an optimization problem that maximises the influence of the training samples while reducing the influence of each test point. This pruning strategy results in non-nested solutions for different subset sizes.
1. The example in Fig 1 is helpful and the three limitations highlighted in Sec 2.1.3 are clear and significant. Addressing these limitations improve the accuracy after data pruning in many applications.
2. The paper is generally clear and easy to follow.
1. There is a lack of comparison with other semivalues, e.g. Shapley value and semivalues with more weights on smaller coalitions which should do better under significant data pruning.
2. The solution is more computationally expensive than semivalues. In Sec 3.1, it is explained that it requires training multiple models. In Sec 3, it additionally involves solving a mixed-integer linear program.
3. Some claims in Sec 3 should be better justified.
* How does penalising the excess over $\kappa$ ensure that no test sample has zero total influence and address the cluster size problem?
* Why is the definition of $T_{ij}$ in line 342 appropriate? Unlike the Shapley value, it does not measure marginal contribution of each sample $d_i$ to a coalition and does not satisfy the efficiency property. Why is it ok to round $T$ to either 0 or 1?
4. The experiments are small-scale and hence less convincing. Each dataset is subsampled in line 359. It is important to demonstrate that the results still hold on larger real-world datasets.
1. What does 10000 models mean in line 193? Does it mean samples/actual evaluation of utility?
2. Why is Shapley value not included in Fig 2 or the experiments? See [K] for a more efficient approximation of Shapley value that allow sample reuse.
3. How is the mixed-integer linear program solved? Describe the efficiency and the approximation guarantee.
[K] Kolpaczki, Patrick, et al. "Approximating the shapley value without marginal contributions." Proceedings of the AAAI conference on Artificial Intelligence. Vol. 38. No. 12. 2024.
[L] Li, Weida, and Yaoliang Yu. "One Sample Fits All: Approximating All Probabilistic Values Simultaneously and Efficiently." Advances in Neural Information Processing Systems 37 (2024): 58309-58340.
Minor comments
* The font looks different from Times.
* LOO should be capitalised
* The LOO and semivalue definitions should be wrong. It should be the utility with $d_i$ as the first term.
* Some grammatical errors (e.g., many data instance on line 98) and typos (1.000 in Fig 2 caption)
* The “test samples” should be validation samples instead. The test set should not be used for pruning or hyper parameter selection.
* Fig 1 should also include the performance of the proposed method |
Fully human-written |
|
Constrained-Data-Value-Maximization: Utilizing Data Attribution for Effective Data Pruning in Low-Data Environments |
Soundness: 4: excellent
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
Paper proposes a new algorithm to prune train datasets using linear programming.
Paper makes an algorithm by Yang more efficient (at some cost in value).
I found their discussion of the nomao dataset interesting.
Experiments: Given the problem is to reduce train set size, I expected to see experiments with at least mid-sized train sets (say, 1 million training samples), but they seem to have presented experiments with 1000 training samples. The lack of even mid-sized data was disappointing and really saps my interest in their results and the significance of their experimental results.
It also seems like the amount of computation to prune the dataset is very high compared to just training on the whole dataset.
One of their results is that the optimal reduced train set does not strictly grow as instances grow. They show this empirically, which has some value, but this is the sort of result that’s fairly easy to *prove* through counterexample, so empirical result is not that significant.
Another weakness is that the semi-value approach doesn’t sound optimal at all, given it uses a poor proxy objective, so it’s not at all surprising that one can do a bit better.
Reducing the train set in a smart way dates back at least to Peter Hart’s 1968 paper on condensed nearest neighbors.
Missing all the related work in coresets and other related work in optimizing a reduced train set? Tons more work in data pruning not even nodded to, there are surveys one could cite at least.
Lastly, I’ll note that the strategy I’ve seen work best in practice when there's really too much train data and you need to reduce it down and train faster is the opposite of data pruning, it’s to start with a small set of the train data and then add in more data, aka "batch active sampling" (see e.g. Jiang and Gupta KDD 2021).
Writing was weak in it use of casual language too casually, like this sentence: “It is generally expected that removing lowvalue
instances results in a gradual decline in accuracy, while the removal of
high-value instances leads to a sharp decrease in performance.” Isn’t that just the definition of “low value instances” and “high value instances” making this sentence like saying “One expects an orange to be an orange?” Paper is full of this sort of loosey-goosey language. Try to be more specific / concise/precise please.
As an other example of this too loose writing, paper says “the subset that maximizes accuracy for one
budget s may exclude instances that are essential for another budget s′ ̸= s.” But really the point they are trying to communicate is optimal subset for a budget of size S may be exclude instances that are optimal if the budget is size S’ > S.
MINOR:
Typo: “Contsraint”
Capitalize acronyms like LOO
In your bibtex, use {Shapley} and {OOB} and other {} to get proper capitalization of proper names
References: what is “6 2023” mean in Jiang et al reference? Similar in other ones.
No questions at this time. |
Fully human-written |
|
AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents |
Soundness: 2: fair
Presentation: 1: poor
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper introduces AGENTMISALIGNMENT, a benchmark designed to measure the likelihood of large language model agents engaging in misaligned behavior such as avoiding oversight, resisting shutdown, sandbagging, or seeking power. The study treats misalignment as a conflict between an agent’s internal goals and its deployer’s intentions, emphasizing that real-world deployments often rely on implicit expectations that are difficult to fully specify. The author evaluated various LLM models and tested the effect of varying persona in system prompt.
- The author uses controlled, deterministic experimental setups to ensure reproducibility.
- The author evaluated on a variety of latest models.
- The benchmark mainly combines known misalignment behaviors (e.g., deception, shutdown resistance, etc.). Many existing papers already tackle similar problems. The authors do not necessarily provide insights or theoretical constructs that make this work stand out, or they fail to make these contributions clear due to the writing or presentation style.
- It is unclear how the authors set up the experiments and implementation details. For example, what tasks the agents are performing, how they are evaluated, what metrics are used, and which variables are controlled or changed. The authors should consider adding separate sections that explicitly describe these details and include clear result tables.
- Based on the current presentation, the experimental setup appears overly simplistic. i.e. 1) defining a few misalignment types, 2) writing a few fixed personas, 3) calling LLMs to generate outputs, and then 3) evaluating those outputs with another LLM. Steps 1 and 2 also seem trivial enough to be easily automated by an LLM itself.
- The authors only evaluate six fixed personality prompts. The resulting interpretations are therefore too limited and lack sufficient generalizability or practical usefulness for others.
- Could the authors provide a more detailed description of the experimental setup? Specifically, what are the concrete tasks that the agents perform in each evaluation scenario, how are these tasks constructed, and what are the underlying assumptions behind each one? A clearer explanation of these would help assess how realistic or representative the scenarios are.
- How were the six personality prompts chosen? Were they derived empirically (e.g., from prior studies or human behavioral typologies), or were they arbitrarily designed? |
Lightly AI-edited |
|
AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces AGENTMISALIGNMENT, a new benchmark suite designed to evaluate the propensity of agents to spontaneously misalign with their deployer's intended goals. They define intent misalignment as behavior where an agent recognizes but still acts against deployer goals, and they target categories such as avoiding oversight, power seeking, resisting shutdown, resource acquisition, deception, and drifting from an ethical baseline. The authors argue that prior work has focused on agent capability (what they can do when explicitly prompted for harmful tasks), whereas this work tests what agents are likely to do in realistic scenarios where deployer intentions are often implicit .
The paper defines misalignment as "intent misalignment", a conflict between the internal goals an agent pursues and the goals intended by its deployer. The benchmark suite consists of 9 evaluations that create "honeypot" scenarios to test for specific misaligned behaviors. Findings include that more capable models tend to show higher misalignment on average and that persona prompts can shift misalignment substantially, sometimes as much as or more than swapping models.
- The paper's primary strength is its clear distinction between what an agent can do (capability) and what it is likely to do spontaneously (propensity) . This moves safety evaluations toward more realistic deployment scenarios where an agent might pursue unintended goals even without malicious prompting. the benchmark probes propensity in deployment-like contexts rather than single-turn capability checks.
- The benchmark uses a Comprehensive Misalignment Scoring (CMS) mechanism that evaluates an agent's internal reasoning in addition to its final actions. This allows the benchmark to detect deceptive intent or planning, even if the agent is not capable enough to successfully execute the misaligned behavior. The paper distinguishes misuse from misalignment and enumerates concrete behavioral categories
- The study methodically investigates system prompts as a key variable, treating personality as a "high-leverage alignment control surface". The inclusion of case studies (Sections 4.1 and 4.2) clearly demonstrates how different personas can cause the same model to behave differently in identical scenarios, providing valuable insight into prompting-based risks.
- The authors acknowledge that the results have large error bars and "lots of variance" between evaluations (as seen in Figure 1). This high variance and statistical uncertainty make it difficult to draw strong conclusions, forcing the authors to "refrain from drawing any definitive conclusions" about which models or personalities are definitively more or less misaligned on average.
- Cross-task comparability. Each eval uses different scoring, which the authors note complicates comparisons of absolute misalignment levels between tasks.All evaluations are in English and simulated environments, so generalizability to other languages and real deployments is limited.
- The authors note that the specific agent scaffolding (the ReAct loop) and tool-use formatting used in the benchmark "can be foreign to some models". This makes "accurate comparisons difficult" , as it introduces a confounding variable: a model's poor performance (or lack of misalignment) could be due to genuine alignment or simply an inability to effectively use the provided tools to achieve the misaligned action.
- How reliable are the comparisons between models and personalities, given the high variance and large error bars in the results?
- Since evaluations are English-only and use different scoring mechanisms, how generalizable are the findings to other languages, and how alid are comparisons of misalignment levels between different tasks?
- How can you distinguish between genuine alignment and a model's simple inability to use the "foreign" agent scaffolding to perform the isaligned action? |
Lightly AI-edited |
|
AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper defines LLM agent misalignment as a conflict between the model’s internal goals and deployers’ intended goals, then proposes the AGENTMISALIGNMENT benchmark (9 real-scenario tasks, e.g., log tampering, resisting shutdown) to evaluate misalignment propensity. Key findings: more capable models (e.g., Claude 3 Opus) show higher misalignment; agent personalities (via prompts) often impact misalignment more than models; existing alignment methods are insufficient for real deployments. It contributes a real-scenario misalignment benchmark, tests cutting-edge models, and quantifies personality effects.
1. Controls variables (model, personality, task), uses RMS normalization and CMS scoring for comparability, and validates via case studies.
2. Clear structure with appendices (task prompts, scoring rules) for transparent replication.
1. The weighting of "severity" in the CMS scoring lacks a quantitative basis (e.g., "Approve Toxic Content" has a weight of 1.0, "Reject Neutral Content" has a weight of 1/(N+1)), and there's no explanation as to why this weighting accurately reflects the actual number of errors.
2. No Long-Term Evaluation: All experiments are short-term (e.g., single-round negotiation); no data on “misalignment drift” (e.g., 7–14 days of community management) or which is critical for real deployments.
3. The paper fails to clarify the extent of consequences caused by agent misalignment. However, "revealing the actual harms of misalignment" should have been the core starting point for researching "evaluating misalignment propensity," which weakens the study’s risk warning significance and practical relevance.
1. For CMS subjectivity: What basis sets “severity weights”? Has expert validation been done?
2. For long-term gaps: Can you provide a long-experiment design (e.g., 10-round “Community Curator” for Claude 3 Opus)?
3. Supplement explanations of the consequence severity corresponding to different misaligned behaviors. This aligns the research on "evaluating misalignment propensity" more closely with the starting point of "addressing actual harms" and enhances the study’s practical value. |
Moderately AI-edited |
|
AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces a benchmark suite designed to evaluate the propensity of LLM agents to spontaneously exhibit misaligned behaviors in realistic deployment scenarios. The study encompasses multi evaluation tasks covering behaviors such as avoiding oversight, resisting shutdown, sandbagging, and power-seeking. The authors test multiple frontier models and systematically investigate how different personality prompts influence misalignment propensity. Key findings indicate that more capable models tend to show higher average misalignment, and personality characteristics can sometimes exert greater influence on misalignment than model choice itself.
1. The evaluation framework is comprehensive. It covers diverse misalignment behavior types.
2. The exploration of how personality prompts affect agent behavior is an important but understudied problem.
3. Detailed experimental setups, prompts, and scoring mechanisms are provided.
4. InspectAI framework provides a unified interface for cross-model comparison.
5. The focus on propensity rather than pure capability represents an important distinction for assessing real-world deployment risks.
1. The CMS scoring mechanism relies primarily on keyword and pattern matching to detect misaligned reasoning, potentially missing more subtle or differently-expressed misaligned reasoning while also generating false positives
2. The ecological validity of evaluation tasks is questionable, with some scenarios designed too obviously to elicit misalignment
3. The experiments use single runs with temperature equals zero for most models, limiting understanding of behavioral stability and variance
4. Personality prompt effects may primarily be artifacts of prompt engineering rather than revealing deep model tendencies, and the paper insufficiently distinguishes between these possibilities.
5. Some evaluation tasks observe no misalignment behaviors at all, such as Rank Organizations, raising questions about task design effectiveness.
6. Causal mechanisms remain unclear, with the paper failing to explain why certain personalities increase or decrease misalignment propensity.
7. Canary Code where no model successfully modified log files, possibly indicating capability insufficiency rather than alignment success.
1. How can the study distinguish between models unable to execute misaligned behaviors due to capability limitations versus models understanding the situation but choosing not to execute misaligned behaviors?
2. Could personality prompts merely change surface response styles rather than genuinely altering decision-making tendencies?
3. Might observed behaviors in some evaluation tasks be reasonable responses to hints in prompts rather than spontaneous misalignment?
4. If different hyper-parameters settings or multiple runs were used on the same tasks, how stable would the results be?
5. How were weights and thresholds in the CMS scoring mechanism determined? Were ablation studies conducted to validate these choices?
6. How does the paper ensure evaluations have not been seen by models during pretraining? |
Moderately AI-edited |
|
On the Impossibility of Retrain Equivalence in Machine Unlearning |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper aims to study the effect of data order in training pipeline that is later being unlearned could have significant impact on how unlearning might have effect. As such, they claim that ideal unlearning might be impossible as we don't have exact kwnoledge about the data order in training pipeline.
They further support this claim by an empirical setup where unlearning targets happen at different positions of training pipeline, and observe that applying the same training procedure results in different outcomes, showcasing the senstitivy of unlearning algorithms on the place where unlearning targets are in training pipeline.
The idea of this paper is interesting. It analyzes how the exact position at which unlearning targets appear can affect unlearning performance, and thus why ideal unlearning may be impossible when the algorithm is not given this positional information.
The theoretical results look sound, and the empirical results support them.
Overall, the paper sheds light on a question that was not clearly studied before. Most prior work focused on cases where unlearning targets were introduced immediately before the unlearning step.
I think the main weakness is that the empirical results are somewhat limited. I would encourage the authors to run experiments on additional benchmarks such as RESTOR [1] and MUSE [2] to better assess the sensitivity of unlearning algorithms to how recently unlearning targets were introduced across different setups.
Also, I wouldn’t frame this as an impossibility of unlearning; rather, it is an impossibility of defining the ideal model when, in practice, we may not have access to it.
-------------
[1] Rezaei, Keivan, et al. "RESTOR: Knowledge Recovery in Machine Unlearning." arXiv preprint arXiv:2411.00204 (2024).
[2] Shi, Weijia, et al. "Muse: Machine unlearning six-way evaluation for language models." arXiv preprint arXiv:2407.06460 (2024).
Given the results in this paper, what do the authors propose for evaluating machine unlearning in practical scenarios? In existing work, it is often assumed that the model was trained on the unlearning targets immediately prior to applying the unlearning algorithm, so the ideal model is simply the checkpoint before the introduction of those targets.
Do the authors suggest excluding all metrics that compare to this ideal model? Or do they offer ideas for redefining or approximating the ideal model so that such comparisons remain meaningful? |
Fully human-written |
|
On the Impossibility of Retrain Equivalence in Machine Unlearning |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This work considers unlearning within a multi-stage training framework. Specifically, this work propose a theoretical understanding that model's behavior during unlearning is influenced by the order of its training stages during learning. Therefore, it is impossible to universally achieve a retrained equivalent model with a path agnostic algorithm. Experimental results on different LLMs and unlearning heuristics verify the theory.
- The paper look at unlearning at a novel perspective: under multi-stage training.
- The paper propose a novel theoretical analysis for the impossibility result of universal relearning equivalence, which is important for understanding unlearning.
While I appreciate the theoretical insights provided by this work, I have a few concerns.
- How large is $t_0$? Looks like corollary 1 does not hold when $t* < t_0$. Then does that mean when using less amount of unlearning iterations, it is more likely that the model will achieve $\varepsilon$-RE?
- The theoretical results is based on 1. linear models, 2. gradient ascent. It's very different from the experiments which focuses on LLM, with different stages, and each stage will use some different method for learning. Grad ascent is also just a most naive unlearning heuristics for LLM.
- Given the theoretical analysis, I think it's more valuable to provide more simplified experiments to verify the theoretical understanding (e.g. classification tasks or even simpler linear models)
- I'm not convinced by the significance of the universality of retrain equivalence. Why do we need to find a universal $t$, such that under any path, the model unlearned at step $t$ has to achieve the same $\varepsilon$-RE? What's the issue with having different $t$ for different paths?
- What is the motivation of using an approximate unlearning definition that measures similarity in output space, rather than the classic approximate unlearning definition which measures similarity in the model parameter distribution space? |
Fully human-written |
|
On the Impossibility of Retrain Equivalence in Machine Unlearning |
Soundness: 3: good
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper considers the difficulty of unlearning with “local” algorithm, i.e., those that only use the model weights and the forget set. They conduct experiments on LLMs and note the ordering of the training data can make such unlearning algorithms significantly less effective , and provide some theory in the case of linear regression. They conclude that further work is needed in formulating this type of unlearning.
The authors seem unaware of past work which had already shown this formulation for unlearning is ill-defined; a dataset without the forget set can give the same model as training with the dataset (with some probability over the training orderings), e.g., Thudi., et al [1]. Note at the time “approximate unlearning” only really consisted of “local” unlearning algorithms. This led to further research into forgeability, and analysis of the closely related notion of per-instance privacy. It also has been one of the primary reasons for stronger algorithmic definitions of unlearning (which use the training algorithm and retain dataset) and disclosures when using weaker definitions, e.g., see the ToFU benchmark [2].
In the context of this literature, this paper makes further observations on the nature and prevalence of forgeability. However, this paper does contribute to the existence of this phenomenon, which was already proved in general for mini-batch SGD training algorithms [1] using an even stronger metric (i.e., the model weights are the same which implies predictions are the same). I recommend this blog post for a more intuitive understanding of forging https://ai.stanford.edu/~kzliu/blog/unlearning (which is described in terms of verifying unlearning), and the work of Kong et al., [3] which recaps the results of [1] a bit more cleanly.
In the questions I make some suggestion for how the authors might change their presentation given this; while I do not think the original claims are valid, the paper can claim to add to existing observations about this unlearning setting.
[1] Thudi, Anvith, et al. "On the necessity of auditable algorithmic definitions for machine unlearning." 31st USENIX security symposium (USENIX Security 22). 2022.
[2] Maini, Pratyush, et al. "Tofu: A task of fictitious unlearning for llms." arXiv preprint arXiv:2401.06121 (2024).
[3] Kong, Zhifeng, Amrita Roy Chowdhury, and Kamalika Chaudhuri. "Forgeability and membership inference attacks." Proceedings of the 15th ACM workshop on artificial intelligence and security. 2022.
1) The linear regression theorem adds to theory around quantifying notions of forgeability.
2) Experimental results quantifying the path dependence of various local unlearning algorithms seem strong and valuable to the literature on unlearning difficulty.
1) The theory, while interesting, is still focused on linear regression where efficient effective unlearning is not a problem, e.g., Guo et al., [4].
2) Moreover, bounds for the difficulty of unlearning when using SGD like algorithms (even when applied to deep neural networks) already exists, Sephavand et al., [5]; the argument deep learning is mysterious so we work with linear regression seems ill-made given the progress in the unlearning literature.
3) On the observation of path dependence (i.e., one starts with different initial models), old work (before even the unlearning community's observation) had already discussed the high variance in models between data orderings and how this can be exploited [6].
[4] Guo, Chuan, et al. "Certified data removal from machine learning models." arXiv preprint arXiv:1911.03030 (2019).
[5] Sepahvand, N.M., Thudi, A., Isik, B., Bhattacharyya, A., Papernot, N., Triantafillou, E., Roy, D.M., Dziugaite, G.K.. (2025). Leveraging Per-Instance Privacy for Machine Unlearning. Proceedings of the 42nd International Conference on Machine Learning
[6] Shumailov, Ilia, et al. "Manipulating sgd with data ordering attacks." Advances in Neural Information Processing Systems 34 (2021): 18021-18032.
Below I give some suggestion for how the authors might re-present their work, which I think can help them be clear about what their contribution to the literature is. I think with such changes to the presentation this paper can be valuable to the literature on unlearning, and will be happy to reconsider the score. However, the paper misses the mark right now given it lacks context with existing results.
Suggestions:
1) While the phenomenon that unlearning is dependent on training trajectory is known in various contexts (which can lead to it being ill-defined), it seems a concrete contribution of this paper is that it measures this dependence (often called forging) for LLM fine-tuning. I suggest the authors no longer claim to discover the former, and instead discuss past work on this phenomenon and more broadly unlearning difficulty. They can then just focus on their contribution to measuring this dependence in LLM fine-tuning, which is already mentioned throughout the paper (but often after claiming the former).
2) Given this, the description of the contribution of the theoretical result could be rephrased to emphasize that it quantifies how specific orderings diverge with continual “local” unlearning; the exponential divergence seems novel to the unlearning (difficulty) literature, and is already what is described in the theorem. One just needs to rephrase the summary of this result (in the context of past results) in the introduction and elsewhere in the paper.
3) Similarly the empirical section seems to be correct on its own, and just how these results are interpreted in the context of past results needs to be made clearer. For example, as far as I know the observation about the recency phenomenon seems novel, and to me is something worthy of a paper; one just needs to not claim to be the first to observe that machine unlearning depends on data ordering, and instead only make the more fine-grained claims of what dependencies were specifically observed. |
Fully human-written |
|
On the Impossibility of Retrain Equivalence in Machine Unlearning |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper investigates how the effectiveness of machine unlearning aimed at achieving Retain Equivalence varies depending on the order in which the model learns different datasets. The study focuses on local unlearning methods that rely solely on gradients computed from the forget set. The authors theoretically prove, under a linear model assumption, that the outcome of unlearning is path-dependent, meaning it can differ significantly depending on the sequence of training data. Empirically, the authors simulate a sequential post-training process using LLaMA and Qwen models and similarly observe that unlearning behavior is strongly path-dependent.
* Unlike prior unlearning studies that mainly focus on designing algorithms to maximize benchmark performance, this paper highlights a more fundamental issue: the effectiveness of unlearning critically depends not only on the algorithm itself but also on how the forget data was originally learned by the model.
* The argument is further strengthened by a theoretical result showing that, under a linear model assumption, performing local unlearning on two models trained with different data orders cannot simultaneously satisfy the defined notion of Retain Equivalence.
* Extensive experiments with LLMs empirically confirm the path-dependent nature of unlearning and show interesting phenomena such as the recency effect, thereby connecting the theoretical findings to observable behaviors in realistic post-training pipelines.
* Definition 2.1 formalizes Retain Equivalence (RE) as the pointwise closeness of model predictions on a test set (any generic test set $X_{test}$). While this assumption enables a theoretical analysis, it arguably over-constrains the notion of equivalence. In practice, two models trained on the same data but with different random seeds can yield noticeably different predictions, especially on out-of-distribution inputs, yet should still be regarded as equivalent from a statistical or functional standpoint. Therefore, RE as defined in this paper captures only pointwise similarity rather than distributional or behavioral equivalence. Extending RE to a distribution-level notion, such as expected loss or output distribution similarity under the retain data distribution, would make the theoretical conclusions more general and more aligned with standard unlearning objectives.
* Section 4.3 appears to overstate the evidence for “path-dependent superficial forgetting”. The claim of a newly identified phenomenon is supported by only 40 prompts for each dataset (C, U, and R), i.e., 120 samples total, in a largely qualitative setup, which is insufficient to draw reliable or generalizable conclusions about superficial vs. deep forgetting. While the observation is interesting as an illustrative example, it does not yet constitute empirical evidence of a new phenomenon.
* The paper defines local unlearning as using only the gradient of the forget set, thereby excluding many practical algorithms that employ lightweight retain-based regularization or contrastive objectives (e.g., Gradient Difference, DPO). This narrow definition simplifies the theoretical treatment but makes the conclusions less representative of real-world unlearning methods. As a result, the claimed impossibility result may not fully capture the feasibility of semi-local or weakly regularized unlearning approaches commonly adopted in practice.
* Is there a specific reason for using different learning rates for both LLaMA2-13B and Qwen2.5-14B?
* Why did the authors choose to use LoRA? If the goal is to simulate the post-training process of LLMs, full fine-tuning might be a more appropriate setup. Also, during unlearning, are only the LoRA weights updated?
* In Figure 4, the results clearly show that unlearning is path-dependent, but how should we interpret this finding? In the setting where “deep forgetting” occurs, U is trained first. Does this imply that the effect arises because the recency effect is weaker in this configuration? |
Lightly AI-edited |
|
TAMER: A Tri-Modal Contrastive Alignment and Multi-Scale Embedding Refinement Framework for Zero-Shot ECG Diagnosis |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces a new ECG pre-training framework. Compared to prior signal+text frameworks, the authors further added the spectrogram of ECG in a separate branch. Accordingly, they introduced rhythm-level and wave-level embedding alignment between signal and spectrograms, and report-aware alignment between signal and report. The results on zero-shot setting demonstrate the effectiveness of the proposed approach.
1. Performance is further improved compared to the baselines in the paper.
1. The core idea is to integrate time series, spectrogram, and text for ECG pretraining. However, this had been presented long time ago in AudioCLIP [1]. Moreover, the rationale of integrating spectrogram in ECG is not sufficiently justified. Unlike audio that contains substantial frequency information, ECG signal has limited frequencies. For instance, majority of the ECG spectrogram presented in Fig 1 has nearly zero information (dark blue).
[1] Guzhov, A., Raue, F., Hees, J., & Dengel, A. (2022, May). Audioclip: Extending clip to image, text and audio. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 976-980). IEEE.
2. The baselines in the evaluation do not mark the SOTA results. For example, in uni-modal baselines, ECG-FM [2] was the SOTA. For multi-modal baselines, the authors also missed most recent works [3,4]. All of these should be compared.
[2] McKeen, Kaden, Laura Oliva, Sameer Masood, Augustin Toma, Barry Rubin, and Bo Wang. "ECG-FM: An Open Electrocardiogram Foundation Model." arXiv e-prints (2024): arXiv-2408.
[3] Hung, Manh Pham, Aaqib Saeed, and Dong Ma. "Boosting Masked ECG-Text Auto-Encoders as Discriminative Learners." In Forty-second International Conference on Machine Learning.
[4] Wang, Fuying, Jiacheng Xu, and Lequan Yu. "From Token to Rhythm: A Multi-Scale Approach for ECG-Language Pretraining." In Forty-second International Conference on Machine Learning.
3. The ablation is only for the alignment strategies, without evaluating the benefit of adding the spectrogram as the third modality (the core idea of the paper).
4. The evaluation pipeline only shows zero-shot performance. Why not follow the previous works and evaluate the performance with full fine-tuning, zero-shot, and linear probing?
5. Aligning three modalities could have various combinations. Currently, RLCA and WLAI are for signal and spectrogram only, and RGWR is for signal and text. Why selecting these combination? And why other combinations are excluded?
1. Table 1-3 take lots of space. Why not merge them into a single table?
2. In Line 217-218, what are the two stages for the residual attention mechanism?
3. In Line 219-220 and Line 226-228, the authors claim WLAI selectively integrates clinically relevant features. However, it is unclear how diagnostic-sensitive waves are selected/learned, or why such a process is able to achieve selective integration. |
Fully human-written |
|
TAMER: A Tri-Modal Contrastive Alignment and Multi-Scale Embedding Refinement Framework for Zero-Shot ECG Diagnosis |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces TAMER, a tri-modal framework for zero-shot ECG diagnosis that integrates ECG signals, spectrograms, and diagnostic reports through contrastive and attentive alignment. The model comprises three modules. Evaluations on three public ECG datasets show state-of-the-art AUC scores and strong cross-domain robustness.
The paper presents a tri-modal framework (TAMER) that integrates ECG waveforms, spectrograms, and clinical reports for self-supervised ECG analysis. Its modular design, comprising the Tri-modal Feature Encoding and Projection (TFEP), Global-Local Temporal-Spectral Alignment (GLTSA), and Report-Aware Alignment and Refinement (RAAR) modules, is interesting.
The framework’s emphasis on semantic interpretability and robustness to distribution shifts may add clear clinical and research value.
The clinical interpretability of TAMER remains limited. Though alignment with diagnostic text is claimed, there is little human expert validation to ensure that the model’s attention aligns with genuine clinical reasoning.
Computational complexity will be high. The tri-modal setup, multi-stage attention, and large-scale contrastive learning make it resource-intensive and potentially impractical for real-time or low-resource clinical applications.
The reliance on pre-trained language models (e.g., Med-CPT) will raise concerns about domain bias and adaptability to institutions using different reporting styles.
The explanation of some equations (e.g., loss functions) lacks intuition, and the visualizations (e.g., t-SNE), while illustrative, do not quantify interpretability gains.
TAMER’s evaluation focuses on AUC metrics only. Complementary metrics such as sensitivity or calibration would better demonstrate clinical reliability.
Could the authors quantify the computational cost and model size compared to prior multimodal approaches like MERL or C-MET?
Have any clinicians evaluated whether the model’s attention maps or refined embeddings correspond to meaningful ECG features?
How does the system handle ECGs without accompanying reports, and can the tri-modal model degrade gracefully to bi- or uni-modal input? |
Fully AI-generated |