ICLR 2026 - Reviews

Reviews

EditLens Prediction: Fully AI-generated Heavily AI-edited Moderately AI-edited Lightly AI-edited Fully human-written All

Rating: 1 2 3 4 5 6 7 8 9 10 All

Confidence: 1 2 3 4 5 All

Summary Statistics

EditLens Prediction	Count	Avg Rating	Avg Confidence	Avg Length (chars)
Fully AI-generated	15899 (21%)	4.43	3.58	3687
Heavily AI-edited	3233 (4%)	4.22	3.59	2990
Moderately AI-edited	7082 (9%)	4.20	3.61	2722
Lightly AI-edited	16648 (22%)	4.15	3.68	2746
Fully human-written	32938 (43%)	4.13	3.62	2917
Total	75800 (100%)	4.21	3.62	3026

Title	Ratings	Review Text	EditLens Prediction
CoRGI: GNNs with Convolutional Residual Global Interaction for Lagrangian Simulation	Soundness: 3: good Presentation: 4: excellent Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	The paper addresses a key challenge - How to handle global fluid information in highly dynamic fluid flows. they propose a hybrid Euler-Lagrange scheme, wherein they leverage a residual CNN layer to perform global convolution. With extensive experiments, they show that the architecture outperforms GNS. The motivation for using global feature processing seems relevant in context to capturing dynamic fluid flows. The writing is very good. All the components are clearly explained. Experiments are extensive. Convolution is not discretization agnostic, whereas GNN is discretization invariant. Unlike standard global convolution module, (i.e. FNO, GNOT, UPT, Transolver, FigConvNet CODANO), this approach seems to be applying "Local convolution" repeatedly for information aggregation. This is almost equivalent to applying GNN layers repeatedly, as the information aggregation would eventually propagate to the ends of the graph (oversmoothing). Can the authors show how many GNN layers are required to achieve the same performance as using U-Net autoencoder? GNS seems to be faster, but it is not clear if increasing the number of layers/training steps would improve GNS accuracy. The notion of projecting to a grid to propagate is not new. Several models already do that (GINO, GIOROM, UPT, Transolver) and these models are discretization agnostic, and enable latent feature learning through global kernel integral transforms and fourier neural operators. The difference here, however, is that this is used as an aggregator, rather than as a latent-space propagation module. Additionally, combining eulerian and Lagrangian schemes is not new - "Hybrid Neural-MPM for Interactive Fluid Simulations in Real-Time, Jingxuan Xu" "A Neural Material Point Method for Particle-based Emulation, Omer Rochman-Sharabi" My primary question is the notion of using local convolutions repeatedly, as opposed to using GNN layers repeatedly. It seems like in Table 1, GNS is faster, so I fail to see the advantage of CNN, unless the addition of more GNN layers would slow down GNS, but how would that affect the accuracy of GNS? Could the authors justify how those two are not equivalent? Moreover, many latent space methods project to regular grids. The reduction aspect of U-Net, serves as a form of model-order reduction. But in this case, it's not clear if the reduced latents are used to timestep, or simply aggregate and lift back to original space. Wouldn't multi-pole graph kernel network serve the same purpose (V-cycle aggregation)? Why have the authors not considered global convolutions for global feature learning? (i.e GNOT, UPT, Transolver, GIOROM, GINO, FNO, CoDANO, multipole graph kernel operator, deeponet etc.). UPT does not explicitly model Lagrangian datasets, as it cannot model autoregression, but all of the other models are capable of performing global convolutions, so why use local convolutions (U-NET)?	Fully human-written
CoRGI: GNNs with Convolutional Residual Global Interaction for Lagrangian Simulation	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposed CoRGI, a hybrid GNN-CNN simulators: particle features are scattered to a grid, processed by a multi-resolution CNN/U-Net, then gathered back to particles and decoded by the GNN, which captures long-range/global interactions with modest overhead. On LagrangeBench, CORGI improves rollout accuracy versus GNS/SEGNN while remaining efficient in training/inference time. 1. The paper is clear written and the design choice is well-motivated. 2. The proposed architecture balanced accuracy–time trade-off quite well. Gains in rollout error come with modest runtime/memory overhead, under matched budgets it can outperform pure GNN baselines. 3. By using CFL condition to determine a effective spatial length scale, the paper argues that too few message passing steps can be detrimental to the performance and shows multiscale CNN/U-Net layers achieve much faster global information mixing than GNN message passing at fixed depth. The core idea: projecting unstructured grid to a uniform Eulerian grid to exploit efficient convolutions/spectral operaions—has been explored (e.g., [1] uses graph kernel network to project input onto a equi-spaced grid for effcient Fourier transform), and U-shaped/UNet-style GNN designs for simulation are also established (e.g., [2]). In addition to CNN/GNN hybrid architecture or U-shape architecture, there is also Transformer-based architecture like [3] that can handle Eulerian/Lagrangian simulation with global interaction captured efficiently. As a result, the contribution currently feels incremental, and the specific conceptual advance isn’t fully clear. It would be helpful to more explicitly differentiate from prior works. [1] Li, Zongyi, et al. "Geometry-informed neural operator for large-scale 3d pdes." [2] Cao, Yadi, et al. "Efficient learning of mesh-based physical simulation with bsms-gnn." [3] Alkin, Benedikt, et al. "Universal physics transformers: A framework for efficiently scaling neural operators." What happens under large deformations or domain growth/shrinkage, is there support for adaptive grid resolution? Like re-mesh/re-parameterize the uniform grid each step. This is quite important, since Lagrangian simulations are typically used in settings with large deformations and moving boundaries	Fully human-written
CoRGI: GNNs with Convolutional Residual Global Interaction for Lagrangian Simulation	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	CoRGI is an extension to message passing networks that allows global interactions. It is motivated by hybrid numerical methods such as particle in cell and FLIP, which use both eulerian and lagrangian in a particle-mesh solver. The authors validate their model on LagrangeBench, a particle-based and GNN-focused benchmark for hydrodynamics. CoRGI can be applied on top of existing models, such as GNS, achieving a relatively cheap ~50% performance improvement. - Writing is generally easy to understand and follow, the figurework is also helpful and well designed. Motivation for the method is clearly stated and supported by "legacy" choices in traditional numerics (eg PIC). - Results are impressive, with only marginal added inference cost CoRGI significantly improves long term performance. Result presentation could be improved, as table 1 is rather large and not highlighted. Figure 3 is also interesting, but could be displayed better, as some metrics are semilog-y and others are linear. Generally I find that semilog plots are hard to read. - Interpretable results: sinkhorn is a distribution distance, and the significant improvement if it reflects the explicit global nature of the method. - The method is understandable and not overstated, section 3.2 is welcome and fits well to support the method. I did not check the correctness of the proof. - Extensive ablations in the appendix on interpolation schemes, graph and CNN architectures. My main concern with the paper can be summarized with a mismatch between motivation, model architecture and experiments supporting the claims. While developed in the realm of neural surrogates, it also fails to account for very similar graph neural network research. For this reason I don't think the paper is ready to be accepted. If the authors can provide preliminary evaluations on larger systems, and especially systems with long range / global interactions, I will happily raise my score. ### Dataset choice Despite proposing a local-global method, the authors only verify it on a lagrangian dataset generated with SPH, an local-only (particle) method. It represents fluids as particles, and field quantities are computed through (smoothing) kernels over radial neighborhoods, with the assumption of small enough timesteps. It works very well for local variations and free surfaces, but it is not explicitly global. Admittedly lagrangebench datasets are time-coarsened with a rate of 100x, breaking the strict locality of SPH. Regardless, it would be very beneficial to velidate CoRGI on different settings, namely purely eulerian datasets (e.g. turbulent PDEs on grids) and graph datasets with explicit global/long range interactions (e.g. molecular dynamics like rMD17, generic graph benchmark such as LRGB https://arxiv.org/abs/2309.00367). Finally, data generated with hybrid solverslike PIC would be the ideal setup, although I am not aware of benchmarks on the latter. ### Model comparisons Comparison to other message passing + global interaction techniques is missing, for instance Ewald-MP (https://arxiv.org/abs/2303.04791) or MFN (https://arxiv.org/abs/2310.10434). Due to the similarity of application areas, a combination with NeuralSPH would also be interesting, as mentioned in the related works (line 140). I do not entirely agree with lines 454-456: _While MSE20 and MSEEkin are primarily local errors [...], Sinkhorn divergence [...] is global in nature._. It is true that Sinkhorn is global in nature, but because of its 2000 timestep horizon, MSE20 while pointwise also contains some long range, non-local information about the interaction of particles. My interpretation is that on long time windows an autoregressive neural surrogate will inevitably accumulate errors and particle pairings will become meaningless; in turn, Sinkhorn does not have node pairs and looks at the distribution, which is meaningful even with mismatched nodes. MSE can be very high if the individual particles are relocated, even if the predictions are stable. ### Writing The introduction could be restructured to be more concise: for instance, lines 047-049 are reworded and repeated in line 071-072. Also there are some minor inaccuracies, for example line 064 "Despite their expressiveness, GNNs are computationally limited". The word expressiveness is misused here, as on one hand it makes the sentence sound like a contradiction, and on the other message passing is strictly upper bounded by the 1-WL test, making it non-expressive by definition. A more in-depth result discussion would be beneficial: for example, why is dam break seemingly the best performing dataset for CoRGI? ### Scalability LagrangeBench only goes up to ~10K particles, as GNNs are not trivial to scale to significantly larger systems due to computational (memory) requirements. Similarly, CNNs can be applied to large grids, but ViTs mostly overtook convolution in performance scalability on large grids and large amount of data (e.g. in scientific ML https://arxiv.org/abs/2411.09678, https://arxiv.org/abs/2502.02414). ### Limitations Limitations are at the end of the appendix, and I think it's important to have them in the main body. Regardless, the first aspect (resolution coupling) is extremely relevant in application: all datasets in lagrangebench except DAM are fully packed grids with SPH-relaxed particle distribution, meaning that this weakness is not explored in this work. I agree that an adaptive / sparse approach is a valuable future work. - the most impressive promotion is seen on dam break. do you have an interpretation as to why? - what do the authors mean by "which we address in CORGI by increasing the receptive field of each node" (line 129)? to me this sounds like the radius of the graph was increased for CoRGI, was the same done for GNS/SEGNN? - Is there any investigation conducted in the direction of scalability? - SPH is inherently a local method. I understand why machine learning requires such long horizons in a framework such as lagrangebench (time coarsening -> large interaction radius), but why is in general a global method required for modeling SPH simulations?	Fully human-written
Training Multi-Layer Transformers in Almost Linear Time	Soundness: 2: fair Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper tries to prove the theoretical possibility of training a transformer model in almost linear time, more specifically, calculating both the forward and backward passes of the attention layer in almost linear time, while having a small error of the order of 1/poly(n), where n represents the context length (or sequence length of the model). To do this, they use results from Alman & Song (2023,2024) as well as reordering the gradient calculations to stay in almost linear time. Alman, Josh, and Zhao Song. "Fast attention requires bounded entries." Advances in Neural Information Processing Systems 36 (2023): 63117-63135. Alman, Josh, and Zhao Song. "The fine-grained complexity of gradient computation for training large language models." Advances in Neural Information Processing Systems 37 (2024): 61419-61454. - The paper extends the results from Alman & Song to the full transformer (including multihead, MLP layers, and residual connections) to attempt to show that it is possible to train a transformer in almost linear time. - The paper is not very well-written and at times hard to follow or read. Additionally, the paper is difficult to understand without reading a substantial portion of the appendix. All proofs of the paper are relegated to the appendix, and the informal versions of the theorems in the paper state that they show that different parts of the model can be calculated in near-linear time without any intuition of strong reasoning as to why. It is hard to say whether the paper is the main paper or the appendix itself. - The paper claims multiple times that the results either "transcend" or represent a significant leap forward; however, most of the paper seems to be an application of two papers, which they cite and a reordering of the gradient calculations such that the operations stay near-linear. - In the abstract, the authors claim that they have done numerical experiments to show that their theory works; however, these are nowhere to be found. In addition, the two papers that reduce the calculation of attention to near linear time (forward and backwards) state that this could be applicable to highly quantised models (4 bits or less), but in non-quantised the U and V matrices would be infeasibly large (the $k_1$, would be in the order of millions, or billions, which would make them slower than quadratic calculations when n = 128k. - In multiple proofs in the appendix, the authors claim poly(n) / poly(n) <= 1/poly(n) (the poly(n)s are not derived from the same place and are in practice different) because the matrices can be written in log(n) bits; this assumption is very strong and not justified. Without it, the paper does not show that the error does not grow. - The last two weaknesses can be explained by a perceived misunderstanding of the related works, the central part of getting the attention operation to be near linear time with a small error is that the weights of the weight matrices can be bounded by $o(\sqrt{log(n)})$ this however never shows in this paper (outside of lemma C.13). - Were any practical experiments done, as claimed in the abstract?	Fully human-written
Training Multi-Layer Transformers in Almost Linear Time	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes a theoretical framework and algorithm to train multi-layer Transformers in almost linear time reducing the usual quadratic complexity of attention to near-linear. It extends prior work on fast single-layer attention to handle multi-layer backprop, including practical components like multi-head attention, residual connections, and causal masking. The authors provide formal proofs with provable approximation error but offer no experiment validation. - The paper presents a highly ambitious and conceptually original approach to reducing Transformer training complexity. It extends theoretical results from single-layer attention to full multi-layer Transformer architectures, which is a non-trivial and technically deep contribution. - The theoretical framework is comprehensive and covers practically important architectural components, including multi-head attention, residual connections, and causal masking. This level of completeness makes the theory relevant to real Transformer models rather than being confined to simplified abstractions. - The proofs and algorithmic structure are clearly presented and mathematically rigorous. The authors successfully derive polynomially bounded errors while maintaining almost-linear time complexity, showing careful control of approximation quality across layers. - The potential long-term impact of this work is very high. If the approach can be made practical, it could drastically lower the cost of training large models, make long-context training feasible, and influence how future large-scale sequence models are built. - The paper lacks any empirical validation or implementation evidence. There are no runtime benchmarks (GPU, TPU test) , or experiments demonstrating that the proposed algorithm actually accelerates Transformer training on real hardware. - The asymptotic claims hide important constants and assumptions. The requirement that the embedding dimension or number of heads grows only logarithmically with sequence length is unrealistic for large models, and the polynomial factors may dominate at practical scales. - The algorithm’s reliance on polynomial kernel approximations and low-rank projections raises serious concerns about hardware efficiency. These operations may not map well to GPUs or TPUs, where dense matrix multiplications dominate performance. - Numerical stability could be problematic. High-degree polynomial expansions are known to suffer from rounding errors under FP16 or BF16 arithmetic, which could make the approach unstable during training. 1. Have the authors implemented or simulated the proposed algorithm, even at small scale, to verify that it produces correct gradients and leads to measurable runtime improvements over standard attention? Without any empirical results, how can we be confident the asymptotic gains translate into practical speedups? 2. Could the authors make the hidden factors in the $n^{1+o(1)}$ complexity explicit, particularly the dependence on embedding dimension d, number of heads h, and number of layers L? These constants are crucial to assess whether the claimed efficiency holds for real-world Transformer sizes. 3. How do the authors envision mapping the polynomial kernel and low-rank computations onto GPU/TPU hardware, which favors dense matrix multiplications? Are there steps in the algorithm that become sequential or memory-bound, and how would those affect parallelism and wall-clock speed?	Fully AI-generated
Training Multi-Layer Transformers in Almost Linear Time	Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 0: Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The authors propose to study gradient descent approximation of self-attention, such that per-step gradient update of transformers can be done in time almost linear in the context length. The authors highlight the importance of low rank approximation in achieving such a speed-up in training. The paper argues that training can approach linear complexity in context length, with minimal error accumulation. However, the work currently lacks empirical support for this claim. In practice, the real-world value of the proposed method depends critically on whether the theoretical savings translate into meaningful wall-clock speedups and whether the approximation error remains negligible for real world training of models. Including controlled experiments that quantify these effects would significantly strengthen the paper. This paper’s presentation is extremely weak and not currently suitable for a conference submission. The authors repeatedly defer key ideas to the appendix, referencing terms and components without providing intuition for why they matter or how they fit together. As written, the reader has no clear narrative path to understand the algorithm or proofs, only a sequence of lemmas with no guiding intuition. For a theoretical paper claiming impactful practical consequences, this is a major issue. There is a clever idea here (low-rank approximation in gradient computation), but it is buried. Even one well-motivated proof sketch in the main text (e.g., for Lemma 4.1) explaining why the decomposition arises and how low-rank structure is exploited would help readers understand what’s going on. At present, the paper reads as if the appendix contains the “real” content and the main text is only a table of contents to it. This is not appropriate for ICLR. Concerns About Theoretical Claims Lemma 4.5 states that the approximation error scales as $n^m/poly(n)$. $m$ is assumed to be a constant. But current 7B models have $m$ as $32$. This means the error explodes as $n^{32}/poly(n)$. This isn't a small and reasonable error accumulation! “We further validate our approach through numerical experiments…” However, there are no experiments in the paper. No figures, no tables, no empirical section. This is misleading. If this was a draft produced hastily with LLMs, it should have been cleaned and verified before submission. Please respect reviewer time. If you promise experiments, they must actually be present. I don't have any questions, as the paper has been very poorly presented.	Lightly AI-edited
Training Multi-Layer Transformers in Almost Linear Time	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper provides a theoretical framework and algorithm for approximating gradient computation in multi-layer transformer architectures with almost linear time complexity, $n^{1+o(1)}$, and polynomially small approximation error $1/poly(n)$. The approach extends earlier results on single-layer acceleration to the backward pass of full multi-layer transformers. - This paper provides important theoritical contributions, extending single-layer to module-wise gradient computation, which allows the gradient of each self-attention layer to be approximated in almost linear time. - The paper is well-structured, with clear definitions, lemmas, and algorithmic pseudocode. - This paper potentially can have significant impact, if empirically validated. And it can dramatically reduce the training cost of long-context models. The paper’s assumptions constrain real-world applicability. - The results assume a fixed number of layers m=O(1) so that error propagation stays bounded. Real LLMs contain tens or hundreds of layers, where accumulated approximation errors may become non-negligible. - The hidden dimension is assumed to grow only as d=O(logn), which simplifies proofs but does not reflect actual transformer configurations. - The claim of polynomially small error lacks clarifing the constants and polynomial degrees, and the true error magnitude and runtime overhead for real-world is unknown. - The paper provides no experiments, even on toy setups, to validate gradient accuracy, runtime scaling, or convergence behavior. - Although the abstract claims “numerical experiments,” the paper contains no explicit experimental section, quantitative tables, or runtime evaluations? - What is the per-layer error accumulation?	Moderately AI-edited
Understanding Federated Unlearning through the Lens of Memorization	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The manuscript revisits the problem of federated unlearning from a memorization perspective, arguing that only the unique memorized information within the forgetting dataset should be removed, while the shared patterns should be retained. The authors propose a metric for distinguishing between memorized and shared knowledge at the instance level: Grouped Memorization Evaluation, and introduce Federated Memorization Eraser, with experiments validating the effectiveness of their method. However, there remain several issues that merit further attention: 1. The manuscript redefines the federated unlearning problem from the perspective of memorization and demonstrates that overlapping or shared information should not be unlearned. 2. The manuscript proposes Grouped Memorization Evaluation, a novel metric that can measure memorization information at the example level, thereby enabling a fine-grained assessment of unlearning efficacy. 1. The effectiveness of the FedMemEraser method relies on the pruning ratio. The manuscript notes that this ratio needs to be determined empirically, which might increase the difficulty and tuning cost of applying the method in different scenarios. 2. The method's core assumption is that redundant parameters relative to the remaining dataset D_r primarily carry the unique memorization information of the unlearning dataset D_u. The universality and completeness of this link require further theoretical or experimental validation. 3. The proposed "Grouped Memorization Evaluation" metric requires retraining the model multiple times to compute the memorization score for each example. This could introduce significant computational overhead in practice. 4. The results in Fig. 1 are too densely presented, which negatively impacts readability. 1. In Section 6.2, during Stage 1, the manuscript uses the average gradient magnitude to evaluate the amount of memorized information in the parameters. However, the authors do not provide sufficient justification and evidence for the superiority and necessity of this step. Moreover, the rationale behind the selection of the threshold hyperparameter for filtering is not thoroughly explained—how is the predetermined percentage of initialized parameters determined? In addition, the method does not appear to account for sensitivity differences across layers. 2. The manuscript discusses the issue of overlapping and non-overlapping information. Since the proposed method aims to remove non-overlapping information while preserving overlapping information, would the presence of retained clients whose data distributions are similar to that of the forgetting client affect the evaluation results? 3. The core assumption that “memorized information is equivalent to non-overlapping information” appears to be overly idealized. In Equation (3), $F_m=F_u-F_g$ is defined as the memorized information, but the manuscript does not explain how to concretely distinguish between “overlapping” and “non-overlapping” components in the feature space. Given that the feature distributions are highly nonlinear, an empirical definition alone cannot guarantee that $F_m$ truly corresponds to the memorized portion. 4. The assumption regarding redundant parameters in FedMemEraser lacks validation in Stage 1. The key premise of the algorithm is that small-gradient parameters correspond to memorized information. This assumption is too absolute and lacks theoretical justification or empirical support. 5. FedMemEraser consists of three phases: positioning, resetting, and fine-tuning. If only positioning and reset phases are performed without fine-tuning, what would happen to the forgetting effect and generalization performance of the model? How much forgetting effect does the reset operation itself contribute, and how much does the subsequent fine-tuning contribute? 6. The manuscript's experiments mainly focus on client level forgetting. How will the FedMemEraser method apply to sample-level or category-level forgetting in federated learning?	Lightly AI-edited
Understanding Federated Unlearning through the Lens of Memorization	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper explores federated unlearning from the perspective of memorization and proposes FedMemEraser, a lightweight method combining gradient-threshold-based pruning and fine-tuning to remove memorized information while retaining shared knowledge. The study provides both theoretical insight into the connection between memorization and unlearning effectiveness and empirical results - Writing is good and easy to follow and understand. - It provides a clear theoretical formulation linking memorized knowledge to model parameters, which helps explain the trade-off between unlearning effectiveness and performance retention. - Why [1] can not become a baseline of this paper? Please demonstrate the experiment results. - I have carefully checked, and some relevant studies are not discussed in this paper. For example [1] and so on. Please do a comprehensive investigation. - The proposed FedMemEraser essentially follows a combination of gradient-threshold-based redundant-parameter pruning and a fine-tuning procedure. Compared with existing unlearning approaches based on weight importance or influence functions, its novelty is a problem. - This paper conducts experiments with a fixed setup of 10 clients and does not evaluate the method under different numbers of clients, leaving its scalability and stability unverified. [1] Not:Federated unlearning via weight negation See in weakness	Fully human-written
Understanding Federated Unlearning through the Lens of Memorization	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper re-examines federated unlearning through the lens of memorization, arguing that common metrics cannot verify whether client-specific information is truly removed. It introduces Grouped Memorization Evaluation (GME): estimate per-sample memorization via multiple retrainings, group the to-be-forgotten samples by score, and contrast “unlearning vs. retraining” within groups for fine-grained assessment. The method FedMemEraser identifies potentially memorization-bearing redundant parameters using the remaining clients’ average gradients, resets them, and fine-tunes only on remaining clients to restore generalization. Experiments on CIFAR-10/100 and EMNIST evaluate GME effectiveness, test performance, local fairness, and time/communication cost. Clear problem reframing and definitions. The decomposition into shared Fg vs. memorized Fm information provides a precise target for Fu and explains why removing overlap harms generalization and fairness. Additionally, the formal “federated memorization unlearning” definition is clean and intuitive. Evaluation innovation. Grouped Memorization Evaluation (GME) directly probes whether high-memorization examples in Du are “forgotten,” enabling fine-grained, content-aware assessment beyond global accuracy or distance metrics—addressing known shortcomings of prior FU evaluations. Simple, FL-compatible method. FedMemEraser requires only server-side aggregation of client gradients already present in FL to locate low-update (redundant) parameters, followed by reinit + fine-tuning. This simplicity is attractive and appears robust to both IID and non-IID data. Broad evaluation coverage. Beyond Grouped Memorization Evaluation and test accuracy, it also reports local fairness and time/convergence behavior. Cost and practicality of Grouped Memorization Evaluation (GME). The memorization score requires training J retrained models without Du to estimate probabilities, which is computationally heavy and may be impractical for large-scale FL. The paper should clarify how J is set. Hyperparameter sensitivity and selection protocol. The method’s main knob ρ is dataset-specific (e.g., different ρ for CIFAR-10 vs. CIFAR-100), and ablations show strong effects. The paper may be should specify a principled, data-independent selection rule and report sensitivity curves. Coverage of federated learning in Related Work. The Related Work section barely discusses FL and FU beyond a brief definition in Preliminaries. Please expand the FL-related literature review: position your work against core FL/FU lines (e.g., client heterogeneity, privacy/unlearning, fairness in FL, server-side vs. client-side unlearning), and clarify what is genuinely new here relative to these threads. Terminology and dataset naming consistency. Please standardize technical terms and dataset names across the paper. For example, consistently use “CIFAR-10/100” (with a hyphen and capitalization) instead of variants like “CIFAR10,” and ensure line 425’s “CIFAR10” is corrected to “CIFAR-10.” GME overhead and configuration. What value of JJJ was used to compute memorization scores? How costly is GME relative to a full retrain, and is it strictly an offline evaluation tool (no influence on hyperparameter selection)? Hyperparameter ρ selection. How should practitioners choose ρ without access to Du? Did you fix ρ per dataset or per run? Precise definitions in Eqs. (9) and (11). In Eq. (9), what exactly is Pr? Is it the same quantity as M in Eq. (11), or is M derived from Pr (e.g., an average over runs/clients/examples)? Please give a rigorous, self-contained definition , and state the intended direction: in Eq. (11), does lower M indicate better unlearning (i.e., less memorization), or the opposite? Threshold selection in Eq. (10). How are the thresholds in Eq. (10) chosen in practice (fixed a priori, validated on remaining-client data, or tuned per dataset/model)? Please describe the protocol used in experiments and provide a sensitivity analysis or at least the chosen values and their rationale. Subgroup mapping and high scores in Table 1. Do the subgroups in Table 1—for example, (95%,100%]—correspond exactly to the thresholds defined in Eq. (10)? If so, why do low-memorization bins show extremely high scores (e.g., EMNIST IID achieves 99.97% in Group (0%,80%])	Moderately AI-edited
XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes XGRAG, an explainable framework designed to identify the most influential nodes and edges that contributes to the outputs of GraphRAG models. The XGRAG framework consists of four components: a GraphRAG model for indexing and retrieving relevant subgraphs, an entity deduplication module for consolidating semantically similar entities, a perturber that systematically perturbs subgraphs, and an explaner that quantifies the importance of individual graph components to the model’s final response. 1. The paper proposes XGRAG to generate fine-grained explanations that identify the most influential nodes and edges contributing to an LLM’s response. 2. The paper conducts experiments on the NarrativeQA dataset and the results indicate the the proposed XGRAG outperforms a baseline RAG-Ex. 3. The paper is well-written and easy to follow. 1. The paper is incremental since the main idea of XGRAG, i.e., perturbing retrieved content to identify the most important components, has already been explored in previous works like RAG-Ex. 2. The paper only compares against RAG-Ex, which is designed for text-based RAG models, and lacks comprehensive comparisons with more relevant baselines, such as KGRAG-Ex. 3. The paper lacks sufficient justification for the proposed approach. For example, a simple baseline could involve using GraphRAG to generate predictions and then computing the similarity between graph components and the predicted answer to identify important elements. Additionally, the paper notes that XGRAG's importance scores align with structural properties such as Degree and PageRank. This raises the question of why these existing graph metrics are not used directly for identifying influential components. 4. The proposed method raises efficiency concerns, as it requires perturbing each graph component and generating outputs with the LLM. However, the paper does not provide statistics on the size of the retrieved subgraphs, making it difficult to assess the practical efficiency and scalability of the approach. 5. The paper only conducts experiments on the NarrativeQA dataset, raising concerns about whether the proposed approach can generalise to other datasets or domains. 1. How does XGRAG differ from prior work like RAG-Ex and KGRAG-Ex, beyond being applied to GraphRAG? 2. Why are graph-based baselines such as KGRAG-Ex not included in the experimental comparison? 3. Can the authors provide an efficiency analysis to quantify the computational cost of XGRAG, particularly given the need to perturb and repeatedly invoke the LLM for each graph component?	Fully human-written
XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation	Soundness: 1: poor Presentation: 3: good Contribution: 3: good Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper aims to explain the outputs of Graph-based Retrieval-Augmented Generation (GraphRAG) systems. The authors propose a perturbation-based explanation approach, extended from existing work RAG-EX. The proposed method XGRAG uses LightRAG as the backbone and introduces graph-based perturbations such as node and edge removal. The experiments include only one baseline and one dataset. 1. The paper is well-written and clearly articulates its core idea. 2. The proposed approach reasonably integrates LightRAG with RAG-EX, introducing two key modifications: improved entity deduplication and a graph perturbation mechanism. 1. The experimental evaluation lacks comprehensiveness, with only one dataset and one baseline included. 2. An important evaluation relies on a hypothesis that is neither convincingly argued nor adequately justified in the paper. 1. The evaluation includes only one baseline, RAG-Ex. The authors should justify why KGRAG-Ex (Balanos et al., 2025) was not included as a baseline comparison. 2. The use of a single dataset for evaluation undermines the generalizability of the experimental results. 3. Figures 3 and 4 employ an inappropriate visualization method: line plots are used despite the x-axis representing discrete variables, when bar plots would be the correct choice for non-continuous data. 4. In Table 3, the purpose of evaluating across different LLMs is to demonstrate that the conclusion, XGRAG outperforms RAG-Ex, remains consistent regardless of the LLM used. However, to validate this claim, the results for RAG-Ex should also be reported alongside XGRAG for each LLM, rather than showing only XGRAG performance. 5. The hypothesis stated in Line 339 is questionable. In KG retrieval, triples relevant to a query do not necessarily correspond to those with high structural importance. Furthermore, standard PageRank assumes homogeneous edge semantics and may not perform effectively on multi-relational KGs [A]. Given that this hypothesis is central to validating the quality of graph explanations, the authors should provide more rigorous justification and clarification. 6. In Line 450, the similarity threshold is not defined, nor is the selection methodology explained. Please provide this information. 7. The paper lacks error analysis. Specifically, the authors should investigate which questions can be successfully explained by RAG-Ex, XGRAG, both methods, or neither, and analyze the underlying reasons for these outcomes. Furthermore, the necessity of KGs for explanation should be empirically demonstrated through comparative analysis. [A] Li, X., Ng, M. K., & Ye, Y. (2012, April). HAR: hub, authority and relevance scores in multi-relational data for query search. In Proceedings of the 2012 SIAM International Conference on Data Mining (pp. 141-152). Society for Industrial and Applied Mathematics.	Lightly AI-edited
XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper introduces an explainability method for GraphRAG that assigns importance scores to entities and relations by perturbing them. S1. The paper is well-written and the figures are clear, making the proposed method easy to follow and understand. W1. The connection of this work to "explainability" is tenuous. The output consists of importance scores for entities and relations, but these scores do not truly explain the inner workings of the llms or why it generated a specific answer. For example, in the case study from Figure 1, knowing that the three entities (Gold Watch, Delta, Jim) all have non-zero importance scores does not explain how the model utilized these entities to formulate its response. In my view, this is more akin to "importance attribution" or "evidence identification" rather than genuine explainability. W2. The experimental evaluation is insufficient. The entire study is conducted on a single dataset, NarrativeQA (2017), which is no longer considered a challenging benchmark for modern RAG systems and may not be representative of common GraphRAG application scenarios. The authors should supplement their experiments with more complex, practical use cases where GraphRAG is specifically needed and standard RAG would be insufficient. Otherwise, the practical significance of this work is questionable. W3. The method's design appears to be more of an engineering effort and lacks principled innovation. The main contributions seem to be demonstrating the value of operations like deduplication and clustering. However, these are standard pre-processing techniques in graph manipulation and can hardly be considered novel contributions of this work. W4. The practical utility of the proposed method is ambiguous. Once the importance scores for entities and relations are obtained, what is their explicit use case? Can they be used for debugging the knowledge graph, refining the retrieval process, or improving model factuality? The authors need to clearly articulate the downstream applications and benefits of their method. In addition to the points in the "Weaknesses" section, I have the following questions: Q1. The knowledge graph (KG) is a relatively traditional method of information representation and has several inherent limitations. For instance, it is challenging to represent complex content (such as the equations in your paper) using a KG, whereas text can be seen as a more general-purpose information carrier. In fact, most existing KGs are constructed by extracting information from text. This raises a fundamental question: why do we need to revert from text back to the structure of a knowledge graph? Historically, due to the limitations in text processing and representation capabilities, KGs were utilized to simplify textual information and remove redundancy, thereby facilitating better representation and retrieval. However, modern Large Language Models (LLMs) can now effortlessly handle the complexities of text representation. Therefore, I believe the remaining advantages of knowledge graphs in the current era of LLMs warrant a more in-depth analysis and discussion. Q2. I observed in your experiments that the effectiveness of your proposed method is primarily benchmarked against RAG-Ex. However, your method utilizes a graph structure, while RAG-Ex operates directly on text. Does this comparison introduce a potential unfairness in terms of the data modality? After all, the conversion of text into a graph is a necessary prerequisite for any knowledge graph-based approach. Consequently, it is difficult to consider the mere utilization of a graph structure as a novel contribution of your method. Could you elaborate on this?	Lightly AI-edited
XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes XGRAG, an explainability framework for Graph-based Retrieval-Augmented Generation (GraphRAG) systems. Unlike previous explainability methods that operate on unstructured text, XGRAG performs graph-native perturbations — removing nodes, edges, or injecting synonyms — to quantify the causal influence of each graph component on the final LLM answer. The framework integrates (1) entity deduplication to merge semantically equivalent nodes, (2) perturb-generate-compare evaluation to compute importance scores, and (3) alignment with graph structural measures to assess validity. Experiments on NarrativeQA show that XGRAG improves explanation accuracy, robustness across story types and question complexities, and generalization across multiple open-source LLMs (LLaMA 3.1-8B, Mistral-7B, LLaVA-7B, etc.). Ablation studies confirm that entity deduplication and node-level perturbations are key to performance gains. 1. Clear motivation. The paper identifies a genuine gap: existing XAI methods for RAG cannot interpret reasoning grounded in structured graph data. XGRAG directly addresses this with a graph-native perturbation approach. 2. Novel methods to perturb the graph. The “Perturb-Generate-Compare” paradigm is adapted elegantly to graphs, combining semantic and structural importance in a unified explanation measure. 3. Comprehensive experiments. Authors include strong empirical validation spans multiple LLMs, question types, and story structures. Ablation studies show the necessity of entity deduplication and tests three perturbation strategies. 1. Limited experimental domain. All experiments are conducted on NarrativeQA, evaluation on other domains (scientific, biomedical or factual QA/KGs) will better demonstrate the ability to generalize. 2. Potential biased evaluation. When building the ground truth, the authors take the assumption that "graph components semantically similar to the final answer are the most relevant pieces of evidence." This assumption can be ungrounded especially when there is no exact information that can directly solve the query, relevant (semantically similar) information in this case could cause hallucination [1]. 3. Scalability issues. The framework requires multiple GraphRAG invocations per perturbation. While LightRAG mitigates some cost, scalability to very large KGs or multi-hop queries remains uncertain. [1] GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation 1. How does XGRAG scale to industrial-scale KGs (millions of entities)? Would LightRAG still be computationally feasible for large perturbation batches? 2. The evaluation relies on similarity to the final answer. Have the authors considered other evaluation metrics such as task-specific annotation to confirm faithfulness? 3. How sensitive is performance to the similarity threshold (θ_sim) used in entity deduplication? Can the authors include an ablation study for that?	Fully AI-generated
DyME: Dynamic Multi-Concept Erasure in Diffusion Models with Bi-Level Orthogonal LoRA Adaptation	Soundness: 1: poor Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper investigates dynamic multi-concept erasure by pretraining LoRA adapters and composing the required concepts at inference. To mitigate crosstalk among LoRA modules, the authors propose Bi-Level Orthogonal Constraints that minimize overlap in LoRA-induced representation shifts and enforce orthogonality in the parameter space. Experimental evaluation employs a CLIP-based classifier to assess both erasure effectiveness and utility preservation. The results indicate that the proposed method maintains the original utility while achieving the lowest accuracy for the erased concepts. 1. Unlike existing concept erasure methods, this paper focuses on dynamic, on-demand multi-concept erasure, enabling separate LoRA adapters to be combined dynamically at inference time. 2. The paper is well-structured and clearly written. The proposed DyME can mitigate LoRA crosstalk. 1. For the motivation, this paper addresses the dynamic combination of trained concept-specific LoRA modules to enable flexible concept erasure in diffusion models. It specifically tackles scenarios where certain concepts may need to be removed from existing LoRA sets. However, the proposed method is limited in that it can add one new concept at a time, potentially requiring retraining across all LoRA modules. The approach’s scalability is restricted if it cannot seamlessly incorporate new concepts without retraining. 2. For the evaluation metrics, the authors adopt a CLIP-based classifier to assess erasure effectiveness and utility preservation. However, existing works typically employ the CLIP score for erasure effectiveness and FID for utility preservation. It remains unclear what advantages the CLIP-based classifier provides over these established metrics. Moreover, LoRA-based weight modifications inevitably affect untargeted concept generation, even if only minimally. Yet, the results measured by $Acc_{UP}$ suggest that the method fully preserves the original performance on untargeted concepts. Therefore, $Acc_{UP}$ alone may not sufficiently capture potential degradation in untargeted concept generation. 3. As the number of erased concepts increases, the difficulty of training LoRA correspondingly rises. Therefore, it remains unclear whether the proposed method can still perform effectively when dealing with a larger set of concepts to erase. 1. How flexible is the system in incorporating new concepts after initial training? 2. Can the authors clarify whether DyME remains effective when erasing a large number of concepts? 3. What is the rationale for choosing this particular evaluation metric?	Lightly AI-edited
DyME: Dynamic Multi-Concept Erasure in Diffusion Models with Bi-Level Orthogonal LoRA Adaptation	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper presents DYME, a modular and scalable framework for dynamic multi, concept erasure in text, to, image diffusion models. Instead of retraining or editing a model for each new erasure request, DYME trains one lightweight LoRA adapter per concept and dynamically composes relevant adapters during inference. To handle interference among multiple adapters, the authors introduce a bi, level orthogonality mechanism consisting of (1) an input, aware orthogonality constraint on induced representation shifts and (2) an input, agnostic parameter, space regularizer derived from a theoretical sufficient condition (Theorem 1). The paper also contributes ERASUREBENCH, H, a hierarchical benchmark organized into brand–series–character levels. Experiments on CIFAR, 100, Imagenette, and ERASUREBENCH, H demonstrate improved erasure efficacy and reduced interference compared to prior static erasure baselines. 1. Addresses real, world dynamic takedown scenarios where multiple concept erasures must coexist flexibly. 2. Per, concept LoRA adapters make updates lightweight and composable. 3. Provides a theoretically backed mechanism to reduce interference among multiple LoRAs. 4. ERASUREBENCH, H offers a structured and hierarchical testbed for compositional erasure. 5. Empirical results: Demonstrate consistent gains across datasets and include several informative ablations. 1. Missing comparison to contemporaneous SOTA (e.g., Receler). The work omits comparisons with key 2025 approaches that address similar problems. This omission weakens claims of novelty and superiority. A direct experimental or analytical comparison with Receler and other strong 2025 methods should be added. 2. Theorem 1 vs implementation inconsistency. The theorem assumes q and k are frozen while LoRA updates only v/o, yet the implementation adapts q/k/v/o. The authors should either align the implementation with the theorem’s assumptions or extend the theoretical justification accordingly. 3. Scalability and cost reporting. With hundreds of unit concepts, training one LoRA per concept can be computationally expensive. The paper should report adapter size, total storage, per, adapter training time, and inference latency when composing many adapters. 4. Baseline fairness. Static baselines were restricted to small erasure scopes due to collapse; stronger tuning or staged training might alleviate this. A justification or expanded baseline study is required. Questions* 1. Why is Receler (and other 2025 SOTA methods) not compared? If unavailable, can you provide a reasoned analysis or reimplementation? 2. Do you freeze q/k during LoRA training? If not, how does Theorem 1 apply in practice? 3. What are the per, adapter parameter counts, total storage, training time, and inference latency at different composition sizes? 4. How were static baselines tuned, and are there hyperparameter settings that allow them to scale further?	Fully AI-generated
DyME: Dynamic Multi-Concept Erasure in Diffusion Models with Bi-Level Orthogonal LoRA Adaptation	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes a dynamic concept erasure framework for multi-concept unlearning. The proposed framework first learns concept-specific LoRA for each concept. Then the framework learns to dynamically compose the LoRAs corresponding to the target concepts during inference. The experiments are conducted on various commonly used benchmarks. In addition, this paper also proposes an ErasureBench-H benchmark for more comprehensive multi-concept evaluation. 1. Multi-concept erasure is crucial for real-world trustworthy generative AI deployments. 2. The proposed method is reasonable to handle the dynamic scenarios during inference. 3. The overall paper is easy to follow. 1. The proposed framework applies LoRA adapters to encode the specific concepts. However, in the real-world multi-concept erasure setting, the number of concepts would be large. It would cause the memory issues that the users cannot store such a large amount of adapters. 2. Step 3 of the proposed method requires the framework to first train all concept LoRAs jointly. However, in real-world scenarios, it is possible that the user wants to add novel concepts beyond the training set. It will require re-training once novel concepts appear, which might not be practical. 3. While this paper enforces each concept to be non-overlapping, it remains hard to handle the rephrased prompts that would cause robustness issues, allowing attackers to easily recover the concept by paraphrasing the target prompts. 1. This paper contains no qualitative visualization in the main paper. It would be beneficial to demonstrate the visualization for concept erasure on a large number of multiple concepts (e.g., 5 or more concepts).	Fully human-written
DyME: Dynamic Multi-Concept Erasure in Diffusion Models with Bi-Level Orthogonal LoRA Adaptation	Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper proposes DYME, a dynamic multi-concept erasure framework for text-to-image diffusion models that allows on-demand suppression of copyrighted or sensitive concepts. Instead of static fine-tuning, DYME trains lightweight, concept-specific LoRA adapters and dynamically composes only those needed at inference. To mitigate interference among multiple erased concepts, it introduces bi-level orthogonality constraints at both the feature and parameter levels. The authors also present ERASUREBENCH-H, a hierarchical benchmark reflecting real-world brand–series–character relationships. Extensive experiments on CIFAR-100, Imagenette, and ERASUREBENCH-H show that DYME achieves superior erasure fidelity and utility preservation compared to prior static approaches, effectively scaling to large and overlapping concept sets. 1. The paper introduces a new concept erasure setting, where multiple target concepts can be erased simultaneously during inference, reflecting more realistic real-world scenarios. 2. It proposes bi-level orthogonality constraints, which effectively enable the stable composition of multiple concept-specific LoRA modules at inference without interference. 3. The paper presents ERASUREBENCH-H, a hierarchical and semantically structured benchmark that enables comprehensive and realistic evaluation of multi-concept erasure performance 1. In Figure 1, the performance of MACE appears inconsistent with its original paper, where it successfully erased 100 concepts (e.g., 100 celebrities) while maintaining high harmonic accuracy. Here, only up to 20 concepts are erased, and the results for MACE seem unexpectedly poor. This might be due to differences in the CIFAR-100 setting, where protected concepts may not have been properly included in the retention set. The authors are encouraged to revisit their reproduction settings. In addition, SPM should be categorized under Dynamic Erasure (DE), as described in Section 3.3 of its paper. 2. In line 73, the claim that static approaches “reduce diversity and degrade fidelity” lacks supporting evidence. In contrast, MACE demonstrates good fidelity preservation even under multi-concept erasure, which contradicts this statement. 3. In the Introduction, the paper does not clearly define the difference between static erasure paradigm and dynamic erasure paradigm, even though these are the core concepts that underpin the paper’s motivation. This omission makes it difficult for readers to fully grasp the main idea. 4. The major concern lies in the motivation of the proposed task setup. The authors emphasize that the weakness of static erasure lies in its inability to dynamically select concepts during inference, while all concepts are erased during training. However, the fundamental goal of concept erasure is to permanently remove target semantics from the model parameters without affecting others. The proposed dynamic erasure is unrealistic in white-box settings (e.g., Stable Diffusion), since LoRA adapters are external and not merged into model weights — an attacker could easily bypass DYME by modifying the inference code directly. 5. Although the proposed method might have value in black-box deployment, similar behavior could be achieved with simpler mechanisms. Since DYME selects LoRA modules based on keyword matching, a trivial baseline could simply detect forbidden keywords in prompts and return a blank or noisy image otherwise generating normally, achieving equivalent or even better results (i.e., AccEE = 0 and unchanged AccUP). 6. The proposed Dynamic Composition at Inference (Sec. 4.2) relies on explicit keyword matching, which fails to address implicit concepts, such as those in the I2P benchmark [1]. Handling NSFW or implicit concepts is a critical aspect of practical concept erasure that this method does not consider. 7. In line 106, the paper claims “this is the first work to systematically investigate multi-concept erasure scalability in diffusion models,” but prior studies such as [2, 3] have already discussed similar topics. 8. In line 241, the symbol j is undefined. 9. In Equation (3), the loss formulation seems to encourage opposite directions (cosine similarity = –1) rather than orthogonality (cosine similarity = 0), which contradicts the intended behavior. 10. The experimental comparisons include only SPM and MACE (both CVPR 2024). More recent and relevant baselines with similar motivations are missing and should be added for completeness. 11. The experiments in Section 5.2 only cover up to 20 erased concepts, while MACE has demonstrated the ability to erase 100 concepts. Although the dynamic approach may scale better, expanding the number of erased conceptswould strengthen the motivation and claims. 12. The ERASUREBENCH-H benchmark lacks sufficient detail and description in the main text; it is hard to understand its structure and usage without referring to the appendix. 13. In Table 4 (Config 1), it is unclear why AccUP remains 70.50 after removing LoRA-C. This inconsistency needs clarification or additional explanation. [1] Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models, CVPR23 [2] Localized Concept Erasure for Text-to-Image Diffusion Models Using Training-Free Gated Low-Rank Adaptation, CVPR25 [3] Erasing Concept Combination from Text-to-Image Diffusion Model, ICLR25 See the weakness aprt.	Moderately AI-edited
Unlocking Decoder-LLMs for Text Embedding with Instructions, Soft Supervision and Curriculum Learning	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes to improve general-purpose text embeddings through (i) in-context learning (ICL), (ii) knowledge distillation with soft scores, (iii) curriculum learning, and (iv) hard-negative mining in an adaptive manner. Experiments on the MTEB v2 English demonstrate that the models trained with these methods achieve SOTA results as of June 2025. (1) The paper focus on the important problem of developing general-purpose text embedding models. (2) The developed model achieves SOTA results on MTEB v2 English as of June 2026. (3) Detailed ablation studies are conducted on each component of the model. (1) The techniques studied in this paper have been explored in prior works, with some methods simply reusing existing approaches without modification. Specifically, the ICL ifollows the same approach as in [1], and hard-negative mining is based on [2]. Knowledge distiallation using soft supervision is also well studied in the literature by applying KL loss using soft reranker scores. A two-stage curriculum learning is also explored in [3], although the detailed config of the curriculum differs. (2) The two-stage approach (STS followed by Retrieval) is only slightly better than using retrieval data alone (73.60 vs. 73.43). Using 2-shot demonstrations leads to worse performance compared to no demonstrations (73.12 vs. 74.14). (3) The paper highlights the advantage of avoiding architectural changes and full fine-tuning. However, these characteristics are already common in existing models (e.g., BGE-EN-ICL, Qwen3-Embe, etc.). (4) There are some missing descriptions. Only the InfoNCE loss is mentioned in the paper. It is not clear how the soft scores are used in the model training. [1] Making Text Embedders Few-Shot Learners. [2] Nv-retriever: Improving text embedding models with effective hard-negative mining. [3] NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models. (1) The citation format in the paper is incorrect. There should be a whitespace betwen the text and the citation.	Fully human-written
Unlocking Decoder-LLMs for Text Embedding with Instructions, Soft Supervision and Curriculum Learning	Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 0: Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper presents a framework for learnign embeddings from LLMs using data selection with BM25 and rerankers and a fine-tuning step. A third step can be added to the pipeline leveraging ICL. The framework is tested using Msitral-7B and evaluated on MTEB(english, v2). As mentionned in the next section, there are more concerns than strengths to this work. See section below. Comments on the format: This paper is poorly written and poorly presented, it looks like it abused AI-generation for many reasons, I cite few of them here: * The related work are really poor, 2 paragraphs only contain 1 citation each, with an abuse of m-dashes. Many papers are also missing concerning the embedding generation with LLMs, like GritLM-7B paper. * All figure look like they were generated by ChatGPT, no effort was made to improve them. A figure that showcases an example is more appreciated. * Citations not formated correctly, for example: ..by Moreira et al. (Moreira et al., 2024),..., it should be replaced just by \citet{paperef}. * The formatting of Table 1 is random, only the summarization line is bolded. Comments on the technical content: * The model is only evaluated on English, while other models are multilingual, it is not sure how the proposed approach would perform in a multilingual setting. * The paper presents a framework for extracting embeddings from LLMs but only shows how it works with Mistral-7B. Other LLMs have been developed since then, Mistral-7B may not be a good reference. Showing that the framework works with other LLMs is needed. * Authors employ the term "Knowledge Distillation" for describing their sample selection strategy but no KD is really done. * The method is poorly explained, it's unclear what the "soft label distillation" does during the training. Better detailing and showcasing an example of a step would have helped understanding. In genreal, the work is very similar to what already has been done with LLMs for representation learning, the novelty and originality of this work is not clear. It needs major updates to match ICLR standards and be published. 1 - Why not test the framework on other backbone models? 2 - Why limit the evaluation on English only? Did you observe similar improvments on MTEB(multilingual, v2)?	Fully human-written
Unlocking Decoder-LLMs for Text Embedding with Instructions, Soft Supervision and Curriculum Learning	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes a practical recipe to convert a decoder‑only LLM (Mistral‑7B) into a strong, general‑purpose text embedder without architectural changes. The system combines: (i) instruction‑shaped inputs with few‑shot demonstrations and [EOS] pooling to get context‑aware embeddings (§4.2; formatting examples in Table 8, p. 14); (ii) soft supervision via a hybrid retrieval teacher (BM25 + dense retrieval fused with RRF and reranked by a cross‑encoder; Fig. 2 on p. 5) to produce continuous relevance scores (§4.1); (iii) adaptive, margin‑based hard‑negative mining with a “95% of positive” threshold (§4.3, p. 6); and (iv) a two‑stage curriculum that first learns semantic textual similarity (STS), then retrieval (RET) (§5.4, Tables 3–6, pp. 8–9). On MTEB (English v2; 41 tasks) the model reports a mean of 74.12 and states a higher Borda rank than some models with slightly higher mean (Table 1, p. 6). Ablations claim gains from soft labels (Table 2), the STS to RET ordering (Tables 4–6), and the margin‑based mining (Table 10, p. 15). Fine‑tuning uses LoRA; hyperparameters are in §5.1 and Appendix B.1 (Table 7). • Modular training recipe that practitioners can replicate: instruction‑shaped inputs, a hybrid teacher (BM25 + dense + RRF + cross‑encoder), and LoRA fine‑tuning, illustrated in Fig. 2 (p. 5) and Fig. 1 (p. 4). • Curriculum insight: STS pretraining followed by RET fine‑tuning gives consistent gains over reverse ordering and multi‑task mixing (Tables 4–6, pp. 8–9). • Ablations: soft‑label on/off (Table 2), ICL shot counts (Table 9), and negative sampling (Table 10). The margin‑based rule is empirically best (Table 10). • Balanced MTEB performance across categories (Table 1), with standout Summarization (38.93) and strong Retrieval/STS/Pair‑Classification. 1. Core objective unspecified (major). The paper repeatedly claims “soft labels” but instantiates only InfoNCE, which is typically hard-labeled. To substantiate the thesis, the authors must state and test a genuine soft‑label loss. Also, this choice needs to be properly ablated and justified, as it seems that their core contribution is a general training recipe for text embeddings. Choices of ablations are other objectives that make use of soft labels. For instance: BiXSE (pointwise BCE) [1], LambdaLoss (pairwise BCE with nDCG weighting) [2], RankNet/PairDistill (pairwise BCE) [3-4], and a soft InfoNCE [5] approach. 2. Negative‑mining direction. The text likely flips the inequality; please correct and fully specify top‑K, sampling, and the margin 0.95 used in Table 10. 3. Add more controlled baseline using the same data/mining/curriculum with more strong open base models. 4. Teacher specificity & data hygiene. Name the exact BM25 variant, dense encoder, reranker checkpoint, RRF k (text mentions “typically 60”), and report deduplication against MTEB sources. Greater detail about the data collection and curation process helps with reproducibility and further more with better claims regarding generalization. 5. Novelty concerns. Any individual aspect beyond maybe curriculum has been already examined in great detail. References ---- [1] BiXSE: Improving Dense Retrieval via Probabilistic Graded Relevance Distillation. Tsirigotis et al. 2025. Treats graded relevance as probabilistic targets and optimizes BCE. [2] The LambdaLoss Framework for Ranking Metric Optimization. Wang et al. 2018. Metric‑driven pairwise weighting approximating nDCG@k. [3] Learning to Rank using Gradient Descent (RankNet). Burges et al. 2005. Classic pairwise logistic objective on score differences. [4] PairDistill: Pairwise Relevance Distillation for Dense Retrieval. Huang et al. 2024. Distills pairwise preferences from a strong reranker into a dense retriever. [5] Rethinking Negative Pairs in Code Search (EMNLP 2023) proposes Soft‑InfoNCE by re‑weighting negatives; conceptually applicable to graded labels after normalization. 1. ICL vs LoRA. Which results are “ICL‑only” (no parameter updates) versus LoRA‑tuned? Clarify whether the backbone is frozen and only adapters are trained. 2. Pooling & length. You fix [EOS] pooling; prior work sometimes finds pooling choice important. Add a brief pooling (average, average on top of non-instruction tokens, last token, attention-based pooling) and sequence‑length sensitivity study.	Moderately AI-edited
Unlocking Decoder-LLMs for Text Embedding with Instructions, Soft Supervision and Curriculum Learning	Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper presents a unified, instruction-based framework that adapts decoder-only Large Language Models into general-purpose text encoders without requiring architectural modifications. The authors propose a four-step learning pipeline: (i) in-context learning to generate context-aware embeddings, (ii) soft supervision through knowledge distillation, (iii) adaptive margin-based hard-negative mining, and (iv) a two-stage curriculum learning strategy (Semantic Textual Similarity (STS) then Retrieval (RET)). The authors then compare their LoRA-finetuned model with this training pipeline against various baselines on the MTEB benchmark, achieving performance on par with the best existing systems. * State-of-the-Art Performance on MTEB: The paper introduces an efficient training pipeline that enables decoder-only LLMs to achieve performance on par with the best existing systems on the MTEB benchmark (designed to comprehensively assess the capabilities of sentence representation models across a diverse set of tasks). * Thorough Ablation Study: The authors systematically evaluate the impact of their proposed design choices through comprehensive ablation studies conducted across the full MTEB benchmark. This effectively demonstrates the contribution of each component to the overall performance. * Limited Novelty of Core Techniques: While the proposed pipeline demonstrates effective integration, its individual components are based on well-established and widely experimented techniques (e.g., in-context learning, knowledge distillation, hard-negative mining). Consequently, the system achieves performance on par with other leading models (that could be explained as many of them also leverage similar foundational training methodologies). * Marginal Gains from the Proposed Curriculum Learning: While the authors emphasize that their two-stage curriculum learning strategy consistently outperforms conventional multi-task learning and more complex curriculum variants, the presented results seem to suggest a relatively small performance improvement. Specifically, the observed performance gap between single-step training (RET-only) and the proposed two-stage curriculum training (STS -> RET) is approximately 0.11 points (as indicated in Tables 3 and 4). This modest gain raises questions about its significance or training noise, especially when compared to the more substantial contributions from other well-established techniques employed in the pipeline, such as soft label distillation (yielding a 0.63 point gain, Table 2) or hard-negative mining (contributing 0.39 points, Table 10). * Limited Generalizability to Out-of-Domain Data: While other leading models also leverage MTEB training data for evaluation, the proposed pipeline training data appears specifically tailored to this distribution while other systems often incorporate a broader range of data sources. This raises concerns about its generalizability and robustness when confronted with out-of-domain data, potentially limiting its practical applicability beyond the MTEB benchmark. * Clarity: The paper seems to suffer from several ambiguities and inconsistencies in its methodological description. Role of In-Context Learning: The introduction claims the use of in-context learning and few-shot examples to generate specialized embeddings without updating model weights. However, Table 9 contradicts this by showing that the inclusion of few-shot examples consistently decreases the average model performance. Furthermore, the subsequent description of the training pipeline, which involves LoRA finetuning with soft label distillation, obscures the contribution and integration of in-context learning without updating model weights and instruction-following, raising questions about their overall impact and purpose in the final system. * Misinterpretation of Hard-Negative Mining Strategy: In Section 4.3, the authors state they discard "Negative candidates with scores falling below this threshold are excluded during training" (maximum negative score threshold), citing Moreira et al. This directly contradicts the technique described by Moreira et al., which specifically advocates for using candidates below this threshold to effectively avoid the inclusion of false negatives and improve mining efficiency. * The pipeline used soft label knowledge distillation, which teacher model(s) is/are used? * Table 2: soft-labeling give the best model performance of the paper, this model has been also trained with hard negative mining? Two curriculum steps? Only soft-labeling? * Table 3: What is the baseline model? is it trained or is it the base mistral? * Concerning Hard-Negative Mining Strategy, is the Author used the strategy from Moreira et al. or is the described one correct (in this case, the citation seems to be irrelevant)? * This pipeline aims to be computationally efficient. It would be interesting to include the total GPU hours required, and to break down the proportion of GPU usage between the LoRA fine-tuning (to reduce training compute) and the generation of soft labels. * The primary evaluation was conducted on MTEB, using the model's in-domain training dataset. To further validate the presented results, it would be beneficial to evaluate on out-of-domain distributions using RTEB.	Lightly AI-edited
Evaluating SAE interpretability without generating explanations	Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper proposes two explanation-free methods for evaluating the interpretability of sparse autoencoders (SAEs): intruder detection and example embedding scoring. The authors test the proposed methods on SmolLM2 135M across 56 latents and find a strong correlation between human and LLM evaluators in intruder detection. The intruder detection method successfully bypasses natural language explanation generation while maintaining interpretability assessment; however, the embedding method shows limited correlation with human judgments. Higher activation deciles prove more interpretable across both methods, and the evaluation reveals that most SAE latents demonstrate interpretability without requiring explicit verbal descriptions. 1. Figure 1 effectively illustrates the conceptual shift from explanation-based to activation-based evaluation, and the writing is generally accessible. 2. The paper introduces evaluation methods that bypass natural language explanation generation, addressing a significant limitation in existing sparse autoencoders' interpretability assessment. This is a significant contribution that streamlines the evaluation pipeline and minimizes the impact of confounding factors. 3. Example embedding scoring offers a computationally lightweight alternative using small embedding models, making large-scale SAE evaluation more feasible. 1. The evaluation focuses exclusively on SmolLM2 135M across only 4 layers with 56 total latents. This narrow scope raises questions about generalizability to larger models, different architectures, or other SAE training approaches beyond TopK. 2. Example embedding scores do not correlate as strongly with human intruder scores (r = 0.48), and AUROC are close to random, which limits the practical utility of the proposed scoring method. 3. The paper lacks a discussion of failure modes or which types of latents are poorly captured by the proposed methods. 4. Limited investigation of why LLMs consistently underestimate interpretability compared to humans 1. Why does the example embedding score not correlate as strongly with human intruder scores as it does with LLM intruder scores? Authors say example embedding scores tend to underestimate the interpretability of latents due to the small size of the embedding. Does the correlation improve when the embedding size is increased? 2. How sensitive are the intruder detection results to the highlighting strategy? Have you tested alternative approaches, such as not highlighting any tokens or using attention-based highlighting to focus on the most relevant tokens? 3. What is the rationale for randomly selecting a single decile and sampling all activating examples from it? 4. Proposed SAEs use TopK activation with $k=32$. How do results change with different $k$ values, different activation functions, or different sparsity levels? 5. Can you provide examples of latents that score poorly on intruder detection but might still be considered interpretable by other measures?	Lightly AI-edited
Evaluating SAE interpretability without generating explanations	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper proposes a novel evaluation approach to assess the interpretability of sparse autoencoders (SAEs). Instead of generating natural language explanations as an intermediate step, the authors introduce two explanation-free methods: intruder detection and example embedding scoring. The paper demonstrates that direct assessment of latent interpretability is viable and correlates well with human judgments when using an LLM-as-a-judge approach. - This paper demonstrates the feasibility of the proposed method intruder detection, achieving strong correlation between human and LLM assessments. - The methods used in this paper are straightforward and easy to understand. - The paper examines interpretability across different activation deciles, providing nuanced insights into how interpretability varies with activation strength. - However, the performance of embedding score is not promising. The AUROC scores are barely above random (0.5-0.7), and correlation with human judgments is weak (r=0.48). This undermines one of the paper's main contributions, as this method was proposed as a fast, scalable alternative. - Lack of direct performance comparison with traditional interpretability evaluation methods. - Results are presented on very small set of latents (56) and small models. So we don't know if this holds when dataset scales up. - The bottleneck of evaluation seems to be extensive data collecting process, why avoiding natural language explanation is a critical problem? - Despite claiming to simplify evaluation, intruder detection still relies heavily on LLM queries, contradicting the motivation of reducing computational costs. - Line 34, the coefficients is not non-negative necessarily, if this refers to activation value of latents. Please verify with examples from Neuronpedia. - Line 44 - 46, the conclusion on natural language explanations introduced additional hyper parameters and prompts can be expanded further. It’s not very clear how they introduce additional parameters, which might refer to simulations. But it’s important to explain this clearly at the beginning of the paper. I feel the authors should spend more time polishing the introduction section to stand out their motivation and make it accessible. The last paragraph of the introduction is hard to follow. The introduction to their own methods is not clear at all. - Heatmap in Figure 2 is not very illustrative. What does "All latents which have less than that 0.2 accuracy are considered non interpretable, and different degrees of interpretability are assigned to the other 4 bins of 0.2." mean? Would overlapping histogram be more illustrative here?	Fully human-written
Evaluating SAE interpretability without generating explanations	Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The authors introduce two novel methods for evaluating interpretability of SAE latents without the requirement for generative methods. The authors construct an intruder detection task as the first approach and compare performance of human and LLM detectors, showing a high correlation albeit at a small sample size of 56 latents. Example embedding scoring, on the other hand, measures proximity of positive and negative sentence samples in the latent space. Example embedding scoring reports a moderate correlation with human scores, which could be caused by the fact that sentence embedders might poorly reflect individual token relevances. The problem of evaluating SAE interpretability is an important one, and the proposed methods have merit. The high correlation between human and LLMs on the intruder detection task is promising. The presentation of the paper, however could be improved upon. I find more extensive experiments lacking, such as adding more latents in the intruder detection task or performing subsequent analses on what causes the low correlation between human and example embedding scores — is the embedder quality a factor driving this gap? Furthermore, it is not clear to me where the data used for positive and negative SAE samples is sourced from, which is a crucial detail. An interesting question would also be also how do the methods fare across different data domains of source text? - The authors study an interesting problem of interpreting SAE latents without the use of generative LMs - The authors propose two methods, one of which exhibits a high correlation with human annotators - The presentation of the paper would benefit from improvement - Some important experimental details are missing: where is the data used as positive/negative samples for SAE latents sourced from? - Experimental limitations: increasing the number of latents, or analysing the cause of poor correlation between the example embedding method and human scoring would be interesting. See above	Fully human-written
Evaluating SAE interpretability without generating explanations	Soundness: 4: excellent Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper introduces a new method for evaluating Sparse Autoencoders (SAEs). It argues that explaining latent directions in the SAE’s latent space through short textual descriptions is suboptimal for two main reasons. First, this approach complicates the evaluation process by adding hyperparameters and prompt-related variability. Second, a latent factor can be interpretable even if it cannot be concisely expressed in words. As an alternative, the paper proposes an intruder detection framework. For each latent, four activating examples and one non-activating “intruder” example are sampled. Interpretability is then assessed based on how effectively humans, large language models (LLMs), and an embedding-based algorithm can identify the intruder. This approach emphasizes intuitive recognition of the pattern a latent does or does not encode. The results show strong agreement between human and LLM performance in intruder detection, indicating that LLMs may be well-suited for automating SAE interpretability evaluation. Proposes new method for evaluating SAEs that is more permissive in the types of interpretability it allows for (interpretable, but not easily expressible in words). The proposed method looks very promising, with LLM accuracies tracking those of humans. Multiple approaches toward the task are evaluated (LLM vs. embedding). I think the presentation could be significantly improved. On line 155, it is explained that 'We randomly select one of the ten deciles of the activation distribution, then sample all of our activating examples from the same decile.', but this is then not at all motivated. I found it quite difficult to understand why we would want to do this, and it wasn't until re-reading some of the results section for the second time that I understand the point. Specifically, the paragraph on lines 295-304 goes into the different ways we might (not) assign meaning to the activation strengths of the latent. I think that goes a long way towards motivating why we care about deciles, but it appears in the results section, rather than in an earlier section, where I would expect it. Looking at the interpretability of distributions of activations, the LLM's results are very far from symmetric: it is much better at detecting a low-activating intruder among highly-activating samples than vice versa. This is something I could not have predicted, do you have any intuition for why this is? And, do you have any data on how symmetrical humans are? Have you compared your interpretability scores to the explanation-centered approach you contrast to in the introduction? Can you find examples of latents which would be deemed uninterpretable according to other methods, but are considered interpretable under your framework?	Fully human-written
BEYOND IMITATION: RECOVERING DENSE REWARDS FROM DEMONSTRATIONS	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces a novel idea that Supervised Fine-Tuning (SFT) can be viewed as a special case of Inverse Q-Learning, suggesting that SFT does not merely imitate expert policies but implicitly learns a dense, token-level reward model. The authors then recover this implicit reward from the SFT model and propose Dense-Path REINFORCE (DPR), which leverages the recovered reward to more efficiently optimize large language models (LLMs). Experimental results demonstrate that the proposed method outperforms standard SFT across multiple benchmarks. 1. The paper presents a novel idea that Supervised Fine-Tuning (SFT) can be viewed as a special case of Inverse Q-Learning, offering a new perspective for understanding SFT. 2. The authors provide a comprehensive theoretical analysis to support the proposed formulation. 3. Experimental results show considerable improvements over traditional large language model (LLM) training methods. 1. In the theoretical analysis, the authors make several strong assumptions about the setting—for example, assuming a deterministic token sequence and a fixed discount factor of $\gamma=1$, rather than a value smaller than 1 as typically used in RL. 2. In the proposed DPR method, the reference policy $\pi_\text{ref}$ is not formally defined. The authors state that it is an SFT checkpoint trained with half of the training samples; however, if the dataset is sufficiently large, wouldn’t this reference policy also become fully trained, thereby reducing the meaningful difference between $\pi_\text{ref}$ and $\pi_\text{SFT}$? 1. The reviewer is not an expert in LLMs, but I question whether the current training paradigm of LLMs is primarily driven by SFT. Wouldn’t RLHF have a greater overall impact on model alignment and performance? I therefore have some concerns about the potential contribution and significance of this work to the broader LLM literature. 2. If SFT is theoretically equivalent to IQL, would it be possible to directly apply IQL methods to learn the reward function instead of recovering the reward from SFT? 3. The authors choose REINFORCE for policy optimization. Could the authors clarify why this choice was made instead of using more advanced RL algorithms, such as actor–critic or PPO-based methods?	Moderately AI-edited
BEYOND IMITATION: RECOVERING DENSE REWARDS FROM DEMONSTRATIONS	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes that supervised fine-tuning (SFT) can be viewed as a special case of inverse soft-Q learning under a deterministic token MDP with γ = 1 and a linear conjugate. Building on this, it defines a baseline-relative dense reward and introduces Dense-Path REINFORCE (DPR), a token-level REINFORCE update using log-probability differences between the SFT model and an earlier checkpoint as reward signals. Experiments on AlpacaEval, MT-Bench, LIMA, and Arena-Hard show modest but consistent gains over SFT and competitive results with SPIN and GSIL. 1. Theoretical link between SFT and inverse RL is clear and well-motivated. 2. The telescoping argument makes the equivalence intuitive under γ = 1. 3. DPR is simple, reproducible, and the dense reward idea is easy to understand. 4. Ablations on γ, checkpoint choice, and baseline removal are informative and thoughtfully designed. 1. Narrow validity: The equivalence holds only for γ = 1, deterministic token transitions, and a linear conjugate. The paper does test γ < 1, showing expected degradation, which confirms the limitation rather than extending the theory. 2. Inconsistent assumptions: The equivalence relies on a linear conjugate, while the later stability theorem assumes strong convexity. These are mathematically incompatible but presented as part of one framework. 3. Evaluation design: DPR is trained and tested on the same prompt set, so gains could reflect continued fine-tuning instead of genuine reward recovery. 4. Limited robustness: The temperature ablation is minimal and doesn’t analyze stochastic effects or variance. 5. Weak empirical evidence: All evaluations rely on GPT-4 judges without error bars, human checks, or multiple seeds; gains over SFT are small (a few percent). 6. Missing controls: No baseline comparing DPR to simply extending SFT training, making attribution unclear. 1. Can the same ψ be both linear (for equivalence) and strongly convex (for contraction)? 2. Why does the halfway checkpoint consistently give the best result? 3. Did you test multiple seeds or new prompts to confirm robustness? 4. Could a continued-SFT baseline match the reported gains?	Fully AI-generated
BEYOND IMITATION: RECOVERING DENSE REWARDS FROM DEMONSTRATIONS	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper reframes SFT under a token-MDP view (γ≈1) and shows that the SFT objective is equivalent to an inverse soft-Q objective. This motivates extracting a dense proxy reward as a log-likelihood ratio between a final SFT model and a reference checkpoint, followed by a short, critic-free RL step (REINFORCE with KL to SFT). The method is closed-loop (no environment reward, no preference data, no reward model), simple to implement, and reports consistent gains over plain SFT across several backbones/evals. Dense credit assignment: Per-token signals are actionable and address SFT’s plateau on long sequences and intermediate steps. Low engineering overhead: Uses existing SFT artifacts (final SFT + ref checkpoint). No external judges or reward models. Theory ↔ practice alignment: The SFT ↔ inverse-soft-Q connection and the safe-improvement style argument justify using the recovered dense signal for a small policy-gradient step. Stable optimizer: REINFORCE + KL to SFT (no critic) keeps the pipeline robust and easy to reproduce. Reference sensitivity: The approach assumes π_SFT is meaningfully better than π_ref. If ref is too weak (noisy signal) or too strong (vanishing signal), the log-ratio reward becomes brittle or tiny. Log-ratio gaming: The policy may drift toward stylistic artifacts that inflate log π_SFT − log π_ref without improving task quality. Domain narrowness: Because SFT and ref share training data, the proxy reward is intrinsically domain-tied; out-of-domain generalization of the dense signal is unclear. Eval reliance: Gains are primarily shown via LLM-as-judge; stronger human or task-grounded metrics would strengthen the case. minor S1. Reference selection by evaluation, not step count: – Choose π_ref via validation metrics (MT-Bench, small human slice, task-specific set) to target the “elbow” where the signal-to-noise of the log-ratio is highest. – Alternatively use an EMA of SFT weights as π_ref to smooth noise. S2. Multi-reference ensemble: – Define log π̄_ref = logsumexp_i(log π_ref,i) − log k (geometric mean). Reward becomes r̂ = log π_SFT − log π̄_ref. This damps idiosyncrasies of any single checkpoint and makes “progress” less gameable. S3. Dual-KL regularization and reward shaping to reduce gaming: – Keep KL(π_θ \|\| π_SFT) and add a small KL(π_θ \|\| π_ref). – Clip/normalize the per-token log-ratio (e.g., cap magnitude or z-score by position). – Penalize tokens where both π_SFT and π_ref assign low probability (both uncertain) even if the difference is large. – Add light style/fluency guards (repetition rate, perplexity bounds under a separate LM). S4. Correlation-gated updates: – On each batch, compute the correlation between r̂-improvement and a cheap proxy (exact-match on small QA, code unit tests, math verifier). If correlation drops below a threshold, reduce step size or increase KL. This is a simple “reward sanity check.” S5. Leverage the re-forward pass to incorporate useful rewards (when available), without changing the core method: – Hybrid reward: use R = α·r̂ + (1−α)·r_ext, where r_ext can be any lightweight verifier signal (unit tests for code, arithmetic checker, safety filter, format validator). α can be annealed from 1→0.8. – Doubly-robust/token-aware AWR: advantage-weight the SFT tokens by r̂ (and r_ext if present), i.e., reweight the teacher-forced loss with w_t = exp(β·A_t) to unify imitation and RL in one pass. – Counterfactual filtering: when the re-forward reveals contradictory beams (both low confidence), zero out r̂ for those tokens to avoid amplifying noise. S6. Report a reference sweep and ablations: – Show downstream metrics vs. ref placement (early/mid/late) to directly address “is π_SFT actually better than π_ref?” – Include ablations for single-ref vs. multi-ref, with/without dual-KL, and with/without correlation gate. How is the reference checkpoint chosen? Is it purely by training step or by validation metrics? What is the sensitivity curve (early/mid/late)? Q2. Can an ensemble of references reduce variance/bias? (Geometric mean of checkpoints as a smoother ref.) Q3. How do you detect or mitigate reward-hacking (odd outputs that maximize the log-ratio)? Q4. What is the robustness out of domain (math/code/safety) where SFT confidence calibration differs? Q5. Since the method already performs fresh forward passes, can those passes be leveraged to incorporate additional rewards or verifiers when available (see Suggestions)?	Fully AI-generated
On the Interaction of Compressibility and Adversarial Robustness	Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper investigates the fundamental relationship between network compressibility and adversarial robustness. The claim is that their interaction remains poorly understood. They show theoritical bounds and also empirical evaluation on architectures (FCNs, CNNs, VIT) and multiple datasets. Results show that Increased neuron or spectral compressibility consistently reduces adversarial robustness, even under adversarial training. - Paper is well motivated and well written - Provides a well-explained theoretical contribution between compressibility and adversarial robustness, tying together concepts from pruning, low-rankness, and Lipschitz theory. - Empirical analysis covers diverse architectures and datasets, including FCN, convolutional and transformer families, and multiple attack settings - The theory uses global operator norm–based Lipschitz bounds. and The bounds rely on scale-normalized parameters (using ∥W∥_F) and strict (q, k, ε)-compressibility. Can this reflect practical training dynamics with normalization layers or adaptive scaling or deep non-linear networks. - How does it position itself with other works exploring the same paradigm [1][2], As some works claim that some sparsity helps robustness - The claim that compressibility fosters universal adversarial examples is intriguing but briefly demonstrated [1] Lipschitz Constant Meets Condition Number: Learning Robust and Compact Deep Neural Networks [2] Robust low-rank training via approximate orthonormal constraints - Why focus exclusively on structured compressibility (neuron/spectral)? Would unstructured or other forms behave differently? - In Fig1,2 did not understand how to interpret these new directions? How were they visualized? - The alignment equation notations could be explain bit better - Line 325, how does Fig4 prove the dominant singular directions claim?	Fully human-written
On the Interaction of Compressibility and Adversarial Robustness	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper develops a framework to investigate the effect of structured sparsity on adversarial robustness through its effect on parameter norms and the network's Lipschitz constant. Compressibility can induce a set of highly sensitive directions in the representation space. 1. This paper is in general well-written and presents results clearly. 2. The motivating hypothesis is very interesting and described clearly in Figure 2. 3. Abundant numerical results are provided to testify the paper's theoretical results. 1. One central claim of this paper is that the compressibility may result to a few potent direction that increases the sensitivity to perturbations, and the adversarial attacks might exploit these directions. However, I cannot picture when and how the advesaries might be able to figure out these directions. Is the neural network and the compressibility totally white-box to the adversaries, which could hardly happen? 2. The evaluation of adversarially robustness of NN models seems to be dependent on the attack itself. 1. Can you please offer a motivation example of how compressibility might be taken advantage of by adversaries? Especially how would adversaries figure out the "adversarial directions"?	Fully human-written
On the Interaction of Compressibility and Adversarial Robustness	Soundness: 4: excellent Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes a systematic theoretical framework for analyzing how structured compressibility — specifically, neuron-level compressibility and spectral compressibility, affect model’s adversarial robustness. The authors propose to characterize l∞ and l2 operator norms of the parameters by an upper bound that decomposes into (compressibility × Frobenius norm) terms. Building on this formulation, they further derive and analyze an upper bound on the network’s overall Lipschitz constant. They show that compression introduces a few highly sensitive directions that can significantly amplify perturbations, which can then be exploited by attackers, ultimately leading to degraded robustness. The experimental section covers a wide range of architectures, such as FCNs, CNNs and Transformers, validating the theoretical predictions that models with higher structured compressibility are more vulnerable to adversarial perturbations. The results also demonstrate that the vulnerabilities induced by compression persists even under adversarial training and transfer learning, and facilitates the emergence of universal adversarial perturbations. 1. Provides a unified norm-based framework connecting structured compressibility and adversarial robustness. 2. Characterizing the l∞ and l2 operator norms of the parameters by decomposing the effects into compressibility and Frobenius norm terms, thereby further formalizing an upper bound on the model’s Lipschitz constant. 3. The analysis shows that the impact of compressibility on robustness persists in adversarial training and transfer learning, and it can facilitate the emergence of universal adversarial perturbations. 4. The theoretical analysis, mathematical derivation, and experimental process are relatively complete. Although the theoretical analysis is mathematically sound and logically consistent with prior robustness theory, the overall reasoning builds on well-known intuitions (ideas such as model structured compression concentrates sensitivity along a small number of directions in representation space, which in turn results in decreased robustness), and the upper bound mainly formalize this intuition rather than uncover new mechanisms. The experiments, though thorough, largely confirm expected behaviors without surprising counterexamples or deeper causal probes. The theoretical results appear to just extensions, restating known connections in a more formalized way. In addition, the two interventions mentioned in the paper for improving robustness also seem to have been studied. 1. The main theoretical results seem to formalize known connections between compressibility, operator norms, and robustness. Could the authors further clarify the new theoretical insights or perspectives provided by their analysis beyond existing results? 2. While the paper formalizes an upper bound on adversarial robustness in terms of compressibility and Frobenius norm, the bound appears relatively loose—being an order of magnitude above the empirical robustness gap as the authors describe in appendix. As such, its practical utility for predicting or guiding robust model design seems limited. 3. The theoretical analysis appears focused on ℓ₂ and ℓ∞ perturbations. Does the same framework extend to other robustness notions (e.g., ℓ₁, distributional robustness, or label noise)?	Moderately AI-edited
On the Interaction of Compressibility and Adversarial Robustness	Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	To understand the impact of model compression on adversarial robustness, this paper presents an adversarial robustness bound that interprets how structured and spectral compression induce adversarial vulnerability through their effects on the model's Lipschitz constant. Extensive experiments further demonstrate the detrimental impact of model compression on adversarial robustness. Valuable insight into model compressibility in terms of adversarial robustness This work provides a unified theoretical analysis that helps to clarify the relationship between model compressibility and adversarial vulnerability without being constrained to a specific norm-based perturbation. Extensive experiments across various settings This paper presents extensive experiments across different model architectures, learning mechanisms, and adversarial example generation methods, offering a thorough and comprehensive analysis of the relationship between structured compressibility and adversarial robustness. W1: The scenario of generating model compression is unclear. Model compression can be achieved either through a fine-tuning procedure using the full or partial training dataset, or in a data-free manner. In particular, for the former case, model compressibility is closely related to the feature representations of the training dataset in terms of model capacity. I would assume that the considered compressibility is restricted to a fine-tuning procedure; however, a clearer introduction to the model compression setup is needed before the discussion of compressibility. W2: Representation of sensitivity along a small number of directions. The authors claim that structured compression induces high sensitivity to adversarial perturbations along a small number of directions. However, this conclusion cannot be directly drawn merely from the observed correlation between compressibility and adversarial robustness. Lines 216–223 explain that the potent attack directions are determined by interlayer alignment; however, these directions are neither visualized nor formally defined. W3: Unclear explanation for the amplification of adversarial attacks in the representation space. The authors attribute the high adversarial vulnerability of compressed models to amplification along certain sensitive directions. However, it remains unclear how the adversarial attack is amplified and what mechanisms cause this amplification to occur. W4: Lack of analysis on robustness-aware model pruning techniques. This work discusses compressibility mainly after model training. However, many prior approaches achieve better preservation of both standard accuracy and adversarial robustness simultaneously during structured model pruning [1,2,3]. Despite the observed proportional relationship between compressibility and adversarial vulnerability, an additional analysis is needed to investigate the impact of adversarially robust model pruning. --- ### References [1] Zhao and Wressnegger, "Holistic Adversarially Robust Pruning", ICLR 2023. [2] Sehweg et al., "Hydra: Pruning Adversarially Robust Neural Networks," NeurIPS 2020. [3] Ye et al., "Adversarial robustness vs. model compression, or both?" ICCV 2019. Q1: The rationale of interlayer alignment I acknowledge the importance of the alignment between consecutive layers. However, in Lines 216–236, it is not clear what interlayer alignment exactly refers to. Moreover, I am curious about the rationale behind the formal definitions of $A^{\ast}_{\inf}$ and $A^{\ast}_2$. How do these two terms represent interlayer alignment? Q2: The impact on robustness improvement with small compressibility For CNNs (Figure 5) and Transformers (Figure 6), a relatively small spectral compressibility can result in a slight improvement in adversarial robustness. How can this positive effect be explained according to the proposed theory?	Fully human-written
LoRAGen: Structure-Aware Weight Space Learning for LoRA Generation	Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper proposes LoRAGen, a structure-aware framework for generating LoRA adapters from natural language, addressing the need for scalable and efficient model customization. The methodology is divided into two stages: Stage 1: Learning a Structured LoRA Latent Space. A LoRA weight autoencoder is trained to map adapter weights to and from a latent space. Its key contribution is a module-aware Mixture-of-Experts (MoE) decoder, where different experts specialize in generating weights for different parts of the network architecture (e.g., attention vs. feed-forward layers). To overcome the issue that multiple low-rank matrices can produce the same adapter, training is supervised directly on the full adapter matrix, ensuring a more robust and meaningful latent space. Stage 2: Text-Conditioned Latent Generation. A diffusion model is trained as a conditional prior over the learned latent space. At inference time, this model takes a natural language task description, encodes it, and generates a corresponding latent vector, which is then decoded into a full set of LoRA weights by the frozen decoder from Stage 1. The primary contributions are the novel module-aware architecture for weight generation, the robust adapter-level supervision strategy, and the demonstration of strong zero-shot performance, which significantly outperforms existing baselines. Originality The originality of this work extends beyond just using a Mixture-of-Experts decoder. Its primary innovation lies in designing a module-aware MoE decoder specifically tailored to the structural properties of Transformer networks. Rather than using a monolithic decoder, this approach allows different experts to specialize in generating weights for distinct components (e.g., attention vs. feed-forward layers), which is a novel and highly effective concept. Furthermore, the paper introduces a unique adapter-level supervision strategy (using direction and spectral losses) to directly address the ambiguity of low-rank matrix factorization—a subtle but critical problem that previous methods have largely overlooked. Quality The paper demonstrates high quality through its rigorous methodology and strong empirical evidence. The entire approach is well-grounded, with its core design principles directly motivated by clear empirical analysis of the LoRA weight space. The experimental validation is comprehensive and convincing, using multiple model architectures and strong baselines. The inclusion of a thorough ablation study, which isolates the contribution of each key component, further attests to the quality and soundness of the research. Clarity The paper is written with exceptional clarity. The authors do an excellent job of building a logical narrative, starting with two core empirical observations, proposing solutions directly tailored to them, and then validating these solutions through experimentation. Complex concepts are broken down and explained in a way that is easy to follow, making the paper highly accessible despite its technical depth. Strengths A Novel and Robust Generation Framework: The paper significantly improves upon previous weight generation methods by introducing a more robust and structure-aware framework. The combination of the module-aware MoE decoder and adapter-level supervision provides a more principled way to learn the geometry of the LoRA weight space. Principled Design Grounded in Strong Empirical Analysis: The method isn't just a collection of techniques; it's a carefully designed solution based on a solid analysis of the problem space. This analytical rigor is a key strength and provides a strong foundation for the paper's claims. Comprehensive and Convincing Empirical Validation: The experiments are a major strength. The method achieves performance on par with, or even outperforming, task-specific LoRAs in some cases. Crucially, it demonstrates excellent generalization in both in-distribution and challenging zero-shot (out-of-distribution) settings, proving that it learns a meaningful mapping from language to weights rather than just memorizing trained adapters. Scalability. Evidence is limited to FLAN-T5-Large (~780M) and Gemma-2-2B. The method’s practical value at the scales where LoRA matters most (7B–70B+) is not demonstrated; compute, memory footprint, and stability characteristics at those sizes remain unclear. Baselines and citation coverage. The evaluation focuses on diffusion-based generators while omitting recent non-diffusion prompt-to-LoRA approaches that also leverage pretrained knowledge, notably Drag-and-Drop LLMs and LoRA-Gen. This gap weakens novelty positioning and makes robustness claims harder to interpret relative to the broader literature. [1] Drag-and-Drop LLMs: Zero-Shot Prompt-to-Weights [2] LoRA-Gen: Specializing Large Language Model via Online LoRA Generation Reproducibility and configuration transparency. Key implementation details are not fully specified in the paper: the autoencoder’s exact topology, latent dimensionalities, MoE settings (E, top-K, routing temperature), per-module head parameterization, training schedules for both stages, and per-module LoRA ranks. Although code is provided, the manuscript itself lacks a consolidated description of these hyperparameters, limiting reproducibility from the text alone. See weaknesses * Why diffusion (vs other generative priors)? What concrete gains do you observe from a diffusion prior over simpler/cheaper alternatives (e.g., MLP/linear prior, VAE, normalizing flows, consistency/CM/rectified flow) on the same latent space? * Sampling cost and scaling to larger adapters. Diffusion can be slow when sampling many adapter locations for deeper network. What is the end-to-end latency per full adapter set, and how does it scale with (i) number of LoRA locations and (ii) LoRA rank (r)? If LoRA rank or the number of locations increases, does the autoencoder need redesign (latent size (d_z), decoder depth/width), or does performance remain stable? * On missing baselines (Drag-and-Drop). Could you clarify why Drag-and-Drop LLMs (prompt→weights hypernetwork) was not included in benchmarks? Was this due to incompatibility, unavailable code, or scope? Given its direct relevance, a brief comparison or discussion of expected differences would help position your contribution. * Decoder architecture/topology details. Please specify the module-aware MoE decoder precisely: number of experts (E), top-(K), shared vs per-module expert pools, router temperature, load-balancing objective, structural embeddings (module/layer dims), expert MLP widths/depths, normalization/activation/residual scheme, and per-module head parameterization (predicting full (\Delta W) vs low-rank factors). A parameter-count and FLOPs breakdown per component would also clarify capacity vs. performance.	Fully AI-generated
LoRAGen: Structure-Aware Weight Space Learning for LoRA Generation	Soundness: 3: good Presentation: 2: fair Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	In this paper, the authors proposed LoRAGen, a structure-aware method that directly synthesizes LoRA parameters from natural language descriptions, to address a critical limitation of traditional LoRA workflows: the need for costly, task-specific training to generate LoRA parameters. LoRAGen can eliminate the need for task-specific data collection and training and is grounded in two key empirical observations about LoRA weight spaces. 1. The motivation of this paper is strong. It provides two observations to inspire the method. 2. LoRAGen generates LoRA parameters directly from task descriptions, bypassing the need for task-specific data collection, annotation, and training. 3. LoRAGen tackles LoRA parameter generation from two key properties of LoRA spaces: 1. the non-uniqueness of Low-rank decomposition and the heterogeneous weight distributions across modules. 4. The empirical results are promising. 1. The writing should be improved. For example, in Line 189, “the the cosine” => “the cosine”. In Line 188 “the adapter similarity similarity” => “the adapter similarity” 2. For decoder-only architecture LLMs, the authors did not present weight distribution analysis. 3. The experiments are only conducted on two small models (FLAN-T5-large and Gemma-2-2B). The authors should involve more popular LLMs, like Qwen3-8B and Llama2-7B, to make the empirical findings more convincing. Please see Weaknesses.	Fully human-written
LoRAGen: Structure-Aware Weight Space Learning for LoRA Generation	Soundness: 1: poor Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper tries to directly predict LoRA adaptation weights from fine-tuning task descriptions with a DiT. With empirical findings that $\Delta W$ correlates with task but not $A$ or $B$, and weights at different layers have different spectral entropy, the model is supervised by $\Delta W$ and conditioned on the layer position. An interesting exploration of weight space learning is presented. The empirical observations on which the assumption is made are unclear. The experiment is only carried out on a specific base model with concerns about cross-model generalization. The presentation is not very clear. The meaning of Fig 1 (Left) is quite unclear, and the explanation in L184-187 is ambiguous. "Similarity" is something to be calculated in pairs, but the text doesn't really make it clear. What is the "representative task"? I assume that the authors compute the similarity between 112 other tasks and a single "representative task", i.e. duorc_gqba. Why was this task selected? Then to what extent they are correlated? Is the rho in the Figure Spearman's coefficient? Please state that more explicitly. From the figure I can't really see the correlation clearly, possibly because the two series are at quite different scales; possibly separating the (A, B) and $\Delta W$ series into two figures will make it clearer. Particularly, the correlation looks dominated by certain samples with the largest task similarity; then are the observations applicable to other tasks? Also, $\Delta W$ is not unique either. What will happen if you collect more samples per task? Discussions on empirical observations are repeated in page 2 and page 4, and it's difficult to check Fig 1 when reading page 4. I suggest only briefly mentioning the empirical conclusions in the Intro part and moving Fig 1 to page 4. L357: What is "element-wise averaged LoRA"? L372-373: How is the model applied to a decoder-only base model? I assume that the model is trained on FLAN-T5, including the module/layer embeddings. Also, the target model Gemma-2-2B-Instruct itself can already get good performance on many tasks. Hence I have concerns about the cross-model generalization. I think this is critical for the practical use of this series of methods, as training hypernetwork is expensive, LoRA-tuning itself is already cheap, and there is still a considerable performance gap between generated and trained LoRA. I also have doubts about FLAN-T5's low performance on the benchmarks. Isn't FLAN-T5 already trained on those tasks? L135-136; L365: Should use \citep here L170: observations L348: Team? L361: Missing label? L710: full?	Fully human-written
LoRAGen: Structure-Aware Weight Space Learning for LoRA Generation	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This paper introduces LoRAGen, a method for generating Low-Rank Adaptation (LoRA) parameters directly from natural language task descriptions. The authors argue that LoRA parameter spaces have unique structural properties that are ignored by general-purpose weight-space learning methods. They identify two key properties from an empirical analysis: (1) the non-uniqueness of low-rank decomposition, where task similarity correlates with the full adaptation matrix $\Delta W = BA$ but not with the individual $A, B$ matrices, and (2) the heterogeneity of weight distributions, where different modules (e.g., encoder, decoder) exhibit different spectral properties. To address these, LoRAGen introduces two main innovations: 1. Adapter-level supervision and 2. Module-aware MoE decoder. The overall framework uses a LoRA Weight Autoencoder (LAE) and a conditional latent diffusion model to generate latents from task descriptions, which are then passed to the MoE decoder. Experiments on FLAN-T5 and Gemma-2 models show that LoRAGen achieves performance close to task-specific (oracle) LoRAs on in-distribution tasks and, more importantly, outperforms the T2L (Text-to-LoRA) baseline by nearly 5% on zero-shot generation for unseen tasks. 1. The paper is grounded in a clear and compelling empirical analysis (Figure 1). The two observations (non-uniqueness and heterogeneity) are well-demonstrated and provide a strong "why" for the proposed method. 2. The proposed solutions map directly to the identified problems. The adapter-level supervision (direction and spectral loss) is a very clever way to address the non-uniqueness issue. The authors also correctly note the importance of an efficient implementation (Appendix A.3), which avoids materializing the $d \times d$ matrix and makes the approach practical. 3. The primary goal of such a model is to generalize to new tasks. The 5% absolute improvement on zero-shot generation (Table 3) over the T2L baseline is a significant and meaningful result, demonstrating the value of the structure-aware approach. 4. The method is shown to be effective on both encoder-decoder (FLAN-T5) and decoder-only (Gemma-2) architectures, suggesting the principles are general. 5. The paper is well-organized, and the progression from observation to method to results is logical and easy to follow. 1. The main weakness is the ablation study in Table 4. The model with just the MoE decoder and a standard reconstruction loss ("X", "X", "✓") achieves 95.2% average accuracy. The full model, with the novel adapter-level losses ("✓", "✓", "✓"), achieves 96.0%. This 0.8% difference on in-distribution tasks seems to undermine the importance of the adapter-level supervision ($L_{ang}$ and $L_{spec}$), which is presented as a primary contribution. 2. Related to Weakness #1, the paper argues that the adapter-level losses are critical for generalization and avoiding "memorizing" specific decompositions, which is key for zero-shot performance. However, the ablation study in Table 4 is only performed on in-distribution tasks. The most critical ablation—showing the zero-shot performance of the "MoE + reconstruction" model (the 95.2% one)—is missing. Without this, the central claim that the novel losses are necessary for zero-shot generalization is not fully substantiated. 3. The "Average LoRA" baseline results in Table 2 are clearly an error. The values appear to be copy-pasted from Table 1, and they show scores (e.g., 96.8 on ArcC) that are vastly higher than the "Task-specific LoRAs" (the oracle, 76.7). This is a sloppy error that should have been caught. 1. The ablation in Table 4 suggests your novel adapter-level losses ($L_{ang}$, $L_{spec}$) provide only a marginal (0.8%) benefit on in-distribution tasks over a standard reconstruction loss when the MoE decoder is used. You motivate these losses as being essential for zero-shot generalization. To support this claim, please provide the zero-shot ablation results for the seven unseen tasks (the same setup as Table 3). Specifically, what is the zero-shot performance of the model with only the MoE decoder and a reconstruction loss (the one that scored 95.2% in Table 4)? This is crucial to validate your central contribution. 2. Please correct the "Average LoRA" baseline results in Table 2. The current values are nonsensical, as they significantly outperform the task-specific oracle. 3. Have you analyzed the expert specialization in your MoE decoder? The routing is based on module and layer embeddings, motivated by Obs-2 (heterogeneity). Do the experts indeed learn to specialize on specific module types (e.g., encoder vs. decoder) or spectral-entropy profiles as hypothesized? 4. What is the training time and parameter-count overhead of LoRAGen (for Stage 1) compared to the T2L baseline? How much do the adapter-level losses and MoE decoder add to the computational cost?	Fully AI-generated
Contrastive Negative Preference Optimization for Machine Unlearning in LLMs	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	This work proposed contrastive negative preference optimization (CNPO) for LLM unlearning. The authors adopted the idea from noisy contrastive estimation (NCE) to derive a novel loss function for unlearning. Mathematically, the proposed loss can be viewed as NPO+a weighted GD-type retain loss, with the weights depending on the similarity between the retain and forget samples. The authors compared CNPO with standard baselines (NPO, SimNPO) on TOFU, MUSE and a new PII benchmark, and showed that CNPO performs the best at balancing between forgetting and utility preservation. I find the idea of contrasting between forget and retain sample for unlearning quite novel and interesting. The authors have conducted extensive experiments to demonstrate the effectiveness of their method. They also curated a new benchmark PII for evaluating unlearning performance, which may benefit future research on LLM unlearning. The writing of the paper is less clear. For example, the notation $x$ is used to denote both the prompt and the forget data in some places. In the main text (eq. 4), there are notations like $r(y_r,y_i)$ and $r(x,y)$, and it is unclear to me what the meaning of $x$ is in these places. Please consider using different notations for prompt and response. Also, I cannot find the definition of $d(x,y)$. 1. The main issue is that in the current eq. 6 (assuming no typo), the softmax weight in the second term does not actually contribute, since when summing over $i = 1, \dots, k$, the weights sum to one and thus cancel out. In this case, the proposed algorithm is close to NPO + GD retain loss, and there is less novelty. 2. In the PII experiment (Figure 3), what are the baselines used for comparison? Are they the original NPO and SimNPO, or the variants that include a retain loss? If they are the original versions, would incorporating a retain loss lead to better trade-offs?	Fully human-written
Contrastive Negative Preference Optimization for Machine Unlearning in LLMs	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This paper proposes Contrastive Negative Preference Optimization (CNPO), a new algorithm for machine unlearning in large language models (LLMs). The method aims to balance forgetting target data and retaining general utility by introducing a contrastive preference-based loss that jointly considers relationships between forget and retain samples. The authors derive CNPO from Noisy Contrastive Estimation (NCE) and preference optimization frameworks, showing that it generalizes Gradient Ascent (GA) and Negative Preference Optimization (NPO) under certain limits. 1. Clear motivation and theoretical grounding. The paper identifies a real gap in LLM unlearning: existing methods (GA/NPO) overlook structural relationships between forget and retain sets. The proposed CNPO provides a principled derivation based on contrastive preference learning, establishing connections to both NCE and NPO through rigorous theoretical analysis (Theorem 3.1, Proposition 1) 2. The synthetic PII dataset is a valuable contribution to the community, simulating privacy-sensitive unlearning tasks while controlling for linguistic entanglement. 3. Experiments on TOFU, MUSE, and PII datasets show consistent improvements, which demonstrates the effectiveness of proposed method. 1. While CNPO’s performance is reported, statistical variance and significance are missing. 2. Limited interpretability of metrics. The metrics for fluency, coherence, and PII leakage rely heavily on GPT-4o evaluation, but the setup lacks calibration details (e.g., number of samples, inter-rater consistency). 3. While the paper evaluates CNPO on TOFU, MUSE, and the proposed synthetic PII dataset, it omits key community-standard benchmarks such as WMDP (Wang et al., 2024), which specifically assess malicious-use forgetting and safety retention. The benchmarks included in the paper primarily focus on sequence-level unlearning, where the goal is to remove specific text patterns or samples. In contrast, WMDP emphasizes knowledge-level unlearning, targeting the removal of factual or harmful knowledge while preserving general capabilities. Including such benchmarks would provide a more comprehensive evaluation of CNPO’s effectiveness in real-world unlearning scenarios and demonstrate its robustness beyond pattern-level forgetting. Please refer to the Weakness section.	Moderately AI-edited
Contrastive Negative Preference Optimization for Machine Unlearning in LLMs	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	The paper proposes Contrastive Negative Preference Optimization (CNPO), a new preference-based unlearning method for large language models. CNPO integrates contrastive learning into the unlearning objective, jointly considering forget and retain data: retain samples act as positives and forget samples as negatives. Using semantic similarity as a proxy preference signal, CNPO adaptively adjusts unlearning strength. The method unifies prior approaches, reducing to NPO with many negatives and to GA in the high-temperature limit, ensuring gradient stability. Experiments on TOFU, MUSE, and a PII dataset show that CNPO achieves strong forgetting while better preserving utility compared with GA, NPO, and SimNPO. 1. The paper conducts experiments on three unlearning benchmarks, including MUSE, TOFU, and a newly constructed PII dataset, providing a comprehensive evaluation of the proposed method. 2. The mathematical formulations are clear and easy to understand. 1. The motivation of the paper is insufficient. After introducing the preliminaries of NPO and SimNPO, the authors directly present CNPO without clearly explaining why NCE is needed or how it addresses the limitations of previous methods. An empirical motivation or ablation study demonstrating the necessity of NCE would make the contribution more convincing. 2. The paper mentions SimNPO, which removes the reference model and achieves better performance than NPO. However, CNPO reintroduces the reference model, even though SimNPO shows that it may not be essential. It would be helpful if the authors could clarify why CNPO requires a reference model and whether a reference-free version of CNPO could achieve even better results. 3. Recently, several works have pointed out that LLM unlearning lacks robustness. For instance, relearning attacks using a small portion of the forget set or jailbreaking prompts can easily recover the forgotten knowledge [1]. Moreover, there are emerging robust unlearning methods that combine unlearning with meta-learning [2] or adversarial training [3,4]. It would strengthen the paper if the authors could evaluate the robustness of CNPO under such attacks and compare it with existing robust unlearning approaches. > [1] Łucki, Jakub, et al. "An adversarial perspective on machine unlearning for ai safety." arXiv preprint arXiv:2409.18025 (2024). > > [2] Tamirisa R, Bharathi B, Phan L, et al. Tamper-resistant safeguards for open-weight llms[J]. arXiv preprint arXiv:2408.00761, 2024. > > [3] Fan, Chongyu, et al. "Towards llm unlearning resilient to relearning attacks: A sharpness-aware minimization perspective and beyond." arXiv preprint arXiv:2502.05374 (2025). > > [4] Sheshadri, Abhay, et al. "Latent adversarial training improves robustness to persistent harmful behaviors in llms." arXiv preprint arXiv:2407.15549 (2024). See weakness.	Lightly AI-edited
Beyond Pixels: Efficient Dataset Distillation via Sparse Gaussian Representation	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.	This study points out that existing dataset distillation methods using dense representations, which is hard to capture the relative importance of pixels effectively, leading to redundancy. To achieve efficient sparse representations, this study proposes Gaussian splatting dataset distillation (GSDD), which utilizes Gaussian splatting to parameterize the distilled synthetic images. Through the experimental results, GSDD shows the efficiency and effectiveness improvement. 1. To my knowledge, this is the first work that employs Gaussian splatting in the dataset distillation. Gaussian splatting is widely researched and utilized across diverse fields, thus possessing high potential for extension. 2. The detailed analysis of the Gaussian representation described in Section 3.3 is highly beneficial as it aids in a more thorough understanding of the GSDD. In particular, the significant variation in performance depending on large opacity and size is of particular interest. 3. GSDD achieves strong performance with significant gap on some benchmark datasets. 1. There is a lack of clear justification for the motivation to apply Gaussian splatting to dataset distillation. I believe the current manuscript focuses more on methodologies for efficiently using Gaussian splatting in Dataset distillation than on unique characteristics of Gaussian splatting compared to existing parameterization methods. The issues of redundancy and computational overhead highlighted by the authors as problems in prior research have already been noted in several previous studies[1,2,3,4]. Specifically, while the paper highlights efficiency concerns regarding time and memory for DDiF[5] (the core baseline adopted in this study), this does not sufficiently justify shifting the framework to Gaussian splatting without improving INR-based parameterization. Therefore, an in-depth analysis is required to present new limitations of existing prior research and demonstrate how Gaussian splatting's inherent characteristics can address these. 2. There is insufficient evidence to support the claim that GSDD achieves higher performance than prior research. The contributions of this study are believed to be the first application of Gaussian splatting in the field of dataset distillation and the enhancement of efficiency through various techniques. While this approach achieves high efficiency in terms of time and memory, the lack of evidence and explanation as to why it outperforms prior research makes it difficult to accept. For instance, DDiF argued that the introduction of INR yields higher expressiveness despite using a smaller budget, through theoretical analysis and reconstruction experiments. Furthermore, experiments with a fixed number of decoded instances supported the notion that DDiF's high performance stems from its high expressiveness and diversity. Similarly, a deeper analysis is required to understand the mechanism through which the proposed idea, GSDD, achieves its high performance. 3. There is no analysis of whether GSDD can be applied across diverse application scenarios and achieve high performance. Parameterization methods effective only in specific application area have limitations in their applicability. Several prior studies report experimental results on corrupted datasets[2,3,5] and cross-resolution[5] to evaluate the out-of-domain generalization of each methods. Furthermore, as the primary baseline DDiF is a unified framework for grid-based data across various modalities, it provided performance comparison results across image, video, audio, and 3D voxel datasets. Demonstrating GSDD's superior performance across diverse application scenarios would significantly enhance the impact and applicability of this research. [1] A Comprehensive Survey of Dataset Distillation [2] Frequency Domain-based Dataset Distillation [3] Sparse Parameterization for Epitomic Dataset Distillation [4] Generalizing Dataset Distillation via Deep Generative Prior [5] Distilling Dataset into Neural Field 1. I am curious about the performance of Gaussian splatting-based parameterization without applying the various techniques for improving efficiency (Section 3.2). I am also curious about the performance improvement when adding each technique, extending Table 3. 2. I am also curious whether GSDD also demonstrates high performance and efficiency for image datasets with resolutions greater than 128. 3. Recently, research into soft labels has also been actively explored as an alternative to one-hot labels[6,7,8]. While this work defined by one-hot label encoding (Line 98), I would be interested to see the performance of GSDD when soft label is applied 4. Quantization using bf16 directly influences the final image per class calculation (Line 402–408), so it affects diversity and playing a key role in GSDD performance (Table 3). However, this technique is not limited to GSDD and can also be applied to existing parameterization methods. Therefore, I would like to know the performance comparison when quantization is applied to existing parameterization methods. 5. I understood that each Gaussian primitive contributes to a single distilled image. I am curious whether this could be extended to allow a Gaussian primitive to contribute to multiple distilled images. If possible, this could further enhance efficiency. [6] A Label is Worth A Thousand Images in Dataset Distillation [7] GIFT: Unlocking Full Potential of Labels in Distilled Dataset at Near-zero Cost [8] Heavy Labels Out! Dataset Distillation with Label Space Lightening	Fully human-written
Beyond Pixels: Efficient Dataset Distillation via Sparse Gaussian Representation	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper proposes GSDD to address the redundancy and poor scalability issues inherent in previous dataset distillation methods that rely on dense pixel-level representations. Specifically, GSDD introduces a sparse representation based on 2D Gaussian primitives to encode distilled images. Each Gaussian captures region-level discriminative information with only a few parameters optimization. This proposed Gaussian-based representation reduces storage overhead, increases dataset diversity and improves optimization stability. Also, to ensure efficiency and scalability, this paper implements a CUDA-based differentiable rasterizer for parallel rendering of multiple distilled images. Experiments demonstrate that GSDD achieves state-of-the-art performance on CIFAR-10, CIFAR-100, and ImageNet subsets with efficiency gains. 1. The paper is well written with clear and informative figures and well-motivated for essential efficiency issues in dataset distillation field. 2. The proposed GSDD is simple yet effective, with a novel and intuitive idea. The transformation from per-pixel to per-region representation is well motivated and directly addresses the inefficiency of dense pixel-level encoding. 3. The proposed GSDD can be integrated into other DD methods while improving the efficiency and performance. 4. The performance of the proposed method is good with less computational cost. 1. [major] The efficiency of the proposed method relies heavily on CUDA-based parallel rasterizer, which raises concerns about portability and reproducibility. Different GPU hardware-specific optimizations can lead to slightly different acceleration behaviors, the performance and results may vary between different GPU architectures or operator implementations. 2. [major] Since distilled images are generated from sets of Gaussian primitives rather than explicit pixels, this representation may have weaker semantic visualization and interpretability, potentially limiting the model generalization ability across different architectures. As shown in Tables 4 and 10, GSDD performance improvement across different architectures is relatively limited compared to performance on the same architecture. Furthermore, this paper only reports the average results for each architecture without providing detailed comparisons per architecture, making the cross-architecture performance results less convincing and verifiable. 3. [minor] While the proposed method demonstrates efficiency and scalability, experiments have primarily focused on relatively low-resolution datasets, such as subsets of CIFAR and ImageNet (low-resolution version). To better verify whether the method remains effective on high-resolution or more complex datasets, it could be extended to ImageNet full size with full resolution. 1. As mentioned in weakness major 1, it would be helpful if the authors could provide some explanation or empirical evidence regarding how consistent the results are across different GPU architectures. 2. For weakness major 2, it might be helpful if the authors could provide further explanation or analysis on how the Gaussian-based representation affects generalization across different architectures. 3. It might be helpful if the authors could provide some discussion on the scalability of the proposed method to higher-resolution datasets, e.g., full size and resolution ImageNet-1K. Also, could the same Gaussian-based representation be extended to detection or segmentation tasks?	Fully human-written
Beyond Pixels: Efficient Dataset Distillation via Sparse Gaussian Representation	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper tackles the problem of data parameterization in dataset distillation (DD). It argues, correctly, that conventional dense pixel-grid representations are redundant, inefficient, and scale poorly. The authors propose GSDD (Gaussian Splatting Dataset Distillation), a novel and efficient parameterization that represents each distilled image as a sparse set of 2D Gaussian primitives. Each Gaussian is defined by 9 parameters (position, shape, color, opacity). These primitives are rendered into images using a highly parallelized, differentiable CUDA-based rasterizer. The central claims are that this sparse representation (1) is more storage-efficient, allowing for greater dataset diversity (more images per class) under a fixed budget , (2) enables a smoother optimization landscape for faster convergence , and (3) is computationally efficient. The method is evaluated on CIFAR-10, CIFAR-100, and ImageNet subsets , where it is shown to achieve state-of-the-art performance and significantly outperform the INR-based DDiF in speed and memory usage. 1. Novel and Clever Parameterization: The core idea of using a sparse set of 2D Gaussians to represent distilled images is highly original in the DD context. It elegantly sidesteps the high-frequency noise issues of pixel optimization (Fig 4) and the computational bottleneck of INR's per-pixel querying. 2. Efficiency vs. INRs: The paper provides compelling quantitative evidence (Figure 5) that GSDD is dramatically faster (both forward and backward) and more memory-efficient than the state-of-the-art functional parameterization, DDiF . This is a strong practical advantage. 3. Performance (on tested benchmarks): The method achieves state-of-the-art results on all tested benchmarks (CIFAR-10, CIFAR-100, and ImageNet subsets), outperforming numerous baselines across various IPC settings. 1. Critical Omission of ImageNet-1K: The most significant weakness is the failure to evaluate on full ImageNet-1K. The paper is titled "Efficient Dataset Distillation" and repeatedly claims scalability. However, it avoids the standard large-scale benchmark where scalability is truly tested. Recent SOTA methods in scalable DD (e.g., SRe2L, RDED, etc.) all report 1K results. Without this comparison, the central claim of scalability is unsubstantiated. The results on "ImageNet subsets" are insufficient. 2. Confounding Initialization: The initialization procedure is a major methodological flaw. The model is "pre-trained" to match real images via MSE loss. This raises two problems: - Performance Confound: How much of the final SOTA performance is simply due to this high-fidelity, real-data-based "warm start" rather than the superiority of the GSDD parameterization within the distillation process? - Privacy Contradiction: This initialization method directly leaks data from the original dataset, which contradicts the paper's (and the field's) stated motivations for privacy preservation. A method that requires fitting to real images cannot be straightforwardly used in privacy-sensitive scenarios. 3. Hyperparameter Complexity: The method introduces a critical new hyperparameter trade-off: the number of Gaussians per image ($M$) vs. the number of Gaussian Images Per Class (GPC). While the paper explores this trade-off (Fig 3c, 3d), it's unclear how to optimally set these values for a new dataset and budget, making the method less of a simple drop-in replacement than advertised. 1. ImageNet-1K: The most pressing question is the lack of full ImageNet-1K experiments. Given the paper's focus on efficiency and scalability, why was this benchmark omitted? Can you provide results for GSDD on ImageNet-1K, comparing it to scalable SOTA methods like SRe2L and RDED? 2. Initialization Ablation: What is the performance of GSDD if the Gaussians are initialized randomly (e.g., random positions/colors, small isotropic covariances) instead of being pre-fit to real images? This is a crucial ablation to understand the true source of the performance gains and to validate the method's feasibility for privacy-preserving applications. 3. Privacy: How do you reconcile the claim of supporting "privacy-preserving research" with an initialization method that explicitly fits the synthetic data to real data samples? 4. Gaussian Escape: In your ablation (Table 3), the "w/o boundary" setting shows a performance drop. Is this drop due to removing the loss term (Eq 10) or the hard clipping (Eq 11)? What happens if you only use the loss, or only the clipping? I am willing to raise my score if the authors can provide a comprehensive comparison on ImageNet-1K against relevant SOTA methods (like SRe2L and RDED) and address the confounding effect of their real-image-based initialization in their rebuttal.	Fully AI-generated
Beyond Pixels: Efficient Dataset Distillation via Sparse Gaussian Representation	Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.	The paper proposes GSDD, a dataset distillation method using a sparse 2D Gaussian representation to replace dense pixels and costly implicit neural representations, aiming to enhance efficiency and scalability. According to the reported results, it achieves notable speed and memory gains over INR-based baselines (e.g., DDiF) while maintaining comparable or better performance across different DD algorithms and architectures. 1. This is the first work to introduce a parametric Gaussian framework into the field of dataset distillation. 2. The method employs customized CUDA operators to significantly improve computational efficiency. Major: 1. The authors emphasize in the introduction and motivation that DDiF struggles with larger and higher-resolution datasets, yet the experiments in this paper do not substantiate this claim. Specifically, DDiF is evaluated on the ImageNet subset with a resolution of 256×256, while GSDD experiments are limited to 128×128 or even 32×32. To convincingly demonstrate GSDD’s superior generalization, it should at least show improvements on datasets of comparable scale and resolution. 2. The method section provides extensive background on parametric Gaussian modeling but lacks theoretical justification for why Gaussian functions are inherently more effective than INRs. There is no formal reasoning or analysis showing that 2D Gaussian bases offer intrinsic advantages over other differentiable bases (e.g., wavelet or trigonometric) for gradient matching in dataset distillation. As a result, the choice of Gaussian appears heuristic and engineering-driven rather than grounded in theoretical insight. 3. The paper does not theoretically formalize the superiority of this representation. Most arguments focus on engineering aspects such as CUDA-based rasterization being faster than neural network querying, rather than proving that sparse explicit primitives are fundamentally more suitable for encoding discriminative knowledge than implicit continuous functions. 4. The paper lacks sensitivity or ablation studies on key hyperparameters, particularly the number of Gaussian primitives per image and their initialization strategy. Since these parameters critically affect performance, the absence of such analysis makes it difficult to assess the robustness and practical applicability of the method. 5. The paper claims to evaluate cross-architecture transfer from ConvNet to ResNet and VGG. However, Table 4 only shows the performance of different DD methods on ImageNet subsets and does not present any cross-architecture results, so the claimed generalization across architectures is not supported. Minor: 1. The notation $R$ in Equation (5) is undefined. 2. The reference “Sequential subset matching for dataset distillation” has inconsistent author formatting compared to other entries. See weakness.	Lightly AI-edited
Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models	Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes Meta Policy Optimization (MPO) for the RLAIF setting, inspired by the psychological concept of metacognition. In this framework, a meta-reward model (MRM) dynamically refines the reward model’s prompt throughout training. The MRM monitors the evolving training landscape by processing the prompt instructions, reference solutions (if available), policy generations, the reward model’s evaluation rubric, and the scores assigned to those generations. Using this information, the MRM identifies weaknesses in the current rubric that the policy may be exploiting (or is likely to exploit) and modifies the rubric to make it more targeted and fine-grained. This helps the reward model resist reward hacking by the policy and promotes more stable policy optimization. MPO reduces the need for manual prompt design and proves effective across diverse tasks without requiring specialized reward designs. Experimentally, the authors show that MPO outperforms PPO with expert or auto prompting across four different domains. The paper proposes Meta Policy Optimization (MPO) for the RLAIF setting, inspired by the psychological concept of metacognition. MPO addresses reward hacking in RLAIF by introducing a meta-reward model (MRM) that periodically refines the reward model’s prompt during training. This ensures that the evaluation rubric remains granular, targeted, and resistant to exploitation by the policy. MPO is a timely contribution toward mitigating reward hacking in RLAIF and promotes more stable policy optimization. Furthermore, the prompts used for the MRM are general and task-agnostic, enabling their usage across diverse domains. MPO demonstrates strong effectiveness compared to approaches that rely on static, hand-crafted prompts, even those designed by domain experts, across diverse tasks such as essay writing, summarization, ethical reasoning, and mathematical reasoning, showcasing its versatility. Additionally, MPO reduces the burden of prompt engineering by automatically refining the reward model’s prompt throughout training based on the observed training context. Finally, the paper is clearly written and well-organized, making it easy to follow. The sample selection process for rubric refinement is completely random. Samples drawn from much earlier stages of training may not be informative, as the training context and policy behavior could have evolved significantly. Given this, it may be more effective to prioritize recent samples or those with higher estimated informativeness when updating the rubric. Such an approach could make the refinement process more adaptive to the model’s current failure modes. As the MRM continuously evolves the reward model’s rubric, the rubric appears to become increasingly complex and fine-grained over the course of training. This process resembles inferring highly detailed reward functions that fit the observed training context but may not generalize well to unseen or downstream tasks. In light of this, it might be useful to regularize the inferred rubric, for instance, by penalizing excessive complexity or enforcing smoothness constraints, to improve generalization and stability. Another concern is that the scoring scale of the rubric can change dynamically during training. At one point, the maximum score might be 30, whereas at a later stage it could increase to 50. This variability may lead to inconsistent reward magnitudes for the policy. To address this, it would make sense to use a normalized reward score, $s\in [0,1]$, ensuring a consistent and comparable reward scale across training iterations. Finally, the experiments are primarily conducted on smaller models (e.g., Qwen2-1.5B-Instruct), with limited evaluation on larger LLMs due to resource constraints. This leaves open questions about scalability, in particular, whether MPO remains effective and stable as model size increases and training dynamics become more complex. 1) Does the rubric becoming more complex over the course of training, as it is evolved by the MRM, affect the generalization performance of the LLM aligned via MPO? Wouldn't regularizing the rubric help improve generalization without sacrificing MPO's ability to curtail reward hacking? 2) Since the rubric scoring scale can change over the course of training, wouldn't it be better to use a normalized score as the training signal for the RL algorithm? 3) Do you have experimental results for other model scales (3B, 7B, 13B, etc.) and potentially other models (e.g., LLaMA) for the policy? 4) In Section 3.3.1, PPO-aligned 32B_AP receives the highest rating in pairwise Elo evaluations. The hypothesis was that the GPT-4o judge favors outputs from models aligned using evaluation rubrics it helped generate. Why was this not observed in the results for the essay writing task? 5) Why were 72B RM and MRM sizes used only for the essay writing task and not for the other three domains?	Fully human-written
Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models	Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces Meta Policy Optimization (MPO), a novel framework that tackles two persistent challenges in Reinforcement Learning from AI Feedback (RLAIF) for large language models: vulnerability to reward hacking, where models exploit flaws in the reward signal, and the heavy reliance on brittle, manual prompt engineering for reward models. Inspired by metacognition and evaluative thinking, MPO augments the standard RLAIF pipeline with a meta-reward model (MRM) that dynamically refines the reward model's prompt throughout training. Empirically, MPO demonstrates significant advantages across diverse tasks spanning the depth-breadth spectrum of evaluative thinking, such as essay writing, summarization, ethical reasoning, and mathematics. It achieves performance on par with or superior to models using extensively hand-crafted prompts, while crucially preventing policy collapse due to reward hacking, as observed in fixed-prompt setups. - MPO directly addresses two of the most significant pain points in RLAIF: reward hacking and the immense burden of manual prompt engineering. - A major strength is the extensive empirical validation across four distinct tasks, each representing different challenges on the spectrum of evaluative thinking. - The paper goes beyond simply reporting results to analyze how MPO works. The discussion on the evolution of the rubric's linguistic structure provides valuable insights into the framework's inner workings. - The entire MPO process hinges on the quality of the MRM's analysis and refinements. The paper does not deeply explore what happens if the MRM itself is flawed, generates poor rubrics, or introduces new biases. - While the paper reports an 11% compute overhead and argues it is modest, this is a critical factor for adoption. - The paper demonstrates strong results on single-turn generation tasks. How would you envision and potentially adapt the MPO framework for multi-turn interactive tasks, such as dialogue or long-horizon instruction following? - Your results show that performance improves with the size of both the RM and MRM. Could you discuss the interplay between the policy model size and the effectiveness of MPO? What are the optimal scaling relationships between these three components?	Moderately AI-edited
Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes Meta Policy Optimization (MPO), a framework that dynamically refines the evaluation rubric used by a reward model (RM) during reinforcement learning from AI feedback (RLAIF). The core idea is to introduce a meta-reward model (MRM) that periodically analyzes the RM’s scoring behavior and updates its rubric to mitigate reward hacking and reduce reliance on manual prompt engineering. The authors evaluate MPO on four tasks—essay writing, summarization, ethical reasoning, and mathematical reasoning—reporting improvements over fixed-prompt baselines, including hand-crafted and AutoPrompt-generated rubrics. - The motivation—addressing reward hacking and prompt brittleness in RLAIF—is well-articulated and practically relevant. - The MPO framework is conceptually clean and integrates naturally into existing PPO-based pipelines. 1. Limited experimental scope and reliability: The evaluation is conducted exclusively on a single small policy model (Qwen2-1.5B) and only with PPO. This raises concerns about the generalizability of the findings. In the current RLAIF literature, standard benchmarks such as Arena-Hard-v2 or Alpaca-Eval are expected for alignment claims, yet these are entirely absent. Without results on more representative models (e.g., 7B+ scale) or alternative RL algorithms (e.g., GRPO), it is unclear whether MPO’s benefits are robust or merely artifacts of a narrow setup. 2. Rubric design appears misaligned with task heterogeneity: The paper implies that a single evolving rubric is shared across all queries within a task (e.g., all essay prompts use one rubric). However, it is natural that different queries may require distinct evaluation criteria (e.g., creativity vs. factuality in essays). The current design risks oversimplifying the complexity of human preferences. 3. Oracle rubric and evaluation protocol lack rigor: The “oracle” rubric is derived from 60+ PPO runs on the same small model—an ad hoc and non-standard baseline. More convincingly, the paper could compare MPO’s evolved RM against a much stronger fixed judge (e.g., Qwen-2507-235B or GPT-4o) to assess whether dynamic rubric refinement truly closes the gap with top-tier static evaluators. As it stands, the claim that MPO “surpasses human-engineered prompts” is overstated given the weak oracle baseline and reliance on GPT-4o as the sole judge (which may favor its own prompt styles). seed weakness	Fully AI-generated
Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models	Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	The paper proposes Meta Policy Optimization (MPO), which augments standard RLAIF by adding a meta‑reward model that periodically refines the reward model’s evaluation prompt/rubric using recent training context. This evolving rubric aims to reduce reward hacking, lessen manual prompt engineering, and provide more discriminative rewards over time. The method is instantiated with small policy models and larger RM/MRM pairs, and evaluated across essay writing, summarization, ethical reasoning, and math reasoning. Results show consistent gains over fixed‑prompt PPO baselines and robustness against typical reward‑hacking failures; on essay writing, MPO surpasses even heavily hand‑engineered “oracle” prompts under similar compute. Practicality: Turning reward design into an evolving rubric reduces brittle prompt engineering and mitigates fixed‑prompt reward hacking. Generality: The framework is task‑agnostic and plugs into PPO‑style RLHF/RLAIF pipelines; evaluations span writing, summarization, ethics, and math. Evidence of robustness: Qualitative/quantitative analyses show MPO detecting and correcting gaming behaviors (e.g., title‑only outputs), while fixed‑prompt PPO can collapse. Analysis: Tracks rubric growth and stricter scoring over iterations, explaining how finer evaluation criteria can yield more informative gradients. 1. Model scaling: It is unclear why experiments focus on Qwen2‑1.5B and Llama‑3.1‑8B only; a systematic sweep across Qwen2.5 scales (e.g., 0.5B→7B) under matched setups would better reveal scaling trends and improvement curves. 2. Benchmarks: Important, contemporary alignment/reasoning suites are missing; adding AlpacaEval 2.0 (length‑controlled) and Arena‑Hard variants would strengthen generalization and robustness claims. 3. Role mapping: The conceptual mapping between reward models and LLM‑as‑judge is muddled. Recent work often finds LLM‑as‑judge competitive or superior to trained reward models; the “student/junior/senior instructor” analogy would be clearer if the policy is the student, the LLM‑as‑judge the instructor/judge, and the reward model a distilled proxy of that judgment. Clarify roles and justify terminology. 4. Objective choice: The training objective is under‑motivated. Given a focus on reward shaping and rubric evolution, comparisons with DPO or GRPO would isolate whether gains come from MPO itself or from PPO specifics. 5. Baselines: Please add strong recent verifiable‑reward pipelines (e.g., RLVR‑style systems) and widely used public suites for instruction‑following/reasoning. Self‑configured evaluations are valuable but less convincing without head‑to‑head comparisons against recognized baselines. See weakness section.	Heavily AI-edited
Protein as a Second Language for LLMs	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This paper introduces the "Protein-as-Second-Language" framework, which aims to enable large language models (LLMs) to interpret protein (amino acid) sequences as if they were acquiring a second symbolic language. By curating a bilingual dataset of almost 80k protein-question-answer triples and implementing an adaptive context construction mechanism, the approach supplies protein sequence–question–answer exemplars as in-context learning cues for frozen LLMs. Extensive experiments on multiple protein-text QA benchmarks demonstrate that LLMs guided by this framework substantially outperform their zero-shot counterparts, and in some cases, even domain-specialized, fine-tuned protein LLMs. 1. The paper cleverly reframes protein sequences as a "second language," allowing general-purpose LLMs to build protein-function mappings in a zero-shot regime. This paradigm bridges symbolic biological and natural languages, bypassing the need for task-specific fine-tuning or retraining and creatively leverages LLMs' in-context learning strengths. 2. A substantial and well-constructed bilingual (protein-natural language) dataset is curated, addressing redundancy at both sequence and annotation levels. Figures 1 and 3 illustrate a careful, step-wise reduction of redundancy with diverse coverage across species, protein families, superfamilies, and ontology categories. 3. The method is evaluated on multiple datasets (ProtDescribe, Protein2Text-QA, Mol-Instructions) with frozen, general-purpose LLMs (Qwen, GPT-4o, Mistral, Kimi) and compared against strong protein-LLM baselines (BioT5-plus, ProLLaMA). Main results (Table 1) and multiple figures demonstrate robust performance gains in both automatic (ROUGE-L) and human evaluations, frequently surpassing or matching fine-tuned specialized models. Weaknesses 1. The methodology’s underpinnings, particularly the bilingual context construction mechanism (Section 3.2), lack precise mathematical formalization. While Equation-based thresholds (for sequence and annotation deduplication, Section 3.1.1/3.1.2) are provided, the procedure for query-to-context matching (how candidate exemplars are scored and integrated, and any ranking or aggregation formulae) is only described at a high level. For instance, does selection use hard thresholds or similarity-weighted composition? What is the precise mathematical form of the query-context similarity metric for textual and sequence components? Without a formal definition of, for example, the joint scoring function or aggregation, it is difficult for others to re-implement or to affirm the reproducibility or generalizability of the context construction process. Explicit equations or algorithms are expected for a submission at this technical level. 2. While the empirical evaluation is extensive by the standards of current protein-language model research, critical baselines are missing: - Direct comparison with leading analogy/reasoning-augmented LLM paradigms, such as those leveraging knowledge graphs (e.g., ANALOGYKB[1]), hierarchical retrieval (e.g., BeamAggR[2]), or reinforcement-learning-based self-correction (e.g., SeRL[3], AlphaEdit[4], Self-Correct RL[5]), is not present. - There is also a missed opportunity to benchmark efficiency: the computational cost of in-context "second language" understanding (which requires substantial prompt assembly with many exemplars) is never compared directly to the one-time fine-tuning cost of protein LLMs, or to parameter-efficient adaptation approaches (e.g., S-LoRA[6], MoDeGPT[7]). Without this, claims of scalability remain qualitative. 3. The central claim—LLMs can generalize protein understanding more efficiently through contextual exemplars—is only validated empirically under a restricted set of Q/A regimes. There is no attempt to analyze theoretical properties: e.g., under what assumptions does the contextual analogy paradigm guarantee generalization or compositional reasoning for proteins? What are failure modes if distributional shifts exist (out-of-distribution proteins/annotations not covered by exemplars)? The paper could benefit from at least some basic analysis or a discussion of the limitations. references [1] Yuan, S., Chen, J., Sun, C. (2024): "ANALOGYKB: Unlocking Analogical Reasoning of Language Models with A Million-scale Knowledge Base" [2] Chu, Z., Chen, J., Chen, Q. (2024): "BeamAggR: Beam Aggregation Reasoning over Multi-source Knowledge for Multi-hop Question Answering" [3] Fang, W., Liu, S., Zhou, Y. (2025): "SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data" [4] Fang, J., Jiang, H., Wang, K. (2025): "AlphaEdit: Null-Space Constrained Model Editing for Language Models" [5] Kumar, A., Zhuang, V., Agarwal, R. (2025): "Training Language Models to Self-Correct via Reinforcement Learning" [6] Wu, Y., Piao, H., Huang, L. (2025): "S-LoRA: Scalable Low-Rank Adaptation for Class Incremental Learning" [7] Lin, C., Gao, S., Smith, J. S. (2025): "MoDeGPT: Modular Decomposition for Large Language Model Compression" 1. Can the authors formalize the context construction mechanism, especially the mathematical definitions of exemplar ranking, weighting, or aggregation (e.g., is there a score function for context selection, or are choices made heuristically)? Please clarify with explicit pseudocode or equations. 2. What is the full breakdown of annotation QA pass/fail in the 5% discarded portion of the dataset? Are there any systematic biases or edge cases in the rejected data? 3. Did the authors attempt to benchmark model inference/prompt assembly time versus fine-tuned protein LLMs, or estimate resource requirements (memory/latency) for large context window assembly in practical use? 4. Could the authors compare directly with analogy-driven or multi-hop retrieval LLM frameworks (such as ANALOGYKB, BeamAggR) and efficient adaptation methods (S-LoRA, MoDeGPT)? 5. Can the authors clarify the apparent error in the human-in-the-loop Krippendorff's $\alpha$ (is 0.72% a typo?), and provide more detail on the distribution of human evaluations by task/model?	Fully AI-generated
Protein as a Second Language for LLMs	Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.	This work introduces a novel question-answering (QA) dataset focused on protein expression, localization, mechanism, and interaction. The authors also propose a retrieval-based framework that enables pretrained, generic large language models (LLMs) to analyze unknown protein sequences using an in-context learning approach. By including similar proteins and their corresponding descriptions in the prompt, the paper reports an average 7% improvement in the ROUGE-L score on QA tasks for the target unknown protein. * This work proposes a framework that eliminates the need to train or fine-tune a task-specific LLM for protein sequence analysis. * The paper employs dual criteria to retrieve similar proteins for augmenting the prompt, considering both sequence and text similarity. * The overall novelty of the paper appears limited. Prior works, particularly ProtEx from Google DeepMind, already explored the possibility of in-context learning for biological entity analysis. This submission however, neither mentions nor compares with them. * The similarity-based retrieval likely limits the generalization ability of the model on unseen protein sequences that are very different from existing ones in the database. A 70% threshold is applied on sequence similarity when constructing the dataset and retrieving for the prompt, but novel or orphan proteins often share <40% identity to the existing database. Please address my two major concerns in the “Weaknesses” section first. I will reassess after the rebuttal. Other miscellaneous questions are as follows: * Were the quality and correctness of the augmented QA from DeepSeek-R1 verified somehow? Teacher LLMs are known to hallucinate, especially on complex scientific topics like biology. * Are GO annotations an effective criterion for grouping and redundancy reduction? As far as I understand, a group of proteins might be very different from each other even though they share similar GO annotations. This is especially the case when some proteins are not very well annotated and have very few GO annotations. * The automatic evaluation solely relying on the ROUGE-L score is most likely not sufficient. Other metrics like accuracy (particularly for True or False QA), BLEU score, and BERTScore might provide a better understanding of the improvement in performance. * Besides, a human evaluation is conducted in the paper, but the domain knowledge or expertise of these evaluators is not mentioned (maybe I missed it), making the reliability of this evaluation questionable. * The term “zero-shot” used multiple times in the paper is a bit misleading, because retrieved similar proteins in the prompt may provide contextual information. The framework described in the paper is more likely a “few-shot” one. * Numbers in Figure 4 (left) are not consistent with the bars. I would recommend a thorough double-check of all the numerical results presented in the paper.	Fully human-written

PreviousPage 29 of 1516 (75800 total rows)Next