ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 15899 (21%) 4.43 3.58 3687
Heavily AI-edited 3233 (4%) 4.22 3.59 2990
Moderately AI-edited 7082 (9%) 4.20 3.61 2722
Lightly AI-edited 16648 (22%) 4.15 3.68 2746
Fully human-written 32938 (43%) 4.13 3.62 2917
Total 75800 (100%) 4.21 3.62 3026
Title Ratings Review Text EditLens Prediction
TruncProof: LL(1)-Constrained Generation in Large Language Models with Maximum Token Limitations Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper presents TruncProof. It allows for strict LLM generation adherence to the token-budget, while still generating syntactically correct outputs. The constrained generation approach is based on LL(1) parsing of context free grammars. The authors performed experiments on Text-to-JSON instructions and code generation tasks, improving over the existing approaches such as Outlines, Syncode, and XGrammar. The approach enhances the semantic robustness of the JSON and C output by leveraging decoding strategies such as Beam Search and Monte Carlo Tree Search. * Interesting and potentially useful technique for controlling LLM output with strict token budget. * Technical approach is intuitive and effective. * The experimental results show that TruncProof is effective on challenging function calling/coding tasks under strict token budget. * The paper is well written. * The experiments are performed only for two budget threshold. A better understanding of the tradeoff between the number of tokens and the syntactic&semantic correctness would be welcome. * The approach may over constrain the semantic generation. The paper presents an interesting practical technique for controlling the output of LLMs. While existing constrained decoding algorithms have shown how to ensure the LLMs only generate legal tokens, they don’t have means to ensure that the output will be properly generated in a fixed amount of time (tokens). Thus, many LLM generations in JSON/coding tasks have occurred because the model was not able to close all scopes etc. The approach is intuitive, yet demonstrated to be effective. The algorithm computes the approximate shortest token length derivable from each nonterminal ahead of time. It further track the number of tokens consumed and can adjust the rest of generation to follow the path until guaranteed completion. The authors show that this approach has reasonable worst-case time and space complexity, although I would have appreciated if it also included the estimates of expected time/space consumption (and relate the algorithm properties with the grammars they used in the evaluation). The experiments compare TruncProof with several state-of-the-art tools for constrained decoding. The results show the significant improvement in syntactically correct codes under a tight token budget. However, the paper presents only two (the aggressive in the paper; the permissive in the appendix) limit. It is important to understand the tradeoffs for multiple token budget thresholds, between the two extremes presented in the paper and in the appendix. Showing these additional results with even one selected decoding strategy would be valuable. Further it would be interesting to see the behavior of the approach on a reasoning LLMs and how token budgets impact their performance. Overall, with strict constraining it may be possible the model yields suboptimal results regarding semantic correctness. Finally, I enjoyed reading the paper and found it well-written. Fully human-written
TruncProof: LL(1)-Constrained Generation in Large Language Models with Maximum Token Limitations Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper augments constrained-decoding for LL(1) grammars to take into account a length constraint on the numbers of tokens. Given a prefix and a next token, one wants to only continue with that token if there exists a valid grammatical completion that uses only as many tokens as the length constraint allows. In general this problem is decidable but one has to make some tradeoffs with efficiency to avoid expensive computations that will cause masking to be expensive. The key observation of the paper is that for LL(1) computing a bound on how many tokens are needed to complete a sequence is pretty easy as the stack of an LL(1) parser tells us exactly how to do that (we just need to take the sum of all the shortest completions of each nonterminal/terminal on the stack). The approach is evaluated on very restrictive grammars to illustrate how normal constrained decoding would (unsurprisingly) fail to abide to the limited number of allowed tokens. - I think the problem of dealing with a bounded number of tokens is a very good problem to study as this is the main failure mode of constrained decoding approaches; they tend to go on and on generating non-sense just to follow the constraints. - I like the observation that for LL(1) whether one can stay within the bound is easier to compute - The results are encouraging at showing that this approach provides some flexibility in setting bounds on numbers of tokens - LL(1) is a very limited set of grammars. Most of the grammars used in existing GCD papers (e.g., https://icml.cc/virtual/2025/poster/45613) are not LL(1). - It is a bit disappointing that the paper, despite targeting a small fragment (LL(1)) still computes an overapproximation of how many tokens a completion will result in (specifically the work doesn't consider that LLM tokens may span across different PL tokens (e.g., if an LLM uses the token *b")*, which spans a literal, a quote, and a parenthesis, eq(5) will count this token at least twice, once for completing the literal, and once for the tree completion - Setting the token budget to 1.1X the ground truth is a bit of an arbitrary evaluation. - The paper doesn't report the preprocessing cost of their tool (other papers who have done so in the past, had very high preprocessing costs) - The tokens/sec reported in Table 1 are not for Sota tool. Should consider, for example, llguidance Other comments: The paper is missing many of the more current works for constrained decoding. For SOTA methods, they should really cite LLGuidance (https://github.com/guidance-ai/llguidance) since it squarely defeats XGrammar and other tools across all metrics, and GreatGramma (https://icml.cc/virtual/2025/poster/45613) since it supports the most complex grammars while guaranteeing reasonable processing times. I don't understand the point at line 121-122 distinguishing top down and bottom up parsers. I feel like the problem won't be much harder for LR(1) parser. One can use some dynamic programming to compute the shortest completion possible. - What is the preprocessing cost of the tool? - Line 374 mentions "As discussed in... TruncProof with grreedy decdoing does not fully account for semantic correctness". What does this mean? This aspect is not discussed - Table 1 mentions 100 syntax correct for Greedy/Ours, but not 100 for schema/exact. Shouldn't Greedy generate always the same output? - The fact that TruncProof proposes the Exact-match so many times is a bit suspicious and makes one think the search space is designed for the tool to do so. What happens with e=2? Fully human-written
LONGSHIELD: SCALABLE DISTRIBUTED DIFFERENTIALLY PRIVATE TRAINING FOR LONG-CONTEXT LLMS Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents LongShield, a distributed framework for differentially private (DP) training of large language models (LLMs) under long-context settings. The method integrates context parallelism (CP) with per-sample gradient computation to achieve long-sequence scalability while maintaining DP guarantees. 1. Addresses an underexplored but important problem: long-context DP training. 2. Demonstrates strong empirical results with practical improvements in scalability and throughput. 3. Provides a clear implementation pathway compatible with Opacus and TorchTitan frameworks. 1. The core idea (computing per-sample gradients under CP) is a direct extension of standard CP, replacing the local (p, d) gradient computation with (B, p, d) per-sample gradients. There is no new parallelism algorithm or architectural innovation beyond this dimensional extension. 2. The “input-stationary vs. output-stationary” trade-off is not new; it mirrors prior analyses in FlashAttention-2 and Megatron-Ulysses. The overlap strategy is a routine engineering optimization rather than a conceptual advance. 3. The paper emphasizes that FSDP cannot shard per-sample gradients while CP can, but this follows trivially from the existing data partitioning dimensions. It is a property of CP’s tensor layout, not a new design. 4. The hook management fix for Opacus checkpointing is more of an implementation detail than a contribution; it does not introduce a new checkpointing strategy or fundamental compatibility solution. 5. There is no formal analysis of privacy-utility trade-offs or communication complexity, and most claims rely purely on empirical evidence. 6. The authors do not justify why long-context DP training is necessary or beneficial. While long-context modeling is relevant for general LLMs, it remains unclear whether the same motivation applies to private training. The experiments focus on throughput and context length scaling but omit any analysis of model utility, privacy-utility trade-offs, or downstream performance (e.g., perplexity, zero-shot tasks). 1. Could the authors clarify whether any communication primitives or kernel fusion were re-implemented, or are all collectives reused from Megatron CP/Ulysses? 2. How does LongShield differ in actual code structure from integrating DP-SGD directly into existing CP frameworks? 3. Would privacy accounting or gradient clipping strategies change under model-parallel settings, or is the approach orthogonal to parallel dimension? 4. Why is long-context DP training practically needed? Are there real-world datasets or use cases that specifically require both privacy and long context? Fully AI-generated
LONGSHIELD: SCALABLE DISTRIBUTED DIFFERENTIALLY PRIVATE TRAINING FOR LONG-CONTEXT LLMS Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces LongShield which integrates context parallelism to DP (differentially private) training method aka DP-SGD. The main contribution of the paper is to integrate context parallelism on top of FSDP (ZERO-DP+) for training LLMs with large context using differentially private algorithms such as DP-SGD. The paper also enables DP-safe activation checkpointing to extend context further. On Llama 3.1 8B with 4× NVIDIA H100 GPUs, LongShield scales sequence length from 4k to 16k compared to the state-of-the-art ZeRO-DP, achieves linear sequence-length scaling, shrinks the throughput gap from 67% to 8.9% while matching non-DP memory usage, and reaches a 64k context length with activation checkpointing. 1. The paper identifies that for large context lengths, ghost clipping introduces significant memory overhead and shits to computing pure grad sample to integrate context parallelism to ZERP-DP+ 2. LongShield keeps per-sample gradients shards local to each GPU to avoid full materialization, overlaps per-sample gradient aggregation with backward computation to sustain throughput. 3. The paper enables DP-safe activation checkpointing to extend context further. 4. The paper presents experimental results on Llama 3.2 1B, Llama 3.2 3B, and Llama 3.1 8B over 4× H100 GPU and show that LongShield scales sequence length from 4k to 16k compared to the state-of-the-art ZeRO-DP. 1. The main weakness of the paper is that LongShield computes per-gradient sample which is notorious for preserving the per-sample gradient over the entire model instead of using Fast Gradient Clipping (FGC) to avoid ghost clipping's $T^2$ overhead. This choice has not been justified if there is indeed some benefit (in-terms of throughput) in computing per-sample gradient of the entire model instead of FGC with two backward passes. 2. The paper was hard to follow, please add a pseudo-code similar to algorithm 1 for LongShield, ZERO-DP and ZERO-DP+. Please explain technical details of CP and how it integrates along with FSDP for DP-SGD more clearly. 3. No codebase is provided. Questions: 1. "We adopt the pure gradient-sample (GS) approach to avoid ghost overhead." -- This is unclear to me. I understand that Ghost Clipping introduces memory overhead proportional to $T^2$ but this can be eradicated by shifting to Fast Gradient Clipping. Why do we need to store the entire per-sample gradient instead of using Fasting Gradient Clipping + Two backwards passes if memory proportional to $T^2$ was indeed the bottleneck? 2. ZERO-DP works with per-layer gradient clipping which doesn't yield good utility in terms of test accuracy. For ZeRO-DP+ results, are the experiments conducted for flat clipping instead of per-layer? 3. In my opinion, integrating CP and FSDP into DP-SGD with Fast Gradient Clipping is trivial. If this is not the case, then can the authors explain why? 4. "For example, mixed ghost norm choose ghost for Llama 3.1 8B final linear layer up to T= 16k. However, the ghost norm is 4× more FLOPs than directly evaluating the per-sample gradient, and the final dot product between two large intermediate tensors (𝑂(𝐵𝑇2)) causes a similar time, according to our profiling, due to the reduction nature." -- this statement is not clear to me. Mixed GC will pick GC or FGC based on minimum overhead condition and GC has more FLOPs than FGC because 1) FGC includes one matmul of BxTxp and BxTxd, where as 2) GC has two matmuls (BxTxp with BxpxT, BxTxd with BxdxT) and one dot product ($B \times T^2$ with $B \times T^2$). The statement says that GC has similar time according to the profile but causes 8x slowdown, this part is not clear. 5. Table 4 shows the results for ZERO-DP+ when it OOMs. How does LongShield perform in terms of throughput and peak memory when ZERO-DP+ doesn't OOM. Can we utilize LongShield to improve throughput even when ZERO-DP+ doesn't OOM? Suggestions 1. In figure 1, mention the per-device batch-size and global batch-size. For example, in this particular figure, FSDP supports GBS of 2*MBS and a per-device batch-size of MBS. But with CP, we can support larger context length at the cost of per-device batch-size reduced to 1/2 and GBS of 1. This is not clear in the figure or in the paragraph below. 2. Figure 4 and text below is not consistent. The text says "we can all-to-all (A2A) exchange the activation tensor (shape changing from (MBS, T/2, p) to (MBS, T, p/2)) and then all-gather (AG) the activation gradient tensor (shape transferred from (MBS, T/2, d) to (MBS, T, d))." but the figure shows (MBS, T, p) and (MBS, T, d/2). Please correct it. Fully human-written
LONGSHIELD: SCALABLE DISTRIBUTED DIFFERENTIALLY PRIVATE TRAINING FOR LONG-CONTEXT LLMS Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. DP-SGD is the most used algorithm for training ML models with differential privacy. Prior literature (DP-ZeRO) has scaled DP-SGD training to very large models, but fails to scale to longer sequence lengths. Through a series of optimizations this paper scales from max sequence length of 4k (prior work) to 16k on the Llama-3 8B model. They use optimizations from the exisiting literature on LLM scaling such as (1) context parallelism, (3) gradient sharding (3) activation checkpointing, and adapt these to work with DP-SGD. - The paper achieves significant results in terms of scaling DP-SGD training to longer contexts. - Experimental results are comprehensive (although a bit hard to compare between methods given the separate tables for each method). - The paper provides important contributions in adapting popular scaling techniques such as context parallel and activation checkpointing to DP-SGD training, thus enabling further research in private LLM training. - I liked the insight on the limitations of ghost clipping for scaling to longer contexts. - The contributions are mainly engineering-oriented versus conceptual/algorithmic since existing methods from the non-private literature are extended somewhat straightforwardly to the private case. - The paper could be better self-contained. Several concepts are used without much explanation such as ghost clipping/ghost overhead, FSDP, context extension continue pre-training, the overlap of communication and computation in the input-stationary pattern. - Will there be open-source code for this paper? That would be very important given that a lot of the contributions are engineering-oriented and would enable ongoing research in this area. More minor comments: - What is the meaning of "large fragmentation" in in the current Opacus imlementation? - Another relevant work/baseline might be "Scaling Private Deep Learning with Opacus: Advances for Large Language Models" which discuss FSDP with the ghost clipping approach. Fully human-written
LONGSHIELD: SCALABLE DISTRIBUTED DIFFERENTIALLY PRIVATE TRAINING FOR LONG-CONTEXT LLMS Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes LongShield, a system approach that combines context parallelism (CP) with existing differentially private (DP) training frameworks (e.g., Opacus + ZeRO-DP) to enable long-context DP training for LLMs. The key contributions include per-sample gradient sharding, communication overlap, and DP-compatible activation checkpointing. Experiments on Llama-3 models (1B–8B) show improved throughput and reduced memory over ZeRO-DP, scaling context length up to 64k tokens on 4×H100 GPUs. 1. Clear and reproducible system engineering. 2. Demonstrates that DP-SGD can scale to longer contexts using modest hardware. 3. Addresses practical compatibility issues (e.g., checkpointing with DP hooks). 4. Experimental evaluation on real Llama-3 models is thorough for throughput and memory. 1. Weak novelty. The method essentially combines context parallelism with DP-SGD under existing frameworks. It does not introduce new algorithms, optimizers, or privacy accounting techniques. 2. Lack of multi-dimensional parallelism. The system is confined to single-node CP. It does not explore or support other critical dimensions of LLM parallelism — such as Tensor Parallel (TP), Pipeline Parallel (PP), Expert Parallel (EP), or ZeRO data sharding. The paper even notes communication challenges would “become more critical” beyond one node, but never verifies cross-node scalability. Consequently, the proposed method’s scalability under realistic distributed setups remains unclear. 3. No integration with established frameworks. CP is already well-implemented in Megatron, which also provides integrated TP/PP/EP interfaces. The authors should compare their CP-DP implementation against a baseline that simply adds DP-SGD into Megatron. Moreover, they should explain why they did not directly integrate into Megatron — doing so would have allowed evaluation under real multi-dimensional parallelism (ZeRO + TP + PP + CP + EP) and would strengthen the engineering contribution. 4. Unconvincing motivation. The paper claims that DP is critical for long-context LLMs because “long sequences may contain sensitive data,” but provides no empirical evidence (e.g., no memorization or membership-inference study) to support that assumption. 5. No privacy evaluation. The paper lacks ε/δ reporting, attack-resistance analysis, or privacy–utility trade-off curves. As such, it demonstrates feasibility, not utility. 6. Limited scientific insight. Improvements (gradient sharding, hook fix, communication overlap) are incremental engineering optimizations that could be implemented in existing systems with minor effort. 1. What privacy budgets (ε, δ) were achieved in your experiments? 2. How does DP noise affect model accuracy or perplexity? 3. Have you compared your CP implementation with Megatron’s sequence-parallel engine? 4. Could LongShield be integrated with Megatron to combine TP/PP/CP/DP? 5. How does performance scale across nodes with slower interconnects (e.g., RDMA instead of NVLink)? Fully AI-generated
Forget Forgetting: Continual Learning in a World of Abundant Memory Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. Brief Summary: The paper tackles the task of continual learning. The authors suggest gpu compute cost is the main constraint and find the problem is with plasticity where bottleneck is on struggling to learn new tasks. To address this, authors propose Weight Space Consolidation involving rank-based parameter resets and weight averaging. Experiments are conducted on both image-classificaiton tasks (cifar-100) and llm instruction tuning (trace) showing improvement at significantly reduced costs. Pros: 1. The overall point about gpu compute constraints being more than storage is good. Exploration of such middle-ground strategies makes sense to me. The core idea of reseting certain parameters for plasticity also is interesting. 2. In the high-sample regime, the proposed method almost always outperforms existing baselines by 2-3 points. 3. The paper has nice ablation experiments such as comparing replay with reset (table 3), reset strategies (table 5). The plasticity loss experiment (Fig2) in particular suggest the issue with lower plasticity with increasing size of past examples. The additional experiments in the supplementary are appreciated. Cons: 1. My main concern is with the framing of the paper for continual learning and its potential application. The point about GPU cost being dominant with storage is fine, but the main problem is that the data itself might not be available in the first place. So storage was never the real issue, it is access to data. For a practical example, assume we have llama-3.2b instruct model but we don't know what data was used in the base to instruct training. Here, the authors are essentially assuming we have access to the underlying data, which is not the case. A few other points: (i) While storage cost is not a problem, the cost of downloading setting up s3 buckets high throughput speed for gpu access are all real costs. These need to be highlighted. (ii) If the authors are motivating it based on cost, some experiments on exact cost saved for practical reference should be provided. This is not to say the original point about GPU vs storage cost isn't correct, but that storage is not the only factor, data access itself is a big one. The entire argument essentially dilutes the author's novelty. 2. It seems a naive full-fine-tuning baseline is missing in Table 2 ? Difference between proposed method and full-fine-tuned on the same set would be very helpful. 3. Authors primarily explore task-based paradigm only, but it seems the method might be more impactful in task-free settings [Ref1]. 4. For LLM datasets, authors only consider TRACE dataset, so tough to know the generalizability.. Given that authors are exploring image-based and llm-based separately, it might be worth repeating experiments on multi-modal datasets as well such as on VisCOLL [Ref2]. 5. (Minor) Some qualitative visualizations would be very helpful. 6. (Minor) For Trace experiments, the training hyper-parameters are underspecified. What is the fine-tuning method for llama-3.2-1B? Some experiments using LoRA adaptors would also be interesting. --- [Ref1]: Aljundi, Rahaf, Klaas Kelchtermans, and Tinne Tuytelaars. "Task-free continual learning." In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11254-11263. 2019. [Ref2]: Jin, Xisen, Junyi Du, Arka Sadhu, Ram Nevatia, and Xiang Ren. "Visually grounded continual learning of compositional phrases." arXiv preprint arXiv:2005.00785 (2020). --- Overall Rating: 4/10 The main framing of gpu vs storage cost is not really novel. While the authors get 2-3 points improvement in high-sample regime, it is unclear how interesting that is. Authors miss key experiments such as full-fine-tuning baselines, only considering task-based not task-free, and only one trace dataset for llm experiments. Q1. Can plasticity be measured as a metric on an eval set? Currently, plasticity loss is provided in the graph. Fully human-written
Forget Forgetting: Continual Learning in a World of Abundant Memory Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. Memory-based methods in Continual Learning focus on learning a sequence of tasks with help of a buffer, which store samples from previous tasks and use them when learning the current tasks. Normally, this buffer is limited in the amount of samples they can store due to some constraints of the environment. In this paper, the authors challenge this constraint mainly because storage costs much less than training time. However, as we increase the samples we store, models could face a different challenge, and instead of having forgetting (too much plasticity) this shifts to a high stability and not learning the current task enough. Considering all this, this paper proposes Weight Space Consolidation, which combines a reset strategy to improve plasticity and weight averaging to enhance stability. Experiments on multiple modalities show good performance of the proposed method as they increase the memory budget. Also, multiple ablation experiments help understand why and how the method performs. - The paper is well motivated and written. The authors raise the challenge that, in current systems, memory is not the most important constraint, as GPU time is more costly. Presenting references and numbers, the paper encourages researchers not to focus on minimising memory size. - However, there are scenarios where having a small memory is required, such as when resource constraints are imposed or when the privacy of previously learned data is an issue. These are not mentioned or explained in the paper. - Along with raising the challenge and presenting a new scenario, the authors analyse the limitations of naively increasing memory size and propose a new approach to increase plasticity and enhance stability in this new scenario. - The experiments and results clearly help to demonstrate what is discussed in the text. Experiments across multiple benchmarks and modalities provide robustness to the results, and ablation experiments help understand the method's limitations and provide further insights into how it works. - A common approach to increase the plasticity of memory-based methods is to concatenate buffer samples with current-task samples at the batch level. This is different from what is shown in the paper, where the full batch is sampled from the intersection of the current data and the buffer. As shown in the paper, this last approach suffers from plasticity because it is least likely to sample data from the current task; however, by concatenating at the batch level, there is no plasticity problem. - How does this sampling method affect the scenario, results and the proposed method? - Concatenating at the batch level can be more costly (in terms of GPU time), but it may mean not keeping a copy of the model in memory, which can increase the batch size. - Figure 2b is unclear. The orange line makes it impossible to see the blue lines and compare them. - Why not compare against better memory-based methods, like for example: - Buzzega, Pietro, et al. "Dark experience for general continual learning: a strong, simple baseline." Advances in neural information processing systems 33 (2020): 15920-15930. - It has been shown to scale well with the number of tasks. Fully human-written
Forget Forgetting: Continual Learning in a World of Abundant Memory Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors challenge the assumption of traditional CL. Instead, they argue that, instead of the storage, the GPU time is the main bottleneck. They investigated the scenario where memory is sufficient enough to mitigate forgetting, but full retraining from scratch remains the main challenge. As the authors have discovered that models become biased toward prior tasks and struggle to learn new tasks, they propose Weight Space Consolidation, a lightweight method that combines rank-based parameter resets to restore plasticity with weight averaging to enhance stability. + The paper investigates a new paradigm/scenario that challenges the previous assumptions. Such new thinking is always valuable. + The mathematical formulations are rigorous and I discovered zero mistakes there. + The new paradigm is not just an "assumption," they have evidence (Section 3) to empirically demonstrate that the new assumption is valid. - While the mathematical formulation is rigorous, the theoretical foundation on why this new approach would work is lacking. - The proposed approach reads like A (rank-based parameter resets) + B (weight averaging). A + B isn't always necessarily bad, but both A and B are the results of prior work, so what's the technical contribution here with this approach? See "Weaknesses." What are the technical contributions (s) of this approach, if the A and B components were previously proposed and employed by existing works? Fully human-written
Forget Forgetting: Continual Learning in a World of Abundant Memory Soundness: 3: good Presentation: 4: excellent Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. Traditionally, continual learning has been studied under the memory-restricted setting (i.e., completely exemplar-free or only limited buffer sizes allowed). Authors argue that this is unrealistic under modern standards: compute, and not memory-costs, are the main drivers for the high costs related to model training. Thus, authors study the setting where larger-than-common buffer sizes are used, and where instead the compute availability is the driving factor for comparison. They find that, as the memory buffer grows and thus allows storing more samples of prior tasks, the challenge in CL shift from forgetting to plasticity. Models are stable wrt old tasks, but not plastic enough to new tasks. To alleviate, they propose using stochastic weight averaging and resetting weights with relatively little gradient activity. They study their method, termed Weight Space Consolidation, under the class-incremental setting on CIFAR100, ImageNet-100, and text-based TRACE benchmark, where competitors are outperformed. Strengths: + novel take on memory buffers in CL + simple and effective method to improve plasticity under performance-stable regimes + image and text modalities tested + easy-to-follow description of algorithm - l 311 to 319: this is a central part of the proposed method, but no experimental results support this thesis - Paper claims less compute overhead from their method, but no tabular results indicating the overhead versus naive replay are given - if we can store samples of old tasks, then we could, in turn, learn longer task sequences. Paper should integrate longer task sequences (e.g., TinyImageNet split into 20 or 40 tasks) - Full re-training is often used as comparison, but missing from the tables - Its unclear how "from scratch" is interpreted. Is that an entirely fresh model that's trained on increasing unions of memory buffer? Then, no knowledge would be transferred by design - Hyperparameter tuning on ResNet32 and then using tuned params on ResNet18 seems strange (l 364). Please elaborate on this. - I do not agree with the definition of post-hoc merging versus in-training merging (l292ff). Post-hoc merging would be, in my understanding, merging after Training on a task, which is thus before a next task -- which is still "during training". I have several questions/ideas for improvement - You use a rank-based approach --> where are the ranks (i.e., matrix ranks) used here? Do you mean "ranking based approach"? - Equation 2 and 5 both use $\alpha$ for different jobs - Equation 2: how is alpha annealed (l 232) - Equation 5: describe that the [l] are the weights to be reset - Figure 1: are the memory sizes per class or total? Additionally give percentage - L 460: Q versus q (in the top-q%) - Figure 3 and 405ff: how is VRAM usage measured? Do not modern frameworks pre-allocate as much GPU memory as possible (i.e., aiming at 100% GPU memory utilization) Fully human-written
Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors investigate the phenomenon of misevolution, indicating the unintended degradation of alignment or safety in self-evolving LLM agents. The authors analyze four pathways through which such degradation emerges: (1) model self-training, (2) memory accumulation, (3) tool evolution, and (4) workflow optimization. Using several frameworks (AgentGen, AFlow, SEAgent, AutoGPT-Zero, etc.), the paper shows consistent safety decay across these dimensions: refusal rate drops, harmful or manipulative actions increase, and even high-end models (GLM-4.5, GPT-5, Gemini 2.5) exhibit drift after iterative adaptation. Introduces a timely and original concept (i.e., misevolution) capturing long-term safety drift in self-evolving LLM agents. Provides a clear four-pathway taxonomy (model, memory, tool, workflow) that organizes an otherwise diffuse research space with empirical depth. 1. The paper frames misevolution as a "temporal" process yet only reports before/after metrics after a small number of evolution rounds (e.g., 20). The choice of “the number of iterations” appears inherited from the prior literature (e.g., N=20 from AFlow setup), but more fine-grained information regarding temporality of degrading safety could be useful. For instance, it is unclear whether degradation accumulates linearly, saturates, or oscillates over time. Even a small-scale longitudinal analysis could be useful to understand the dynamics of safety decay more fine-grained way. 2. The results show pronounced domain-level differences (e.g., lower safety in Finance and Science vs. higher in Service), so I hoped to see the interpretation of why misevolution manifests unevenly across domains. It was unclear whether these differences arise from the task structure, feedback signal design, or domain semantics. 1. Could you provide a more fine-grained longitudinal analysis of misevolution, even on a small subset of agents (e.g., tracking safety metrics over all 20 evolution steps)? This would help clarify whether safety decay is cumulative or episodic. 2. The inter-domain variation in unsafe rates (Finance/Science vs. Service) is intriguing. Could you show a few qualitative examples or analyses that illustrate why certain domains are more prone to safety degradation? Lightly AI-edited
Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper systematically analyzes evolution risks (referred to as misevolution) from four aspects: model, memory, tool, and workflow. It evaluates various agents, including state-of-the-art models, and identifies significant misevolution risks across all four aspects and models. Finally, the paper discusses potential mitigation strategies supported by preliminary experiments. - They first systematically investigate misevolution risks, whereas existing work has primarily focused on static risks. - Their analysis is comprehensive, covering four aspects and evaluating both open-weight and closed-weight models. The safety evaluation is conducted using several existing safety benchmarks. - They not only identify novel risks but also propose potential mitigation strategies, supported by preliminary experiments. I don't see many critical weaknesses in this paper, but one weakness that I think is the lack of explanation regarding performance. It is well known that self-evolution or the use of a long context window may lead to performance degradation. The misevolution observed in the paper could also be a result of such degradation. The paper should discuss whether the evolved agents showed a performance drop and, if so, how the decreasing safety alignment might relate to the observed performance drop. Did you observe any performance drop in the evolved agents? Lightly AI-edited
Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents Soundness: 2: fair Presentation: 1: poor Contribution: 3: good Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper introduces the concept of "Misevolution" where self-evolving agents develop unintended harmful behaviors or vulnerabilities during their autonomous improvement process. Misevolution is evaluated across four evolutionary pathways: model, memory, tool, and workflow. The authors perform experiments for each of the evolutionary axis on frontier models and empirical evidence for safety risks. The experiments focus on both coding and non coding tasks. The results show that the refusal rate dropped significantly with an increase in unsafe tools. Finally, the paper also briefly discusses mitigation strategies. - The discussions around the risks of evolutionary agents has been sparse, hence this paper focuses on the much needed research gap. The results could be important to lead discussions in safety of evolutionary agents. - Four-pathway taxonomy (model, memory, tool, workflow) provides pedagogical risk landscape coverage. - Multiple benchmarks from literature were used to evaluate the risks. - While I think the paper is important to the community, I think there are aspects of the paper that need to be improved, mainly the presentation and experimental details. Since the paper doesnt propose novel theory (which is fine) and is mainly an empirical paper, it seems to be lack information about the experiments. Instead of describing their experimental setup, metrics, the reader is guided to "Section 2" which often lacks details. The appendix is extensive, however it is difficult for a reader to switch between the appendix and the main paper. Additionally L243, the reader to signaled to read the appendix for _all models, benchmarks, metrics, and evaluation protocols_ which is a 7 page appendix. Given the structure I find the details around the setup such as temperature used for reproduciblity. - The paper performs extensive experiments, I would have appreciated some qualitative understanding of the results. For eg: in the evolutionary process, which node/parent cascaded the effect of unsafe behaviors. - Minor: Similar risks and mitigation have been discussed in many papers that are important to be included to better contextualize the paper: [1] DeChant, Chad. "Episodic memory in ai agents poses risks that should be studied and mitigated." 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 2025. [2] Hammond, Lewis, et al. "Multi-agent risks from advanced ai." arXiv preprint arXiv:2502.14143 (2025). [3] Ecoffet, Adrien, Jeff Clune, and Joel Lehman. "Open questions in creating safe open-ended AI: Tensions between control and creativity." Artificial Life Conference Proceedings 32. One Rogers Street, Cambridge, MA 02142-1209, USA journals-info@ mit. edu: MIT Press, 2020. [4] Sheth, Ivaxi, et al. "Safety is Essential for Responsible Open-Ended Systems." arXiv preprint arXiv:2502.04512 (2025). - The paper shows safety degrades after evolution, but doesn't always clearly establish that evolution caused the degradation vs. other factors i.e. if there is a causal effect. - Why is there a disconnect between the experiments in the way different models are evaluated for each paradigm? What is the insight? Would have been good to point out that a particular model is more susceptible to lets say workflow misevolution vs memory misevolution. - Could you ablate the trajectory of misevolution and what caused it? - What evolutionary algorithms were used? Fully human-written
Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces "misevolution" as a novel safety risk in self-evolving LLM agents, where autonomous improvement leads to unintended harmful behaviors. The authors systematically evaluate four evolutionary pathways (model, memory, tool, workflow) and demonstrate that even state-of-the-art LLMs exhibit safety degradation, including alignment decay, reward hacking, and vulnerability introduction during self-evolution. **Novel and timely research direction**: First systematic study of safety risks in self-evolving agents, addressing a critical gap as these systems become more prevalent. **Comprehensive empirical evaluation**: Thorough assessment across four distinct evolutionary pathways with multiple LLM backends, providing strong evidence of widespread risks. **Clear conceptualization**: Well-defined characteristics distinguishing misevolution from existing safety concerns (temporal emergence, self-generated vulnerability, limited data control, expanded risk surface). **Practical demonstrations**: Concrete examples (Figure 1) and detailed case studies effectively illustrate how misevolution manifests in real scenarios. **Insufficient mechanistic understanding**: While statistics demonstrate misevolution occurrence, the paper lacks deep analysis of root causes. Section 6 briefly hypothesizes three factors but doesn't provide experimental validation. For instance, why does SEAgent lose refusal ability after self-training (Figure 4)? Is it due to data distribution shift or optimization pressure? **Limited mitigation validation**: Mitigation strategies (Section 4) are mostly theoretical. Only memory and tool mitigations have preliminary experiments (Appendix D), and even these show incomplete recovery. No experimental validation for model or workflow mitigation. **Narrow safety focus**: Evaluation exclusively targets safety metrics. Missing analysis of whether self-evolution affects general task performance, robustness to distribution shifts, or other capabilities. Does fixing safety issues compromise the benefits of self-evolution? **No analysis of combined evolution**: Real systems likely combine multiple evolutionary pathways. What happens when an agent evolves both memory and tools simultaneously? Do risks compound or interact in unexpected ways? **Limited discussion of detection methods**: How can we detect when misevolution is occurring in deployed systems? The paper focuses on post-hoc evaluation but lacks online monitoring strategies. **Root cause analysis**: Can you provide ablation studies to isolate which specific aspects of self-training cause safety degradation? For example, is it the self-generated data quality, optimization objectives, or training dynamics? **Performance-safety tradeoffs**: How do the proposed mitigations affect the agent's core capabilities? Table 1 in Appendix D shows memory mitigation, but what's the impact on SWE-Bench performance? **Cross-pathway interactions**: Have you tested agents that evolve through multiple pathways simultaneously? How do risks interact when combining memory + tool evolution? **Temporal dynamics**: How quickly does misevolution occur? Is there a critical point or gradual degradation? Figure 2 shows outcomes but not the progression. **Generalization of findings**: The evaluation focuses on specific implementations (SE-Agent, AFlow, etc.). How confident are you these risks generalize to other self-evolving frameworks? **Benchmark limitations**: Many evaluations use LLM judges (e.g., Gemini-2.5-Pro in Section 3.3). How reliable are these judgments? What's the inter-rater agreement with human evaluation? **Deployment implications**: What monitoring mechanisms would you recommend for production systems? Can misevolution be detected before harmful outcomes occur? **Comparison with supervised evolution**: How does autonomous self-evolution compare to human-supervised iterative improvement in terms of safety risks? Fully AI-generated
Ensemble Prediction of Task Affinity for Efficient Multi-Task Learning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Summary This paper addresses a central challenge in multi-task learning (MTL): identifying groups of tasks that mutually improve each other’s performance when trained together. Existing methods typically fall into two categories — white-box and black-box approaches. The paper introduces a hybrid method that integrates both. The proposed two-step framework first estimates pairwise task affinities by training a single MTL model on all tasks within a group, inferring affinities for all possible combinations. In the second step, a non-linear mapping from task affinity to performance gain is constructed and refined using residual predictions. The method is evaluated against several state-of-the-art task grouping algorithms, with ablation studies highlighting the contribution of each component. Results demonstrate the benefits of applying non-linear mappings and residual predictions. Overall, the paper presents a robust and well-conceived methodology that successfully combines the strengths of existing approaches. Strengths Tackles an important challenge in the MTL literature — the high computational overhead of existing task-grouping algorithms. Through comprehensive ablation studies, the paper provides valuable insights into the role of gradient similarities (e.g., via comparison of affine vs. non-linear mappings). Design choices (e.g., use of B-splines and regression techniques) are justified through ablation analyses and comparative experiments. The paper is clearly structured and effectively relates its contributions to prior work, ensuring coherence and contextual grounding. The experimental evaluation goes beyond measuring performance gains, also analyzing the correlation between predicted affinities and ground-truth transfer gains. Weaknesses While computational efficiency is claimed as a key advantage, it would be valuable to include results for the complete approach (including hyperparameter tuning in the second stage) or to explicitly state that the additional cost is negligible. Comparing against an additional data-driven baseline would strengthen the evaluation. The performance gains over naive MTL are relatively modest; a discussion of their practical significance would help contextualize their value. Some design choices lack clear theoretical justification (e.g., why averaging gradient similarities yields effective affinity estimates, or why B-splines are particularly suitable). The proposed method introduces several additional hyperparameters (for the B-spline expansion, regression method, and residual prediction), which may complicate tuning and reproducibility. Questions for the Authors What is the rationale for time-averaging the cosine similarity over the K K training steps in Equation (6)? How stable is this measure across training, and how much variability is typically observed? Since TAG’s affinity scores are on a different scale than observed gains, how does your method correct for this discrepancy? You mention that GRAD-TAE performs well for groups — could you include these results in Table 3 for completeness? Could you provide additional details on the computational requirements of the different methods, particularly in terms of the number of backward passes and the extent of hyperparameter tuning involved? Recommendation Although some design choices could be better motivated and the performance gains over naive MTL are modest, the paper provides valuable insights into the problem of task grouping. The authors conduct detailed and well-structured experiments, including informative ablation studies that clarify the role of different components within the proposed framework. Additional Feedback Figure 1 is somewhat difficult to interpret, as it presents too many elements at once. Consider splitting it into two complementary figures: one conceptual illustration to convey the overall idea, and another outlining the algorithmic steps in more detail. Section 4 could be improved for clarity and consistency. Ensure that all discussed results appear in the corresponding tables or figures, and avoid repetition of statements like “TAG achieves higher correlation but incurs significant computational overhead.” The term “training cost” could be replaced with “computational cost” for improved clarity. In Section 3.1.3, explicitly explain the difference with TAG and refer to the appendix where both affinity measures are compared. The discussion of limitations could be expanded. As shown in the appendix, the method performs comparably to naive MTL in some cases, indicating room for improvement and further exploration in future work. Lightly AI-edited
Ensemble Prediction of Task Affinity for Efficient Multi-Task Learning Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. Proposed ETAP builds a scalable predictor by computing a gradient-alignment affinity score for pairs and groups in shared parameters, then refining it with learned nonlinear transformations and residual corrections. Across benchmarks, ETAP improves MTL gain prediction and enables more effective task grouping, outperforming used baselines. It combines gradient-based affinity with learned non-linear relationship modeling to efficiently and accurately capture task relationships. And it includes thorough component-wise ablations that clarify each contribution and improve interpretability. Dividing learning tasks into groups based on similarity is a long-standing area [1]. The paper introduces new measures of task affinity for MTL, but I am not fully convinced that the proposed methods are superior to prior work. The baselines used are relatively dated, and a comparison of computational cost and predictive performance with stronger recent baselines, such as [2], would strengthen the claims. Efficient group-wise tracking of task affinity is also not new, as [3] tracks inter-task affinity in a group-wise manner during multi-task optimization. Finally, I am not convinced that the proposed methods clearly improve over other inter-task affinity tracking approaches, including [4]. [1] Which tasks should be learned together in multi-task learning? [2] Task Grouping for Automated Multi-Task Machine Learning via Task Affinity Prediction [3] Selective Task Group Updates for Multi-Task Optimization [4] Scalable Multitask Learning Using Gradient-based Estimation of Task Affinity It would help to clarify the concrete differences from prior work, especially how your methods distinctively improve on gradient-based affinity and group-wise tracking approaches. Practical guidance on when to prefer your method over more recent baselines would make the contribution clearer. Lightly AI-edited
Ensemble Prediction of Task Affinity for Efficient Multi-Task Learning Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes ETAP (Ensemble Task Affinity Predictor), a framework for predicting multi-task learning (MTL) gains to enable efficient task grouping. The approach combines white-box gradient-based affinity scoring with data-driven ensemble prediction. 1. The two-stage ensemble design that uses gradient-based affinity scores as foundation and refining with data-driven models is reasonable, it combines white-box gradient-based affinity scoring with data-driven ensemble prediction. 2. ETAP achieves impressive runtime reduction while maintaining or improving correlation with ground-truth gains. This is a meaningful practical contribution. 1. The gradient-based affinity score (Equation 5) is quite similar to existing work (TAG), just removing the auxiliary forward/backward passes. The B-spline transformation and ridge regression are standard techniques. The main novelty seems to be in combining these pieces, which feels somewhat incremental. Can you clarify what is fundamentally new here beyond engineering different existing methods together? 2. While you claim ETAP is "scalable," all experiments use relatively small task sets (n=7-10). What happens when n=50 or n=100? The affinity score computation still requires training one MTL model with all tasks, and the pairwise scores grow as O(n²). The paper doesn't really demonstrate scalability to large task pools that appear in real applications. 3. Section 3.2.3 uses branch-and-bound from prior work. So the contribution is really just the gain prediction, not the actual grouping algorithm. This should be made more clear in the contribution claims. See weaknesses. Fully AI-generated
Bridging Unsupervised and Semi-Supervised Anomaly Detection: A Provable and Practical Framework with Synthetic Anomalies Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes to add synthetic anomalies to diversify the collected anomalies under the semi-supervised setting. Authors connect anomaly detection with binary classification and introduces synthetic anomalies to mitigate two issues in semi-supervised AD: false negative modeling and insufficient regularity of learning. Some theoretical analyses are provided to justify the effectiveness of incorporating synthetic anomalies. Experiments across tabular, image, and text datasets demonstrate the applicability of the proposed framework. 1. Some theoretical analyses are conducted on the effectiveness of introducing synthetic anomalies for semi-supervised anomaly detection. 2. The experiments span diverse modalities (tabular, image, text) and multiple AD methods, showing general applicability of the “synthetic anomaly” principle. 1. Formulating anomaly task as binary classification is fundamentally inappropriate. Since the type of anomaly is uncountable, anomaly detection is usually formulated as one-class classification to model the distribution of normal data or to learn the pattern of them. Using binary classifier may learn a unreliable decision boundary. 2. The theoretical analysis of convergence is narrow to the network using ReLU as activation function. Extending the theoretical guarantees to broader architectures or activation functions would significantly strengthen the generality and impact of the results. 3. The novelty of this paper is weak. While the theoretical framing is elegant, the core idea is adding synthetic anomalies, which is not new. The main contribution lies in extending this idea to a semi-supervised setting, which feels incremental and does not substantially push the frontier of anomaly detection research. 4. Synthetic anomaly generation is overly simplistic. The use of uniformly random noise as synthetic anomalies is questionable, especially for complex or high-dimensional data. This weakens the practical significance of the framework and may not generalize to high-dimensional or structured data. There is no comparison with more informative or adaptive anomaly generation methods. 5. The presented ablation resembles a sensitivity analysis rather than a comprehensive investigation. 6. The writing of this work is terrible and should be significantly improvoed. 1. How sensitive is the framework to the way synthetic anomalies are generated? Would a more structured generator improve performance? 2. It is possible to generalize the theoretical guarantees to broader architectures or activation functions? Lightly AI-edited
Bridging Unsupervised and Semi-Supervised Anomaly Detection: A Provable and Practical Framework with Synthetic Anomalies Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 0: Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper deals with semi-supervised anomaly detection. It states that anomaly detection and semi-supervised anomaly detection can be approached as binary classification. It proposes as algorithmical contribution in section 4.1 to sample from a uniform distribution background to create artificial anomaly samples. They cite several theoretical results on excess risk convergence over a function class for binary classification. They prove certain special cases. They perform experiments using relu networks. Negligible in the light of the below weaknesses. The algorithmical proposal of this paper, sampling anomaly background , is long known (e.g. the paper by Sipple 2020, but it is known before), is trivial and has no novelty. There has been advantages over that, see eg SMOTE, Chawla et al from 2002 . - The claim that Semi-supervised anomaly detection can be treated as classification is not novel. Steinwart 2005 has established this formally for anomaly detection in general (Corollary 3 and theorem 4 in Steinwart 2005). Were it not for the old suggestion of sampling negatives and the many mistakes, this paper would feel like a recapitulation of Steinwart 2005, or one tries to confuse the readers over the simplicity of the algorithmical content by citing convergence bounds. - The paper has a number of severe theoretical mistakes: 1. line 148-149 they state that if a non-negative lower bound converges to zero, then the term which it bounds from below must also converge to zero. Their statement is: $a >= b$, $b>=0$, $b \rightarrow 0$ implies $a \rightarrow 0$ for evidence see "from [4], we see ..." 2. a similar wrong conclusion occurs in lines 240-244 "From Proposition 3.1 and Theorem 3.3, we can see that if the regression function is discontinuous, the approximation error is high (at least 1), which may lead to vacuous excess risk bounds (i.e., excess risk can be high and is not guaranteed to converge). Lacking theoretical guarantees, the Bayes classifier cannot be effectively learned." If an upper bound diverges, it does not mean that the quantity bounded by it would have to diverge as well. Same kind of logical mistake as in 1., but now with an upper bound. 3.Proposition 4.2 is obviously wrong. They claim continuitity, however if $h_-(X)$ is discontinuous, then $f_P(X)$ can be discontinuous, too. E.g. choose $s=0.5, \tilde{s}=0.5, h_1 =c$, then $f_{P}(x) = \frac{0.5c -0.25 h_-(X) -0.25}{0.5c + 0.25h_-(X) + 0.25}$ 4. eq (4) is proven in Steinwart (2005) as an upper bound, see Theorem 10 in Steinwart (2005). Proving the exact same result a lower bound would be very surprising. They use exactly the same argument as Steinwart 2005 in the proof of theorem 10, but arrive at the opposite direction of inequality. If this is corrected to the correct direction of inequality, their extension is straightforward. Steinwart 2005 assumes for the anomaly density to be $\mu$. They assume that it has density $h_2$ with respect to $\mu$ . There is no technical effort in doing this change. btw, line 1101 makes a lower bound (it should use 3/5 as constant but this is minor) . 5. Proposition 3.1 is wrong because they do not ensure that $\mu(X_1) >0$ and $\mu(X_-) >0$ . One can choose closed sets such that $\mu(X_1) =0$ and $\mu(X_-) =0$ Then one can get a zero $\ell_{\infty}$-norm to $f(x)=0$ but even if one would fix that, it would be of no consequence, see point 7 6. It could be that Theorem 4.5 has an unfavourable rate $O( (log n) ^4 / n ) ^{ (c+\alpha) / (c+d) }$ For $c = \alpha q$ is typically small compared to the input dimensionality $d$ if one wants smooth settings as they state it for even moderate input dimensionalities d the bound is worse than the typical $O(n^{-1/2})$ results . 7. the "insufficient regularity of learning" problem as they state it is no problem for training a classifier: assume $P[Y=1|X]$ makes a jump in direction orthogonal to the decision boundary, but the decision boundary is a standard hyperplane. This is trivially learnable with 1 layer. - Ironically, Tsybakovs noise condition, which is repeatedly cited by the submitters of this paper, requires a steepness of $\eta(x) = P(Y=1|X=x )$ around 0.5 for faster convergence rates. They state that this steepness would a problem for learning. This is a direct contradiction to the results from Tsybakov and Steinwart. - Overall, the paper has very poor readability. none Fully human-written
Bridging Unsupervised and Semi-Supervised Anomaly Detection: A Provable and Practical Framework with Synthetic Anomalies Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper demonstrates that using synthetic anomalies improves the performance of semi-supervised anomaly detection. It first formulates anomaly detection as a binary classification problem, then shows why training a model using only normal data and known anomalies is difficult. The proposed method theoretically resolves this issue by incorporating synthetic anomalies generated from a uniform distribution in addition to known anomalies. Experiments on tabular, image, and text data show that adding synthetic anomalies enhances the performance of semi-supervised anomaly detection. - By defining a unified anomaly detection framework based on binary classification, this paper can handle both unsupervised and semi-supervised settings. Based on this, this paper also theoretically justify the use of synthetic anomalies. - The experimental results are very strong. Across a variety of methods and datasets, incorporating synthetic anomalies leads to improved performance. Please see the Questions section. - In the proposed method, noise drawn from a uniform distribution is used as synthetic anomalies. However, since normal data also is a subset of a uniform distribution, would not the synthetic anomalies contain normal data as well? If so, I would expect the detection performance for normal data to drop. Why is the proposed method able to avoid this? (For example, DROCC also uses synthetic anomalies, but it includes mechanisms to avoid overlapping with normal data. That approach feels more natural to me; yet for images and text, adding uniform noise to DROCC actually improves performance.) - This paper uses autoencoders (AEs) in the experiments, but how about trying DeepSVDD? For tabular data, AEs may be better, but for image data I expect DeepSVDD to yield stronger results. I am interested in how the proposed method would perform within DeepSVDD-based variants such as DROCC, ABC, and DeepSAD. Fully human-written
How Does Fine-Tuned Foundation Models Help for Long-Tailed Data Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper systematically studies how classic long-tailed learning methods perform when applied to fine-tuning foundation vision models. It evaluates seven categories of techniques (re-sampling, augmentation, class-balanced losses, classifier design, etc.) under both full and parameter-efficient fine-tuning. The authors find that many traditional methods do not transfer well, while a combination of cosine classifier, square-root sampling, Balanced Softmax/logit adjustment, and label smoothing works reliably. They conclude with a unified fine-tuning framework that outperforms prior long-tail methods across multiple benchmarks. 1 The paper provides a comprehensive and systematic empirical study of seven major categories of long-tailed learning methods on foundation models 2 Based on extensive experiments, the authors deliver a well-validated unified fine-tuning framework that consistently improves long-tailed performance across multiple datasets and backbones (1) I do not fully agree with the authors’ claim in the introduction that “to the best of our knowledge, there has not been a systematic study on how to fine-tune foundation models under a long-tailed distribution.” In fact, LIFT [a] has already provided a systematic analysis of imbalance issues under long-tailed settings and explored various strategies. Works such as [b] and [c] have also examined biases in foundation models under long-tailed distributions, and LPT [d] offers a deeper investigation as well. I recommend that the authors reconsider the positioning of their contribution and more precisely articulate the gap their work aims to fill, rather than relying on an overly broad claim. (2) The results on several benchmarks do not appear to surpass LIFT, which achieves competitive performance with only 10 training epochs and minimal additional techniques. From this standpoint, it is difficult to assess the practical significance and novelty of the proposed method. (3) Many of the examined techniques and strategies depend heavily on training hyperparameters such as the number of epochs, learning rate. More experiments are required to understand how sensitive the proposed tricks are to these hyperparameters and to more fully validate their robustness. [a] Shi, Jiang-Xin, et al. "Long-tail learning with foundation model: Heavy fine-tuning hurts." arXiv preprint arXiv:2309.10019 (2023). ICML2024 [b] Chen, Jiahao, et al. "Rethinking the Bias of Foundation Model under Long-tailed Distribution." arXiv preprint arXiv:2501.15955 (2025). ICML2025 [c] Wen, Xin, et al. "What makes clip more robust to long-tailed pre-training data? a controlled study for transferable insights." Advances in Neural Information Processing Systems 37 (2024): 36567-36601. [d] Dong, Bowen, et al. "Lpt: Long-tailed prompt tuning for image classification." arXiv preprint arXiv:2210.01033 (2022). ICLR see weakness Lightly AI-edited
How Does Fine-Tuned Foundation Models Help for Long-Tailed Data Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper investigates how classic long-tailed (LT) learning techniques fare when fine-tuning foundation models (CLIP, ViT) under both full fine-tuning (FFT) and parameter-efficient fine-tuning (PEFT). It benchmarks seven families of methods (re-sampling, data augmentation, class-sensitive losses, balanced classifiers, and “other tricks”), and synthesizes an “ultimate framework” combining Cosine classifier + Square-root sampling + Balanced Softmax (BS) + Label Smoothing (LS). This framework yields consistent gains across ImageNet-LT, Places-LT, CIFAR100-LT, and iNaturalist-2018, sometimes surpassing recent LT methods, while noting that naive data augmentation have inconsistent effects on performance across different datasets and models. 1: The paper systematically reviews and tests representative techniques (e.g., ROS/RUS/Square-root, Rand/AutoAugment, Focal/LDAM/CB/BS/LA/LADE, Cosine/τ-norm, mixup/LS) across FFT/PEFT and backbones 1: While this study presents ample empirical results, the findings remain largely superficial and fail to uncover the intrinsic mechanisms by which long-tailed data distributions influence fine-tuning. Prior works, such as [1] and [2], provide theoretical insights into the geometric properties of contrastive representations learned from balanced versus imbalanced datasets. Other studies [3, 4] investigate the underlying mechanisms of long-tailed learning from an empirical perspective. In the context of CLIP, [5] offers a promising direction for exploring how fine-tuning with long-tailed data impacts downstream performance. Incorporating such theoretical or representational analyses would substantially deepen the paper’s contribution and explanatory power. [1] Dissecting supervised contrastive learning, Graf et al., ICML 2021. [2] Geometry of Long-Tailed Representation Learning: Rebalancing Features for Skewed Distributions, Yi et al., ICLR 2025. [3] Imbalance trouble: Revisiting neural-collapse geometry, Thrampoulidis et al., NeurIPS 2022. [4] What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights, Wen et al., NeurIPS 2024. [5] Decipher the Modality Gap in Multimodal Contrastive Learning: From Convergent Representations to Pairwise Alignment, Yi et al., arxiv. 1: Why does RUS×N negatively affect tail performance? Beyond reporting the empirical trend, the paper should try to provide a deeper analysis of the underlying mechanism. For instance, can the authors examine representation drift or head-biased margin dynamics as N increases? Such analyses would clarify whether the degradation arises from overfitting to majority classes, loss of feature diversity in tails, or instability in the learned decision boundaries. 2: Please provide a deeper mechanistic analysis of why the combination of Cosine normalization, BS, and LS exhibits robustness under long-tailed (LT) fine-tuning for foundation models (FMs). For instance, an examination of weight norms, feature angular distributions, and class-wise margins before and after fine-tuning would help explain the underlying dynamics contributing to this robustness. Moderately AI-edited
How Does Fine-Tuned Foundation Models Help for Long-Tailed Data Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper systematically evaluates long-tail learning methods, including re-sampling and loss functions for fine-tuning foundation models, i.e., CLIP and ViT. A unified framework that outperforms existing approaches on imbalanced datasets has been proposed, and empirical guidelines have been provided for the long-tailed learning community. However, evaluations have been carried out on limited model architectures, and the proposition is mainly validated through accuracy and efficiency. The impact on the learned representations remains undiscovered. - The paper provides a systematic empirical study of long-tail learning methods and offers actionable insights, e.g., recommending Balanced Softmax and Square-root sampling for fine-tuning foundation models on imbalanced data, which is valuable for the whole long-tailed learning community. - Proposes a novel framework combining optimal methods to achieve trade-offs between performance and computational cost. - Tests on 4 datasets with detailed observations, e.g., hyperparameter sensitivity analysis, examine robustness of methods like LADE, noting instability with improper hyperparameters. - Only CLIP/ViT are considered. Extending to other architectures (e.g., DINO) could strengthen the generalizability. - Mentions potential leakage between ImageNet and IN21K-ViT, but doesn’t quantify how its impact was mitigated. Additionally, the BALLAD baseline is omitted in Table 10, which makes superiority claims less convincing. - The motivation is intuitive, and the work's unified framework combines existing methods, but groundbreaking algorithmic novelty lacks emphasis. - Some tables (e.g., resampling results across datasets) could be consolidated for brevity. Moreover, the style of the tables is not unified. Some are full-bordered, and some are three-line tables. 1. What is the impact brought by the new design to the learned representations? Qualitative results and more detailed ablations are preferred to indicate the effectiveness of the proposed method on learned representations. 2. Why were only CLIP and ViT tested? How about DINO? 3. LADE collapses in Fig. 2. Does this reveal fundamental limitations of logit adjustment for foundation models, or is it fixable via hyperparameter tuning? 4. Beyond acknowledging potential leakage between ImageNet and IN21K-ViT, what specific steps were taken to ensure contamination didn't inflate results? Lightly AI-edited
How Does Fine-Tuned Foundation Models Help for Long-Tailed Data Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper presents a systematic study of how classical long-tail learning methods perform when applied to foundation models (CLIP and ViT) instead of training from scratch. The research categorizes existing methods into 7 groups: (1) Re-sampling, (2) Data Augmentation, (3) Class-sensitive Loss, (4) Balanced Classifier, (5) Knowledge Distillation, (6) Ensemble Learning, and (7) Other tricks. Through extensive experiments across two fine-tuning paradigms (Full Fine-Tuning and Parameter-Efficient Fine-Tuning) and four standard datasets (CIFAR100-LT, Places-LT, ImageNet-LT, iNaturalist 2018), they provide empirical guidelines for practitioners. The authors then propose a unified framework combining the most effective methods and demonstrate competitive performance compared to state-of-the-art approaches. - The paper provides a timely revisit of how existing methods perform when pre-trained models are adopted, which is practically beneficial. - The experimental setup is well-structured and thorough seven method categories covering major long-tail learning approaches over four datasets. The paper also provides detailed hyperparameter specifications and comprehensive ablation studies, facilitating reproducibility and follow-up research. - The work provides actionable insights, clearly showing which methods work best in different settings. - This work also considers training costs, computational efficiency, and hyperparameter sensitivity. This practical consideration is crucial for real-world deployment. - The work is purely empirical. While systematic evaluation has value, the contribution is relatively modest. The proposed ultimate framework is also simply a combination of best-performing existing methods without deeper insight into why these combinations work synergistically. - No statistical significance testing is reported. It would be beneficial to rule out experimental noise with multiple independent runs and increase the reliability of results. - Do the authors expect these findings to generalize to other foundation models beyond CLIP and ViT (e.g., DINOv2, MAE, SigLIP2)? - What properties of pre-trained representations make them more suitable to certain long-tail learning techniques or hyper-parameters? Fully AI-generated
How to Teach Label to Understand Decisions: A Decision-aware Label Distribution Learning Framework Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a novel framework for decision-aware learning. It is based on LDL (label distribution learning), which in a nutshell in this context performs label enhancement in a decision driven way: I.e. the label space is expanded to accommodate a certain number of decision-aware labels with their support depending on the “transfer cost” from the decision that would be made if one vs another label was presented for a data-point at hand. The method is quite performant, achieving consistently the best overall regret, and the best robustness, compared to a variety of vanilla and more advanced decision-focused benchmark methods; this was tested on both synthetic and real world data on classic and substantively nontrivial combinatorial decision tasks, so I overall found the selection of tasks for the empirical evaluation to be good. In particular, there is a pronounced effect of low-data regime superiority compared to other methods. On the methodological side, the proposed method exhibits an intuitive multi-stage structure that lets it generate learned label distributions that can then be used to solve a large variety of decision-oriented problems; the distributions are learned in a way that leverage the downstream decisions and how those are correlated across samples; the usage of such correlations on the decision/label level, as opposed to earlier approaches that embraced adaptivity at the feature level, appears in certain ways more principled in decision-focused settings than these prior methods with local feature-based adaptivity. To start, the proposed method appears to be very computationally demanding compared to the alternatives it is benchmarked against. I could not find a runtime comparison in the pdf manuscript, so in the interests of transparency I’d like to ask for it to be disclosed. Overall, the expansion of the label space into custom “label-distribution” space involves a lot of matrix computations, together with many parallel neural networks with non-trivial architecture. Furthermore, it appears that due to the heavy parameterization of the method, there is a risk of non-robustness to misspecification that could be more pronounced than with the other methods. Thus, it would have been important to see how well this method does for noisy/drift-prone settings where the decision maps and/or labels could be misspecified. Also, examining the plots, I would agree that the proposed method does exhibit substantial regret benefit over the benchmark methods on aggregate (as well as that the proposed method is more robustly performant, with box widths smaller than the rest), but I would be more moderate in making the performance gains claims given that there is still substantial overlap between the boxes in most cases. Thus, what we can deduce with a lot of certainty is that the proposed method obtains much better regret than the naive benchmark in almost all evaluations, which other methods by and large cannot consistently achieve in the sense of box-plots. However, what I believe we cannot claim with absolute certainty is the superiority of the proposed method over all of the benchmarks at once: For instance, the KNN based benchmark is usually in the same ballpark. Moreover, on real-world multi-item newsvendor (Figure 3), the performance of most methods looks quite evenly matched, modulo the variation/box width. As another meta-issue, while I appreciated the logical nature of the proposed pipeline, I was not as convinced about the variety of heuristic choices that went into it at most junctures (many of these choices are not ablated against and would in fact be difficult to ablate). This relates to neural net architectures, hyperparameters like the neighbor count M when deciding on the largest transfer components; and this also relates to other subtler design choices that could be made, but were not made and weren’t usually discussed. Just as an example, when finding the M highest transfer cost samples, the hyperparameter M first of all sounds like the performance could be quite sensitive to it; so it appears that similar samples could be clustered together at first before performing this step, as the optimal M would then be found in a smaller, more robust range. Please see above. Furthermore, I have some additional questions, which if the authors are able to address them would likely require some different plots from the ones displayed. First, the adaptively chosen support is mentioned quite a few times, but there are no illustrations that specifically showcase the adaptivity/variation in support throughout the instances on any of the tasks, so I’d request for this to be provided. Second, there is a lot of mention of the difficulties, related to non-differentiability, of the standard existing predict-then-optimize approaches that are based on designing customized decision-aware losses. Yet, the comparison in the experimental section remains high-level, and doesn’t focus on exhibiting the favorable contrast as it specifically pertains to non-smoothness issues: I could imagine a dataset where decisions are intentionally set to be very discontinuous, and showing the benefits that the current framework has over decision-loss-based ones, locally. Currently, based on the results of this paper, it appears that the added benefit may be in the extra stability of the proposed method, as I imagine it to be quite computationally demanding compared to any decision-loss-optimizing method, smooth or nonsmooth. Second, returning to the point of the KNN method being one of the most closely matched, this raises the question as to whether the feature-level similarity may in any way have translated to decision-level similarity. If is there a way to display whether that is or is not the case empirically, that would be great; else, a qualitative discussion in the case of each of the two studied settings would suffice. Fully human-written
How to Teach Label to Understand Decisions: A Decision-aware Label Distribution Learning Framework Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes a decision-aware Label Distribution Learning (LDL) framework for Contextual Stochastic Optimization (CSO). By incorporating decision information into data representations via a decision-aware similarity matrix and predicting full label distributions, the method avoids modifying loss functions while naturally reducing risk in high-cost regions. Experiments on synthetic and real-world datasets, including comparisons with SAA, prescriptive analytics, and feature-based LDL, show consistent regret reduction, particularly in low-data settings, demonstrating its effectiveness for decision-focused learning tasks. 1. The proposed decision-aware LDL framework is novel, introducing a similarity matrix that explicitly incorporates decision information. 2. Experiments show that LDL achieves consistently lower regret and higher stability than baselines across both synthetic and real-world datasets. 1. There is a lack of analysis of hyper-parameters, i.e., P, M, $\alpha$, $\lambda$. 2. Although the proposed decision-aware LDL framework is novel and achieves strong performance, the manuscript does not discuss computational efficiency. LDL involves multiple steps, which may be slower than simpler baselines such as SAA or KNN. It is better to include a table showing average training and inference times for all methods. 3. The decision-aware similarity matrix S is a key innovation, but no visualization or interpretability analysis is provided. It would help to show a heatmap comparing S with a feature-based similarity matrix. 4. Experiments only consider small-scale problems (K=2 or 4), so the method’s scalability is unclear. It would be beneficial to include an experiment on larger-scale problems to assess performance and computational feasibility. 5. The manuscript lacks comparison with recent representative learning-to-decision or end-to-end decision learning frameworks. Including such baselines would better contextualize the performance of the proposed method. Overall, the idea is novel, but the experiments are insufficient, so I give a score of 4. I will base my final score on other reviewers’ comments and the author’s responses. Please refer to the details in the weaknesses. Lightly AI-edited
How to Teach Label to Understand Decisions: A Decision-aware Label Distribution Learning Framework Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper studies the problem of contextual stochastic optimization, where the setup is as follows: there is a joint distribution $P$ over an observed context $x$ and unobserved problem parameters $y$, and a cost function $c(z, y)$ that associates a cost with each decision $z$ and problem parameters $y$. Given a context $x$, we would like to solve the optimization problem $$ \min_{z \in \mathcal{Z}} \mathbb{E}_{y \sim P(y \mid x)}[c(z, y)]. $$ The distribution $P$ is not known, but we have a dataset of samples $(x_1, y_1), \ldots, (x_m, y_m)$ drawn from $P$. The approach explored in this paper and in prior work roughly involves using the data to learn a model that predicts for each context $x$ the distribution over outcomes $y$. Then the optimization problem is solved using the predicted distribution in place of the true conditional distribution of $y$ given $x$. A recent line of work studies "Integrated Learning and Optimization", where the loss function used when learning to predict the distribution of $y$ given $x$ is informed by the down-stream task (instead of simply being some generic measure of distributional similarity). However, the authors point out that a weakness of this approach is that these loss functions need to be designed bespoke for each downstream task, and are often difficult to work with due to being non-differentiable or discontinuous. This paper proposes a new approach called Label Distribution Learning (LDL) that does not require per-problem loss derivations and argue that it generally achieves decision-awareness (i.e., works well for most down-stream tasks). The high-level idea of the proposal is follows: 1. From the training data of $(x_i,y_i)$ pairs, construct a distribution $p_i$ over $\mathcal{Y}$ for each example. The distribution $p_i$ is a product distribution over the coordinates of $\mathcal{Y}$ where each coordinate's marginal distribution is supported on a finite set of values. The support of each marginal distribution and the weights associated with each value are determined from the training data by incorporating the data manifold as well as the decision objectives. 2. Next, train a two-branch neural network to simultaneously predict the support and weights from the context $x$. The authors then carry out experiments comparing their proposed method against several baselines on two separate tasks with both real-data and simulated data. The experiments show that the proposed method works well on these tasks. The problem studied by the paper is interesting, the approach seems quite different from prior work and is innovative and interesting. The experimental results are somewhat limited (only two decision tasks) but show promise for the proposed approach. At a high level, the label enrichment process described in the paper makes intuitive sense. However, at the same time, there are no formal guarantees or arguments suggesting that the approach will always result in predicted conditional distributions over problem parameters that work well for the down-stream decision task. While a formal guarantee is not required, the experiments section is limited to two decision tasks, so it remains unclear if the proposed approach would continue to work well across a wide range of tasks. My main concern with the paper is further justification for the details of the approach, either with theory arguing that the specific approach will capture important problem-specific structures, or with a broader experimental evaluation. To give one example of a situation where the proposed approach might go wrong, suppose that the dataset of $(x,y)$ pairs has the property that every pair it contains appears at least $M$ times (so that there are many duplicates of every example). In this case, the set of top-$M$ neighbors that have maximal transferrability to a given $(x,y)$ pair (defined on line 242) will be the $M$ copies of $(x,y)$. As a result, the support for each of the marginal distributions over the coordinates of $y$ will contain a single value: the one that was present in $y$. After this, my understanding is that the label enrichment process will associate each training $(x,y)$ pair with a distribution $p_i$ that is a point mass on $y$, which seems to undermine the goals of the process. While this is an extreme example, it seems that softer versions of this could cause the LDL method to behave poorly. A second weakness (which is acknowledged by the authors in the conclusion) is that LDL always fits a product distribution for $y$, ignoring any correlations between the problem parameters. In some cases a product distribution might work well enough, but it seems like this is a significant simplifying assumption. 1. In equation 7 distances are measured according to a distance metric $d$, but in equation (8) you switch to using norm notation. Is the norm meant to be a different measure of distance, or are these the same distance? 2. How is the specific form of the objective in equation 10 motivated? Intuitively it makes sense to find distributions that are similar both for other data points that are similar either in their context $x$ or where their problem parameters $y$ have exchangeable optimal decisions, but it seems like there are many choices. 2. In the special case where the training data contains $M$ identical copies of each $(x,y)$ pair, does the label enrichment process result in $p_i$ being a point mass on $y_i$? Fully human-written
How to Teach Label to Understand Decisions: A Decision-aware Label Distribution Learning Framework Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes a decision‑aware Label Distribution Learning (LDL) pipeline for contextual stochastic optimization (CSO). It constructs a decision‑aware similarity from optimization transfer costs to build per‑sample discrete label supports, estimates mixture weights via a manifold objective combining feature and task graphs, and trains dual‑branch networks with an MMD loss to predict mixture positions and weights. Joint distributions are factored over marginals; downstream decisions minimize expected cost over the learned discrete mixtures. The authors perform empirical evaluation on a multi‑item newsvendor and a small quadratic network flow and report lower regret than baselines. - The design of the framework seems novel - Limited information on experimental protocol (especially on how baselines were trained and applied) raises concerns about reproducibility. - It seems to me that the paper lacks some important baselines from families of differentiable solvers, direct task loss optimization with some soft surrogates, etc. - Random 80:20 split on newsvendor? Is it, in fact, time-series data? Is there potential leakage/test contamination possible here? - NIT: only plots, without reporting mean/stds in some table. Please see the weaknesses section. Lightly AI-edited
Online Test-Time Adaptation in Tabular Data with Minimal High-Certainty Samples Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. - This paper introduces OT3A (Online Test-Time Tabular Adaptation), a framework designed for test-time adaptation in tabular domains—a setting largely overlooked. - OT3A dynamically adapts models during inference using high-confidence pseudo-labels to correct label shift and entropy minimization to handle covariate shift. - The method targets real-world tabular challenges such as class imbalance and multi-type shifts, showing consistent improvements across benchmarks like HELOC, Voting, and Diabetes. - This paper tackles online test-time adaptation specifically for tabular data, an underexplored and practically important problem in fields like healthcare and finance. - Combines pseudo-labeling for label shift and entropy minimization for covariate shift in a unified and interpretable way. - This paper also addresses real-world issues such as class imbalance and complex mixed-type shifts, improving model reliability in deployment scenarios. ### **Motivation** - The idea of test-time adaptation for tabular data is not new. Prior works such as TabLog [1], FTAT [2], AdapTable [3], and [4] have already investigated similar directions. The current paper fails to cite AdapTable despite its clear conceptual overlap, raising concerns about potential plagiarism. - Moreover, the proposed components—uncertainty estimation, label distribution correction, and entropy minimization—have all been explored in existing tabular TTA frameworks (e.g., AdapTable, FTAT, AdaTab). In particular, using entropy minimization for covariate shift and label-shift calibration for distribution correction is standard practice in many previous works, and thus provides no technical novelty here. - The paper also claims novelty in handling the co-existence of covariate and label shifts, but this is a well-known issue already discussed multiple times in the vision TTA literature. The contribution in this respect is incremental at best. --- ### **Methodology** - The proposed method overlaps heavily with AdapTable. The use of the maximum difference between the top two logits as an uncertainty measure is identical, and the label distribution correction in line 198 replicates the same mechanism. These similarities appear to go beyond inspiration and verge on textual or conceptual plagiarism. - Additionally, the use of Euclidean distance to measure differences between high-dimensional tabular features is inappropriate. Tabular data often mix heterogeneous feature types and scales, making Euclidean metrics mathematically and practically meaningless without normalization or feature weighting. --- ### **Experiments** - All experiments are performed only on binary classification datasets from TableShift. This setup is insufficient to validate the method’s effectiveness. Binary tasks make label-shift correction artificially easy, since biasing predictions toward the majority class can inflate accuracy. Evaluation on multi-class datasets is essential to demonstrate robustness. - The reported batch sizes (250–2000) are unrealistic for tabular adaptation experiments. Standard batch sizes rarely exceed 128 or 256. Using such large batches distorts learning dynamics and makes results incomparable to prior baselines. - The component-wise ablation is also weakly analyzed. Several ablations even outperform the full method, suggesting the model’s design lacks internal consistency. --- ### **Writing** - The notation of the Affinity Matrix (\hat{s}_{ij}) contains a typo. - The ablation table misuses bold formatting, highlighting configurations that are not the best-performing. --- ### **References** - [1] Ren et al. "Test-Time Adaptation for Tabular Data Using Logic Rules." ICML, 2025. - [2] Zhou et al. "Fully Test-time Adaptation for Tabular Data." AAAI, 2025. - [3] Kim et al. "AdapTable: Test-Time Adaptation for Tabular Data via Shift-Aware Uncertainty Calibrator and Label Distribution Handler." NeurIPS Workshop on Table Representation Learning, 2024. - [4] Zeng et al. "LLM Embeddings Improve Test-time Adaptation to Tabular Y|X-Shifts." NeurIPS Workshop on Table Representation Learning, 2025. * How exactly was Figure 1 computed? The paper provides no details on the data or metric used to construct this figure. Moderately AI-edited
Online Test-Time Adaptation in Tabular Data with Minimal High-Certainty Samples Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. * Tabular datasets often exhibit distribution shifts (label/covariate shift, class imbalance) when deployed in real-world environments, causing significant drops in model performance. Existing TTA methods from vision and NLP domains are suboptimal when applied to tabular data. * Proposes OT3A method - which selects a subset of locally consistent and highly confident subset of test data. Then, it is calibrated, by aligning the estimated label distributions between source and target. * Evaluated upon multiple dataset, showing performance gains upon prior methodologies. . Questions regarding core contribution * Most of the statements made in the paper align with AdapTable(https://arxiv.org/abs/2407.10784). I cannot see the difference in its statements made in 2.2 Problem Analysis, showing the exact same findings once more. Questions regarding novelty of proposed method. * This also tracks shifted label distributions using high-confidence predictions, and uses acovariance matrix to correct the bias in label distribution estimation. Why does it not compare it with this methodology? (https://arxiv.org/html/2412.10871v1). Utilization of uncertainty is a well-founded method(mixture of it as well). Are the authors claim solely rise from the application of these methods to the tabular domain? Questions regarding source model used. * The AdamW optimizer is used with a learning rate of 0.01 and a weight decay of 0.01. -> this seems to be an EXTREME fixation upon the source model. I wish the authors could train a variety of source models, select one based upon the best validation performance and use it for test-time adaptation. * I believe most of the performance gains are due to the mitigation of different label distributions. for a good comparison, the authors should train their source models upon a "balanced" dataset, and log their performance gains. For datasets where minorities are under-represented, there are a multiple serires of upsampling techniques to ensure balanced training. (https://arxiv.org/abs/2012.01696) refer to above. Fully human-written
Online Test-Time Adaptation in Tabular Data with Minimal High-Certainty Samples Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper studies online test-time adaptation (OTTA) for tabular classification and proposes OT3A, which (i) selects consistency-confidence points (C2P) via per-class confidence quantiles and local agreement, (ii) estimates/corrects label shift by combining batch predictions with label propagation over an affinity graph, and (iii) adapts to covariate shift using pseudo-label self-training plus entropy minimization; updated per incoming batch. Experiments on six Tableshift datasets with MLP/TabTransformer backbones show sizable gains in balanced accuracy and macro-F1 over several TTA baselines. 1. Consistent empirical gains across multiple datasets/architectures. 1. Missing tabular-TTA baselines and prior art that already tackle label+covariate shift. The paper compares against generic TTA methods (PL, TTT, TENT, EATA, SAR, LAME, ODS) but omits tabular-specific TTA that are close in spirit: AdaptTable [1], FTAT [2]. 1. Novelty claims around “co-existence of label & covariate shift + class imbalance” are not new. Both AdapTable and FTAT emphasize exactly these phenomena for tabular TTA and build methods to handle them; the current paper should reposition its contribution as a specific C2P-driven estimation/propagation strategy rather than the first to recognize the setting. Minor weaknesses 1. The label shift correction method requires source label distribution. 1. Figure axis legends have small font sizes. [1] Kim, Changhun, et al. "Adaptable: Test-time adaptation for tabular data via shift-aware uncertainty calibrator and label distribution handler." arXiv preprint arXiv:2407.10784 (2024). [2] Zhou, Zhi, et al. "Fully test-time adaptation for tabular data." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 39. No. 21. 2025. 1. What is the rationale of the loss function design, consisting both CE and Entropy loss? 1. Would this method applicable to vision dataset with label+covariate shift? Minor 1. Line 180: Is this a typo? "where τc is the value corresponding to the *τc quantile* in the consistency distribution" Lightly AI-edited
Online Test-Time Adaptation in Tabular Data with Minimal High-Certainty Samples Soundness: 1: poor Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposed a Test-time adaptation method named OT3A. Based on the analysis of the imbalanced class, it leverages high-confidence and domain-consistent pseudo-labels to estimate and correct for target label distribution shifts. Additionally, it employs self-training and entropy minimization, guided by these confident samples, to adapt the model to the out-of-distribution test data. The experiment shows the effectiveness of OT3A 1. The analysis 2 is interesting and gives insights into tabular machine learning. 2. The experimental design is comprehensive and well-structured 1. This paper lacks novelty. Here is the detail: - In the introduction section, line 42, this conclusion is proposed early in Observation 4 in [1]. - In the preliminary section, lines 85-88 represent the same as section 2.1 in [1]. Analysis 1, from lines 103 to 130, has already been reported in previous studies, and the analyses, experiments, and descriptions in this paper are consistent with prior research ( [1] Observation 1). - In the method part, the usage of the neighbor message and the correction of prediction in this study has been mentioned and applied in previous research [1] [2]. This paper only makes some minor adjustments in specific details, while the overall methodological framework remains similar to prior studies. **The description is also similar to earlier works (line 186-188 is similar to section 3.1 in [1], line 195-201 is similar to section 3.3 in [2] ), indicating limited novelty.** - **The paper shows a high degree of similarity to previously published work, with portions of the text being identical, which lacks references.** Here are the details besides the ones mentioned above: a. Line 263-267 is similar to section 4.1-Evaluation Protocol in [1]. b. Line 274-280 is similar to section 4.1 in [2] c. Line 698-715 **is the same as** E.3 in [2] d. Line 726-731 **is the same as** E.1 MLP in [2] --- 2. Some parts of the paper are not clearly articulated. For example, in lines 60–64, the discussion shifts directly from *high/low-entropy samples* to *high-consistency samples* without explaining the connection between them. You only mention that *low-entropy samples* are beneficial for model convergence, but it remains unclear how *low-entropy samples* relate to *consistency samples*. --- 3. The symbol is not clear. It is hard to follow the methods with unclear mathematical symbols. For example, in section 3.2 line 224, what does S mean, and how does S come from? In addition, it is not clear whether kNN and RBF function jointly or independently in the proposed method. --- In summary, this paper has a high similarity to previously published work, which lacks novelty, not meet the standard of ICLR. --- [1] Zhou Z, Yu K Y, Guo L Z, et al. Fully test-time adaptation for tabular data. In: AAAI 2025 [2] Kim C, Kim T, Woo S, et al. Adaptable: Test-time adaptation for tabular data via shift-aware uncertainty calibrator and label distribution handler. In: NeurIPS Workshop on Table Representation Learning (NeurIPSW-TRL), 2024 See Weakness above Fully human-written
Don't Trust any Distilled Dataset! Model Hijacking with the Fewest Samples Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes a novel model hijacking method that leverages dataset distillation to embed malicious hijacking tasks into victim models using very few samples. The method employs a U-Net–based Transporter to generate visually and semantically blended “osmosis samples” between benign and hijacking datasets, followed by a distillation stage that compresses these samples while preserving hijacking effectiveness through key patch selection, label reconstruction, and training trajectory matching. Extensive experiments across CIFAR-10/100, MNIST, SVHN, and Tiny-ImageNet demonstrate high attack success rates (ASR) (>95%) while maintaining high utility (model accuracy on original task). The authors argue that this exposes new security risks when using third-party synthetic datasets in transfer learning. - This work brings attention to a real and under-explored risk in the rapidly growing use of third-party distilled datasets for transfer learning. Highlighting this vulnerability is of clear significance to the community. - The two-stage Osmosis + Distillation pipeline is conceptually elegant and technically plausible, combining semantic-visual embedding with trajectory matching for stealthy attack transfer. - Systematic experiments are performed on multiple common datasets (CIFAR-10, CIFAR-100, MNIST, SVHN, Tiny-ImageNet) and two architectures (ResNet18, VGG16), examining not just the attack’s effectiveness, but also its stealth, sample efficiency, and transferability. - The ablation (see Figure 6) concretely demonstrates the value of their trajectory loss for improving attack potency without degrading benign accuracy. 1. The paper raises a critical alarm about model hijacking but does very little to discuss possible detection or defense strategies, mitigation mechanisms, or even basic analyses on how one could screen for osmosis/hijacking samples. This significantly lessens its practical impact for practitioners. 2. Only CAMH (He et al., 2025) is compared. Other modern data poisoning, model hijacking, or backdoor-in-distillation baselines (e.g., Liu et al., NDSS 2023) are missing, which weakens claims of superiority. 3. For Eq. 6, $\phi_{h}$ is vaguely described as a “human observer” but this is undefined and impractical; the implementation details of $\phi_h$ and whether it is simulated, learned, or manual are absent. 4. The paper lists relevant prior work (especially Salem et al. NDSS 2022, Si et al. USENIX 2023), but does not cite or discuss two recent, closely related studies: - Chung et al. (2024), “Rethinking Backdoor Attacks on Dataset Distillation: A Kernel Method Perspective” — this offers a theoretical basis for understanding dataset distillation vulnerabilities and should be discussed in Section 2 and referenced when motivating the kernel/matching used here. - Ge et al. (2024), “Hijacking Attacks against Neural Network by Analyzing Training Data” — this study’s attack methods and analysis of attack mechanisms are relevant benchmarks and should be positioned against OD. 1. Is there any baseline detection mechanism (e.g., outlier detection, dataset forensics, pattern analysis) that can partially or fully mitigate the OD attack? Have the authors attempted any preliminary analysis (visual/algorithmic) of the osmosis samples’ separability from real data? 2. In Eq. 6, what precisely constitutes $\phi_h$? Is it a perceptual metric, a learned classifier simulating human preference, or a simulated metric? Please provide implementation details. 3. For fine-grained datasets (e.g., CIFAR-100, Tiny-ImageNet), what is the minimum number of per-class hijacking samples required to achieve a given ASR, and how does it scale with class count? Can the authors offer any theoretical or empirical guidance? Fully AI-generated
Don't Trust any Distilled Dataset! Model Hijacking with the Fewest Samples Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. In this manuscript, the authors introduce Osmosis Distillation (OD) attack, a novel model hijacking attack method. OD attack integrates model hijacking with dataset distillation, leveraging distilled osmosis samples to significantly reduce the requirement of poisoned samples. The authors' experiments demonstrate that OD attack can successfully execute the hijacking task while minimizing the impact on the performance of the original task. - Leveraging dataset distillation enables significant gains in both the efficiency and the stealthiness of the attack. - The comparison in Fig. 3 may not be fully suitable, as CAMH is not designed to optimize attack success under extremely low-sample situation. I recommend the authors additionally report CAMH’s best achievable performance. This would help clarify whether the proposed method achieves efficiency at the cost of attack success or model utility. - All experiments are restricted to ResNet and VGG16. This narrow architectural scope weakens claims about transferability and generality. - The manuscript does not report the time cost of dataset distillation. While the proposed method substantially reduce the requirement of sample size at the attack stage, it is unclear how much additional cost is required during the training stage. - It is unclear how the method handles cases where the hijacking dataset and the original dataset have different class numbers, especially for hijack has more classes than the original. - The proposed method can ensure stealthiness, as distilled osmosis samples exhibit a high degree of visual similarity to original samples. However, beyond human visual inspection, are there potential defense mechanisms for detecting such samples? For example, could OOD detectors, feature-space anomaly detection, or confidence/energy-based scoring be effective in filtering distilled or hijacking samples? - Would stronger fine-tuning stragies beyond vallina cross-entropy (e.g., alternative loss functions, label smoothing, data augmentation) can reduce attack success rate? Please see the weaknesses. Fully human-written
Don't Trust any Distilled Dataset! Model Hijacking with the Fewest Samples Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces a novel model hijacking attack method called Osmosis Distillation. The proposed approach combines dataset distillation with data poisoning–based model hijacking, enabling the embedding of malicious tasks into benign distilled datasets using only a few attack samples. The OD attack consists of two stages. In the Osmosis stage, a U-Net-based Transporter model fuses the information of original and hijacking samples. By jointly optimizing visual loss and semantic loss, the Transporter generates osmosis samples that are visually similar to the original samples but semantically close to the hijacking samples. In the Distillation stage, key patch selection, soft label reconstruction, and training trajectory matching are applied to the osmosis samples to obtain a compact yet high-fidelity Distilled Osmosis Dataset. When a victim model is fine-tuned on this DOD, it performs normally on the original task but activates the attacker-defined hidden task when encountering specific malicious inputs. Experimental results show that OD achieves over 97% attack success rate on CIFAR-10, CIFAR-100, SVHN, MNIST, and Tiny-ImageNet, while maintaining nearly unchanged performance on the original task, outperforming the current state-of-the-art CAMH attack. - The paper is the first to identify a realistic and novel threat model by combining dataset distillation with model hijacking, exposing previously unexplored vulnerabilities in the distillation pipeline. - Extensive experiments across multiple datasets and architectures (e.g., ResNet-18, VGG-16) demonstrate that the OD attack achieves very high Attack Success Rates while leaving original-task accuracy virtually unchanged, indicating strong stealthiness. - Introduces and validates a novel loss term that, for the first time, uses training-trajectory alignment as an explicit loss in the context of dataset distillation with embedded malicious tasks. This causes victim models fine-tuned on the distilled data to follow learning trajectories similar to those produced by the original osmosis samples, thereby concealing the attack. - The OD attack requires only a few attack samples to succeed, highlighting that the attack remains effective in resource-constrained settings and substantially lowers the cost and barrier for real-world exploitation. - Experiments are conducted only on small-scale vision datasets; the paper lacks evaluation on larger-scale datasets (e.g., ImageNet), leaving questions about scalability and real-world applicability. - The attack model assumes the victim will use a distilled dataset produced or published by the attacker, which may be an optimistic assumption in some practical deployments. - Although the paper exposes a real security threat, it does not provide a thorough investigation of detection or mitigation strategies for this class of attacks. -The paper does not experimentally demonstrate whether OD can evade or be detected by common poisoning detectors, leaving open how resilient the attack is to standard defenses. - The paper does not analyze what specific semantic information is embedded in the osmosis samples or how that information propagates through the network during fine-tuning; it also lacks feature-level visualizations that could explain the underlying mechanism. - Could the authors evaluate OD attack using established backdoor detection techniques to assess whether it can be detected by existing defenses? - Can OD attack be extended to larger-scale or non-vision domains such as language or diffusion models? If so, how does its performance and stealthiness vary across modalities? - If the Distilled Osmosis Dataset is combined with a fraction of genuine training data (e.g., 10 % real + 90 % DOD), does the Attack Success Rate drop significantly? - What are the ablation results for patch size, N (number of key patches per class), and the image-concatenation strategy? How do these hyperparameters influence the DOD size and the overall attack effectiveness? Fully AI-generated
Don't Trust any Distilled Dataset! Model Hijacking with the Fewest Samples Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces OD (Osmosis Distillation) attack, which embeds an additional “hijacking” task into the victim model while preserving original-task utility. The method relies on visual + semantic losses during generation, patch-based distillation with soft labels, and trajectory-matching to preserve the hijacking signal in a very small distilled set (e.g., IPC ≈ 50). 1. This paper is well-organized and easy to read. 2. This paper enhances the stealth of the attack by reducing the amount of data and maintaining high utility on source tasks. 3. The experimental results demonstrate that the attack transferability across different model architectures within the same task. 1. The manuscript does not clearly bridge the gap between “model contains a hidden classification ability” and “this constitutes a practical, exploitable real-world threat.” As a result, the severity of the threat in practice remains speculative. Specifically: * The paper argues stealthiness because the original task utility remains high, but leaves open how an attacker benefits operationally from a model that silently encodes a second task (e.g., remote extraction, covert telemetry, or third-party abuse) * Experiments are conducted on common small vision benchmarks (CIFAR/MNIST/SVHN/Tiny-ImageNet). While these validate feasibility, they do not demonstrate the attack at scales or task modalities where legal/ethical harm would actually occur. 2. The OD formulation and experiments assume that the hijacking task shares the same (or mappable) output space as the source task, raising two concerns: * The attack is primarily demonstrated for same-type tasks (classification → classification) and requires an explicit mapping between source and hijack labels. The paper does not experimentally or analytically evaluate the difficulty of cross-task attacks (e.g., classification → object detection, regression tasks), where output dimensionality or semantics differ. The extent to which OD generalizes across heterogeneous task types (and architectures) is therefore unclear. * When the hijacking label space is larger than the source label space, the paper does not provide a principled mapping or capacity analysis, which leaves open whether OD is practical for richer hijacking tasks. In general, without a clearer exploitation narrative or a demonstration on a more realistic downstream task, I consider this work an interesting proof-of-concept rather than evidence of an immediate, practical security crisis. And the application limitations reduce the generality of the claims. If OD only works for a narrow class of same-task, same-output-space scenarios, that should be stated more clearly. None Fully AI-generated
Trade-off in Estimating the Number of Byzantine Clients in Federated Learning Soundness: 4: excellent Presentation: 3: good Contribution: 4: excellent Rating: 6: marginally above the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper provides a systematic theoretical analysis of the impact of estimating the number of Byzantine clients in Federated Learning (FL). The authors consider a scenario where an aggregator is chosen with a robustness parameter $\hat{f}$ (the estimated maximum number of Byzantine clients it can tolerate), while the true number is $f$. Their key contributions are: Underestimation Risk: They rigorously prove that underestimation ($\hat{f} < f$) can lead to arbitrarily poor performance and divergence, even under favorable conditions like the Polyak-Łojasiewicz (PL) inequality. Minimax Optimal Bounds for Non-Underestimation: For the non-underestimation case ($\hat{f} \geq f$), they establish matching lower and upper bounds (i.e., minimax optimal rates) for both the aggregation error and the convergence rate of the Federated Robust Averaging (FedRo) algorithm. These bounds are proportional to $\frac{\hat{f}}{n - f - \hat{f}}$. Fundamental Trade-off: This bound reveals a fundamental trade-off: while an aggregator with a larger $\hat{f}$ can tolerate a wider range of attacks (any $f \leq \hat{f}$), its performance deteriorates when the actual number of attackers $f$ is small, as the error bound increases with $\hat{f}$. Optimal Algorithm: They propose a novel composite aggregator ($\hat{f}$-Krum $\circ$ $\hat{f}$-NNM) that is proven to achieve the order-optimal upper bound without prior knowledge of the true $f$. Empirical Validation: Experiments on CIFAR-10 validate the theoretical trade-off, showing performance collapse when $\hat{f} < f$ and performance degradation as $\hat{f}$ increases beyond $f$. Novel Problem Formulation: Systematically studying the effect of the estimated number of Byzantine clients ($\hat{f}$) is a highly original and important direction. Theoretical Completeness: The paper provides a complete minimax analysis, with tight lower and upper bounds for both aggregation error and algorithm convergence, under both underestimation ($\hat{f} < f$) and non-underestimation ($\hat{f} \geq f$) scenarios. The bounds, proportional to $\frac{\hat{f}}{n-f-\hat{f}}$, are rigorously derived. Practical Relevance: The theoretical trade-off is clearly demonstrated and validated through experiments on a standard benchmark (CIFAR-10), connecting theory with practice. Algorithmic Insight: The analysis of the composite aggregator ($\hat{f}$-Krum $\circ$ $\hat{f}$-NNM) provides a constructive method to achieve the order-optimal bound. Computational Complexity of Optimal Aggregator: While the composite aggregator ($\hat{f}$-Krum $\circ$ $\hat{f}$-NNM) is theoretically optimal, it is computationally expensive. The per-round cost involves nearest neighbor searches for all clients, which may be prohibitive for very large-scale systems or high-dimensional models. The paper does not discuss efficient approximations or the practical scalability of this aggregator. Limited Empirical Scope: The experiments, while supportive, are limited to one dataset (CIFAR-10), one model (ResNet-20), and one type of attack (Gaussian noise). Broader experimentation with more datasets (e.g., CIFAR-100, FEMNIST), model architectures, and sophisticated Byzantine attacks (e.g., label-flipping, backdoor) would strengthen the empirical validation. Assumption of PL Condition: The convergence analysis for the upper bound (Theorem 5) relies on the Polyak-Lojasiewicz (PL) condition to achieve a global convergence rate. While the PL condition holds for some machine learning problems, it is a relatively strong assumption. Discussing the plausibility of this assumption in FL settings or exploring convergence under weaker conditions (e.g., just smoothness) would be valuable. The theoretically optimal aggregator ($\hat{f}$-Krum $\circ$ $\hat{f}$-NNM) appears computationally heavy for large $n$ and $d$, involving $O(n^2 d)$ operations per round. What are your thoughts on developing more computationally efficient aggregators (e.g., using approximate nearest neighbor search or sampling) that can still preserve, or nearly preserve, the theoretical guarantees? Would such approximations fit into your current theoretical framework? Your experiments use a simple Gaussian noise attack. How do you expect your theoretical findings and the observed trade-off to hold under more complex and adaptive Byzantine attacks, such as those designed to mimic honest updates or perform model poisoning? Could such attacks potentially alter the established lower or upper bounds, for example, by affecting the heterogeneity constant $G^2$? The convergence upper bound (Theorem 5) is derived under the PL condition, leading to an asymptotic error floor of $\mathcal{O}\big(\frac{\hat{f}G^2}{n-f-\hat{f}}\big)$. Can you provide any intuition or discussion on whether the core trade-off—that the asymptotic error floor scales with $\frac{\hat{f}}{n-f-\hat{f}}$—would still hold in non-PL settings, perhaps under different or weaker assumptions? Fully AI-generated
Trade-off in Estimating the Number of Byzantine Clients in Federated Learning Soundness: 4: excellent Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. The paper studies Byzantine-robust federated learning when the server only knows an estimate \(\hat f\) of the number of Byzantine clients (the true number is \(f\)). Main takeaways: (i) if we underestimate (\(\hat f < f\)), aggregation / training can become arbitrarily bad; (ii) if we do not underestimate (\(\hat f \ge f\)), the best-possible error (and convergence floor) scales like \[ \kappa = \Theta\!\left(\frac{\hat f}{\,n - f - \hat f\,}\right), \] with matching lower/upper bounds and a simple construction that achieves this rate up to constants. - Clear split between two regimes: underestimation can be catastrophic; otherwise there is a tight, quantified floor. - Matching lower and upper bounds - The dependence on both \(f\) and \(\hat f\) is explicit, which clarifies the cost of being conservative. - The underestimation impossibility is not really new: it is closely tied to the **breakdown point** in robust statistics. If the true contamination $f/n$ exceeds the method's breakdown (here essentially $\hat f/n$), you can get unbounded error. So this part feels more like transferring a known concept, rather than introducing a new phenomenon. - Novelty in the *order* of the bound is modest. Away from edges, it matches the trivial (previously known) bound up to constants. Precisely, for a constant $c \in (0, \tfrac{1}{2})$ such that $$ \hat f \le \big(\tfrac{1}{2} - c\big)\,n \quad \text{and} \quad f + \hat f \le (1 - c)\,n, $$ both denominators are $\Theta(n)$, and $$ \frac{\hat f/(n - f - \hat f)}{\hat f/(n - 2\hat f)} \;=\; \frac{n - 2\hat f}{\,n - f - \hat f\,} \;=\; \frac{1}{\,1 + \frac{\hat f - f}{\,n - 2\hat f\,}\,}. $$ This ratio stays bounded between two positive constants (and $<1$ when $\hat f > f$), so the two expressions are of the *same order*. - They only separate near the **edges**: - $f + \hat f \to n$: the paper’s denominator $n - f - \hat f$ collapses $\Rightarrow$ larger floor than the trivial count. - $\hat f \to n/2$: the trivial denominator $n - 2\hat f$ collapses $\Rightarrow$ trivial bound blows up sooner. - **Underestimation** $\hat f < f$: the paper (consistent with breakdown) shows no finite guarantee, while the trivial bound misleadingly remains finite. - Experiments could be broader; I appreciate the paper is from a theoretical nature so this is a minor issue for me. - Since the underestimation result follows the breakdown-point logic, what is the extra conceptual significance here? see weaknesses Heavily AI-edited
Trade-off in Estimating the Number of Byzantine Clients in Federated Learning Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper investigates the effect of underestimation and overestimation of the number of Byzantine clients and highlights the importance of accurately estimating the Byzantine size. The authors theoretically prove and empirically validate that underestimation can lead to divergence under Byzantine attacks, while overestimation results in a trade-off: a larger estimated Byzantine size enhances robustness under Byzantine attacks but degrades performance when there are no or fewer Byzantine clients. 1. The investigation of the impact of under- and over-estimating the number of Byzantine clients in the context of Byzantine attacks is a relevant and important topic for the field of Byzantine-robust federated learning. 2. This paper is well-written, clear, and theoretically sound. 1. The theoretical results in Theorems 1 and 3 do not fully support the conclusion that underestimation leads to arbitrarily large aggregation and convergence errors, as they apply only to a specific robust aggregator, not all such aggregators. If a different robust aggregator is used, the effect of underestimation may not hold. This represents a notable limitation of the paper. Can the authors generalize the results in Theorems 1 and 3 to all robust aggregators to address this issue? 2. There is a discrepancy between the theoretical analysis and the empirical results, as the former is conducted in a deterministic setting, while the latter is based on a stochastic setting. This gap is significant, as stochastic noise can substantially affect the performance of FedRo and lead to different convergence properties. Can the authors extend the analysis to address this gap? 3. In Table 1, the robustness coefficient $\kappa$ for the combination of $\hat{f}$-Krum and $\hat{f}$-NNM matches the lower bound in order. Why does the authors not use that aggregator in experiments? The reviewer would like to see the results of using $\hat{f}$-Krum $\circ$ $\hat{f}$-NNM in experiments. 4. In Line 48, the authors claim that "Existing works typically require estimating the actual number $f$ or the fraction of the Byzantine clients to select the maximum number $\hat{f}$ that the aggregator can tolerate." To my knowledge, the tolerable maximum number of Byzantine clients (i.e., the breakdown point) for some robust aggregators is independent of the actual number $f$ or fraction of Byzantine clients, meaning that these terms do not need to be estimated in advance for such aggregators. For instance, $\frac{n}{2}$ for CWMed, $\frac{n-2}{2}$ for CWTM, and $\frac{n}{10}$ for centered clipping, with respect to the breakdown point. The authors should revise this statement to make it more precise and rigorous. 5. In line 237, the authors claim that "since $\kappa$ of the commonly used aggregators like GM, CWTM, CWMed, and Krum do not match the lower bound even in the simple special case of $f = \hat{f}$." This statement is incorrect. As noted in Remark 2 of Allouah et al. (2023), CWTM does match the lower bound in order. The reviewer suggests that the authors carefully verify this statement. 6. There are several typos in this paper. For instance, in line 225, $\frac{\hat{f}}{n - 2f}$ should be $\frac{\hat{f}}{n - 2\hat{f}}$, and in line 226, $\frac{\hat{f}}{n - 2\hat{f}}$ should be $\frac{\hat{f}}{n - \hat{f}}$. The authors are advised to carefully review the paper to address these and prevent any similar issues. My detailed questions are outlined in the section above; please refer to them. If the authors can fully address my concerns, I will consider adjusting my score accordingly. Lightly AI-edited
Explicitly Bounding Q‑Function Estimates for Offline-to-Online Reinforcement Learning Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces a new method, BOTO, for offline-to-online (O2O) reinforcement learning. To address the problem of Q-value misestimation due to OOD actions, the authors propose to explicitly bound Q-value estimates across the entire action space, which is achieved by injecting structured noise into the dataset actions. This process mitigates both overestimation and underestimation and improves the performance on standard O2O RL benchmarks. - Q-value misestimation is a well-known limitation of O2O fine-tuning. The motivation is clear and the proposed $\alpha$-NAMDP makes sense. - The results look good, the proposed method performs more stable compared to other baselines. - The idea of injecting noise into the dataset actions is native. An illustration of how it mitigates Q-value misestimation should be provided, e.g., through a toy example or some ablation experiments. - The tunable parameter $\alpha$ is so important and sensitive to this framework. I found from the Table 2 that you choose different $alpha$ values for different tasks using grid research (from -1 to 1), which looks like this parameter is over-tuned. It would be good to have a detailed analysis or ablation study on the $\alpha$. - The model is evaluated on few D4RL benchmarks. However, as in the CQL paper, the proposed method should also be evaluated on diverse datasets of the same task. For example, 1) the "-random", "-expert" and "-medium" in Gym domains; 2) "-umaze", "-medium", and "-large" settings in AntMaze. - The ablation experiments on warmup phase is missing. - How do you get the illustration of Q-value misestimation cases in Figure 1 (from a real task or a synthetic dataset)? It would be better to provide more details. Fully human-written
Explicitly Bounding Q‑Function Estimates for Offline-to-Online Reinforcement Learning Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper targets the Offline-to-Online (O2O) RL setting, where value misestimation for out-of-distribution (OOD) actions inherited from the offline phase can derail early online fine-tuning. It proposes BOTO, which injects structured noise into actions while adjusting the Bellman targets to explicitly bound Q-values across the entire action space. A tunable parameter αtrades off optimism vs. conservatism, yielding an α-Noisy Action MDP interpretation with theoretical bounds on Qfor both in-distribution and OOD actions. Empirically, the method improves stability and final performance on D4RL domains (AntMaze, Kitchen, Adroit) under a Warm Start RL protocol. The paper formalizes noise-perturbed Q-learning targets and proves equivalence to learning in an α-NAMDP, including upper/lower bounds on Q that cover OOD actions The single parameter αgives a transparent way to bias OOD estimates and thereby shape early online behavior. Figure 3 and Theorem 2 illustrate the induced bounds. On AntMaze/Kitchen/Adroit, learning curves and Table 1 show faster adaptation and stronger final success rates vs. CalQL, CQL, IQL, RLPD, and WSRL. Limited novelty. The core idea closely follows and extends Offline RL with Penalized Action Noise Injection (Oh & Lee, 2025) by adding the αbias control; the paper should more sharply articulate what is genuinely new in analysis or practice beyond that prior. The conclusion acknowledges that choosing the bias controller may require per-environment tuning, which could undercut the algorithm-agnostic appeal. In what ways (theory or practice) does BOTO substantively differ from penalized action-noise injection beyond introducing the $\alpha$ controller and the $\alpha$-NAMDP lens? Fully AI-generated
Explicitly Bounding Q‑Function Estimates for Offline-to-Online Reinforcement Learning Soundness: 3: good Presentation: 3: good Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper presents an algorithm-agnostic approach to address challenges in the O2O reinforcement learning paradigm. In O2O RL, agents are pretrained on static datasets and refined through limited online interactions, but this process is hindered by Q-value misestimations for OOD actions, leading to underestimation or overestimation that destabilizes fine-tuning. The proposed method, BOTO, injects structured action noise during a pre-fine-tuning phase to regularize Q-values across the entire action space. This is formalized via α-NAMDP, where a tunable parameter α balances conservatism and optimism in Q-value bounds. Key contributions include: - Introduction of a Q-value bounding technique through action noise injection that explicitly regularizes OOD actions while mitigating both underestimation and overestimation. - Theoretical analysis establishing equivalence to Q-learning in the α-NAMDP and deriving explicit bounds on Q-value estimates. - Empirical demonstrations on standard O2O RL benchmarks. 1. The writing is logically coherent. A feasible improvement plan is proposed to address the inaccuracy of Q estimation of OOD actions. 2. The BOTO algorithm seems to achieve state-of-the-art performance in experiments compared to other algorithms. 3. The author constructed a reliable theoretical framework for this algorithm: α-NAMDP and proved the global boundedness of Q under this framework. A substantive assessment of the weaknesses of the paper. Focus on constructive and actionable insights on how the work could improve towards its stated goals. Be specific, avoid generic remarks. For example, if you believe the contribution lacks novelty, provide references and an explanation as evidence; if you believe experiments are insufficient, explain why and exactly what is missing, etc. 1. This study appears to have only introduced a hyperparameter from previous research[1], which severely undermines its contribution. 2. The author does not seem to have designed experiments separately to verify the performance of policies in the O2O process, although the author repeatedly emphasizes that their method affects this process. 3. At present, the scope of this algorithm seems to be limited to the O2O process, and it does not appear to benefit subsequent online RL. [1]JunHyeok Oh and Byung-Jun Lee. Offline reinforcement learning with penalized action noise injection. arXiv preprint arXiv:2507.02356, 2025. 1. Apart from introducing hyperparameters, could you further clarify the difference between this study and [1]? 2. Can the experiment contain the entire process of policy performance (Offline Pre-training, O2O, Online fine-tuning)? 3. Directly performing online pre-training on Q also seems feasible. How does this technique compare with online pre-training? Training time or computational cost? [1]JunHyeok Oh and Byung-Jun Lee. Offline reinforcement learning with penalized action noise injection. arXiv preprint arXiv:2507.02356, 2025. Lightly AI-edited
Explicitly Bounding Q‑Function Estimates for Offline-to-Online Reinforcement Learning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper addresses Q-value misestimation in Offline-to-Online Reinforcement Learning (O2O RL) by proposing BOTO (Bounding Q-function Estimates for Offline to Online RL), an algorithm-agnostic method that explicitly bounds Q-value estimates across the entire action space. The core contribution is a regularization approach that injects structured noise into dataset actions during Q-learning prior to online fine-tuning. The authors formalize their approach through the α-Noisy Action MDP (α-NAMDP) framework, showing that their method corresponds to Q-learning in a modified MDP with a tunable bias parameter α that controls the degree of conservatism versus optimism in Q-value estimates. The method consists of three stages: (1) offline pre-training using standard offline RL algorithms, (2) BOTO training that optimizes the Q-function using action noise injection with a warmup dataset, and (3) online fine-tuning using standard online RL algorithms. Empirical evaluations on D4RL benchmarks (Antmaze, Kitchen, and Adroit domains) demonstrate that BOTO outperforms strong baselines including WSRL, CalQL, RLPD, CQL, and IQL by mitigating both underestimation and overestimation issues during the online fine-tuning phase. Principled theoretical framework: The paper provides rigorous theoretical analysis by introducing the α-NAMDP formulation with two key theorems: Theorem 1 establishes equivalence between minimizing the proposed objective and Q-learning in the α-NAMDP, while Theorem 2 derives provable bounds on Q-values across the entire action space, including out-of-distribution (OOD) actions. This theoretical grounding distinguishes the work from heuristic approaches and provides formal guarantees on the boundedness of learned Q-functions. Clear problem motivation and analysis: The paper systematically analyzes the failure modes of Q-value misestimation in O2O RL, distinguishing between global underestimation, local underestimation, and overestimation scenarios. Figure 1 effectively illustrates these failure modes, and the discussion clearly articulates why existing methods suffer during online fine-tuning. This thorough problem characterization strengthens the motivation for the proposed solution. Algorithm-agnostic design: BOTO is designed to work with any offline RL algorithm, which is demonstrated empirically through experiments combining BOTO with both CalQL and CQL as base methods. Table 1 shows consistent improvements regardless of the underlying offline algorithm, highlighting the generality and practical applicability of the approach. This flexibility is valuable for practitioners who may have existing offline RL pipelines. Comprehensive experimental evaluation: The empirical evaluation covers six benchmark tasks across multiple domains (Antmaze, Kitchen, Adroit) and compares against five strong baselines. The results consistently demonstrate that BOTO achieves faster adaptation and higher final performance, particularly in mitigating the "unlearning" phenomenon that causes performance drops at the onset of fine-tuning. The inclusion of standard error bars across five random seeds strengthens the reliability of the results. Effective visualization of the bounding mechanism: Figure 2 provides compelling visual evidence of how BOTO produces bounded Q-values across the action space compared to baseline methods (CQL, CalQL, IQL), clearly showing how different α values modulate the degree of conservatism and optimism. Figure 3 further validates the theoretical bounds derived in Theorem 2 against empirical results in a controlled bandit setting. Limited novelty over prior work. The paper builds heavily on Oh & Lee (2025), extending the Noisy Action MDP to α-NAMDP by introducing a tunable parameter α. While this generalization provides useful control over bias, the core mechanism of action noise injection for Q-value bounding is not novel. The incremental nature of the contribution relative to the base NAMDP framework should be more explicitly acknowledged, and the paper would benefit from a more detailed comparison highlighting what specific advantages the α parameter provides beyond the original formulation. Hyperparameter sensitivity not thoroughly addressed. The paper acknowledges that "selecting an appropriate value for the bias controller may require tuning for specific environments", but provides insufficient guidance on how to set α in practice. Table 2 shows that optimal α values vary substantially across tasks (ranging from -1.0 to 1.0), yet the paper lacks principled strategies for hyperparameter selection. The discretization interval of 0.1 across the range [-1, 1] requires evaluating 21 different values, which could be computationally expensive. A sensitivity analysis or adaptive selection method would strengthen the practical applicability. Insufficient ablation studies. The paper lacks important ablation studies to isolate the contributions of different components. Specifically: (1) What is the individual contribution of the warmup phase versus the BOTO training objective? (2) How sensitive is performance to the warmup dataset size (Dwarmup)? (3) What is the impact of different noise distributions beyond the hybrid noise distribution adopted from Oh & Lee (2025)? (4) How does the BOTO training duration affect results? While Table 2 provides some hyperparameter values, systematic ablation studies would clarify which design choices are critical. Limited theoretical insights on α selection. While Theorems 1 and 2 characterize the learned Q-function under α-NAMDP, they do not provide guidance on how to choose α for a given task. The bounds in Theorem 2 depend on the reward function and value function, but it's unclear how these quantities can be estimated a priori to inform α selection. Developing a theoretical connection between task characteristics (e.g., sparsity of rewards, dataset quality) and optimal α values would significantly enhance the practical utility of the framework. Missing comparisons with recent methods. While the paper compares against several strong baselines, it omits comparisons with other recent O2O RL methods that also address Q-value estimation issues, such as SO2 (Zhang et al., 2024; https://arxiv.org/pdf/2312.07685) which uses perturbed value updates, or ENOTO (Zhao et al., 2024; https://arxiv.org/pdf/2306.06871) which employs Q-ensembles. Including these comparisons would better position BOTO within the current landscape of O2O RL research. Q1: How should α be selected in practice? Could the authors provide more principled guidance or an adaptive method for selecting α? For instance, could α be estimated based on properties of the offline dataset (e.g., coverage, quality, diversity) or initial Q-value statistics? Q2: How does BOTO perform on suboptimal or mixed-quality datasets? All experiments use standard D4RL datasets. How does BOTO behave when the offline dataset is of very poor quality or contains multi-modal behavior policies? Does the method provide robustness advantages in these more challenging scenarios? Q3: Why does BOTO use the offline dataset in addition to warmup data during BOTO training? The paper constructs DBOTO = D_offline ∪ D_warmup, but it's unclear why the offline dataset is retained during the BOTO training phase. What would happen if BOTO training used only D_warmup? Does this design choice relate to preventing distributional shift or maintaining coverage? Q4: Can the bounds in Theorem 2 be tightened? The bounds depend on the infimum and supremum over the support of the dataset distribution. In practice, are these bounds tight, or is there significant slack? Could empirical analysis on the benchmark tasks characterize how tight the bounds are in practice? Q5: How does BOTO compare to ensemble-based methods? Methods like ENOTO use Q-ensembles to address similar issues in O2O RL. What are the relative advantages of the noise injection approach versus ensemble methods in terms of performance, computational cost, and implementation complexity? Fully AI-generated
TBG-Driven Minimization of Noise-Resistant Adaptive Sharpness Awareness Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 0: Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper is poorly written, with many arbitrarily defined names and techniques. The overall flow is difficult to follow for beginners, and the exposition lacks clarity and rigor. While more mature readers can roughly infer the authors’ intended ideas, the lack of precision and structure is already a red flag for a venue such as ICLR. While the topic may be of potential interest, I did not identify any clear technical or empirical strengths in the current version of the paper. This is mainly because the presentation, logical flow, and definitions are unclear, making it difficult to discern the intended contributions. I generally avoid evaluating a paper based on assumptions or guesswork about the authors’ intent. W1. The paper motivates the addition of noise to gradients from a privacy perspective in the Introduction, and adds Gaussian noise to gradients in experiment. However, the proposed algorithm does not actually address privacy concerns. It neither employs standard differential privacy (DP) techniques nor provides any formal privacy guarantees. This leaves adding noise to gradients not well-justified. The authors should consider presenting more convincing scenarios or motivations where gradient noise naturally arises during training to better justify their experimental setup. W2. It is unclear what NRSAM refers to in the experiments. This algorithm does not appear to be introduced or defined anywhere in the paper. Similarly, TBG-NRASAM is mentioned in the first paragraph of Section 5 without explanation. The authors should clearly define these methods and describe how they relate to the proposed approach. Some experiments have TBG-NRASAM, and some not. W3. The presentation does not clearly explain how TBG is applied to NRASAM or why this combination is expected to improve performance. A more detailed and structured description of the integration mechanism and its intended benefits would greatly improve the clarity of this section. W4. The theoretical analysis is difficult to follow, as the problem setting, assumptions, and algorithms are not clearly or consistently explained. The lack of precise definitions and organization makes it challenging to assess the correctness and significance of the theoretical results. I believe that a thorough revision and careful polishing of the writing, organization, and technical exposition could substantially improve the overall quality and readability of the paper. Moderately AI-edited
TBG-Driven Minimization of Noise-Resistant Adaptive Sharpness Awareness Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors propose a new optimization method, TBG-NRASAM, which is reported to be more robust to Gaussian noise than SAM. In particular, leveraging the time-based generator theorem, they reuse the gradient of the previous weights to enhance robustness against Gaussian perturbations. The experimental results appear to support the effectiveness of the proposed algorithm under Gaussian noise. The idea of combining a time-based generator with sharpness-aware minimization is interesting and, to the best of my knowledge, has not been explicitly explored in prior work. Moreover, it is noteworthy that the proposed method enhances robustness to Gaussian noise while relying solely on previous gradients. This direction could be further investigated from the broader perspective of robustness in AI models. (Major) There are several issues that should be addressed before publication. Sections 2 (“Sharpness-Aware Minimization”) and 3 (“Time-Based Generator”) should be merged into a preliminaries section. These sections essentially introduce existing methods and concepts, but in the current form, they are presented as if they were the authors’ contributions. Important related work is also missing from these sections, which may mislead readers into thinking that the algorithms described there are original to this paper. The authors should clearly separate (i) background / existing techniques and (ii) their own contribution. The main concern about the proposed method is that the paper mostly emphasizes performance under noisy (Gaussian) settings, while giving less attention to performance on clean examples. Even if the goal is robustness, a practically useful optimizer should not severely degrade performance on clean data. In Table 5, the authors should discuss in more detail the performance of baselines such as SAM, ASAM, and SGD-M, not only under noise but also on clean data. In addition, several recent variants and analyses of SAM should at least be cited and, if possible, included as comparison methods: - Ji, Jie, et al. “A single-step, sharpness-aware minimization is all you need to achieve efficient and accurate sparse training.” NeurIPS 37 (2024): 44269–44290. - Li, Bingcong, Liang Zhang, and Niao He. “Implicit regularization of sharpness-aware minimization for scale-invariant problems.” NeurIPS 37 (2024): 44444–44478. - Mueller, Maximilian, et al. “Normalization layers are all that sharpness-aware minimization needs.” NeurIPS 36 (2023): 69228–69252. Since the proposed method builds on the SAM family, omitting these works weakens the contribution. Furthermore, in Algorithm 1, the proposed method uses the gradient of $L_B(w_{t-1}^{adv})$ and the gradient of $L_B(w^{adv}_{t})$. It seems necessary to use the same mini-batch across these steps in each iteration. However, the paper does not explain this implementation detail. If the same batch is used, please state this explicitly and discuss the computational implications. If different batches were actually used in the experiments, then the implemented method is not exactly the one described in the algorithm, and this discrepancy should be explained. While the proposed algorithm shows better performance under random Gaussian noise, the paper should also demonstrate that it maintains reasonable accuracy on other types of noisy data. Authors should consider CIFAR-10-C and CIFAR-100-C. Otherwise, the improvement may come from overfitting to the Gaussian noisy setting. A discussion about this trade-off, ideally with additional tables/plots, would strengthen the paper. (Minor) - Figures are not easy to read; please increase their resolution and/or font sizes. - Tables should have consistent significant figures/decimal places (e.g., Table 1). Inconsistent formatting makes it harder to compare methods. Refer to Weaknesses. Lightly AI-edited
TBG-Driven Minimization of Noise-Resistant Adaptive Sharpness Awareness Soundness: 2: fair Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. - This paper proposes noise-resistant adaptive sharpness-awareness minimization (NRASAM) variants. - The proposed method is based on the time-based generator and a boundedness theorem in a discrete system. - The authors seem to conduct image classification experiments to demonstrate the superiority of their proposed method. - The proposed methods are theoretically grounded. - The paper needs to be revised for a better understanding of the readers. - As far as I understand, the proposed NRASAM is a combination of ASAM and NRSAM. - It is hard to intuitively understand the concept of the discrete system and the proposed TBG-NRASAM. Please supplement the conceptual relation. - No error bars in the experimental results. - Please interpret the meanings of Theorems 3 and 4. Also, in Line 267, ... the rule for updating (2) is restructured to: ... eq_num (2)? I do not understand this part. Maybe a typo? - So, what is the proposed TBG-NRASAM? Please properly provide the algorithm. - Please specify the difference among ASAM, NRSAM, and NRASAM, including the algorithms and the contributions as well. - The authors proposed TBG for the discrete system, but ended up with an image classification task, right? What is the definition of a discrete system? How does it connect to the image classification models? Moreover, the experiments were the image classification experiments, right? In the paper, it just says "experiments on CIFAR-10 dataset", etc. Fully human-written
TBG-Driven Minimization of Noise-Resistant Adaptive Sharpness Awareness Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper addresses the problem of degraded model generalization under noisy training scenarios and proposes a Time-Base Generator (TBG) and noise-resistant adaptive sharpness-aware minimization methods (NRASAM, TBG-NRASAM). The key contributions are: proposing a TBG based on discrete systems and its boundedness theorem, providing a stability analysis tool for non-exponentially decaying systems; designing NRASAM, which integrates adaptive sharpness measurement and historical gradient integration mechanism to simultaneously suppress intrinsic parameter scaling interference and extrinsic noise perturbation; optimizing parameters via TBG to propose TBG-NRASAM, improving algorithm convergence. Theoretical proofs verify the algorithm's convergence and noise resistance, and experiments show that the method outperforms SAM-series methods across multiple datasets, model architectures, and noise intensities, providing an effective optimization solution for noisy training scenarios such as differential privacy protection and federated learning. 1.The theoretical derivation is rigorous, and the algorithm's convergence and noise resistance are fully verified through 4 theorems; the experimental design is comprehensive, covering multiple datasets, model architectures, and noise intensities. Ablation experiments validate the effectiveness of TBG parameter adjustment, with highly credible conclusions. 2. The paper has a coherent structure, progressing layer by layer from problem formulation, theoretical foundation, algorithm design to experimental verification; the formula derivation is detailed, the appendix supplements complete theorem proofs, and figures intuitively show the algorithm's convergence process and performance advantages, facilitating reader understanding. 3. Focusing on the core pain points of practical scenarios such as differential privacy and federated learning, the proposed method can be directly integrated into existing optimization frameworks, which is of great practical significance for model training requiring noise protection. 1. Limited coverage of noise types: Experiments only verify the anti-interference performance of Gaussian gradient noise, not involving common noise types in practical scenarios such as label noise and data input noise; the performance boundary of the algorithm under extreme noise intensities (e.g., σ>0.02) is not evaluated. 2. Lack of computational overhead analysis: NRASAM introduces a historical gradient integration mechanism, which may increase additional computational and storage overhead. However, the paper does not compare its parameter count, FLOPs, and training/inference time with SAM and ASAM, lacking efficiency-performance trade-off analysis. 1. What is the performance of the algorithm under other common noise types such as label noise and data input noise? Can supplementary experiments be provided to verify its generalization ability in diverse noise scenarios? 2. Does the historical gradient integration mechanism of NRASAM increase computational and storage overhead? Fully AI-generated
DeepTravel: An End-to-End Agentic Reinforcement Learning Framework for Autonomous Travel Planning Agents Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors propose an end-to-end agentic reinforcement learning framework for autonomous travel planning. It enables a LLM to plan itineraries by calling travel-related tools inside a sandboxed environment and refining its reasoning through RL rather than static prompts. * Self-expanding sandbox: built from cached real API data (DiDi flights, hotels, maps). It starts almost empty and grows as the agent explores new queries. Cached responses are replayed deterministically, while a daily refresh partially updates data to mimic real-world dynamics. * Hierarchical reward modeling: A trajectory-level reward checks global consistency (route feasibility, timing), and a turn-level reward, implemented via DeepSeek-R1, ensures each step’s tool use and reasoning match tool outputs. If any turn fails, the entire trajectory is penalized. * Training pipeline: 1. Cold-start SFT: 1K verified trajectories distilled from DeepSeek-R1; fine-tune Qwen to learn the `<think>`/`<tool_call>` structure, masking `<tool_response>` tokens. 2. RL phase (GRPO-style): roll-out groups of 8 trajectories per query in the sandbox; compute verifier rewards; update with GRPO loss and replay previously failed queries for continual improvement. * Evaluated on 6 K + real user queries and synthetic benchmarks. DeepTravel-32B outperforms OpenAI-o1/o3 and DeepSeek-R1 while cutting hallucinations by > 50 %. * Introduction of an Agentic RL framework for travel planning with real deployment. * Sandbox design: stable, replayable environment that solves API inconsistency and rate-limit problems. * Hierarchical reward: reward system combining high-level feasibility with low-level factual consistency. * Practical impact: deployed inside a commercial platform (DiDi) * Scalable idea: structure could extend to other multi-tool reasoning domains * Reward quality relies on DeepSeek-R1 judgments; could introduce bias. * No comparison to human or optimization-based planners * Agentic behavior is constrained by predefined setting; not full open-world * Evaluation metric is binary and may miss aspects like personalization or cost-tradeoffs * How consistent are DeepSeek-R1 verifier judgments vs human annotators? * How are failed queries sampled when training? * Can the agent handle unseen cities or updated APIs without re-training? Fully AI-generated
DeepTravel: An End-to-End Agentic Reinforcement Learning Framework for Autonomous Travel Planning Agents Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper focuses on the practical business scenario of "travel planning" and designs an automated optimization algorithm based on Large Language Models (LLMs). The method is tested on different types of datasets and in a production environment, showing some improvements compared to open-source models. However, I do not believe this work is a good fit for this conference. It would be more appropriate for a conference with an applied track, such as KDD. - This paper is easy to follow - The algorithm design is simple and straightforward - This work is too engineering-focused and lacks significant academic contribution. - The learning curves presented in Figure 4 are weird; they show no clear evidence of policy learning because there is no significant reward improvement. N/A Lightly AI-edited
DeepTravel: An End-to-End Agentic Reinforcement Learning Framework for Autonomous Travel Planning Agents Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper DeepTravel presents an end-to-end agentic reinforcement learning framework for autonomous travel planning. It trains agents to plan, execute, and refine itineraries through a sandboxed environment, hierarchical reward modeling, and replay-augmented learning. Deployed on DiDi’s app, DeepTravel enables smaller LLMs to outperform larger models like OpenAI-o1/o3 and DeepSeek-R1 in travel planning tasks. The paper’s main strength lies in its innovative integration of agentic reinforcement learning for autonomous travel planning, combining a robust sandbox environment, hierarchical reward modeling, and replay-augmented learning to enable effective tool interaction and reasoning. It demonstrates promising empirical performance, showing that smaller LLMs can surpass larger state-of-the-art models, and provides real-world validation through deployment on DiDi’s platform. While the proposed sandbox environment effectively stabilizes training by mitigating API inconsistencies, it also raises concerns about closed-world limitations. By caching transportation, accommodation, and POI data, the framework may train agents on a static or outdated representation of the travel environment, potentially limiting generalization to real-world, dynamically changing conditions. Although the authors mention a daily update mechanism, the paper lacks detailed analysis on how frequently and extensively this data is refreshed, or how well the sandbox replicates real-world variability. Consequently, the practical usefulness and robustness of the trained agent outside this controlled setting remain uncertain, and further empirical validation with live, dynamic data sources would strengthen the paper’s claims. The design of the trajectory-level and turn-level verifiers appears to be highly domain-specific and handcrafted, relying on travel-oriented rubrics and prompt templates. This raises concerns about generalizability, reliability, and reproducibility. Since these verifiers directly determine the reward signals, the overall agent performance may critically depend on their specific design choices, such as rubric formulation, prompt phrasing, or the underlying reward model’s calibration. However, the paper appears to provide limited ablation or sensitivity analysis to assess how changes in verifier design impact learning outcomes. Without clearer justification, it is difficult to evaluate whether the reported performance gains stem from genuine improvements in agentic reasoning or from carefully tuned, domain-dependent reward mechanisms, which could limit reproducibility and broader applicability. The necessity and novelty of the proposed Replay-Augmented Reinforcement Learning (RARL) algorithm are not sufficiently justified. While the idea of replaying failed experiences is presented as a key contribution, it appears conceptually similar to existing experience replay or replay buffer mechanisms widely used in RL, including prior work like GRPO. The paper does not clearly explain what makes this replay-augmented approach fundamentally different or superior beyond applying it to the travel-planning domain. Moreover, it remains unclear why existing RL algorithms could not achieve similar effects with appropriate tuning or data sampling strategies. Without a deeper theoretical motivation or comparative analysis isolating the benefits of the proposed replay mechanism, the contribution risks appearing incremental rather than novel, and its necessity in achieving the reported performance improvements remains questionable. The comparison with state-of-the-art reasoning LLMs (e.g., OpenAI-o1/o3, DeepSeek-R1) may be unfair, as these models are not specifically trained or optimized for travel planning. Their weaker performance could reflect domain mismatch rather than genuine inferiority. Moreover, since evaluation relies on DeepTravel’s own reward verifiers and limited human checks, the setup may favor the proposed system, calling into question the fairness and validity of the comparisons. All experiments and evaluations appear to rely heavily on DiDi’s proprietary platform, APIs, and datasets, which may not be publicly accessible. This dependence significantly limits reproducibility and independent verification of the reported results. While the authors provide training details and claim to release prompts, the potential absence of open-source data, sandbox implementation, and evaluation environment means that other researchers may find it difficult to replicate or validate the findings. As a result, the scientific transparency and reproducibility of the work ay be at the weak side. How does the cached sandbox data reflect real-world dynamics, and can the trained agent generalize beyond this controlled environment? How sensitive is the agent’s performance to the design of the handcrafted verifiers, and how is their reliability and reproducibility validated? What distinguishes the proposed replay-augmented RL from existing methods, and how are fairness and reproducibility ensured given reliance on DiDi’s proprietary setup? Could the authors provide stronger evidence or ablation results showing that this new algorithm yields distinct and essential improvements compared to standard approaches? Fully AI-generated
PreviousPage 32 of 1516 (75800 total rows)Next