ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 15899 (21%) 4.43 3.58 3687
Heavily AI-edited 3233 (4%) 4.22 3.59 2990
Moderately AI-edited 7082 (9%) 4.20 3.61 2722
Lightly AI-edited 16648 (22%) 4.15 3.68 2746
Fully human-written 32938 (43%) 4.13 3.62 2917
Total 75800 (100%) 4.21 3.62 3026
Title Ratings Review Text EditLens Prediction
Value-Anchored Group Policy Optimization for Flow Models Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper targets two issues when applying Group Relative Policy Optimization (GRPO) to flow‑matching text‑to‑image models: temporal miscredit from using a single terminal reward at every denoising step, and vanishing optimization signal when group rewards lose variance. There are two contributions methodological contributions: first, they added a Temporal Cumulative Reward Mechanism (TCRM) that converts the terminal reward into per‑step “instant” rewards by projecting each intermediate latent one ODE step to an approximate and computing a reward on it, then forms discounted action values $Q_t$ ; and second, Adaptive Dual Advantage Estimation (ADAE) that mixes a relative term with an absolute‑value term so that the advantage does not collapse when the group reward standard deviation goes to zero. The method is evaluated on GenEval, OCR text rendering, and PickScore alignment using SD‑3.5 as the base model. The headline results claim modest improvements over Flow‑GRPO on in‑distribution task metrics and on quality metrics, with the largest relative gains occurring when KL regularization is absent. Key equations and the training loop are clearly stated in Section 3 and Algorithm 1. I think there is a strong motivation for this work as plainly appling GRPO to flow based models presents some issues, most prominently diversity of generated samples. I also like the simplicity of the propose methods, make it easy to implement and evaluate. The presentation of the paper is very clear as well, make it easy for readers to digest. **Suprisingly lack of recent baselines for empirical comparision (even though it was mentioned in the work).** In related works section the authors clearly stated related works like Tempflow [1], Preflow [2], Mixgrpo [3], BranchGRPO [4]. In particular Tempflow also incorporate temporal reward signal and I think it should be incorporated into the empirical evaluations. While I think an improvement with Flow-GRPO is great, **I believe some of the empirical gains are small and sometimes negative on quality**. There was a bit of overclaiming the effectiveness of the proposed method. An example of minimal gain is in Table 1 on page 8, with KL, GenEval improves from 0.95 to 0.96, OCR from 0.92 to 0.94, and PickScore from 23.31 to 23.41. Several aesthetic or DeQA scores are within 0.01 to 0.16. Without KL, the human‑alignment setting shows a drop in Aesthetic from 6.15 to 5.97. “Reward hacking mitigation” is asserted, yet some quality metrics degrade or move minimally. Quantifying the variance across seeds and reporting effect sizes would substantiate claims. This method also results in **a non-trivial computational overhead** compared to Flow-GRPO. Flow‑GRPO evaluates the reward once per generated image, TCRM evaluates it $T$ times per image. A wall‑clock or FLOPs analysis is needed vs Flow-GRPO, and comparisons with efficiency‑focused variants like MixGRPO or BranchGRPO would be appropriate. This is more tricky and I will classify it as minor weakness. The theory is limited to a limit case. Appendix B proves that ADAE does not vanish when $\sigma \to 0$. This is a desirable property, but it does not address finite‑$\sigma$, bias, stability, or KL‑regularized convergence. [1] Xiaoxuan He, Siming Fu, Yuke Zhao, Wanli Li, Jian Yang, Dacheng Yin, Fengyun Rao, and Bo Zhang. Tempflow-grpo: When timing matters for grpo in flow models, 2025. [2] Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning, 2025a. [3] Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde, 2025a. [4] Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. Branchgrpo: Stable and efficient grpo with structured branching in diffusion models, 2025b. See weakness, mostly about missing baselines and empirical benchmarks. 1. What is the wall‑clock or GPU‑hour overhead of TCRM relative to Flow‑GRPO? 2. How sensitive is ADAE to reward‑scale changes across reward models? Fully human-written
LLMs Can Generate a Better Answer by Aggregating Their Own Responses Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper studies how LLMs can be used to better one-shot answer questions by processing their generated responses. Previous such techniques include self-consistency, self-correction and choose-from-N methods. They require either verifiable answers (e.g. math reasoning tasks) or discriminatively capable LLMs, which may be lacking. To address this, the paper proposes the GSA method which first generates multiple LLM responses, passes all of them in context to the LLM (along with the question) prompting it to generate a solution to the question. The authors conduct experiments using 4 LLMs and 8 benchmark datasets showing performance improvements by GSA compared to previous techniques. The paper proposes a simple response-aggregation based technique for improving an LLM’s answering capabilities, which can be effective and more generally applicable in comparison to previous methods. 1. The method is fairly straightforward and there seems to be little insight into why it supposedly works. 2. The empirical performance gains on the various benchmarks are not very substantial and seem inconsistent over model and datasets. In particular, the performance of GSA is similar to that of Self-consistency on the non-open ended datasets. 3. The error bars are not provided. 4. Some parts of the paper require more explanation (see questions to authors). 1. Which model is used to predict the index as mentioned in line 271? 2. How is the Best-of-N oracle implemented for the open ended tasks? 3. While the experiments fix the number model calls, one ablation could be to also account for the larger context length of GSA. Have the authors considered this aspect? Fully human-written
LLMs Can Generate a Better Answer by Aggregating Their Own Responses Soundness: 3: good Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces a novel prompting method, Generative Self-Aggregation(GSA), which samples multiple diverse responses from LLM and then generates an improved answer based on the sampled responses. - GSA only adds one aggregation prompt; no reward model, no verifier, no extra model. This makes it easy to drop into existing pipelines that already sample multiple candidates. - Because it aggregates process rather than final tokens, GSA handles open-ended and code-generation tasks where voting is ill-defined; the paper shows gains on MT-Bench, AlpacaEval, and MBPP. - Novelty is incremental. The core idea is a scope extension of self-consistency / universal SC to open-ended tasks, not a fundamentally new test-time reasoning paradigm. The paper even reuses the same candidate pool as SC. - With 4 calls, GSA frequently ties or barely beats SC or even greedy: e.g., Llama-3-8B MMLU 65.62 vs SC 65.62; GPT-4o-mini MATH 78.25 vs SC 77.26; Llama-3-8B MATH 32.46 vs SC 31.68. These are real but slim deltas. - The ablation (fix N=3, vary temperature) shows a trend, but the paper does not define or report an actual diversity metric. - Temperature ablation fixes N=3 and focuses on GSM8K/GPQA w/ Llama-3-8B. Why N=3? Do you observe interaction effects between N and temperature/diversity? - Authors attribute GSA’s gains to “increased response diversity,” but there is no quantitative measure in the paper. Lightly AI-edited
LLMs Can Generate a Better Answer by Aggregating Their Own Responses Soundness: 2: fair Presentation: 2: fair Contribution: 1: poor Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. While Large Language Models (LLMs) are highly capable, they often need help with complex problems. Popular methods like self-correction and selection fail when the LLM is tasked with evaluating its own work, due to a lack of training in discriminative judgment. This paper introduces Generative Self-Aggregation (GSA), a new prompting technique that improves answers without needing the model to judge them. Instead, GSA: 1.Generates multiple diverse responses. 2.Aggregates them by synthesizing a new, improved answer based on all the generated text. Unlike self-consistency, which relies on majority voting for specific answers, GSA is more flexible and works on open-ended tasks. Tests show it boosts performance in areas like math, knowledge questions, code, and dialogue. 1.Simplicity and efficiency: As a prompting method, GSA is simple to implement and does not require additional training, fine-tuning, or external models, making it highly efficient and accessible. 2.Generality and flexibility: Unlike Self-Consistency (SC), which is restricted to tasks with verifiable outputs (like multiple-choice), GSA's generative aggregation mechanism makes it applicable to a wide range of open-ended tasks, such as code synthesis and conversational responses. 1.Dependence on generation quality: The effectiveness of GSA is contingent on the quality and diversity of the initially sampled responses. If the model consistently generates flawed or homogeneous outputs for a given prompt, the aggregation process may be unable to synthesize a correct or improved solution (a "garbage-in, garbage-out" risk). 2.Lack of explicit error correction: While bypassing the need for discriminative judgment is a strength, it is also a limitation. The method does not explicitly identify or correct errors from the sampled responses; it relies on the generative process to implicitly overcome them, which may not be as reliable for specific, factual inaccuracies. 3.Limited contribution: The proposed method is straightforward and lacks sufficient innovative points. see the weakness Heavily AI-edited
LLMs Can Generate a Better Answer by Aggregating Their Own Responses Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces Generative Self-Aggregation (GSA), a two-stage prompting strategy intended to improve an LLM’s response quality without asking the model to explicitly judge or rank candidates. Step (1) samples several diverse responses; step (2) feeds those responses back as context and asks the model to synthesize a new, improved response—leveraging generative next-token prediction rather than discriminative scoring. The authors position GSA as a generalization of self-consistency to open-ended tasks, and as an alternative to self-refine/choose-from-N methods that rely on an LLM’s judging ability. They evaluate across math (GSM8K, MATH, SVAMP), knowledge (MMLU, GPQA), and open-ended tasks (MT-Bench, AlpacaEval, MBPP) using Llama-3-8B, Gemma-2-9B, Qwen-2.5-14B, and GPT-4o-mini. GSA typically improves over greedy decoding, Self-Refine, and Universal Self-Consistency, matches or beats Self-Consistency where applicable, and extends to open-ended tasks where SC is not applicable. S1. Simple idea with broad applicability: The paper’s core insight—aggregate by generation rather than select by judgment—is crisply articulated and easy to implement. The method is defined as “diverse response generation” followed by “context-enriched response synthesis,” with concrete prompt templates provided for each benchmark in the appendix, aiding reproducibility. Figure 1 and the worked examples make the intuition tangible (e.g., synthesizing from multiple solution paths to fix earlier missteps). S2. Comprehensive Evaluation: The authors show their main results across four models and 7-8 diverse domains (math, knowledge, science and open-ended tasks). It is great that the authors also perform cost-aware comparison by fixing model calls (N=4) for fairness and further provides an “adjusted calls” comparison in Table 4. The authors perform meaningful ablations showcasing the effects of number of responses (Figure 2), temperature (Figure 3), and sampling strategies (template variation and multilingual sampling in Table 3) demonstrate robustness to these knobs. It is also useful detail that GSA can succeed even when all individual candidates are wrong, which selection-based methods cannot. S3. Reproducibility aids: The appendix lists full prompts, implementation details (vLLM, temperatures, max tokens), and case studies. This level of detail is helpful for practitioners to try GSA quickly. W1: My main concern with the paper is that it has very thin contribution. The improvement is minimal (< 1% in most cases compared to SC and also seen in Fig. 2 and Fig. 3). Further, numbers reported without error bars on smaller datasets (like GPQA) make the improvement of GSA even harder to discern. The authors say that SC's "application to open-ended tasks remains challenging due to the lack of clear voting mechanisms." While I acknowledge this, there are some indirect ways to go about this like embedding the responses and choosing the cluster with the highest count (or using a LLM-as-a-judge to perform semantic equivalence between two responses and then doing self-consistency). W2: Apart from greedy decoding, evaluation at recommended temperature, top_p, etc. (for eg: Qwen recommends temp=0.7, top_p=0.8, etc.) should also be done and similarly added to Tables 1/2. I also noticed some incorrect numbers. Following Qwen2.5 technical report (Table 7), the reported numbers in your paper seem to be lower (i.e., official report's normal evaluation has higher numbers than your proposed method): | Benchmark | Official Report | Your Reporting (Greedy) | |-----------|----------------:|---------------:| | MBPP | 82.0 | 72.00 | | GPQA | 45.5 | 39.51 | This might be due to difference in sampling by the Qwen team. Thus, I would also recommend adding evaluation for each model in its recommend configuration (as suggested by their company). Please fix this. W3: Benchmark sampling choices and statistical rigor: For MMLU, only 10% of the test set is evaluated; for MATH a 500-sample subset is used. There are no confidence intervals or statistical tests across runs/seeds. Please report mean +- std. err. W4. Missing order-sensitivity: Aggregation may be sensitive to ordering of candidates. Was this factor studied? How sensitive is GSA to candidate ordering? I would request the authors to add more empirical analysis on candidate order like test best-candidate dropped vs. worst-candidate dropped or what fraction of the final response is composed from the each candidate (in their order). W5: For Alpaca eval, why is GPT-4 used as judge for evaluating GPT-4o-mini (which came out long after GPT-4). GPT-4 is old now (by today's LLM standards) and may even be worse than Qwen2.5-14B on some tasks. At least GPT-4o should be used or similarly capable open-source model like DeepSeek-v3.1 if proprietary cost is an issue. ### Presentation: 1. In L228-230, what is wolves, cougars, etc.? The whole paragraph is unreadable. 2. Minor grammar: “generate an new response”, “we experiment our method” → “experiment with our method”, “number of class” → “calls”; a few curly quotes (Let‘s) and spacing inconsistencies. 3. Consistency: “MT-bench” vs “MTbench”; “Jame” vs “James” in the appendix example (if that’s not from the dataset verbatim). Q1: MBPP metric: Is MBPP reported as pass@1 using the official unit tests? If not, please detail the exact measurement. Fully human-written
Self-Guidance: Training VQ-VAE Decoders to be Robust to Quantization Artifacts for High-Fidelity Neural Speech Codec Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The authors propose a new self-guidance loss to improve the training of neural speech codecs. Motivated by the observation that decoders can produce better reconstructions when using pre-quantized encoder outputs, the authors introduce a feature mapping loss that aligns the decoder’s intermediate features with those produced from the pre-quantized encoder outputs. The proposed loss is evaluated on the XCodec-2 baseline (Ye et al., 2025b), showing improved low-bitrate reconstruction performance on the LibriSpeech dataset. In addition, results indicate that with the self-guidance loss, XCodec-2 can maintain reconstruction fidelity when the codebook size is reduced by 4x, and the downstream TTS performance is also improved. 1. The motivation for the self-guidance loss is clear, i.e., decoders reconstruct better when using the pre-quantized encoder features. 2. The proposed loss is simple to implement and introduces negligible computational overhead. 3. Empirical results demonstrate that the self-guidance loss improves reconstruction quality within the XCodec-2 framework. 1. The improvements in reconstruction quality appears marginal, e.g., gains of only about 0.1 in PESQ and 0.04 in UTMOS (Table 2). 2. The experiments are not sufficient. The proposed loss is only evaluated on the XCodec-2 framework. Given the simplicity of the idea, it should be tested on additional single-codebook codecs to better demonstrate its general effectiveness. 3. In Table 4, the WER slightly increases after applying the self-guidance loss, suggesting semantic information loss. The authors should provide a stronger justification or analysis for this observation. 1. Is the feature mapping loss in Equation 2 applied at each decoder resolution level? The current equation seems ambiguous. 2. As in Weakness 2, does the self-guidance loss depend specifically on XCodec-like architectures, or is it generally applicable to single-codebook neural speech codecs? 3. Since the reported improvements are relatively small, have the authors considered simpler ablation studies? For example, could you report the gain from the self-guidance loss alone, without the semantic and adversarial losses, to isolate its contribution? 4. The authors claim that the self-guidance loss enables a smaller codebook, benefiting downstream speech LLMs. However, large codebooks are not necessarily problems for LLMs, sometimes larger vocabularies can enhance performance. I wonder why the authors do not frame this claim from a compression perspective, emphasizing that a 4x reduction in codebook size implies a 4x increase in compression rate. Fully human-written
Self-Guidance: Training VQ-VAE Decoders to be Robust to Quantization Artifacts for High-Fidelity Neural Speech Codec Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper introduces "self-guidance," a novel and elegant training mechanism for VQ-VAE-based neural speech codecs. The core idea is to improve the decoder's robustness to quantization artifacts by introducing an auxiliary feature-mapping loss that encourages the decoder to produce similar intermediate representations for both quantized tokens and their continuous pre-quantization counterparts. The method is simple to implement, adds negligible computational overhead during training, and requires no changes at inference time. Through extensive experiments on the state-of-the-art XCodec2 model, the authors demonstrate significant and consistent improvements in reconstruction quality across various metrics, codebook sizes, and quantization methods. A key finding is that self-guidance enables a 4x reduction in codebook size while maintaining comparable fidelity, which is shown to directly benefit downstream autoregressive TTS tasks by simplifying the token modeling space. 1. The proposed self-guidance mechanism is a simple, intuitive, and novel training objective. It addresses the core problem of quantization error by enhancing the decoder directly, rather than adding complexity to the quantizer or architecture. Its minimal overhead (a single extra forward pass during training with no gradient computation) makes it highly efficient and practical. 2. The method is applied to XCodec2 and evaluated on a standard benchmark (LibriSpeech), demonstrating consistent state-of-the-art performance. The ablation studies convincingly show the method's effectiveness across different codebook sizes and vector quantizer types (FSQ, SimVQ), proving its generalizability. 3. The paper provides a crucial analysis in Figure 3, demonstrating that the performance gain comes from improved decoder robustness, not from a reduction in the quantization error itself. This confirms the authors' central hypothesis and provides a clear understanding of the method's mechanism of action. 1. The paper mentions selecting the loss weight λ_guide, but a sensitivity analysis showing how performance varies with this weight would enhance the experimental rigor and provide practical guidance for future work. 2. The method's effectiveness is demonstrated exclusively on single-codebook models. Its applicability and potential benefits in common multi-codebook architectures (e.g., RVQ) remain entirely unexplored, which significantly limits the proven scope and generalizability of the proposed technique. 3. The proposed "self-guidance" is essentially a form of self-distillation, where a network branch with privileged information (pre-quantized latents) acts as a teacher. The paper fails to acknowledge this strong connection to existing paradigms, thereby overstating its novelty and positioning the contribution more as a clever engineering refinement than a fundamental advance. N/A Fully AI-generated
Self-Guidance: Training VQ-VAE Decoders to be Robust to Quantization Artifacts for High-Fidelity Neural Speech Codec Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper studies the audio tokenizer task, aiming to improve the quality of audio codecs. The authors propose an approach that aligns hidden embeddings to better reconstruct fine-grained audio information. - The proposed method is well-motivated and appears conceptually sound, leading to consistent performance improvements across multiple downstream tasks. - Extensive experiments on various downstream applications demonstrate the effectiveness of the proposed approach. - The experiments on Hidden Feature Alignment MSE provide a clear comparison of different methods in terms of information reconstruction capability. This analysis is valuable and could serve as a useful reference for future work on improving reconstruction quality in audio tokenizers. - Could the authors provide more comprehensive experimental results in Table 5? For example, results for XCodec2 with a 16,384-sized codebook and XCodec2+SG with a 65,536-sized codebook would offer a fuller view of the model’s behavior under different configurations. - While the proposed method is effective, its novelty appears limited. It remains unclear why this approach can mitigate quantization artifacts. The paper mentions decoder robustness, but the conceptual difference between decoder robustness and quantization error is not clearly articulated. Providing a deeper explanation of this relationship would help readers better understand the core contribution. I would appreciate it if the authors could elaborate on the conceptual and practical differences between decoder robustness and quantization error. How does improving robustness directly contribute to reducing quantization artifacts? A more detailed discussion or visualization would strengthen the theoretical understanding of the proposed approach. Moderately AI-edited
Self-Guidance: Training VQ-VAE Decoders to be Robust to Quantization Artifacts for High-Fidelity Neural Speech Codec Soundness: 2: fair Presentation: 4: excellent Contribution: 1: poor Rating: 4: marginally below the acceptance threshold Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This manuscript proposes a technique to enhance the reconstruction fidelity of discrete speech codecs. The core idea involves a modification to the training objective: during training, both the continuous encoder-output features and the subsequent discrete quantized representations are passed to the decoder. An auxiliary feature-mapping loss (MSE) is then applied within the decoder to minimize the discrepancy between the internal representations generated from these two distinct inputs. The authors report performance on several standard reconstruction metrics (e.g., STOI, MCD) and simple downstream TTS tasks (e.g., WER, SIM). The underlying motivation for this approach is sound. In VQ-based architectures, ensuring that the decoder is robust to the information bottleneck of the quantizer—by training it to map both pre- and post-quantization features to a similar internal manifold—is a reasonable objective beyond simply optimizing the codebook itself. 1. The primary concern is the limited methodological contribution (only an easy trick). The proposed method amounts to a straightforward technical adjustment (i.e., an auxiliary loss) rather than a substantial new approach. The introductory discussion in Section 3, which focuses on the well-established concept of information loss in discrete quantization, offers little new insight and feels remedial. The method itself is not particularly insightful or heuristic. 2. The most critical weakness is the lack of compelling empirical evidence. The results presented in Table 1 indicate that the proposed method yields minimal to no improvement over the baseline across the majority of metrics (STOI, MCD, WER, SIM, UTMOS). The PESQ metric, which the authors specifically highlight, demonstrates only a marginal improvement (reportedly a 0.3 change), falling short of demonstrating meaningful practical utility. 3. Given that the method is presented as a general "trick" adaptable to the VQGAN framework, its evaluation is insufficiently narrow. To substantiate its effectiveness, the technique should have been validated across a much broader spectrum of modern audio codec models, not just the single architecture presented. Furthermore, to claim generalizability within the VQGAN context, validation on other relevant modalities (e.g., image, video) would be essential. This work may meet the threshold for ICASSP/Interspeech, but it currently lacks the requisite novelty and impact expected for ICLR. Lightly AI-edited
Efficient Algorithms for Adversarially Robust Approximate Nearest Neighbor Search Soundness: 1: poor Presentation: 2: fair Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper considers the problem of building adversarially robust approximate near neighbor (ANN) data structures. In the classical ANN setting, the goal is to build a data structure $D$ which when instantiated with a dataset of $n$ $d$-dimensional datapoints, a target distance $r > 0$, and an approximation constant $c > 1$, supports the following queries: Given a query point $q$, return a point $x$ in the dataset with $\|x - q\| \leq cr$ if there exists any point $y$ in the dataset with $\|q - y\| \leq r$. Classical work has produced data structures with query times scaling \emph{sub-linearly} in $n$ and space complexities scaling almost linearly in the size of the dataset. Unfortunately, these classical works often assume that the sequence of queries given to the data structure are independent of the randomness used to instantiate it. In realistic scenarios, where the query sequence may potentially depend on answers to prior queries (and hence, the internal randomness of the data structure), this assumption breaks down and the correctness guarantees of the data structure no longer hold. Consequently, there has been much recent interest in remedying these shortcomings by developing data structures, resilient to these effects. This paper operates in the strong adversarial setting where \emph{both} the dataset and the sequence of queries are assumed to be chosen by a potentially malicious adversary. The paper contains several results applying to different regimes: 1) In the high-dimensional setting ($d = O(\sqrt{Q})$), the paper constructs a series of data structures culminating in a data structure with space complexity $\tilde{O} (\sqrt{Q} n^{1 + \beta})$ and query complexity $d n^{\beta}$ for $\beta \approx \log \log (c) / \log (c)$ and 2) In the low-dimensional setting, the paper mildly improves on prior results providing a stronger \emph{for-all} query guarantee with an increased cost in space complexity $\tilde{O} (d n^{1 + \rho})$ (note when $d$ is large, this space complexity dominates the query-dependent results of the high-dimensional setting). Technically, the main insight of the paper is that algorithms for \emph{Fair}-ANN (a recently developed ANN notion which requires that any answer returned by the data structure must be uniformly chosen from the set of correct answers if non-empty), inherently provide resilience to adversarially chosen inputs. Unfortunately, the query times of prior data structures for Fair ANN rely on the density ratios of two neighborhoods around $q$ (specifically, the neighborhoods within $cr$ and $r$ of $q$) which may be as large as $n$. The main technical contribution of the paper addresses this challenge by breaking down the neighborhoods $[r, cr]$ into a series of annuli and observing that not all radii in this series can feature large density ratios. Hence, an alternative approach is to try different data structures for different annuli and only use ones which terminate within a chosen time complexity threshold (one exists since at least one annuli features a small density ratio). Unfortunately, this idea only results in a relaxed Fair ANN data structure which no longer straightforwardly yields robustness as the answer may potentially leak the precise choice of annuli used which is correlated with the internal randomness. The paper then uses standard techniques from differential privacy to hide only the \emph{choice} of annuli while incurring increased space complexity scaling with $\sqrt{Q}$. The combination of these techniques yields the final result. Overall, the paper considers the natural problem of building adversarially robust data for search problems like ANN. The paper draws a connection to the Fair-ANN problem and expands on these ideas to build data structures with sub-linear query times and space-complexities scaling independently of $d$. However, the remainder of the paper largely relies on standard techniques previously explored in the literature for robust \emph{estimation}. Furthermore, the approach in the paper has runtimes scaling as $n^\beta$ (as opposed to $n^\rho$), representing a substantial degradation. It is not clear that such a degradation is necessary and somewhat dampens the main result of the paper. Finally, the writing and organization of the paper need to be substantially improved. For instance, Theorem 1.3 as stated only concerns Fair-ANN and not robust ANN and the proof of Claim 3.3 (the main insight of the paper) is missing some steps -- how can one condition on the correctness of the algorithm?, what is $R$?, why does $\mathcal{A}_{fair}$ not being adversarially robust mean the corresponding random variables are not independent?. In its current state, I cannot recommend acceptance. See main review See main review See main review Fully human-written
Efficient Algorithms for Adversarially Robust Approximate Nearest Neighbor Search Soundness: 2: fair Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper designs data structures for the ANN problem against an adaptive adversary, who selects a worst-case, size-$n$ dataset and $Q$ *adaptive queries* based on past responses of the data structure. The authors develop a progression of algorithms with provable robustness and efficiency guarantees: they first (i) connect adversarially robust ANN to fair ANN (where outputs are uniformly random among near-neighbors), showing that the latter implies the former, then (ii) reduces search to a decision ANN and solve the latter with a DP mechanism on top of a LSH, and finally (iii) introduce a concentric-annuli LSH construction that privatizes which annulus is predicted to terminate quickly, and then runs a fair ANN only within that annulus. On efficiency, (i) yields a query time that depends on the "density ratio" (i.e. the number of points in the $cr$-ball relative to the $r$-ball for a query $q$) which can be large in a worst-case dataset. In comparison, (ii) removes this data dependence; however, it partitions the data into $\sqrt{n}$-sized buckets, thus the query time is at least $\sqrt{n}$ (i.e. it does not diminish as $c$ grows). Finally, (iii) mitigates this issue and achieves space and query complexities that are data-independent and whose dependence on $n$ diminish as $c$ increases. - I find it appealing that Theorem 1.2 and 1.3 achieve runtime and space bounds independent of dataset-specific quantities (i.e. $s$ in [Feng'25] or $D$ in Theorem 1.1). This makes performance predictable on worst-case datasets and avoids hidden inefficiency due to dense neighborhoods. The search-to-decision reduction that enables this isolates the leakage channel and patches it with a DP mechanism, which feels natural and standard, but is executed nicely. - (Subject to correctness,) the fairness to robustness connection feels neat: framing robustness as a consequence of returning a uniformly random near neighbor yields a simple, reusable principle. As the authors note, this argument extends to any algorithm which is required to answer adaptive queries by picking from a discrete set of candidate values. In my view, this offers an alternative methodology beyond the now-standard DP-for-robustness recipe and could lead to further interesting results. - The presentation is clear. I appreciate how the authors explain the inefficiencies of Theorems 1.1 and 1.2, progressively motivating Theorem 1.3. The efficiency guarantees and trade-offs of the three algorithms are clearly discussed and compared with [Feng'25]. I only checked the first proof (fairness implies robustness) and I'm confused about the following point: In Definition 2.1, both $R$ and $R_{setup}$ are used to denote the randomness used to *initialize* the data structure. So I assume that $R = R_{setup}$ and write $W_i := (R_{setup}, R_1, \cdots , R_i)$. My question concerns Definition 3.1: is the ith answer $a_i$ independent of $(a_1, \cdots, a_{i-1})$ and $R$, or is $a_i$ independent of $(a_1, \cdots, a_{i-1})$ and $W_{i-1}$? Definition 3.1 seems to assert the former, and Definition 2 in [Aumuller et al.'21] only asserts independence from $(a_1, \cdots, a_{i-1})$; however, it appears to me that the proof of Claim 3.3 assumes independence of $(a_1, \cdots, a_{i-1})$ and $W_{i-1}$. In particular, Line 266 states: "since the $R_i$ are fixed, suppose that $f(R) \subset M$ is the set of queries for which A wrongfully answers $\bot$." This sentence makes sense to me if $f(R)$ here is really $f(W_Q)$. Then, to use DPI, line 269 would need to read "$(a_1, \cdots, a_{i-1})$ is independent of $f(W_{i-2})$", which does not seem to be guaranteed by the definition of a fair NN, per the concern above. Alternatively, if $f(R)$ on Line 266 truly means $f(R) = f(R_{setup})$, then I think $f(R_{setup})$ alone does not define the set of queries for which A wrongfully answers. (This could be the case for a specific fair NN construction, e.g. if $R_1, \cdots, R_Q$ are used in a limited way that does not affect the set of incorrect queries. But from the current description I don't think this is generally the case, especially since Definition 2.1 explicitly says that the failure probability is "taken over the algorithm’s entire internal randomness $ (R_{setup}, R_1, \cdots , R_i)$".) - Could you address the question raised in the Weaknesses section? - In [Feng'25], in addition to the space and the query time, the authors also analyze update time and preprocessing time: 1. Both this paper's definition of robust NN and the fair NN definition in [Aumuller et al.'21] appear to assume a static setting where the dataset is fixed up front. Do the algorithms here support (or admit natural modifications to support) dynamic updates, i.e., insertions and deletions? If so, how do the update times compare to [Feng'25]? 2. What are the preprocessing times of the algorithms in this paper? Fully human-written
Efficient Algorithms for Adversarially Robust Approximate Nearest Neighbor Search Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper studies the Approximate Nearest Neighbor (ANN) problem in the presence of an adaptive adversary. In the model considered, the adversary first fixes the dataset but can subsequently select each query point adaptively based on previously observed outputs. Classical ANN algorithms such as locality-sensitive hashing (LSH) assume an oblivious adversary and may fail under adaptive attacks. The authors seek to design algorithms that remain correct and efficient even against a powerful adaptive adversary. The paper introduces several algorithms and provides corresponding theoretical guarantees. First, the authors show that fairness in ANN search implies adversarial robustness, establishing a formal connection between fair selection among valid near neighbors and adaptive security. They then present a bucketing-based robust search algorithm by reducing ANN search to a weak decision problem and leveraging differential privacy machinery to ensure robustness. Finally, they introduce a concentric-annuli LSH construction that overcomes the $\sqrt{n}$ query-time barrier present in their bucketing framework, again utilizing differential privacy techniques to carefully control information leakage. These algorithms yield provable guarantees across both high-dimensional settings $(d = \omega(\sqrt{Q}))$ and low-dimensional ones $(d = O(\sqrt{Q}))$, and the paper establishes the corresponding bounds in Theorems 1–4. The contributions rely on a combination of tools including fair ANN sampling, differential privacy, and careful runtime analysis over concentric geometric partitions. 1. The observation that fairness implies robustness is conceptually elegant and powerful. Its applicability extends beyond the ANN problem and may motivate further exploration of fairness-based defenses in other algorithmic settings. 2. Theorem 3, in particular, presents a strong result for adversarially robust ANN, improving upon prior work and breaking the $\sqrt{n}$ barrier under mild assumptions. 3. In addition to high-dimensional settings, the authors also provide results for low-dimensional cases, strengthening the completeness of the theoretical contributions. 4. The paper is built on solid mathematical foundations and includes rigorous proofs, carefully addressing subtle issues related to privacy, adaptivity, and randomized algorithm behavior. 1. Both Theorems 2 and 3 include a $\sqrt{Q}$ factor, which can be significant when the number of adaptive queries is large. While Theorem 1 avoids this factor, it introduces dependence on data density, and it remains unclear whether the $\sqrt{Q}$ dependence can be eliminated in the general case. 2. The adversary is assumed to fix the dataset in advance but may adaptively choose queries throughout execution. The paper does not provide sufficient justification for this threat model, and it is not obvious that this form of adversary naturally arises in practice. 3. There is a typo on page 4, line 204. 4. Although the theoretical results are compelling, the paper does not include experiments to assess the algorithms’ performance in practical settings or illustrate their behavior under realistic adversarial scenarios. 1. Can the authors provide concrete examples or motivating applications where an adversary selects the dataset a priori but adaptively selects queries during execution? This would help clarify the practical relevance of the adversary model. 2. Could alternative adversary models also be considered? For instance, in an online learning setting where data points arrive sequentially and both data and queries may be chosen adaptively, would the proposed techniques still apply? 3. The conclusion mentions adversaries with “more information,” such as timestamps. Could the authors elaborate on what forms of additional information are considered and how such leakage might affect their guarantees? 4. Do the authors intend to conduct empirical experiments to validate the performance of their algorithms or to evaluate robustness under practical adaptive attack strategies? Heavily AI-edited
Efficient Algorithms for Adversarially Robust Approximate Nearest Neighbor Search Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper tackles ANN under adversarial query sequences where both the dataset and queries are adversarially chosen. For high-dimensional settings, it proposes a progression of algorithms: leveraging fairness to achieve robustness, using differential privacy-based robust deciders with bucketing, and ultimately introducing a concentric-annuli LSH construction to break inherent query time lower bounds. For lower dimensions, it presents for-all algorithms based on metric coverings that guarantee correctness for all queries. The paper introduces new theoretical tools and extends known connections between fairness, robustness, and privacy in ANN search. 1. This work establishes and proves (Claim 3.3) the core theoretical result that exact fair ANN algorithms are adversarially robust. 2. The paper presents a thoughtful progression from fairness-induced robustness (Theorem 1.1) to assumption-free methods via bucketing (Theorem 1.2), culminating in the concentric annuli construction (Theorem 1.3) that achieves sublinear query time even in worst-case datasets. 3. Table 1 provides a summary of algorithmic tradeoffs (query time, space) under varying assumptions and concretes the advances over rival works, notably Feng et al. (2025). 1. This work is entirely theoretical. It lacks experiments to evidence the theoretical results. I would suggest authors to include some experiments to support the claims. 2. Claims are ambiguous. For instance, Theorem 1.1 and Theorem 1.2 offload much of the complexity into density ratios, while we are still not sure about how these density ratios affects in the real application scenarios. 3. In the for-all algorithms (sec 1.1.2), how severe is the intractability when $d$ is only modestly large (e.g., $d = 30$ or $100$)? Are there adaptations that would make these algorithms relevant in practice, beyond toy cases? 4. Table 1 has many notations. I would suggest authors to draw a figure or simplify the notations to highlight the changes between different algorithms. For example, the definition of D, p, beta, s, are actually unknown to readers at the first time. See weakness. Fully human-written
Efficient Algorithms for Adversarially Robust Approximate Nearest Neighbor Search Soundness: 3: good Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors address the challenge of approximate nearest neighbors search (NNS) to adversarially constructed queries. In particular, they consider the case that for a fixed randomness, the query designer can use a sequence of $Q$ queries. They make the observation that fairness implies robustness, which allows you to apply fair algorithms for robustness. To overcome the distributional assumption of this fairness algorithm, The paper subsequently introduce a bucketing approach. To break the $\sqrt{n}$ query time of this approach, they introduce their main algorithm which uses concentric annuli. They conclude by giving an algorithm that provides guarantees for any query, not just a fixed one. - Their algorithm appears to give an algorithm that improves over the prior query time of $\sqrt{n}$, when $c$ is large, while maintain robustness to adaptive adversarial queries. - The algorithm using concentric annuli with guarantees that one of the annuli will finish quickly is interesting and appears new. - In general, I am not a fan of how the paper is structured and presented. The paper is presented as a story or a textbook chapter, so that it is unclear what the main results are. In particular, the final section, giving the algorithm with guarantees for all queries, seems fairly disconnected from the previous. At several points in the paper, ideas are presented that sound interesting but are not clearly explained. One such example is the notion of ‘timestamps’ on line 146 that is repeated in the conclusion. - It is not clear why the reader should care about their main result for improving over the $\sqrt{n}$ barrier. It might be nice for example to see experiments that verify a setting where the proposed algorithm is faster. - Other than the concentric annuli algorithm, the proposed approaches/connections between fairness and robustness do not seem surprising. - How does your work differ from [1], which also provides guarantees for worst-case queries? Is it that the guarantees are for interactive querying in your work? They also develop an adversarially robust algorithm for approximate NNS based on LSH. In particular, their algorithm optimizes performance for the worst possible query, which seems to be quite similar to the goal of this work. [1] “Learning to Hash Robustly, Guaranteed”. Andoni , Beaglehole ICML 2022. None. Fully human-written
$\textbf{SDPose}$: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents SDPose, a fine-tuning framework that leverages Stable Diffusion (SD) as a vision backbone for human pose estimation. Instead of modifying cross-attention layers or using learned embeddings, SDPose directly predicts keypoint heatmaps in the SD U-Net’s latent space, preserving the generative priors of diffusion models. The authors also construct COCO-OOD, a style-transferred variant of COCO for systematic robustness evaluation. SDPose matches or surpasses Sapiens-1B/2B on COCO with only one-fifth of the training cost and achieves new state-of-the-art performance on HumanArt and COCO-OOD. The model further demonstrates zero-shot pose annotation capabilities for pose-guided image and video generation, establishing SDPose as an efficient, generalizable, and versatile framework for structured prediction using diffusion priors - Novel approach: Exploits Stable Diffusion’s latent features directly, preserving generative priors instead of introducing ad-hoc conditioning modules. - Strong performance: Matches SoTA on COCO and achieves new records on HumanArt and COCO-OOD, with significant training efficiency (1/5 of Sapiens training time) - Benchmark contribution: Introduces COCO-OOD, enabling standardized OOD robustness evaluation for pose models. - Analysis of computational cost: The paper highlights training efficiency but provides limited discussion of inference-time latency and memory overhead compared to conventional backbones like ViTPose. - Ablation breadth: Some architectural design choices (e.g., selection of diffusion timestep or use of SD-v1.5 vs v2) could benefit from more justification or quantitative analysis. - How would SDPose perform under multi-person scenarios or crowded scenes, where top-down cropping might limit context? - Would fine-tuning from more recent foundation diffusion models (e.g., SDXL) further improve generalization, or is the performance already saturated? Fully AI-generated
$\textbf{SDPose}$: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes SDPose, a fine-tuning framework built upon Stable Diffusion (SD) to adapt pre-trained diffusion priors for 2D human pose estimation, particularly focusing on out-of-domain robustness. SDPose operates entirely in the latent space of the SD U-Net without modifying attention modules. The authors add a lightweight heatmap head to map multi-scale latent features to keypoint heatmaps and an auxiliary RGB reconstruction branch for generative regularization, preserving domain-transferable visual semantics. Additionally, the authors introduce COCO-OOD, a style-transferred version of COCO (via CycleGAN), to benchmark robustness under domain shift. Experiments show SDPose achieves SOTA performance on COCO, HumanArt, and COCO-OOD. It also demonstrates zero-shot usability as a pose annotator for controllable image/video generation. 1. The paper, unlike prior adaptations (e.g., fine-tuned cross-attention or learned condition embeddings), leverages pre-trained diffusion features with minimal architectural change, preserving SD’s representational power. The auxiliary RGB reconstruction task is a simple yet effective regularization strategy for maintaining generative semantics during fine-tuning. 2. The authors show SD Pose almost matches Sapiens-1B/2B on COCO while requiring 1/5 training epochs and a smaller backbone compared to the 2B backbone. The paper also shows SOTA performance on other in-the-wild datasets - HumanArt and COCO-OOD. Ablations validating the contribution of diffusion priors and auxiliary reconstruction are helpful. 3. Experimental tables are well-organized and reproducibility details (hardware, hyperparameters, dataset splits) are appreciable. 1. The core idea of repurposing SD U-Net features for downstream vision is shared with prior works like Marigold, Lotus, and GenLoc. The proposed approach’s novelty lies mainly in its architectural restraint (no attention tuning) and addition of a reconstruction branch. 2. The paper does not deeply analyze why diffusion priors confer robustness, e.g., what specific latent feature properties (multi-scale, texture invariance, semantic richness) contribute most. 3. OOD tests (HumanArt, COCO-OOD) are style-based and geometric or camera-domain shifts (e.g., occlusion, lighting, or viewpoint) are not considered. 1. Would SDPose generalize to other OOD types such as blur, occlusion or synthetic-to-real transfer? Could the authors test or discuss applicability in non-artistic domain shifts? 2. Did the authors explore multi-scale fusion from different U-Net layers, rather than selecting a single feature map? Heavily AI-edited
$\textbf{SDPose}$: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces SDPose, a novel framework that adapts a pre-trained Stable Diffusion (SD) model for human pose estimation. The core idea is to leverage the rich, general-purpose visual features learned by the SD model's U-Net to achieve superior robustness, especially on out-of-domain (OOD) data like artistic images. Instead of modifying the cross-attention mechanisms or adding new embeddings, SDPose makes minimal changes: it adds a lightweight convolutional head to predict keypoint heatmaps directly from the U-Net's intermediate features and uses an auxiliary RGB reconstruction task to prevent overfitting and preserve the model's generative priors. - It targets the OOD task and finds a simple method to use the pretrained model for the OOD task. It shows the application of SDPose as a zero-shot pose annotator for ControlNet image generation and video generation provides tangible evidence of its superior qualitative performance over baselines like DWPose - The paper is well organized and evaluations are comprehensive. - My major concern is the novelty of this work. The novelty of this paper is quite marginal. It just uses the pretrained model for a pose estimation task. Nothing seems special or challenging. - This paper just did the experiments but not provide some insights. It would be better to provide some explanations about why the proposed method works better. Refer to strengths and weaknesses Fully human-written
$\textbf{SDPose}$: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. SDPose is a fine-tuning framework on Stable Diffusion that repurposes pre-trained diffusion priors for human pose estimation. While diffusion models offer rich multi-scale features and strong cross-domain robustness, their use for structured outputs like pose estimation is underexplored, and existing pose methods often degrade under domain shift and demand heavy fine-tuning. SDPose asks whether SD U-Net latent features alone can produce reliable heatmaps. Adopting the x₀-prediction design and the Lotus “Detail Preserver” strategy preserves fine detail and avoids overfitting, enabling efficient adaptation with minimal architectural changes. With one-fifth of Sapiens’s training schedule, SDPose matches Sapiens-1B/2B on COCO, sets new SOTA on HumanArt and COCO-OOD. 1. By fine-tuning pre-trained diffusion models, the method achieves state-of-the-art performance in OOD pose estimation benchmark with significantly fewer training steps. 2. It also demonstrates that the intermediate features of pre-trained diffusion models are highly beneficial for generalized pose estimation. 1. Although prior works, as cited in the paper, have shown that the last and penultimate layers of the SD U-Net provide strong features, this study focuses on exploring features specifically for pose estimation, so conducting ablation only on these two layers seems insufficient. 2. The architecture appears to be largely borrowed from the Lotus paper, suggesting a lack of novelty. 3. For COCO-OOD generation, applying only Monet-style paintings seems limited. While Monet’s style is indeed out-of-domain, it would have been more convincing if other artistic styles were also used for style transfer. 4. The paper’s readability needs improvement. 1. Much of the design seems to be inspired by the Lotus paper[1]. Could the key differences or novel contributions be clarified beyond the task difference? 2. Since it seems necessary to find an appropriate training epoch that balances pose estimation performance and generalization from the pretrained model, are there any experiments showing performance variation across training epochs? [1] He, Jing, et al. "Lotus: Diffusion-based visual foundation model for high-quality dense prediction." arXiv preprint arXiv:2409.18124 (2024). Lightly AI-edited
Unsupervised Behavioral Tokenization and Action Quantization via Maximum Entropy Mixture Policies with Minimum Entropy Components Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors present an unsupervised method based on online RL to learn efficient tokenisation of actions in a continuous control problem, by learning a mixture policy that maximises the entropy of the actions while the sub-policies in the mixture minimise the entropy. The authors then use their method to derive quantised actions that can be then optimised through an actor-critic algorithm to solve control problems more efficiently, and they show how the same action quantisation generalises to different tasks. They provide empirical results showing how the method is comparable or improves over existing work. - The problem of how to compress complex RL problems into interpretable, efficient sub policies or skills is very relevant. - The solution proposed by the authors (the maxi-mix-mini-com objective) is interesting and I think unsupervised methods are a reasonable approach to solve this problem. - The empirical evaluation and the baselines compared are sufficient to demonstrate the efficacy of the method. - I am not completely convinced by the method as is, see the box below for questions. - The method seems to rely on quite a few heuristics, which are not always clearly explained. For example, the authors state in the limitations that they need to downscale the variance of the learned component policies a posteriori. It is not very clear how much engineering the method needs to work at scale. I write the questions in order of appearance in the text (not in order of relevance). 1. What does long-lived or short-lived mean, intuitively and formally? 2. Would the objective in (2) be maximised by a uniform mixture of deterministic policies? 3. If this is the case (and please correct me if i'm wrong), I’m not sure why this requires learning. Couldn't you construct these policies online with zero learning cost? At each step, you construct the sub-policies by as deterministic (or very low variance) actions, sequentially such that each new policy is sufficiently different from the already generated ones, and then you randomise with equal probability for the mixture. 4. Continuing on this line, what is preventing from converging to trivial (useless) policies? How can you assure that the sub-policies learned are of any interest? A trivial solution to the objective (if I understand correctly), with e.g. 2 policies, would be each policy picks one action deterministically, and both policies are mixed with equal weights, but this is not necessarily interesting in many cases. 5. Shouldn't the unsupervised ‘tokenisation’ be somehow linked to other policy metrics like obtained rewards? In line 193 you mention that the method leads to a compression of the actions, but can this lead to an ‘arbitrary’ compression that does not take into account rewards (and in fact it will not take them into account). So is it possible that the quantisation learned completely stops the agent from being able to get high rewards? I can think of toy examples where this would happen and i’m not convinced this does not happen in general. If this is correct, then the fact that the quantised policies work in the examples you tested is because the reward function is expressive enough, perhaps. Please correct me if I misunderstood some part. 6. What is the intuition for the method doing much better than PPO? I would expect an unconstrained algorithm (in terms of the sub-policies) to still do better, but at a higher complexity cost. Fully human-written
Unsupervised Behavioral Tokenization and Action Quantization via Maximum Entropy Mixture Policies with Minimum Entropy Components Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes an online unsupervised approach for action quantization. This approach has the potential to reduce the complexity of the action space by learning a relatively small but useful action set so that downstream tasks could use this more compact action set to learn more efficiently. In the first step, the agent maximizes the entropy of a mixture policy, while minimizing the entropy of individual components of the mixture, which generates distinct but more focused components. In the second step, a discrete-action algorithm treats the learned and fixed components or their mode as actions and maximizes rewards. In theory, the paper shows that 1) the learned action modes of a mixture policy with enough capacity is lossless (covers the whole action space) in the discrete setting, and 2) the gradient estimators for the unsupervised learning of mixture policies. Empirically, the paper illustrates the impact of the strength entropy regularization and capacity of the mixture, and investigates the performance of this approach in continuous-control tasks. This paper has the following strengths: 1. Learning useful sub-policies in reinforcement learning (RL) is an important problem and has the potential of scaling up RL to solve more complex problems. 2. The proposed unsupervised learning objective for learning diverse mixture components is novel and demonstrated to be effective. 3. It provides useful theoretical characterizations of the proposed method. While the theorems do not appear to be difficult to prove technically, they are useful to describe some of the fundamental properties of learning the mixture policy with the unsupervised learning objective. 4. The paper is well-written. It’s easy to follow and to find specific details in the appendix. Despite the strengths, the paper has some weaknesses to be addressed: 1. The benefit of the proposed approach is not well demonstrated. Since pretraining is performed to extract behavioral prior, the paper does not demonstrate such an approach improves efficiency compared to methods that train from scratch (SAC / PPO). 2. The current empirical evaluation is quite limited. 1) Experiments are only performed in a subset of MuJoCo environments and a few other stand alone tasks. It’d be beneficial to increase the coverage of the tested tasks. For example, more MuJoCo locomotion and navigation environments. 2) The lack of comparison to the natural, trivial uniform quantization. 3. Some limitations of the proposed approach are not discussed: 1) One of the limitations of the supervised learning approach is that it requires a state-dependent action space to encourage meaningful learned components. If such “available action sets” are not available, the learned components will likely just randomly scatter across the action space. 2) Another limitation is that the paper does not investigate how the components could be fine-tuned in down-stream tasks when pretrained components are not optimal and even be limited. Here are questions that might impact the rating: 1. The paper mentions the learned components generalize across environment layouts (Line 96). Could the authors clarify if the mentioned result is in Table 1? Further, could the authors clarify the observation spaces in pretraining and downstream training? 2. Could the authors provide learning curves for the experiment results in Section 4? 3. Could the authors provide comparisons to the trivial uniform quantization? 4. Whether DQN or PPO is used for each experiment? It’s unclear from the discussion in Lines 408-409. 5. As mentioned in the appendix, the proof of Theorem 4 appears to be similar to that of a theorem in He et al. (2025). In addition, I also found some parallels between Section 3 and Section 4 of He et al.. Could the authors clarify the difference between the two and highlight the contributions in this paper? Other minor suggestions: 1. Line 362: It’s confusing to see the acronym SD. Similarly, it might be better to spell out USD in the caption of Table 2. 2. Standard errors are proper confidence intervals. It’s difficult to tell the statistical significance of the reported results in the tables. It’d be better to use proper confidence intervals. Fully human-written
Unsupervised Behavioral Tokenization and Action Quantization via Maximum Entropy Mixture Policies with Minimum Entropy Components Soundness: 4: excellent Presentation: 3: good Contribution: 4: excellent Rating: 8: accept, good paper Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes an algorithm called maxi-mix-mini-com that can learn transferable behaviour tokens in an unsupervised manner. Maxi-mix-mini-com aims to maximise the discounted sum of future entropies while minimising the entropy of each behaviour token. By cleverly introducing a new optimisation variable $r$, the authors presented a provably convergent algorithm that can learn diverse behaviours. The authors also empirically demonstrates that by learning a high-level policy based on the learned behaviour tokens, they can easily obtain a high-performing policy, manifesting the reusability and diversity of the learned behaviour tokens. The paper is well-written and easy to understand. In particular, the introduction section provides a great overview of existing research and effectively articulates the significance of their work, making it one of the best introductions I've recently read. Also, the proposed algorithm is backed by rigorous mathematical theorems and extensive empirical analysis on various benchmarks. Although the paper is, in general, easy to follow, there are some parts that need further clarifications. 1. Denoting the right-hand side of (4) by Q is a bit misleading, because technically speaking, it is not a Q function. 2. Please add a cross-reference of Figure 4 to Section 2.1. 3. Lines 315-316 are difficult to understand. How is $\pi_{k,\theta}(a\mid s)$ defined? 4. Figures 3(c) and 3(d) needs more explanation. The action distribution would be different for each state. I suppose training upon the learned behaviour tokens would drastically speed up the high-level training process. Could you provide learning curves for the **Transfer of learned behavioral tokens across tasks** experiments? Fully human-written
Unsupervised Behavioral Tokenization and Action Quantization via Maximum Entropy Mixture Policies with Minimum Entropy Components Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes an unsupervised method to quantize the action space to learn ‘tokens’, specialized one-step policies, which are used by downstream policies to tackle a wide range of tasks. The key idea behind the proposal is the maxi–mix–mini–com entropy objective: maximize the entropy of the mixture policy to ensure coverage, while minimizing the entropy of each component for specialization. A tractable iterative algorithm is proposed and theoretically shown to converge to a unique solution in the tabular setting. Empirical results indicate that the learned tokens are task-agnostic and can be used to achieve performance comparable to methods explicitly designed for single-task optimization. The paper is well-motivated and clearly structured for the most part. The experiments in the tabular domain effectively illustrate the core idea, while those in continuous control environments demonstrate the scalability of the proposed approach. Furthermore, the supplementary videos help convey the qualitative behavior of the learned representations. The paper presents four theorems that substantiate its main claims and experimental findings. All four proofs appear to be sound. I was able to follow and did not identify any errors in the proofs of Theorems 1 and 2. The proofs of Theorems 3 and 4 are comparatively straightforward, as they build directly upon the results established in [1] and [2]. However, the paper has three issues: An important advantage of unsupervised behavioral tokenization is its potential to improve sample efficiency in downstream tasks. As stated in Lines 42–43, “By focusing on core representative tokens, behavioral tokenization can improve sample efficiency, accelerate convergence, and avoid wasteful exploration of irrelevant continuous actions”. However, the paper reports only the final downstream performance after 3 million training steps, without intermediate evaluations. Presenting learning curves or periodic evaluations for the proposed method and the baselines would provide clearer empirical support for this claim. The paper does not specify how the hyperparameters were chosen for both the proposed method and the baselines. Were defaults used? If any hyperparameters were tuned, did each method get a fair opportunity to tune the same number of hyperparameters? Were hyperparameters tuned across environments? The reported results are based on only five random seeds. Strong claims cannot be made based on such a small number. You could either increase the number of seeds, and then justify why it is sufficient, or you could aggregate across environments and only make claims at the aggregate level. You could then just show individual runs per environment, to qualitatively show behavior. Finally, though not strictly a weakness, it was unclear why the modes of the component Gaussian policies were used as discrete actions. To quote: “Action space is quantized using the modes of the learned component Gaussians, a method that provides the first unsupervised online AQ algorithm. Next, state-dependent quantized actions are used as discrete actions in a DQN (Mnih et al., 2013) or discrete PPO (Schulman et al., 2017). The learned components and quantization are not allowed to change during the optimization of cumulative reward using the discrete controller” Couldn’t the component policies themselves be considered the tokens, and DQN learn to take the action of sampling from one of the component policies? This is not obviously better, but it would be useful to discuss this choice more. (Putting Minor Points here, as there is no separate box. These are not major issues) 1. It was not immediately clear what the state- or action-centric perspective referred to. The concept only became clear after reading Section A.1 (Lines 799–809), which appears too late in the paper. I would suggest moving parts of this explanation to the introduction for better clarity and accessibility. 2. The term d_{\pi_{m,\theta}} is introduced in Theorem 3 but defined only later in Theorem 4. 3. Line 329: citation formatting should be corrected to “from Jang et al., 2016” (without brackets). 4. The references section requires proper formatting. 5. Line 1309 – It appears that M in \mathbb{R}^M has not been defined earlier. Based on the context, it seems to correspond to | \mathcal{A} |. 6. In Section A.5, the paragraph on “Downstream Tasks” is missing the word “Table” when referring to the results. References: [1] Sutton, Richard S., and Andrew G. Barto. "Reinforcement learning: an introduction, 2nd edn. Adaptive computation and machine learning." (2018) [2] He, Jiamin, et al. "Investigating Mixture Policies in Entropy-Regularized Actor-Critic." Questions, summarized from the above weaknesses. 1. Could you provide results on learning efficiency? 2. Could you explain how hyperparameters were chosen? 3. Could you justify claims based on the current number of seeds, or provide updated results to support claims? 4. Can you explain the choice of using modes of the Gaussian component policies? I have currently put scores based on uncertainty on these points. For example, I cannot assess soundness without understanding how hyperparameters were chosen, so even though the theory is sound, I had to put a lower rating there. I will adjust this based on responses to questions. Fully human-written
CAREFL: Context-Aware Recognition of Emotions with Federated Learning Soundness: 2: fair Presentation: 1: poor Contribution: 1: poor Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes a federated learning framework for emotion recognition from images, designed to balance contextual reasoning, privacy, and computational efficiency. The system operates in two stages: (1) a large vision–language model (LLaVA 1.5) generates contextual captions for each image, and (2) a lightweight vision–language model (SMOLVLM2) is fine-tuned with Quantized Low-Rank Adaptation (QLoRA) in a federated setting. This design enables decentralized training without sharing raw data while leveraging semantic context from the larger model. Experiments on EMOTIC and CAER-S datasets show that CAREFL achieves higher mean average precision and F1-scores compared to larger centralized models such as GPT-4o and LLaVA, while reducing memory usage and model size. The paper’s contributions include: (1) proposing a novel two-phase federated framework combining large-model context generation with small-model adaptation, (2) introducing an efficient QLoRA-based fine-tuning scheme for lightweight federated training, and (3) comparative and ablation studies across datasets, client numbers, aggregation methods, and quantization settings. Despite its technical framing, the paper appears conceptually weak and executionally shallow: (1) The link between “context awareness” and federated learning is not clearly articulated. Context generation is performed offline using an existing large model, not integrated dynamically into the FL process. This makes the “context-aware” claim superficial. (2) Illustrations and explanation lack clarity. Figures 1 and 2 are schematic and omit crucial architectural or algorithmic details; the paper mostly reuses known components (YOLO, LLaVA, QLoRA) with limited methodological innovation. (3) Evaluations rely on narrow datasets (EMOTIC, CAER-S) without broader benchmarking or significant statistical analysis; performance comparisons against massive centralized models seem to be not fair and lack deeply analyzed. (4) Many claims (e.g., “context improves emotion recognition”) are intuitive but not theoretically supported or quantitatively dissected. Overall, presentation feels more like a system demonstration than a rigorous ICLR-level contribution; key insights or innovations are missing. 1. How exactly does “context awareness” influence the federated learning process? Does context affect model aggregation or only data preprocessing? Why was context generation performed offline instead of integrated dynamically during training? 2. How does the framework generalize to other tasks beyond emotion recognition? 3. How are biases or errors from LLaVA-generated captions mitigated during federated fine-tuning? Fully AI-generated
CAREFL: Context-Aware Recognition of Emotions with Federated Learning Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper presents CAREFL, a framework designed for efficient and privacy-preserving emotion recognition. First, a large vision-language model (LLaVA 1.5) generates contextual descriptions of images to enrich semantic information. Second, a lightweight model (SMOLVLM2) is fine-tuned using QLoRA within a federated learning setup. This method allows distributed training without sharing raw data. Experiments on the EMOTIC and CAER-S datasets show that CAREFL achieves high accuracy and F1-scores while significantly reducing computational and memory requirements. 1. The two-phase design cleverly combines large VLMs for context generation with lightweight models for federated learning is reasonable. 2. Experiments show strong performance, surpassing larger centralized models like GPT-4o and LLaVA. 3. The paper is well-written and easy to read. 1. The proposed two-phase design relies on rich contextual descriptions generated offline using LLaVA 1.5. However, in real-world or real-time emotion recognition scenarios, such offline pre-generation is impractical due to latency, computational overhead, and privacy constraints. This is inconsistent with the author's claim. 2. The experimental setup overlooks realistic aspects of federated learning, such as heterogeneous client data distributions, communication latency, and device variability. 3. Could you show examples of successful and failed predictions for a discussion? Please see Weaknesses. Lightly AI-edited
CAREFL: Context-Aware Recognition of Emotions with Federated Learning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes CAREFL, a two-phase framework for multimodal emotion recognition that (1) uses a large frozen VLM (LLaVA-1.5) offline to generate rich scene/subject contextual descriptions, and (2) federatedly fine-tunes a small, efficient SVLM (SMOLVLM2) on client devices using quantized low-rank adapters (QLoRA). Experiments on EMOTIC (multi-label) and CAER-S (7 classes) show large gains in mAP and varying gains in F1/Recall. 1. This paper proposed a light-weight training approach, which is shown to be effective at achieving promising model performance. 2. This paper conducted comprehensive evaluation which covers different aggregation algorithms (FedDyn, FedAvg, FedProx, FedAdam), LoRA ranks, quantization settings (4-bit QLoRA vs full LoRA) 3. Large performance improve on EMOTIC benchmark 1. Claims of outperforming huge baselines need more careful parity. The paper states CAREFL outperforms GPT-4o, LLaVA and other heavy models — but many of these baselines are used in zero-shot or prompting setups while CAREFL is fine-tuned (and in federated settings). 2. Lack of evaluation benchmarks. The proposed models and baselines are mostly evaluated on EMOTIC. The results of the proposed model on CAER-S are not compared with any baselines. 1. For results on EMOTIC, why mAP is so high while the recall and F1 are modest? Fully human-written
When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This submission studies how LLMs learn to balance exploration and exploitation when trained to solve multi-armed bandit (MAB) problems, using both supervised fine-tuning and reinforcement learning. They train LLMs on various MAB environments using RL with three forms of reward design: original reward (i.e. the raw environment rewards), strategic reward (a normalized, regret-shaped reward that reflects how close the agent’s action is to the optimal arm at each tilmestep), and algorithmic reward (a signal that rewards the agent for matching the decisions of an expert oracle, specifically the UCB algorithm). For supervised fine-tuning, they train the LLM on synthetic data generated from UCB expert trajectories, including step-by-step chain-of-thought demonstrations of UCB value calculations. They evaluate the performance of these methods (when fine-tuning Qwen models) on Gaussian and Bernoulli MAB environments. They find that the fine-tuned models outperform vanilla models on these tasks, and that policies trained with UCB imitation perform best. These methods also perform comparably to UCB and Thompson sampling. Upon further analysis, the authors find that these gains often come from the models adopting greedy behaviors and under-exploring. Using LLMs to solve decision-making tasks under uncertainty is an important next frontier, so the problem being studied is well-motivated. While the authors are not the first to fine-tune LLMs to solve these sorts of bandit tasks, their proposed reward designs are novel, from what I can tell. The finding that gains often stem from more greedy behavior is also interesting, and largely matches the findings of Krishnamurthy et al for vanilla models which are not fine-tuned. I am not entirely sure what this paper is trying to accomplish. The typical motivation for studying LLMs’ ability to solve bandit tasks is because we want to design AI models capable of solving complex, text-based, decision making tasks under uncertainty. This is because we don’t currently have good algorithms for solving these types of tasks. Bandits get to the essence of the exploration/exploitation trade-off, and so they can be a useful abstraction. However, I am not sure what utility we get from fine-tuning LLMs to be more like existing algorithms, without evaluating their performance on some complex task that existing algorithms do not do well in (or are not even applicable to). (We don't need to reinvent the wheel to solve MAB tasks, as we have good algorithms for these types of tasks already.) What are you trying to show when evaluating these fine-tuned models on MAB tasks? It is not very surprising that a transformer will behave like UCB if you train it to. Fully human-written
When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training Soundness: 4: excellent Presentation: 4: excellent Contribution: 4: excellent Rating: 10: strong accept, should be highlighted at the conference Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper investigates how Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training paradigms shape the exploration strategies of Large Language Models (LLMs) when solving Multi-Armed Bandit (MAB) tasks, treating the LLMs as meta-bandit agents. I currently think this paper provides a lot of value to the "learning to learn" / "learning to explore" community. However, I raised several questions and based on the authors' rebuttal response, I reserve the right to change/update my score. 1. The paper has extremely clear writing 2. The authors disambiguated different training strategies and setups very well. 3. This paper introduced token-level reward derived from GAE, which improved upon prior work. 4. Each reward design is principled and reasonable -- incorporating all possible learning signals. 5. The paper is well-situated compared to prior works, investigating areas where prior works didn't, adding critical knowledge/understanding to the entire teaching LLM to explore space. The paper doesn't have obvious flaws/weaknesses, but here is some nitpicking weaknesses: 1. As a field, the learning to explore community needs to move beyond toy setting. [1] proposed and [2] included the MovieLens environment, which is a good first step. Behaviors of RL vs SFT and whether we need $\pi^*$ (optimal exploration policy) need to be answered in slightly more realistic setting. Many algorithms (like SGD) might be flawed in constructed setups, but they work really well in practice. Even though I think this paper has enough value without exploring those more realistic settings, I would encourage the authors to extend the paper to cover those settings. 2. The conclusion from the experiment seems a bit unclear (see the question part). [1] Nie, Allen, et al. "Evolve: Evaluating and optimizing llms for exploration." arXiv preprint arXiv:2410.06238 (2024). [2] Schmied, Thomas, et al. "Llms are greedy agents: Effects of rl fine-tuning on decision-making abilities." arXiv preprint arXiv:2504.16078 (2025). Typo 1: In Figure 3, it says RL-STG, it should be RL-STR 1. RL-STR is not significantly better than RL-OG. Just comparing the average, in 3 out of 8 domains (Figure 3), it's better. It's worse in 3 out of 8. In 2 it's about the same? For people who might want to train exploration policy in a realistic setting, how would you give them suggestions on what reward design to use? 2. Table 2, you did some behavioral analysis and citing a prior work. Krishnamurthy et al. (2024) also proposed MinFrac. Do you think this is a metric worth including? 3. [1] also proposed to fit a function form over steps $T$ on average cumulative regret -- and try to describe the slope/speed of learning/unlearning. Would such analysis yield additional insights for your setting? 4. What future directions are opened up by your analysis, besides moving to more realistic domain? In your GRPO training (with RL-OG), do you see length hacking (i.e., with training continues, the length in thinking tokens becomes longer and longer)? In appendix, we saw that RL with ALG reward learns to compute UCB. Can you: - Provide sampled generations with base model but sampled using your prompt/pipeline? Do they already know that they should calculate UCB values? - Can you provide sampled generation from RL-OG and RL-STR for us so we know if these also learned (on their own) to compute UCB values? Fully human-written
When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training Soundness: 2: fair Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper studies how different fine-tuning paradigms shape the exploration and exploitation behaviors of LLMs in multi-armed bandit (MAB) settings. The authors train Qwen-2.5 3B and 7B models on MAB tasks with different reward signals: > RL-OG: standard stochastic rewards, > RL-STG: regret-shaped strategic rewards, and > RL-ALG: imitation of the UCB oracle via algorithmic rewards. The work demonstrates that both SFT and RL reduce regret and achieve performance comparable to theoretical baselines like UCB and Thompson Sampling, with generalization to unseen environments and 6$\times$ longer horizons. However, behavioral analysis shows that these gains arise from emergent exploitative tendencies since trained agents become greedier and sometimes abandon exploration prematurely. RL-ALG agents trained to imitate UCB outperform their teacher by adopting more exploitative variants of it. The paper concludes that while both SFT and RL enhance decision-making, they also introduce bias toward short-term exploitation, highlighting the need for reward and data designs that explicitly sustain exploration. - **Originality and significance**: This work provides a unified framework comparing SFT and RL fine-tuning for LLM agents on the same controlled MAB setup. The strategic and algorithmic rewards are well-motivated and address variance and credit-assignment issues in RL with LLMs. - **Quality**: This paper includes experiments across Gaussian and Bernoulli bandit families, including cross-distribution generalization and longer horizons, which strengthen the empirical claims. Metrics like suffix failure and greedy frequency are used to reveal qualitative behavioral shifts alongwith quantitative evaluation. - **Clarity**: The paper is generally well-written with clear implementation details and code release commitments. - Limited task diversity: Experiments are confined to simple bandit environments, therefore claims about general agentic learning remain somewhat speculative. - Model scalability: There is no discussion of whether the observed training and inference dynamics hold even with larger models and across different model families. - While the paper uncovers behavioral tendencies, it stops short of offering causal or mechanistic explanations of why greediness emerges under RL/SFT objectives. - Compared to the analysis of RL approaches in LLM training, the SFT analysis is less thorough, especially regarding data sampling effects and robustness to arithmetic errors. - No explicit exploration-promoting baseline: Methods like intrinsic reward shaping or information gain are not compared, though they would provide valuable context. Exploration bias mitigation – Could future training include explicit exploration bonuses (e.g., curiosity or entropy regularization) to test if the emergent greediness persists? Do the authors have an intuition of how findings might transfer to contextual or Markovian tasks where exploration requires memory and state estimation? The finding that RL-ALG outperforms its UCB teacher is quite interesting. Could the authors analyze whether this over-exploitation emerges from token-level credit assignment or from the episodic reward structure? Fully AI-generated
Improving Generalizability and Undetectability for Targeted Adversarial Attacks on Multimodal Pre-trained Models Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper investigates the security vulnerabilities of large-scale multimodal pre-trained models to targeted adversarial attacks, with a focus on cross-modal matching tasks. The authors identify two major unresolved issues in existing methods: poor generalizability (i.e., adversarial examples only work on an exact known target and fail on semantically similar but different targets) and limited undetectability (i.e., adversarial examples can be easily flagged as outliers by simple anomaly detection). To address these, they introduce the Proxy Targeted Attack (PTA), which leverages both source-modal and target-modal proxies in its optimization. Theoretical analysis is provided to formalize a trade-off between generalizability and undetectability, and comprehensive experiments on several state-of-the-art multimodal models and tasks show improvements over prior approaches in attack success, generalization, and stealthiness. **Comprehensive ablations**: The roles of key hyperparameters are dissected in Figure 4 and Figure 5, providing practical insights for real use. **Broadened evaluation**: The method is also tested in black-box settings, on text and audio modalities, and remains potent—even outperforming strong baselines. **Limited conceptual novelty**:Although the proposed Proxy Targeted Attack (PTA) framework is presented in a novel form, its core idea—leveraging proxy samples to enhance the generalizability and stealthiness of adversarial examples—is essentially an incremental extension of existing targeted attack paradigms. The main contributions lie in the combination of loss components and the introduction of proxy sets, which, while systematically analyzed both theoretically and empirically, offer limited conceptual or methodological innovation. **Placement and accessibility of key details**:Many important methodological and theoretical clarifications—such as the proxy selection strategy, detection metric assumptions, and additional defense results—are included in the appendix rather than the main body. While this is understandable given space constraints, some of these points (e.g., proxy sampling and update policy, sensitivity analysis to detection thresholds) are central to understanding the method. Moving key portions into the main text or providing a concise summary section would substantially enhance readability for the reader. **Limited theoretical validation beyond simplified assumptions**:Although the appendix provides further explanation on Theorem 1 and the trade-off formalization, the assumptions (notably the fixed anomaly threshold β and L₂-based distance metric) remain rather idealized. **Defense coverage and interpretation**: The defense evaluation (Table 7 and Appendix E) includes adversarial training and data augmentation, but it still lacks certified or adaptive detection defenses such as MMCert. Additionally, while PTA maintains high success rates under existing defenses, the paper could benefit from a more detailed analysis of why. For instance, whether the proxy mechanism intrinsically avoids defense gradient masking or shifts feature-space distributions differently. **Discussion of model-specific behavior**:In Table 4, One-PEACE exhibits smaller degradation compared to other multimodal models. The appendix briefly mentions architectural differences, but more explicit discussion in the main text would improve interpretability and generalization insights. 1.Can the authors elaborate on how proxies are actually selected—random sampling, clustering, or some other heuristic? 2.How sensitive is the theoretical guarantee (Theorem 1, 2) to the specific choice of anomaly detection method and the distance metric? 3.Why were key certified defenses like MMCert not included in the defense benchmark (Table 7)? Are there practical or conceptual challenges preventing such experiments, or would PTA fundamentally fail under those defense strategies? 4.In Table 4, One-PEACE appears less affected by injection of PTA-crafted AEs than other models, especially at low injection rates. Can the authors offer an explanation—is this due to architectural differences, training diversity, or other factors? Is this generalizable to other multimodal architectures? 5.Could you detail the practical cost (runtime, memory) of optimizing PTA with large proxy sets across high-dimensional modalities? Fully AI-generated
Improving Generalizability and Undetectability for Targeted Adversarial Attacks on Multimodal Pre-trained Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes Proxy Targeted Attack (PTA), a method to improve both generalizability and undetectability of targeted adversarial examples on multimodal pre-trained models such as ImageBind and LanguageBind. PTA introduces source- and target-modal proxies to craft adversarial examples that remain stealthy while aligning with multiple semantically related targets. Theoretical analyses reveal the trade-off between the two goals, and extensive experiments across classification, retrieval, and multimodal settings show PTA’s superior attack success rate. 1. This paper proposes a novel and well-motivated approach addressing two key weaknesses in adversarial attacks on multimodal pretraining models. 2. This paper provides comprehensive experiments across models, modalities, and defense scenarios. 3. The paper is well organized, with clear problem statement and detailed method description. 1. Given that LLaVA employs a vision encoder (e.g., CLIP) pretrained on large-scale vision-language data, it would strengthen the paper to investigate whether the proposed attack can be effectively adapted to such popular large vision-language models (LVLMs). 2. The transferability across models of the proposed attack should be benchmarked against established baselines such as SGA [1] under the experimental conditions defined in the original SGA paper. 3. The practicality of PTA under fully blind attacks (no target prior) is not explored; the authors should quantify performance degradation. 4. The core idea of the proposed method relies on cross-modal interaction, which appears conceptually similar to prior work (e.g., SGA [1]). The authors should clearly articulate how their approach differs from existing methods. 5. The parameter α governs the trade-off between generalizability and undetectability; the authors should further clarify how α is selected and provide a detailed analysis of its impact on model performance. [1] Lu D., Wang Z., Wang T., et al. Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models. ICCV 2023. Please address the weakness above. Lightly AI-edited
Improving Generalizability and Undetectability for Targeted Adversarial Attacks on Multimodal Pre-trained Models Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper introduces a targeted adversarial attack against multimodal encoders, characterized by its generalizability and undetectability. The authors propose a Proxy Targeted Attack (PTA), which leverages multiple source- and target-modal proxies to optimize targeted adversarial examples (AEs), enabling them to evade defenses while aligning with multiple potential targets. The paper further provides theoretical analyses to elucidate the relationship between generalizability and undetectability, ensuring optimal generalizability under the specified undetectability constraints. Experimental results demonstrate the effectiveness of the proposed method. 1、The motivation of the paper is clearly articulated, making the overall narrative easy to follow. 2、The authors provide theoretical analyses to highlight the relationship between generalizability and undetectability. 3、The release of open-source code ensures the reproducibility of the work. 1、The technical novelty of this work is quite limited. The proposed method essentially aligns the adversarial example’s features with multiple injected features— a strategy that has become rather common in recent multimodal attack studies. For instance, SGA [1] (ICCV 2023) and AttackVLM [2] (NeurIPS 2023) both adopt similar attack paradigms. Notably, the paper incorrectly cites AttackVLM as a NeurIPS 2024 paper. The only notable difference here is that the proposed method introduces benign image features as constraints to enhance undetectability, which is also a widely adopted practice in adversarial attack design. In fact, many recent works have already begun exploring stealthiness in digital, feature, and frequency domains. Therefore, I find it difficult to see how this work provides any substantial advancement to the current state of multimodal adversarial attack research. 2、The adversarial detection baselines considered in this paper are quite outdated, mostly predating 2009. How does the proposed method perform against more recent and advanced adversarial detection techniques, such as frequency-based methods [3,4,5], spatial-based methods [5,6], or feature-based methods [7]? I strongly recommend the authors review and discuss the latest progress summarized in the comprehensive survey [8], and conduct experiments comparing their approach with these more recent detection defenses. Such evaluation would significantly strengthen the paper’s contribution. 3、The black-box attack evaluation is weak: only one baseline from 2020 is considered, which is unconvincing. The authors should include more recent black-box target attack methods for comparison and discussion, such as [9,10,11]. 4、In the defense evaluation section, the authors show that the proposed PTA method remains effective against defenses like TeCoA (adversarial training), Gaussian Blur (data augmentation), and DiffPure. However, the method is not explicitly designed to counter these defense mechanisms. Why does PTA still succeed under these settings? This is especially puzzling in the case of DiffPure, which is well known for its effectiveness in purifying adversarial noise. The authors should provide a more detailed explanation of these results and clarify the underlying mechanism that enables PTA to bypass such defenses. 5、The experimental evaluation mainly focuses on white-box settings. It remains unclear how well the generated adversarial samples transfer across different models. 6、The paper lacks a discussion and comparison with recent state-of-the-art multimodal adversarial attacks [12,13,14,15]. The paper lacks technical novelty, as it is largely based on existing work without substantial innovation. Moreover, the experiments are limited and outdated, lacking comparisons with recent adversarial detection methods, state-of-the-art multimodal attacks, and thorough black-box and transferability evaluations. Reference [1] Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models, ICCV 2023. [2] On Evaluating Adversarial Robustness of Large Vision-Language Models, NeurIPS 2023. [3] Automated Detection System for Adversarial Examples with High-Frequency Noises Sieve, International Symposium on Cyberspace Safety and Security 2019. [4] Detecting AutoAttack Perturbations in the Frequency Domain, ICML 2021 workshop. [5] Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks, NDSS 2017. [6] Adversarial Example Detection Using Latent Neighborhood Graph, ICCV 2021. [7] Detecting Adversarial Examples through Image Transformation, AAAI 2018. [8] Adversarial example detection for DNN models: a review and experimental comparison, Artificial Intelligence Review 2022. [9] Improving Transferable Targeted Adversarial Attacks with Model Self-Enhancement, CVPR 2024. [10] Boosting the Transferability of Adversarial Examples via Local Mixup and Adaptive Step Size, ICASSP 2025. [11] Enhancing targeted transferability via feature space fine-tuning, ICASSP 2024. [12] One perturbation is enough: On generating universal adversarial perturbations against vision-language pre-training models, CVPR 2025. [13] Exploring transferability of multimodal adversarial samples for vision-language pre-training models with contrastive learning. TMM 2025. [14] AnyAttack: Towards Large-scale Self-supervised Adversarial Attacks on Vision-language Models, CVPR 2025. [15] Modality-Specific Interactive Attack for Vision-Language Pre-Training Models, TIFS 2025. It is also recommended that the authors include attack results on large multimodal models, such as LLaVA, MiniChatGPT, and GLM, to further validate the method’s generalization and practicality. Lightly AI-edited
Improving Generalizability and Undetectability for Targeted Adversarial Attacks on Multimodal Pre-trained Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper investigates adversarial attacks that are both generalizable and undetectable, focusing on multimodal pre-trained models such as ImageBind. The authors argue that existing attacks lack both properties, limiting their practical effectiveness. To address this issue, the paper proposes a method called Proxy Targeted Attack (PTA), which leverages multiple source-modal and target-modal proxies to generate adversarial examples (AEs). By using multiple target-modal proxies, AEs are optimized to align with a target-modal distribution, thereby improving generalization. By using source-modal proxies, AEs are constrained to remain within the clean sample distribution, improving undetectability. Furthermore, the paper provides a theoretical analysis showing that there exists a fundamental trade-off between the generalizability and undetectability of AEs. - Attacking multimodal pre-trained models such as ImageBind and LanguageBind, beyond the typical vision-language setting, is an important and novel topic. - Identifying the challenge of achieving both generalizability and undetectability in the multimodal settings, and demonstrating that both can be simultaneously improved through appropriate optimization, is valuable. The improvement in attack capability over baselines is significant. - The proposed method is conceptually simple. - The experiments are extensive, covering both image classification and retrieval tasks, and involving not only image-text but also text-audio modalities. - The trade-off between adversarial attack strength and undetectability has already been discussed in prior work [Frederickson et al., 2018], and is also somewhat intuitive. Therefore, the theoretical conclusion in Section 2.4 has limited novelty. - The idea of improving generalizability using multiple data points is not new; a similar approach was proposed in SGA-Attack [Lu et al., 2023]. The paper should acknowledge and discuss this connection. - The idea of modifying the objective function to jointly optimize attack capability and undetectability was also discussed in [Frederickson et al., 2018], and this relationship should be mentioned for completeness. [Frederickson+2018] Frederickson, Christopher, et al. "Attack strength vs. detectability dilemma in adversarial machine learning." 2018 international joint conference on neural networks (IJCNN). IEEE, 2018. [Lu+2023] Lu, Dong, et al. "Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. see weaknesses Lightly AI-edited
Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper proposes the cross-modality competency problem where a multimodal large language model (MLLM) fails to evaluate all modalities. A concrete example of this problem is modality interference, where MLLMs often use spurious information from irrelevant modalities when they are expected to rely solely on one-modality data. As a result, MLLMs often underperform on pure visual recognition and textual reasoning. The paper designs a perturbation-based experiment to verify and quantify this problem. Then, it proposes to fine-tune MLLMs with perturbation-based data augmentations. Experiments on image-heavy, text-heavy and multimodal tasks and multiple model families verify the effectiveness of the proposed approach in boosting unimodal reasoning ability while enhancing performance on multimodal tasks. - The paper demonstrates that MLLMs are not robust under modality interference where different modalities are not aligned and only one modality is relevant to the task. This highlights an important robustness issue in MLLMs. - The paper shows that using modality misaligned data for fine-tuning can mitigate modality interference and is effective in boosting both unimodal and multimodal reasoning abilities. - Experiments show that the proposed method is effective with different model families in different scales. - The paper is well-written and easy to follow. - The technical contributions of the paper are limited. The proposed perturbation-based data augmentations are not novel, and it is expected to see performance improvements when incorporating data with modality interference. - Choice of the datasets: Why do the authors choose Mini-ImageNet and Caltech-101 as Image-heavy datasets and Open-BookQA and MMLU as text-heavy datasets? Would different choices of these datasets affect the performance of fine-tuned models on VQA tasks? - Concern on the generalizability of the proposed method: While it is expected to see performance improvement on image-heavy and text-heavy datasets when they are incorporated into the fine-tuning dataset, it is unclear how this would mitigate modality interference on unseen image-heavy and text-heavy datasets. - The proposed adversarial perturbation to the latent embeddings is not effective as compared to the perturbation in the input space as shown in Table 2. Is this component essential to mitigating modality interference? - How to choose image-heavy and text-heavy datasets? Would different choices of these datasets affect the performance of fine-tuned models on VQA tasks? - Can the proposed method geeralize to unseen image-heavy and text-heavy datasets? - Lines 264-265: How to choose $N_{img}$, $N_{text}$, and $N_{vqa}$ in practice to ensure effective modality interference mitigation? How do different choices of these values affect the performance? Fully human-written
Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models Soundness: 2: fair Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper first defines the Cross-Modality Competency Problem where existing Multimodal Large Language Models (MLLMs) are susceptible to misleading inputs especially in modality-specific tasks, such as image classification or pure text question answering, where models are expected to rely solely on one modality. The paper, first benchmarks this across a range of models using a perturbation-based causal diagnostic setup. Perturbations include - for image-heavy tasks, unrelated scientific facts and misleading descriptions and - for text-heavy tasks are including a real, irrelevant image or a full black/white canvas image. Next, to improve upon this shortcoming, a novel framework to finetune MLLMs is presented which includes adversarial losses and a consistency regularizer strategy at the input and representation level. Experiments on multiple datasets and models demonstrate significant improvements in robustness and cross-modality competency, indicating the method’s effectiveness in boosting unimodal reasoning ability while enhancing performance on multimodal tasks. 1. The definition of Modality Interference is well defined and the findings that model performance goes down due to sub-optimal integration information across modalities is interesting. 2. The motivation behind the proposed losses are well defined. 3. The paper is well written and easy to follow. 1. The proposed losses are not very effective : As shown in the ablations in Table 2, FFT with VQA/AUG performs better than proposed losses. Examples being : LLaVA-1.5-13B - FFT with $D^{AUG}$ - on ScienceQA-IMG | Qwen2.5-VL-3B - FFT with $D^{VQA}$ - on MM-Bench-EN. 2. Consistency of results : In Table 1, the drop in performance of models on Caltech 101 is quite high, for example LLaVA-1.5-7B, goes from (97.0 --> 57.4), but in Table 4 : the drop is much less on OCR images (97.0 --> 92.8) ; this raises questions around. i) if modality interference is really a concern and/or ii) if the generalization results in Table 4 are correct. 3. Provided results are on 3B/7B/13B models; results on newer family of thinking models would help solidify the claims made about failure modes in the paper. 4. To truly evaluate the effectiveness of the proposed losses, evaluation against other adversarial attacks should also be presented. 1. It seems from Table 1 that drop in performance on models is much larger on vision-heavy tasks (such as Mini-ImageNet) is much larger than language-heavy tasks (such as OpenBookQA) - why might this be the case? 2. What is the reasoning behind choosing these perturbations : i) unrelated scientific facts or misleading descriptions - for image-heavy tasks. ii) semantically meaningful real images/ full black canvas/ full white canvas - for language-heavy tasks. 3. Does the current evaluation of Modality Interference require Causal Modeling? 4. An ablation with/without the modality-specific binary mask for the adversarial loss will be interesting to see. Fully human-written
Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces a framework to diagnose and mitigate modality interference in multimodal large language models (MLLMs)—a phenomenon where irrelevant or misleading modality signals degrade model performance. The authors define the broader cross-modality competency problem, identifying modality interference as a concrete instance. They propose (1) a perturbation-based causal evaluation that quantifies interference by injecting irrelevant signals into one modality, and (2) a fine-tuning strategy combining heuristic and adversarial perturbations with a consistency regularization objective. Experiments on image-heavy, text-heavy, and multimodal tasks (Mini-ImageNet, Caltech-101, OpenBookQA, MMLU, ScienceQA, MM-Bench, Seed-Bench) demonstrate significant robustness gains and improved unimodal reasoning without harming multimodal performance. The paper is technically solid and the framing is clear, though the conceptual advance is moderate. 1. The identification of modality interference as a measurable phenomenon and its connection to cross-modality competency is insightful. 2. The perturbation-based causal evaluation is well designed and empirically grounded, revealing meaningful vulnerability patterns. 3. The fine-tuning framework combining heuristic and adversarial perturbations with consistency regularization is practically effective. 4. The experiments are comprehensive across datasets, model sizes, and modalities, showing consistent improvements in robustness and generalization. 1. The theoretical depth is limited. The work largely integrates known ideas from causal probing and adversarial robustness without deeper theoretical analysis. 2. The causal effect metric ($\delta_{cp}$) is intuitive and a bit heuristic. It only captures prediction flips, not probabilistic changes or feature-level shifts. 3. The perturbation design may unintentionally change semantic content rather than purely isolate modality relevance. 4. The method introduces additional computation from adversarial training, and the efficiency trade-off is not discussed. 5. The paper emphasizes robustness metrics but provides little mechanistic insight into why the proposed perturbation strategy improves cross-modality alignment. 1. How sensitive are the robustness gains to the perturbation strength \epsilon and the number of adversarial steps? 2. Can the proposed approach generalize to generative tasks or multimodal reasoning beyond multiple-choice formats? 3. How does the model behave under simultaneous perturbations in both modalities? 4. Could the causal effect be defined more continuously (e.g., using KL divergence on prediction distributions) to better quantify partial modality reliance? 5. Have the authors compared to other causal fine-tuning strategies such as counterfactual supervision or gradient orthogonalization? Fully AI-generated
Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models Soundness: 3: good Presentation: 3: good Contribution: 4: excellent Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper investigates modality interference in MLLMs, particularly in tasks like Visual Question Answering. It defines modality interference as the negative impact of irrelevant modalities on unimodal tasks and quantifies this issue through perturbation-based causal diagnostics. To mitigate this interference, the authors introduce a new fine-tuning framework that incorporates data augmentation and consistency regularization strategies to improve model stability across different inputs. Experimental results demonstrate significant enhancements in robustness and overall performance. Innovative Concept: The paper introduces the notion of the Cross-Modality Competency Problem, providing a fresh perspective on modality interference in multimodal large language models. This innovative approach contributes new insights to the field. Systematic Analysis: By designing a perturbation-based causal diagnostic experiment, the authors quantify the impact of modality interference, providing empirical evidence that enhances the scientific rigor and validity of the research. Effective Solution: The proposed fine-tuning framework combines data augmentation with consistency regularization strategies, offering a practical solution to mitigate modality interference. This approach has been validated through significant improvements in robustness and performance across multiple benchmark datasets. Potential Overfitting Risks: The use of perturbation-based data augmentation may introduce noise into the training process. While it aims to enhance robustness, there is a risk that the model might overfit to these perturbed examples, resulting in poorer generalization on clean, real-world data. Lack of Comparative Baselines: The paper does not provide a comprehensive comparison against a wider variety of existing methods or models that address modality interference. Without robust baseline comparisons, it is difficult to ascertain the relative effectiveness of the proposed framework. Limited Experimental Diversity: The experiments primarily focus on a small set of benchmark datasets, which may not capture the full range of conditions under which modality interference could occur. This limited range could restrict the generalizability of the findings to other real-world scenarios. How does the performance of the model vary with different configurations of the augmentations or regularization strategies? Fully AI-generated
Optimizing the Ineffable: Generative Policy Learning for Human-Centered Decision-Making Soundness: 1: poor Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The paper proposes GenNOP (Generative Near-Optimal Policy learning) for human-centered decision making. The goal is to train a conditional generative model that samples actions from a target policy (\pi^_{\epsilon}(\cdot\mid x)) defined as uniform over the (\epsilon)-optimal set with respect to the measurable outcome (Y): [ \Omega^{\epsilon}(x)={a:; |y^(x)-\mathbb{E}[Y^a\mid X=x]|\le \epsilon},\quad \pi^{\epsilon}(a\mid x)\propto \mathbf{1}{a\in\Omega^{\epsilon}(x)}. ] Training uses importance-like reweighting of an observational dataset with weights [ w(x,a,y;\epsilon)=\frac{g{\epsilon}(y,x)}{p(a\mid x)}, ] where (p(a\mid x)) is a generalized propensity score (GPS), and (g_{\epsilon}(y,x)=\mathbb P{y^(x) < y+\epsilon}) is estimated via a max-stable/GEV process regression on block maxima over nearest neighbors. A conditional diffusion model is then trained on the reweighted data to generate actions. Synthetic experiments illustrate regret vs. (\epsilon) and sample budget (m); real-data case studies (e.g., ICU dosing) provide qualitative visual comparisons between generated actions and filtered clinician actions. - Timely problem & motivation. Framing human-in-the-loop selection over a diverse near-optimal set is compelling and relevant to human-centered AI. - Engineering effort. A non-trivial pipeline (GEV nets for (g_{\epsilon}), VAE-style GPS, conditional diffusion) is implemented; synthetic experiments are thoughtfully designed. - Synthetic results. Plots relating regret to (\epsilon) and sample budget (m) illustrate the intended trade-offs and potential practical utility when assumptions hold. - The paper claims uniform sampling over (\Omega^*{\epsilon}(x)) but trains with weights (g{\epsilon}(y,x)/p(a\mid x)); no argument shows the induced generator is uniform (or even calibrated) over the (\epsilon)-optimal set. This undermines the central “diverse without historical bias” claim. - Visual PCA/KDE alignment to filtered clinician actions (derived using the same filter) does not evaluate counterfactual utility. No off-policy evaluation, CIs, or sensitivity analyses are provided - Guarantees are developed near the (V)-optimizer, while the method operates near the (Y)-optimizer. Can you add off-policy evaluation for continuous actions (with uncertainty), sensitivity to (\epsilon) and to the block/neighbor choices, and comparisons to learned stochastic baselines beyond Gaussian perturbations? Moderately AI-edited
Optimizing the Ineffable: Generative Policy Learning for Human-Centered Decision-Making Soundness: 3: good Presentation: 1: poor Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper proposes GenNOP, an algorithmic decision-making method that tries to align with hard-to-specify human values. They split the value function into two components, measurable and unmeasurable. Then, the algorithm optimizes for the measurable utilities, and proposes $\epsilon$-optimal candidates for the human to choose. The candidates that the human chooses are optimal in the unmeasurable sense in that they align the most with the human, and by their construction, are also value-optimal. The authors conduct both synthetic and real-world experiments to show the efficacy of their method. * Interesting proposal to allocate roles to human and algorithm in a way that reduces human load (compared to an approach such as RLHF). * Flexible problem formulation allows for optimality to be defined on a per-person basis, enabling satisfaction for different stakeholders. * The experimental results support the claims in the paper. * I feel the overall presentation of information in this paper can be improved. For example, Figure 1 is informative of how the whole algorithm works, but it feels like 3 figures crammed into 1. Another example, Figure 3 seems like a generic plot and hard to relate with Sec. 3.2 or the rest of the paper. See questions for more specific followups. * Quite a bit of main paper material were discretely introduced in the Appendix. For example, the baselines (GP-UCB, DDOM, DRPolicyForest), key mathematical components ($y^*, g$), etc. As a reader, I would appreciate to be introduced to these concepts in the main paper without being required to refer to the appendix as a dependency. * Assumption 1 plays a key role in much of the theoretical analyses, with supporting evidence from Figure 8, but does this strict concavity trend always hold? If it does not, how does this affect the $\epsilon$-optimality? I feel this should at least be mentioned more explicitly as a limitation. Minor: * Line 294: GEV abbrev. used before defining it (Line 1025). Same with VAE on Line 295. * What do the drawings from the zoomed-in A's on the left side mean? Should the top right plot should probably be its own figure? * In Figure 3, where are the datapoints coming from? * In Figure 4, what setting of m and \epsilon are you using for the left and right plots, respectively? Fully human-written
Optimizing the Ineffable: Generative Policy Learning for Human-Centered Decision-Making Soundness: 3: good Presentation: 2: fair Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper proposes a new framework for algorithm-assisted decision-making for problems which they term human-centered decision-making problems: problems where the overall utility V of the decision can be separated between a quantifiable component Y and subjective, non-measurable component U that can only be evaluated by human judgement. They propose GenNOP, a method to learn policies that return not only the best performing action on Y, but a set of near-optimal actions, from which humans can chose from to factor in the decision process their evaluation of U. - The setting of human-centered decision-making problems is interesting and well motivated - The method proposed is sensible and principled - Experiments on synthetic and real-world datasets show good performance Major: - The formalism in Section 2 is hard to follow. The assumptions 1-6 are only laid out in the appendix and not even all of them are mentioned in the main text. Propositions 1-3 are presented without much context or explanation as to what these results mean for the framework, and are not used or mentioned a single time in the rest of the paper. As it stands only Definition 1 is actually used in the rest of the paper. I encourage the authors to rework this section to better integrate the formalism with the discussion around it, and discuss thoroughly what each assumption and proposition mean in practice. - In the experiments (eg first Table in section 4, Figure 6) why don’t we compare with model-free offline RL methods, for example estimating a Q function, and then picking actions uniformly within {a, ||max_a’ Q(x,a’) - Q(x,a)|| < eps} ? This seems to me like a natural baseline to compare GenNOP to so I was surprised not to see it in the experiments. - The choice of epsilon is essential (as noted by the authors and illustrated Figure 4). Can the authors provide some discussion about how this would be done by practitioners in practice? Given the overall outcome V is never measured, does it necessarily have to be chosen “blindly” by practitioners, or can it be refined over time after collecting data from a GenNOP-like process? Minor: - Table of results in Section 4 is missing a caption - Typo l.296 “an re-weighted” - In Section 3.2: I’m not familiar with the theory but as I understand it, the estimation relies on a form of continuity between the covariates X and the optimal outcomes y*(x) = max_a Y_a | X=x. That is, “close” covariates are expected to lead to similar optimal outcomes. However, in practice X can be complex and high-dimensional, eg with categorical features and sparsely sampled in the available data. Can the authors discuss these assumptions and give a sense of how robust this estimation might be in real-world datasets like the two MIMIC-IV extracts? Fully human-written
Optimizing the Ineffable: Generative Policy Learning for Human-Centered Decision-Making Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper presents GenOP, a generative model framework for generating sets of actions from which a decision-maker can pick its preferred one to complete a task, thus balancing normative notions and unknown user preferences. Under several theoretical assumptions, the method proposes $\epsilon$-optimal actions that stay on the Pareto frontier generated by the user’s and task’s optimal values. Lastly, under both synthetic and realistic datasets, the authors evaluate the regret-minimizing ability of their approach and the capabilities of GenOps in generating similar actions taken by physicians. The paper tackles a very relevant problem in today’s literature. Indeed, we still lack decision support systems that enable human decision makers to achieve complementarity (e.g., the human+AI achieves superior performance than the human alone). Further, the paper also considers a relevant issue stemming from potential trade-offs between the task objective (e.g., curing effectively a patient) and some personal preferences of the decision maker. - In general, the paper is very dense (and it makes it harder to address the impact of its core contribution). The proposed solution uses many different techniques (e.g., Extreme value theory, VAE, causality), but their contributions are only briefly sketched or relegated to the Appendix. Given that GenOP has many moving parts (and all of them come with their own assumptions), I believe this method could have trouble generalizing beyond the very synthetic scenario of the lab (e.g., or at least it would require a stronger experimental evaluation than the one provided). - I believe assumptions should be stated within the main text, or at least summarized textually in layman's terms to make the reader understand the limits (and strengths) of the proposed theoretical formulation. Propositions 1,2, and 3 seem to require many assumptions (6!), but they are buried within the Appendix. I believe they need to be made more explicit in the main text. - I do not fully understand the experimental support of the real dataset experiment (line 428). GenOP can generate a distribution of actions similar to the observed one, but I believe that is because GenOP is a generative model trained on real-world data. Thus, it is somewhat trivial that it will learn “$\epsilon$-optimal” actions close to the physician policies (furthermore, we cannot even evaluate counterfactual effects in this case). For example, it would have been nice to see the comparison between $m$ actions generated by GenOP and $m’$ actions generated by a simple VAE, to understand the benefits of the GenOP architecture. - The paper does not mention other alternative forms of human-AI complementarity that appeared in the past years. These are important to contextualize this work within the human-AI collaboration literature. Notably, decision support systems that restrict human agency, thus preventing overreliance and leading to provable improvement in performance [1,2], and _“learning to defer”_ approaches, which learns when and how to optimally defer decisions to a human decision maker over uncertain instances [3]. See this survey for further inputs [4]. - The code provided in the supplementary comprises only the GenOP implementation, without any code to run the experiments or the evaluations in the paper. Therefore, I believe that it is not useful in assessing the reproducibility (or to clear any of my doubts with the setup). [1] Straitouri, E., Wang, L., Okati, N., & Rodriguez, M. G. (2023, July). Improving expert predictions with conformal prediction. In International Conference on Machine Learning (pp. 32633-32653). PMLR. [2] De Toni, G., Okati, N., Thejaswi, S., Straitouri, E., & Rodriguez, M. (2024). Towards human-AI complementarity with prediction sets. Advances in Neural Information Processing Systems, 37, 31380-31409. [3] Madras, D., Pitassi, T., & Zemel, R. (2018). Predict responsibly: improving fairness and accuracy by learning to defer. Advances in neural information processing systems, 31. [4] Ruggieri, Salvatore, and Andrea Pugnana. "Things machine learning models know that they don’t know." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 39. No. 27. 2025. - Can you better state which standard causal assumptions underlie your weight function (lines 286-287)? In general, the counterfactual distribution is not identifiable from observational data, unless we have interventional data too (or unless the underlying causal generative process is invertible to some extent). - How did you evaluate the quality of your GEV estimate (lines 315-316)? Fitting the parameters with neural networks can be brittle and give unreliable estimates depending on the data splits. - How many runs did you compute your standard deviation on (lines 330-331)? - Can you better describe the relevance of the “real dataset” experiment (line 428)? Can you provide a comparison between GenOP actions and the ones of a simple generative model (e.g., CVAE)? Fully human-written
Certifying Robustness of Agent Tool-Selection Under Adversarial Attacks Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper presents CATS, a statistical framework for verifying the robustness of agent tool selection. CATS models tool selection as a Bernoulli process. By simulating multiple rounds of adaptive attacks (where attackers can iteratively optimize malicious tools based on the agent's historical selection), it generates a high-confidence lower limit of accuracy, thereby quantifying the agent's performance in the worst scenario. Experiments show that under multiple rounds of attacks, the authentication robustness of multiple SOTA LLM agents drops sharply to nearly zero, revealing serious security threats in the tool selection process. + The issue is of practical significance: With the widespread application of LLM agents in tool invocation, the robustness of tool selection is indeed a critical and understudied security problem. + The experimental system is rich: multiple models and various attack types were evaluated, and the vulnerabilities of the retrieval device and the selector were deeply analyzed through ablation experiments. - The key points of the paper and the content of the method section are out of balance: The core content of this paper is the robustness evaluation framework, but the methods section devotes a considerable amount of space to detailing the classification and implementation of attack methods (such as Top-N Saturation, Abstention Trigger, etc.). This leads readers to feel confused when understanding the core mechanisms of the certification framework itself (such as the composition of multiple rounds of experiments, the definition of the Bernoulli process, and the calculation of confidence intervals), and they are not clear about the priorities. - Key method details are missing: Although the paper presents various attack types, it does not elaborate on how these attacks are specifically implemented in the system. For example, how are the three attack types such as Top-N Saturation and Abstention Trigger implemented? - The lack of a clear threat model: The paper does not explicitly define the attacker's specific capabilities, knowledge boundaries, and restrictive conditions (for example, whether the attacker can access the retrieval device, whether they can modify existing tools, etc.), which casts doubt on the rationality and universality of the attack scenario. - What is the core mechanism of the certification framework? - What are the specific Settings of the threat model? Lightly AI-edited
Certifying Robustness of Agent Tool-Selection Under Adversarial Attacks Soundness: 4: excellent Presentation: 4: excellent Contribution: 3: good Rating: 8: accept, good paper Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. The authors study the setting of adversarially attacking the tool-selection process for agentic LLMs. They devise a small collection of prompts which they use to get a model to generate attacks. The attacks focus on two areas: (1) the slate selection phase (choosing which tools to consider in an initial narrowing-down phase), and (2) tool selection (choosing which tool to use from the narrowed-down slate). They find that models are susceptible to both families of attack. ## originality As far as I know, this is the first work to explicitly study the threat model of attacking the tool pool. ## quality The paper is well-written and the experiments are compelling. ## clarity The paper is well written and overall quite clear. ## significance It's not obvious exactly how important a threat model this is, given that it assumes no steps are taken to curate or moderate the tool pool. However, it is still interesting and a worthwhile addition to the literature. It would be nice if the authors could explain a bit more clearly why their threat model is realistic. Or, if it's not realistic, to acknowledge it or explain how it could still be a problem in specific situations. My default intuition is that most of these issues go away if there is moderation of the tool pool: if it's a private company, then all their tools will be internal; if it's a public setting, then I would assume there would be some maintainers as in open-source projects who would check for this kind of malicious tool. Would tool certification/curation just solve this problem? I would also like more discussion of what other defenses would look like, beyond tool certification/curation. It would be good if the authors could be more clear in the paper (top and middle of page 6) that their attacks are in fact simply different prompts given to an LLM as input (which I later understood by looking at the Appendix). It was not clear just from reading that page. ## Questions 137: I would really like citations supporting the idea that these tools can be authored by anyone. This is an important point for your paper, since the importance of the setting hinges on this being true. 202: what is semantic manipulation? Could you please explain it? 247: regarding the adaptive update, how much better is this than a simple best-of-N attack approach? 291: are you allowing k >= N? It seems like you're not, but this should be made very clear. 313: how do you choose $r$ and $k$? These seem pretty important. 349: "stability" is a weird thing to say here. It's just "to reduce noise". 360: in the multi-turn setting, what is k? 375: for top-N Saturation, is k < N? It seems that it should be. However, "saturation" and "flooding" give the image of all the N tools in the slate being malicious. Maybe you can change the language here to make it more clear. Figure 2: what happened to the gemma3 orange bars in the first two plots? Figure 3: can you also simply report what proportion of the time the correct tool wasn't in the slate? that seems like a much simpler way of answering this question, and you've already done the experiments for it. ## Suggestions line 26: "severe" feels a bit strong line 48: please provide a justifying citation to support the notion that anyone can publish malicious tools. line 65: citet -> citep line 68: citet -> citep 87: agentic systems -> agentic tool-calling systems 138: same as previous comment 189: consider putting a \quad after the comma before the t 191: excludes -> contains no 192: misleading -> wrong 256: I don't really understand this sentence top of page 6: I didn't understand what exactly these attacks were until I looked at the appendix 301: at this point, it wasn't obvious to me how Privilege Escalation is different from Adversarial Selection. I think explaining clearly that these are all just different prompts would help a lot. 354: the Gemma3 citation is messed up 379: "attacker model" this is literally the first time I understood there was an attacking model. Please make it clear earlier Figure 2: please make all plots have the same x axis order Figure 3: this is a table, not a figure 436: while -> While Fully human-written
Certifying Robustness of Agent Tool-Selection Under Adversarial Attacks Soundness: 2: fair Presentation: 2: fair Contribution: 2: fair Rating: 2: reject Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper studies adversarial attack on the tool-selection stage of agentic systems. Specifically, the attacker could inject malicious tools and mislead agents to select them. It formalizes robustness as the probability that the agent still picks an intent-satisfying tool even when an adaptive adversary can inject up to $k$ malicious tools and iteratively refine them over $R$ rounds. The proposed framework, CATS (Certification of Agentic Tool Selection), treats each full multi-round interaction as a Bernoulli trial and uses Clopper–Pearson intervals to produce a high-confidence lower bound on “robust accuracy.” In particular, to study the worst case setting, the paper introduces an adaptive attacker that can dynamiclly refine its attacking policy based on the agent's previus choices. Experiment results show that under multi-round attacks the certified lower bound can collapse to near zero, and even with forced inclusion of the correct tool, certified robustness stays <50%, indicating both retrieval and selection are vulnerable 1. Studying adversarial attack on the tool-selection stage of agentic systems is well motivated. 2. Formalizing a multi-round problem instead of a single round is more realistic, which unlocks the potential to study advanced red team strategies (e.g. adaptive attacks studied in this paper). 3. Results are evaluated across multiple attack strategies (Top-N Saturation, Adversarial Selection, Privilege Escalation). The near-zero certified lower bound convincingly show that this is a real problem. 1. overclaim on novelty. As far as I know, this is not the **first** paper studies adversarial attack on the tool-selection stage. see https://arxiv.org/pdf/2412.10198 and https://arxiv.org/pdf/2508.02110v1. 2. Lack of experimenting with more defense methods. For example. how easy it is to catch these injected malicious tools? Can the blue team easiy select them with an additional monitor before using the retriever and selector? 3. Studying the worst-case setting of adaptive attacks is reasonable. However, in practice, the attacker might not really have access to the agents' detailed trajectories since they are often hided by companies? 1. What was n (number of trials) per model/attack in practice? How sensitive were your lower bounds to halving n? A small table of “trials → CI width” would make the “certified” claim more concrete. 2. The current paper tests defender LLaMA-3.1 vs. attacker Gemma-3 as a “representative” strong attacker (P7). Did you try mismatched or weaker attackers? Do we still see near-zero bounds when the attacker LLM is strictly smaller or older than the defender? 3. The current paper focuses on single-tool tasks. How does the proposed attack adapt to multi-tool tasks? Fully human-written
Certifying Robustness of Agent Tool-Selection Under Adversarial Attacks Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper studies robustness of LLM agent tool selection, the two‑stage process in which a retriever surfaces a top‑N slate of tools and the agent chooses one to execute. It introduces CATS (Certification of Agentic Tool Selection), a statistical framework that treats each adversarial interaction as a Bernoulli trial and computes a high‑confidence lower bound on robust accuracy via Clopper–Pearson intervals. The attacker can inject up to k tools and adapt across R rounds using feedback from the agent’s previous choices; the refinement is modeled as a Markov process over tool metadata. Experiments use BFCL single‑tool tasks with M=300 tools, top‑N=10 retrieval (MiniLM-L6-v2), and several models as selectors (Llama‑3.1‑8B, Gemma‑3‑4B, Mistral‑7B, Phi‑4‑14B; Gemini‑2.5 Flash appears as an attacker). The paper defines five attack families (Adversarial Selection, Top‑N Saturation, Privilege Escalation, Abstention Trigger, Intent Shifting) and reports that the certified bound collapses toward zero under strong adaptive attacks (e.g., R=10), even when clean accuracy is high. A causal ablation shows low robustness (<0.5) even with forced inclusion of the correct tool in the slate, implicating both retriever and selector (Figure 3, p.8). The work argues CATS is the first formal statistical certification tailored to discrete tool selection rather than continuous perturbations or output text. - First certification framework tailored to discrete tool selection with iterative adaptive attacks; clean formalization of pipeline and adversary space. - Comprehensive experiments across models and attacks; informative ablations on rounds, budgets, retrievers, and frameworks - Clear motivation (Figure 1, p.2), self‑contained algorithms (App. A.2), and compact visual summaries (Figure 2, p.8; Figures 4–8, pp.18–20). - Reveals large robustness gaps between clean and certified performance; provides a reusable evaluation harness that can guide defense development for agentic systems. - The lower bound is computed against the authors’ ∆adv class (templated, LLM‑generated, Markovian refinement). Claims of “worst‑case performance” should be qualified; results do not imply bounds over all possible adversaries or non‑Markov strategies. Provide an explicit statement of scope in §3.6 - Retrieval uses a single embedding model with Top‑N=10; more realistic settings include Unicode normalization, near‑duplicate clustering, homoglyph canonicalization, and slate diversification/quotas. Including at least one defended retriever baseline would strengthen conclusions about systemic vulnerability. - The paper assumes a privilege field and compares to a user budget, but the user privilege model and enforcement are not specified; clarify how πuser is set and judged in experiments - Results are on BFCL single‑tool calls with synthetic narrative context. It would be valuable to test on real tool stores or MCP/OpenAPI‑derived corpora and to vary M and N systematically to show scaling trends. - How generalizable is this certification to other adversarial strategies (e.g., non-Markov or non-templated attacks)? Could you explicitly clarify this scope in §3.6? - The experiments use a single embedding-based retriever with Top-N = 10. Have you considered evaluating more realistic retrieval settings, such as Unicode normalization, near-duplicate clustering, homoglyph canonicalization, or slate diversification, to simulate defended retrievers? Including one defended retriever baseline could strengthen the claim of systemic vulnerability. - In the Privilege Escalation attack, the paper assumes a privilege field π(t) and compares it to a user privilege πuser. How is πuser defined and enforced in your experiments? Are privilege mismatches detected through metadata rules or simulated policy constraints? - All evaluations are conducted on BFCL single-tool tasks with synthetic narrative context. Have you tested or do you plan to test CATS on real tool stores (e.g., MCP/OpenAPI-derived corpora) or vary M and N systematically to analyze scaling behavior and generalizability? Fully AI-generated
Gaussian Belief Propagation Network for Depth Completion Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper addressed the depth completion task by developing the Gaussian Belief Propagation Network (GBPN). The GBPN consists of a Graphical Model Construction Network (GMPN) for constructing a scene-specific MRF over dense depth variables and the Gaussian Belief Propagation strategy that infers the dense depth on the learned MRF. The GMPN models the potentials of the MRF and its structure by predicting non-local edges to capture the complex, long-range spatial dependencies guided by image content. The GBP strategy uses serial & parallel message passing scheme to enhance information flow. Experiments on KITTI and NYUv2 show that the proposed method achieves SOTA performance. The authors also conduct comprehensive ablations to validate the effectiveness of the proposed modules and the robustness. The proposed method achieves SOTA performance on public benchmarks, KITTI and NYUv2 The authors validate the effectiveness of the proposed modules and the robustness over sparsity. The authors also provide detailed information regarding the method, like model structure, proof, parameters, etc. The idea of using use Gaussian Belief Propagation for inference is interesting, with strong motivation from previous methods. - The strategy, MRF for depth estimation, has been explored before [1][2]. The authors should provide some discussions. - What are the advantages of the MRF for depth completion (GMCN & GBP) in comparison with previous propagation-based methods? - The authors claim that " allowing the model to adaptively capture complex, long-range spatial dependencies guided by image content". Maybe it would be better if some cases are provided. - As shown in Tab. 9 of the Supplementary, the proposed method has higher running time than BP-Net, while there performance is very close. Therefore, what advantages does the method have in comparison with BP-Net (apart from fewer parameters) - Discussion about the Serial & Parallel Propagation Scheme should be provided, like the efficacy. How the strategy improves the performance/computational cost? - In which scenarios, the proposed method performs better? and what issues it can solve? Please give more examples and analysis. Since the performance is close to the latest methods, the authors should give more evidences for the effectiveness of the proposed method. [1] Chen et al., Fast MRF Optimization with Application to Depth Reconstruction. [2] Liu et al., Deep Convolutional Neural Fields for Depth Estimation from a Single Image See the weakness. Please give feedbacks for each point. Fully human-written
Gaussian Belief Propagation Network for Depth Completion Soundness: 2: fair Presentation: 1: poor Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper introduces a hybrid framework, termed the Gaussian Belief Propagation Network (GBPN), for depth completion using sparse depth points and color images. The core idea is to use a deep network (GMCN) to dynamically construct a Markov Random Field (MRF) for each scene, learning both its potential function and graph structure by predicting non-local edges. Subsequently, the Gaussian Belief Propagation (GBP) algorithm is employed to infer dense depth from the constructed MRF. The authors report that the proposed method achieves state-of-the-art performance on the NYUv2 and KITTI datasets and exhibits strong robustness to input sparsity. 1. Framing deep completion as probabilistic inference on dynamically constructed graph models offers a theoretically sound approach for handling sparse and irregular inputs. Extensive evaluations under varying sparsity levels, noise conditions, and cross-dataset settings show that the proposed framework achieves stronger robustness than pure end-to-end regression models. 2. The proposed method not only learns the MRF parameters but also infers the graph structure by predicting non-local edges, which represents a novel contribution. This design allows the model to adaptively capture long-range dependencies from image content, thereby overcoming the fixed-neighborhood constraints of traditional MRFs. 1. The ablation study (Table 2) is presented in not clear, making it difficult to verify the contribution of each model component. 2. Although the paper provides a thorough empirical comparison with competitors such as BP-Net and demonstrates clear advantages in accuracy and robustness, the discussion does not move beyond empirical evidence and lacks a compelling conceptual justification. The authors do not clearly explain why their MRF+GBP paradigm is theoretically or conceptually superior to the direct learning propagation paradigm represented by BP-Net. The contributions appear to represent a highly successful and well-designed paradigm instantiation rather than a fundamental conceptual advancement. 3. The inference time of this method is considerably longer than that of its main competitors (nearly 80% slower than BP-Net on KITTI), yet the accuracy improvement is negligible (only about 0.4%). This trade-off is unacceptable for real-time applications such as autonomous driving. 4. This paper employs loopy belief propagation on dynamically generated graphs, a method that lacks formal convergence guarantees. However, the paper does not discuss or analyze the potential stability and convergence issues associated with this setting. 1. Could you provide a clearer version of Table 2 that explicitly lists the configuration details for each ablation study variant? 2. For practical applications such as autonomous driving, how do you justify a considerable increase in inference latency in exchange for only a marginal gain in accuracy? 3. What conceptual advantages does your approach offer over methods that directly learn propagation operators? 4. When applying loopy belief propagation to dynamically generated graphs, have you observed any cases of non-convergence or oscillation? Lightly AI-edited
Gaussian Belief Propagation Network for Depth Completion Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. This paper addresses the depth completion task by introducing a Graphical Model Construction Network (GMCN), which constructs a scene-specific graph utilized by a Markov Random Field to optimize sparse depth through Gaussian Belief Propagation. Experimental results on the KITTI DC and NYU datasets demonstrate state-of-the-art performance, highlighting the effectiveness of the proposed approach. 1. The proposed method achieves state-of-the-art performance on both indoor and outdoor datasets. 2. It shows superior robustness across varying depth sparsity levels compared to existing approaches. 3. The paper provides a comprehensive analysis and extensive experimental results in the supplementary material, which further supports the validity of the proposed approach. 1. In Figure 1, it is recommended to add essential legends for better clarity, such as explaining the meaning of “T” in the top-middle and the significance of the green, blue, and orange lines. 2. The Method section currently occupies a substantial portion of the paper, leaving limited space for the Experiment section. It is suggested to compress the Method section to allow more room for presenting additional experimental results. 3. The influence of local edges and GBP iterations should be analyzed individually. Table 2 appears cluttered, making it difficult to identify corresponding variants. A clearer presentation or separate analysis would improve readability and understanding. 1. In Table 1, could you clarify whether the entry labeled GBPN corresponds to the GBPN-1 or GBPN-2 variant? Moderately AI-edited
Gaussian Belief Propagation Network for Depth Completion Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposes the Gaussian Belief Propagation Network (GBPN) for depth completion. GBPN leverages a learned Markov Random Field (MRF) structure, constructed dynamically from RGB and sparse depth inputs, and performs inference via Gaussian Belief Propagation (GBP). The paper introduces a hybrid message-passing scheme and evaluates the method on NYUv2 and KITTI benchmarks, reporting competitive results. 1. Hybrid Learning-Inference Integration: The paper attempts a meaningful integration of graphical model reasoning with deep learning. Learning the MRF structure dynamically from images represents a conceptual advance over fixed priors or fully feed-forward architectures. 2. Principled Treatment of Sparse Inputs: The method embeds sparse depth directly into the probabilistic inference process, rather than processing it as part of standard CNN input, offering a more theoretically grounded approach to the sparsity challenge. 1. Lack of Computational Analysis: The proposed approach introduces significant inference overhead due to iterative GBP and dynamic graph construction. However, the paper does not report any runtime statistics, GPU memory usage, or scalability discussion. Given the growing importance of efficiency in practical systems, this omission is concerning. 2. Limited Empirical Gain: While the method shows SOTA iRMSE on KITTI, it underperforms on other key metrics (RMSE, MAE), suggesting the gain may not be consistent. On NYUv2, although the reported RMSE is strong, the deltas are small and the competitive landscape is already saturated. 3. Weak Justification of Components: Ablation studies show marginal gains (~3mm RMSE difference on NYUv2), raising questions about the necessity of the complex model components, including non-local edge prediction, dynamic updates, and dual-pass U-Nets. 4. Insufficient Analysis on Iterative Behavior: The authors claim that more iterations improve performance (line 290), but Table 2 only reports 3 and 5 iterations. It remains unclear whether the performance plateaus or continues to improve, and at what computational cost. 5. Unclear Sparsity Robustness Comparison: In Figure 2, the RMSE of some methods (e.g., GuideNet, CFormer) increases as input becomes denser, which is counter-intuitive and not explained. Additionally, curves for many methods converge from 500 points onward, making relative robustness claims less persuasive. 6. Presentation and Layout: Several pages are cluttered with tightly packed text and figures (notably pages 6 and 9), negatively impacting readability. A key concern is that the ablation results are unconvincing, as the reported gain is only about 3 mm. For the same model, it is quite common that retraining multiple times yields variations of around 5 mm, which means such a small improvement is either unsuitable for ablation analysis or insufficient to demonstrate the effectiveness of individual modules. It is also puzzling that the model’s accuracy degrades when the density of depth points increases, yet the explanations provided fail to address this anomaly satisfactorily. Even after this issue was highlighted by the reviewers in **NeurIPS comments** before, the authors showed no intention of taking concrete steps to rigorously validate the effectiveness of their method or to improve its interpretability. The current revision remains insufficient for publication, and the reviewer maintains a negative overall assessment of the paper. Moderately AI-edited
All Patches Matter, More Patches Better: Enhance AI-Generated Image Detection via Panoptic Patch Learning Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 2: reject Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. This paper proposes two principles for AI-generated image (AIGI) detection: "All Patches Matter" and "More Patches Better". The authors contend that existing detectors suffer from "Few-Patch Bias," relying excessively on a minimal number of highly discriminative patches. To address this issue, they propose the Panoptic Patch Learning (PPL) framework, comprising Randomized Patch Reconstruction (RPR) and Patch-wise Contrastive Learning (PCL). The method achieves superior cross-model generalization performance. - The methodology is well-aligned with the motivation: RPR corresponds to the principle of "More Patches Better," while PCL reflects the idea that "All Patches Matter." Both methods are clearly motivated and logically consistent with the proposed principles. - The experiments are comprehensive: performance is reported on benchmarks such as GenImage, DRCT-2M, and Chameleon, and robustness under various corruptions (e.g., compression and blur) is demonstrated. The study also includes ablation studies and hyperparameter analysis. - The paper presents “All Patches Matter / More Patches Better” as a primary principle, but ideas such as “any local patch of an AI-generated image contains artifacts, and even a single patch can be sufficient for reliable discrimination” have already been explored in prior patch-based detection work discussed in the related work section. This makes the proposed patch-level principles feel incremental rather than conceptually new. - Similarly, although the proposed PPL framework is shown to be effective, it is essentially an engineering-level refinement rather than a fundamentally new paradigm. The core idea of RPR is to move the DRCT-style diffusion reconstruction from the image level down to the patch level, while PCL is a straightforward application of supervised contrastive learning at the patch-token level. - The paper relies on TDE for attribution. However, TDE does not appear to be a commonly adopted attribution method in deep learning. Why not use more standard interpretability approaches, such as Grad-CAM and its variants, or LRP (Layer-wise Relevance Propagation) [1], to explain what the detector is actually using? - The paper focuses primarily on CLIP, DINOv2, and other ViT-style backbones, while largely ignoring CNN-based detectors. This raises an important question: do the core principles proposed in this paper still hold under CNN-based architectures? - Going further, [2] uses Guided-GradCAM and LRP as attribution methods, extracts transferable forensic features from different layers of a CNN-based detector, and maps them back to the input image patches. The results indicate that color statistics are a key signal for CNN-based forgery detectors, rather than an extreme reliance on a few spatial patches. Different layers highlight different input regions. This behavior is not the same type of bias that the authors call “Few Patch Bias.” Therefore, the bias analysis in this paper may not generalize to CNNs, and it is not yet convincing that the claimed principles are architecture-agnostic. [1] Transformer Interpretability Beyond Attention Visualization [2] Discovering Transferable Forensic Features for CNN-generated Images Detection 1. In Section 5, the citation of SAFE is incorrect. 2. In Table 6, the row labeled “Infonce/tau=0.5” is not aligned with the notation used in the main text. 3. In Section 3.1, the text uses “donot,” which is a typographical error. It should be “do not.” 4. In Algorithm 1, the classification loss is denoted as ( L_{ce} ) in Step 3, but it appears as ( L_{bce} ) in Step 5 when forming the total loss. 5. The terminology for the proposed bias is inconsistent. The Introduction (in the italicized part) uses “Few Patch Bias,” while other parts of the paper use “Few-Patch Bias.” Lightly AI-edited
All Patches Matter, More Patches Better: Enhance AI-Generated Image Detection via Panoptic Patch Learning Soundness: 3: good Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper identifies a Few-Patch Bias in current AI-generated image detectors and proposes a Panoptic Patch Learning (PPL) framework consisting of (1) Randomized Patch Reconstruction (RPR), reconstructing random real-image patches using Stable Diffusion v1.4 inpainting, and (2) Patch-wise Contrastive Learning (PCL) applied to ViT patch tokens. The method aims to enforce equal attention to all patches and enhance artifact distribution information. Extensive experiments across several benchmarks show substantial gains. 1. The Few-Patch Bias is demonstrated visually (attention/TDE) and via patch-mask counterfactuals. The fix (RPR+PCL) is straightforward and well-motivated. 2. The combination of RPR and PCL provides measurable improvements over existing works. 3. The experiment is conducted on both toy and in-wild datasets, and demonstrates its SOTA performance. 4. There are rich ablation studies to examine the designs of the proposed method. 1. RPR reconstructs patches of real images using Stable Diffusion v1.4 inpainting (empty prompt). This introduces a major domain-specific bias. It can train additional variants of RPR using GAN-based reconstruction and observe the performance changes. This can further examine the generality of PPL framework. Also, since some baselines, like SAFE, train on GAN-generated images and test on all, such an experiment can make the comparison more fair. 2. In Table 9, Midjourney and ADM accuracies with dropout = 0.15 (94.3% and 87.9%, mAcc=94.2%) are much higher than at 0.1, 0.2, or 0.25 (≈70%), and close to the proposed PPL result (mAcc=97.2%). This raises several questions: (1) Why does such a narrow dropout range produce a sudden performance surge? Is there randomness in patch masking or seed selection? (2) If a properly tuned dropout rate (0.15) nearly matches PPL while avoiding diffusion reconstruction and PCL fine-tuning, the claimed advantage of PPL becomes less convincing. 3. The paper fine-tunes CLIP and DINOv2 with PPL but does not clearly report the results of directly training CLIP and DINOv2. 1. In terms of Weakness-1, (1) have you tried using GAN-based reconstruction or other diffusion models within RPR? (2) And how does the performance change when training RPR with different reconstruction backbones? 2. In terms of Weakness-2, please see the questions in the comment. 3. Could you report the results of directly training or evaluating CLIP/DINOv2 on the same datasets to quantify the actual gain from PPL fine-tuning? Fully human-written
PreviousPage 23 of 1516 (75800 total rows)Next