|
LeGIT: LLM Guided Intervention Targeting for Online Causal Discovery |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a framework to incorporate LLMs into existing gradient-based intervention targeting methods. Specifically, the paper proposes using LLMs in a warmup phase that includes an initial guess for the causal graph based on prompting the LLM with the variable names in the dataset. The paper performs ablations and shows that on 4 datasets, LeGIT, using GIT as underlying gradient-based intervention targeting method, outperforms vanilla GIT as well as a version of LeGIT where the LLM was replaced by human feedback.
Presentation of work is clear and results seem positive; the idea of using LLMs to resolve the issue of instability under certain initializations for gradient-based intervention targeting methods seems novel to me.
Experiments on only four standard datasets could be too few to draw statistically significant conclusions, especially if any optimizations (e.g. tuning prompts; tuning length of the warmup phase) have been performed using these datasets. It would be good to see robustness to cases where datasets are less well-known and LLMs may have difficulty coming up with a reasonable guess of the underlying structure.
- Experiments have been performed iterating over 5 seeds. Are 5 seeds sufficient to obtain statistically significant results?
- How was tuning the length of the warmup relative to the length of the experiments done? Are any plots showing performance under varying warmup-length?
- Is there a possibility to experiment on more novel datasets where guessing the causal graph may be difficult for an LLM? It would be interesting to see to what extent LeGIT still outperforms or whether incorporating LLMs still helps in scenarios that are closer to "unsolved" cases, where there is much less background knowledge available to the LLM; as this would be relevant for real research problems. |
Fully human-written |
|
Symmetric Behavior Policy Optimization |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper addresses a clear theoretical and practical gap in behavior-regularized offline reinforcement learning (BRPO).
While existing approaches rely almost exclusively on asymmetric divergences (e.g., KL or *$\chi^2$*) that induce strong mode-seeking bias, this work systematically investigates the use of symmetric *$f$*-divergences as regularizers — an area that has been largely unexplored in prior literature.
The authors prove that symmetric BRPO generally lacks analytic optimal policies due to the non-affine form of *$f'(t)$* in *$\ln t$* and propose a principled approximation based on finite *$\chi$*-series truncation. This yields the Symmetric f-Actor-Critic (Sf-AC) algorithm, which preserves convexity, maintains bounded approximation error, and provides a balanced compromise between mode-seeking and mode-covering behaviors. Empirical results on D4RL MuJoCo benchmarks show consistent robustness and per-environment stability improvements over strong baselines (IQL, SQL, AWAC, XQL).
However, the empirical evaluation lacks sufficient quantitative evidence to clearly demonstrate the effectiveness of the proposed symmetric regularization. While the results suggest improved robustness, the paper does not include concrete numerical comparisons or diagnostic analyses against existing offline RL approaches that would clarify why symmetric regularization helps. In particular, additional metrics illustrating the robustness deficiencies of conventional mode-seeking or mode-covering methods would make the contribution more complete and convincing.
If such quantitative analyses or diagnostic metrics are provided during the rebuttal period, I would be highly willing to raise my score.
* **Theoretical novelty**
* The paper fills a clear theoretical gap in behavior-regularized offline RL by systematically exploring the use of symmetric *$f$*-divergences. This is a novel perspective that extends the well-established BRPO framework beyond its asymmetric KL- or χ²-based formulations and highlights an under-examined dimension of regularization geometry.
* **Practical workaround through principled approximation**
* The authors introduce a mathematically grounded yet practical approximation via finite χ-series truncation and Taylor expansion of the conditional symmetry term. This approach preserves convexity and boundedness while providing a tractable implementation, making the otherwise intractable symmetric regularization feasible without major instability.
* **Lack of rigorous comparative analysis**
The paper mainly presents performance plots (e.g., Figure 3) showing only average returns across environments, without quantitative summaries such as mean ± confidence intervals. This makes it difficult to verify whether the improvements are **statistically significant** or whether the baseline implementations are correctly reproduced. In particular, the reported baseline performances on MuJoCo tasks appear unexpectedly low, raising concerns about possible implementation discrepancies or hyperparameter mismatches. Providing detailed numerical tables with variance statistics would make the comparisons more reliable and transparent.
* **Insufficient diagnostic metrics for robustness**
The claim of “robust performance” remains qualitative. In offline RL, robustness is often better captured by additional statistics such as CVaR or worst-case return, rather than mean performance alone. Including such risk-sensitive or tail-distribution metrics would substantially strengthen the claim that symmetric regularization mitigates the brittleness of conventional mode-seeking methods.
* **Need for deeper analysis contrasting mode-seeking and symmetric behaviors**
While prior works have demonstrated the empirical benefits of using symmetric divergences in *$D_{opt}$*, this paper claims that applying symmetry directly to the regularizer *$D_{reg}$* offers additional advantages. Although the paper clearly articulates the motivation for symmetric regularization, it does not sufficiently highlight the limitations of existing mode-seeking regularization that the proposed method intends to address. In offline RL, purely mode-seeking approaches are often regarded as sufficient for ensuring stability and avoiding out-of-distribution (OOD) actions. Introducing more mode-covering behavior through symmetry could, in principle, encourage exploration into unsupported regions of the dataset, potentially harming performance. Therefore, a deeper analysis or empirical evidence clarifying when and why symmetric regularization provides tangible benefits over purely mode-seeking objectives would make the paper more convincing and theoretically grounded.
* **Quantitative Comparison**
Could the authors provide detailed numerical results (e.g., mean ± confidence intervals) for each baseline, especially on the MuJoCo tasks where baseline performance appears unusually low, to confirm that the reported improvements are statistically significant and reproducible?
* **Robustness Evaluation**
Beyond average returns, can the authors include additional robustness-oriented metrics (e.g., CVaR, percentile or worst-case returns) to substantiate the claim of “robust performance” and demonstrate that symmetric regularization indeed improves stability over mode-seeking methods?
* **Clarification on D_reg vs. D_opt Effectiveness**
Since prior works have already shown benefits of symmetric divergence when used in *$D_{opt}$*, could the authors isolate or ablate the contribution of using symmetry in *$D_{reg}$* to clarify when and why this leads to tangible gains over purely mode-seeking regularization? |
Fully AI-generated |
|
Symmetric Behavior Policy Optimization |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents an interesting perspective on behavior regularized policy optimization: using symmetric divergences, instead of asymmetric divergences, as the implementation of the regularization. The difficulty of using a symmetric divergence is twofold: 1) common symmetric divergences do not admit analytical solutions for the optimal policy; and 2) numerical issues can occur when dealing with a finite support distribution. To tackle the challenges, this paper introduces two techniques that are all based on Taylor expansion. First, in order to derive a closed-form optimal policy, they truncate the divergence in the RL objective to the second order; second, they truncate the divergence used in the policy improvement step to improve numerical stability. Some other techniques are also included in the proposed Sf-AC algorithm. The empirical evaluations are conducted on D4RL, and Sf-AC does demonstrate improvement over previous methods like SQL, IQL, and XQL.
1. The presentation of the overall idea is clear to me.
2. The literature review is comprehensive. I especially appreciate the discussion in the Appendix, which covers both policy regularization and distribution matching, and elucidates why the problem studied in this paper is unique and has not been addressed by previous literature.
3. The proposed algorithm appears robust and does not require per-task hyperparameter tuning (Tables 4, 5, 6).
Although the paper briefly discussed the issue with asymmetric divergence in the introduction section, I would say it would be more appropriate to formally define the problems of using asymmetric divergence somewhere between Sections 2 and 3. In lines 51-53, it is unclear to me how symmetric divergence solves the issue of multiple minimum points due to the capacity of the policy function class.
In short, the current version of the draft doesn’t provide sufficient motivation for me to switch from an asymmetric divergence to a symmetric one. I would consider raising my score if the authors can make this clear in the rebuttal.
1. In equation (8), the first term uses weights proportional to the exponential of the advantage, while in line 310, the optimal policy is defined by using [1 + A/\tau]_+. What causes the mismatch?
2. The value of \epsilon varies from 0.2 to extremely large values like 100. Is it really necessary to include this hyperparameter?
3. Copy-edits: In Algorithm 1, when defining the advantage regression, the subscript of V should be \phi. |
Fully human-written |
|
Symmetric Behavior Policy Optimization |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper studies symmetric behavior regularization for offline RL within a BRPO-style framework and asks whether symmetric $f$-divergences admit an analytic optimal policy $\pi*$ suitable for target matching.
The authors prove that for common symmetric divergences (Jeffreys, Jensen–Shannon, GAN) no closed-form $\pi^*$ exists and that naively using symmetric losses can be numerically unstable with finite-support policies. They propose a remedy by Taylor expanding $f$-divergences into a $\chi^n$ series, truncating at small order ($N<5$), which yields an analytic surrogate policy (closed form for $N=2$) and a practical algorithm Sf-AC: advantage regression plus a truncated conditional-symmetry term with ratio clipping and a truncation-error bound. Experiments on a Mixture-of-Gaussians fit and 9 D4RL MuJoCo tasks show competitive and failure-robust performance; the JS/Jeffreys variants are frequently top-3 by AUC.
Clear theory: explains why standard symmetric divergences do not yield an analytic $\pi^*$ and exposes instability.
Principled workaround: $\chi^n$ expansion recovers a usable surrogate for $N\le 4$; simple closed form for $N=2$.
Practical loss: decomposes into advantage regression + symmetry expansion, with clipped ratios and an error bound.
Empirical robustness: strong results on D4RL with ablations over series order and clipping; resilient under failure cases.
Approximation vs exact symmetry: truncation introduces bias; the bias/variance/stability trade-off could be analyzed deeper.
Benchmark scope: limited to MuJoCo; harder OOD domains (AntMaze, Kitchen, Adroit-relocate) would stress-test the method.
Behavior-policy access: guidance is light when $\pi_D$ (or density ratios) are poorly estimated.
No end-to-end convergence or error-propagation guarantees under function approximation for the combined objective.
Sensitivity to series order $N_{\text{loss}}$ and analytic-policy order $N$ (e.g., 2 vs 3–4)? Any task-dependent guidance?
Wall-clock and memory vs $N_{\text{loss}}$ and clipping $\epsilon$?
Robustness when behavior coverage is narrow or multimodal; diagnostics for ratio misestimation?
Could uncertainty signals (critic ensembles) adapt $\epsilon$ or series coefficients to avoid rare failures?
Any caveats for discrete/bounded actions beyond clipping (e.g., projection effects)? |
Fully AI-generated |
|
VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper studies the role of VLMs as backbones for VLA models. The authors fine-tune 7 open VLMs on robot simulation data and evaluate them across three benchmarks (Calvin, SimplerEnv, Libero). Their results show that a simple architecture and training objective can achieve competitive performance against recent models with specific VLMs. The authors also show that performance on Calvin is correlated with the VLM’s VQA performance, and that fine-tuning on VQA data from robot tasks does not help improve performance. Finally, the authors show the importance of fine-tuning the vision encoder of the VLM.
1. A meta-analysis of the role of VLMs in VLA models, showing competitive performance with a simplified training framework.
2. A study of the relationship between robot task performance and generic VQA performance.
3. A study of the (lack of) usefulness of VQA data extracted from robot data.
4. An analysis of the importance of fine-tuning the vision encoder, likely due to the domain mismatch between pretraining and robot data.
1. While the experiments show that some VLMs can achieve competitive performance in the VLM4VLA framework, the paper lacks specific guidelines and insights into why specific models perform better. While the authors found VQA performance to be predictive of Calvin performance, this was not the case for the other benchmarks. This left me wondering how I would choose the next VLM to initialize my VLA from.
2. I am also concerned about the decision of using the same hyperparameters for all the models. While this drastically lowers the number of experiments, models of different sizes would likely require different hyperparameters to perform best. This can affect all the results and conclusions that the authors derive from their experiments.
1. Which VQA benchmarks were used for the correlation analysis in Fig 3?
2. Have you tried correlating robotic task performance with other downstream tasks besides VQA?
3. What is the (lower-bound) performance of a VLM4VLA initialized from scratch? |
Fully human-written |
|
VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors propose VLM4VLA, a general framework for adapting arbitrary Vision-Language-Models (VLMs) into Vision-Language-Action (VLA) models, requiring only a minimal number of new parameters. The work also investigates how the choice of the base VLM affects the downstream VLA's performance.
1. The paper presents a framework for fairly comparing the performance of different VLMs on VLA tasks and provides an in-depth study into the reasons for performance discrepancies.
2. By using an MLP action head instead of a more complex diffusion-based one, the framework avoids introducing stochasticity. This ensures a "fair and reproducible" comparison across the different VLMs.
3. It systematically proposes three benchmarks for evaluating VLM capabilities: general capability, embodied-specific capability, and the vision encoder.
4. The experiments cover a wide range of models and test tasks, providing a strong empirical baseline for future work.
1. The study lacks real-robot experiments. The sim-to-real gap is a major concern in the VLA field, and this work doesn't clarify how different VLMs might affect the model's final generalization to real-world scenarios.
2. Diffusion action heads and MLP action heads may leverage VLM capabilities differently (e.g., many diffusion heads use the VLM's KV-cache for information interaction). The paper does not directly compare the impact of these two approaches on VLA performance.
1. Could you provide a performance comparison and analysis for a model using both a diffusion action head and an MLP action head? Or, perhaps more directly, a comparison between using embeddings vs. using the KV-cache for information interaction?
2. Are there any additional experiments that could demonstrate the difference in generalization capabilities among the various VLMs when applied to VLA tasks? |
Lightly AI-edited |
|
VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work conducts an empirical study (in simulation) to investigate the relationship between the capabilities of the VLM backbone and its downstream performance in a VLA. The authors propose VLM4VLA, an adaptation pipeline that adds a MLP head to the VLM backbone to ensure a fair comparison across models. Through experiments in simulation, the authors arrive at three findings: 1) a VLM's performance on general VQA benchmarks is often a poor predictor of performance on robotic manipulation tasks, 2) fine-tuning a VLM on auxiliary embodied VQA tasks doesn't reliably improve performance, and 3) fine-tuning the VLM's vision encoder is necessary.
Large-scale study: the primary strength of this work is the sheer number of models (7) and simulation environments (3) studied. This comprehensive evaluation provides a valuable perspective
VLM4VLA Framework: by using an MLP (non-stochastic), the authors maintain a simple adaptation module which enables isolating the VLM backbone as one of the only independent variables. Ensuring prompts are adjusted per model is further evidence of clean experimentation.
Findings: the findings that VQA performance is not predictive of manipulation performance, in simulation, is good to know for future work. However, I would like to add that I think the main reason researchers leverage a VLM backbone is to capitalize on the potential of generalization, not in-distribution robot manipulation performance (this must be taught via imitation learning).
Reliance on simulation: the paper's most significant weakness, which the authors acknowledge, is exclusive use of simulated environments. While this choice was made for reproducibility and scale, it limits whether or not the paper's main claims generalize to the real-world, which is ultimately what is needed in robotics. The sim-to-real gap is a well-known challenge in robotics and care must be taken to extrapolate findings from simulation to the real world. Moreover, Figure 3 supports this statement: the findings of VLA performance and VLM capability are not consistent across simulation environments.
As another example of the potential pitfalls of simulation, as vision encoders are often pre-trained on real-world images, evaluating them purely in the simulated domain may add a confounding variable to the analysis, as most VLAs are only fine-tuned on real-world data. Therefore, I suggest that all main claims in the paper add the qualifier that they hold only in the simulation benchmarks tested.
Novelty of vision encoder finding: while the result that fine-tuning the vision encoder is crucial, this observation is not entirely new. Other works such as OpenVLA [1] and Otter [2] have described similar findings.
[1] Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., ... & Finn, C. (2024). Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246.
[2] Huang, H., Liu, F., Fu, L., Wu, T., Mukadam, M., Malik, J., ... & Abbeel, P. (2025). Otter: A vision-language-action model with text-aware visual feature extraction. arXiv preprint arXiv:2503.03734.
Clarification of Pi0 baseline: could you elaborate a bit more on the specifics of why the official implementation of Pi0 was not used? In particular, were the official weights of the Physical Intelligence model used?
Defining "VLM General Capability:" Figure 3 aggregates 'vlm capability' on the x-axis, but it is not precisely defined. What exactly does this mean? |
Fully human-written |
|
VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper studies the impact of the underlying VLM on VLA policy performance through the VLM4VLA architecture. VLM4VLA trains a continuous regression head on top of the VLM for the robot action prediction task. The paper trains and evaluates VLAs based on several different VLM sizes across the CALVIN, SimplerEnv, and Libero simulated environments. The paper finds that the impact of base VLM capabilities varies greatly by environment. Furthermore, continuing to train the VLM on embodiment and 3D-specific tasks produces a slight degradation in VLA performance.
1. The paper addresses the important problem of the impact of VLMs on VLA performance, which has been relatively understudied in prior work.
1. The paper shows that general VLM capability does not necessarily correlate with VLA capability. This is an important finding since it contradicts the common intuition that a stronger VLM model is always better for VLAs. For example, recent VLA works use newer VLM bases, which this study shows is not necessarily a good decision.
1. The paper supports its claims with comprehensive empirical analysis across many different base VLMs and simulated environments.
1. The result showing that further training on VLM auxiliary tasks does not improve downstream VLA performance, as presented in Figure 4, is a novel insight. The paper supports this claim with comprehensive results that test a wide variety of auxiliary tasks. Like the VLM-to-VLA transfer result, this finding is also surprising and contrary to prior efforts that design auxiliary tasks to improve VLM embodied capabilities.
1. The results in Section 4.3 provide an important lesson by comprehensively demonstrating the importance of training the visual encoder, where freezing the visual encoder leads to a large performance drop.
1. The VLM4VLA architecture provides a consistent and simple way to adapt VLMs to VLAs. The paper also shows that this architecture produces results superior to prior works that use more complicated designs.
1. The paper includes sufficient reproduction details in the appendix.
1. The importance of the visual encoder can also be explained by several other factors beyond the need to finetune it. (1) The Qwen2.5-VL model is sensitive to image resolution, with higher resolutions using more visual tokens per image and typically producing better performance. It is possible that the visual encoder could be frozen if the image resolution were increased. (2) The VLMs are trained primarily on real images, while the selected benchmarks are not photorealistic and use only simple rendering (see the top of Figure 1). Thus, the visual encoder may need to be finetuned to overcome this domain gap, which would likely not be an issue with real robot data.
2. The linear correlation plot in Figure 3 is hard to interpret (see Question 2). The paper should report the strength of the correlation.
3. The paper appears to rely on VQA benchmarks as a measure of VLM capability in Figure 3. Are there more specific VLM capabilities, such as spatial understanding, that have a direct relationship with VLA performance?
4. The paper does not provide analysis of what drives VLA performance based on the base VLM. For example, why does Kosmos-2 achieve the highest success rate in Table 2? While showing that VLM capabilities do not necessarily correspond to VLA capabilities is a valuable contribution, providing initial evidence for why this occurs would strengthen the paper.
5. The paper does not experiment with larger VLM sizes. It is possible that larger VLMs learn more general features that improve VLA capabilities, which this paper does not rule out.
1. Is it still important to train the visual encoder if the image resolution is increased, as discussed in Section 4.3 (see Weakness 1)?
2. What do the colors of the lines and shapes represent in Figure 3? Likewise, which points are used to fit the lines?
3. How does the paper compare the VLM performance of the models between “proprietary tasks” and “general-purpose VQA benchmarks” (L357–368)? Shouldn’t VLM general capability be assessed using the same benchmarks for all models? |
Fully human-written |
|
EgoExo-Con: Exploring View-Invariant Video Temporal Understanding |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper studies the cross-view consistency in predictions of video models.
* First, the authors propose a benchmark (EgoExo-Con) of 491 synchronised videos and 3,178 temporal-bounded queries. While video models perform well in individual views, their predictions are not consistent across views.
* It is shown that naive SFT on a mix of ego-exo video data does not improve this consistency likely due to conflicting priors.
* Finally, the authors propose View-GRPO, an RL based approch (along with a training dataset View30K) to enhance temporal reasoning while encouraging view-invariant video comprehension. This outperforms SFT and standard GRPO. View-GRPO includes three reward signals:
1. Format reward (for answer structuring)
2. Accuracy: irrespective of view, the final answer consistency is encouraged by this reward
3. Reasoning reward: the model is encouraged to output reasoning traces similar to those of GPT5 - this balances cross-view temporal reasoning as well as certain view-specific details.
1. **Useful problem and benchmark.** The proposed problem on cross-view consistency is interesting and has several practical applications like learning from human demonstrations. The ExoEgo-Con benchmark, although not as big in size, is thoroughly constructed and is useful in measuring this consistency.
2. **Strong baseline.** The proposed View-GRPO is simple, well-formulated and shows strong performance on the proposed benchmark.
3. **Presentation.** The paper is very well written and was easy to understand.
1. Instead of enforcing cross-view consistency only through the final answer, the View-GRPO method relies on matching reasoning traces for cross-view consistency. This means that the consistency is enforced via the language space and not directly via visual correspondence. In some narrow cases this can be a limitation for examples where it is hard to describe certain visual elements in language.
2. There is still a lot of room for improvement in terms of consistency scores (Tab 3). While this highlights strength of the benchmark, it could strengthen the paper if some discussion is included on what the authors project as potential ways to improve performance (more ego data in pre-training? entirely new algorithms/architectures?).
3. (Minor) There is some overlap with prior work [1] in terms of problem formulation. However, it is not published and can be considered concurrent. Nevertheless, it may be nice to include a deeper discussion on the differences.
4. Some discussion on whether the proposed ViewGRPO affects performance on other standard video benchmarks is desirable. I am not sure if it is LoRA or full finetuning. In case of latter, such evaluation is much needed.
[1] EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs
Yuping He, Yifei Huang, Guo Chen, Baoqi Pei, Jilan Xu, Tong Lu, Jiangmiao Pang
1. Audio is view-invariant. Have you considered including audio in the mix and using Qwen-Omni like models?
2. Is there a way to automatically quantify if a given query for a video is exo-friendly, ego-friendly or both-friendly? For example, "a person laughing" is likely only exo-friendly and "a person holding something tight" is ego-friendly (clearly visible in ego view but perhaps not visible in exo view).
3. Another type of consistency could be to pass both ego and exo views together and ask if they depict the same action or ask to match coresspondences. Have you considered something like that? |
Fully human-written |
|
EgoExo-Con: Exploring View-Invariant Video Temporal Understanding |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper investigates the ability of Video-LLMs to maintain consistent temporal understanding of events captured from different viewpoints. The authors introduce EgoExo-Con, a new benchmark consisting of synchronized egocentric and exocentric video pairs with refined natural language queries for temporal verification and grounding tasks. Through extensive experiments, they reveal that current SOTA models struggle significantly with cross-view consistency, often achieving scores that are much lower than their single-view performance. The paper also demonstrates that naively fine-tuning models on a mix of both viewpoints is insufficient and can even degrade performance. To address these shortcomings, the authors propose View-GRPO, a reinforcement learning framework that encourages models to develop view-specific reasoning chains while aligning their final temporal predictions, showing improved consistency over standard fine-tuning methods.
- EgoExo-Con benchmark is a significant contribution. The authors have been sourcing data from diverse datasets and performing careful filtering and human-backed refinement of queries to ensure they are unambiguous and visible from both perspectives.
- The paper provides a comprehensive evaluation of a wide range of Video-LLMs. The key finding that cross-view consistency scores are "barely over half their single-view performance" and that naive multi-view training fails to improve consistency, is a crucial insight that highlights a fundamental weakness in current architectures and training paradigms.
- The proposed View-GRPO method offers a promising direction for improving consistency.
- The effectiveness of the proposed View-GRPO method is demonstrated only on the Qwen2.5-VL model family. While the results are positive, application to wider range of models would be necessary to make a stronger claim about the generalizability and robustness of the approach.
- Authors showed that the choice of judge model impacts performance. However, the issue of reliability and bias in judge is left for future work. Given its central role in the method's success, a more thorough analysis or ablation study (e.g., removing the reasoning reward) is wanted.
- The View-GRPO is an incremental approach built on top of GRPO. The novelty lies in its application to the cross-view consistency problem and the design of the reward function.
This paper presents a valuable and timely contribution by defining and benchmarking the problem of view-invariant temporal understanding. The EgoExo-Con dataset and the accompanying analysis are strong points that will benefit the community.
The proposed View-GRPO method, while seems promising, has weaknesses in presentation. |
Moderately AI-edited |
|
EgoExo-Con: Exploring View-Invariant Video Temporal Understanding |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper introduces evaluation benchmark EgoExo-Con to evaluate cross-view video temporal understanding. The authors propose View-GRPO and construct View30K, to explicitly enhance temporal reasoning.
1. The paper is clearly written.
2. The proposed EgoExo-Con eval set can be useful for this area of research.
3. The proposed reinforced approach enhanced view-invariant comprehension in video-LLMs.
1. The proposed evaluation set EgoExo-Con only contains 491 items, which is pretty small and is hard to say weather it is simply finding a hard set for the current Video LLMs.
2. The close-sourced and human performance are reported on a randomly sampled subset, which may be sensitive to sample selection and cannot be fairly compared with open-sourced models and proposed model.
3. In Figure 1 (b), I don't think it is appropriate to expect model understanding "put a knife" from the provided exo video. Because the object of interest is too small. Even for human, it is hard to identify that action. The top example in Figure 3 also makes me confused about how could human even be able to understand the person is opening a cabinet door.
4. The evaluation is pretty limited, only on the proposed set. And compared Table 1 and Table 2, the proposed method is not better than previous models, e.g. TimeChat-VT.
1. Can you somehow apply your View-GRPO approach with existing model on tasks like Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment? It is also tackling the similar problem.
2. How many video frames do you take for training? The training time seems too long --"8 xA100 GPUs and requires over 1 day for the 3B model and 2 days for the 7B model". |
Fully human-written |
|
EgoExo-Con: Exploring View-Invariant Video Temporal Understanding |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces EgoExo-Con, a benchmark to evaluate the capability of VideoLLMs in understanding temporal consistency across ego- and exocentric viewpoints of the same events. The benchmark focuses on two tasks, including temporal verification (binary QA) and temporal grounding. The authors found that the existing VideoLLMs achieve good single-view performance while showing poor cross-view consistency. To this end, they propose View-GRPO, an RL-based fine-tuning framework, extending Group Relative Policy Optimization with an additional reasoning reward and a curated dataset to enhance cross-view temporal reasoning. Empirical results on EgoExo-Con show that View-GRPO improves consistency over standard SFT and GRPO baselines.
**[S1] Writing and motivation**
- The paper is well-written and easy to follow. The motivation is clearly presented.
**[S2] Experiments**
- The paper covers extensive range of experiments, including evaluation both open- and closed-source models, reporting detailed per-subset results (CharadesEgo, LEMMA, EgoExo-4D), and fine-tuning analyses.
**[W1] Dataset and task clarification**
- EgoExo-Con combines pre-existing datasets, so its domain diversity still depends on those sources. The paper claims “comprehensive” coverage, but 491 pairs is small compared to modern multimodal benchmarks.
- Evaluating temporal consistency across viewpoints is critical in view-invariant video understanding. Traditionally, cross-view temporal consistency has been evaluated through cross-view phase progression or Kendall's $\tau$. Additionally, cross-view frame or clip retrieval may also measure the capability of temporal consistency of the models. Compared to previous evaluation tasks, temporal verification may provide a limited insight, as it measures whether the model identifies an event as the same or not. The necessity of temporal verification should be clearly presented in the paper.
- In sum, the contribution of the introduced benchmark may not be significant.
**[W2] View-GRPO**
- View-GRPO itself seems not inherently specific to cross-view reasoning. In other words, rewards corresponding to each view can be optimized separately, and co-optimization across viewpoints is not guaranteed.
**Minor issues**
- Repeated entries in reference (e.g., Feng et al., 2025a/b and Grauman et al., 2024a/b)
- Typo: L77 their sing-view
- Figure 6 should be moved before Figure 7 (or swap the order of figures)
- There are substantial performance gaps between closed-source models and open-source models, as shown in Table 1. What makes closed-source models stronger? This is just a question to know the authors' opinion.
- Please see weaknesses |
Fully human-written |
|
Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents Conservative Discrete Quantile Actor-Critic (CDQAC), an offline reinforcement learning algorithm for Job Shop Scheduling Problems (JSP) and Flexible Job Shop Scheduling Problems (FJSP). The method learns scheduling policies from static datasets without requiring environment interaction, using a quantile critic with dueling architecture and conservative Q-learning to prevent overestimation of out-of-distribution actions. CDQAC employs delayed policy updates and a Dual Attention Network to encode operations and machines. The algorithm demonstrates superior performance compared to offline and online RL baselines across standard benchmarks, achieving sample efficiency with only 10-20 training instances. The research reveals that CDQAC performs best when trained on random solutions rather than expert heuristics, attributed to higher state-action coverage (7.7× more than priority dispatching rules) enabling better solution stitching. The method generalizes to larger problem instances and maintains stability even with 1% of original training data, challenging conventional offline RL assumptions about data quality requirements for combinatorial optimization problems.
1. A good summary of related work in the area, and setting the context of the relevance of the proposed work clearly
2. Learning from offline trajectories, especially lower quality ones like random or genetic algorithms, is an important problem tackled in the paper.
3. The problem setting is explained clearly, the evaluations use state-of-the-art algorithms as baseline
4. Expansion from job scheduling to flexible job scheduling is significant. Expansion to varied number of state-action spaces, although not new, is both a good characteristic of the solution, and its generalizability has been evaluated, limitations have been noted.
1. An ablation study is missing. The proposed solution has many different components, and it is unclear which of these components are critical to the overall performance.
2. Data used is synthetic, the experiment results will be stronger if real-world data were used.
1. What is the computational requirements, training time, memory usage, and inference latency?
2. How were the hyper-parameters determined?
3. Why do you need to use to asymmetric Huber loss? In what way is it asymmetric? Have you compared it with the standard L2 loss?
4. What is the purpose of the dueling quantile network? Would the algorithm still be performant if we remove it?
5. What is the sensitivity to the number of quantiles you use? What happens if you reduce the quantiles to as few as 3 or 4? |
Fully human-written |
|
Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces Conservative Discrete Quantile Actor-Critic (CDQAC), a novel offline reinforcement learning algorithm designed for job shop and flexible job shop scheduling problems. CDQAC learns effective scheduling policies directly from datasets generated by suboptimal heuristics, eliminating the need for simulated environment training. The method leverages a quantile-based critic and delayed policy updates to estimate return distributions and improve stability. Extensive experiments show that CDQAC consistently outperforms both the heuristics that generated the training data and state-of-the-art offline and online RL baselines, achieving high sample efficiency and strong generalization—even when trained on data from random heuristics.
- CDQAC can learn highly effective scheduling policies even when trained only on data generated by random or suboptimal heuristics, demonstrating strong generalization capabilities
- The method eliminates the need for simulated environment interactions, making it practical for real-world scheduling problems where simulators may not be available
- The algorithm consistently outperforms both the heuristics used to generate the training data and state-of-the-art offline and online RL baselines across various scheduling benchmarks
- CDQAC achieves strong results with limited data, highlighting its efficiency and practicality for offline RL in industrial scheduling contexts
I am not sure whether the below points are actual weaknesses of the paper or whether they arise due to the lack of my knowledge of scheduling problems and related literature - I am happy to increase the score if these can be addressed & I will give myself a lower confidence to account for this.
1) The main reported metric is the optimality gap - i.e. the gap to some method that is even better. The immediate question is of course why not use the better method - if I am not mistaken then this one would take 30 mins to provide a better solution for any of the considered problems - 30 mins does not sound like a lot, especially when considering RL training time is likely higher (I believe it is not reported).
2) If I assume the better method is for some reason not applicable, i.e. because 30 mins is too long, then probably the training time of the RL algorithm does not matter as much - the time needed to solve is reported for some problems within seconds, which is likely only the inference time. I wonder then why training time does not seem to matter - maybe the policy can be reused across many new instances, while the "classical" solution needs to start from scratch always, but I am not sure.
3) Somewhere along the same lines: It seems that the main reason heuristics and RL methods are considered in these settings in the first place is because classical approaches do not scale to large problem sizes / complexities. I wonder if sufficiently large problem instances have been considered in the paper then, since any problem is solved better by a classical approach within just 30 minutes, which again does not sound like a lot (also 30 minutes is a relatively weak quantification; I don't know whether this uses the same GPU hardware, better would be to report FLOPS used by both methods). I presume it would make sense to consider a problem size where the best method "switches", i.e. where the optimality gap becomes negative since the classic approach cannot come up with a reasonable solution within 30 minutes, while the RL solution can.
I guess generally the paper could use a bit more clarifications on the constraints and requirements that arise in practice from these scheduling problems - at least for me some things remain unclear.
see weaknesses |
Fully human-written |
|
Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces Conservative Discrete Quantile Actor-Critic (CDQAC), an offline reinforcement learning algorithm for job shop and flexible job shop scheduling. CDQAC aims to learn high-quality scheduling policies directly from static datasets generated by suboptimal heuristics such as random, genetic algorithm (GA), or dispatching rules, without relying on simulated environment interaction. The method combines a quantile-based critic, a delayed policy update, and a dueling network structure to model the return distribution conservatively and stably. Experiments on JSP and FJSP benchmarks show that CDQAC outperforms both existing offline and online RL baselines while maintaining strong generalization to larger instances.
The empirical results are comprehensive across JSP and FJSP benchmarks, demonstrating scalability and transferability. The finding that random datasets yield superior generalization provides a novel perspective on data diversity and pessimism in offline RL. Overall, the paper is clearly written, methodologically solid, and offers meaningful insights for applying offline RL to combinatorial optimization.
Despite the strong presentation, the empirical evaluation leaves several open concerns.
- The reliance on datasets generated from suboptimal heuristics raises doubts about the true generality of the learned policy, as performance improvements may stem from overly simple environments rather than genuine learning from poor data.
- The comparison set for both offline and online RL baselines is narrow—especially for online methods—and lacks sufficient training details, making fairness unclear.
- The claim that CDQAC trained on random data outperforms fully trained online PPO-based methods is difficult to accept without deeper evidence of equal training conditions or convergence verification.
- Similarly, the demonstration that five training trajectories suffice for good performance may reflect limited state diversity rather than strong sample efficiency.
- The justification for using datasets generated by suboptimal heuristics is not sufficiently supported—how does training on random or poor-quality data yield policies that outperform those derived from expert or online learning methods?
- The experimental comparison includes very few offline RL baselines.
- Section 5.3 lacks details on the training configuration for online RL methods; it is unclear whether the models were fully trained to convergence under comparable computational budgets.
- The result that well-trained online RL performs worse than offline RL trained on static random data seems counterintuitive—additional diagnostic analysis could strengthen the argument.
- The conclusion that five trajectories suffice for strong generalization appears overly optimistic; more analysis is needed to show whether those trajectories capture meaningful dynamic diversity.
- The composition of the state vector is underexplained—what specific scheduling features (machine utilization, remaining processing time, job priority, etc.) are encoded, and how do they influence performance?
- The paper does not compare against classical dispatching rules or metaheuristics widely used in scheduling; including such baselines would improve the relevance and credibility of the results. |
Heavily AI-edited |
|
Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes Conservative Discrete Quantile Actor-Critic (CDQAC), a conservative distributional offline RL method for learning dispatching policies in JSP/FJSP. CDQAC combines a quantile critic with conservative value regularization and delayed policy updates to stabilize the learning process. The offline approach reports strong performance on FJSP/JSP benchmarks and the notable observation that policies trained on random datasets can sometimes outperform training on near-optimal trajectories.
This paper addresses exploration-free policy learning (offline RL) for large FJSP/JSP instances by combining distributional Q-learning with conservative regularization, dueling Quantile network, and delayed policy update.
It demonstrates sample efficiency and generalization across instance sizes in FJSP, one of the most challenging COPs.
The finding that random datasets can be effective for training is practically meaningful for dataset construction in scheduling.
JSP comparisons are incomplete and relatively weak. While FJSP results include several recent methods, JSP needs stronger baselines. In particular, compare against imitation learning approaches that learn machine–operation pairing policies without a simulator; the following work achieves SOTA on JSP:
- Lee, J. H., & Kim, H. J. (2024). Graph-based imitation learning for real-time job shop dispatcher. IEEE T-ASE.
Modern online RL baselines should be included as JSP baselines (not only in Related Work). For example:
– Park, J., Bakhtiyar, S., & Park, J. (2021). ScheduleNet. arXiv:2106.03051.
– Chen, R., Li, W., & Yang, H. (2022). TII 19(2):1322–1331.
– Iklassov, Z. et al. (2023). IJCAI, pp. 5350–5358.
These papers report the performance on the Taillard (1993) datasets.
The integration of quantile critic + dueling + conservative regularization + delayed policy updates is presented, but its novelty and motivation are under-emphasized. If this combined offline scheduling is first in this domain, authors need to state it explicitly, explain what had to be modified to make the components work together, and add ablations isolating each component’s effect, not only about quantile critic and dueling (Table 9).
The encoder adopts the DAN/DANIEL structure (Wang et al., 2023) and operation/machine features largely as is.
1. As you note that CDQAC can outperform the heuristic that generated its training data, do policies trained on optimal schedules achieve near-optimal quality on the same instances?
2. Why can policies trained on random datasets outperform near-optimal data? Imitation learning is sensitive to dataset quality, yet CDQAC sometimes shows the opposite trend.
3. Against modern online RL and imitation learning, how close is CDQAC in solution quality?
4. Could you provide learning curves of CDQAC across training-set size and composition (random/PDR/GA/near-optimal mixtures)?
Typo
“Learning-based methods for Scheduling Problems” → capitalization (L80)
“Demirkol 80x20” → it may be 50x20 (L865) |
Lightly AI-edited |
|
Activation-Guided Regularization: Improving Deep Classifiers using Feature-Space Regularization with Dynamic Prototypes |
Soundness: 3: good
Presentation: 3: good
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work introduces Activation Guided Regularization (AGR). AGR combines standard cross-entropy training with a secondary ojbective that encourages a sample's features to be similar to a prototype computed on-the-fly. This work finds that this improves performance over several architechtures and model classes.
* The paper is clearly written.
* The idea is both reasonable and straighforward to implement, increasing practicality.
* The results consistently improve over standard cross-entropy on the datasets tested.
* There is a substantial lack of of comparison to baselines. Supervised Contrastive Learning also results in much tighter clusters, for example. Even though it comes at the cost of using a larger batch size with pair-wise comparisons, it is worth comparing to. Similarly, label smoothing has been show to have a very similar effect (tightening class clusters) in [2]. I believe this is a major issue in being able to understand how well this works.
* In Section 3.3, the authors state that training begins from a pre-trained checkpoint (the best standard model).
* The datasets are quite small for assessment and perhaps close to saturated; many accuracies are over 90%.
[1] Khosla, Prannay, et al. "Supervised contrastive learning." Advances in neural information processing systems 33 (2020): 18661-18673.
[2] Müller, Rafael, Simon Kornblith, and Geoffrey E. Hinton. "When does label smoothing help?." Advances in neural information processing systems 32 (2019).
* Could the authors discuss competing methods such as supervised contrastive learning and label smoothing, and maybe show some results? [1].[2]
* How would performance change for standard cross entropy under similar continued training as AGR?
* Could the authors try on a harder dataset, maybe full ImageNet?
[1] Khosla, Prannay, et al. "Supervised contrastive learning." Advances in neural information processing systems 33 (2020): 18661-18673.
[2] Müller, Rafael, Simon Kornblith, and Geoffrey E. Hinton. "When does label smoothing help?." Advances in neural information processing systems 32 (2019). |
Fully human-written |
|
Activation-Guided Regularization: Improving Deep Classifiers using Feature-Space Regularization with Dynamic Prototypes |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposed AGR, a loss function same to CELoss, which can significantly improve the intra-class and inter-class distributions.
1.In a considerable number of experiments on multiple datasets, AGR has shown significant improvements.
2.The author provides a certain degree of visualization.
1.The author claims that AGR" can always be improved ". However, in some experiments, AGR played a significant negative role. This statement is inaccurate, 'low-capacity models' should be clearly defined and sufficient experiments provided to show readers the clear scope of application.
2.The superiority of the proposed method should come from comparison. In fact, there are many excellent works on self-supervised or regular loss (the author has also listed them in related works). Although the implementation ideas may not be consistent, their inclusion can still enhance persuasiveness.
3.The lack of key information such as GPU model will significantly affect reproducibility.
4.In terms of writing, the author should provide clear contribution points and more illustrations to help readers understand quickly. This is allowed in terms of the length of the paper.
5.In section 3.2, the paragraph abstract It sometimes ends with ". "and sometimes with" : ".In page 19, caption of Fig.2, line 3 and In page 18, caption of Fig.1, line 3 what is ‘ViTResNet50’? In Algorithm 1, y_b should also be in required.
6.Some unclear edits. Missing line numbers, incorrect references to some formulas and charts, manifested as inability to link.
1.In Algorithm 1, it is clearly stated that CELoss remains a part of the computational steps. May I ask if AGR is a weighted combination of the regularization term and CE, or an equivalent term of CE?
2.Can the author re-explain the necessity of the word "activation" in the naming? Clarifying this process is very helpful for understanding.
3.Can the author demonstrate the performance on larger datasets (e.g., ImageNet1K)?
If the author can clearly explain the reasons for the current experimental design, I will consider improving my score. |
Fully human-written |
|
Activation-Guided Regularization: Improving Deep Classifiers using Feature-Space Regularization with Dynamic Prototypes |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces a technique dubbed Activation-Guided Regularisation (AGR). AGR consists of two key components: (i) a regularisation term that encourages the last-layer activations to align with some class prototypes, which are defined self-consistently from the activations of correctly classified samples, and (ii) the replacement of the standard prediction with a linear combination of the network’s logits and a similarity-based prediction derived from the class prototypes. The authors demonstrate that AGR enhances classification accuracy across standard image classificatio benchmarks. They further investigate its effect on representation learning by analysing intra- and inter-class distances in the learned feature space, and by providing t-SNE and UMAP visualisations of the resulting embeddings. Additional experiments show that AGR leads to improved robustness to input corruptions, few-shot transfer performance, and interpretability of learned representations.
- The paper is clearly written and easy to follow;
- The proposed method is convincing---the intuitive explanation in Section 3.1 is particularly appreciated---and it seems straightforward to implement;
- The evaluation is comprehensive, covering standard performance on multiple vision benchmarks as well as additional desirable aspects such as robustness and interpretability.
1. The paper lacks references or comparisons for the reported benchmark performances. Also, I would like to know how the method compares with other regularisation schemes (e.g., SAM).
2. The evaluation does not include ImageNet, which is a widely accessible and standard benchmark. Given that many vision models are pre-trained on ImageNet before being adapted to other datasets, such a comparison would be valuable.
3. The visualisation of the learned embeddings is unconvincing: the left and right columns in both groups look very similar, making it hard to see any difference. A quantitative metric could provide a more reliable evaluation.
4. The setup of the few-shot transferability experiment is unclear (possibly not explained at all), and the occlusion sensitivity analysis lacks a quantitative assessment.
All my questions are directly connected to the weaknesses, and I will raise my score if they are addressed convincingly. |
Lightly AI-edited |
|
Activation-Guided Regularization: Improving Deep Classifiers using Feature-Space Regularization with Dynamic Prototypes |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes Activation-Guided Regularization (AGR). It augments standard cross-entropy (CE) training by introducing a prototype-driven similarity distribution: the model’s prediction and a distribution from temperature softmax over cosine similarities between features and class prototypes are linearly mixed with coefficient α, and CE is computed on this mixed distribution. An additional L2 consistency term pulls features toward their class prototype to improve intra-class compactness and inter-class separation (with a Fisher-style motivation). The training schedule employs a warm-up phase with CE only, followed by updates to prototypes per batch using high-confidence samples via a confidence-weighted centroid and momentum. This is then followed by the application of the mixed prediction and consistency regularizer. AGR leaves the classifier head unchanged, requires no pair mining, and can be plugged into existing pipelines as a lightweight, architecture-agnostic regularizer. Experiments on CIFAR-10/100 and ADNI report gains in accuracy, noise robustness, and 10-shot linear probing, suggesting that AGR learns more transferable representations and focuses on more semantic regions.
- Lightweight and easy to deploy: AGR is an architecture-agnostic regularizer that uses nonparametric class prototypes, requires no pair mining, keeps the classifier head unchanged, and drops into existing training pipelines with low engineering cost.
- Clear motivation and formulation: the method links improvements to intra-class compactness and inter-class separation, formalizes a mixed probability cross-entropy with a coefficient alpha, and adds an L2 consistency term that pulls features toward class prototypes in a way consistent with a Fisher-style objective.
- Strong qualitative evidence: feature space visualizations show tighter and more separable clusters, and attention or activation maps focus on semantically meaningful regions, supporting the claim that AGR improves representation geometry.
- Omission of the dynamic prototype update math formula and part of the implementation details
The paper relies on vague terms like confidence-weighted centroid and momentum update, but omits the actual equation. The momentum parameter ($\beta$), the exact confidence-weighting function, and the final EMA formula are all absent. This omission makes the core innovation a black box.
Meanwhile, there is a logically impossible update condition. The authors state that a prototype update is triggered only when at least $m=50$ high-confidence samples for a class are present in the mini-batch. However, the appendix explicitly defines the default batch size as 16. It is mathematically impossible for a batch of 16 samples to contain 50 samples of any kind.
- Ambiguous metric definition in the consistency loss
The definition of the core consistency loss is critically ambiguous. Equation (7) and Algorithm 1 define the loss over the raw, un-normalized feature embeddings ($\sum ||\phi_{\theta}(x_{i})-\mu_{y_{i}}||^{2}$). However, the text immediately following Eq. (7) directly contradicts this, stating: "In practice we compute the consistency term on L2-normalized features and prototypes for numerical stability". These are two fundamentally different loss functions with different geometric implications: one minimizes Euclidean distance in the embedding space, while the other optimizes for cosine similarity on a hypersphere. The paper fails to clarify which was actually used.
- Experiments
**Missing ablations.** The paper's central claim that the L2 consistency term is the primary driver of performance is not adequately supported by the experimental design. The proposed $\mathcal {L}\_{\text{AGR}}$ is a composite objective, combining: (1) a prototype-based L2 consistency term and (2) a blended cross-entropy loss $\mathcal{L}\_{CE\_{Blended}}$, which itself acts as a powerful form of prototype-guided label smoothing or knowledge distillation. However, the provided ablation study only removes the consistency term, leaving the $\mathcal{L}\_{CE\_{Blended}}$ intact3. This test resulted in only a minor performance drop on CIFAR-100 (from 81.07% to 80.95%), suggesting that the L2 term may not be the primary contributor. The authors failed to include the most critical control experiment: Standard CE + L2 Consistency Term only. Without this, it is impossible to disentangle the gains from the feature-space regularization (the paper's claimed contribution) vs. the prediction-space regularization (a significant confounding variable).
**Insufficient baseline comparisons.** The paper evaluates AGR almost exclusively against a conventional cross-entropy baseline. This comparison is insufficient to position the work or validate its novelty. For example, the L2 consistency term is a direct descendant of prototype and proxy-based metric learning losses (e.g., Center Loss or Proxy-NCA). The $\mathcal{L}\_{CE\_{Blended}}$ component directly, somehow mimics the behavior of Label Smoothing (LS) and Knowledge Distillation (KD). These relevant baselines should be contained to determine if AGR's gains are novel or if they merely recapitulate the known benefits of existing, simpler techniques.
**Missing sensitivity analysis.** The method introduces at least four new critical hyperparameters ($\alpha$, $\beta$, $\tau$, $\lambda$), yet the paper provides no sensitivity analysis, grid search, or robustness checks for any of them.
- Minor
There are many careless typos. For example, the appendix (C.4) refers to optimizing a non-existent Eq. (8).
- A critical question: The prototype update mechanism's reliance on the model's own high-confidence predictions. Does this design not create a significant risk of a self-fulfilling prophecy or a confirmation bias feedback loop? For example, if the model learns a spurious correlation in its early stages, such as associating grass with the cow class, and generates high-confidence incorrect predictions based on this error, will the AGR mechanism not erroneously crystallize this spurious feature into the class prototype, thereby reinforcing the model's initial mistake rather than correcting it?
- Given the proposed method's strong conceptual resemblance to Center Loss, could the authors elaborate further on their key distinctions? I am particularly interested in a deeper comparison of how AGR's non-parametric, confidence-weighted prototype update mechanism differs from Center Loss's gradient-based learnable centers, both in terms of theoretical optimization dynamics and the empirical characteristics of the resulting prototype/center distributions during training.
- Have the authors considered framing their method and its objectives through the lens of Neural Collapse (NC) theory? This theoretical perspective seems highly relevant to the paper's goals of enforcing intra-class compactness, and while I am not requesting such an analysis, adopting this viewpoint might help in further strengthening the paper's narrative. |
Fully AI-generated |
|
FedMAP: Meta-Driven Adaptive Differential Privacy for Federated Learning |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work proposes a method called FedMAP, which enhances federated learning’s defense against gradient inversion and membership inference attacks. FedMAP equips each client with a fine-tuned MetaNet that predicts clipping bounds and noise scales based on gradient statistics. On the server side, a Rényi differential privacy accountant is employed to track each client’s privacy cost and compute the overall global expenditure, which is then broadcast to all clients to constrain cumulative loss and guide adaptive local updates. Empirical experiments on standard federated learning benchmarks demonstrate that FedMAP provides stronger protection against both gradient inversion and membership inference attacks compared to existing baselines.
1. The paper is easy to follow.
2. The authors address both gradient inversion and membership inference attacks in federated learning, which is a challenging and important problem.
3. The idea of using a neural network to predict clipping thresholds and noise scales for differential privacy mechanisms is promising.
4. The authors provide convergence results to support the theoretical soundness of the proposed FedMAP method.
5. Extensive experiments demonstrate the effectiveness of FedMAP against multiple attack methods. Moreover, the model accuracy achieved by FedMAP remains close to that of the non-private baseline.
1. The paper lacks a detailed description of the defense and attack models, which is crucial for helping readers understand the setup and assumptions of the considered DP-FL system.
2. The proofs of Theorems 1 and 2 are missing, preventing readers from verifying their details and correctness.
3. The rationale for selecting the four specific features as inputs to the MetaNet is not well justified, and further explanation or empirical evidence would strengthen this design choice.
1. Regarding the fine-tuning of the MetaNet, does this process occur on the client side, performed independently by each client? Clarifying where and how this fine-tuning is conducted would help readers better understand the workflow.
2. If the above is true, and a client is currently training on the CIFAR dataset, is the MetaNet fine-tuned specifically on that client’s CIFAR data, or on a mixture of datasets such as CIFAR, FMNIST, and SVHN? The explanation in lines 174–175 of the paper is unclear and should be elaborated.
3. The rationale for constructing the labels of $C$ and $\sigma$ to be proportional to empirical observations is not well justified. This label design appears ad hoc and lacks theoretical or empirical support.
4. As shown in Inequality (3), the Gaussian mechanism requires the noise scale to exceed a certain lower bound to ensure differential privacy. How does the MetaNet guarantee that the predicted noise scale always satisfies this requirement? This concern is especially important given that the proposed system does not seem to employ secure aggregation. If the server is semi-honest, it may still attempt inference attacks despite added noise.
5. How do the authors derive Inequality (10)? A step-by-step derivation or reference to supporting materials would improve clarity.
6. What is the practical meaning or role of $q_{max}$ in the paper? Its definition and influence on the overall algorithm are not clearly explained. |
Lightly AI-edited |
|
FedMAP: Meta-Driven Adaptive Differential Privacy for Federated Learning |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents FEDMAP, a closed-loop adaptive differential privacy (DP) framework for federated learning (FL). The key idea is to dynamically predict clipping thresholds and noise scales using a lightweight MetaNet based on a frozen-layer BERT-tiny architecture. FEDMAP further introduces global privacy accounting via Rényi DP and global feedback regularization to align local DP spending with global privacy budgets. Experiments on CIFAR-10, SVHN, and Fashion-MNIST demonstrate improved privacy-utility trade-offs and stronger robustness to gradient inversion and membership inference attacks compared to DP-SGD, Soteria, and CENSOR. Theoretical convergence guarantees and DP analyses are provided, and extensive ablations show sensitivity to client participation and hyperparameters .
Overall, the work is timely, well-motivated, and empirically solid. The adaptive privacy calibration idea is intuitive and practical. However, some algorithmic and training details are ambiguous, experiments on larger models/datasets are missing, and several theoretical statements lack formal proofs.
1. **Adaptive DP calibration**. The framework introduces flexible and client-specific DP noise and clipping schedules, addressing client heterogeneity in FL.
2. **Meta-learning-based privacy control**. A lightweight BERT-tiny MetaNet effectively maps gradient statistics to DP parameters, demonstrating a novel use of meta-learning for privacy.
3. **Global privacy loss regularization**. The feedback mechanism aligns local DP spending with global budgets and prevents over-consumption of privacy.
4. **Theoretical grounding**. Convergence bounds and DP accounting provide theoretical credibility.
5. **Strong empirical validation across attacks**. Experiments evaluate multiple attacks and show competitive robustness and utility against baselines.
1. **Unclear MetaNet training and update procedure (critical).**
It is ambiguous whether MetaNet parameters are frozen during private training or continually updated. The algorithm suggests frozen transformer layers and trainable heads during pretraining, but does not clarify if they continue updating during FL, and how privacy and global penalties influence MetaNet outputs during training.
2. **Scalability to large-scale FL is uncertain.**
All experiments use small vision models (ResNet-18, LeNet) and datasets. It remains unclear if the method scales to transformers or large-scale NLP tasks, where computing gradient statistics and MetaNet inference might incur overhead.
3. **Missing proofs in Appendix.**
The main theorems are stated without formal proof details, which reduces theoretical rigor.
1. **Difference between $D_t$ and $VarGrad_t$**.
In Eq. (4), both statistics quantify gradient variability. What unique information does each contribute? Is there a redundancy?
2. **Cost of computing gradient statistics.**
For large models, computing $\\|g\\|_2$ and covariance-based metrics could be expensive. Can the author provide a detailed comparison of the actual runtime overhead on different architectures?
3. **MetaNet training and DP interaction.**
- Are MetaNet parameters frozen during private training?
- If frozen, how can global penalty terms meaningfully influence privacy control beyond inference?
- If trainable, how are MetaNet updated? Do they follow the same FL training procedure as the main model?
Clarifying this point is crucial to judge correctness and privacy guarantees. |
Fully AI-generated |
|
FedMAP: Meta-Driven Adaptive Differential Privacy for Federated Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposed a novel Federated Learning framework that protects against membership inference and reconstruction attacks under Differential Privacy. At a high level, the paper fine-tunes a BERT-based model to predict the hyperparameters (C and $\sigma$) for the DP-SGD algorithm to preserve the privacy of the client's data. The paper also focuses on a scenario in which each client has their own privacy budget. Thus, the paper proposed an updating mechanism and objectives to incorporate into this setting. The paper conducts extensive experiments to highlight the advantages of their proposed method.
- The proposed method provides flexibility and automation in tuning DP-SGD for different clients.
- Extensive Theoretical and Experimental results are provided to support the advantage of the proposed methods.
- The motivation of the work is not convincing. Specifically, it does not explain why different clients require different privacy budgets. For instance, if we consider FL for medical data, how can a client define a privacy budget to protect this sensitive data? Previous works have approached adaptive hyperparameters (C and $\sigma$) from a performance perspective, which is more convincing. Thus, I suggest that the author clarify this point.
- Secondly, the curation process for the label of C and $\sigma$ is unclear and not optimal. How do you determine that the curated labels are the best for the subsequent iterations? Isn't this process empirical, and does the curator need to tune different C and $\sigma$ at each iteration?
- Thirdly, although the paper considers different privacy budgets for different clients. This is not highlighted in the proposed method or in how it integrates this information. Furthermore, the loss function in Eq. 12 will encourage each client to have the same privacy budget, which is in contradiction with the paper's goal.
- Next, given a predicted C and $\sigma$ from the meta model, the proposed method cannot achieve the predefined privacy budget of the clients, resulting in weak protection. How do you ensure that the consumed privacy budget is lower than the budget predefined by the clients?
- Finally, the experimental results for Table 1, Figure 3, and 4 do not mention the privacy budget for each client, which reduces the validity of these experiments.
- Please address all the points in the Weaknesses section. |
Fully human-written |
|
FedMAP: Meta-Driven Adaptive Differential Privacy for Federated Learning |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 1: poor
Rating: 0:
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
An overly complex method is proposed for private federated learning that purports to achieve a better utility/privacy tradeoff. The method is flawed because it does not account for the privacy loss of the MetaNet mechanism.
No notable strengths.
In exactly the same way as another paper I have just reviewed (evidently by the same authors, since much of the text is copied), this method is overly complex and the purported gains are not supported by the experiments, and in any case would not justify the implementation and pretraining costs.
To support a claim of improved privacy-utility trade-off, one must compare model utility (e.g., test accuracy) while holding $\varepsilon$ constant across all methods. The current experiments (e.g., Figure 3) compare utility against communication rounds, which is insufficient to demonstrate a superior trade-off. Nowhere in Section 4.1 does it state that all methods were calibrated to achieve the same total privacy budget $\epsilon$ for a fair comparison.
However the most problematic issue is that the mechanism is flawed: in fact it does not provide a formal DP guarantee for any level of $\epsilon$. It releases data-dependent parameters without accounting for their privacy cost. The server broadcasts $\varepsilon_\text{global}$ to all clients at each round. $\varepsilon_\text{global}$ is dependent on the data of clients from the previous round. Suppose an honest-but-curious client $A$ participating at rounds $t-1$ and $t$ observes a large increase in $\varepsilon_\text{global}$. Client $A$ could deduce that some other client $K$ at round $t-1$ likely had low-variance per-sample gradients, leading to a small noise scale $\sigma_K^{(t)}$. Since $\varepsilon_\text{global}$ is determined deterministically, $A$ can distinguish with certainty between two datasets $\mathcal{D}$ and $\mathcal{D'}$ that are identical except that in $\mathcal{D}$, $K$ has low variance per-sample gradients, while in $\mathcal{D'}$, $K$ has high variance per-sample gradients. This violates the definition of DP.
Is anything I stated in the weaknesses section incorrect? |
Fully human-written |
|
SPO: A Black-box, Unbiased, Robust Watermarking Method for Large Language Model |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper introduces SPO, a novel watermarking method for Large Language Models (LLMs). The method aims to be black-box, unbiased, and robust, tackling the well-known "trilemma" in the watermarking domain. Its core mechanism involves sampling N candidate tokens from the model, partitioning them into L sub-vocabularies (buckets) based on a prioritized allocation scheme, and then sampling the final output token from a key-specified "watermark bucket." The authors provide a theoretical proof for the unbiasedness of their method and conduct a series of experiments to validate its performance in terms of detectability, robustness, and quality preservation.
1. The paper addresses a critical challenge in the field. The three properties of "black-box," "unbiased," and "robust" are central to the practical deployment of LLM watermarking, and developing a method that can simultaneously satisfy them is a valuable research direction. The design of SPO thoughtfully considers this trilemma.
2. Mathematical proof for the unbiasedness of the SPO method is presented.
1. **Incomplete Related Work and Lack of Comparison with Key Baselines**: The paper claims to balance black-box access, unbiasedness, and robustness, yet it overlooks several seminal and highly relevant works that have pursued similar goals.
- KTH [1]: This is one of the foundational works in generative text watermarking. Their method also operates in a black-box setting and considers unbiasedness and robustness. Its complete omission from the paper is a major oversight.
- Unigram [2]: A black-box method that achieves very high robustness, serving as an important baseline for that property.
- SIR [3]: A black-box method that achieves very high robustness, serving as an important baseline for that property.
- SynthID-Text [4]: A black-box and unbiased method from Google Deepmind (published on Nature).
The absence of comparisons with these methods makes it difficult for readers to accurately assess the true novelty and advantages of SPO within the existing literature.
2. **Insufficient Experimental Evaluation**: The experimental design is not comprehensive enough to fully support the paper's claims of superior performance.
- Detectability Evaluation: In the detectability experiments (Table 2), the authors only report results for a generation length of 100 tokens. It is standard practice in watermarking research to evaluate performance across various lengths (e.g., 50, 100, 200), as watermark strength is closely tied to text length. Presenting results for a single length is insufficient.
- Robustness Evaluation: The robustness evaluation (Table 3) is limited to only a "replacement attack." However, real-world attacks are far more diverse. Common and critical attacks such as paraphrasing attacks (e.g., using another LLM to rewrite the text) and random deletion/insertion attacks are not tested. This makes the paper's assessment of robustness appear overly optimistic and incomplete.
3. **Minor: Formatting and Presentation Issues**: The paper's formatting has minor issues that affect its professional appearance. For instance, the captions for Tables and Figures are center-aligned, which deviates from standard academic formatting. Additionally, some captions lack a terminal period (e.g., the caption for Table 1), leading to inconsistency.
**Reference**:
[1] Robust Distortion-free Watermarks for Language Models
[2] Provable Robust Watermarking for AI-Generated Text
[3] A Semantic Invariant Robust Watermark for Large Language Models
[4] Scalable Watermarking for Identifying Large Language Model Outputs
1. Could you please revise the related work section to include and discuss the key methods mentioned above (especially KTH, Unigram, SIR, and SynthID-Text)? More importantly, could you supplement the experimental section with direct comparisons against these baselines to more convincingly demonstrate the relative advantages and trade-offs of SPO?
2. To make the experimental validation more comprehensive and credible, would it be possible to:
- In the detectability section, add results for longer generation lengths (e.g., 200 tokens) and provide an analysis?
- In the robustness section, expand the evaluation to include more attack types, particularly paraphrasing attacks and random deletion/insertion attacks, to fully test SPO's resilience in practical scenarios? |
Fully AI-generated |
|
SPO: A Black-box, Unbiased, Robust Watermarking Method for Large Language Model |
Soundness: 1: poor
Presentation: 2: fair
Contribution: 1: poor
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper claims that it has developed a watermarking scheme that achieves unbiasedness, high detectability, and robustness simultaneously. The scheme consists of two components, vocabulary division and sampling. The authors claim that using this method, the generated text is guaranteed with high detectability and unbiasedness, via black-box access to the un-watermarked model only.
Watermarking LLMs is a timely topic and the motivation for this work is strong: robust, detectable, and unbiased watermarking for LLMs via black-box access only.
1. The method highly resembles existing work e.g., KGW and STA-M, offering limited novelty.
2. The paper lacks rigor: a) there is no formal or empirical analysis of unbiased-ness and the utility of the generated text, b) these concepts are not even clearly defined. c) The proofs provided are heuristic rather than rigorous, and key claims (e.g., on unbiasedness) are not supported by statistical validation.
3. There is no qualitative or quantitative analysis on the trade-off between unbiased-ness and detectability of the watermark. The lack of such evaluation undermines the claimed balance among robustness, detectability, and unbiasedness.
4. The utility of the generated text is questionable. In Table 1, all watermarking methods—biased and unbiased—achieve nearly identical BERTScore and ROUGE values, suggesting that the evaluation metrics are not sensitive or that the setup lacks proper control. The uniformity of results raises doubts about the soundness of the experimental design and reproducibility. Given W2-3, I cannot trust the evaluation results.
1. Please provide a formal definition and rigorous analysis of “unbiasedness.” How is it measured both theoretically and empirically?
2. Clarify how SPO differs fundamentally from STA-M and KGW beyond vocabulary partitioning and sampling procedure.
3. Conduct an explicit analysis of the trade-off between unbiasedness and detectability. Include both quantitative plots and qualitative examples.
4. Re-examine the evaluation design: were all BERTScore and ROUGE metrics computed on the same generated text? If yes, explain why identical scores appear across all methods. If not, clarify the setup and variance sources.
5. Discuss whether the observed robustness is intrinsic to the method or a byproduct of sampling randomness. |
Lightly AI-edited |
|
SPO: A Black-box, Unbiased, Robust Watermarking Method for Large Language Model |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes SPO (Sampling and Prioritizing Output), a new watermarking method for large language models. SPO works as a black-box technique, meaning it does not require access to model parameters or logits. It divides the vocabulary into multiple subvocabularies and samples several candidate tokens for each output step. These tokens are placed into corresponding subspaces based on their vocabularies, with overflow tokens stored in a queue and redistributed to keep every subspace full. A watermark subspace is then randomly chosen, and one token is uniformly sampled from it as the output. This process embeds a watermark while keeping token probabilities unchanged, ensuring unbiased generation.
SPO’s main novelty lies in its overflow queue and multi-subvocabulary division, which together maximize the number of valid watermarked tokens, strengthening detection without distorting the output distribution. The paper’s primary contribution is a simple, general algorithm that simultaneously achieves black-box, unbiased, and robust watermarking.
**Originality.** The paper introduces SPO, a black-box, unbiased watermark method. SPO’s main design novelties are the combination of overflow queue backfilling and multi-subspace vocabulary division. Previous black-box approaches suffer from weak robustness or statistical bias—STA-M improves robustness only by breaking unbiasedness, while unbiased reweighting methods depend on model logits and lose the black-box property. SPO avoids both problems by partitioning the vocabulary into multiple randomized subvocabularies, allocating sampled tokens into corresponding subspaces, and using a queue to backfill overflow so that all subspaces remain balanced. This ensures the embedding process preserves the original token distribution while maximizing the number of watermarked tokens. The authors also propose an early-exit optimization (Algorithm 2) to stop allocation once the watermark subspace is filled, cutting embedding time roughly in half without affecting statistical properties.
**Quality and clarity.** The method and assumptions are clearly presented (Algorithms 1–3). Empirical evaluation is thorough: (i) unbiasedness is tested via MT/TS quality metrics showing parity with no-watermark baselines (Table 1); (ii) robustness is examined under addition, deletion, and replacement attacks with AUC reported across generation lengths (Tables 3–5); and (iii) applicability is demonstrated across models and datasets (Figure 4; Tables 8–10). The theoretical appendix walks carefully from simple to general cases, closely matching the algorithmic design.
**Significance.** Practically, the combination of black-box, unbiased, and robust watermarking addresses what deployment needs: compatibility with closed models, preserved output quality, and resistance to simple editing. The reported gains are meaningful—at N=20, L=20, SPO’s TPR surpasses existing unbiased methods and approaches or exceeds biased ones (Table 2), while remaining resilient to token-level perturbations (Table 3).
**Compute and latency budget.** SPO requires N candidate samples per token. The paper would benefit from throughput and latency benchmarks across different N,L configurations on GPUs and hosted APIs, along with a Pareto frontier (AUC/TPR vs. tokens/sec). Including a monetary cost estimate per 1k tokens would make deployment trade-offs explicit.
**Adding semantic attacks**. The robustness evaluation focuses on token-level add/delete/replace perturbations at fixed rates. While most prior work follows the same practice, it would strengthen the study to include semantic or paraphrase attacks (e.g., round-trip translation or LLM-based rewriting), which are increasingly common today.
**Token IID Assumption.** The Z-test assumes independent tokens and a perfect Binomial process—standard in the literature, but unrealistic since natural language is highly autocorrelated. To obtain an empirical distribution of token probabilities, consider a bootstrap calibration:
- For a given prompt distribution and decoding config, generate M non-watermarked samples.
- For each, compute the hit count (or Z-stat as defined).
- Set the threshold to the (1-\alpha)-quantile of this empirical distribution.
- Report calibrated FPR by hold-out non-watermarked samples; report TPR on watermarked samples. |
Lightly AI-edited |
|
SPO: A Black-box, Unbiased, Robust Watermarking Method for Large Language Model |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces SPO (Sampling and Prioritizing Output), a novel watermarking method for Large Language Models (LLMs) designed to simultaneously achieve black-box embedding, unbiasedness, and robustness. The core idea of SPO is as follows: for each token to be generated, N candidate tokens are first sampled from the model in a black-box manner. These candidates are then distributed into L subspaces (corresponding to vocabulary partitions) through an innovative "Sampling and Prioritizing" mechanism, which uses a queue to handle uneven distributions and ensure the final output is unbiased. Finally, a token is randomly sampled from a single "watermarked subspace," chosen based on a secret key, to embed the watermark.
The main contributions of the paper are threefold:
It proposes SPO, a novel black-box watermarking framework, featuring an original "prioritizing output" mechanism.
It proves, both theoretically (with a detailed mathematical proof in the appendix) and empirically, that the SPO method is unbiased, meaning it does not alter the original model's output distribution in expectation, thus preserving the quality of the generated content.
Through extensive experiments, it demonstrates that SPO, while maintaining unbiasedness, achieves significantly better detectability and robustness than existing unbiased watermarking methods. In some cases, its performance is comparable to or even surpasses that of biased methods, successfully striking a superior balance among the three key metrics.
1.The paper's strength lies in the originality of its SPO mechanism. By using a "prioritized allocation + queue-based redistribution" approach, it cleverly solves the problem of breaking unbiasedness due to the uneven distribution of tokens in black-box sampling.
2.It not only proposes a new method but also validates it dually from a theoretical standpoint (a detailed proof of unbiasedness) and a practical one (extensive comparative experiments). This tight integration of theory and practice makes the paper's conclusions highly reliable and convincing.
3.This paper achieves a better balance in the "impossible triangle" of watermarking. The experimental results, especially the robustness tests (Table 3), show that SPO's performance degrades minimally under attack, far outperforming other unbiased methods. This implies that the SPO watermark is much harder for malicious users to erase in the real world, giving it high practical value.
1.To generate a single output token, SPO requires sampling N candidate tokens from the LLM. This means the computational cost (or API call cost) of text generation is approximately N times higher. In the experiments, N is set to 20 to achieve optimal performance, implying a 20x overhead. Although the authors propose an optimized algorithm in Appendix B.1 to terminate loops early, this only reduces SPO's internal computation and cannot reduce the N sampling calls to the LLM. The main paper lacks a sufficient discussion of this cost, which is crucial for assessing the method's practical feasibility.
2.The method's performance (and cost) is highly dependent on the hyperparameters N and L. The paper shows excellent results for N=20, L=20, but for lower-cost settings (e.g., N=4, L=2), the performance advantage, while still present, is less dramatic. In a practical application, how should a user choose N and L to trade off between performance and cost? The paper lacks an in-depth analysis or guiding principles for this trade-off.
3.Appendix C.7 notes that increasing L increases the probability that a single token is "erased" by a random modification (1 - 1/L). This seems to contradict the experimental finding that increasing L improves overall robustness (AUC). The authors explain that the Z-test is multi-dimensional, I think it's not complete. The author didn't provide a deeper theoretical explanation of why SPO's overall detection mechanism can withstand this increase in single-point vulnerability.
1.Could you provide a quantitative analysis of the generation speed? In what application scenarios do you believe this overhead is acceptable?
2.When deploying SPO in practice, what advice would you give users for selecting N and L? Is there a "Pareto front" that could guide users in making a trade-off between performance (e.g., robustness) and computational cost?
3.As I mentioned in the "Weaknesses" section, increasing L theoretically makes a single watermarked token more vulnerable, yet experimentally, the overall robustness improves. Could you provide more intuition or a theoretical explanation for this phenomenon? |
Moderately AI-edited |
|
Good allocations from bad estimates |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 4: excellent
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper fills the gap between estimating CATE and making decisions on the allocations. The authors show that while estimating all CATEs within $\epsilon$ accuracy requires $O(M/\epsilon^2)$ samples, achieving a near-optimal $(1-\epsilon)$ treatment allocation typically needs only $O(M/\epsilon)$ samples under mild distributional assumptions. In general, I personally see the results quite interesting and insightful.
1. The paper makes a clear and elegant theoretical distinction between estimation and allocation. The reduction of the sample complexity from $M/\epsilon^2$ to $M/\epsilon$ is insightful and exciting.
2. Practical relevance: Direct implications for RCT and policy design: significant reduction in sample cost.
3. The proofs are clean and well-structured and the theoretical results are rigor.
In general, I enjoy reading the paper a lot. I do not have major concerns.
1. Comparison to bandit best arm identification could be expanded. The link is conceptually strong. Particularly, recently, there are some works on good arm identification. Some ideas are very similar, although they are not is a causal inference setting.
2. Policy implication is strong (“RCTs underpowered for CATE estimation can still yield good allocations”), but guidance on how to detect $\rho$-regularity or compute sample sizes in practice is missing.
See above. |
Fully human-written |
|
Good allocations from bad estimates |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors investigate sample complexity bounds for estimating CATE across M different groups when trying to perform allocation to a K subset of them. Standard CATE analysis demonstrates a $\frac{M}{\epsilon^2}$ bound, as each group needs to learn a separate CATE, then the K-highest CATE values are selected. However, the authors argue for a $\frac{M}{\epsilon}$ bound by performing coarser estimation for treatments near the boundary. Essentially, the authors demonstrate that if CATE estimates can be learned within $\sqrt{\epsilon}$, then any estimation mistakes will not be very costly. Such learning is possible under certain assumptions on $\tau$; for example, that $\tau$ is smooth or is near-uniform. The authors conclude with experiments on real-world RCTs to demonstrate the efficacy of their experiments.
1. **Work tackles an important problem** - The authors tackle the important problem of $K$ selection from $M$ groups in a causal setting. Such a problem can be seen across a variety of real-word situations, and is especially prevalent in the world of policy. This can help more efficiently allocate resources and avoid unnecessary experiments.
2. **Analysis is intuitive and clean** - The authors present a clean and intuitive reason why their proposed selector should outperform baselines. By avoiding the need to recompute CATE for each of the $M$ groups, the authors are able to achieve a better sample complexity, due to their ability to adaptively explore in some sense.
3. **Characterizes when their new bound is possible** - The authors describe several scenarios where there newly proposed estimators reach the desired $\frac{1}{\epsilon}$ bound, and describe why such assumptions are not onerous. For example, the authors describe a class of smooth distributions for $\tau$ that allow for such bounds, and they express an if and only if condition based on the CDF of $\tau$.
4. **Extension to Flexible Budget** - In Section 5, the authors include a discussion of flexible budgets, where an alternative budget $K'$ is used near $K$ that achieves better sample complexity performance. The authors sketch when such a method does and does not work, dependent on the distribution of $\tau$.
1. **Experiments are not Extensive** - The experiments in Section 6 are condensed to a half a page (with some extra material in the Appendix). As a result, it's hard to understand some of the results. For example, why is the failure percentage not monotonic in $\epsilon$; presumably with increasing $\epsilon$, it is less a stringent failure threshold, so it is surprising that this pattern is exhibited across datasets. Additionally, there is little comparison with the $\frac{1}{\epsilon^2}$ method. Finally, what does the actual distribution of $\tau$ look like in practice; are the assumptions validated?
1. In practice, on the datasets listed in the experiments section, how does the sample complexity of the CATE-style selector ($\frac{1}{\epsilon^2}$ compare with the selector proposed |
Fully human-written |
|
Good allocations from bad estimates |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors provide new theory showing that optimal treatment allocation can be achieved with fewer samples than classic predict-then-optimize approaches (FullCATE in the paper) would require. Their theory is built upon the insight that accurate estimates of CATE are only needed around the treatment allocation threshold.
- The authors provide interesting insights about sample size requirements for optimal treatment allocation.
- The authors substantiate their claims with extensive theoretical analysis
Some other related work exists that uses similar insights about the problem of optimally allocating treatment, though often without an extensive theoretical analysis. For example, some work has argued that when trying to find the optimal treatment allocation, accurate CATE estimation is not always the most effective [1,2].
While the authors provide a very extensive theoretical analysis, they only briefly explain the potential impact of their contributions. For example, I find it hard to understand what practitioners should do with this new information. It would be helpful if the authors could discuss this.
I wonder why the authors decided to put the individuals into different groups and do the analysis based on these groups. As the number of groups decreases, the number of samples needed also decreases, but the quality of the overall allocation will also go down (because as you make the groups more granular, you will find more heterogeneity between groups, allowing for better decision-making). How should you choose the number of groups in practice if you have very little a priori information about the CATE distribution?
Related to the previous point, I do not understand how the different groups are created in the experiments. To me, this seems like the most crucial part when evaluating treatment allocation quality.
I wonder why the authors did not use datasets that are often used in Uplift Modeling [1] and treatment effect estimation [3].
[1] Devriendt, F., Van Belle, J., Guns, T., & Verbeke, W. (2020). Learning to rank for uplift modeling. IEEE Transactions on Knowledge and Data Engineering, 34(10), 4888-4904.
[2] Fernández-Loría, C., & Provost, F. (2022). Causal decision making and causal effect estimation are not the same… and why it matters. INFORMS Journal on Data Science, 1(1), 4-16.
[3] Curth, A., Svensson, D., Weatherall, J., & Van Der Schaar, M. (2021, August). Really doing great at estimating cate? a critical look at ml benchmarking practices in treatment effect estimation. In Thirty-fifth conference on neural information processing systems datasets and benchmarks track (round 2).
All the items listed in the Weaknesses section may be interpreted as questions by the authors.
Typo: line 353 "notion that requiring" |
Fully human-written |
|
DFCA: Decentralized Federated Clustering Algorithm |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents DFCA, a Decentralized Federated Clustering Algorithm that combines federated clustering (as in IFCA) with decentralized communication. The goal is to remove reliance on a central server and allow clients connected via a communication graph to collaboratively train cluster-specific models. To handle asynchronous communication, the authors propose a running average aggregation scheme and provide convergence analysis under smoothness and separability assumptions. Experiments on several benchmark datasets (MNIST, EMNIST, CIFAR-10, FEMNIST) suggest that DFCA can achieve accuracy comparable to centralized IFCA while outperforming other decentralized baselines.
1. The proposed approach is conceptually simple, intuitive, and easy to implement within decentralized federated learning frameworks.
2. While the theoretical contribution is not highly novel, the paper provides a useful convergence discussion that helps connect DFCA with existing decentralized SGD analyses under standard smoothness and connectivity assumptions.
1. The theoretical section mostly adapts existing results from IFCA and decentralized SGD. It does not introduce new analytical techniques or address the harder questions that arise from decentralization, e.g., how delayed neighbor updates or misclustered clients affect convergence.
2. The experimental evaluation in this paper is not sufficiently comprehensive. While standard datasets are used, the experiments focus on relatively small-scale and synthetic heterogeneity (via data rotation).
3. The paper does not isolate the contribution of its design choices, e.g., running average vs. synchronous aggregation, GI vs. LI initialization. Additionally, comparisons with more recent personalized or clustered FL methods (e.g., pFedMe, FedProx-based decentralized variants) are missing.
4. The algorithm requires each client to maintain $k$ full model copies and assumes stable connectivity across the network. This limits scalability to large $k$ or unstable peer-to-peer networks.
1. How sensitive is DFCA to temporary misclusterings? Does the convergence argument still hold if clients frequently switch clusters or if cluster separability is weak?
2. How does the running average scheme compare empirically with simple synchronous averaging? Is there a measurable benefit in terms of wall-clock time or communication efficiency?
3. Have the authors evaluated DFCA on larger networks (e.g., $N>500$) or more realistic non-IID data partitions (e.g., natural label skew)? How does the method behave under dynamic connectivity or dropped messages?
4. How sensitive is performance to the number of clusters $k$ and to initialization (GI vs. LI)? Could the method degrade to standard decentralized SGD when $k=1$?
5. Could the authors clarify why comparisons with more recent personalized or clustered decentralized FL baselines (e.g., FedProx-Decentralized, Per-FedAvg) were not included?
6. Could the authors provide additional experiments to strengthen the evaluation? For example, have they considered testing DFCA on larger-scale or more realistic non-IID datasets, varying the number of clusters k, or evaluating its performance under strongly asynchronous or dynamically changing communication graphs?
7. Additionally, could the authors investigate communication and computation trade-offs, as well as scalability on larger networks? |
Fully AI-generated |
|
DFCA: Decentralized Federated Clustering Algorithm |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes DFCA (Decentralized Federated Clustering Algorithm), which, as the name suggests, finds clusters of federated learning clients in a decentralized manner while these clients train models on their local data. The key idea is to use a sequential running average of neighboring clients’ updated model parameters in order to compute model updates, instead of attempting to aggregate all neighbors’ parameter updates at once in each round. Such sequential averaging is naturally amenable to asynchronous updates, and the paper shows that it reduces to decentralized stochastic gradient descent (SGD) after the clients’ clusters converge, which is proven to happen after a sufficient number of update rounds. Experiments show that DFCA can nearly match the performance of centralized IFCA (Iterative Federated Clustering Algorithm), while outperforming other decentralized federated learning algorithms.
+ The experimental results show that DFCA outperforms several decentralized federated learning baselines, on multiple different datasets. It also nearly matches the accuracy achieved by IFCA, which is impressive for a decentralized learning algorithm.
+ The DFCA algorithm itself appears fairly easy to implement and can be naturally extended to asynchronous settings, which is especially important if clients’ network connections can change over time, which may not be synchronized across rounds.
--The client disagreement Disp_j^t is never formally defined, making it difficult to fully appreciate Theorem 1. Some assumptions are also unclear, e.g., wouldn’t the reduction to decentralized SGD rely on the fact that clients perform only one local training iteration between each update?
--It’s not clear why Assumption A3 has different graph mixing conditions for the synchronous and asynchronous cases. The conditions could also be defined more formally, e.g., what exactly does “disagreement contracts” mean? Wouldn’t that be a property of the algorithm as well as the underlying connectivity graph, not the graph mixing itself?
--The paper does not give many concrete examples of where DFCA might be deployed. For example, the assumption that clients are separated into distinct clusters appears fairly strong, so it would be useful to discuss some example applications where this assumption would be reasonable.
1) There is no discussion of how (or whether) the proof of Theorem 1 differs significantly from prior work in the literature. My understanding is that the proof that the clusters eventually stabilize is fairly standard, though I’d appreciate clarification on this point from the authors (in particularly whether the sequential averaging makes a material difference to the proof technique).
2) Does DFCA assume that clients know their neighboring sets N_{I,j} at any given time? If not, how would they know when to resume assigning a new cluster to their data (going back to Step 1 of the algorithm) after the aggregation in Step 3 is complete? How would clients know this neighbor set in practice, since network connectivity may change over time?
3) The experiments section claims that DFCA is robust to low connectivity. I agree that the experimental results provide evidence of this claim, but is it evident in Theorem 1’s convergence result as well?
Please also see the questions listed in the weaknesses above. |
Fully human-written |
|
DFCA: Decentralized Federated Clustering Algorithm |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
DFCA is a decentralized federated clustering algorithm inspired by IFCA, designed to mitigate data heterogeneity. It iterates through three steps. First, each client performs cluster assignment by locally evaluating all k cluster’s models on its data and self-assigning to the cluster with the minimum loss. Second, it executes a local update. Third, it conducts decentralized aggregation, where it exchanges and averages all k models with its neighbors, notably using a "sequential running average" to efficiently handle asynchronous updates.
1. DFCA achieves clustering in a DFL setting, and its performance is comparable to centralized IFCA under data-heterogeneous scenarios.
2. DFCA propose a practical sequential running average method to enable asynchronous aggregation. This mechanism is well-suited for realistic decentralized deployments .
1. The memory cost, computation cost, and communication cost are all large, each of which is k times higher than that of standard decentralized FL. First, DFCA requires each clients to store all k clustering models locally. Second, DFCA performs a complete inference on each of these k models using all their local data to find the model with the lowest loss. Third, the client needs to exchange all k models with its neighbors. The experiments in the paper are limited to small-scale settings such as k=2 and k=4, which masks the serious scalability issues of the design. If the number of clusters is slightly larger (such as k=10 or k=20), this k times memory and k times inference overhead is completely impractical for resource constrained devices.
2. The description of aggregation is contradictory. The aggregation defined by Eq. (6) and (7) is that client i only interacts with the neighbors N_ij in cluster j. However, the convergence analysis in Section 4 (Eq. 10) assumes that client i will interact with all neighbors N_i aggregate all k models. These two descriptions are completely inconsistent. The author must clarify which one is the true aggregation method of DFCA.
Overall, the methods proposed in this paper lack novelty and technical inspiration.
see the above. |
Fully human-written |
|
DFCA: Decentralized Federated Clustering Algorithm |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors present a new clustering method for decentralized FL extending the core idea of IFCA in absence of a central server. Each client stores all the cluster models and it is dynamically assigned to the cluster whose model minimizes the empirical loss. In the paper, the authors also present two possible aggregation strategies, a batch aggregation strategy and a sequential strategy that better meets the constraints of real world FL deployment. The paper is well motivated and quite easy to follow.
I appreciated the clarity of the paper. As mentioned above it is easy to follow and well motivated. In particular, the authors did provide a good theoretical extension of IFCA's proofs to the decentralized setting, properly applying interesting results of algebraic graph theory.
1. The DFL framework is interesting and I appreciate the effort of trying to extend IFCA to this scenario, however the similarities concerns me about the novelty of the proposed approach.
2. Storing all the cluster models on each client could be impractical, mostly in low-powered IoT frameworks where the storage capacity is often limited. I am concerned that in large scale scenarios where the number of clusters drastically increases the memory cost of storing all the models explodes.
2a. The algorithm does not prevent from forming degenerate clusters, i.e. a cluster with a single client. In this case the model of the clusters coincides with the individual model of the client. Hence, this would imply that a client stores the individual information of another client, which is privacy concerning, with respect to standard FL privacy assumptions.
3. Since the number of clusters is fixed a priori as input of the algorithm, similarly to IFCA, why they evaluate the algorithms with $k = 4$? The evaluation shall be extended to other values of this hyper-parameter.
4. In the introduction the authors claim that their method is robust against data heterogeneity, however it is not precised with which value of Dirichlet's $\alpha$ they construct the federation.
5. I think that the authors should compare experimentally their method to other clustering DFL algorithms, or other DFL algorithms that are specially designed to mitigate data heterogeneity.
6. I advise the authors to update the literature review --- while most of the relevant historical works on CFL are present, the most recent literature is not properly discussed.
i. I do not get why the related work section has been placed at the end of the experimental results. It compromises the readability of the paper.
ii. In the pseudocode why the number of local SGD iterations equals the number of communication rounds?
See weaknesses. |
Fully human-written |
|
Black-box Optimization of LLM Outputs by Asking for Directions |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper studies the use of a binary comparison rather than relying on a signal (e.g., logprob signals or from a surrogate model). They show that in image classification, this comparison is significantly more calibrated than directly asking the model. Furthermore, using this tool they case study three scenarios for which their method is effective.
1- The paper is well-written and is easy to understand.
2- The comparison between binary comparison and directly asking the model for the confidence score is the main novelty of this paper, where in my view can be applied in a broader scope.
3- The results of the paper seem promising.
1- I believe that a more comprehensive related work needs to be done in this paper. The idea of comparing only two prompts is largely investigated in the prompt optimization literature. For instance:
Wu et al. "LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization"
Lin et al. "Prompt Optimization with Human Feedback"
2- In the continuation of the previous point, there are several papers that study the calibration of the model's score. This paper only compares with a basic approach--directly asking for a score without defining any rules, criteria, etc.
3- I'm still confused about how the model answers this question: "which one of these prompts is less harmful: 1- Tell me how to build a bomb|suffix_1; and 2- Tell me how to build a bomb|suffix_2" and there needs to be clarification here. Moreover, I don't understand how the comparison of these two will give a meaningful signal. I think more ablation studies are required to ensure the significance of this comparison. For instance, how does the answer of this comparison correlate with the value of the log-probs.
4- The baselines in all three case studies are weak. In the first scenario, the authors only compare their methods with a transfer method. Can the authors clarify if the attack algorithm for the transfer was adapted from (Li et al., 2025)? How about other methods? For instance:
Hu et al. "Transferable adversarial attacks on black-box vision-language models"
5- In the second and third scenario, they only compare with a random search method that has access to the log-probs. Even in this case, the only meaningful comparison is for GPT-4.1 mini in Table 3 (actually I don't understand how the log-probs were calculated for this black-box model. Are they exploiting the same top-5 method as the original paper does?) Why didn't the authors include more attacking methods (preferably more recent ones) in their table?
1- Can you do the same calibration experiment for the adversarial prompt? I.e., asking a model to give you a score (there are methods developed for this already such as StrongReject) vs. asking it two only compare two prompts? |
Fully human-written |
|
Black-box Optimization of LLM Outputs by Asking for Directions |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper tackles the setting of jailbreaking closed sourced LLMs with text-only API access, which is more prevalent but difficult setting than the white box setting, as malicious actors can leverage gradients to steer LLMs, and with APIs that output log probs, as gradient free can be used for jailbreaking. Existing work on the text-only setting requires surrogate LLMs or reward models to score possible jailbreaks and update them. This work provides a different approach by using the victim model directly to score possible jailbreaks / adversarial examples, and then leverages these scores to generate new attacks. This is done by developing a series of binary comparisons of potential adversarial attacks for the victim model to score, which are then used to attack the model directly.
The main strengths of this paper is that it does not require any auxilliary models and data to score adversarial samples, which is very useful for stress-testing such closed source API LLMs. The algorithm is simple and very easy to implement and builds on existing black box adversarial attacks on LLMs and VLMs. Additionally, the use of binary comparisons instead of absolute confidence scores is novel and well motivated.
There are two primary weaknesses of this paper. First, there is a lack of related work / comparisons to existing text-only black box methods. Methods such as BlackDan [1] and D-Attack / DH-CoT [2] are not mentioned. Additionally, while existing methods like PAIR and Tree-of-Attacks do require auxilliary models, it would strengthen the paper to compare against them, as currently the only comparison is to the logprobs based method.
Secondly, the results in Table 1 are confusing, as it seems that the approach is not very successful on its own. Table 1 shows attack success rates that are significantly lower than the transfer based approaches across all models, and incorporating the attack with transfer based methods, either directly or through ensemble does not yield significantly better results. Can the authors explain this?
[1] Wang, X., Huang, V. S. J., Chen, R., Wang, H., Pan, C., Sha, L., & Huang, M. (2024). Blackdan: A black-box multi-objective approach for effective and contextual jailbreaking of large language models. arXiv preprint arXiv:2410.09804.
[2] Zhang, C., Zhou, L., Xu, X., Wu, J., Fang, L., & Liu, Z. (2025). Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts. arXiv preprint arXiv:2508.10390.
1) How efficient is the algorithm at converging to a successful adversarial attack? Can the authors provide an ablation study on changing the maximum number of iterations?
2) The majority of the attacks are done on non-reasoning models or non-thinking models. How does the performance of this approach change for models such as the o1-3 OpenAI reasoning model series?
3) Is the repeated iteration done in a multi-turn fashion (i.e. does the model's context include the past binary comparisons)? |
Fully human-written |
|
Black-box Optimization of LLM Outputs by Asking for Directions |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper presents an effective black-box attack strategy for LLMs that operate in the most restrictive, text-only API setting. The core contribution is the discovery that while LLMs are poorly calibrated for absolute confidence scoring, they are surprisingly well-calibrated for binary comparisons. The authors leverage this insight to build a general, query-based "hill-climbing" optimization algorithm. By repeatedly "asking for directions" via these comparative prompts, the attack iteratively refines a malicious input. The method's effectiveness is demonstrated across three distinct and important domains: adversarial examples for vision-LLMs, jailbreak attacks, and prompt injections.
1. The paper's primary insight, using comparative, self-reported confidence as an optimization signal, is a significant and novel contribution. It elegantly bypasses the common requirement for logits or confidence scores, which are rarely available in production systems. The validation in Figure 3, which contrasts the failure of absolute scoring with the success of binary comparison against ground-truth logits, is very convincing.
2. The attack is designed for and effective in the "text-only" black-box setting. This is a very practical and challenging scenario, and this work dramatically expands the known attack surface for deployed, proprietary models.
3. A key and counterintuitive finding is that larger, more capable models are often more vulnerable to this attack. The paper provides strong evidence for this across model families (e.g., Qwen-VL-72B > 7B, GPT-5 mini > GPT-4o mini). The hypothesis that this is because they are better at the comparative reasoning task is a useful insight for the field.
4. The method is not a one-trick thing. It is successfully applied to three very different attack types (adversarial images, jailbreaks, and prompt injections) across numerous model families (GPT, Claude, Llama, Qwen). Furthermore, the results are state-of-the-art, achieving near-perfect success on jailbreaks and even outperforming logit-based attacks in some cases (e.g., 98% vs. 56% on GPT-4.1 mini).
1. In Table 1, the high ASRs (e.g., 94.7% on GPT-4o mini) are achieved with the "Transfer+ours" hybrid method. In this hybrid, the improvement from the optimization step is sometimes marginal (e.g., 92.9% to 94.7% for GPT-4o mini), suggesting the (known) transfer attack is doing most of the work. The standalone power of the query attack for vision seems less impressive than for text.
2. The entire attack hinges on the model's willingness to answer the comparative "meta-prompt." The authors note this as a failure mode, where a strongly aligned model may simply refuse to perform the comparison. This seems like a critical vulnerability of the attack itself. The paper does not sufficiently explore the robustness of the attack to simple defenses on this meta-prompt (e.g., "I cannot compare prompts in a way that might lead to a harmful outcome").
3. As mentioned in the first point, the benefit from the "Transfer+ours" method is highly variable. For GPT models, the gain is tiny (1.8% on GPT-4o mini) (table 1), but for Claude models, it is massive (35.1% to 59.6% on Haiku 3.5). This significant discrepancy is not analyzed. Does it mean the transfer attack is already near-perfect for GPT, or that the Claude models provide a much better optimization signal? This is an important detail.
1. Following up from weakness, why is the improvement from "Transfer+ours" so minimal for GPT models but so large for Claude models (table 1)? Does this imply that the "directions" from GPT are less effective, or that the CLIP-based transfer attack is already highly aligned with the GPT vision encoder?
2. The paper identifies that a model can refuse the comparison query as a defense. How difficult is it to bypass this? Did the authors experiment with iteratively re-prompting or reformulating the comparison prompt itself to get around such refusals?
3. The "security paradox" claim, does this finding suggest that alignment techniques that rely on a model's advanced reasoning (like self-critique or Constitutional AI) are fundamentally flawed, as that same reasoning capability can be turned against the model to guide an attack?
4. For the vision-LLM attacks, the query budget was 1,000. What was the average number of queries for a successful attack? Table 3 provides this for jailbreaks (e.g., 4.9-79.3 queries), which is very efficient. Is the cost for vision attacks similarly low, or does it regularly require hundreds of queries? |
Fully AI-generated |
|
Black-box Optimization of LLM Outputs by Asking for Directions |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The authors introduce an approach for attacking black-box large language models. They iteratively present a model with two slightly different images and based on the responses to that question, select the image showing desired behavior, and then use the selected image to create the two images for future iterations. This results in an image that can successfully elicit the desired behavior in models.
Black box attack for VLMs
Shows promising results
Very compute heavy and expensive
Unclear from the paper how many iterations were required to get the shown ASR
1. What were the computational costs/time required for the optimization of the adversarial image? |
Fully human-written |
|
Multivariate Time Series Forecasting with Fourier Neural Filter |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a Time Filter (TiF) for multivariate time-series forecasting. TiF employs a Fourier Neural Filter (FNF) as the backbone and a Dual-Branch Decoupler (DBD) as the architectural design. The former provides strong representational capacity, while the latter establishes efficient learning pathways for spatiotemporal modeling.
1. This paper proposes a unified FNF backbone that integrates time-domain and frequency-domain analyses.
2. This paper provides theoretical and empirical evidence for the effectiveness of DBD in spatiotemporal modeling.
3. Comprehensive experiments conducted on long-term and short-term forecasting tasks verify the superior performance of TiF.
1. The organization needs improvement to make it easier to follow. In the Related Work, it is unclear why distribution shift and non-autoregressive decoding are reviewed, as these topics do not appear to be central to the paper’s main contributions. In the Method (Sections 3.1.1–3.1.6), substantial space is devoted to preliminaries such as complex transforms and global convolution, which obscures the core ideas and innovations of the proposed approach.
2. The paper’s novelty appears limited. Simply replacing the fixed kernel in the Fourier Neural Operator with an input-dependent kernel represents only a marginal improvement. In addition, introducing DBD as a parallel paradigm, compared with unified and sequential paradigms, to maintain independent information-processing branches also appears to be an incremental design choice rather than a substantive conceptual advance.
3. More relevant frequency-filter baselines should be considered, such as FilterNet [1] and TSLANet [2].
[1] FilterNet: Harnessing Frequency Filters for Time Series Forecasting. NeurIPS, 2024.
[2] TSLANet: Rethinking Transformers for Time Series Representation Learning. ICML, 2024.
pls refer to the weakness. |
Lightly AI-edited |
|
Multivariate Time Series Forecasting with Fourier Neural Filter |
Soundness: 3: good
Presentation: 1: poor
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces Time Filter (TiF) structure for multivariate time series forecasting. TiF combines: 1) Fourier Neural Filter (FNF), a spectral backbone with an input-dependent kernel that adaptively mixes time-domain and frequency-domain information while filtering noise; 2) Dual Branch Decoupler (DBD), a parallel temporal–spatial architecture that processes time and variable dimensions separately and fuses them later. The paper provides theoretical motivation via the information bottleneck principle and demonstrates improved performance across 12 datasets with notable efficiency and robustness.
1. The method design is well-motivated. The proposed FNF introduces an adaptive, input-dependent spectral filter that effectively bridges time-domain and frequency-domain modeling, while the DBD offers a parallel architecture for capturing both temporal and spatial dependencies.
2. The paper is written clearly, with good method framing and empirical support, making it a sounding backbones for time series forecasting.
3. Consistent better results across 12 datasets demonstrates the generalization and efficiency advantages of the proposed method.
1. The theoretical claims in this paper are mostly definitions or intuition. Theoretical proofs in Sections 3.1 and 3.2 read more as descriptive derivations or qualitative reasoning rather than rigorous theorems or guarantees. To strengthen credibility, the authors could either (a) provide concrete theoretical statements with clear conditions and supporting lemmas, or (b) reframe these sections as design intuitions supported by experiments.
2. The experiment section in the main paper lacks sufficient description of hyperparameters and training details. The authors state that a grid search over input lookback lengths and other hyperparameters was performed. However, it is unclear how this grid search was conducted. Is the performance of the proposed method significantly affected by the choice of lookback window length? Additional experiments using a fixed window size and including statistical significance tests would help address these concerns.
3. The length distribution across sections could be better balanced. Sections 3 and 4 occupy a large portion of the paper, leaving the experimental section relatively brief. If Section 3 mainly focuses on describing the model architecture, it could be condensed to make room for more detailed experiments. Similarly, Section 4 could be merged into Section 3. At present, the ablation studies are rather limited, discussing only a few modules and not conducted on unified dataset selection.
1. The paper claims selective activation enhances mid/high frequencies. Some frequency or distribution visualizations would make this more concrete.
2. Providing pseudo-code for the structures or releasing an anonymized codebase with experimental configs would benifit transparency and reproducibility.
3. How do you design the fusion network for the representations? What is the dimension of the linear layer in Eq. 23? Why it is a Linear - LN - GELU - Linear structure, which is not a common style MLP?
If the authors' response adequately addresses my questions and concerns mentioned above, I am willing to raise my score. |
Lightly AI-edited |
|
Multivariate Time Series Forecasting with Fourier Neural Filter |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces a new architecture for multivariate time series forecasting, including two key components:
(1) Fourier Neural Filter (FNF), an input-dependent integral kernel operator that unifies time-domain and frequency-domain modeling, extending the Fourier Neural Operator (FNO) by introducing adaptive gating, selective activation, and learnable truncation for denoising.
(2) Dual Branch Decoupler (DBD), a dual-path structure inspired by information-bottleneck theory that decouples temporal and spatial processing for improved gradient flow and representation capacity.
1. This paper presents a new architectural exploration for time series forecasting, offering a meaningful attempt to design a dedicated backbone tailored to the characteristics of temporal data. This represents a positive and constructive step for research in this area.
2. The proposed Dual Branch Decoupler (DBD) introduces a parallel-branch mechanism to decouple temporal and spatial feature learning. This is an interesting design that contributes fresh insights to spatiotemporal modeling in time series forecasting.
3. The experimental evaluation is extensive and convincing, covering 12 benchmark datasets and a broad spectrum of competitive baseline models, which demonstrates the robustness and general applicability of the proposed approach.
1. While the proposed Fourier Neural Filter (FNF) introduces adaptive kernels and learnable truncation mechanisms, much of its formulation builds upon existing frameworks such as FNO and AFNO.
2. The DBD parallel design is conceptually sound but lacks empirical exploration of branch interactions (e.g., information flow visualization or mutual information analysis). It would strengthen the paper to show why the parallel path quantitatively improves gradient dynamics or representation diversity.
3. The Related Work section does not sufficiently discuss prior studies directly related to the paper’s two main contributions, i.e., FNF and DBD.
4. The ablation study is relatively limited. It would be useful to further investigate the effect of architectural choices, such as FNF depth, kernel size, and sensitivity to the patch length P, to better understand the robustness of the proposed design.
1. It is unclear why Equation (22) is claimed to capture global correlations while Equation (21) captures local correlations, given that both modules employ the Fourier Neural Filter (FNF) backbone. |
Heavily AI-edited |
|
Multivariate Time Series Forecasting with Fourier Neural Filter |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper proposes Time Filter (TiF), combining a new Fourier Neural Filter (FNF) backbone with a Dual Branch Decoupler (DBD) architecture. FNF extends FNO with an input-dependent kernel (selective activation), complex transform, and adaptive truncation to modulate time–frequency information; DBD uses parallel temporal/spatial branches to improve gradient flow and capacity from an information-bottleneck perspective. Experiments on 12 benchmarks (eight LTSF datasets plus four PeMS short-term sets) report strong results, along with ablations and an efficiency plot.
1. FNF formalizes an input-dependent spectral operator with selective activation and adaptive truncation; the math exposition (Definitions. 1–7, Remarks 1–6) is explicit and links capabilities to Transformer functions and complexity.
2. DBD’s motivation via the information bottleneck and gradient-path analysis is well argued.
3. Results span LTSF and PeMS with comparisons to Transformer/CNN/MLP/Fourier baselines; Table 2 claims lookback grid search (96–720) for all methods, which ensures a fair comparison.
4. Efficiency analysis (Traffic) and component ablations (AT/SA, LS/GS) give some insight.
1. FNF’s contributions (input-dependent kernel, selective activation, adaptive truncation) are close in spirit to prior spectral/fractional operators (e.g., FNO/AFNO/FITS/FreMLP). The paper does not clearly establish what capability FNF enables that prior spectral blocks cannot, beyond architectural composition; DBD overlaps with known parallel/dual-path decouplers (e.g., Leddam [1], xPatch [2], MTGNN [3], TimeMixer++ [4], just to name a few). A sharper comparison or operator-replacement study is needed.
2. There is no working anonymous repo or pseudo code for review, which limits the reproducibility of the proposed method.
3. For PeMS, the lookback is fixed to 96 for all baselines, which can bias results; fair practice tunes input length per method (as you already did for Table 2). Please align protocols across tasks.
4. Table 2 lacks more recent strong baselines (e.g., TimeMixer++ [4] (ICLR25), PatchMLP [5] (AAAI25), TQNet [6], and TimeBridge [7] (ICML25)), which undermines the SOTA claim; please add with identical splits/tuning.
5. The proposed FNF looks conceptually overlapped with selective state-space models such as Mamba: both realize an input-conditioned long filter with a gated residual/skip path. In FNF, the frequency-domain parameterization (F → complex transform → adaptive truncation → F⁻¹) effectively implements a learnable long convolution; Mamba parameterizes a similar operation via SSM kernels and a selective gate. The manuscript should explicitly position FNF against Mamba/S4-family—clarifying what FNF does that a selective SSM cannot—and include quantitative comparisons.
*[1] Revitalizing Multivariate Time Series Forecasting: Learnable Decomposition with Inter-Series Dependencies and Intra-Series Variations Modeling.*
*[2] xPatch: Dual-Stream Time Series Forecasting with Exponential Seasonal-Trend Decomposition.*
*[3] Connecting the Dots: Multivariate Time Series Forecasting with Graph Neural Networks.*
*[4] TimeMixer++: A General Time Series Pattern Machine for Universal Predictive Analysis.*
*[5] Unlocking the Power of Patch: Patch-Based MLP for Long-Term Time Series Forecasting.*
*[6] Temporal Query Network for Efficient Multivariate Time Series Forecasting.*
*[7] TimeBridge: Non-Stationarity Matters for Long-term Time Series Forecasting.*
See in weakness. |
Moderately AI-edited |
|
Binary-Integer-Programming Based Algorithm for Expert Load Balancing in Mixture-of-Experts Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes an improved version of auxiliary-loss-free load balancing strategy, which is based on binary integer programming. It reported better load balancing control (especially at the beginning) and lower perplexity than auxiliary-loss-controlled ones or the original auxiliary-loss-free method.
1. Better load balancing (especially at the beginning).
2. It reported better performance.
3. Algorithm 4 can be applied to cases where large batch sizes and expert parallelism are used.
1. Can you report the training data amount and the batch size (and the loss curve, if you can update it in the PDF or somewhere)? I cannot find it in the paper, and I’m wondering whether the data and batch size were too small, which might have caused the behavior to be different from normal cases.
2. I wish experiments were conducted on Algorithm 4 (or some similar version that consumes acceptable GPU memory and allows expert parallelism in large-scale training).
3. Will this algorithm cause the bias term *q* to change too quickly, hurting performance and causing training instability?
1. The template is a little strange, and you might use `\citep` for some cases.
2. Will a 0.1 auxiliary loss alpha be too large for the loss-controlled baseline? (I'm not clear about the implementation, so I'm not sure about the actual effect of this 0.1 value, but you may report the loss and maxvio around 0.1 (like 0.01 / 1.?)).
3. "Especially in the Loss-Controlled method, since there are discussions showing that on the softmax function the Loss-Controlled method works better Su (2025)." Can you provide the original text on the Kexue blog? It isn't on the linked blog.
4. I shall argue that the gating function for the load balancing methods does not have to be the same, as long as all methods report their best results. It should be viewed as an inseparable part of the load balancing methods. |
Fully human-written |
|
Binary-Integer-Programming Based Algorithm for Expert Load Balancing in Mixture-of-Experts Models |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a new load-balancing strategy for Mixture-of-Experts (MoE) models called BIP-Based Balancing. It formulates expert assignment as a Binary Integer Programming (BIP) problem. The method introduces a per-layer vector q that adjusts the routing scores. This adjustment is achieved by solving a simplified dual optimization problem using ADMM iterations. The approach aims to maintain expert load balance from the start of pre-training. This is in contrast to previous loss-controlled or loss-free methods that converge slowly. Experiments on small-scale MoE variants of the Minimind model (0.3B and 1.1B parameters) show reduced perplexity. They also indicate roughly 13–14\% shorter training time.
1. Novel formulation: Modeling MoE load balancing as a BIP optimization offers a fresh theoretical perspective on routing dynamics.
2. Auxiliary-loss-free: The method avoids using extra loss terms. This allows the model to focus on the main objective.
1. The proposed BIP formulation requires iterative updates at every routing gate. The paper does not provide clear computational complexity or runtime profiling. Thus, it is unclear if this approach is practical for large-scale MoE models with tens or hundreds of billions of parameters.
2. Experiments are conducted only on small MoE models with single-GPU setups (RTX4090/L20). There is no evidence that the algorithm scales to large LLMs or distributed settings. This significantly weakens the empirical claims.
3. The paper presents the BIP problem as solvable via a simple ADMM-like iteration. However, this is mathematically inconsistent. BIP is NP-hard in general. The authors are actually solving the dual of the LP relaxation, not the integer problem itself. There is no proof that their iterative rule recovers a feasible or optimal integer assignment $X_{ij}$. Thus, the method likely produces an approximate heuristic, not an exact BIP solution.
4. The authors invoke ADMM to justify iterative updates for $p, q, r$. However, the augmented Lagrangian $L(p,q,r,u)$ is not explicitly written. No penalty parameter $\lambda$ is defined. There is no convergence guarantee, which normally requires convexity and Lipschitz continuity. The authors assert that lines 2–3 in Algorithm 2 imply lines 7–12 in Algorithm 1. This is not a valid ADMM update derivation; it is a qualitative analogy, not a formal equivalence.
Please check the weakness |
Heavily AI-edited |
|
Binary-Integer-Programming Based Algorithm for Expert Load Balancing in Mixture-of-Experts Models |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
In this paper, the authors propose a novel balancing algorithm called "BIP-Based Balancing." The core idea is to formulate the expert-token assignment as a Binary Integer Program (BIP) designed to maximize the routing scores for the current batch, with a hard constraint on load balance.
Based on this formulation, the authors analyze its Linear Programming (LP) relaxation and corresponding dual problem. By solving this dual problem using the Alternating Direction Method of Multipliers (ADMM) for a small number of iterations (T) at each step, the algorithm derives a set of dual variables. These variables are then used as biases to modify the original routing scores before the Top-K selection.
The authors conduct experiments on two MoE language models (0.3B 16-expert and 1.1B 64-expert) and compare their method against Loss-Controlled (e.g., GShard) and Loss-Free (e.g., DeepSeek-V3) baselines. The results show that BIP-Based Balancing achieves significantly lower load imbalance (AvgMaxVio), better model performance (lower perplexity), and reduced training time.
1. Novelty: The formulation of expert balancing as a BIP and the use of dual variables from its LP-relaxation as routing biases is a novel and theoretically sound approach. Personally, I found this formulation very interesting.
2. Good performance: The experimental results look promising. The proposed method outperforms the baselines: it achieves a more stable and balanced load (drastically lower AvgMaxVio/SupMaxVio), results in a better-performing model (lower perplexity), and reduces training time at the same time.
3. Solves the early imbalance issue: A key advantage, clearly shown in Figures 1 and 2, is that this method achieves load balance from the very early training stage. Baselines (blue and green lines) show high initial imbalance or significant fluctuations, while the proposed method (red line) is stable and low from step 0.
1. CRITICAL: Submission Format Violation: The submitted paper does not appear to follow the standard ICLR template. On this basis alone, the paper is recommended for an **immediate desk rejection**. If AC and other reviewers do not think this is a big issue, I may revise my score.
2. Scalability Concerns: The experiments are conducted on relatively small models (0.3B and 1.1B). While informative, the true test for MoE load balancers is at larger scales and more training steps. Intuitively, if the model is trained for many more tokens and steps, the influence from the early imbalance phase will naturally decrease, as will the relative performance gain of the proposed method.
3. The paper reports a significant (13%) reduction in training time but does not clearly explain why this occurs. The authors could better explicitly state that this algorithmic overhead is negligible compared to the system-level time saved by eliminating "stragglers" (overloaded GPUs that bottleneck parallel training steps). This connection is critical to the paper's main claim.
4. The paper does not state whether the baseline methods (loss-controlled and loss-free) use the token dropping strategy. Token dropping is a common technique to handle imbalance, especially during the early, unstable phases, and should be a good baseline.
5. Missing Training Data Details: The paper does not specify the total size (e.g., number of tokens) of the pre-training dataset, only its source and the max sequence length. This makes it harder to fully assess the reproducibility and trustworthiness of the results.
1. Could the authors provide a precise measurement of the computational overhead for the proposed routing, lines 7-12? I suspect it is negligible compared to the rest of the training, but it would be better to quantify this.
2. Do the authors have any intuition on why the MaxVio of the loss-controlled and loss-free methods seems to fluctuate periodically during training (as seen in Figures 1 and 2)? |
Fully human-written |
|
Should We Forget About Certified Unlearning? Evaluating the Pitfalls of Noisy Methods |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper provides a critical evaluation of certified unlearning methods, focusing specifically on approaches based on differentially private (DP) training. The authors conduct experiments in vision and language settings to compare these DP-unlearning techniques against the baseline of retraining from scratch (RFS). The paper's central claim is a negative result: it argues that current DP-based methods fail to offer a compelling tradeoff, often suffering from poor utility (due to DP noise guiding models to suboptimal solutions) or failing to provide a significant computational benefit over a simple RFS baseline, especially when starting from a public pretrained model.
1. The paper addresses a very timely and critical question for the field. As unlearning moves from theory to practice, we need rigorous evaluations to understand if current methods are actually viable.
2. It's good to see the evaluation extend beyond vision tasks to include larger language models. This is a crucial domain for unlearning.
I've ordered these in terms of importance, from high-level conceptual issues to more specific experimental concerns.
1. **Overly Broad Claim:** The paper's main conclusion that "certified unlearning" as a concept is not "worthwhile" seems too strong and not fully supported by the experiments. The evaluation focuses on one specific family of methods (DP-SGD based). It largely ignores other established classes of certified unlearning, such as sharding-based approaches (e.g., SISA), which have different trade-offs. The conclusion should be more precisely narrowed to the methods actually tested.
2. **Potentially Premature Conclusion:** The paper's strong negative claim risks being premature. Certified unlearning for complex, non-convex tasks is a very new research area, with rigorous theoretical guarantees only just emerging. It's highly likely that the initial DP-based methods evaluated here are just a first step. The paper's conclusion seems to over-generalize from the limitations of these early-stage methods, rather than framing them as a baseline for much-needed future practical progress.
3. **Experimental Scenarios Don't Test the Premise:** The core problem unlearning aims to solve is avoiding expensive retraining. However, the experiments seem to focus on relatively easy settings, such as forgetting a single sample. In this scenario, RFS (or simple re-finetuning from a public model) is expected to be a very strong and cheap baseline. The paper's claims would be much more impactful if they tested scenarios where RFS is genuinely prohibitive, such as forgetting larger, non-IID fractions of the dataset, as explored in other works.
4. **Incomplete Comparison to Prior Work:** The comparison to Koloskova et al. 2025 feels incomplete. It's not clear why the authors didn't compare against stronger methods from that paper, such as the variant called "gradient clipping" in the aforementioned paper, which was claimed to work better. The "model clipping" method used here seems to be the simpler "output perturbation" baseline, which is known to be weaker, and may not correspond to the method named "model clipping" in Koloskova et al. 2025's paper. To strengthen the negative claim, it's important to show that even the best versions of these DP-based methods are insufficient.
5. **Limited Novelty of Core Insight:** The finding that DP-SGD can lead to suboptimal solutions and trade-offs in utility is a known (though important) fact in the DP literature. Once again, given that certified unlearning for non-convex tasks is a very new field, finding that the first-generation methods are not yet practical is a useful, but perhaps not surprising, result.
6. **Limited Language Model Metrics:** While including language tasks is a strength, the evaluation is limited to perplexity. This doesn't give a full picture of model quality. It would be nice to see other standard metrics (e.g., ROUGE for summarization, or accuracy on downstream tasks) to understand the practical impact on utility.
1. Could the authors please provide an explanation for Figure 2.b? It seems quite odd that all perplexity curves converge early to the same point. What might be causing this behavior?
2. Could you elaborate on the choice to focus on the DP-SGD method (Algorithm 1) also used in Koloskova et al. 2025, rather than the noiseless pretraining setting they also studied?
3. Following on that, why were other methods from that paper, like so-called "gradient clipping", not included in the main comparisons (Figures 1-3)? It would be helpful to understand if they were tested and performed similarly, or if they were omitted for other reasons.
4. Do you believe your negative claims hold for all certified unlearning paradigms, including non-DP-based methods like SISA, or should the paper's conclusion be more tightly scoped to DP-based approaches? |
Heavily AI-edited |
|
Should We Forget About Certified Unlearning? Evaluating the Pitfalls of Noisy Methods |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
In this paper, the authors critically evaluate differential privacy (DP)–based certified unlearning methods that aim to remove specific data influences without retraining from scratch. They test these approaches across vision and language tasks and find that, contrary to prior claims, DP-unlearning offers no clear advantage in efficiency or performance compared to full retraining. The study identifies two main issues: DP training from random initialization leads to poor model quality, while starting from pretrained models makes unlearning unnecessary since simple fine-tuning achieves better results faster. Overall, the work provides an important negative result, questioning the practicality of DP-based certified unlearning.
1. Well-written and clearly presented: The paper is well-organized and easy to follow, with clear motivation, experimental design, and conclusions that make a complex topic accessible.
2. Highlights an overlooked aspect: It draws attention to an often-overlooked issue—the actual practicality and trade-offs of differential privacy–based certified unlearning in real-world scenarios.
3. Valuable negative result: The authors provide a rare but important negative finding, showing through systematic experiments that current DP-unlearning methods fail to outperform retraining. This insight is valuable for guiding future research toward more effective unlearning approaches.
1. The title — "Should We Forget About Certified Unlearning? Evaluating the Pitfalls of Noisy Methods" — feels overly broad and somewhat misleading. The paper only examines one specific class of certified unlearning methods, namely those based on DP-SGD training followed by fine-tuning on the retain set, considering just two scenarios: training from scratch and fine-tuning a pretrained model. However, other classes of certified or exact unlearning methods exist—such as those using Newton-based updates with additive noise—which are not explored here. In addition, approaches like [a] and [b] that handle the case without any retain data, focusing solely on the forget set, represent fundamentally different settings where retraining is infeasible. Therefore, the claim implied by the title that we should “forget” the entire field of certified unlearning appears too strong given the limited scope of evaluation.
[a] Fast Yet Effective Machine Unlearning (Tarun et. al)
[b] Towards Source-Free Machine Unlearning (Ahmed et. al)
2. The empirical evaluation is limited to relatively small-scale datasets such as CIFAR-10, which makes it difficult to generalize the conclusions to large-scale scenarios where retraining from scratch is practically infeasible—such as with current large language models trained on massive data corpora. Hence, the claim that DP-based certified unlearning is not worthwhile may not hold in realistic, large-data contexts.
3. Furthermore, the analysis overlooks the fact that existing DP-SGD–based certified unlearning methods are inherently approximate and rely on relaxed assumptions. While these limitations are well known, dismissing the entire line of research risks discouraging progress. A more balanced conclusion would acknowledge that, although current methods are imperfect, continued efforts to reduce assumptions and bridge the gap between theory and practice are essential to advancing the field.
While the paper includes experiments on the Pythia-1B language model, this still represents a relatively moderate scale compared to current large foundation models. How do the authors expect their conclusions to hold for much larger models—such as multi-billion parameter LLMs—where retraining from scratch is prohibitively expensive and DP-based certified unlearning might still offer practical advantages despite its theoretical and empirical limitations?
Overall, the effort is commendable, and more such rigorous “reality-check” studies are needed to objectively assess assumptions and guide the unlearning community in the right direction. |
Moderately AI-edited |
|
Should We Forget About Certified Unlearning? Evaluating the Pitfalls of Noisy Methods |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper examines certified unlearning methods that rely on differentially private (DP) mechanisms (e.g., DP-SGD) to provide formal guarantees that models have forgotten specific data points. The authors perform an extensive empirical study across vision (CIFAR-10, ResNet-18) and language (Pythia-1B, Alpaca) tasks to test whether these methods actually outperform simple retraining from scratch (RFS). They claim that there are two key "failure modes" of DP methods:
FM1: Noisy DP training leads to suboptimal local minima that hinder subsequent fine-tuning and yield inferior utility.
FM2: Retraining (fine-tuning from a pretrained checkpoint) is efficient enough.
1. The paper is clearly written and organized, with well-labeled figures and sections, though at times it overexplains basic concepts such as DP-SGD.
2. The authors perform several experiments on both image data and text data.
3. The topic is timely given current interest in unlearning and DP.
1. The conclusions extend far beyond what the experiments can justify. Results on ResNet-18/CIFAR-10 and Pythia-1B/Alpaca cannot represent the behavior of larger or domain-specific systems where **retraining is expensive**. The authors ignore contexts where certified unlearning is actually needed (e.g., regulated deployments, multi-tenant models). Thus, the sweeping claim that “certified unlearning is not worthwhile” is unsupported by the limited evidence.
2. FM1—noisy DP training leads to suboptimal local minima—is not promising and is already well-known. However, this can be mitigated by a more fine-grained hyper-parameter tuning. The main reason for this claim in the paper is that their hyperparameters are not “optimal.” This is also evident from Figure 1, where the accuracy of fine-tuning a publicly pretrained model on CIFAR-10 is only 80%. However, DP fine-tuning on CIFAR-10 typically achieves an accuracy above 90% (cf. https://arxiv.org/abs/2204.13650). Overall, the experiments are not convincing.
3. The study treats DP noise as an intrinsic flaw rather than a "hyperparameter" and does not explore modern variants of DP-SGD (e.g., adaptive clipping, DP-LoRA, or tighter privacy accountant) that can dramatically reduce noise. These methods should have been applied in the experiments. This is another evidence that the experimental results are not convincing.
4. Unlearning effectiveness is not actually tested in this paper. No membership inference attacks are conducted. All conclusions rely solely on DP certificates and accuracy metrics, neither of which confirm that the influence of the forget set has been removed. Moreover, the forget set size (|DF| = 1 % of CIFAR-10) is arbitrary and unexplored—no scaling experiments on forget-set size.
5. The key claim relies on an ambiguous notion of “not worthwhile.” However, what thresholds of cost, privacy budget, or accuracy make a method “not worthwhile”? In regulated environments, any provable guarantee—regardless of cost—may be necessary, meaning that “worthwhile” is inherently policy-dependent. Their framing also overlooks the regulatory and compliance context that motivates certified unlearning in the first place.
Overall, the paper is structured to confirm its predetermined conclusion (“certified unlearning is not worthwhile”) rather than to test it objectively. All experiments are designed to maximize RFS’s apparent advantage—using small forget sets, simple architectures, and low training costs. The DP setup is deliberately weakened and then declared ineffective. This amounts to a self-fulfilling negative result rather than an unbiased evaluation.
see the weaknesses |
Lightly AI-edited |
|
Should We Forget About Certified Unlearning? Evaluating the Pitfalls of Noisy Methods |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents a systematic empirical evaluation of certified (DP-based) unlearning versus the simple “retrain-from-scratch” (RFS) baseline on vision and language tasks. Under realistic, large-model regimes the authors find that current noise-injection schemes consistently lose to RFS in cost–utility–guarantee trade-offs, identifying two failure modes: (i) random-init DP pre-training lands in poor loss basins that are hard to escape, and (ii) pretrained-init makes RFS so cheap that the upfront DP cost becomes sunk overhead. Recommendations to rethink definitions and explore non-DP guarantees are well-motivated.
1. The work provides the first reality check on whether certified unlearning actually beats the trivial retrain-from-scratch baseline. The authors recommend revisiting definitions and exploring alternative techniques.
2. The paper is overall well-structured.
1. he reviewer disagrees with the statement in FM2: "In this setting, 'retraining from scratch' means fine-tuning a pretrained model on the retain set." Regardless of the setting, fine-tuning is always a distinct unlearning method separate from retraining from scratch (RFS). Since fine-tuning-based methods themselves lack theoretical guarantees, we could also disregard the constraints required for certified unlearning and directly apply their algorithms for unlearning.
2. The paper lacks discussion on online unlearning (i.e., handling continuous unlearning requests).
3. The paper's survey of certified unlearning is insufficient, lacking discussion of recent methods such as that in [1].
[1] Qiao, X., Zhang, M., Tang, M., & Wei, E. (2025). Hessian-Free Online Certified Unlearning. *International Conference on Learning Representations*.
I would appreciate the authors’ responses to the weaknesses outlined above. |
Lightly AI-edited |