|
BEARD: Benchmarking the Adversarial Robustness for Dataset Distillation |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper introduces a benchmark for evaluating the robustness of dataset distillation approaches against adversarial examples. The authors wrap several models, datasets, and attacks, and provide a set of metrics to evaluate the adversarial robustness. The wide range of experimental results is also collected in a leaderboard, which is publicly released (together with the implementation code).
- The paper is clear and well written
- The addressed problem is relevant, and I think those benchmarks and their codebase are very valuable for the research community and can serve as a baseline both for attacks and defenses
- The authors made a lot of effort to wrap together models, datasets, and attacks, and run a considerable amount of experiments
- I am concerned about the contribution, as it appears weak (particularly considering this venue), both from a technical and novelty point of view. The authors (although I recognize the hard work that has been made) "simply" wrap together existing works, whereas the most novel contribution appears to be the proposed metrics (on which I have some concern, see below). Additionally, there is a non-negligible overlap with the competing DD-RobustBench work, with only incremental improvements over it.
- I don't understand the reason to define relative metrics (RR and AE), as the maximum ASR and AST, which serve as baselines for them, are strongly influenced by several factors (model pool, attack performance, which in turn depends on many parameters, etc.). Why not use absolute metrics, such as an average?
- Using the GPU time to compute the computational cost is not reliable, as this can be influenced by multiple side effects. A suitable metric to compute the attacker cost is the number of model inferences and gradient computations. In this way, both the model itself and other overheads unrelated to the attack are excluded.
- I also have concerns about the attack hyperparameters. Unlike AutoAttack, which is parameter-free (and thus suitable for benchmarking different models), the other approaches require careful tuning of the hyperparameters for each attacked model (e.g., iterations, step size) to achieve the best results. For this reason, the results of those attacks might not be reliable. Moreover, using 10 iterations for PGD is unlikely to lead to convergence of the optimization.
- Could you please justify the use of relative metrics (RR and AE), especially considering that adding other models/attacks to the benchmark might lead to recomputing the entire results? |
Fully human-written |
|
BEARD: Benchmarking the Adversarial Robustness for Dataset Distillation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper presents the BEARD benchmarking framework to evaluatethe robustness of models trained via dataset distillation against adversarial attacks. The authors point out that existing benchmarks such as DD-RobustBench and RobustBench fail to sufficiently reflect the actual performance of dataset distillation techniques in adversarial scenarios. To address this issue, BEARD integrates multiple dataset distillation techniques, attack methods, and datasets, while introducing three new metrics: RR, AE, and CREI. These metrics employ game-theoretic approaches to simultaneously evaluate attack effectiveness and efficiency. The benchmarking platform is publicly available, featuring leaderboards and a curated model dataset collection to support reproducible research. Experiments demonstrate that dataset distillation significantly enhances adversarial robustness, particularly when combined with adversarial training.
1. The paper presents the first unified benchmark for adversarial robustness in dataset distillation, introducing a novel adversarial game framework and three tailored metrics (RR, AE, CREI).
2. As dataset distillation gains traction in resource-constrained settings, understanding its robustness is critical. BEARD provides a standardized platform for comparative evaluation.
3. The paper is well-structured, with clear descriptions of the framework, metrics, and experimental setup. The public release of code and leaderboard enhances transparency and usability.
4. The benchmark covers multiple DD methods, attack types, and datasets, offering a comprehensive evaluation landscape.
1. Completion is slightly insufficient. This paper has systematically expanded and deepened DD-RobustBench through introducing unified evaluation metrics, incorporating more attack types, and proposing a game-theoretic framework. yet it is unable to prove on other larger datasets, more complex architectures and algorithms, and only remains in relatively simple scenarios.
2. In Section 3.10, the CREI metric locks α at 0.5 without explanation. Giving robustness and efficiency equal weight might not suit every task; an ablation on α or a data-driven reason for this choice would help.
3. In Section 5.1, the claim that “dataset distillation improves adversarial robustness” is counter-intuitive and lacks mechanistic explanation. The observed CREI drop with increasing IPC is noted but not interpreted. Include a discussion on why distilled datasets may enhance robustness—e.g., whether they filter out non-robust features or reduce overfitting. Analyze the IPC–robustness trade-off more deeply.
4. In Section 5.1, the performance differences among DD methods (e.g., why DSA/DM/BACON perform better) are reported but not explained. The analysis remains descriptive. Provide hypotheses or further experiments (e.g., feature analysis, robustness curvature) to explain why certain methods excel.
5.In Figure 4 & Figure 5, the paper claims that “distilled datasets improve adversarial robustness,” a conclusion that runs counter to intuition (smaller datasets are usually expected to yield more fragile models) yet no convincing explanation is provided. Discuss the interaction between dataset scale, distillation method, and adversarial training to provide more actionable insights.
1. Why was α=0.5 chosen for CREI? Have you experimented with other values, and how sensitive are the rankings to this parameter?
2. Can you provide a deeper explanation for why some DD methods (e.g., DSA, DM, BACON) exhibit stronger adversarial robustness? Is it related to their distillation objectives or synthetic data diversity?
3. The conclusion that “distillation improves robustness” contradicts the common belief that smaller datasets lead to weaker models. Can you discuss potential reasons for this phenomenon? |
Fully AI-generated |
|
BEARD: Benchmarking the Adversarial Robustness for Dataset Distillation |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper introduces an open and unified benchmark designed to systematically evaluate the adversarial robustness of models trained via dataset distillation methods called BEARD. The benchmark covers multiple DD algorithms, adversarial attacks and widely-used image datasets. The authors formalize an adversarial game framework and employ three key evaluation metrics, Robustness Ratio, Attack Efficiency Ratio and Comprehensive Robustness-Efficiency Index respectively.
1. The code, leaderboard, and data pools are open-sourced, which can help facilitate future research.
2. The adversarial game formalism is thoughtfully articulated.
1. The empirical results do not directly benchmark against some newer strategies for adversarial training (e.g, [1]), adversarial attacks (transformation-based attacks [2] and generative approachs [3] )and other widely used datasets (e.g., cinic10, imagenet and mnist)
2. Section 5 reports trends but lacks deeper causal explanations (e.g., why DM improves CREI).
3. Section 3 introduces too many mathematical definitions, but provides limited experimental interpretation or discussion later.
4. typo and errors in grammar. 1) "DISCUSSION THE DIFFERENCES BETWEEN BEARD AND OTHER BENCHMARKS" (B.5) -> "THE DIFFERENCES BETWEEN BEARD AND OTHER BENCHMARKS" in 2) Missing space between “IDM” and “BACON” in figure 3. 3) TinyImageNet” or “Tiny-ImageNet? should be consistent. 4)
5. From the current description, BEARD appears conceptually similarly to DD-RobustBench in both purpose and experimental scope, though the authors claim they provide a more holistic assessment. But RRM does not provide substantial novelty beyond existing robustness evaluation metrics. And it is easy to integrate target settings in DD-RobustBench. Furthermore, the DD methods, attack methods and provided in BEARD are also limited. The paper reads more like an engineering consolidation than a fundamentally new contribution. I am not sure I understand it correctly.
[1] Yang, Zhuolin, et al. "Trs: Transferability reduced ensemble via promoting gradient diversity and model smoothness." Advances in Neural Information Processing Systems 34 (2021): 17642-17655.
[2] Yun, Zebin, et al. "The Ultimate Combo: Boosting Adversarial Example Transferability by Composing Data Augmentations." Proceedings of the 2024 Workshop on Artificial Intelligence and Security. 2024.
[3] Wei, Zhipeng, et al. "Enhancing the self-universality for transferable targeted attacks." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023.
What is the meaning of $|\epsilon|$? Why the authors use $\epsilon= 8/255$ and $|\epsilon|= 8/255$ interchangeably? |
Fully human-written |
|
BEARD: Benchmarking the Adversarial Robustness for Dataset Distillation |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces BEARD, a unified benchmark for evaluating the adversarial robustness of models trained on distilled datasets. It proposes an adversarial game framework and three new metrics (RR, AE, CREI) to systematically assess robustness across multiple datasets, distillation methods, and attack strategies. The benchmark includes a dataset pool, model pool, and leaderboard, with extensive experiments showing that dataset distillation can improve robustness, especially when combined with adversarial training.
- Novel and well-defined evaluation framework and metrics.
- Comprehensive experiments across datasets, methods, and attacks.
- Open-source code and leaderboard support reproducibility and community adoption.
- Limited to image classification; does not cover other modalities.
- Does not include very large-scale datasets like ImageNet.
- Some results (e.g., robustness improvements) are not thoroughly analyzed or compared to non-distilled baselines.
- Some metrics (e.g., AST) may be sensitive to hardware and implementation details.
1. Why does dataset distillation often improve adversarial robustness? Is it due to implicit regularization or reduced capacity?
2. How does BEARD compare to training on random subsets of the original data?
3. Could the benchmark include more recent distillation methods (e.g., SRe2L, CAFE)? |
Fully AI-generated |