|
Trade-off in Estimating the Number of Byzantine Clients in Federated Learning |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 4: excellent
Rating: 6: marginally above the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper provides a systematic theoretical analysis of the impact of estimating the number of Byzantine clients in Federated Learning (FL). The authors consider a scenario where an aggregator is chosen with a robustness parameter $\hat{f}$ (the estimated maximum number of Byzantine clients it can tolerate), while the true number is $f$. Their key contributions are:
Underestimation Risk: They rigorously prove that underestimation ($\hat{f} < f$) can lead to arbitrarily poor performance and divergence, even under favorable conditions like the Polyak-Łojasiewicz (PL) inequality.
Minimax Optimal Bounds for Non-Underestimation: For the non-underestimation case ($\hat{f} \geq f$), they establish matching lower and upper bounds (i.e., minimax optimal rates) for both the aggregation error and the convergence rate of the Federated Robust Averaging (FedRo) algorithm. These bounds are proportional to $\frac{\hat{f}}{n - f - \hat{f}}$.
Fundamental Trade-off: This bound reveals a fundamental trade-off: while an aggregator with a larger $\hat{f}$ can tolerate a wider range of attacks (any $f \leq \hat{f}$), its performance deteriorates when the actual number of attackers $f$ is small, as the error bound increases with $\hat{f}$.
Optimal Algorithm: They propose a novel composite aggregator ($\hat{f}$-Krum $\circ$ $\hat{f}$-NNM) that is proven to achieve the order-optimal upper bound without prior knowledge of the true $f$.
Empirical Validation: Experiments on CIFAR-10 validate the theoretical trade-off, showing performance collapse when $\hat{f} < f$ and performance degradation as $\hat{f}$ increases beyond $f$.
Novel Problem Formulation: Systematically studying the effect of the estimated number of Byzantine clients ($\hat{f}$) is a highly original and important direction.
Theoretical Completeness: The paper provides a complete minimax analysis, with tight lower and upper bounds for both aggregation error and algorithm convergence, under both underestimation ($\hat{f} < f$) and non-underestimation ($\hat{f} \geq f$) scenarios. The bounds, proportional to $\frac{\hat{f}}{n-f-\hat{f}}$, are rigorously derived.
Practical Relevance: The theoretical trade-off is clearly demonstrated and validated through experiments on a standard benchmark (CIFAR-10), connecting theory with practice.
Algorithmic Insight: The analysis of the composite aggregator ($\hat{f}$-Krum $\circ$ $\hat{f}$-NNM) provides a constructive method to achieve the order-optimal bound.
Computational Complexity of Optimal Aggregator: While the composite aggregator ($\hat{f}$-Krum $\circ$ $\hat{f}$-NNM) is theoretically optimal, it is computationally expensive. The per-round cost involves nearest neighbor searches for all clients, which may be prohibitive for very large-scale systems or high-dimensional models. The paper does not discuss efficient approximations or the practical scalability of this aggregator.
Limited Empirical Scope: The experiments, while supportive, are limited to one dataset (CIFAR-10), one model (ResNet-20), and one type of attack (Gaussian noise). Broader experimentation with more datasets (e.g., CIFAR-100, FEMNIST), model architectures, and sophisticated Byzantine attacks (e.g., label-flipping, backdoor) would strengthen the empirical validation.
Assumption of PL Condition: The convergence analysis for the upper bound (Theorem 5) relies on the Polyak-Lojasiewicz (PL) condition to achieve a global convergence rate. While the PL condition holds for some machine learning problems, it is a relatively strong assumption. Discussing the plausibility of this assumption in FL settings or exploring convergence under weaker conditions (e.g., just smoothness) would be valuable.
The theoretically optimal aggregator ($\hat{f}$-Krum $\circ$ $\hat{f}$-NNM) appears computationally heavy for large $n$ and $d$, involving $O(n^2 d)$ operations per round. What are your thoughts on developing more computationally efficient aggregators (e.g., using approximate nearest neighbor search or sampling) that can still preserve, or nearly preserve, the theoretical guarantees? Would such approximations fit into your current theoretical framework?
Your experiments use a simple Gaussian noise attack. How do you expect your theoretical findings and the observed trade-off to hold under more complex and adaptive Byzantine attacks, such as those designed to mimic honest updates or perform model poisoning? Could such attacks potentially alter the established lower or upper bounds, for example, by affecting the heterogeneity constant $G^2$?
The convergence upper bound (Theorem 5) is derived under the PL condition, leading to an asymptotic error floor of $\mathcal{O}\big(\frac{\hat{f}G^2}{n-f-\hat{f}}\big)$. Can you provide any intuition or discussion on whether the core trade-off—that the asymptotic error floor scales with $\frac{\hat{f}}{n-f-\hat{f}}$—would still hold in non-PL settings, perhaps under different or weaker assumptions? |
Fully AI-generated |
|
Trade-off in Estimating the Number of Byzantine Clients in Federated Learning |
Soundness: 2: fair
Presentation: 3: good
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper investigates the effect of underestimation and overestimation of the number of Byzantine clients and highlights the importance of accurately estimating the Byzantine size. The authors theoretically prove and empirically validate that underestimation can lead to divergence under Byzantine attacks, while overestimation results in a trade-off: a larger estimated Byzantine size enhances robustness under Byzantine attacks but degrades performance when there are no or fewer Byzantine clients.
1. The investigation of the impact of under- and over-estimating the number of Byzantine clients in the context of Byzantine attacks is a relevant and important topic for the field of Byzantine-robust federated learning.
2. This paper is well-written, clear, and theoretically sound.
1. The theoretical results in Theorems 1 and 3 do not fully support the conclusion that underestimation leads to arbitrarily large aggregation and convergence errors, as they apply only to a specific robust aggregator, not all such aggregators. If a different robust aggregator is used, the effect of underestimation may not hold. This represents a notable limitation of the paper. Can the authors generalize the results in Theorems 1 and 3 to all robust aggregators to address this issue?
2. There is a discrepancy between the theoretical analysis and the empirical results, as the former is conducted in a deterministic setting, while the latter is based on a stochastic setting. This gap is significant, as stochastic noise can substantially affect the performance of FedRo and lead to different convergence properties. Can the authors extend the analysis to address this gap?
3. In Table 1, the robustness coefficient $\kappa$ for the combination of $\hat{f}$-Krum and $\hat{f}$-NNM matches the lower bound in order. Why does the authors not use that aggregator in experiments? The reviewer would like to see the results of using $\hat{f}$-Krum $\circ$ $\hat{f}$-NNM in experiments.
4. In Line 48, the authors claim that "Existing works typically require estimating the actual number $f$ or the fraction of the Byzantine clients to select the maximum number $\hat{f}$ that the aggregator can tolerate." To my knowledge, the tolerable maximum number of Byzantine clients (i.e., the breakdown point) for some robust aggregators is independent of the actual number $f$ or fraction of Byzantine clients, meaning that these terms do not need to be estimated in advance for such aggregators. For instance, $\frac{n}{2}$ for CWMed, $\frac{n-2}{2}$ for CWTM, and $\frac{n}{10}$ for centered clipping, with respect to the breakdown point. The authors should revise this statement to make it more precise and rigorous.
5. In line 237, the authors claim that "since $\kappa$ of the commonly used aggregators like GM, CWTM, CWMed, and Krum do not match the lower bound even in the simple special case of $f = \hat{f}$." This statement is incorrect. As noted in Remark 2 of Allouah et al. (2023), CWTM does match the lower bound in order. The reviewer suggests that the authors carefully verify this statement.
6. There are several typos in this paper. For instance, in line 225, $\frac{\hat{f}}{n - 2f}$ should be $\frac{\hat{f}}{n - 2\hat{f}}$, and in line 226, $\frac{\hat{f}}{n - 2\hat{f}}$ should be $\frac{\hat{f}}{n - \hat{f}}$. The authors are advised to carefully review the paper to address these and prevent any similar issues.
My detailed questions are outlined in the section above; please refer to them. If the authors can fully address my concerns, I will consider adjusting my score accordingly. |
Lightly AI-edited |
|
Trade-off in Estimating the Number of Byzantine Clients in Federated Learning |
Soundness: 4: excellent
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
The paper studies Byzantine-robust federated learning when the server only knows an estimate \(\hat f\) of the number of Byzantine clients (the true number is \(f\)).
Main takeaways: (i) if we underestimate (\(\hat f < f\)), aggregation / training can become arbitrarily bad; (ii) if we do not underestimate (\(\hat f \ge f\)), the best-possible error (and convergence floor) scales like
\[
\kappa = \Theta\!\left(\frac{\hat f}{\,n - f - \hat f\,}\right),
\]
with matching lower/upper bounds and a simple construction that achieves this rate up to constants.
- Clear split between two regimes: underestimation can be catastrophic; otherwise there is a tight, quantified floor.
- Matching lower and upper bounds
- The dependence on both \(f\) and \(\hat f\) is explicit, which clarifies the cost of being conservative.
- The underestimation impossibility is not really new: it is closely tied to the **breakdown point** in robust statistics.
If the true contamination $f/n$ exceeds the method's breakdown (here essentially $\hat f/n$), you can get unbounded error.
So this part feels more like transferring a known concept, rather than introducing a new phenomenon.
- Novelty in the *order* of the bound is modest. Away from edges, it matches the trivial (previously known) bound up to constants.
Precisely, for a constant $c \in (0, \tfrac{1}{2})$ such that
$$
\hat f \le \big(\tfrac{1}{2} - c\big)\,n
\quad \text{and} \quad
f + \hat f \le (1 - c)\,n,
$$
both denominators are $\Theta(n)$, and
$$
\frac{\hat f/(n - f - \hat f)}{\hat f/(n - 2\hat f)}
\;=\; \frac{n - 2\hat f}{\,n - f - \hat f\,}
\;=\; \frac{1}{\,1 + \frac{\hat f - f}{\,n - 2\hat f\,}\,}.
$$
This ratio stays bounded between two positive constants (and $<1$ when $\hat f > f$),
so the two expressions are of the *same order*.
- They only separate near the **edges**:
- $f + \hat f \to n$: the paper’s denominator $n - f - \hat f$ collapses $\Rightarrow$ larger floor than the trivial count.
- $\hat f \to n/2$: the trivial denominator $n - 2\hat f$ collapses $\Rightarrow$ trivial bound blows up sooner.
- **Underestimation** $\hat f < f$: the paper (consistent with breakdown) shows no finite guarantee, while the trivial bound misleadingly remains finite.
- Experiments could be broader; I appreciate the paper is from a theoretical nature so this is a minor issue for me.
- Since the underestimation result follows the breakdown-point logic, what is the extra conceptual significance here?
see weaknesses |
Heavily AI-edited |