|
Rethinking Regularization in Federated Learning: An Initialization Perspective |
Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper re-examines the role of regularization in federated learning (FL) from an initialization perspective. Through a comparative analysis of client gradients and learned features, the authors find that FedDyn is the most effective regularization method for mitigating client drift caused by data heterogeneity. However, they argue that all practical regularization methods, including FedDyn, are imperfect approximations of an ideal scheme, leading to side effects and additional overhead that diminish their benefits in later training stages. Based on the observation that FL is less sensitive to heterogeneity when well-initialized, the paper proposes a two-stage training strategy: using FedDyn for pre-training to provide a strong initialization, followed by standard FedAvg for fine-tuning. Experiments in both cross-silo and cross-device settings demonstrate that this approach achieves faster convergence and higher accuracy compared to using regularization throughout the entire training process.
* The paper provides a good observational analysis of how different regularization methods impact client gradients and learned features in federated learning.
* The writing is clear and the paper is well-structured.
* The analysis is supported by abundant experimental figures, providing a multi-faceted view of the regularization effects.
* The paper claims that regularization encourages local models to learn features that better align with the global model, but this claim is not supported by any theoretical convergence analysis.
* The "side effects" of regularization are not clearly explained. The paper asserts that the server control variate in FedDyn "does not accurately approximate the gradient of the global objective function", but the reasoning is not fully developed. Furthermore, the claim of significant "additional computational cost" is not quantified. The overhead of adding a regularization term, which is often just a vector operation, seems marginal compared to the cost of model training.
* The algorithms discussed (FedAvg, FedDyn, etc.) are all well-established. The main contribution appears to be the proposed two-stage training strategy, which is a combination of existing methods. The novelty of this contribution seems limited.
* The paper provides a formal criterion for switching from FedDyn to FedAvg, but its practical application is unclear. The criterion depends on the term Ct, which seems difficult to compute in a real experiment. How is the switching point determined in the experiments in real experiments? The compution of Ct in the experiment is inpractical. A sensitivity analysis on the switching point would be beneficial.
* In Section 3, why is the analysis based on the "pseudo-gradient" instead of the exact gradient?
* How to understand Figure 7 (c) and (d)? It seems the proposed algorithm is not competitive. |
Lightly AI-edited |
|
Rethinking Regularization in Federated Learning: An Initialization Perspective |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper reframes the use of regularization in federated learning, arguing that its primary benefit is in the early training phase. Through novel gradient and feature-level analyses, the authors show that while FedDyn effectively reduces client drift, its benefits diminish over time and it introduces side effects. They propose a two-stage strategy: use FedDyn for pre-training to find a strong initialization, then switch to standard FedAvg for fine-tuning. Experiments show this FedDyn -> FedAvg approach often achieves superior accuracy and faster convergence than using either method alone, while also reducing computational overhead.
1. This paper shifts the focus from finding the "best" single algorithm to understanding when an algorithm is most beneficial, introducing a valuable "initialization vs. fine-tuning" paradigm.
2. This paper provides deep, quantitative insights into how regularization methods work and where they fail by considering more metric like gradient diversity and feature interaction tensors.
3. The proposed two-stage strategy is simple, effective, and computationally efficient.
1. Lack of Theoretical Proof: The paper lacks a formal convergence proof for the proposed two-stage FedDyn -> FedAvg algorithm, relying instead on empirical results and a non-practical switching criterion.
2. Switching Point: The effectiveness of the method depends critically on the switching point, which is chosen heuristically in the experiments. The paper offers no practical guidance on how to determine this crucial hyperparameter.
3. Limited Generality: The conclusion that only FedDyn serves as a good initializer is not fully explained, limiting the generality of the "regularization for initialization" principle to other methods.
4. This paper reads more like an experimental report than an academic paper. Its contribution and innovation seem somewhat weak for a conference like ICLR.
1. How can the optimal switching point be determined in a principled and adaptive way for new tasks?
2. What is the fundamental reason that FedDyn succeeds as an initializer while a similar method like SCAFFOLD fails?
3. Is the two-stage training principle generalizable to other combinations of FL algorithms beyond FedDyn -> FedAvg? |
Heavily AI-edited |
|
Rethinking Regularization in Federated Learning: An Initialization Perspective |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper presents a compelling argument for rethinking the role of regularization in FL. Its core thesis is that while regularization methods like FedDyn are highly effective at mitigating client drift, they are computationally expensive and their benefits diminish after the model is well-initialized. The authors' key proposal—a two-stage training strategy that uses FedDyn only for pre-training before switching to standard FedAvg—is novel, practical, and well-supported by experimental evidence from gradient and feature-learning perspectives.
1. The paper goes beyond mere accuracy plots. The analysis from gradient perspective (diversity, cosine similarity) and feature perspective (interaction tensor) provides a much deeper, mechanistic understanding of why FedDyn works better than other methods. This is a major strength.
2. The paper is well-structured and easy to follow.
1. The paper states that FedDyn has "side effects" (e.g., the server control variate inaccurately approximates the global gradient, negatively impacting features). However, this is not demonstrated as clearly as its benefits.
2. The analysis focuses heavily on FedDyn, with SCAFFOLD, MOON, and FedNTD as comparisons. While justified by FedDyn's performance, a broader discussion of why this two-stage strategy might or might not work for other state-of-the-art methods (e.g., FedProx) would strengthen the generalizability of the claim.
Please refer to weaknesses for details. |
Moderately AI-edited |
|
Rethinking Regularization in Federated Learning: An Initialization Perspective |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper points out that existing SoTA methods, such as FedDyn, may impose too strong the regularization that impedes from reaching better global minima. The authors propose an intriguing trick: switch back to vanilla FedAvg. With analysis from gradient diversity, cosine similarity, and feature cluster analysis, this work shows improvement over constrained setup (only Table 1 and Figure 1 seems to be serious quantitative results to me). I encourage the authors to provide additional experiments to better understand this algorithm-switch phenomenon to better understand and strengthen their observations.
1. The algorithm-switch phenomenon is interesting and Figure 1 is specifically convincing to me.
2. The perspective and comparison with learning decay is important to ensure the performance gain is not similar to learning rate scheduler.
1. Limited algorithm choice
All the main quantitative analysis focuses on FedDyn, and there are no metrics regarding FedNTD or Scaffold. The authors also should consider more recent algorithms such as FedAlign, FedExp, FedProto, FedMD, or others. It is hard to demonstrate how robust, consistent, and reproducible this method is empirically without a thorough comparison. I strongly encourage the authors to include an additional quantitative table comparing at least three additional federated learning algorithms using the same setup as in Table 1.
2. What is the main contribution of algorithm-switching?
I am not fully convinced that "regularization impedes global feature learning" is the best explanation for the working mechanism of the proposed method. My intuition is that changing (or switching) the minimization goal prevents the model from getting stuck at a local minimum, because that is effectively what the proposed method is doing. I would appreciate it if the authors could conduct additional experiments to prove this hypothesis right or wrong. For example, instead of switching from FedDyn to FedAvg, can the authors try switching from FedAvg to FedDyn? Another interesting setup would be to switch between FedDyn and FedAvg every 20 rounds and evaluate the performance.
1. Can the authors explain why in Figure 1b there are some abrupt drops of diversity but in Figure 1c the cosine similarities increase monotonically? |
Fully human-written |