|
Implicit Bias and Loss of Plasticity in Matrix Completion: Depth Promotes Low-Rankness |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
Matrix completion by deep linear networks is an important model problem for understanding a variety of phenomena in the training of deep neural networks. It was previously observed empirically that the such models have a tendency to discover low-rank solutions. In the case of two-layer networks, it was observed that this is the case if and only if the observations are "connected" in a suitable sense.
The reviewed work attempts to extend this understanding to the case of deeper networks (which are known to provide low-rank solutions even for disconnected observations). As a starting point, it uses the fact that the characterization of low-rankness through connected observations proceeds by observing that disconnected observations lead to decoupled gradient dynamics of different degrees of freedom.
To this end, the reviewed work definines a notion of "coupled gradient flow dynamics" and shows that under diagonal (and thus disconnected) observations and a structured initialization scheme, the training dynamics are coupled if and only if the initialization is diagonal. It implicitly characterizes the singular values of the converged matrices in this case.
Finally, the authors aim to illucidate the loss of plasticity phenomenon by showing that fast convergence of a pretrained network prevents significant decay of the stable rank.
- This work addresses the important and thorny question of implicit bias in deep learning
- Generalizes the independent observations mechanism beyond the 2 layer case
- Provides an potentially useful perspective on the plasticity phenomenon
Maybe due to the challenging nature of the questions studied, the connection between the highly restricted theoretical setting and the claims about the behavior of the real algorithm (deep learning or even just deep matrix completion) is somewhat tenuous at times. I am especially doubtful about the additional insight of the implicit characterization (8), (9). If it is possible to derive additional insight from it rather than just solving it numerically, this would significantly strengthen the paper.
Typos:
Line 306 "of of"
1. You solve the equations (8) (9) numerically. Can you show that this system has a unique solution? If not, how do you know that the solutions of this system of equations characterizes the training behavior?
2. On a related note, what is the benefit of solving (8) (9) over directly observing the training trajectories under the (fairly restrictive) conditions where Theorem 3.3 holds? |
Fully human-written |
|
Implicit Bias and Loss of Plasticity in Matrix Completion: Depth Promotes Low-Rankness |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper uses deep linear matrix factorization as a controlled setting to explain why depth strengthens the implicit bias toward low rank. It identifies coupled vs. decoupled training dynamics as the underlying mechanism: in depth-2, coupling depends on observation-graph connectivity; in depth $\ge3$, coupling arises generically and drives stronger rank-1 attraction. Using diagonal observations and a family of initializations, the authors derive limiting singular-value formulas and connect the mechanism to loss of plasticity (LoP) under warm starts.
1. A clear, mechanism-first account that unifies several scattered observations about low-rank bias and depth.
2. Nontrivial analytical traction via diagonal observations and limiting SVD characterizations, with numerics that mirror the theory.
3. Bridges the mechanism to LoP with formal statements (stable-rank lower bounds) rather than just empirical anecdotes.
1. Scope of formalism: The most rigorous theorems rely on diagonal or highly structured observation patterns and gradient-flow analysis. This leaves a substantial gap to practical regimes, for example finite-step SGD with noise/momentum and unstructured sparsity. Without non-asymptotic bounds in these regimes, it’s unclear how predictive the theory is for typical training runs.
2. Initialization dependence: Several results hinge on a specific initialization family that tunes coupling. While the intuition is compelling, robustness across standard random inits is argued more heuristically than proved. This creates ambiguity about when the rank-1 bias reliably manifests.
3. Transfer to nonlinear/realistic setups: The LoP story is precise for depth-2 linear models and toy expansions of the observation set. Generalization to deep nonlinear networks, noisy completion, and rectangular cases is suggestive but not sealed.
See weakness |
Fully AI-generated |
|
Implicit Bias and Loss of Plasticity in Matrix Completion: Depth Promotes Low-Rankness |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper studies implicit low-rank bias in deep linear matrix factorization for matrix completion and connects it to loss of plasticity when warm-starting from partially trained solutions. The core thesis is that while L=2 models exhibit coupling only when the observation pattern connects entries, depth >= 3 architectures induce structural coupling regardless of observation connectivity, yielding a stronger implicit bias toward low rank. The paper provides theoretical results on coupled/decoupled regime and on plasticity loss under warm starts supported by toy experiments.
* Provides an intuitive coupled vs. decoupled framework explaining how depth alone can produce low-rank bias.
* Theoretical results are mathematically careful, with complete proofs and well-presented figures.
* The paper connects depth-induced bias to plasticity loss, an emerging topic in optimization and continual learning.
* Writing and presentation are clear and well-formatted.
The main theoretical contributions rely heavily on diagonal observation structures and specific initializations which limits generalizablity of said findings; it is also unclear how the results extend to generic or noisy data patterns. Much of the clean spectral characterization is within this stylized regime. Additionally, the finding that deeper (linear) networks inherently exhibit a stronger
low-rank bias than shallow networks is not an entirely novel finding though I appreciate that the novelty here is the coupling mechanism + solvable diagonal case + plasticity proofs. For other weaknesses and more details, see questions below.
* Would it be possible to go beyond diagonals? Can the authors extend analysis/conclusions to non-diagonal, sparse patterns (even banded or block-diagonal) to demonstrate generality of the depth-induced bias? The main theorem(s) only seems to cover an extremely stylized regime which is much narrower than generic matrix completion settings. Right now, I believe this setup is quite limiting to the generalizability of said findings as this makes the conclusions hinge on a stylized regime.
* How robust are the results to generic random initialization and to noisy/unequal diagonal entries? Can the authors provide some characterization of formal continuity/stability arguments in these cases?
* Given that many works nowadays, both in theoretical analysis and empirical methods, focus on adaptive optimizers (e.g., Adam and its variants etc.), I wonder how these findings change under these optimizers? Many works ([1][2] to just name a few with [1] also focusing on matrix completion) have explored the implicit bias of these types of optimizers with some focusing on the role of how adaptive optimizers provide pre-conditoning in certain tasks/testbeds similar to that of this paper.
* Can the related works section be made a bit clearer in terms of distinguishing implicit reg/bias between 1st order and adaptive, etc. 2nd order optimizers (see below)? I think mentioning these works that focus on the implicit bias of adaptive optimizers (beyond just GD) like [1] and [2], among others, are worth contextualizing (I see the paper already cites the "The implicit bias of adam on separable data" but it's clumped together with several other works under a generic sense of implicit regularization for GD etc.); this would better help frame this paper's relevance and contribution while improving the flow. Or one could better structure related works according to the testbed/problem (e.g., compressed sensing, matrix factorization, etc.). I'm also not sure why the related works is in the appendix.
* Adding optimizer ablations (Adam/Adagrad/SGD) would clarify whether depth-induced coupling is robust across optimizers; for instance, core results (Prop. 3.2, Thm. 3.3; Thm. 4.2/4.3) are proved under gradient flow but practical claims and CNN experiments implicitly target regimes where training with Adam etc. is typically more standard. It's also unclear whether the depth-induced coupling and ensuing low-rank spectra hold beyond the GF/SGD limit. I view this optimizer-dependence gap as a key reason to temper claims of practical significance (despite the mention of practical tests using CNNs etc.).
* For figs. 10-13 in Appendix, for the CNN/resnet experiments, could the authors please detail how convolutional kernels are reshaped, which layers are measured, provide seeds/variance/CI to support trends, etc.
* The plasticity theorems are insightful but tend to be confined to toy regimes.
[1] "Combining Explicit and Implicit Regularization for Efficient Learning in Deep Networks" Zhao et al. (2022)
[2] "On the Implicit Bias of Adam" Cattaneo et al. (2024) |
Fully human-written |
|
Implicit Bias and Loss of Plasticity in Matrix Completion: Depth Promotes Low-Rankness |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper explores the use of deep matrix factorization for matrix completion tasks, with a focus on how network depth influences training dynamics and the implicit bias of neural networks. Building on the prior work of Bai et al. (2024), the paper emphasizes that coupled dynamics are the core mechanism behind this low rank bias. For deep networks, the dynamics are coupled, meaning that even if the observed entries are decoupled, the training process strongly tends toward a low-rank solution. This bias arises from the inherent interdependence between layers in deep factorizations, which naturally favors low-rank solutions, regardless of the observation pattern. The paper also discusses the phenomenon of plasticity loss in matrix completion tasks, where models struggle to adapt to new information after initial training, a problem particularly evident in shallow networks. The paper shows that deep models mitigate plasticity loss through their low-rank bias, enabling them to maintain good adaptability even when trained on limited data.
- This paper is well-written and has a clear overall structure, making it relatively easy to read.
- The paper addresses a key issue in matrix completion, specifically the role of network depth in shaping training dynamics and the implicit bias toward low-rank solutions. This focus provides valuable insights into how deeper networks perform better in this context.
- The paper builds on and extends the work of Bai et al. (2024), with a emphasis on the coupling dynamics and their relation to low-rank solutions. The paper does well to highlight the phenomenon of low-rank bias in deep networks, showing how deeper networks (L ≥ 3) exhibit stronger tendencies to converge to low-rank solutions, even when trained with limited data. The use of the theoretical framework surrounding coupled and decoupled dynamics is a solid foundation for exploring the behavior of deep networks in matrix completion tasks.
- The analysis of plasticity loss, especially the argument that deeper networks avoid this issue better than shallower networks, provides new insights to this phenomenon.
- While the paper’s theoretical results are solid, some claims feel weak or underexplored. For instance, Theorem 3.1 is focused on a very specific case (2x2 matrices), and while it does provide some insights, it lacks generality. Additionally, Theorem 3.3 is reduced to an implicit equation that is not deeply analyzed.
- Although numerical experiments (e.g., Figure 2) are used to validate the low-rank results, the theoretical argumentation feels limited. The paper would benefit from a more detailed analysis of how the implicit equation correlates with low-rank solutions, particularly for deeper networks (L ≥ 3).
- The definitions of "couple" and "decouple" (in the context of training dynamics) are important for understanding the paper’s core argument, but they have been discussed in similar terms in Bai et al. (2024) (Definition A.5). The paper would benefit from clearer citations to this previous work to avoid redundancy and ensure clarity.
- The discussion of special initializations, especially in Eq. (7), is quite restrictive and doesn't align well with practical network initialization strategies, which are typically random. This narrow focus on a specific initialization limits the applicability of the results.
1. It seems to be focused on a very specific case (2x2 matrix), and while it provides some detailed characterization, it is weak overall. Additionally, I believe Bai et al. (2024)'s Theorem 2 offers a more general result that already characterizes the explicit relationship between initialization scale and alignment. Could the authors clarify how Theorem 3.1 adds value over this more general result?
2. The definitions of couple and decouple seem to have similar characterizations in Bai et al. (2024) (Definition A.5). Would it be possible to cite this source to avoid redundancy and clarify the connection?
3. Why does Eq. (7) consider such a special initialization? In practice, we usually consider random initializations rather than the specific initialization used here. Could the authors provide a justification for using this specific initialization in their analysis?
4. What is the relationship between the implicit equation and low-rank? Why does satisfying the implicit equation imply low-rank? Could the authors analyze how this implicit equation precisely corresponds to low rank, especially for the case where $L = 3$? I think this is the core argument of the paper (that coupling leads to low-rank solutions), but Theorem 3.3 reduces this to an implicit equation without further analysis. The low-rank results are instead verified through numerical experiments (e.g., Figure 2), which weakens the theoretical contribution. Could the authors provide a more detailed theoretical explanation of this?
5. Regarding Line 306: I think the argument that increasing depth promotes low rank might have a trivial reason behind it. For example, in the depth-2 case, the origin is a strict saddle point, and a very small initialization is required for the escape time to be long, which enables alignment. However, in deeper networks, the origin is no longer a strict saddle point, and the escape time significantly increases. This suggests that stacking more layers inherently leads to a smaller initialization scale. Comparing the rank under the same initialization scale might not be fair.
6. How do the authors argue the differences between matrix completion tasks and real-world neural network tasks in terms of data characteristics? Real neural networks typically sample continuous data, which is almost surely connected. In contrast, matrix completion tasks involve discrete sampling. Is this discreteness unique to matrix completion tasks?
7. In the Loss of Plasticity paper (Kleinman et al. 2024), the authors compare networks of different depths, finding that deeper networks show more pronounced phenomena at critical learning stages (Fig. 1). This seems contradictory to the claim that 3-layer networks exhibit more coupling than 2-layer networks, leading to low rank early on and thus avoiding plasticity loss. Could the authors discuss this discrepancy further? |
Heavily AI-edited |