ICLR 2026 - Reviews

SubmissionsReviews

Reviews

Summary Statistics

EditLens Prediction Count Avg Rating Avg Confidence Avg Length (chars)
Fully AI-generated 0 (0%) N/A N/A N/A
Heavily AI-edited 0 (0%) N/A N/A N/A
Moderately AI-edited 0 (0%) N/A N/A N/A
Lightly AI-edited 0 (0%) N/A N/A N/A
Fully human-written 4 (100%) 5.00 4.00 4138
Total 4 (100%) 5.00 4.00 4138
Title Ratings Review Text EditLens Prediction
Do We Really Need Permutations? Impact of Width Expansion on Linear Mode Connectivity Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper argues that sufficiently wide neural networks achieve linear mode connectivity without permutation alignment. While permutation-based methods require networks to have a minimum width, the authors claim that for a wide enough network the permutation search itself becomes unnecessary and provide insights on why it emerges. The authors extend layerwise linear feature connectivity (LLFC) by introducing layerwise exponentially weighted connectivity (LEWC), positing that the intermediate activations can be expressed as an exponentially weighted average of those of the individual models. They explain the emergence of LEWC with weak additivity of ReLU activations (same as LLFC) and reciprocal orthogonality. - The exponential derivation and the LEWC framework represent a meaningful extension to the layerwise linear feature connectivity, novel to the best of my knowledge. - The paper contributes to emerging literature questioning permutation alignment as necessary for mode connectivity, challenging a mainstream assumption in the field. - The paper provides a principled explanation for why LMC is more easily achieved in wider networks through the lens of reciprocal orthogonality, moving beyond the intuition that "more width provides more permutations to search." This offers a different perspective on the role of overparameterization in mode connectivity. - The empirical framework and justification is sound. -The paper makes it sound like the networks are widened post-hoc after training. The title and framing suggest a novel widening procedure (potentially post-hoc expansion of trained models, which would be very exciting), but the paper appears to simply experiment with networks that are already wide from initialization, this should be made clear. I believe post-hoc widening could be a very interesting addition to impose reciprocal orthogonality and low-rank structure. - The experiments inherit the limited scope of prior work and need expansion, some suggestions are included in the questions - I think the paper could benefit from an expanded discussion in relation to over-parameterized networks (e.g. Simsek et al, 2021). - Could you provide more analysis of the temperature parameter in Section 3.1? Is there a relationship between optimal temperature and network width/depth/alpha? How sensitive are the results to this choice? - I find the low-rank argument insightful. Given LMC is sensitive to optimization hyper-parameters (Altıntaş, et al. 2024) I would be interested to see a comparison between first and second-order methods in this line. Another experiment could be using no weight decay, since it is also a common practice in training Cifar-scale Resnets. - Have the authors investigated how the permutation-based methods affect the overlapping dimensions? Though not critical, this could be a good insight of the paper. I think an analysis in this line could support the claims in the paper. - I would be interested to see if the connectivity breaks gradually over layers and hence if there is a relationship between the width expansion factor and layer depth? - I think it would be useful if the authors commented on how normalization layers and residual connections interfere with the reciprocal orthogonality. The authors could experiment with this in the MLP architecture. - Even though the authors elaborate on LLFC (and commutativity not holding in their seeting) in the appendix I believe the paper could benefit from a broader discussion of the implications of LLFC vs LEWC as well as the emergence of LEWC in the spawning setting (Frankle et al, 2018). This could be an ablation of the paper. I find this line of research and the authors' contribution quite interesting and I am open to increasing my score if the authors can situate the implications of their work better within the existing literature. Fully human-written
Do We Really Need Permutations? Impact of Width Expansion on Linear Mode Connectivity Soundness: 3: good Presentation: 3: good Contribution: 2: fair Rating: 4: marginally below the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. The paper tackles the task of merging different checkpoints of the same architecture trained from scratch with different seeds. In particular, the paper investigates why linear mode connectivity increases just by increasing the width of the model, without the need for permutations. To explain this, they propose Layerwise Exponentially Weighted Connectivity (LEWC): the intermediate output of the merged model can be expressed as a weighted sum of those of the endpoint models, with the interpolation coefficients decaying exponentially with the widths. The paper derives two sufficient conditions for this phenomenon to arise: ReLU weak additivity and reciprocal orthogonality. The experiments then empirically confirm these ones to hold in several controlled settings, including MLPs, VGGs and ResNets over CIFAR10, MNIST and FMNIST. - The paper is clearly written and easy to read. The exposition is well-structured, and the theoretical derivations are both intuitive and sound. - The problem is important, both in providing further understanding in loss landscape connectivity and carrying strong implications for model merging, editing and alignment. - The proposed theoretical framework is novel and interesting. Layerwise Exponentially Weighted Connectivity (LEWC) offers a way to generalize the “averaging as ensembling” point of view that is true for linear model to non-linear deep architectures. The sufficient conditions for LEWC to arise are formally rigorous and make intuitive sense. - I have some reservations about the main claim of the paper. The observed permutation-free linear connectivity may stem from overparameterization-induced robustness rather than genuine connectivity. In very wide networks and easy tasks (e.g. CIFAR-10, MNIST), independent checkpoints can lie in flat regions of the loss landscape where large parameter perturbations (like naive weight averaging) don’t affect accuracy. In this perspective, the reported effect might be explained by functional redundancy and flatness rather than the what is stated in the paper. It would be interesting to see the following experiments to better understand if this is the case: - Train seeds with constraints that violate the sufficient conditions for LEWC by setting zero weight decay and spectral regularizers against low-rank. Measure ReLU weak additivity, reciprocal orthogonality and merge success. If merging still works despite not exhibiting relu weak additivity and reciprocal orthogonality then it doesn’t have to do with LEWC and it’s rather just functional redundancy. - Compute the norm of the difference between A and B, then traverse a random direction *u* by that same distance and plot the loss barrier throughout. The loss should clearly increase in that direction, but if it’s still low at $\lambda$ = 0.5 then we might be again observing functional redundancy. - Limited practical utility. The effect appears only for very wide models trained on simple datasets, under settings that are not those used in realistic large-scale systems. Moreover, the need for temperature calibration, strong weight decay, and substantial width multipliers means that the reported phenomenon does not affect practical model-merging pipelines. - Experimental evidence is limited to toy models and datasets. While this is somewhat common in prior literature, its applicability and theoretical impact would greatly benefit from considering more complex architectures such as ViTs. This could be done by following e.g. [1]. Overall, I find the analysis and insights interesting and worth sharing, but requiring the proposed additional experiments to ensure we are not just observing spurious explanations for unknown, possibly trivial phenomena. [1] Theus, Alexander, et al. "Generalized Linear Mode Connectivity for Transformers." NeurIPS 2025. - To what extent are the observed results a consequence of extreme overparameterization and the smoothness of the loss landscape, rather than a new structural mechanism such as LEWC? - Do suggested experiments in Weakness 1 confirm LEWC? If these cannot be performed in the rebuttal time frame, what is the expected outcome? - Modern architectures have moved beyond ReLU. How does the treatment differ if one employs one more modern variant (e.g. SwiGLU)? - What are the practical implications? requiring strong weight decay and large width multipliers, does this actually inform or help any practical merging or federated learning pipeline? Fully human-written
Do We Really Need Permutations? Impact of Width Expansion on Linear Mode Connectivity Soundness: 3: good Presentation: 4: excellent Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. In the context of model merging and parameter permutations, this work shows that contrary to prevailing thought, increasing network width is sufficient in and of itself (without permutation alignment of the parameters) to make networks more linearly connected (more merge-able via weight interpolation). The authors show that a) increasing width reduces barriers (loss barriers require softmax recalibration), b) activations become linearly connected (remain more similar when linearly interpolated) with depth, dependent on two conditions (weak additivity and reciprocal orthogonality), and c) activations are lower rank with less "overlap" as width increases. Reducing weight decay (which increases rank) counteracts this effect. The work presents a novel and quite interesting finding, and backs it up with a predictive theory along with empirical evidence. The structure of the paper is well thought out and covers many of the gotchas and questions one may have going in - e.g. batch norm statistics are handled properly using REPAIR, and many efforts are made to distinguish reciprocal orthogonality from commutativity in Zhou et al. (2023). Overall there are many good things going on and I would like the authors to push a bit further beyond establishing that width-induced LMC exists. I have two issues with this work. One issue is that the theory and experiments are somewhat detached as the theoretical setup and some predictions (depth particularly) don't have solid connections with the empirical evidence. Some of the definitions/conditions (e.g. low rank weights leading to LMC) feel arbitrary and accidental rather than clearly supported by intuition and evidence. The other issue is that I'm not sure what to conclude - can we predict when LMC will arise due to width alone? What are the implications (practical or otherwise) of the findings? Currently, the empirical results cannot rule out the possibility that there is a lower floor on how small the barrier becomes with increased width. First, for the higher width scales, it is important to report the actual loss or accuracy barriers (such as in a table, zoomed-in plot, or plotting width vs max barrier) since "LMC" in the permutation literature (e.g. Ainsworth et al.) requires "essentially no barrier". Along this line, I don't find the random permutations experiment (Figure 13) useful - in a sense, two randomly initialized and trained networks are already randomly permuted with respect to each other, since the probability of initializing $\theta$ is equal to that of $\pi(\theta)$ for any permutation $\pi$. Instead, I would like to know whether as width increases, the difference between barriers from a known good permutation and a random one goes to 0. For the known good permutation, one could use permutation-finding algorithms, or networks from the same initialization that are LMC (Frankle et al. 2020). I find the argument in 5.2 (low rank structure reduces overlap) somewhat unconvincing - first, the intuitive example is coordinate-aligned and I don't see how the same applies to arbitrary low-rank transformations which are not coordinate-aligned. It's possible that activations are coordinate-aligned in practice due to ReLU being a coordinate-aligned transform, which figure 6 supports. In general, I think a statement like "width leads to sparser activations, which induces weak additivity" would be stronger than the current "low rank" explanation. Small problems with the presentation: - Missing citations for line 59: "In addition, several studies have shown that widening ResNet-50 trained on the ImageNet dataset improves the test accuracy of the merged model when interpolating the weights of two trained models with permutation" - The plot colors are hard to read for the different layer widths - maybe consider using a graduated palette (e.g. viridis)? - The "standard deviation with respect to 0" (defined in figu re 5) should be called the square root of the second (uncentered) moment - I suggest moving figure 5 to the appendix in favour of figure 9 - section 5.4 (reduced weight decay stops the effect) adds a much-needed ablation in my opinion. Actually, more ablations would really help establish that weak additivity and reciprocal orthogonality are necessary and sufficient conditions for LMC sans permutation. An interesting implication predicted by LEWC is that the exponential decay in activation magnitude as depth increases. This point isn't really touched on in the main text, and I would like to know if the exponential decay in the magnitude of activations from intermediate layers occurs as predicted: - What are the actual temperature values in fig 2 after calibration? Is LEWC predictive of the magnitude of these temperature values (i.e. does increasing depth make the scale of the logits smaller)? - Similarly, for fig 3 what does a plot of depth versus magnitude of the interpolated vector look like? It isn't immediately obvious from the main text alone what the distinction is between the definition of reciprocal orthogonality and commutativity from Zhou et al. (2023) - in particular, what is the significance of the differences between the experiments for figure 8 in section 5.3 (empirically verifying reciprocal orthogonality) and figure 10 (empirically verifying commutativity doesn't hold)? Could the authors make a connection between LMC with increasing width, and the neural tangent kernel at the infinite width limit? Fully human-written
Do We Really Need Permutations? Impact of Width Expansion on Linear Mode Connectivity Soundness: 4: excellent Presentation: 3: good Contribution: 3: good Rating: 6: marginally above the acceptance threshold Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. This paper explores the narrative around permutation symmetries and linear mode connectivity (LMC) from a new angle, finding that permutation symmetries are not necessary to achieve LMC for very wide models if one is allowed to recalibrate the temperature of the softmax (a much smaller modification). The paper then introduces layerwise exponentially weighted connectivity (LEWC), a nonlinear form of connectivity which is satisfied when the activations of the model decay exponentially as the layer index increases. The key observation is that low rank weights reduce the conflict between merged neurons and (relatively) large settings of weight decay promote low-rank weights. 1. The key observation of the paper is very scientifically interesting and portrays linear mode connectivity and permutation symmetries in a new light. 2. The paper also contains theoretical analysis justifying why LEWC can be expected to emerge for very wide models. 1. The theoretical analysis is based on two properties: 2. Weak Additivity for ReLU Activations (roughly, interpolating the inputs of ReLUs is the same as interpolating their outputs) 3. Reciprocal Orthogonality (roughly, each model's features live in the "null space" of the other) 3. The above properties are tested in practice and found to emerge as the width increases. Both properties get strong as the width increases! 1. The main experiments use a slightly stronger than normal choice of weight decay (3e-3). I believe the typical values are roughly 1e-4 MNIST and 5e-4 for CIFAR, (although I am not familiar with the history of these choices). This is not a major weakness, but it does complicate the narrative since permutation-based methods still work when the weight decay is not so high. 1. Do permutation methods like Git Re-Basin continue to work with typical choices of weight decay? If so, is the mechanism by which Git Re-Basin "fixes" the calibration of the model? Why doesn't LEWC continue to appear between the permuted models? 2. Do you think the re-calibration step could further reduce the barriers for models merged with permutation methods in settings where LEWC does not arise organically (low weight decay)? 3. Can you record the (midpoint) accuracies/losses of the models in numerical form? It it useful to see the exact numbers since differences in accuracy can be quite small while staying meaningful (for example). 4. The larger setting of weight decay is important to establish the low-rank structure which allows the permutation-free merging to succeed. Given the merging does not succeed when the weight decay is set to a lower value of 1e-4, I think it would be interesting to see how the loss/accuracy barriers change as the weight decay decreases. Fully human-written
PreviousPage 1 of 1 (4 total rows)Next