|
NeuMoSync: End‑to‑End Neuromodulatory Control for Plasticity and Adaptability in Continual Learning |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper addresses the challenges of plasticity loss and poor knowledge transfer in continual learning by introducing NeuMoSync, a novel architecture inspired by global neuromodulatory mechanisms in the brain. NeuMoSync utilizes a higher-level module that synthesizes current inputs and the network's historical state, allowing it to adaptively regulate activation dynamics and synaptic plasticity. Evaluated on a diverse set of CL benchmarks, the method demonstrated strong performance in retaining plasticity and achieved significant improvements in both forward and backward adaptation compared to existing approaches.
The paper's core strengths lie in its clarity, rigorous validation, and the impressive performance of the proposed NeuMoSync architecture. NeuMoSync demonstrates significant performance gains over a wide array of baselines on diverse continual learning benchmarks. The paper also goes beyond raw accuracy and includes adaptation speed and knowledge transfer for more comprehensive quantification of models' performance. The paper's claims are supported by ablation studies which clarifies the benefit of each component of the model. The analysis of emergent behaviors provides valuable intuition into why the system succeeds.
1. The paper frames the proposed method as inspired by neuromodulatory mechanisms in the brain, but the link is very weak. The modulation in this work seems more closely linked to conditioning modules, such as FiLM [1], as opposed to being brain-like. The authors also claim the consolidated network is inspired by the memory consolidation process in the brain, but it's not clear to me how averaging weights updated with gradient-descent is tied to any known neural mechanism. While the authors have acknowledged this in the appendix (Appendix I.5), this remains a weakness given that "brain-inspired" seems to be the key motivation for the method.
2. The scalability of the method to larger networks for more complex tasks is unclear. The NeuroSync module needs to take the feature vectors of every neuron as input and produce the modulation coefficient for every neuron. It seems hard to scale this to larger networks, as acknowledged by the authors. One possibility would be to only modulate a subset of neurons in the large network, but that would require additional empirical experiments.
3. The focus of the paper is on quick relearning and adaptation (plasticity) but does not address the stability issue (i.e., catastrophic forgetting) in isolation. However, this is not made clear in the main text. It would be better to mention these limitations in the main text instead of in the appendix.
[1] FiLM: Visual Reasoning with a General Conditioning Layer
In addition to the points above, I wonder if using global consolidation and plasticity factors $\alpha_{WC}$ and $\alpha_{SM}$ would degrade model performance, this could help clarify if neuron-specific consolidation is necessary. |
Fully human-written |
|
NeuMoSync: End‑to‑End Neuromodulatory Control for Plasticity and Adaptability in Continual Learning |
Soundness: 3: good
Presentation: 2: fair
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
NeuMoSync is quite a complex architecture. The authors successfully demonstrate that the model does not lose plasticity on online continual learning tasks. The model can not only adapt quickly to new tasks but can also re-adapt quickly to previously trained tasks. A central component is the NeuroSync controller, which takes the current input and a feature vector for every neuron in the main network and outputs coefficients that govern how the inference network behaves.
It’s not clear to me how exactly the NeuroSync controller learns to produce effective coefficients for the inference network, given that there is no meta-learning loop. But it’s great that it does! The authors show that parameter sharing in the controller’s architecture is crucial, though I’m still not quite sure about the reasons for this.
Recommendation: Weak accept. The core idea is interesting, the empirical results are encouraging, and the ablation study suggest that all the components of the architecture matter. I did have a hard time grasping what was done at first. I do think the presentation could be improved. Figure readability is an issue, and quite a lot of important details are deferred to the appendix, but I don’t see fundamental issues that would block publication.
The authors demonstrate that NeuMoSync works for online continual learning and appears to enable fast adaptation and re-adaptation.
Results in Figure 2 and Table 1 are positive, even if some metrics are not immediately intuitive.
The ablation studies indicate that removing components degrades performance, supporting the claim that each part of the architecture matters.
For the results, it’s hard for me to know how impressive the performance is from the information about the comparisons provided. It would be helpful to know parameter counts for each of the comparisons and a little bit more information about the choice of hyperparameters. A lot of this information is in the appendix, but I think some of it should be moved to the main text if possible.
Some figures (especially Figure 4) are hard to read due to small text, and several practical details are mostly in the appendices, making it harder to judge the main claims from the body alone.
One thing I’m curious about is that the controller requires the current input: how important is this? It would be interesting to see an ablation without this. (It may already be in the appendix, I may have missed it.) |
Fully human-written |
|
NeuMoSync: End‑to‑End Neuromodulatory Control for Plasticity and Adaptability in Continual Learning |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors propose a method for preserving plasticity when training over successive tasks.
The methods maintains both a normal trained network (MainNetwork), and its slow-changing Exponential Moving Average (EMA - ConsolidatedNetwork).
Furthermore, each neuron maintains a vector of features (position in network, past activation stats, learnable features...)
Then a NeuroSync module takes in the feature vectors of all neurons, together with the current data sample, and outputs 4 modulation coefficients per neuron. 2 of those are used to interpolate between the neuron's weights from fast and slow (EMA) networks, and the other modulate the neuron's output function.
Standard Backpropagation occurs through both Neurosync and MainNetwork, after each training sample/batch.
The model seems much better than multiple alternatives at quickly learning new tasks. This is especially true for tasks that require pure memorization with little shared structure, such as Random-Label MNIST and CIFAR.
Various ablations and experiments attempt to describe the dynamics of the system and explain its performance.
The model's performance in quickly learning new tasks, over a succession of tasks, seems much higher than many alternatives, including training from scratch, strong L2 regularization, and various algorithms for maintaining plasticity.
The algorithm itself is clearly explained.
- The authors make many references to continual learning. However, they implicitly acknowledge that their proposed method itself is not actually capable of continual learning and forgets previous tasks, since they are forced to augment it with Experience Replay in their "Forgetting" experiments. The method seems to improve the capacity of the network to quickly learn new tasks, with little regard to previously learned tasks.
- The authors show many graphs of the system's behavior. Unfortunately some of these graphs seem to contradict each other, making interpretation difficult (see below).
- However, from some of those graphs, a possible explanation for the system's behavior emerges; this explanation is quite different from what the one the authors suggest.
- Graphs about the alphas produced by the system, as shown, seem contradictory. For the same Random-Labels CIFAR task, Figure 4a shows average alpha_sm as small and slightly positive. Figure 4c shows all alpha_sm as uniformly and strongly negative (<1). Figure 9 shows alpha_sm as moderately negative (~-0.5). Something must be missing from the descriptions.
- Similarly, figure 4b shows that alpha_wc is basically 0 for the first few tasks, before jumping higher. But Figure 4C shows alpha_wc jumping almost immediately to sizeable values. Please clarify.
- Comparison are not helped by the authors constantly changing the x-axis between "epochs", "tasks" and "steps". Some consistent markers for successive tasks would be useful.
- Overall, the graphs suggest that alpha_sm (the dynamic weight on the current network) and alpha_wc (the dynamic weight on the averaged, slow-changing network) are consistently of opposite signs. This remarkable fact is not mentioned by the authors, unless I missed it. If true, it would suggest an immediate explanation: the system simply *subtracts* the accumulated weights from the current, fast-moving network, making the changes "faster" (the specific assignment of which is negative or positive, between alpha_sm and alpha_wc, should make no difference, unless I'm missing something).
- This is particularly relevant for Figure 4b, which suggests that a jump of alpha_wc coincides with and counteracts a dip in performance, presumably caused by loss of plasticity. The authors choose to interpret this as "learned reliance on consolidated knowledge" - which seems counter-intuitive since (as the authors point out) this particular task has no use for past knowledge. Instead, it suggests an increased "active forgetting" of this past knowledge, reducing the burden of accumulated (and now irrelevant) information.
- Please clarify whether the above makes sense. Maybe it doesn't, but there should be at least some mention of the apparent opposite signs between outputted alpha_sm and alpha_wc (assuming it the graphs showing it are the correct ones). |
Fully human-written |
|
NeuMoSync: End‑to‑End Neuromodulatory Control for Plasticity and Adaptability in Continual Learning |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper proposes a biologically inspired continual learning architecture, NeuMoSync, which introduces three core components: a Main Network (for rapid adaptation to new tasks), a Consolidated Network (for long-term memory), and a NeuroSync module (as a global regulator). NeuMoSync dynamically modulates neuron-level plasticity, activation functions, and synaptic weights to address the loss of plasticity and inefficient knowledge transfer commonly observed in deep neural networks during continual learning. Across six categories of continual learning benchmarks, NeuMoSync demonstrates superior performance in maintaining plasticity and achieving fast forward/backward adaptation compared to existing methods. Systematic experiments further reveal that the modulation parameters exhibit neuroscience-like behaviors, such as dopamine-like responses during task switching and neuron functional specialization, which validates the biological plausibility of the proposed approach.
This paper presents an innovative integration of neuromodulatory mechanisms with continual learning, achieving a dynamic balance between plasticity and stability through neuron-level modulation. The experimental design is rigorous and extensive, covering six benchmark types and multi-dimensional metrics, such as plasticity, adaptation speed, generalization. And the comparisons with meta-learning approaches and stability-enhanced methods further demonstrate the method's robustness. Besides, the emergence of neuroscience-like modulation behaviors (e.g., dopamine-like responses and neuron specialization) observed through systematic experiments strengthens the credibility and interpretability of the biologically inspired design. Moreover, the paper features a clear structure with detailed methodological descriptions. Illustrations such as architecture diagrams, learning curves, and modulation parameter analyses provide intuitive support for key arguments.
Although the paper mentions that the parameter overhead of NeuMoSync is only 5–8%, the scalability of the NeuroSync module, which relies on Transformer-based network, is not sufficiently discussed for networks such as ResNet. Besides, the forgetting experiments (Appendix F.3) depend on experience replay, which does not directly demonstrate NeuMoSync’s intrinsic ability to mitigate Catastrophic Forgetting. Moreover, some biological analogies (e.g., αARM as “tonic neural modulation”) lack direct empirical validation, which weakens the persuasiveness of these claims.
How can NeuMoSync be extended to very large-scale architectures?
Would adjustments to the neuron grouping strategy or sparsification in NeuroSync be necessary?
Are the observed phenomena such as the “dopamine-like responses” of modulation parameters supported by neuroscientific experiments results, or are they qualitative analogies?
The paper states that “in this manner enables input-dependent amplification or attenuation of each network’s contribution within the Inference Network.” in line 171-172. Could the authors elaborate on the implementation details of this mechanism?
In Lines 185-188, is there corresponding ablation experiments results supporting the described behavior |
Moderately AI-edited |