|
A theory of parameter identifiability in data-constrained recurrent neural networks |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 6: marginally above the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper seeks to characterize when the parameters of an RNN can be reliably recovered from neural data, with applications to data-constrained modeling of neural dynamics. To do so, they establish conditions under which parameter subspaces are constrained by data, and when they might not be, supported by theoretical results and numerics. In particular, they argue that parameter combinations aligned with top components of the Gram matrix of observed neural activity are identifiable, whereas those associated with near-zero eigenmodes are not. They then show that FORCE learning fails to constrain parameters to identifiable subspaces, and then argue that non-identifiable parameters are optimally recovered with perturbations aligned with modes of the Gram matrix associated with small eigenvalues.
The writing is clear and organized. The numerics in this paper and appendix are quite extensive, touching on many of the more realistic conditions that neuroscientists may be concerned with (correlated noise, unobserved neurons, etc.). The observations made about FORCE necessarily learning non-identifiable parameters should be of interest to those working on or fitting data-constrained models.
Many of the claims feel overstated. Overall, it feels there is a large gap between what is actually proven in the theorems/propositions, and what is then concluded about identifiability of models from data as a result. For example, Theorem 1, along with most of the following results, characterize parameter identifiability in terms of a projection matrix derived from the concatenated observed neural activity/external inputs, yet this framing strictly holds only for the noiseless dynamics setting, with matched student/teacher architecture and no unobserved influences. The fact that the identifiability criteria of Theorem 1 is identical to that of noiseless LTI systems/linear RNNs speaks to how restrictive of a setting this is. Throughout this paper, stated conditions relating to parameter identifiability seem to be primarily of the necessary kind, whereas the abstract/title/discussion would lead one to think the contents also substantially cover realistic sufficient conditions for identifiability.
I acknowledge that extensions of theorems to more realistic settings are considered, but they still feel underexplored. For example, Proposition S1 regarding recovery under partial observations seems like merely a restatement of Theorem 1, with an addendum to restrict to the case where recovery of parameters associated with unobserved neurons can be safely ignored. I don't see how that result meaningfully characterizes how partial observations can corrupt identifiability.
I am still leaning towards an accept due to the extensive empirics, which I think would be useful in itself to the neuroscience community, but feel the paper would be much stronger if the theory spoke to the empirics better.
Finally, I find it strange that this paper develops a framework around parameter subspaces defined by the top eigenvectors of $X^\top X$, but makes no reference to PCA. Isn't this identical to (or at least closely related to) projecting parameters onto the top PCs of the activity+external inputs?
Other comments:
1. A minor point: throughout, $P \in \mathbb{R}^{N_X \times N_X}$ is stated to project to the column space of $X \in \mathbb{R}^{TM \times N_X}$, but by shape, this must be referring to the row space projector (projects to the subspace of $\mathbb{R}^{N_X}$ spanned by the rows of $X$).
2. The dynamics noise scale used in the empirics pertaining to estimation with noise---the more relevant/realistic case---feels absurdly small. For example, in Fig. S3, $\epsilon_{in} \sim \mathcal{N}(0,10^{-6})$. Since this noise is precisely the noise that corrupts the input $\theta x(t)$ to the nonlinearity, which is the relevant part of the dynamics invoked for Theorem 1, it would seem this would be the stress point that should be the most tested.
1. Can the learning of non-identifiable parameter combinations by certain algorithms like FORCE be rectified by simply post-hoc projecting learned parameters to the identifiable subspace, as estimated by the empirical Gram matrix?
2. Very minor: In Fig. 2, presumably, the top modes of the Gram matrix change/wiggle as longer neural trajectories are observed, beyond just overall increases in rank. Consequently, the parameter subspaces evaluated need not be exactly comparable across curves in A, B, and C, no?
3. How is accuracy of parameter recovery defined, e.g. in Fig. 2 and Fig. 3? I'm assuming this is something like 1 - (relative error in frobenius norm), but this should be stated somewhere.
4. Regarding effects of partial observations on identifiability: the paragraph starting at line 1345 seems to state that recovery of parameters associated to top spectral components is poor for partially observed systems, even under very long recording sessions, yet the following paragraph seems to contradict this paragraph in spirit, and instead spins an optimistic tone. A very minor point, but I think this disconnect should be clarified, as I feel this discussion point will be of interest to the neuroscience community. |
Fully human-written |
|
A theory of parameter identifiability in data-constrained recurrent neural networks |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The manuscript presents a mostly theoretical treatment of the issue of parameters identifiability in the fits of recurrent neural networks (RNNs) to neural data. RNNs are increasingly fitted to experimentally measured neural data as a way to extract the relevant dynamics and potentially gain insights into the underlying mechanisms. Yet it is not fully understood to what extent the parameters of RNN are in principle identifiable based on finite, noisy data. The authors present several theoretical insights into this question and propose experimental approaches that might mitigate issues of identifiability.
The premise of the paper is clearly outlined. Several interesting connections to past work (even outside neuroscience) are presented and discussed.
The focus on relatively simple, tractable settings allows the authors to gain precise insights into which parameters of their models are identifiable and which are not (in the form a various theorems and corresponding proofs).
The work appears technically sound.
While the paper provides some interesting insights into which parameters of data constrained RNNs are identifiable or not, the practical relevance of these insights is not clear. One prominent application of data-constrained RNNs in neuroscience is to obtain smooth/denoised estimates of low-dimensional, latent dynamics from high-dimensional, noisy observations. For such applications, presumably it does not matter if multiple RNN weight matrices exist that can explain the dynamics equally well. Likewise, many mechanistic insights into the function of the fitted RNNs (like the topology of fixed points) are probably possible even if the RNN weight matrix cannot be identified uniquely. In fact, it is known that the same type of dynamics can be implemented even by different classes of RNNs (work on “Universality” by Sussillo and colleagues, 2019).
The authors should also clarify the connection of their work to past studies that have found a close relationship between dimensions of the weight matrix and dimensions of the dynamics. This relationship has been described in detail in low-rank networks (work by Ostojic et al) and even nominally high-rank RNNs have been found to be functionally low-rank (Krause et al, 2022), whereby a only a low-d subspace of the weight matrix is sufficient to explain the corresponding low-d dynamics. These lines of work seem closely related to those in this manuscript, which also finds that the subspace of the identifiable weights is closely linked to the subspace explored by the dynamics.
I found some of the sections of the paper are rather dense and difficult to read. It would help if the authors could at times provide more intuitions about the insights gained from their theorems.
What types of insight are affected by the non-identifiability presented by the authors? If the goal of fitting an RNN to data is to infer latent dynamics, or generate hypotheses about the underlying topology of the dynamics, does it matter that the RNN parameters are not fully identifiable?
The causal interventions proposed by the authors to alleviate non-identifiability seem to be focused on characterizing the components of the weight matrix that are not identifiable, as they are not sufficiently constrained by the measured data. But why would it even be desirable to constrain dynamics that are not explored in the “natural” operation of a neural circuit? |
Fully human-written |
|
A theory of parameter identifiability in data-constrained recurrent neural networks |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This work studies the identifiability of parameters of RNNs from observations of their inputs and dynamics, with motivation from the use of these models in neuroscience. Their main theorem states the fact that components of the parameters outside the subspace spanned by the observations - which is equivalent to the union of eigenspaces with non-zero eigenvalue of the empirical observability Gramian - are not identifiable.
As the authors clearly explain, this paper addresses an important topic. The main theoretical results of the paper are intuitive, though perhaps unsurprising. The range of experiments presented is interesting and mostly convincing.
The authors present their work in very sweeping terms, but as a reader their results didn't live up to the expectations set by the title and abstract. First, I do not find the main result (Thm. 1) to be particularly surprising; see my comment below about the study of LTI systems in control theory. Second, I am confused by the fact that the authors defer to appendix the extension of their main results to the partially-observed case. That seems central to the point they're trying to make, so though I guess it's a pretty deflationary result it's odd not to mention it clearly in the main text. I'd suggest bringing this into the main text, and moderating the tone of the paper overall. I also have a series of more specific concerns, questions, and suggestions, which I list under **Questions**.
- There's a long line of work in control theory on problems of identifiability in system identification for linear dynamical systems, see for instance recent works by [Simchowitz et al. (2018)](https://proceedings.mlr.press/v75/simchowitz18a.html) or [Geadah et al. (2024)](https://ieeexplore.ieee.org/abstract/document/10886179) and references they cite. It would be useful to make contact with this literature, as my impression is that some analog of their Thm. 1 is folklore there. Also, the discussion of Gram matrices coincides with the classical observability Gramian; it'd be useful to make this connection.
- It'd be interesting to extend these results to nonlinearities that are not strictly monotone. For instance, the paper you cite by Biswas and Fitzgerald focuses on some of the degeneracies that arise from using a threshold-linear function.
- Thm. 2 is (up to the error term) a consequence of the fact that the parameter updates induced by gradient descent lie within the span of the observed covariates. This seems well-known to me, so I think you might consider citing a reference, or at least commenting on the conceptual content.
- Related to this aspect of gradient-descent-type algorithms, I don't think it's surprising that FORCE would retain initialization-dependence, and thus non-identifiable components. Is your aim here primarily to show that CORNN compares favorably to FORCE in this regard?
- I think Cor. 2 could be made more mathematically precise; when is this quadratic approximation reliable? In the appendix you don't consider the remainder term.
- The authors discuss regularization at length. However, thinking naively about neuroscience experiments, it's unclear to me how one should choose a regularization parameter in a principled way, as the data is presumably almost always nonstationary. Can you elaborate on this? For instance, you show in Figure S10 an example where you can get good estimation by choosing a good $\ell_2$ penalty; how do you choose this?
- The results on interventions are interesting, but could you at least speculate on the informativeness of experimentally-feasible interventions? I suppose those would be limited by partial observation.
- I'm curious about the generality of the claim in Figure S5 - this correspondence between variance and task-relevance should depend on the nature of the solution found by the RNN, right? For example, some solutions based on feedforward amplification - like Karel Svoboda's group has recently suggested are at play in ALM, in [Daie et al. (2023)](https://www.biorxiv.org/content/10.1101/2023.08.04.552026v2.full) - would seem to violate this.
- There is a long discussion in the appendix (starting around line 1360) about long-term recordings, but it's not clear to me whether that will necessarily resolve any of the challenges documented in the paper, because on those timescales there can clearly be substantial changes due to plasticity and other factors. I'm confused about why you'd cite Driscoll et al. (2017) here and not comment the fact that that paper shows substantial drift in responses over time, which seems contrary to the idea that such long recordings would help constrain an RNN model with fixed weights. |
Fully human-written |
|
A theory of parameter identifiability in data-constrained recurrent neural networks |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The authors consider the problem of fitting RNNs to data, something often done in neuroscience. They characterise identifiability in the cleanest case: within model class, one-step on fully observed data, with monotonic nonlinearities, and noiseless; and find an intuitive result - the weights are only identifiable in the span of the `input datapoints' (meaning the span of the set of concatenations of previous timestep's activity vectors and current timestep's input vector). They show some theoretical and empirical results regarding which estimators will and won't recover these identifiable parameters, and present empirical results on how FORCE learning does not set non-identifiable weights to zero. Finally, they show how to design interventions to enlarge the set of identifiable weights, and that, if the activity stays within the identifiable span, that the dynamics will generalise.
- The main theorem, theorem 1, was intuitive and interesting
- The writing was clear, barring some AI-like verbosity
- The question was interesting, and well-framed
- The comparison to another common training method, FORCE, was cool.
- I liked the analysis of low rank networks S2.
Overall, the framing of the paper got me very excited, but I found that technical concerns led me to see the contributions as smaller than I initially thought. I will list them here, and the authors can likely correct me on some of my mistaken understandings.
- First, since the RNN is trained on one-step prediction, and assuming complete observability, the problem becomes a zero-layer feed-forward neural network, or a general linear model [linear regression then nonlinear link function]. Then the result is simply: if the nonlinearity is monotonic, identifiability becomes the same as for linear regression, and that is identifiable on the span of the input data. (A) Emphasising this simplicity seems good? (B) surely this is already well-known? Googling identifiability of generalised linear models provides many results. In this particular setting it may be new, but it seems very related to existing ideas?
- Then I had some concerns with the unobserved data case. Firstly, unless I'm confused, it is wrongly signposted in the text (it's at the end of appendix B.2, not C as advertised?) Then, for some reason phi become arbitrary rather than monotonic? Finally, and more importantly, the result's framing seemed weird. It showed that, even if you know the parameters relating to the unobserved data, you keep the nonidentifiability of the parameters outside the span of the observed data. So far so good. It did not show that, "even if $P=I$, RNN parameters may remain non-identifiable due to the hidden influence of unobserved neurons or redundancy introduced by non-monotonic activation functions". It just showed, exactly as in the original observed case, there exists a class of non-identifiable parameters living outside the span of the observed data. By assuming that the unobserved parameters were correctly observed you remove all the interesting parts of the problem? And you don't discuss the role of the nonlinearity at all? So why should I draw the conclusion you suggest from the theorem?
- Next, I was concerned by the claims about l2 regularisation. In the simplest case (no noise), finding the estimator that fits the data with minimal l2 norm will clearly select the weights that have no projection in the nullspace of the data matrix, solving the identifiability problem. Yet the authors claim that this is insufficient. They justify this claim by showing that FORCE learning recovers non-zero non-identifiable weights - certainly an interesting result on a shortcoming of FORCE learning. But I don't see the link - they claim that FORCE learning effectively performs l2 regularisation, but I don't see how. I read through the original algorithm and couldn't see the link to the fact it is minimising error + l2 regularisation, and in fact I view the authors' results as evidence in the opposite direction. If it were minimising such a loss, it would not have these non-identifiable components upon convergence!
- Theorem 2 seemed solid [did not check this proof], showing that if you start in the identifiable weights and effectively regularise the weights you will stay there. I was surprised about the fact corollary 1 is a local claim, about the loss near a minima where second order taylor expansions are relevant. This is a severe restriction for a general loss, and should be acknowledged as such (for example, perhaps name the corollary local identifiability in nonlinear regression). My take-away was still that l2 regularisation saves the day.
- I found the discussion of noise confusing. The model is introduced as noisy, but all the analytics are not about that setting. The only discussion of noise is in 4.3, where suddenly the data have noise added to them. I did not get why the important quantity is the span of the noiseless component - that's only true if both your real dynamics and your fitting are applied to the noiseless data, which is not the case? (Unless the added noise is just observation noise, not the input or conversion noise introduced in section 2) I agree that estimating the rank of data with noise is interesting, and likely relevant, but (a) surely this stuff is well studied? (b) the link to the rest of the work is very unclear to me.
Overall, despite thinking this was an interesting question, I felt like the more interesting parts weren't tackled. In the noiseless case with full observability it comes down to a very simple result the same as in linear regression about the span of the data. Further, the writing made this simplicity hard to see. A large part of the surprises in this problem seem to come from unobserved parts of the model/noise, and model mismatch; none of these were robustly tackled. Add to this additional confusions listed above, and I'm afraid I am currently leaning reject.
Further, stylistically, there was just a lot of material, that made it hard to digest. Appendix B was basically a continuation of the paper (8 pages!), including B.3, one of my favourite bits. Some more digesting by the authors, and much punchier writing (it's very verbose), will likely help the presentation. But this is vague advice, so not something I can reasonably request changed in a rebuttal.
Clear from the above I think. |
Fully human-written |