|
Reinforced Data-Driven Estimation for Spectral Properties of Koopman Semigroup in Stochastic Dynamical Systems |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 3: good
Rating: 2: reject
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
In the present paper, the authors introduce an amendment to the stochastic dynamic mode decomposition in which reinforcement learning is utilized in an outer loop to optimally select the trajectory sample points to satisfy a Koopman reward function. Evaluation is performed on a set of 3 stochastic dynamical system namely the double-well potential, the stochastic Duffing oscillator, and the FitzHugh-Nagumo model.
The paper is well-motivated, and very well written in its construction of the individual algorithm variants, relating the background theory for the Koopman operator to the subsequent construction of the algorithm variants.
As is the paper suffers from a number of inherent flaws, in short at a high level:
* Insufficient relation to existing work on optimal/intelligent sampling, especially also to active sampling into the broader purview of which this paper falls
* Lack of a coherent design of evaluation
* Approach seemingly confined to a single DMD algorithm. Broader utility to the field not readily apparent
#### Insufficient relation to related work
* The paper is not set in relation to other work which utilizes reinforcement learning for optimal sampling strategies such as Zhao and Pilai [1], and Shen et al. [2].
* The core idea of the paper, moving away from random sampling for sDMD, is one not only confined to DMD but the wider scientific computing and machine learning based design literature. Some use GP surrogates to sample from to alleviate expensive individual sampling trajectories [3], others train individual sampling models [4], yet all of them can be broadly summarized under the larger umbrella of _Active Learning_ [5]. The paper sadly fails to draw these connections.
* Line 54-56, the choice of dictionary is orthogonal to the same issue faced in symbolic regression / SINDy-based approaches. Drawing this connection would aid greatly in embedding this work in existing literature.
* Line 58, some have started utilizing LLMs for the learning of the dictionary. See e.g. [6].
#### Design of Evaluation
* The current evaluation does not permit to draw conclusions on the performance of the introduced algorithms as only the potential and eigenvalues of the stochastic dynamical systems are shown. Going further, it is not apparent which algorithm(s) are used for the construction of the eigenvalues.
* There exists no actual performance evaluation, such as e.g. evaluating each of the 3 algorithms on each of the stochastic dynamical system for its sampling efficiency to reach a predetermined quality. This would also be a natural point to introduce ablations.
* The authors perform no ablations. To properly motivate the use of reinforcement learning for optimal sampling, an ablation to random sampling and e.g. importance sampling would be highly desirable to actually establish the performance advantage of the introduced algorithms. As is, it is not apparent whether the new algorithms outperform a random sampling baseline or not.
#### References
1. Zhao, D., & Pillai, N.S. (2024). Policy Gradients for Optimal Parallel Tempering MCMC. ArXiv, abs/2409.01574.
2. Shen, W., Dong, J., & Huan, X. (2023). Variational Sequential Optimal Experimental Design using Reinforcement Learning. ArXiv, abs/2306.10430.
3. Jones, A., Cai, D., Li, D. et al. Optimizing the design of spatial genomic studies. Nat Commun 15, 4987 (2024). https://doi.org/10.1038/s41467-024-49174-4
4. Fannjiang C, Listgarten J. Autofocused oracles for model-based design. Advances in Neural Information Processing Systems. 2020;33:12945-56.
5. Hsu, D.J. (2010). Algorithms for active learning.
6. Grayeli, A., Sehgal, A., Costilla-Reyes, O., Cranmer, M.D., & Chaudhuri, S. (2024). Symbolic Regression with a Learned Concept Library. ArXiv, abs/2409.09359.
* Why are the algorithms not evaluated on more challenging (stochastic) environments? As is, it is hard to evaluate the limits of the approach.
* Has there been any quantitative comparison to commonly used sampling algorithms? Training a reinforcement learning agent is not cheap, and especially in such a highly specific application it is not readily apparent to the reviewer how the expended compute is to be amortized later on. A potential comparison here might be taking one of your existing environments on which the agent is trained, and providing the same compute budget to a random sampling based SDMD to then compare them on key metrics. |
Fully human-written |
|
Reinforced Data-Driven Estimation for Spectral Properties of Koopman Semigroup in Stochastic Dynamical Systems |
Soundness: 3: good
Presentation: 1: poor
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper integrates Reinforcement Learning (RL) with Stochastic Dynamic Mode Decomposition (SDMD) to improve data-driven Koopman spectral estimation for stochastic dynamical systems. The method is named Reinforced Stochastic Dynamic Mode Decomposition (Reinforced SDMD).
It is well understood that the capacity of Koopman-based methods to approximate the spectral decomposition of the (possibly stochastic) evolution operator crucially depends on the distribution of samples, that is on where and how trajectories are obtained. Poorly chosen initial conditions or long-time scales to escape meta-stable regions lead to inaccurate estimation of eigenfunctions and spectra.
To address this issue, Reinforced SDMD casts data collection as an RL problem. That is, the agent’s policy determines the initial sampling locations of trajectories which are than obtained by numerically solving known SDE over some time-horizon. The reward is based on spectral consistency, that is on how well estimated eigenpairs predict system evolution, and an exploration term encouraging coverage of the state space. The paper explores sequential decision-making algorithms including Bandit, DQN, and PPO, showing that the agent identifies dynamically informative regions.
Theoretical analysis that links the quality of the learned policy to Koopman operator estimation accuracy is provided. Experiments on small-dimensional canonical stochastic systems (double-well, Duffing oscillator, FitzHugh–Nagumo) show efficient discovery of coherent regions without prior domain knowledge.
(1) The proposal to use RL framework in combination with numerical solvers of SDEs to estimate the spectral decomposition of the corresponding semigroup of Markov transfer operators, and hence build an efficient ML based solver is, up to my knowledge, novel and interesting.
(2) The choice of analysing different sequential decision-making algorithms (Bandits, DQN, PPO) is appreciated.
(3) The choice of canonical SDEs is appropriate for small dimensional problems, and the experiments support the claim that Reinforced-SDMD can obtain good estimation
(1) The paper misses to report on big body of work on learning transfer operators of stochastic systems related not only to proposed SDMD but also to the problem of sampling from complex distributions via data-driven SDEs. To name just a few directly related works:
- Christof Schütte and collaborators have a big track record on learning stochastic systems and in particular treating the problem this paper tackles, see e.g. 150p. review Overcoming the timescale barrier in molecular dynamics ActaNumerica 2023, and references therein.
- Frank Noe and collaborators have also made significant impact on this topic, see e.g. VAMPnets for deep learning of molecular kinetics, Nature Communications 2018.
- Massimiliano Pontil and collaborators made significant contributions in understanding learning algorithms of transfer operators associated to SDEs, see e.g. Learning dynamical systems via Koopman operator regression in reproducing kernel Hilbert spaces, NeurIPS2022.
- More recent papers providing methods on learning continuous semigroups of stochastic processes, also containing statistical learning bounds:
- Hou et al., Sparse learning of dynamical systems in RKHS: An operator-theoretic approach. ICML2023
- Devergne et al., From biased to unbiased dynamics: An infinitesimal generator approach. NeurIPS2024
- Kostic et al., Laplace transform based low-complexity learning of continuous Markov semigroups. ICML2025
(2) Proposed method is not adequately compared to vanilla performance of numerical solvers of SDEs. Namely, the reader cannot judge if the overall computational complexity of Reinforced-SDMD that has a big overhead of using sequential decision-making algorithms to get new samples has any advantage of randomly sampling initial points in the space. This is even more so important, since the current paper only works in small state dimensions, and its scalability is not clear.
(3) Authors make strong Assumption on the core method of estimation of the transfer operator (SDMD), and then continue with standard theoretical arguments for convergence of considered sequential decision-making algorithms. Hence, the main novelty of the paper is methodological with weak, at best, theoretical novelty.
(4) From my perspective, many aspects of the paper are not clarified enough, please see the questions for details.
(5) Given the of methods for learning representations with neural networks (VAMP-Nets, DPNets, LoRA,..) that are used as a subspace on which SDMD is run, as well as having many competitors of SDMD, at least broader discussion of different approaches that can be coupled with RL formulation is needed to appreciate the authors' proposed approach.
My current score reflects identified weaknesses, however I am ready to revise it depending on authors' clarifications and revision of the paper.
(1) Your choice of reward implicitly assumes that the noise level (diffusion term) is much smaller than the signal of the drift, making the forecasting of states a reasonable task. But, isn't the forecasting of distributions an adequate reward for stochastic systems? What happens if the diffusion is stronger, how it impacts the experiments?
(2) Looking at Eq. (2.7), one expects that the estimator has a large number of eigenvalues close to one when dictionary is sufficiently large. Is this the case in your experiments? If yes, how do you choose eigenvalue-eigenfucntion pairs which approximate true leading non-trivial ones (typically just few close to one).
(3) In Assumption 5.1, which norm is used? If the norm is operator norm in $\cal{F}=L^({\cal M},\rho)$, how you formally define estimator acting on this domain? If it is the norm w.r.t. finite-dimensional subspace of $\cal{F}$ given by the dictionary, transfer operator is typically not well-defined on that, so the assumption becomes unreasonable. I believe that formal presentation off the method needs to significantly improve in order that proofs
(4) For learning the transfer operator, or the generator, the samples need to come from the invariant distribution so that the one can guarantee learning of the object on the domain $\cal{F}$. However, in your proposal, we are getting samples from adequate supports of the invariant distribution (dense in meta-stable states), but I don;'t see how the samples are distributed according to $\rho$. What am I missing? |
Fully human-written |
|
Reinforced Data-Driven Estimation for Spectral Properties of Koopman Semigroup in Stochastic Dynamical Systems |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces a novel technique for learning the Koopman operator via reinforcement learning (RL). The proposed framework, called Reinforced Stochastic Dynamic Mode Decomposition (Reinforced SDMD), integrates RL with SDMD to actively guide the data acquisition process in stochastic dynamical systems. The method leverages three RL algorithms (Multi-Armed Bandit, Deep Q-Network (DQN), and Proximal Policy Optimization (PPO)) to generate well-behaved trajectories that enhance the robustness of Koopman operator estimation. The reward signal is based on a spectral consistency criterion, designed to encourage the agent to collect informative trajectories while maintaining adequate exploration. The authors validate their approach on three synthetic systems (the double-well potential, stochastic Duffing oscillator, and FitzHugh–Nagumo model), showing that the agent can identify dynamically relevant regions. They also provide a theoretical convergence analysis that connects estimation accuracy to the quality of the learned sampling policy.
The integration of RL and Koopman operator estimation is conceptually appealing and addresses an important limitation of existing data-driven spectral methods, i.e., their dependence on data quality and sampling. I appreciate that this approach is systematically evaluated using three distinct RL algorithms and tested across three representative dynamical systems. Overall, the paper is clearly written, well structured, and technically sound.
- While the qualitative illustrations are convincing, it remains unclear how much improvement RL sampling yields compared to random or uniform sampling strategies. Including quantitative metrics, for instance the eigenvalue distance between estimated and ground-truth spectra (e.g., using the directed Hausdorff distance) would significantly strengthen the empirical validation.
- The experiments focus exclusively on 2D systems; even a moderate increase in dimensionality (e.g., 5-10 dims) would help demonstrate the method’s scalability and computational feasibility.
See Weaknesses |
Fully AI-generated |
|
Reinforced Data-Driven Estimation for Spectral Properties of Koopman Semigroup in Stochastic Dynamical Systems |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper proposes Reinforced Dynamic Mode Decomposition, which introduces Reinforcement Learning (RL) to automatically guide the data collection process for Stochastic Dynamic Mode Decomposition. They shape a new reward signal to guide the agent based on the spectral consistency, which measures how well the Koopman operator has been estimated. The method has been validated using three different RL algorithms on canonical systems. Moreover, they provide a theoretical analysis of the proposed algorithm under strong assumptions.
- They introduce a new reward function to guide the agent in collecting the data. The function leverages an exploitation term, which is defined as the spectral consistency, and an exploration bonus measured with the Gaussian kernel.
- The data-collection method is shown to be working using three different RL algorithms.
1. Lack of baselines: The paper is missing baselines to show that their method brings benefits. For example, randomly initializing the agent position at each rollout and collecting data from there could be a simple yet effective comparison.
2. Inconsistency between the theory and the experimental results: The theoretical results are built upon the strong assumptions that the Q and V functions could be linearly expressed, respectively, in the DQN and PPO algorithms. However, these assumptions are not tested to see if they hold in practice.
3. Missing reward learning results: The paper does not demonstrate whether the proposed reward function can actually be learned by the agent in practice. The authors should include figures showing the results compared to the agent’s performance relative to the reward convergence, to illustrate how expert the used agent is for data collection.
4. Computational costs: The proposed method appears to be computationally intensive. Without a comparison to baseline methods, it remains unclear whether the computational costs are justified.
5. Unclear role of R_0: You don’t justify the role of this baseline, and whether it is a hyperparameter to be tuned or not. You left over this term along the paper.
Minor
1. In Line 356, you mention that Figure 5 shows the first eigenfunction, but you are showing just the second one.
2. Typo in line 483 -> “Essentiall”
Unclear meaning of images and doubts on the evaluation process:
1. You show in Figures 2, 3, and 5 that the Koopman eigenfunctions are learnt better on the data points collected using an improved agent over training. Are these eigenfunctions learnt using a fixed number of points? Are these coming from the different policies obtained along the training in an “off-policy” way?
2. Where do the points on which you evaluate the eigenfunctions come from? Are the eigenfunctions learnt using all of that or just a portion of those points? |
Fully human-written |