|
Duality and Policy Evaluation in Distributionally Robust Bayesian Diffusion Control |
Soundness: 3: good
Presentation: 2: fair
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a framework aiming to incorporate prior-level distributional robustness into continuous-time Bayesian control through a quotient-space dual formulation. A strong duality result (Thm 2) is derived via a quotient-space construction where $\tfrac{dQ}{dP}=\tfrac{d\nu}{d\mu}(B)$, allowing the objective to be expressed over $\sigma(B)$ and linked to smooth ambiguity models. The paper also develops a randomized multilevel Monte Carlo (rMLMC) estimator for unbiased policy evaluation with an asymptotic $\mathcal{O}_p(n^{-1/2})$ convergence rate (Thms 11--13) and propose two policy-learning procedures: a KL finite-prior optimization and a neural-network-based learning objective (Thm 17). Experiments on synthetic and financial data show reduced conservatism compared with standard DRC.
- While prior-only robustness has been explored in static DRO and Bayesian optimization, extending it to a linear-diffusion model is interesting, although the setting is quite specific and the resulting time-inconsistency issue is not addressed.
- The quotient-space construction and strong duality (Thm 2) are neat, and the asymptotic $\mathcal{O}_p(n^{-1/2})$ convergence rate for the rMLMC estimator is well motivated.
- The quotient-space duality and the rMLMC-based unbiased evaluation are appealing ideas that could inspire follow-up work on robust Bayesian control. That said, the overall impact feels limited since everything remains tied to a one-dimensional linear-Gaussian SDE with constant drift and volatility.
The entire analysis and algorithm are developed entirely under a specific linear diffusion model with constant drift and volatility. The “policy learning” step optimizes terminal wealth rather than a general state-action mapping, and there’s no evidence that the ideas extend beyond this setup. As such, it is not clear how the approach contributes to the broader ICLR community, which typically values methods applicable to generic stochastic control, reinforcement learning, or optimization under uncertainty.
Specific points:
- The strong duality (Thm 2) relies on a quotient-space construction where $\frac{dQ}{dP}=\frac{d\nu}{d\mu}(B)$ and $W$ remains Brownian. This holds only for the 1-D linear-Gaussian SDE; for nonlinear or state-dependent dynamics, the assumption fails and the duality cannot be applied.
- The light-tailed prior and compact-support assumptions (Assumptions 7 and 8) are important for bounded multipliers and the rMLMC CLT but seem unrealistic in higher-D or heavy-tailed settings.
- The KL dual is non-strictly concave in $(\lambda,\beta)$ (Thm 38), producing flat regions and saddle points that worsen with dimension.
- The smooth ambiguity loss (Thm 17) requires the closed-form Girsanov transform $W_t^{b_1}=W_t+\frac{B-b_1}{\sigma}t$, so it doesn’t generalize to nonlinear or non-Gaussian systems.
- Each prior draw requires $2^{N^b}$ inner samples; in high dimensions cost scales as $O(2^{N^b}d^2)$. Even in 1-D, full evaluation reportedly takes about 40 hours.
- Unbiased sampling from $P^b$ is only possible in the linear-Gaussian case, so the unbiased-CLT results may not hold elsewhere.
- Can the strong duality extend beyond the 1-D linear-Gaussian diffusion? Under what conditions on dynamics or noise would the quotient-space argument still apply?
- How sensitive are the results to the light-tail and compact-support assumptions? Have you tried priors with heavier tails or unbounded support, and if so, how does the optimization behave?
- In practice, how severe is the non-strict concavity issue for the KL dual? Do the plug-in or univariate-$\lambda$ variants fully stabilize optimization?
- Could the smooth-ambiguity loss be reformulated for discrete-time or nonlinear systems without relying on the Girsanov transform?
- What is the observed scaling of rMLMC cost and variance with respect to dimension and level $N^b$? Any chance of using control-variates or sample-reuse to cut runtime?
- If unbiased simulation from $P^b$ is infeasible, how would biased simulators affect convergence and the CLT?
- Are there plans to test the approach on non-financial control or RL problems? Even a small toy example would help gauge generality. |
Fully AI-generated |
|
Duality and Policy Evaluation in Distributionally Robust Bayesian Diffusion Control |
Soundness: 3: good
Presentation: 2: fair
Contribution: 1: poor
Rating: 4: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper proposes a Bayesian distributionally robust control setup for a diffusion-control problem in finance.
An agent must decide how to allocate investments between a risk-free asset and a risky asset whose value is determined by a diffusion process combining a semi-known drift term $B$ with a standard Brownian motion.
The agent starts with some prior guess $\mu$ of the distribution of $B$.
Before the game begins, an adversary chooses the true distribution $\nu$ of $B$ from a divergence-ball around $\mu$.
The agent's goal is: given our prior belief and a known bound $\delta$ on the divergence $D(\nu \| \mu)$, select a policy that allocates investment between the risk-free and risky assets over time based on observing the history of the risky asset's price.
The primal distributionally-robust minimax optimization is infinite-dimensional due to both the agent's and the adversary's action spaces.
The authors extend a prior strong duality result to the (inner) adversary's optimization.
To make this result useful in the outer agent's optimization, some extra Assumptions 4-8 on the solution are needed.
I am not familiar enough with the topic to comment on their strength.
After the duality transformation, the computational problem remains hard, but the adversary's problem is no longer infinite-dimensional.
The authors propose an alternating scheme between adversary and agent, where the adversary's step is called "policy evaluation", and the agent's step is called "policy learning" (although the scheme is not a straightforward instance of policy iteration from MDPs).
To solve the "policy evaluation" step, the authors assume access to a simulator that can, given a fixed control $\pi$, generate samples from which to estimate the expected terminal utility $Z^b$.
This simulator is used in a randomized multi-level Monte Carlo estimator, from which a stochastic estimate of the gradient/Hessian can be computed (although this is only discussed in any detail in the appendix).
To solve the "policy learning" step, the authors handle two cases separately: 1) KL-divergence with a finite prior, and 2) General divergences and priors.
- In case 1), the problem is transformed from a plain optimization over $\pi$ to another minimax, and then apply Sion's minimax theorem to get the form in Theorem 16.
This seems unhelpful, since we have now introduced another infinite-dimensional optimization over probability measures.
However, with the KL-divergence and the finite prior, it becomes tractable, and the inner policy optimization becomes expressible in closed form (although I am not sure how easy it is to compute).
- In case 2), the authors instead transform the policy optimization problem into a form of supervised learning using a neural network to predict the terminal wealth $X_T$ as a function of the terminal Brownian motion state $W_T$ and the drift $B$.
It was hard for me to figure out if this gives us a way to recover the policy $\pi$.
The authors evaluate their approach in simulation against the more-conservative DRC, and the non-robust Bayesian, and classical Merton ($B$ constant) solution, showing significant improvements in Sharpe ratio, a risk-aware measure of performance improvement relative to the risk-free asset.
The problem formulation seems to capture an important special case of the distributionally-robust control problem, where the uncertainty is confined to the drift term in a diffusion process.
The mathematical approach is highly sophisticated, with some nontrivial results from the authors and some usage of recent advanced Monte Carlo estimation tools. I read the theorem statements in the main body, but did not check the proofs in the appendix. However, the tools used seem appropriate.
I am not familiar enough with economics/finance to judge the paper's novelty in that area.
The paper is rigorous about stating its assumptions (although sometimes they are highly technical and more discussion would be helpful).
The simulated comparison against baselines shows significantly better performance in terms of Sharpe ratio when evaluated on historical data.
The introduction to this paper seemed to suggest general-purpose implications in learning-based control. For example, the statement:
*"Our motivating application is continuous-time control with unknown dynamics"*,
or the list of related work citing contextual bandits, policy gradient, $Q$-learning, etc.
However, as far as I can tell, the contributions are highly specialized to the specific finance-inspired diffusion problem considered here.
As a reviewer from the RL/control side of things: If this paper has any broader implications outside of economics/finance, then the it was not apparent to me. For the highly diverse ICLR audience, it would help if the authors point out how the ideas in this paper might be generalized.
I defer to the expertise of the Area Chair and other reviewers with economics/finance backgrounds to judge the novelty and impact of this work within those fields.
Paragraph at end of Section 2 contrasting against DRC: this connects back to the "Adversary's power" issue from the intro (see Questions), but for the reader not already familiar with DRC, it remains hard to understand exactly how your approach is different. Pushing a few technical details to the appendix to make room for a more in-depth comparison against DRC would be a better use of the main paper space, in my opinion.
Lines 265-292: As a reader not familiar with rMLMC already, I felt the mathematical algorithm definitions here are too dense to help me understand the method for the first time, and the contrast against Blanchet & Glynn (2015) and Blanchet et al. (2019) is not meaningful to me.
I suggest moving the details on the estimator to the appendix, and presenting a higher-level summary of the Policy Evaluation approach (including the outer optimization part) in the main paper instead.
Similarly, the policy learning steps include dense detail in the main body, but leave out some key concepts, such as how one actually extracts the policy $\pi$ from the proposed solutions (see Questions). I again suggest pushing a few details in the appendix to make room for a high-level summary.
The experiment in Section 6.1 and Table 1 feels inconclusive to me. We can see that the ordering of the learning results matches that of the closed-form, but the differences are large. A plot over a wider set of $b$ and/or $r$ values would be more helpful for evaluating the learning -- for example, it could capture that the learned result is order-preserving despite having large mean squared errors.
Lines 2136-2137: please use `\displaystyle` to make the math legible.
Lines 048, 096: The phrase "Adversary's power is replenished" is hard to understand; I think we need a more clear idea of the adversary's decision space and what "power" means first.
Section 2.1: For someone with RL/control background but no economics/finance background, the language about "risk", "asset", "interest rate", "stock price", etc. makes this section hard to understand. Please try to state the setup in an abstract mathematical form first, then treat the finance application as an example separately.
What kind of mathematical object is $\pi_t$? Given Equation 2 and the description "amount of money invested in the risky asset", I thought it was a real number. But then in the definition of admissibility, we take $||\pi_t||_2$, which implies by convention that $\pi_t$ is a vector in general.
Line 137: Why is this without loss of generality?
Line 146: Isn't this normally called an $f$-divergence?
Line 198: Defines $Z^b$ as an expectation of a real-valued random variable, hence a real number.
But then, Line 263 says we are using the simulator to "generate unbiased samples from $Z^b$", implying $Z^b$ is a distribution.
Line 296: How strong is Assumption 10?
Line 385-386: What is a "blue case"?
Section 5.2: The initial problem (10) is written as an optimization over $\pi$, but after the transformation we are using a neural network to optimize $X_T$. But isn't the policy $\pi$ the object we ultimatley want, if we are using the approach to do real-life trading? How do we convert from the neural network back into a policy $\pi$?
The conclusion implies that there are some other divergence families outside the proposed $\phi$-divergences that are worthy of attention -- what are they? |
Fully human-written |
|
Duality and Policy Evaluation in Distributionally Robust Bayesian Diffusion Control |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
- This paper investigates the DRO formulation of the Bayesian diffusion control problem where a continuous-time stochastic system is driven by a drift, Brownian motion, and a prior distribution over the drift.
- The inner minimization problem (policy evaluation) is reformulated into a dual form, and a strong duality result is established for general ambiguity sets with the $\phi$-divergence. This problem is solved using the rMLMC algorithm, which has a provable sample complexity of order $\sqrt{n}$.
- The outer maximization problem (policy learning) admits a tractable formulation, to which a general deep learning algorithm can be applied.
- Empirical results demonstrate that the proposed algorithm matches the closed-form solution and outperforms the DRC baselines, which tends to be overly pessimistic
The proposed formulation and algorithmic solution has a clear conceptual novelty. The algorithm is supported by a strong theoretical foundation, including a tractable reformulation with the strong duality, and established sample complexity results. Although I have not verified the proofs in the appendix, the results appear mathematically sound and build upon established prior work.
The advantages of the proposed setting and algorithm over prior methods could have been explained more clearly. The author claims that DRBC improves upon standard DRC by avoiding overly conservative policies through a single DRO formulation that does not depend on DP. However, this advantage is not entirely clear to me and appears to be insufficiently explored. The empirical evaluation also appears limited. Please see the questions below.
- In DRC, the degree of conservativeness can be controlled by simply adjusting the radius of the ambiguity set. Could the authors provide more intuition on how the proposed algorithm mitigates overpessimism beyond this?
- Are there any similar settings or approaches in the DRC literature that avoid DP (or HJB) by directly optimizing over a single ambiguity set defined on the distribution of model parameters? Additionally, what trade-offs or limitations does the proposed framework introduce by avoiding DP?
- Could the authors clarify how the ambiguity set and its radius were selected in their experiments (Section 6.2) to ensure fair comparisons?
- In line 311, $E \rightarrow \mathcal{E}$. |
Lightly AI-edited |
|
Duality and Policy Evaluation in Distributionally Robust Bayesian Diffusion Control |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper studies a distributionally robust Bayesian control (DRBC) formulation for continuous-time diffusion processes under model uncertainty, motivated by scenarios common in portfolio optimization. The authors focus on misspecification in the prior distribution of the unknown drift, formulating the DRBC game as an adversarial problem over divergence neighborhoods of the prior. They establish strong duality, derive efficient evaluation and learning methods, notably an unbiased rMLMC estimator, and provide both theoretical and empirical support, including neural-network-based policy learning and real-world stock data experiments.
The paper tackles an important challenge in robust control: mitigating over-conservatism of classic distributionally robust control by localizing the adversarial game to the Bayesian prior.
1. The mathematical formulation and the derivation of strong duality results provide theoretical insight, connecting financial mathematics and the ambiguity literature. The unbiased rMLMC estimator for evaluating the robust value function, along with its statistical guarantees, is technically sophisticated and practically motivated.
2. The neural policy learning strategy, guided by nontrivial stochastic analysis, pushes forward robust policy computation without relying on dynamic programming. The discussion around computational tractability is forthright, and the application to the KL-divergence case delivers semi-closed form structure.
3. Real data experiments on S&P 500 stocks demonstrate improved Sharpe Ratios with DRBC over both Bayesian and classic robust baselines.
1. Lack of systematic robustness ablations. The empirical story does not yet demonstrate robustness in an ablation way. Claims about “less pessimism” and better risk-return trade-offs hinge on how performance evolves as the ambiguity radius $\delta$ increases, the prior is increasingly misspecified, or the adversary is strengthened. At present, results are largely point evaluations rather than sensitivity curves. The paper should introduce grid-based sweeps over (\delta) and controlled prior shifts, plot risk-return frontiers and policy behavior heatmaps, and show how action distributions and terminal-wealth quantiles change as robustness knobs are turned.
2. The strength of the contribution rests on duality/minimax interchanges and properties of the penalized objective (e.g., monotonicity/concavity of $\Phi(\lambda,\beta)$, yet the paper stops short of giving concrete, checkable sufficient conditions and diagnostics that practitioners can apply to their own settings. Readers need to know when these assumptions plausibly hold (or fail), what quick tests to run, and what numerical safeguards to use when they do not. As it stands, correctness and numerical stability feel contingent on assumptions that are described as “mild” but not empirically stress-tested or made verifiable.
3. While the framework is positioned as covering general $\phi$-divergences, the only fully implemented and evaluated case is KL. This makes the practical scope narrower than advertised and leaves open whether the method’s stability and performance persist beyond KL. The paper should deliver at least one additional end-to-end instance with implementation details, timing, convergence behavior, and head-to-head performance versus KL to substantiate generality claims.
4. Given the rapid pace in robust control and adversarial continuous control, the paper needs a tighter comparison to recent lines such as [1, 2] and discuss several DRO control/RL. This undercuts the “why this approach now” case.
5. Minor issues
- The construction of the unbiased rMLMC estimators, the role of plug-in approximations, and the variance/cost trade-offs live across sections and appendices. Consolidating them into a single, self-contained subsection with defaults for $R, n_0$, and clear guidance on when to prefer unbiased vs. biased estimators would make the paper far easier to reproduce.
- Core notions such as admissible controls, the quotient-space view of uncertainty, and the parallel use of $P^b$ vs. $Q^b$ are correct but presented abstractly.
- The paper would benefit from a frank discussion of when assumptions fail (e.g., heavy-tailed priors, volatility uncertainty, discretization error), where dual-multiplier optimization becomes numerically unstable, and how rMLMC cost scales in practice.
- A few targeted plots like adversarial posterior tilts, policy trajectories as $\delta$ varies, sample wealth paths under different risk profiles would significantly enhance interpretability.
[1] Wang, S., Si, N., Blanchet, J.: Statistical Learning of Distributionally Robust Stochastic Control in Continuous State Spaces
[2] Park, H., Zhou, D., Hanasusanto, G. A., & Tanaka, T: Distributionally robust path integral control
Please see the weakness. I will change my score if the concerns are resolved. |
Fully AI-generated |
|
Duality and Policy Evaluation in Distributionally Robust Bayesian Diffusion Control |
Soundness: 3: good
Presentation: 3: good
Contribution: 3: good
Rating: 6: marginally above the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
The paper considers to robustify the prior distribution in the context of Bayesian diffusion control. Strong duality results are derived for tractability and numerical experiments are illustrated to showcase the efficacy of the proposed approach against various benchmarks.
- The paper's motivation is well-grounded. The detrimental effect of an imprecise prior is also illustrated through numerical experiments.
- Both theoretical tractability and numerical perfomance are considered in the paper, and both of them consider well-motivated settings with convincing results.
- Assumptions and experimental setups are accompanied by proper discussions.
- The motivation for using phi-divergence beyond the simple strong duality formula seems inadequate. For example, phi-divergence ambiguity set requires that the distribution in the first argument is absolutely continuous with respect to the second argument. It also cannot capture the geometry of the support of the distribution. For finance applications this would be less of an issue through normalization but for most control tasks the geometry of the support does matter.
- How would the results generalize to e.g. optimal transport discrepancies where the distance carries a more physical meaning?
- How would the assumption of having access to i.i.d. samples (simulator access) be relaxed? How can this be guaranteed in real-world applications? |
Fully human-written |