stochastic control formulation of PDEs

The central insight of the deep BSDE method is that solving a semilinear parabolic PDE can be recast as a stochastic optimal control problem. Given the BSDE formulation Y_t = g(X_T) + integral_t^T f(s, X_s, Y_s, Z_s) ds - integral_t^T Z_s^T dW_s, the unknowns are the initial value Y_0 and the control process {Z_t}{0 ⇐ t ⇐ T}. The variational problem becomes: inf{Y_0, {Z_t}} E|g(X_T) - Y_T|^2, subject to the forward SDE for X_t and the forward-propagated BSDE dynamics for Y_t.

This reformulation has a natural interpretation: given the stochastic process X_t, one chooses a starting point Y_0 and controls the evolution of Y_t through Z_t such that Y_T matches the terminal condition g(X_T). The minimiser of this variational problem is the solution to the PDE, and vice versa. In financial applications, X_t represents the asset dynamics, Y_t is the portfolio value, Z_t encodes the hedging strategy, and the terminal condition g(X_T) is the payoff. The stochastic control problem then amounts to finding the initial portfolio value and optimal hedging strategy that replicates the payoff.

This variational structure is what makes the deep BSDE method compatible with the stochastic gradient descent paradigm of deep learning. The loss function E|g(X_T) - Y_T|^2 does not require pre-generated training data — the initial conditions and Brownian paths are generated on-the-fly, providing effectively infinite training data. The gradient of the PDE solution plays the role of a “policy function” analogous to deep reinforcement learning (Han, Jentzen, E 2018). This framework extends naturally to Hamilton-Jacobi-Bellman equations via actor-critic methods, Nash equilibria in mean-field games, and XVA computation where the control represents the hedging strategy for counterparty risk.

Key Details

Variational problem: inf_{Y_0, {Z_t}} E|g(X_T) - Y_T|^2, where Y_t is propagated forward through discretised BSDE dynamics
Reinforcement learning analogy: Z_t = sigma^T grad u acts as the policy function; the terminal mismatch is the reward signal
No training data needed: X_0 and {W_{t_n}} serve as data generated on-the-fly, suitable for infinite-data SGD training
Financial interpretation: Y_0 = initial price, Z_t = hedging strategy, g(X_T) = payoff to replicate
Extensions: actor-critic for HJB equations, Nash equilibria via deep fictitious play, recursive XVA computation via Deep xVA Solver
Sub-networks can share or have independent parameters without affecting the applicability of SGD

concept

Alethograph

Explorer

stochastic control formulation of PDEs

Key Details

Graph View

Backlinks