Full Text

Turn on search term navigation

1. Introduction

The financial crisis of 2007–2008 firmly emphasized the importance of quantifying counterparty credit risk (CCR), which is the risk that the counterparty will default on the obligation and fail to fulfill its contractual agreements. Important indicators used to measure and price CCR include expected exposure (EE), potential future exposure (PFE), and various valuation adjustments (xVAs), which reflect credit, funding, and capital costs related to OTC derivative trading Gregory (2015). Most of these metrics depend on the distribution of the potential future losses resulting from a credit event. Due to the complex nature of these distributions, practitioners resort to numerical methods like Monte Carlo (MC) simulation to approximate the quantities. Typically, this involves scenario generation for the underlying risk factors and subsequent valuation of the contract for each time-step on each path Zhu and Pykhtin (2007). The latter is generally considered the most involved aspect because it needs to be carried out for full portfolios. This poses a major computational challenge to financial institutions. Efficient numerical methods for derivative valuation, both on spot and future simulation dates, are therefore highly relevant.

To address this problem, we extend the concept of (semi-)static replication, which has been extensively studied for, for example, equity derivatives, to interest rate derivatives. A traditional dynamic replication, such as a delta hedge, is achieved by constructing an asset portfolio that is rebalanced continuously through time as the market moves. A static replication on the other hand is an asset portfolio that mirrors the value of the derivative without the need for rebalancing. The weights of the portfolio composition are so to speak static. In this work, we consider a semi-static hedge, which is a replicating portfolio that needs to be updated on only a finite number of instances. Considering a replication of vanilla products instead of the exotic derivative itself can greatly simplify its risk-assessment. Typically, ample machinery is available to analyze vanilla instruments, including closed-form prices and sensitivities.

In the equity world, the static replication problem has been addressed in the literature by, for example, Breeden and Litzenberger (1978), Carr and Bowie (1994), Carr et al. (1999), and Carr and Wu (2014). The main concept is to construct an infinite portfolio of short-dated European options with a continuum of different strike prices. A different but comparable approach is proposed in Derman et al. (1995). Here, a portfolio of European options with a continuum of different maturities is constructed to replicate the boundary and terminal conditions of exotic derivatives, such as knock-out options. The replication of an American-style option is challenging as it involves a time-dependent exercise boundary, giving rise to a free boundary problem. In Chung and Shih (2009), this is addressed by composing a portfolio of European options with multiple strikes and maturities, and, in Lokeshwar et al. (2022), a semi-static hedge is constructed using shallow neural network approximations. However, in the field of interest rate (IR) modeling, this topic has received little attention and the static replication of exotic IR derivatives remains largely an open problem. Where equity options depend on the realization of a stock, IR derivatives depend on the realization of a full term structure of interest rates, leveraging the complexity of the hedge. The articles of Pelsser (2003) and Hagan (2005) are among the few contributions to the literature, treating the static replication of guaranteed annuity options, and CMS swaps, caps, and floors, respectively, with a portfolio of European swaptions.

In this work, we study the replication problem of Bermudan swaptions under an affine term structure model, possibly multi-factor. Bermudan swaptions are a class of exotic interest rate derivatives that are heavily traded in the OTC market. We show that such a contract can be semi-statically replicated by a portfolio of short-maturity options, such as discount bond options. We propose a regress-later approach, which is introduced in Lokeshwar et al. (2022) for callable equity options. In Lokeshwar et al. (2022), the replication method combines the approximation power of artificial neural networks (ANNs) with the computational benefits of regress-later schemes. In traditional regress-now schemes, such as that of Longstaff and Schwartz (2001), sampled realizations of the continuation value are regressed against the realizations of the risk factors at the preceding monitor date. Advanced variations in this algorithm, where the polynomial regression functions are replaced by ANNs, include the work of Kohler et al. (2010), Lapeyre and Lelong (2019), and Becker et al. (2020). In contrast, in regress-later schemes, the sampled realizations of the continuation value are regressed against the realizations of the risk factors at the same date. The continuation value at the preceding monitor date is then obtained by evaluating the conditional expectation of this regression. An analysis and discussion of the benefits of this approach can be found in Glasserman and Yu (2004) and an example of such a scheme is presented in Jain and Oosterlee (2015).

Novel pricing algorithms that replace costly valuation functions with ANN-based approximations have been the subject of many recent papers. An early attempt to approximate option prices in the Black–Scholes model can be attributed to Hutchinson et al. (1994) and dates back to 1994. Since then, a great number of variations in this approach have been investigated. A comprehensive overview of articles devoted to this topic can be found in the literature review of Ruf and Wang (2020). An accessible introduction to neural networks and an application to derivative valuation is, for example, given in the work of Ferguson and Green (2018). A drawback of directly replacing value functions with ANNs is that the method continues to rely on external pricing methodologies to provide input to the training process. In that sense, it can accelerate, but not fully substitute, traditional valuation routines.

Other approaches in the literature consider an indirect use of ANNs and therefore do not depend on classical benchmarks for training. A noteworthy example is the development of deep backward SDE solvers, which, in a financial context, have been introduced by Henry-Labordere (2017). Where the dynamics of financial risk factors are typically captured by forward SDEs, option prices tend to be the solution to backward SDEs. An application to Bermudan swaption valuation is treated in Wang et al. (2018) and a generalization to a CCR management framework is proposed in Gnoatto et al. (2020). Another example is the development of the deep optimal stopping (DOS) algorithm by Becker et al. (2019). They propose an ANN-based method by directly learning the optimal stopping strategy of callable options, without depending on the approximation of continuation values. In the work of Andersson and Oosterlee (2021), the DOS algorithm is applied to compose exposure profiles for Bermudan contracts.

Our contribution to the existing literature is threefold. First, we propose a semi-static replication method for Bermudan swaptions under a multi-factor short-rate model. In the one-factor case, we argue that replication can be achieved with an options portfolio written on a single discount bond. In the multi-factor case, replication can be achieved with an options portfolio written on a basket of discount bonds. As such, we generalize the Black–Scholes-embedded method presented in Lokeshwar et al. (2022) to an interest rate modeling framework. Additionally we propose an alternative ANN design, such that a replication with vanilla options can also be achieved in the multi-factor case (as opposed to basket options). This facilitates highly efficient pricing, which is essential for credit risk applications, such as exposure, VaR, and xVAs, which rely on frequent re-evaluations of the portfolio.

Second, we propose a direct estimator and a lower and an upper bound estimator to the contract’s value, which is implied by the semi-static replication. The lower bound results from applying a non-optimal exercise strategy on an independent set of Monte Carlo paths. The upper bound is based on the dual formulation of Haugh and Kogan (2004) and Rogers (2002), which, in contrast to other work, can be obtained without resorting to expensive nested simulations. We complement the study of Lokeshwar et al. (2022) by deriving analytical error margins to the lower and upper bound estimators. This provides a direct insight toward the approximation quality of the proposed estimators and proves their convergence as the regression errors of the ANNs diminish.

Thirdly, we prove that any desired level of accuracy can be achieved in the replication due to the universal approximating power of ANNs. We support this theoretical result with a range of representative numerical experiments. We demonstrate the pricing accuracy of the proposed algorithm by benchmarking to the established least-square method of Longstaff and Schwartz (2001). The regression error and convergence of the method is presented for different contract specifications. Lastly, we study the replication performance for different ANN designs.

The paper is organized as follows: Section 2 introduces the mathematical setting, describes the modeling framework, and provides the problem formulation. Section 3 provides a thorough introduction to the algorithm, motivates the use and interpretation of neural networks, and treats the fitting procedure. Section 4 introduces the lower bound and upper bound estimates to the true option price. In Section 5, we introduce the error bounds on the direct, lower bound, and upper bound estimates brought forth by the algorithm. We finalize the paper by illustrating the method through several numerical examples in Section 6 and providing a conclusion in Section 7.

2. Mathematical Background

In this section, we describe the general framework for our computations and give a detailed introduction to the Bermudan swaption pricing problem.

2.1. Model Formulation

We consider a continuous-time financial market, defined on finite time horizon $[0, \bar{T}]$ . We additionally consider a probability space $(Ω, F, P)$ , which represents all possible states of the economy, and let the filtration $F = {(F_{t})}_{t \in [0, \bar{T}]}$ represent all information generated by the economy up to time-t. The market is assumed to be frictionless and we ignore any transaction costs.

We let $B (t)$ denote the time $- t$ value of the bank account. Investments in the money market are assumed to compound a continuous, risk-free interest $r_{t}$ , which we refer to as the short rate. $B (t)$ corresponds to the time-t value of a unit of currency invested in the money market at time-zero and we assume it is given by the following expression (see Andersen and Piterbarg 2010a or Brigo and Mercurio 2006):

$\begin{matrix} B (t) : = e^{\int_{0}^{t} r (u) d u}, t \in [0, \bar{T}] \end{matrix}$

We denote by

Q

the risk-neutral measure equivalent to

P

, which is associated to

B (t)

as the numéraire. Attainable claims denominated by the numéraire are assumed to be martingales under

Q

, which guarantees the absence of arbitrage Harrison and Pliska (1981).

We assume that the dynamics of the short-rate r are captured by an affine term structure model, in accordance with the set-up introduced in Duffie and Kan (1996) and Dai and Singleton (2000). The short rate itself is therefore considered to be an affine function of a—possibly multi-dimensional—latent factor $x_{t}$ , i.e.,

(1) $\begin{matrix} r (t) = ω_{1} + ω_{2}^{⊤} x_{t} \end{matrix}$

with

ω_{1}

ω_{2}

denoting a scalar and a vector of time-dependent coefficients, respectively. We furthermore assume that the stochastic process

{\{x_{t}\}}_{t \in [0, T]}

is a bounded Markov process that takes values in

R^{d}

, which represents all market influences affecting the state of the short rate. Let the dynamics of

x_{t}

be governed by an SDE of the form

(2) $\begin{matrix} d x_{t} = μ (t, x_{t}) d t + σ (t, x_{t}) d W_{t} \end{matrix}$

where

W_{t}

denotes an

R^{d} -

valued Brownian motion under

Q

adapted to the filtration

F

. The measurable functions

μ : [0, T] \times R^{d} \to R^{d}

and

σ : [0, T] \times R^{d} \to R^{d \times d}

are taken to satisfy the standard regularity conditions by which the SDE in Equation (2) admits a strong solution.

We let $P (t, T)$ denote the time $- t$ value of a zero-coupon bond contract that matures at T. A zero-coupon bond guarantees the holder one unit of currency at maturity, i.e., $P (T, T)$ : = 1. Within the class of affine term structure models, zero-coupon bond prices are exponential affine in $x_{t}$ Andersen and Piterbarg (2010b); Duffie and Kan (1996). Therefore, the value of $P (t, T)$ can be expressed as

$\begin{matrix} P (t, T) : = E^{Q} [e^{- \int_{t}^{T} r_{u} d u} | F_{t}] = exp \{A (t, T) - B {(t, T)}^{⊤} x_{t}\} \end{matrix}$

where the deterministic coefficients

A (t, T) \in R

and

B (T_{1}, T_{2}) \in R^{d}

can be found by solving a system of ODEs, which are of the form of the well-known Riccati equations; see Duffie and Kan (1996) or Filipovic (2009) for details. We consider this framework as it is still intensively used for risk management purposes. High-dimensional models, such as Libor market models, can be intractable for quantifying credit risk for large portfolios, particularly in a multi-currency setting. Multi-factor short-rate models are therefore popular amongst practitioners, providing a solid compromise between modeling flexibility and analytical tractability.

For simplicity, we will assume that the collateral rate used for discounting and the instantaneous rate used to derive term rates are both implied by the same short rate $r_{t}$ . Thus, we consider a classic single-curve model environment. As term rates, we consider simply compounded rates, which we refer to as LIBOR Brigo and Mercurio (2006)

$\begin{matrix} L (t, T) : = \frac{1 - P (t, T)}{τ P (t, T)} \end{matrix}$

where

τ

denotes the year fraction between date t and T.

2.2. The Bermudan Swaption Pricing Problem

We consider the pricing problem of a Bermudan swaption. A Bermudan swaption is a contract that gives the holder the right to enter a swap with fixed maturity at a number of predefined monitor dates. Should the holder at any of the monitor dates decide to exercise the option, the holder immediately enters the underlying swap. The lifetime of this swap is assumed to be equal to the time between the exercise date and a fixed maturity date $T_{M}$ .

As an underlying, we take a standard interest rate swap that exchanges fixed versus floating cashflows. For simplicity, we will assume that the contract is priced in a single-curve framework and that cashflow schemes of both legs coincide, yielding fixing dates $T_{f} = \{T_{0}, \dots, T_{M - 1}\}$ and payment dates $T_{p} = \{T_{1}, \dots, T_{M}\}$ . However, we stress that the algorithm is applicable to any industry standard contract specifications and is not limited to the simplifying assumptions that are made here. The time fraction between two consecutive dates is denoted as $Δ T_{m} = T_{m} - T_{m - 1}$ . Let N be the notional and K the fixed rate of the swap. Assuming that the holder of the option exercises at $T_{m}$ , the payments of the swap will occur at $T_{m + 1}, \dots, T_{M}$ .

We consider the class of pricing problems, where the value of the contract is completely determined by the Markov process ${\{x_{t}\}}_{t \in [0, T]}$ in $R^{d}$ as defined in Section 2. Let $h_{m} : R^{d} \to R$ be the $F_{T_{m}}$ -measurable function denoting the immediate pay-off of the option if exercised at time $T_{m}$ . Although the methodology holds for any generalization of the functions $h_{m}$ , we will consider those in accordance with the contract specifications described above. This means that the functions $h_{m}$ are assumed to be given by

$\begin{matrix} h_{m} (x_{T_{m}}) : = δ \cdot N \cdot A_{m, M} (T_{m}) (S_{m, M} (T_{m}) - K) \end{matrix}$

where the indicator

δ = 1

infers a payer and

δ = - 1

infers a receiver swaption. The swap rate

S_{m, M}

and the annuity

A_{m, M}

are defined in the same fashion as Brigo and Mercurio (2006), given by the expressions

$S_{m, M} (t) = \frac{\sum_{j = m + 1}^{M} Δ T_{j} P (t, T_{j}) F (t, T_{j - 1}, T_{j})}{\sum_{j = m + 1}^{M} Δ T_{j} P (t, T_{j})}, A_{m, M} (t) = \sum_{j = m + 1}^{M} Δ T_{j} P (t, T_{j})$

where the function F denotes the simply compounded forward rate given by the expression

$F (t, T_{j - 1}, T_{j}) = \frac{1}{Δ T_{j}} (\frac{P (t, T_{j - 1})}{P (t, T_{j})} - 1)$

for any

j \in {1, \dots, M}

. For details, we refer to Brigo and Mercurio (2006).

Now, let $T$ denote the set of all discrete stopping times with respect to the filtration $F$ , taking values on the grid $T_{f} \cup {\infty}$ . Define the function $h_{τ}$ as

(3) $\begin{matrix} h_{τ} (x_{τ}) : = h_{τ (ω)} (x_{τ} (ω)) = \{\begin{matrix} h_{m} (x_{T_{m}}) & if τ (ω) = T_{m} \\ 0 & if τ (ω) = \infty \end{matrix}, ω \in Ω \end{matrix}$

In this notation,

τ (ω) = \infty

indicates that the option is not exercised at all. We aim to approximate the time-zero value of the Bermudan swaption, which satisfies the following equation:

(4) $\begin{matrix} V (0) = sup_{τ \in T} E^{Q} [\frac{h_{τ} (x_{τ})}{B (τ)} | F_{0}] \end{matrix}$

Finding the optimal exercise strategy $τ$ is typically a non-trivial exercise. Numerical approximations for $V (0)$ can, however, be computed by considering a dynamical programming formulation as given below, which is shown to be equivalent to (4) in, for example, Glasserman (2013). Let $t \in (T_{m}, T_{m + 1}]$ for some $m \in {0, \dots, M - 2}$ and denote by $V (t)$ the value of the option, conditioned on the fact that it is not yet exercised prior to t. This value satisfies the equation (see Glasserman 2013)

(5) $\begin{matrix} V (t) = \{\begin{matrix} max \{h_{M - 1} (x_{T_{M - 1}}), 0\} & if t = T_{M - 1} \\ max \{h_{m} (x_{t}), B (t) E^{Q} [\frac{V (T_{m + 1})}{B (T_{m + 1})} | F_{t}]\} & if t = T_{m}, m \in {0, \dots, M - 2} \\ B (t) E^{Q} [\frac{V (T_{m + 1})}{B (T_{m + 1})} | F_{t}] & if t \in (T_{m}, T_{m + 1}), m \in {0, \dots, M - 2} \end{matrix} \end{matrix}$

We refer to the random variables

C_{m} (t) : = B (t) E^{Q} [\frac{V (T_{m + 1})}{B (T_{m + 1})} | F_{t}]

as the hold or continuation values. They represent the expected value of the contract if it is not being exercised up until t but continues to follow the optimal policy thereafter. Approximations of the dynamic formulation are typically obtained by a backward iteration based on simulations of the underlying risk factors. The objective is then to determine the continuation values as a function of the state of the risk factor

x_{t}

. Popular numerical schemes based on regression have been introduced in, for example, Carriere et al. (1996) and Longstaff and Schwartz (2001).

Based on approximations of the continuation values, the optimal policy $τ$ can be computed as follows. Assume that, for a given scenario $ω \in Ω$ , the risk factor takes the values $x_{T_{0}} = x_{0}, \dots, x_{T_{M - 1}} = x_{M - 1}$ . Then, the holder should continue to hold the option if $C_{m} (T_{m}) > h_{m} (x_{m})$ and exercise as soon as $C_{m} (T_{m}) \leq h_{m} (x_{m})$ . In other words, the exercise strategy can be determined as

$\begin{matrix} τ (ω) = min \{T_{m} \in T_{f} | C_{m} (T_{m}) \leq h_{m} (x_{m})\} \end{matrix}$

Should, for some scenario, the continuation value be bigger than the immediate pay-off for each monitor date, then

τ (ω) = \infty

and the option expires as worthless.

3. A Semi-Static Replication for Bermudan Swaptions

The main concept of our method is to construct static hedge portfolios that replicate the dynamical formulation in Equation (5) between two consecutive monitor dates. In this section, we introduce the algorithm for a Bermudan swaption that is priced under a multi-factor affine term structure model. The methodology is inspired by the algorithm presented in Lokeshwar et al. (2022) and utilizes a regress-later technique in which the intermediate option values are regressed against simple IR assets, such as discount bonds. The regression model is chosen deliberately to represent the pay-off of an options portfolio written on these assets. An important consequence is that the hedge can be valued in closed form. Throughout this work, we will use the terms semi-static hedge and semi-static replication interchangeably. A hedge in general refers to a trading strategy that reduces the exposure to market risk of an outstanding position. A replication refers to an asset portfolio that mirrors the value of a derivative, which is a common means to set up a hedge. As we see the efficient valuation properties in the context of credit risk quantification as the main application, rather than actual hedging, we will put emphasis on the term replication.

3.1. The Algorithm

The regress-later algorithm is executed in an iterative manner, backward in time. The outcome is a set of option portfolios $\{Π_{M - 1}, \dots, Π_{0}\}$ written on pre-selected IR assets. To be more precise, the algorithm determines the weights and strikes of each portfolio $Π_{m}$ , such that it closely mirrors the Bermudan swaption after its composition at $T_{m - 1}$ until its expiry at $T_{m}$ . The pay-off of $Π_{m}$ exactly meets the cost of composing the next portfolio $Π_{m + 1}$ or the Bermudan’s pay-off in case it is exercised. The methodology yields a semi-static hedging strategy as the portfolio compositions are constant between two consecutive monitor dates. Hence, there is no need for continuous rebalancing, as is the case for a dynamic hedging strategy. The algorithm can roughly be divided into three steps, presented below. Algorithm 1 summarizes the method.

Algorithm 1 The algorithm for a Bermudan swaption

Generate N risk factor scenarios for $x_{T_{m}}$ for $m = 0, \dots, M$
Compute N corresponding asset scenarios for $z_{m}$ for $m = 0, \dots, M$
$\tilde{V} (T_{M - 1}; x_{T_{M - 1}}^{n}) \leftarrow max \{h_{M - 1} (x_{T_{M - 1}}^{n}), 0\}$ for $n = 1, \dots, N$
Initialize $G_{M - 1}$ parameters $ξ_{M - 1}$ from independent uniform distributions
for $m = M - 1, \dots, 1$ do
$ξ_{m} \leftarrow \underset{ξ \in R^{p}}{argmin} L (ξ | {\hat{z}}_{m}, {\hat{x}}_{m})$ minimizing the MSE
for $n = 1, \dots, N$ do
${\tilde{C}}_{m - 1} (T_{m - 1}) \leftarrow B (T_{m - 1}) E^{Q} [\frac{G_{m} (z_{m} (T_{m}))}{B (T_{m})} | F_{T_{m - 1}}]$
$\tilde{V} (T_{m - 1}; x_{T_{m - 1}}^{n}) \leftarrow max \{{\tilde{C}}_{m - 1} (T_{m - 1}), h_{m - 1} (x_{T_{m - 1}}^{n})\}$
end for
$ξ_{m - 1} \leftarrow ξ_{m}$ initialize weights of $G_{m - 1}$
end for
$ξ_{0} \leftarrow \underset{ξ \in R^{p}}{argmin} L (ξ | {\hat{z}}_{0}, {\hat{x}}_{0})$ minimizing the MSE
return $E^{Q} [\frac{G_{0} (z_{0} (T_{0}))}{B (T_{0})} | F_{0}]$

3.1.1. Sample the Independent Variables

We start by sampling N realizations of the risk factor $x_{t}$ on the time grid $T = \{T_{0}, \dots, T_{M - 1}\}$ . These realizations will serve as an input for the regression data. We will denote the data points as $\hat{x} : = {\{(x_{T_{0}}^{n}, \dots, x_{T_{M - 1}}^{n})\}}_{n = 1}^{N}$ . Different sample methodologies could be used, such as:

Take a standard quadrature grid for each monitor date $T_{m}$ , associated with the transition density of the risk factor. For example, if $x_{t}$ has Gaussian dynamics, one could consider the Gauss–Hermite quadrature scaled and shifted in accordance with the mean and variance of $x_{t}$ . See, for example, Xiu (2010).
Discretize the SDE of the risk factor and sample by the means of an Euler or Milstein scheme. Make sure that a sufficiently coarse time-stepping grid is used, which includes the M monitor dates. See, for example, Kloeden and Platen (2013) for details.

Secondly, we select an asset that will serve as the independent variable for the regression. We will denote this asset as

z_{m} (t)

. The choice for

z_{m}

can be arbitrary, as long as it meets the following conditions:

The asset $z_{m} (T_{m})$ should be a square integrable random variable that is $F_{T_{m}}$ measurable, taking values in $R^{d}$ .
The risk-neutral price of $z_{m} (t)$ should only be dependent on the current state of the risk factor and be almost surely unique; that is, the mapping $x_{T_{m}} \mapsto z_{m} | x_{T_{m}}$ should be continuous and injective. This is required to guarantee a well-defined parametrization of the option value.

Examples for

z_{m}

would be a zero-coupon bond, a forward Libor rate, or a forward swap rate. For each sampled realization of the risk factor, the corresponding realization of the asset value will be computed and denoted as

\hat{z} : = {\{(z_{0}^{n}, \dots, z_{M - 1}^{n})\}}_{n = 1}^{N}

. This will serve as the regression data in the following step.

3.1.2. Regress the Option Value against an IR Asset

In this phase, we compose replication portfolios $Π_{0}, \dots, Π_{M - 1}$ by fitting M regression functions $G_{0}, \dots, G_{M - 1}$ . We consider functions of the form $G_{m} : R^{d} \to R$ , which assign values in $R$ to each realization of the selected asset $z_{m}$ . Fitting is performed recursively, starting at $T_{M - 1}$ , moving backwards in time, until the first exercise opportunity $T_{0}$ . Approximations of the Bermudan swaption value at each monitor date serve as the dependent variable. At the final monitor date, the value of the contract (given it has not been exercised) is known to be

$\begin{matrix} V (T_{M - 1}; x_{T_{M - 1}}^{n}) = max \{h_{M - 1} (x_{T_{M - 1}}^{n}), 0\}, n = 1, \dots, N \end{matrix}$

Now, assume that, for some monitor date

T_{m}

, we have an approximation of the contract value

\tilde{V} (T_{m}; x_{T_{m}}^{n}) \approx V (T_{m}; x_{T_{m}}^{n})

. Let

ξ_{m} \in R^{p}

for some

p \in N

denote the vector of the unknown regression parameters. The objective is to determine

ξ_{m}

such that

$\begin{matrix} G_{m} (z_{m} (T_{m})) \approx V (T_{m}) \end{matrix}$

with the smallest possible error. This is carried out by formulating and solving a related optimization problem. In this case, we choose to minimize the expected square error, given by

(6) $\begin{matrix} E^{Q} [{|G_{m} (z_{m} (T_{m})) - V (T_{m})|}^{2}] \end{matrix}$

There is no exact analytical expression available for the expectation of Equation (6). However, it can be approximated using the sampled regression data, giving rise to an empirical loss function L given by

(7) $\begin{matrix} L (ξ_{m} | {\hat{z}}_{m}, {\hat{x}}_{m}) = \frac{1}{N} \sum_{n = 1}^{N} {(G_{m} (z_{m}^{n}) - \tilde{V} (T_{m}; x_{T_{m}}^{n}))}^{2} \end{matrix}$

The parameters

ξ_{m}

are then the result of the fitting procedure, such that

$\begin{matrix} ξ_{m} \approx \underset{ξ \in R^{p}}{argmin} L (ξ | {\hat{z}}_{m}, {\hat{x}}_{m}) \end{matrix}$

If the regression model is chosen accordingly,

G_{m} (z_{m})

represents the pay-off at

T_{m}

of a derivative portfolio

Π_{m}

written on the selected asset

z_{m}

. Details on suggested functional forms of

G_{m}

, asset selection for

z_{m}

, and fitting procedures are subject of Section 3.2.

3.1.3. Compute the Continuation Value

Once the regression is completed, the last step is to compute the continuation value and subsequently the option value at the monitor date preceding $T_{m}$ . For each scenario $n = 1, \dots, N$ , we approximate the continuation value as

(8) $\begin{matrix} {\tilde{C}}_{m - 1} (T_{m - 1}) & = B (T_{m - 1}) E^{Q} [\frac{\tilde{V} (T_{m})}{B (T_{m})} | F_{T_{m - 1}}] \\ \approx B (T_{m - 1}) E^{Q} [\frac{G_{m} (z_{m} (T_{m}))}{B (T_{m})} | F_{T_{m - 1}}] \end{matrix}$

G_{m}

is chosen to represent the pay-off of a derivative portfolio

Π_{m}

written on

z_{m}

, we argue that computing

C_{m - 1}

is in fact equivalent to the risk-neutral pricing of

Π_{m}

. In other words, we have

$\begin{matrix} {\tilde{C}}_{m - 1} (T_{m - 1}) = B (T_{m - 1}) E^{Q} [\frac{Π_{m} (T_{m})}{B (T_{m})} | F_{T_{m - 1}}] : = Π_{m} (T_{m - 1}) \end{matrix}$

In Section 3.2, we treat examples for which

Π_{m}

can be computed in closed form.

Finally, the option value at the preceding monitor date $T_{m - 1}$ is given by

$\begin{matrix} \tilde{V} (T_{m}; x_{T_{m}}^{n}) = max \{{\tilde{C}}_{m - 1} (T_{m - 1}), h_{m - 1} (x_{T_{m - 1}}^{n})\}, n = 1, \dots, N \end{matrix}$

The steps are repeated recursively until we have a representation

G_{0}

of the option value at the first monitor date. An estimator of the time-zero option value is given by

$\begin{matrix} \tilde{V} (0) = E^{Q} [\frac{G_{0} (z_{0} (T_{0}))}{B (T_{0})} | F_{0}] \end{matrix}$

We refer to this approximation as the direct estimator.

3.2. A Neural Network Approach to $G_{m}$

In this section, we propose to represent the regression functions $G_{m}$ as shallow, artificial neural networks. The choices that are presented here are adapted to a framework of Gaussian risk factors, such as that presented in Section 2. The method, however, lends itself to be generalized to a broader class of models by considering an appropriate adjustment to the input or structure.

3.2.1. The 1-Factor Case

First, we discuss the case $d = 1$ . Let $m \in {0, \dots, M - 1}$ . As a regression function, we consider a fully connected, feed-forward neural network with one hidden layer, denoted as $G_{m} : R \to R$ . The design with only a single hidden layer is graphically represented in Figure 1 and is chosen deliberately to facilitate the network’s interpretation. As an input to the network (the asset $z_{m}$ ), we select a zero-coupon bond, which pays one unit of currency at $T_{M}$ .

The first layer consists of a single node and corresponds to the discount bond price, which serves as input. It is represented by the left node in Figure 1. The hidden layer has $q \in N$ hidden nodes, represented by the center layer in Figure 1. The affine transformation acting between the first two layers is denoted $A_{1} : R \to R^{q}$ and is of the form
$\begin{matrix} A_{1} : x \mapsto w_{1} x + b, w_{1} \in R^{q \times 1}, b \in R^{q} \end{matrix}$

As an activation function $φ : R^{q} \to R^{q}$ acting on the hidden layer, we take the ReLU-function, given by
$\begin{matrix} φ : (x_{1}, \dots, x_{q}) \mapsto (max {x_{1}, 0}, \dots, max {x_{q}, 0}) \end{matrix}$

Note that the ReLU function corresponds to the pay-off function of a European option.
The output of the network estimates contract value ${\tilde{V}}_{m} \in R$ and therefore takes value in $R$ . It is represented by the right node in Figure 1. We consider a linear transformation acting between the second and last layer $A_{2} : R^{q} \to R$ , given by
$\begin{matrix} A_{2} : x \mapsto w_{2} x, w_{2} \in R^{1 \times q} \end{matrix}$

On top of that, we apply the linear activation, which comes down to an identity function, mapping x to itself.

Combined together, the network is specified to satisfy

$\begin{matrix} G_{m} (\cdot) : = A_{2} \circ φ \circ A_{1} \end{matrix}$

and the trainable parameters can be presented by the list

$\begin{matrix} ξ_{m} = \{w_{1, 1}, b_{1, 1}, \dots, w_{1, q}, b_{1, q}\} \cup \{w_{2, 1}, \dots, w_{2, q}\} \end{matrix}$

3.2.2. Interpretation of the Neural Network

Now that we have specified the structure of the neural network, we will discuss how each function $G_{m}$ can be interpreted as a portfolio $Π_{m}$ . In the one-dimensional case, $G_{m}$ can be expressed as follows:

$\begin{matrix} G_{m} (z_{m}) : = \sum_{j = 1}^{q} w_{2, j} max \{w_{1, j} z_{m} + b_{j}, 0\} \end{matrix}$

We can regard this as the pay-off of a derivative portfolio

Π_{m}

written on the asset

z_{m}

. The portfolio contains q derivatives that each have a terminal value equal to

w_{2, j}

max \{w_{1, j} z_{m} + b_{j}, 0\}

. In total, we can recognize four types of products, which depend on the signs of

w_{1, j}

and

b_{j}

If $w_{1, j} > 0$ and $b_{j} > 0$ , we have
$\begin{matrix} w_{2, j} max \{w_{1, j} z_{m} + b_{j}, 0\} = w_{2, j} w_{1, j} z_{m} + w_{2, j} b_{j} \end{matrix}$
which is the pay-off of a forward contract on $w_{2, j} w_{1, j}$ units in $z_{m}$ and $w_{2, j} b_{j}$ units of currency.
If $w_{1, j} > 0$ and $b_{j} < 0$ , we have
$\begin{matrix} w_{2, j} max \{w_{1, j} z_{m} + b_{j}, 0\} = w_{2, j} w_{1, j} max \{z_{m} - \frac{- b_{j}}{w_{1, j}}, 0\} \end{matrix}$
which is the pay-off corresponding to $w_{2, j} w_{1, j}$ units of a European call option written on $z_{m}$ , with strike price $\frac{- b_{j}}{w_{1, j}}$ .
If $w_{1, j} < 0$ and $b_{j} > 0$ , we have
$\begin{matrix} w_{2, j} max \{w_{1, j} z_{m} + b_{j}, 0\} = - w_{2, j} w_{1, j} max \{\frac{b_{j}}{- w_{1, j}} - z_{m}, 0\} \end{matrix}$
which is the pay-off corresponding to $- w_{2, j} w_{1, j}$ units of a European put option written on $z_{m}$ , with strike price $\frac{b_{j}}{- w_{1, j}}$ .
If $w_{1, j} < 0$ and $b_{j} < 0$ , we have
$\begin{matrix} w_{2, j} max \{w_{1, j} z_{m} + b_{j}, 0\} = 0 \end{matrix}$
which clearly represents a worthless contract.

The sign of the coefficient

w_{2, j}

indicates if one has a short or long position of the product in the portfolio. Hence, under the assumption of a frictionless economy, the absence of arbitrage, and the Markov property for

z_{m}

, the portfolio

Π_{m}

replicates the original Bermudan contract over the period

(T_{m - 1}, T_{m}]

. As the portfolio composition is constant between two consecutive monitor dates, the method described here can be interpreted as a semi-static hedging strategy.

3.2.3. The Multi-Factor Case

In the case $d \geq 2$ , we propose that a basket of d zero-coupon bonds all maturing at different dates $T_{m} + δ_{1}, \dots, T_{m} + δ_{n}$ is required as input to the regression. If the risk factor space is d-dimensional, it can only be parametrized by an at least d-dimensional asset vector.

To see why the above statement is true, simply consider n bonds $P (T_{m}, T_{m} + δ_{1}), \dots,$ $P (T_{m}, T_{m} + δ_{n})$ and note that the following relation holds:

$\begin{matrix} (\begin{matrix} P (T_{m}, T_{m} + δ_{1}) \\ ⋮ \\ P (T_{m}, T_{m} + δ_{n}) \end{matrix}) = (\begin{matrix} exp {A (T_{m}, T_{m} + δ_{1}) - \sum_{j = 1}^{d} B_{j} (T_{m}, T_{m} + δ_{1}) x_{j} (T_{m})} \\ ⋮ \\ exp {A (T_{m}, T_{m} + δ_{n}) - \sum_{j = 1}^{d} B_{j} (T_{m}, T_{m} + δ_{n}) x_{j} (T_{m})} \end{matrix}) \\ \Rightarrow & (\begin{matrix} B_{1} (T_{m}, T_{m} + δ_{1}) & \dots & B_{d} (T_{m}, T_{m} + δ_{1}) \\ ⋮ & ⋱ & ⋮ \\ B_{1} (T_{m}, T_{m} + δ_{n}) & \dots & B_{d} (T_{m}, T_{m} + δ_{n}) \end{matrix}) (\begin{matrix} x_{1} (T_{m}) \\ ⋮ \\ x_{d} (T_{m}) \end{matrix}) \\ = (\begin{matrix} A (T_{m}, T_{m} + δ_{1}) - log P (T_{m}, T_{m} + δ_{1}) \\ ⋮ \\ A (T_{m}, T_{m} + δ_{d}) - log P (T_{m}, T_{m} + δ_{n}) \end{matrix}) \\ \Rightarrow & B (T_{m}) x_{T_{m}} = α \end{matrix}$

Since we have that

r a n k (B (T_{m})) = min {n, d}

, it follows that if

n < d

, the image of

B

does not span the whole risk factor space, whereas if

n > d

, the image of

B

is still equal to the case

n = d

Concluding on the argument above, it would be an obvious choice to take a $d$ -dimensional vector of bonds as the input and generalize the architecture of $G_{m}$ by increasing the input dimension (i.e., the number of nodes in the first layer) from 1 to d. However, in that case, $Π_{m}$ represents a derivatives portfolio written on a basket of bonds, by which the tractability of pricing $Π_{m}$ would be lost. Therefore, we suggest two alternatives to the design of $G_{m}$ , intended to preserve the analytical valuation potential of $Π_{m}$ .

The basic specifications of the neural network will remain similar to the one-factor case. We consider a feed-forward neural network with one hidden layer of the form $G_{m} : R^{d} \to R$ .

The first layer consists of d nodes and the hidden layer has $q \in N$ hidden nodes. The affine transformation and activation acting between the first two layers are denoted $A_{1} : R^{d} \to R^{q}$ and $φ : R^{q} \to R^{q}$ , respectively, given by
$\begin{matrix} A_{1} : x \mapsto w_{1} x + b, w_{1} \in R^{q \times d}, b \in R^{q} \\ φ : (x_{1}, \dots, x_{q}) \mapsto (max {x_{1}, 0}, \dots, max {x_{q}, 0}) \end{matrix}$
The output contains a single node. A linear transformation acts between the second and last layer $A_{2} : R^{q} \to R$ , together with the linear activation, given by
$\begin{matrix} A_{2} : x \mapsto w_{2} x, w_{2} \in R^{1 \times q} \end{matrix}$
The network is given by $G_{m} (\cdot) : = A_{2} \circ φ \circ A_{1}$ .

3.2.4. Suggestion 1: A Locally Connected Neural Network

The outcome of each node in the hidden layer represents the terminal value of a derivative written on the asset $z_{m}$ , which, together, compose the portfolio $Π_{m}$ . In the $d$ -dimensional case, the outcome of the $j^{t h}$ node $ν_{j}$ can be expressed as

$\begin{matrix} ν_{j} (z) = max \{\sum_{k = 1}^{d} w_{j k} z_{k} + b_{j}, 0\} \end{matrix}$

which corresponds to the pay-off of an arithmetic basket option with weights

w_{j 1}, \dots w_{j d}

and strike price

b_{j}

. Such an exotic option is difficult to price. To overcome this issue, we constrain the matrix

w_{1}

to only admit a single non-zero value in each row. The architecture of this suggestion is graphically depicted in Figure 2a. Let the number of hidden nodes be a multiple of the input dimension, i.e.,

q = n \cdot d

for some

n \in N

. The matrix

w_{1}

is set to be of the form

$\begin{matrix} w_{1} = (\begin{matrix} w_{1, 1} & 0 & 0 & \dots & 0 & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ w_{1, n} & 0 & 0 & \dots & 0 & 0 \\ 0 & w_{2, n + 1} & 0 & \dots & 0 & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ 0 & w_{2, 2 n} & 0 & \dots & 0 & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ 0 & 0 & 0 & \dots & 0 & w_{d, d \cdot n} \end{matrix}) \end{matrix}$

As a result, none of the hidden nodes are connected to more than one input node (see Figure 2a). Therefore, the outcome of each node $ν_{j}$ again represents a European option or forward written on a single bond, which can be priced in closed form (see Appendix A.1).

We can recognize two drawbacks to this approach. First, the number of trainable parameters for a fixed number of hidden nodes is much lower compared to the fully connected case. This can simply be overcome by increasing q. Second, as the network is not fully connected, the universal approximation theorem no longer applies to $G_{m}$ . Therefore, we have no guarantee that the approximation errors can be reduced to any desirable level. Our numerical experiments however indicate that the approximation accuracy of this design is not inferior to that of a fully connected counterpart of the same dimensions; see Section 6.

3.2.5. Suggestion 2: A Fully Connected Neural Network

Our second approach does not entail altering the structure or weights of the network, but suggests to take a different input. We hence consider a fully connected feed-forward neural network with one hidden layer of the form $G_{m} : R^{d} \to R$ . The architecture is graphically depicted in Figure 2. As a consequence, each hidden node is connected to each input node. However, as an input, we use the log of n bonds, i.e.,

$\begin{matrix} z_{m} : = {(log P (T_{m}, T_{m} + δ_{1}), \dots, log P (T_{m}, T_{m} + δ_{n}))}^{⊤} \end{matrix}$

Therefore, each node

ν_{j}

can be compared to the pay-off of a geometric basket option written on n assets

z_{m}

equal to the log of

P (t, T_{m} + δ_{j})

. Under the assumption that the dynamics of the risk factor

x_{t}

are Gaussian, these options can be priced explicitly as we will show in Appendix A.2.

An advantage of this approach is that it employs a fully connected network that, by virtue of the universal approximation theorem Hornik et al. (1989), can yield any desired level of accuracy. A drawback is that the financial interpretation of the network as a replicating portfolio is not as strong as in suggestion 1 due to the required log in the payoff.

3.3. Training of the Neural Networks

In this section, we specify some of the main considerations related to the fitting procedure of the algorithm. The method requires the training of M shallow feed-forward networks as specified in Section 3.2, which we denote $G_{0}, \dots, G_{M - 1}$ . Our numerical experiments indicated that the normalization of the training set strongly improved the networks’ fitting accuracy. Details for pre-processing the regression data are treated in Appendix B.

Optimization

The training of each network is performed in an iterative process, starting with $G_{M - 1}$ working backwards until $G_{0}$ . The effectiveness of the process depends on several standard choices related to neural network optimization, of which some are listed below.

As an optimizer, we apply AdaMax Kingma and Ba (2014), a variation of the commonly used Adam algorithm. This is a stochastic, first-order, gradient-based optimizer that updates weights inversely proportional to the $L_{\infty}$ -norm of their current and past gradient, whereas Adam is based on the $L_{2}$ -norm. Our experiments indicate that AdaMax slightly outperforms comparable algorithms in the scope of our objectives.
The batch size, i.e., the number of training points used per weight update, is set to a standard 32. The learning rate, which scales the step size of each update, is kept in the range 0.0001–0.0005.
For the initial network, $G_{M - 1}$ , we use random initialization of the parameters. If the considered contract is a payer Bermudan swaption, we initialize the (non-zero) entries of $w_{1}$ i.i.d. unif $(0, 1)$ and the biases $b$ i.i.d. unif $(- 1, 0)$ . In the case of a receiver contract, it is the other way around. The weights $w_{2}$ are initialized i.i.d. unif $(- 1, 1)$ .
For the subsequent networks, $G_{M - 2}, \dots, G_{0}$ , each network $G_{m}$ is initialized with the final set of weights of the previous network $G_{m + 1}$ .
As a training set for the optimizer, we use a collection of 20,000 data-points.

Some specific choices for the hyperparameters are motivated by a convergence analysis presented in Appendix C.

4. Lower and Upper Bound Estimates

The algorithm described in Section 3.1 gives rise to a direct estimator of the true option price V. The accuracy of this estimator depends on the approximation performance of the neural networks at each monitor date. Should each regression yield a perfect fit, then the estimation error would automatically be zero. In practice, however, the loss function, defined in Equation (7), never fully converges to zero. As the networks are trained to closed-form exercise and continuation values, error measures such as MSE and MAE can be easily obtained. In particular, the mean absolute errors provide a strong indication of the error bounds on the direct estimator (see Section 5).

Although convergence errors put solid bounds on the accuracy of the estimator, they are typically quite loose. Therefore, they give rise to non-tight confidence bounds. To overcome this issue, we introduce a numerical approximation to a tight lower and upper bound to the true price, in the same spirit as Lokeshwar et al. (2022). These should provide a better indication of the quality of the estimate.

4.1. The Lower Bound

We compute a lower bound approximation by considering the non-optimal exercise strategy $\tilde{τ}$ implied by the continuation values estimates introduced in Section 3.1. We define $\tilde{τ}$ as

(9) $\begin{matrix} \tilde{τ} (ω) = min \{T_{m} \in T_{f} | {\tilde{C}}_{m} (T_{m}) \leq h_{m} (x_{T_{m}})\} \end{matrix}$

where

{\tilde{C}}_{m}

refers to the approximated continuation value given in Equation (8). A strict lower bound is now given by

(10) $\begin{matrix} L (0) = E^{Q} [\frac{h_{\tilde{τ}} (x_{\tilde{τ}})}{B (\tilde{τ})} | F_{0}] = P (0, T_{M}) E^{T_{M}} [\frac{h_{\tilde{τ}} (x_{\tilde{τ}})}{P (\tilde{τ}, T_{M})} | F_{0}] \end{matrix}$

where

h_{\tilde{τ}}

corresponds to the definition given in Equation (3). The term on the right is obtained by changing the measure from

Q

to the

T_{M} -

forward measure

Q^{T_{M}}

Geman et al. (1995). Under the

T_{M} -

forward measure, the lower bound can be estimated by simulating a fresh set of scenarios of the risk factor

\hat{x} : = \{(x_{t_{1}}^{n}, x_{t_{2}}^{n}, \dots, x_{T_{M}}^{n}) | n = 1, \dots, N\}

. Denote by

P^{n} (t, T_{M})

the zero-coupon bond realization corresponding to

x_{t}^{n}

. Then, the lower bound cab be approximated as

$\begin{matrix} \tilde{L} (0) = \frac{P (0, T_{M})}{N} \sum_{n = 1}^{N} \frac{h_{{\tilde{τ}}^{n}} (x_{{\tilde{τ}}^{n}}^{n})}{P^{n} ({\tilde{τ}}^{n}, T_{M})} \end{matrix}$

4.2. The Upper Bound

We compute an upper bound by considering a dual formulation of the price expression Equation (4) as proposed in Haugh and Kogan (2004) and Rogers (2002). Let $M$ denote the set of all martingales $M_{t}$ adapted to $F$ such that ${sup}_{t \in [0, T]} |M_{t}| < \infty$ . An upper bound $U (0)$ to the true price $V (0)$ is obtained by observing that the following inequality holds (see Haugh and Kogan 2004):

(11) $\begin{matrix} V (0) \leq M_{0} + E^{Q} [max_{T_{m} \in T_{f}} \{\frac{h_{m} (x_{T_{m}})}{B (T_{m})} - M_{T_{m}}\} | F_{0}] : = U (0) \end{matrix}$

for any

M_{t} \in M

. To find a suitable martingale that yields a tight bound, we consider the Doob–Meyer decomposition of the true discounted option price process

\frac{V (t)}{B (t)}

. As the price process is a supermartingale, we can write

$\begin{matrix} \frac{V (t)}{B (t)} : = Y_{t} + Z_{t} \end{matrix}$

where

Y_{t}

denotes a martingale and

Z_{t}

is a predictable, strictly decreasing process such that

Z_{0} = 0

. Note that Equation (11) attains an equality if we set

M_{t} = Y_{t}

, i.e., the martingale part of the option price process. The bound will hence be tight if we consider a martingale

M_{t}

that is close to the unknown

Y_{t}

. Let

G_{m} (\cdot)

denote the neural networks induced by the algorithm. In the spirit of Andersen and Broadie (2004) and Lokeshwar et al. (2022), we construct a martingale on the discrete time grid

\{0, T_{0}, \dots, T_{M - 1}\}

as follows:

(12) $\begin{matrix} M_{0} = E^{Q} [\frac{G_{0} (z_{0} (T_{0}))}{B (T_{0})} | F_{0}], M_{T_{0}} = \frac{G_{0} (z_{0} (T_{0}))}{B (T_{0})} \\ M_{T_{m}} = M_{T_{m - 1}} + \frac{G_{m} (z_{m} (T_{m}))}{B (T_{m})} - E^{Q} [\frac{G_{m} (z_{m} (T_{m}))}{B (T_{m})} | F_{T_{m - 1}}], m = 1, \dots, M - 1 \end{matrix}$

Clearly, the process

{\{M_{T_{m}}\}}_{m = 0}^{M - 1}

yields a discrete martingale as

$\begin{matrix} E^{Q} [M_{T_{m}} | F_{T_{m - 1}}] & = E^{Q} [M_{T_{m - 1}} + \frac{G_{m} (z_{m} (T_{m}))}{B (T_{m})} - E^{Q} [\frac{G_{m} (z_{m} (T_{m}))}{B (T_{m})} | F_{T_{m - 1}}] | F_{T_{m - 1}}] \\ = E^{Q} [M_{T_{m - 1}} | F_{T_{m - 1}}] + E^{Q} [\frac{G_{m} (z_{m} (T_{m}))}{B (T_{m})} - \frac{G_{m} (z_{m} (T_{m}))}{B (T_{m})} | F_{T_{m - 1}}] \\ = M_{T_{m - 1}} \end{matrix}$

Furthermore, the process

M_{t}

as defined above will coincide with

Y_{t}

if the approximation errors in

G_{m} (\cdot)

equal zero, hence yielding an equality in Equation (11). Note that the recursive relation in Equation (12) can be rewritten as

(13) $\begin{matrix} M_{T_{m}} = \frac{G_{0} (z_{0} (T_{0}))}{B (T_{0})} + \sum_{j = 1}^{m} (\frac{G_{j} (z_{j} (T_{j}))}{B (T_{j})} - E^{Q} [\frac{G_{j} (z_{j} (T_{j}))}{B (T_{j})} | F_{T_{j - 1}}]) \end{matrix}$

We can now estimate the upper bound by again simulating a set of scenarios of the risk factor

\{(x_{t_{1}}^{n}, x_{t_{2}}^{n}, \dots, x_{T_{M}}^{n}) | n = 1, \dots, N\}

and approximate

U (0)

under the risk-neutral measure as

$\begin{matrix} \tilde{U} (0) = M_{0} + \frac{1}{N} \sum_{n = 1}^{N} max_{T_{m} \in T_{f}} \{\frac{h_{T_{m}} (x_{T_{m}}^{n})}{B^{n} (T_{m})} - M_{T_{m}}^{n}\} \end{matrix}$

The upper bound can be approximated under the

T_{M} -

forward measure. In that case, the risk factor should be simulated under

Q^{T_{M}}

and the numéraire

B (t)

should be replaced by

P (t, T_{M})

. By carrying this out, we avoid the need to approximate the numéraire on a coarse simulation grid.

Note that by the deliberate choice of $G_{m} (\cdot)$ , all the conditional expectations appearing in Equation (13) can be computed in closed form (see Appendix A). Hence, there is no need to resort to nested simulations, in contrast to, for example, Andersen and Broadie (2004) and Becker et al. (2020). Especially if simulations are performed under the $T_{M} -$ forward measure, both lower and upper bound estimations can be obtained at minimal additional computational cost.

5. Error Analysis

In this section, we analyze the errors of the semi-static hedge, the direct estimator, the lower bound estimator, and the upper bound estimator, which are induced by the imprecision of the regression functions $G_{0}, \dots, G_{M - 1}$ . We show that for a sufficiently large hedging portfolio, the replication error will be arbitrarily small. Furthermore, we will provide error margins for the price estimators in terms of the regression imprecision. We thereby show that the direct estimator, lower bound, and upper bound will converge to the true option price as the accuracy of the regressions increases. The cornerstone to the subsequent theorems is the universal approximation theorem, as presented in, for example, Hornik et al. (1989). Given that $\tilde{V}$ is a continuous function on the compact set $I_{d}$ , it guarantees that, for each $m \in {0, \dots, M - 1}$ , there exists a neural network $G_{m}$ such that

$\begin{matrix} sup_{x \in I_{d}} B^{- 1} (T_{m}) |\tilde{V} (T_{m}; x) - G_{m} (z_{m} (T_{m}) | x)| < ε \end{matrix}$

for arbitrary

ε > 0

. In other words, the regression error can be kept arbitrarily small on any compact domain of the risk factor.

5.1. Accuracy of the Semi-Static Hedge

Let $T_{f} = \{T_{0}, \dots, T_{M - 1}\}$ denote the set of monitor dates. For the following theorem, we assume that $x_{t} \in I_{d}$ for some compact set $I_{d} \subset R^{d}$ . As $I_{d}$ can be arbitrarily large, this assumption is loose enough to account for a vast majority of the risk factor scenarios in a standard Monte Carlo sample. On top of that, $I_{d}$ can be chosen as sufficiently large such that $E^{Q} [|\tilde{V} (T_{m}) - G_{m} (z_{m})| 1_{\{x_{T_{m}} \notin I_{d}\}} | F_{0}]$ approaches zero. For the proof, we refer to Appendix D.

Theorem 1.

Let $ε > 0$ and $| T_{f} | = M$ . Denote by $\tilde{V} (t)$ the value of the replication portfolio for a Bermudan swaption, conditional on the fact that it is not exercised prior to time t. Assume that there exist M networks $G_{m} (\cdot)$ such that

$\begin{matrix} sup_{x \in I_{d}} B^{- 1} (T_{m}) |\tilde{V} (T_{m}; x) - G_{m} (z_{m} (T_{m}) | x)| < ε, \forall_{m \in {0, \dots, M - 1}} \end{matrix}$

Then, for any $t \in [0, T_{M - 1}]$ , we have that

$\begin{matrix} sup_{x \in I_{d}} B^{- 1} (t) |V (t; x) - \tilde{V} (t; x)| < M ε \end{matrix}$

5.2. Error of the Direct Estimator

Theorem 1 bounds the hedging error of the semi-static hedge in terms of the maximum regression errors. This implicitly provides an error margin to the direct estimator under the aforementioned assumptions. Although the universal approximation theorem guarantees that the supremum errors can be kept at any desired level, in practice, they are substantially higher than, for example, the MSEs or MAEs of the regression function. This is due to inevitable fitting imprecision outside or near the boundaries of the finite training sets. In the following theorem, we propose that the error of the direct estimator can be bounded in terms of the discounted MAEs of the neural networks. These quantities are generally much tighter than the supremum errors and are typically easier to estimate.

The proof of the theorem follows a similar line of thought as the proof of Theorem 1. As the direct estimator at time-zero depends on the expectation of the continuation value at $T_{0}$ , we can show by an iterative argument that the overall error is bounded by the sum of the mean absolute fitting errors at each monitor date. The error bound in the direct estimator therefore scales linearly with the number of exercise opportunities. For a complete proof, we refer to Appendix E.

Theorem 2.

Let $ε > 0$ and assume that $| T_{f} | = M$ . Denote by $\tilde{V}$ the time-zero direct estimator for the price of a Bermudan swaption V. Assume that, for each $T_{m} \in {T_{0}, \dots, T_{M - 1}}$ , there is a neural network approximation $G_{m} (\cdot)$ such that

$\begin{matrix} E^{Q} [B^{- 1} (T_{m}) |\tilde{V} (T_{m}) - G_{m} (z_{m})| | F_{0}] < ε \end{matrix}$

where $\tilde{V} (T_{m}) : = max \{B (T_{m}) E^{Q} [\frac{G_{m + 1} (z_{m + 1})}{B (T_{m + 1})} | F_{T_{m}}], h_{m} (x_{T_{m}})\}$ denotes the estimator at date $T_{m}$ . Then, the error in $\tilde{V}$ is bounded as given below:

$\begin{matrix} |V (0) - \tilde{V} (0)| < M ε \end{matrix}$

5.3. Tightness of the Lower Bound Estimate

A lower bound $L (t)$ to the true price can be computed by considering the non-optimal exercise strategy, implied by the direct estimator (see Section 4.1). This relies on the stopping time

(14) $\begin{matrix} \tilde{τ} (ω) = min \{T_{m} \in T_{f} | {\tilde{C}}_{m} (T_{m}) \leq h_{m} (x_{T_{m}})\} \end{matrix}$

In the following theorem, we propose that the tightness of

L (0)

can be bounded by the discounted MAEs of neural network approximations.

The proof of the theorem relies on the fact that, conditioned on any realization of $\tilde{τ}$ and $τ$ , the expected difference between $L (0)$ and $V (0)$ is bounded by the sum of the mean absolute fitting errors at the monitor dates between $\tilde{τ}$ and $τ$ . In the proof, we therefore distinguish between the events $\tilde{τ} < τ$ and $\tilde{τ} > τ$ . Then, by an inductive argument, we can show that the bound on the spread between $L (0)$ and the true price scales linearly with the number of exercise opportunities. For a complete proof, we refer to Appendix F.

Theorem 3.

Let $ε > 0$ and assume that $| T_{f} | = M$ . Denote by $L (0)$ the lower bound on the true Bermudan swaption price as defined in Equation (10). Assume that, for each $T_{m} \in {T_{0}, \dots, T_{M - 1}}$ , there is a neural network approximation $G_{m} (\cdot)$ , such that

$\begin{matrix} E^{Q} [B^{- 1} (T_{m}) |\tilde{V} (T_{m}) - G_{m} (z_{m})| | F_{0}] < ε \end{matrix}$

$\begin{matrix} |V (0) - L (0)| < 2 (M - 1) ε \end{matrix}$

5.4. Tightness of the Upper Bound Estimate

An upper bound $U (t)$ to the true price can be computed by considering a dual formulation of the dynamic pricing equation Haugh and Kogan (2004); see Section 4.2. From a practical point of view, the difference between the upper bound and the true price can be interpreted as the maximum loss that an investor would incur due to hedging imprecision resulting from the algorithm Lokeshwar et al. (2022). The overall hedging error at some monitor date $T_{m}$ is the result of all incremental hedging errors occurring from rebalancing the portfolio at preceding monitor dates. As the incremental hedging errors can be bounded by the sum of the expected absolute fitting errors, we propose that the tightness of $U (t)$ can be bounded by the discounted MAEs of the neural networks and scales at most quadratically with the number of exercise opportunities.

The proof follows a similar line of thought as that presented in Andersen and Broadie (2004). There, it is noted that the difference between the dual formulation of the option and its true price is difficult to be bound. Here, we make a similar remark and propose a theoretical maximum spread between $U (0)$ and $V (0)$ that is relatively loose. Our numerical experiments, however, indicate that the upper bound estimate is much tighter in practice. For a complete proof, we refer to Appendix G.

Theorem 4.

Let $ε > 0$ and assume that $| T_{f} | = M$ . Denote by $U (0)$ the upper bound on the true Bermudan swaption price as defined in Equation (11). Assume that, for each $T_{m} \in {T_{0}, \dots, T_{M - 1}}$ , there is a neural network approximation $G_{m} (\cdot)$ , such that

$\begin{matrix} E^{Q} [B^{- 1} (T_{m}) |\tilde{V} (T_{m}) - G_{m} (z_{m})| | F_{0}] < ε \end{matrix}$

$\begin{matrix} |U (0) - V (0)| < M (M - 1) ε \end{matrix}$

6. Numerical Experiments

In this section, we treat several numerical examples to illustrate the convergence, pricing, and hedging performance of our proposed method. We will start by considering the price estimate of a vanilla swaption contract in a one-factor model. This is a toy example by which we can demonstrate the accuracy of the direct estimator in comparison to exact benchmarks. We continue with price estimates of Bermudan swaption contracts in a one-factor and a two-factor framework. The performance of the direct estimator will be compared to the established least-square regression method (LSM) introduced in Longstaff and Schwartz (2001), fine-tuned to an interest rate setting as described in Oosterlee et al. (2016). Additionally, we will approximate the lower and upper bound estimates as described in Section 4 and show that they are well inside the error margins introduced in Section 5. Finally, we will illustrate the performance of the static hedge for a swaption in a one-factor model and a Bermudan swaption in a two-factor model. For the one-factor case, we can benchmark the performance by the analytic delta hedge for a swaption, provided in Henrard (2003).

A $T_{0} \times T_{M}$ contract (either European swaption or Bermudan swaption) refers to an option written on a swap with a notional amount of 100 and a lifetime between $T_{0}$ and $T_{M}$ . This means that $T_{0}$ and $T_{M - 1}$ are the first and last monitor dates, respectively, in case of a Bermudan. The underlying swaps are set to exchange annual payments, yielding year fractions of 1 and annual exercise opportunities. All examples that are illustrated here have been implemented in Python using the Quant-Lib library Ametrano and Ballabio (2003) for standard pricing routines and Keras with Tensorflow backend Chollet et al. (2015) for constructing, fitting, and evaluating the neural networks.

6.1. 1-Factor Swaption

We start by considering a swaption contract under a one-dimensional risk factor setting. The direct estimator of the true $V (0)$ swaption price is computed similar to a Bermudan swaption, but with only a single exercise possibility at $T_{0}$ . Therefore, only a single neural network per option needs to be trained to compute the option price. We have used 64 hidden nodes and 20,000 training points, generated through Monte Carlo sampling. We assume the risk factor to be captured by the Hull–White model with constant mean reversion parameter a and constant volatility $σ$ . The dynamics of the shifted mean-zero process Brigo and Mercurio (2006) are hence given by

(15) $\begin{matrix} d x (t) = - a x (t) d t + σ d W (t), x (0) = 0 \end{matrix}$

For simplicity, we consider a flat time-zero instantaneous forward rate

f (0, t)

. The risk-neutral scenarios are generated using a discrete Euler scheme of the process above. Parameter values that were used in the numerical experiments are summarized in Table 1.

Figure 3a,b show the time-zero option values in basis points (0.01%) of the notional for a $5 Y \times 10 Y$ and a $10 Y \times 5 Y$ payer swaption as a function of the moneyness. The moneyness is defined as $\frac{S}{K}$ , where K denotes the fixed strike and S the time-zero swap rate associated with the underlying swap. The exact benchmarks are computed by an application of Jamshidian’s decomposition Jamshidian (1989). The relative estimate errors are shown in Figure 3c,d. We observe a close agreement between the estimates and the reference prices. The errors are in the order of several basis points of the true option price. In the current setting, the results presented serve mostly as a validation of the estimator. We however point out that this algorithm for swaptions is applicable in general frameworks, such as multi-factor, dual-curve, or non-overlapping payment schemes, for which exact routines are no longer available.

6.2. 1-Factor Bermudan Swaption

As a second example, we consider a Bermudan swaption contract. The same dynamics for the underlying risk factor are assumed as discussed in the previous paragraph, using the parameter settings of Table 1. Monte Carlo scenarios are generated based on a discretized Euler scheme associated to the SDE in Equation (15), taking weekly time-steps.

We first demonstrate the convergence property of the direct estimator, which is implied by the replication portfolio. We consider a $1 Y \times 5 Y$ Bermudan swaption with strike $K = 0.03$ . This strike is selected as it close to ATM, a moneyness level that is most likely to be liquid in the market. For this analysis, the neural networks were trained to a set of 2000 Monte-Carlo-generated training points. Figure 4a shows the direct estimator as a function of the number of hidden nodes in each neural network, alongside an LSM-based benchmark. In Figure 4b, the error with respect to the LSM estimate is shown on a logscale. We observe that the direct estimator converges to the LSM confidence interval or slightly above, which is in accordance with the fact that LSM is biased low by definition. The analysis indicates that a portfolio of 16 discount bond options is sufficient to achieve a replication of a similar accuracy to the LSM benchmark.

Table 2 depicts numerical pricing results for a $1 Y \times 5 Y$ , $3 Y \times 7 Y$ and $1 Y \times 10 Y$ receiver Bermudan swaption. For each contract, we consider different levels of moneyness, setting the fixed rate K of the underlying swap to, respectively, 80%, 100%, and 120% of the time-zero swap rate. The estimations of the direct, the upper bound, and the lower bound statistics are again reported alongside LSM-based benchmarks. Here, the neural networks have 64 hidden nodes and are fitted using a training set of 20,000 points. The lower and upper bound estimates, as well as the LSM estimates, are based on simulation runs of 200,000 paths each. The given lower and upper bounds are Monte Carlo estimates of the statistics defined in Equations (10) and (11) and are therefore subject to standard errors, which are reported in parentheses. The reference LSM results have been generated using $\{1, x, x^{2}\}$ as regression basis functions for approximating the continuation values. The standard errors and confidence intervals are obtained from ten independent Monte Carlo runs. The choice for hyperparameter settings is motivated by the analysis of Appendix C.

The spreads between the lower and upper bound estimates provide a good indication of the accuracy of the method. For the current setting, we obtain spreads in the order of several basis points up a few dozen of basis points. The lower bound estimate is typically very close to the LSM estimate, which itself is also biased low. Their standard errors are of the same order of magnitude. The upper bound estimates prove to be very stable and show a variance that is roughly two orders of magnitude smaller compared to that of the lower bound. The direct estimate is occasionally slightly less accurate. This can be explained by the fact that it depends on the accuracy of the regression over the full domain of the risk factor, whereas, for the lower bound, only a high accuracy near the exercise boundaries is required. In Figure 5, the mean absolute error of each neural network after fitting is presented as a function of the network’s index. The errors are displayed in basis points of the notional. We observe that the errors are the smallest at maturity and tend to increase with each iteration backward in time. That the errors at the final monitor date are virtually zero can be explained by the fact that the pay-off at $T_{M - 1}$ is given by

$\begin{matrix} max \{h_{M - 1} (x_{T_{M - 1}}), 0\} & = N \cdot max \{A_{M - 1, M} (T_{M - 1}) \cdot (K - S_{M - 1, M} (T_{M - 1})), 0\} \\ = N \cdot max \{(Δ T_{M} K + 1) P (T_{M - 1}, T_{M}) - 1, 0\} \\ ≃ w_{2} φ (w_{1} z - b) \end{matrix}$

which can be exactly captured by a network with only a single hidden node. With each step backwards, the target function is harder to fit, yielding larger errors. We observe MAEs up to one basis point of the notional amount. The empirical lower–upper bound spreads remain well within the theoretical error margins provided in Section 4.1 and Section 4.2. The spreads are mostly much lower than the sum of the MAEs, indicating that the bound estimates are in practice significantly tighter than their theoretical maximum spread.

6.3. 2-Factor Bermudan Swaption

As a final pricing example, we consider a Bermudan swaption contract under a two-factor model. The dynamics of the underlying risk factors are assumed to follow a G2++ model Brigo and Mercurio (2006). Monte Carlo scenarios are generated based on a discretized Euler scheme, taking weekly time-steps, based on the SDE below:

$\begin{matrix} d x_{1} (t) & = - a_{1} x_{1} (t) d t + σ_{1} d W_{1} (t), x_{1} (0) = 0 \\ d x_{2} (t) & = - a_{2} x_{2} (t) d t + σ_{2} d W_{2} (t), x_{2} (0) = 0 \end{matrix}$

where

W_{1}

and

W_{2}

are correlated Brownian motions with

d {⟨W_{1}, W_{2}⟩}_{t} = ρ d t

. Parameter values that were used in the numerical experiments are summarized in Table 3.

We again start by demonstrating the convergence property of the direct estimator for both the locally connected and the fully connected neural network designs as specified in Section 3.2.3. The same $1 Y \times 5 Y$ Bermudan swaption with strike $K = 0.03$ is used and the networks are each fitted to a set of 6400 training points. Figure 6a shows the direct estimator as a function of the number of hidden nodes in each neural network, alongside an LSM-based benchmark. In Figure 6b, the error with respect to the LSM estimate is shown on a logscale. We observe a similar convergence behavior, where the direct estimators approach the LSM benchmark within the 95% confidence range. Here, it is noted that a portfolio of eight discount bond options is already sufficient to achieve a replication of a similar accuracy to the LSM estimator.

In Table 4, numerical results for a $1 Y \times 5 Y$ , $3 Y \times 7 Y$ , and $1 Y \times 10 Y$ receiver Bermudan swaption are depicted for different levels of moneyness. We again report the direct, the upper bound, and the lower bound estimates for both neural network designs. In this case, all networks have 64 hidden nodes and are fitted to training sets of 20,000 points. As before, the lower bound, the upper bound, and the LSM estimates are the result of 10 independent Monte Carlo simulations of 200,000 scenarios.

For the LSM algorithm, we used $\{1, x_{1}, x_{2}, x_{1}^{2}, x_{1} x_{2}, x_{2}^{2}\}$ as basis functions. Note that the number of monomials grows quadratically with the dimension of the state space and, with that, the number of free parameters. For our method, this number grows at a linear rate. Choices for the hyperparameters are again based on the analysis of Appendix C. The results under the two-factor case share several features with the one-factor results. We observe spreads between the lower and upper bounds ranging from several basis points up to a few dozen basis points of the option price. The lower bound estimates turn out to be very close to the LSM estimates and the same holds for their standard errors. The upper bounds are again very stable with low standard errors and the direct estimator appears as slightly less accurate. If we compare the locally connected to the fully connected case, we observe that the results are overall in close agreement, especially the lower and upper bound estimates. This is remarkable given that the fully connected case gives rise to more trainable parameters, by which we would expect a higher approximation accuracy. In the two-factor setting, the ratio of free parameters for the two designs is 3:4.

In Figure 7, the mean absolute errors of the neural networks after fitting are shown. The MAEs for the locally connected networks are in blue; the fully connected are in red. All are represented in basis points of the notional amount. We observe that the errors are mostly in the same order of magnitude as the one-dimensional case. The figures indicate that the locally connected networks slightly outperform the fully connected networks in terms of accuracy, although this does not appear to materialize in tighter estimates of the lower and upper bounds. For the locally connected case, we again observe that the errors are virtually zero at the last monitor date, for the same reasons as in the one-factor setting. In the fully connected representation, an exact replication might not exist, resulting in larger errors. We conjecture that this effect partially carries over to the networks at preceding monitor dates. The empirical lower–upper bound spreads remain well within the theoretical error margins, as the spreads are in all cases lower than the sum of the MAEs. Hence, also for the two-factor setting, we find that the bound estimates are tighter in practice than their theoretical maximum spreads.

6.4. Performance Semi-Static Hedge

Finally, we consider the hedging problem of a vanilla swaption under the one-factor model and a Bermudan swaption under the two-factor model.

6.4.1. 1-Factor Swaption

Here, we compare the performance of a static hedge versus a dynamic hedge in the one-factor model. As an example, we take a $1 Y \times 5 Y$ European receiver swaption at different levels of moneyness. The model set-up is similar to that in Section 6.2, using the same set of parameters reported in Table 1. In the static hedge case, the option contract writer aims to hedge the risk using a static portfolio of zero-coupon bond options and discount bonds. The replicating portfolio is composed using a neural network with 64 hidden nodes, optimized using 20,000 training-points generated through Monte Carlo sampling. The portfolio is composed at time-zero and kept until the expiry of the option at $t = 1$ year. In the dynamic hedge case, the delta-hedging strategy is applied. The replicating portfolio is composed of units of the underlying forward-starting swap and investment in the money market. The dynamic hedge involves the periodic rebalancing of the portfolio. The delta for a receiver swaption under the Hull–White model (see Henrard 2003) is given by

(16) $\begin{matrix} Δ (t) = \frac{\sum_{j = 1}^{M} c_{j} P (t, T_{j}) ν (t, T_{j}) Φ (κ + α_{j}) - P (t, T_{0}) ν (t, T_{0}) Φ (κ)}{\sum_{j = 1}^{M} c_{j} P (t, T_{j}) ν (t, T_{j}) - P (t, T_{0}) ν (t, T_{0})} \end{matrix}$

where

κ

is the solution of

$\begin{matrix} \sum_{j = 1}^{M} c_{j} \frac{P (t, T_{j})}{P (t, T_{0})} exp (- \frac{1}{2} α_{j}^{2} - α_{j} κ) = 1 \end{matrix}$

and

$\begin{matrix} α_{j}^{2} : = \int_{0}^{T_{0}} {(ν (u, T_{j}) - ν (u, T_{0}))}^{2} d u \end{matrix}$

where

Φ

denotes the CDF of a standard normal distribution,

c_{j} = Δ T_{j} K

for

j = 1, \dots, M - 1

, and

c_{M} = 1 + Δ T_{M} K

. The function

ν (t, T)

denotes the instantaneous volatility of a discount bond maturing at T, which, under Hull–White, is given by

ν (t, T) : = \frac{σ}{a} (1 - e^{- a (T - t)})

. We validated the analytic expression above with numerical approximations of the Delta obtained by bumping the yield curve. Within the simulation, the dynamic hedge portfolio is rebalanced on a daily basis between time-zero and expiry of the option. In this experiment, that means it is updated on 255 instances at equidistant monitor dates.

The performance of both hedging strategies is reported in Table 5. The results are based on 10,000 risk-neutral Monte Carlo paths. The hedging error refers to the difference between the option’s pay-off at expiry and the replicating portfolio’s final value. The quantities are reported in basis points of the notional amount. The empirical distribution of the hedging error is shown in Figure 8. We observe that, overall, the static hedge outperforms the dynamic hedge in terms of accuracy, even though it involves only a quarter (64 versus 255) of the trades. Although it is not visible in Figure 8b, the static strategy does give rise to occasional outliers in terms of accuracy. These are associated with scenarios that reach or exceed the boundary of the training set. These errors are typically of a similar order of magnitude as the errors observed in the dynamic hedge. The impact of outliers can be reduced by increasing the training set and thereby broadening the regression domain.

6.4.2. 2-Factor Bermudan Swaption

Here, we demonstrate the performance of the semi-static hedge for a $1 Y \times 5 Y$ receiver Bermudan swaption under a two-factor model. We compare the accuracy of the hedging strategy utilizing a locally connected network versus a fully connected neural network. In the former, the replication portfolio consists of zero-coupon bonds and zero-coupon bond options. In the latter, the Bermudan is replicated with options written on hypothetical assets with a pay-off equal to the log of a zero-coupon bond (see Section 3.2.3). The model set-up is similar to that in Section 6.3, using the same set of parameters reported in Table 3. Both networks are composed with 64 hidden nodes and optimized using 20,000 training points generated through Monte Carlo sampling. The portfolio is set up at time-zero and updated at each monitor date of the Bermudan until it is either exercised or expired. We assume that the holder of the Bermudan swaption follows the exercise strategy implied by the algorithm, i.e., the option is exercised as soon as ${\tilde{C}}_{m} (T_{m}) \leq h_{m} (x_{T_{m}})$ . When a monitor date $T_{m}$ is reached, the replication portfolio matures with a pay-off equal to $G_{m} (z_{m} (T_{m}))$ . In case the Bermudan is continued, the price to set up a new replication portfolio is given by $\tilde{V} (T_{m}) = B (T_{m}) E^{Q} [\frac{G_{m + 1} (z_{m + 1})}{B (T_{m + 1})} | F_{T_{m}}]$ , which contributes $G_{m} (z_{m} (T_{m})) - \tilde{V} (T_{m})$ to the hedging error. In case the Bermudan is exercised, the holder will claim $\tilde{V} (T_{m}) = h_{m} (x_{T_{m}})$ , which also contributes $G_{m} (z_{m} (T_{m})) - \tilde{V} (T_{m})$ to the hedging error. The total error of the semi-static hedge (HE) is therefore computed as

$\begin{matrix} HE : = \sum_{m = 0}^{M - 1} (G_{m} (z_{m} (T_{m})) - \tilde{V} (T_{m})) 1_{\{\tilde{τ} \leq T_{m}\}} \end{matrix}$

where

\tilde{V} (T_{m}) : = max \{B (T_{m}) E^{Q} [\frac{G_{m + 1} (z_{m + 1})}{B (T_{m + 1})} | F_{T_{m}}], h_{m} (x_{T_{m}})\}

denotes the direct estimator at date

T_{m}

and

\tilde{τ}

denotes the stopping time, as defined in Equation (9).

The performance of the strategies related to locally and fully connected neural networks is reported in Table 6. The results are based on 10,000 risk-neutral Monte Carlo paths and reported in basis points of the notional amount. The empirical distribution of the hedging error is shown in Figure 9. We observe that both approaches yield an accuracy in the same order of magnitude, although the locally connected case slightly outperforms the fully connected case. This is in line with expectations, as the fitting performance of the locally connected networks is generally higher. For similar reasons to the one-factor case, the hedging experiments give rise to occasional outliers in terms of accuracy. These outliers can be in the order of several dozens of basis points. Again, the impact of outliers can be reduced by broadening the regression domain.

7. Conclusions

In this paper, we have proposed a semi-static replication algorithm for Bermudan swaptions under an affine term structure model. We have shown that Bermudan swaptions, an exotic interest rate derivative that is heavily traded in the OTC market, can be semi-statically replicated with an options portfolio written on a basket of discount bonds. The static portfolio composition is obtained by regressing the target option’s value using a shallow, artificial neural network. The choice of the regression basis functions are motivated by their representation of an option’s portfolio pay-off, implying an interpretable neural network structure. Leveraging the approximating power of ANNs, we proved that the replication can achieve any desired level of accuracy given that the portfolio is sufficiently large. We derived a direct estimator of the contract price, and an upper bound and lower bound estimate to this price can be computed at minimal additional computational cost.

The algorithm we presented is inspired by the work of Lokeshwar et al. (2022), which proposes a semi-static replication approach for callable equity options embedded in the Black–Scholes model. We contribute to the literature by extending the concept of (semi-)static replication to the field of interest rate modeling. Next, to a direct, lower bound, and upper bound estimator, we have derived analytical error margins for these statistics. This proves their convergence as the regression error diminishes and provides a direct insight toward the accuracy of the estimates. Additionally, we propose an alternative ANN design, which constrains the replication into a portfolio of vanilla bond options, even in the case of a multi-factor model. This guarantees efficiency in the portfolio valuation, which is key to many applications in credit risk management.

The performance of the method was demonstrated through several numerical experiments. We focused on Bermudan swaptions under a one- and two-factor model, which are popular amongst practitioners. The pricing accuracy of the method was determined through a benchmark to the established least-square method of Longstaff and Schwartz (2001). This reference is approached with basis point precision. A convergence analysis showed that a portfolio of 16 bond options suffices in achieving a replication with a similar accuracy to the LSM. Finally, the replication performance was studied through an in-model hedging experiment. This showed that the semi-static hedge outperforms a traditional dynamic replication in terms of hedging error.

As a look-out for further research, we consider applying the algorithm to the computation of credit risk measures and various value adjustments (xVAs). These metrics typically rely on generating forward value and sensitivity profiles of (exotic) derivative portfolios. We see the semi-static replication approach combined with the simple error analysis as an effective tool to address the computational challenges associated with these risk measures. The performance of the method in the context of quantifying CCR will therefore be studied in a forthcoming companion paper.

Author Contributions

Conceptualization, J.H., S.J. and D.K.; Formal analysis, J.H., S.J. and D.K.; Investigation, J.H.; Writing—original draft, J.H.; Writing—review and editing, S.J. and D.K.; Visualization, J.H.; Supervision, S.J. and D.K.; Project administration, D.K. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflict of interest.

Disclosure

The opinions expressed in this work are solely those of the authors and do not represent in any way those of their current and past employers.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1. Suggested neural network design for [Forumla omitted. See PDF.].

View Image - Figure 2. Suggested neural network designs for [Forumla omitted. See PDF.]. (a) Locally connected neural network. (b) Fully connected neural network.

Figure 2. Suggested neural network designs for [Forumla omitted. See PDF.]. (a) Locally connected neural network. (b) Fully connected neural network.

Figure 3. Accuracy of the direct estimator for vanilla swaptions. [Forumla omitted. See PDF.].

View Image - Figure 4. Convergence of the direct estimator for the [Forumla omitted. See PDF.] Bermudan swaption price as a function of hidden node count, with respect to the LSM benchmark under a 1-factor model.

Figure 4. Convergence of the direct estimator for the [Forumla omitted. See PDF.] Bermudan swaption price as a function of hidden node count, with respect to the LSM benchmark under a 1-factor model.

Figure 5. Mean absolute errors of neural network fit per monitor date under a 1-factor model.

View Image - Figure 6. Convergence of the direct estimator for the 1Y × 5Y Bermudan swaption price as a function of hidden node count, with respect to the LSM benchmark under a 2-factor model.

Figure 6. Convergence of the direct estimator for the 1Y × 5Y Bermudan swaption price as a function of hidden node count, with respect to the LSM benchmark under a 2-factor model.

View Image - Figure 7. Accuracy of neural network fit per monitor date under a 2-factor model. Blue lines represent the locally connected (l.c.) case and the red lines represent the fully connected (f.c.) case. The legend in Figure (c) applies to all three graphs.

Figure 7. Accuracy of neural network fit per monitor date under a 2-factor model. Blue lines represent the locally connected (l.c.) case and the red lines represent the fully connected (f.c.) case. The legend in Figure (c) applies to all three graphs.

View Image - Figure 8. Hedge error distribution for a [Forumla omitted. See PDF.] receiver swaption, based on [Forumla omitted. See PDF.] MC paths. [Forumla omitted. See PDF.].

Figure 8. Hedge error distribution for a [Forumla omitted. See PDF.] receiver swaption, based on [Forumla omitted. See PDF.] MC paths. [Forumla omitted. See PDF.].

View Image - Figure 9. Hedge error distribution for a [Forumla omitted. See PDF.] receiver Bermudan swaption, based on [Forumla omitted. See PDF.] MC paths. [Forumla omitted. See PDF.].

Figure 9. Hedge error distribution for a [Forumla omitted. See PDF.] receiver Bermudan swaption, based on [Forumla omitted. See PDF.] MC paths. [Forumla omitted. See PDF.].

Table 1

Parameters 1F Hull–White model.

Parameter	a	$σ$	$f (0, t)$
Value	0.01	0.01	0.03

Table 2

Results of 1-factor model. $S_{1 Y \times 5 Y} \approx S_{3 Y \times 7 Y} \approx S_{1 Y \times 10 Y} \approx 0.0305$ . Standard errors are in parentheses, based on 10 independent MC runs of $2 \times 10^{5}$ paths each.

Type	K/S	Dir.est.	Lower bnd	Upper bnd	UB-LB	LSM est.	LSM 95% CI
1Y × 5Y	80%	1.527	1.521 (0.001)	1.528 (0.000)	0.007	1.521 (0.001)	[1.518, 1.523]
	100%	2.543	2.534 (0.002)	2.542 (0.000)	0.008	2.534 (0.002)	[2.531, 2.538]
	120%	4.015	4.016 (0.002)	4.018 (0.000)	0.002	4.016 (0.002)	[4.012, 4.021]
3Y × 7Y	80%	3.296	3.293 (0.002)	3.295 (0.000)	0.002	3.293 (0.002)	[3.290, 3.296]
	100%	4.767	4.755 (0.004)	4.761 (0.000)	0.006	4.755 (0.004)	[4.747, 4.762]
	120%	6.625	6.629 (0.004)	6.631 (0.000)	0.002	6.629 (0.004)	[6.621, 6.638]
1Y × 10Y	80%	3.950	3.945 (0.005)	3.960 (0.000)	0.015	3.945 (0.005)	[3.935, 3.955]
	100%	5.818	5.811 (0.003)	5.818 (0.000)	0.007	5.811 (0.003)	[5.805, 5.816]
	120%	8.346	8.354 (0.005)	8.360 (0.000)	0.006	8.353 (0.005)	[8.344, 8.362]

Table 3

Parameters 2F G2++ model.

Parameter	$a_{1}$	$a_{2}$	$σ_{1}$	$σ_{2}$	$ρ$	$f (0, t)$
Value	0.07	0.08	0.015	0.008	−0.6	0.03

Table 4

Results of 2-factor model for the locally connected and fully connected neural network cases. $S_{1 Y \times 5 Y} \approx S_{3 Y \times 7 Y} \approx S_{1 Y \times 10 Y} \approx 0.0305$ . Standard errors are in parentheses, based on 10 independent MC runs of $2 \times 10^{5}$ paths each.

Locally connected neural networks
Type	K/S	Dir.est.	Lower bnd	Upper bnd	UB-LB	LSM est.	LSM 95% CI
1Y × 5Y	80%	1.617	1.617(0.002)	1.619(0.000)	0.002	1.617(0.002)	[1.614, 1.621]
	100%	2.652	2.650(0.002)	2.654(0.000)	0.004	2.650(0.002)	[2.646, 2.654]
	120%	4.128	4.127(0.003)	4.131(0.000)	0.004	4.127(0.003)	[4.121, 4.132]
3Y × 7Y	80%	3.073	3.076(0.004)	3.078(0.000)	0.002	3.077(0.004)	[3.069, 3.085]
	100%	4.554	4.553(0.004)	4.553(0.000)	0.000	4.552(0.004)	[4.545, 4.559]
	120%	6.444	6.448(0.004)	6.451(0.000)	0.003	6.446(0.005)	[6.435, 6.456]
1Y × 10Y	80%	3.616	3.624(0.002)	3.626(0.000)	0.002	3.622(0.002)	[3.618, 3.627]
	100%	5.508	5.509(0.002)	5.514(0.000)	0.005	5.508(0.002)	[5.503, 5.512]
	120%	8.128	8.123(0.005)	8.130(0.000)	0.007	8.121(0.005)	[8.110, 8.132]
Fully connected neural networks
Type	K/S	Dir.est.	Lower bnd	Upper bnd	UB-LB	LSM est.	LSM 95% CI
1Y × 5Y	80%	1.617	1.617(0.002)	1.619(0.000)	0.002	1.617(0.002)	[1.614, 1.621]
	100%	2.651	2.650(0.002)	2.654(0.000)	0.004	2.650(0.002)	[2.646, 2.654]
	120%	4.129	4.127(0.003)	4.131(0.000)	0.004	4.127(0.003)	[4.121, 4.132]
3Y × 7Y	80%	3.076	3.077(0.004)	3.078(0.000)	0.001	3.077(0.004)	[3.069, 3.085]
	100%	4.553	4.553(0.004)	4.554(0.000)	0.001	4.552(0.004)	[4.545, 4.559]
	120%	6.451	6.447(0.005)	6.451(0.000)	0.004	6.446(0.005)	[6.435, 6.456]
1Y × 10Y	80%	3.616	3.624(0.002)	3.626(0.000)	0.002	3.622(0.002)	[3.618, 3.627]
	100%	5.506	5.509(0.002)	5.514(0.000)	0.005	5.508(0.002)	[5.503, 5.512]
	120%	8.124	8.123(0.005)	8.130(0.000)	0.007	8.121(0.005)	[8.110, 8.132]

Table 5

Hedging errors for static and dynamic hedging strategy for a $1 Y \times 5 Y$ receiver swaption, based on $10^{4}$ MC paths. $S_{1 Y \times 5 Y} \approx 0.0305$ .

Hedge Error (bps)	K/S	Static Hedge	Dyn. Hedge
Mean	80%	$- 1.9 \times 10^{- 2}$	$0.38$
	100%	$- 2.2 \times 10^{- 3}$	$0.61$
	120%	$- 1.5 \times 10^{- 2}$	$0.46$
St. dev.	80%	$2.5$	$9.1$
	100%	$3.1 \times 10^{- 2}$	$10.1$
	120%	$4.5 \times 10^{- 2}$	$9.4$
95%-percentile	80%	$6.6 \times 10^{- 2}$	$15.7$
	100%	$1.2 \times 10^{- 2}$	$17.9$
	120%	$2.0 \times 10^{- 2}$	$16.2$

Table 6

Hedging errors of the semi-static hedging strategy for a $1 Y \times 5 Y$ receiver Bermudan swaption, based on $10^{4}$ MC paths. $S_{1 Y \times 5 Y} \approx 0.0305$ .

Hedge Error (bps)	K/S	Loc. conn. NN	Fully conn. NN
Mean	80%	$3.2 \times 10^{- 2}$	$2.1 \times 10^{- 2}$
	100%	$7.9 \times 10^{- 2}$	$- 5.5 \times 10^{- 2}$
	120%	$- 9.4 \times 10^{- 2}$	$4.5 \times 10^{- 2}$
St. dev.	80%	$0.45$	$0.55$
	100%	$0.38$	$0.48$
	120%	$0.37$	$0.67$
95%-percentile	80%	$0.66$	$0.69$
	100%	$0.56$	$0.85$
	120%	$0.72$	$0.76$

Appendix A. Evaluation of the Conditional Expectation

In this section, we will explicitly compute the conditional expectations related to the continuation values. We will distinguish two approaches associated with the two proposed network structures, i.e., the locally connected case (suggestion 1) and the fully connected case (suggestion 2).

For ease of computation, we will use a simplified, yet equivalent representation of the risk factor dynamics discussed in Section 2.1. This concerns a linear shift of the canonical representation of the latent factors as presented in Dai and Singleton (2000). We write $x_{t} : = {(x_{1} (t), \dots, x_{n} (t))}^{⊤}$ , where each component $x_{i}$ denotes a mean-reverting zero-mean process. The risk-neutral dynamics are assumed to satisfy (A1) $\begin{matrix} d (\begin{matrix} x_{1} (t) \\ ⋮ \\ x_{d} (t) \end{matrix}) = - (\begin{matrix} a_{1} (t) x_{1} (t) \\ ⋮ \\ a_{d} (t) x_{d} (t) \end{matrix}) d t + (\begin{matrix} σ_{11} (t) & \dots & σ_{1 d} (t) \\ ⋮ & ⋱ & ⋮ \\ σ_{d 1} (t) & \dots & σ_{d d} (t) \end{matrix}) d W (t), (\begin{matrix} x_{1} (0) \\ ⋮ \\ x_{d} (0) \end{matrix}) = (\begin{matrix} 0 \\ ⋮ \\ 0 \end{matrix}) \end{matrix}$ where $W$ denotes a standard d-dimensional Brownian motion with independent entries. By setting ${\tilde{σ}}_{i} (t) : = \sqrt{\sum_{j = 1}^{d} σ_{i j}^{2} (t)}$ , the process above can be rewritten in terms of one-dimensional Itô processes Shreve (2004) of the form (A2) $\begin{matrix} d x_{i} (t) & = - a_{i} (t) x_{i} (t) d t + {\tilde{σ}}_{i} (t) d {\tilde{W}}_{i} (t), i = 1, \dots, d \end{matrix}$ where ${\tilde{W}}_{1}, \dots, {\tilde{W}}_{d}$ denote a set of one-dimensional, correlated Brownian motions under the measure $Q$ . The instantaneous correlation is denoted by $ρ_{i j}$ , such that $d {⟨{\tilde{W}}_{i}, {\tilde{W}}_{j}⟩}_{t} = ρ_{i j} (t) d t$ .

Appendix A.1. The Continuation Value with Locally Connected NN

We consider the network $G_{m} (\cdot)$ , which is trained to approximate $\tilde{V} (T_{m})$ . Let $t \in [T_{m - 1}, T_{m})$ . In order to obtain $\tilde{V} (t)$ , we need to evaluate $E^{Q} [e^{- \int_{t}^{T_{m}} r (u) d u} G_{m} (x_{T_{m}}) | F_{t}]$ . As $G_{m} (\cdot)$ represents the linear combination of the outcome of q hidden nodes, we will focus on the conditional expectation of hidden node $i \in {1, \dots, q}$ . Our aim is then to compute the following: $\begin{matrix} H_{i} (t) : = E^{Q} [e^{- \int_{t}^{T_{m}} r (u) d u} φ (w_{i}^{⊤} P (T_{m}) + b_{i}) | F_{t}] \end{matrix}$

The map $φ : R \to R$ denotes the ReLU function defined as $φ (x) = max {x, 0}$ . The weight vector $w_{i}$ (corresponding to hidden node i) and $P (T_{m})$ are defined as $\begin{matrix} w_{i} = (\begin{matrix} w_{1}^{i} \\ ⋮ \\ w_{d}^{i} \end{matrix}), P (T_{m}) = (\begin{matrix} P (T_{m}, T_{m} + δ_{1}) \\ ⋮ \\ P (T_{m}, T_{m} + δ_{d}) \end{matrix}) \end{matrix}$ with $T_{m} < T_{m} + δ_{1} < \dots < T_{m} + δ_{d} \leq T_{M}$ . Recall that, as a characteristic of the affine term structure model, the random variable $P (t, T)$ can be expressed as $\begin{matrix} P (t, T) = e^{A (t, T) - \sum_{i = 1}^{d} B_{i} (t, T) x_{i} (t)} \end{matrix}$ for deterministic functions A and $B_{i}$ , which are available in closed form (see Brigo and Mercurio 2006). By the structure of the network, the weight vector is constrained to have only a single non-zero entry, which we will denote to have index k. Therefore, we can rewrite $\begin{matrix} H_{i} (t) & = E^{Q} [e^{- \int_{t}^{T_{m}} r (u) d u} max \{w_{i}^{k} P (T_{m}, T_{m} + δ_{k}) + b_{i}, 0\} | F_{t}] \end{matrix}$ As we argued before, if $w_{i}^{k}$ and $b_{i}$ are both non-negative, $H_{i} (t)$ denotes the value of a forward contract. In that case, we have $\begin{matrix} H_{i} (t) & = E^{Q} [e^{- \int_{t}^{T_{m}} r (u) d u} (w_{i}^{k} P (T_{m}, T_{m} + δ_{k}) + b_{i}) | F_{t}] \\ = w_{i}^{k} E^{Q} [e^{- \int_{t}^{T_{m}} r (u) d u} E^{Q} [e^{- \int_{T_{m}}^{T_{m} + δ_{k}} r (u) d u} | F_{T_{m}}] | F_{t}] + b_{i} E^{Q} [e^{- \int_{t}^{T_{m}} r (u) d u} | F_{t}] \\ = w_{i}^{k} P (t, T_{m} + δ_{k}) + b_{i} P (t, T_{m}) \end{matrix}$

If, on the other hand, $b_{i} < 0 < w_{i}^{k}$ or $w_{i}^{k} < 0 < b_{i}$ , we are dealing with a European call or put option, respectively. Closed-form expressions for European bond options are available based on Black’s formula and have been treated extensively in the literature; see, for example, Musiela and Rutkowski (2005), Filipovic (2009), or Brigo and Mercurio (2006). In our case, we have $\begin{matrix} H_{i} (t) = \{\begin{matrix} w_{i}^{k} P (t, T_{m} + δ_{k}) Φ (d_{+}) + b_{i} P (t, T_{m}) Φ (d_{-}) & if b_{i} < 0 < w_{i}^{k} \\ - b_{i} P (t, T_{m}) Φ (- d_{-}) - w_{i}^{k} P (t, T_{m} + δ_{k}) Φ (- d_{+}) & if w_{i}^{k} < 0 < b_{i} \end{matrix} \end{matrix}$ where $Φ$ denotes the CDF of a standard normal distribution, and we define $\begin{matrix} d_{\pm} : = \frac{log (- \frac{w_{i}^{k} P (t, T_{m} + δ_{k})}{b_{i} P (t, T_{m})}) \pm \frac{1}{2} Σ (t, T_{m})}{\sqrt{Σ (t, T_{m})}} \end{matrix}$ and $\begin{matrix} Σ (t, T_{m}) : = \int_{t}^{T_{m}} {∥ν (u, T_{m} + δ_{k}) - ν (u, T_{m})∥}^{2} d u \end{matrix}$

In the expression above, the function $ν (t, T) \in R^{d}$ refers to the instantaneous volatility at time t of a discount bond maturing at T. Under the dynamics of Equation (A1), $ν$ is given by (A3) $\begin{matrix} ν (t, T) = (\begin{matrix} \sum_{i = 1}^{d} B_{i} (t, T) σ_{i 1} (t) \\ ⋮ \\ \sum_{i = 1}^{d} B_{i} (t, T) σ_{i d} (t) \end{matrix}) \end{matrix}$

Appendix A.2. The Continuation Value with Fully Connected NN

Once again, we consider the network $G_{m} (\cdot)$ , focus on the outcome of hidden node $i \in {1, \dots, q}$ , and let $t \in [T_{m - 1}, T_{m})$ . Now, our aim is to evaluate the conditional expectation below, which, by a change in numéraire argument, can be rewritten as $\begin{matrix} E^{Q} [e^{- \int_{t}^{T_{m}} r (u) d u} φ (w_{i}^{⊤} log P (T_{m}) - b_{i}) | F_{t}] \\ = P (t, T_{m}) E^{T_{m}} [max \{w_{i}^{⊤} log P (T_{m}) - b_{i}), 0\} | F_{t}] \end{matrix}$ where the expectation on the right is taken under the $T_{m}$ -forward measure, taking $P (t, T_{m})$ as the numéraire. The weight vector $w_{i}$ (corresponding to hidden node i) and $log P (T_{m})$ are defined as $\begin{matrix} w_{i} = (\begin{matrix} w_{1}^{i} \\ ⋮ \\ w_{d}^{i} \end{matrix}), log P (T_{m}) = (\begin{matrix} log P (T_{m}, T_{m} + δ_{1}) \\ ⋮ \\ log P (T_{m}, T_{m} + δ_{d}) \end{matrix}) \end{matrix}$ with $T_{m} < T_{m} + δ_{1} < \dots < T_{m} + δ_{d} \leq T_{M}$ . We set the input dimension equal to the number of risk factors (i.e., $d = n$ ). Therefore, we can write $\begin{matrix} w_{i}^{⊤} log P (T_{m}) = & \sum_{j = 1}^{d} w_{j}^{i} log P (T_{m}, T_{m} + δ_{j}) \\ = & \sum_{j = 1}^{d} w_{j}^{i} A (T_{m}, T_{m} + δ_{j}) - \sum_{j = 1}^{d} w_{j}^{i} \sum_{k = 1}^{d} B_{k} (T_{m}, T_{m} + δ_{j}) x_{k} (T_{m}) \\ = & (\begin{matrix} w_{1}^{i} & \dots & w_{d}^{i} \end{matrix}) (\begin{matrix} A (T_{m}, T_{m} + δ_{1}) \\ ⋮ \\ A (T_{m}, T_{m} + δ_{d}) \end{matrix}) \\ - (\begin{matrix} w_{1}^{i} & \dots & w_{d}^{i} \end{matrix}) (\begin{matrix} B_{1} (T_{m}, T_{m} + δ_{1}) & \dots & B_{d} (T_{m}, T_{m} + δ_{1}) \\ ⋮ & ⋱ & ⋮ \\ B_{1} (T_{m}, T_{m} + δ_{d}) & \dots & B_{d} (T_{m}, T_{m} + δ_{d}) \end{matrix}) (\begin{matrix} x_{1} (T_{m}) \\ ⋮ \\ x_{d} (T_{m}) \end{matrix}) \\ = & w_{i}^{⊤} A (T_{m}) - w_{i}^{⊤} B (T_{m}) x_{T_{m}} \end{matrix}$ where we implicitly define $\begin{matrix} A (T_{m}) & : = (\begin{matrix} A (T_{m}, T_{m} + δ_{1}) \\ ⋮ \\ A (T_{m}, T_{m} + δ_{d}) \end{matrix}), \\ B (T_{m}) & : = (\begin{matrix} B_{1} (T_{m}, T_{m} + δ_{1}) & \dots & B_{d} (T_{m}, T_{m} + δ_{1}) \\ ⋮ & ⋱ & ⋮ \\ B_{1} (T_{m}, T_{m} + δ_{d}) & \dots & B_{d} (T_{m}, T_{m} + δ_{d}) \end{matrix}) \end{matrix}$ In order to compute the conditional expectation of Equation (A4), a change in measure is required to obtain the dynamics of $x_{1}, \dots, x_{n}$ under the $T_{m} -$ forward measure. Consider the Radon–Nikodym derivative process Beyna (2013), defined by $\begin{matrix} \frac{d Q^{T_{m}}}{d Q} | F_{t} = \frac{B (t)}{B (T_{m})} \frac{P (T_{m}, T_{m})}{P (t, T_{m})} = exp \{- \int_{t}^{T_{m}} ν (u, T_{m}) \cdot d W (u) - \frac{1}{2} \int_{t}^{T_{m}} {∥ν (u, T_{m})∥}^{2} d u\} \end{matrix}$ where $ν$ refers to to the instantaneous volatility of the numéraire, given in Equation (A3). The dynamics of the risk factors under $Q^{T_{m}}$ can be obtained by an application of Girsanov’s theorem Musiela and Rutkowski (2005). Denote by $σ_{i} (t) : = (σ_{i 1} (t), \dots, σ_{i d} (t))$ the ith row of the volatility matrix of $x_{t}$ and let ${\tilde{W}}_{i}^{T_{m}}$ be Brownian motions under $Q^{T_{m}}$ ; then, (A4) $\begin{matrix} d x_{i} (t) & = - a_{i} (t) x_{i} (t) d t - σ_{i} (t) \cdot ν (t, T_{m}) d t + {\tilde{σ}}_{i} (t) d {\tilde{W}}_{i}^{T_{m}} (t), i = 1, \dots, d \end{matrix}$

Let $Θ_{i} (t, T_{m}) = \int_{t}^{T_{m}} σ_{i} (s) \cdot ν (s, T_{m}) e^{- \int_{s}^{T_{m}} a_{i} (u) d u} d s$ ; then, the SDE above solves to (A5) $\begin{matrix} x_{i} (T_{m}) & = x_{i} (t) e^{- \int_{t}^{T_{m}} a_{i} (u) d u} - Θ_{i} (t, T_{m}) + \int_{t}^{T_{m}} {\tilde{σ}}_{i} (s) e^{- \int_{s}^{T_{m}} a_{i} (u) d u} d {\tilde{W}}_{i}^{T_{m}} (s), i = 1, \dots, d \end{matrix}$ It follows that, as a property of the Itô integral, the risk factors $(x_{1} (T_{m}), \dots, x_{n} (T_{m}))$ as presented in Equation (A5), conditional on $F_{t}$ , have a multivariate normal distribution under $Q^{T_{m}}$ . Their mean vector and co-variance matrix are, respectively, given by $\begin{matrix} μ : = (\begin{matrix} μ_{1} \\ ⋮ \\ μ_{d} \end{matrix}) : = (\begin{matrix} E^{T_{m}} [x_{1} (T_{m}) | F_{t}] \\ ⋮ \\ E^{T_{m}} [x_{d} (T_{m}) | F_{t}] \end{matrix}) = (\begin{matrix} x_{1} (t) e^{- \int_{t}^{T_{m}} a_{1} (u) d u} - Θ_{1} (t, T_{m}) \\ ⋮ \\ x_{d} (t) e^{- \int_{t}^{T_{m}} a_{d} (u) d u} - Θ_{d} (t, T_{m}) \end{matrix}) \\ C : = (\begin{matrix} c_{11} & \dots & c_{1 d} \\ ⋮ & ⋱ & ⋮ \\ c_{d 1} & \dots & c_{d d} \end{matrix}) : = (\begin{matrix} Cov [x_{1} (T_{m}), x_{1} (T_{m}) | F_{t}] & \dots & Cov [x_{1} (T_{m}), x_{d} (T_{m}) | F_{t}] \\ ⋮ & ⋱ & ⋮ \\ Cov [x_{d} (T_{m}), x_{1} (T_{m}) | F_{t}] & \dots & Cov [x_{d} (T_{m}), x_{d} (T_{m}) | F_{t}] \end{matrix}) \\ c_{i i} = \int_{t}^{T_{m}} {\tilde{σ}}_{i}^{2} (s) e^{- 2 \int_{s}^{T_{m}} a_{i} (u) d u} d s \forall_{i \in {1, \dots, d}} \\ c_{i j} = \int_{t}^{T_{m}} ρ (s) {\tilde{σ}}_{i} (s) {\tilde{σ}}_{j} (s) e^{- \int_{s}^{T_{m}} (a_{i} (u) + a_{j} (u)) d u} d s \forall_{i \neq j} \end{matrix}$ As a result, it should be clear that the random variable $Y : = w_{i}^{⊤} log P (T_{m})$ is normally distributed with mean and variance given, respectively, by $\begin{matrix} μ_{Y} = w_{i}^{⊤} A (T_{m}) - w_{i}^{⊤} B (T_{m}) μ \end{matrix}$ and variance $\begin{matrix} σ_{Y}^{2} = w_{i}^{⊤} B (T_{m}) CB {(T_{m})}^{⊤} w_{i} \end{matrix}$ As a result, we can compute $\begin{matrix} E^{Q} [e^{- \int_{t}^{T_{m}} r (u) d u} φ (w_{i}^{⊤} log P (T_{m}) - b_{i}) | F_{t}] = P (t, T_{m}) E^{T_{m}} [max (Y - b_{i}, 0) | F_{t}] \end{matrix}$ where the conditional expectation on the right-hand side can be expressed in closed form following a similar analysis as presented in Musiela and Rutkowski (2005). Let $d_{i} : = \frac{μ_{Y} - b_{i}}{σ_{Y}}$ and denote by $ξ \sim N (0, 1)$ a standard normal random variable. Then, it follows that $\begin{matrix} E^{T_{m}} [max (Y - b_{i}, 0) | F_{t}] & = E^{T_{m}} [(Y - b_{i}) 1_{{Y > b_{i}}} | F_{t}] \\ = E^{T_{m}} [(Y - μ_{Y}) 1_{{Y > b_{i}}}] + (μ_{Y} - b_{i}) Q^{T_{m}} [Y > b_{i} | F_{t}] \\ = σ_{Y} E^{T_{m}} [\frac{Y - μ_{Y}}{σ_{Y}} 1_{\{\frac{Y - μ_{Y}}{σ_{Y}} > - d_{i}\}} | F_{t}] \\ + (μ_{Y} - b_{i}) Q^{T_{m}} [\frac{Y - μ_{Y}}{σ_{Y}} > - d_{i} | F_{t}] \\ = σ_{Y} E [- ξ 1_{\{- ξ < d_{i}\}}] + (μ_{Y} - b_{i}) P [ξ < d_{i}] \\ = σ_{Y} ϕ (d_{i}) + (μ_{Y} - b_{i}) Φ (d_{i}) \end{matrix}$ where $ϕ$ denotes the standard normal density function and $Φ$ the standard normal cumulative density function.

Appendix B. Pre-Processing the Regression-Data

A procedure that significantly improves the fitting performance of the neural networks is the normalization of the training data. The linear rescaling of the input to the optimizer is a common form of data pre-processing Bishop et al. (1995). In the case of a multivariate input, the variables might have typical values in different orders of magnitude, even though that does not reflect their relative influence on determining the outcome Bishop et al. (1995). Normalizing the scale avoids the impact of a certain input being prioritized over another input. Also, the transfer of the final weights in $G_{m + 1}$ to the initialization of $G_{m}$ is more effective as the target variables are of roughly the same size at each time-step. In the default situation, the average continuation values would change in magnitude and the risk factor distribution would grow with each passing of a monitor date.

Another argument for pre-processing the input is that large data values typically induce large weights. Large weights can lead to exploding network outputs in the feed-forward process Goodfellow et al. (2016). Furthermore, it can cause an unstable optimization of the network, as extreme gradients can be very sensitive to small perturbations in the data Goodfellow et al. (2016).

In practice, we propose the following rescaling of the data. Denote by $\begin{matrix} \hat{z} (T_{m}) : = \{{(\begin{matrix} z_{1} (T_{m}) \\ ⋮ \\ z_{d} (T_{m}) \end{matrix})}_{1}, \dots, {(\begin{matrix} z_{1} (T_{m}) \\ ⋮ \\ z_{d} (T_{m}) \end{matrix})}_{N}\}, \hat{V} (T_{m}) : = \{\tilde{V} (T_{m}; x_{T_{m}}^{1}), \dots, \tilde{V} (T_{m}; x_{T_{m}}^{N})\} \end{matrix}$ the training points for the in- and output of network $G_{m}$ . Define the standard sample mean and standard deviations as $\begin{matrix} μ_{z_{i}} (T_{m}) : = \frac{1}{N} \sum_{n = 1}^{N} z_{i}^{n} (T_{m}), μ_{V} (T_{m}) : = \frac{1}{N} \sum_{n = 1}^{N} \tilde{V} (T_{m}; x_{T_{m}}^{n}) \\ σ_{z_{i}} (T_{m}) : = \frac{1}{N - 1} \sum_{n = 1}^{N} {(z_{i}^{n} (T_{m}) - μ_{z_{i}})}^{2}, σ_{V} (T_{m}) : = \frac{1}{N - 1} \sum_{n = 1}^{N} {(\tilde{V} (T_{m}; x_{T_{m}}^{n}) - μ_{V} (T_{m}))}^{2} \end{matrix}$ We then perform a simple element-wise linear transformation to obtain the scaled data ${\hat{z}}^{†}$ and ${\hat{V}}^{†}$ given by $\begin{matrix} {\hat{z}}_{i}^{†} (T_{m}) : = \frac{{\hat{z}}_{i} (T_{m}) - μ_{z_{i}} (T_{m})}{σ_{z_{i}} (T_{m})}, {\hat{V}}^{†} (T_{m}) : = \frac{\hat{V} (T_{m})}{σ_{V} (T_{m})} \end{matrix}$ With the transformations above in mind, it is important to adjust the associated composition of the replicating portfolio accordingly. For the two network designs, this has the following implications:

The locally connected NN case: Consider the outcome of the $i^{t h}$ hidden node $ν_{i}$ and denote the input of the network as $z$ . Then, $ν_{i} = φ (w_{i}^{k} z_{k} + b_{i})$ , where k is the index of the only non-zero entry of $w_{i}$ , the $i^{t h}$ row of weight matrix $w_{1}$ . The transformation $z \mapsto \frac{z - μ_{z}}{σ_{z}}$ implies that $\begin{matrix} ν_{i} \mapsto φ (w_{i}^{k} \frac{z_{k} - μ_{z_{k}}}{σ_{z_{k}}} + b_{i}) = φ (\frac{w_{i}^{k}}{σ_{z_{k}}} z_{k} + (b_{i} - \frac{w_{i}^{k} μ_{z_{k}}}{σ_{z_{k}}})) \end{matrix}$
As a consequence, in the analysis of Appendix A.1, the transformations $w_{i}^{k} \mapsto \frac{w_{i}^{k}}{σ_{z_{k}}}$ and $b_{i} \mapsto b_{i} - \frac{w_{i}^{k} μ_{z_{k}}}{σ_{z_{k}}}$ should be taken into account. Additionally, the transformation $w_{2} \mapsto σ_{V} w_{2}$ is required to account for the scaling of $\hat{V}$ .
The fully connected NN case: Again, consider the outcome of the $i^{t h}$ hidden node $ν_{i}$ . This time, the transformation $z \mapsto \frac{z - μ_{z}}{σ_{z}}$ implies that $\begin{matrix} ν_{i} \mapsto φ (w_{i}^{⊤} \frac{z - μ_{z}}{σ_{z}} + b_{i}) = φ (\sum_{j = 1}^{d} \frac{w_{i}^{j}}{σ_{z_{j}}} z_{j} + (b_{i} - \sum_{j = 1}^{d} \frac{w_{i}^{j} μ_{z_{j}}}{σ_{z_{i}}})) \end{matrix}$
As a consequence, in the analysis of Appendix A.2, the transformations $w_{i} \mapsto$ $(\frac{w_{i}^{1}}{σ_{z_{1}}}, \dots$ ${, \frac{w_{i}^{d}}{σ_{z_{d}}})}^{⊤}$ and $b_{i} \mapsto b_{i} - \sum_{j = 1}^{d} \frac{w_{i}^{j} μ_{z_{j}}}{σ_{z_{i}}}$ should be taken into account. And, again, the transformation $w_{2} \mapsto σ_{V} w_{2}$ is required to account for the scaling of $\hat{V}$ .

Appendix C. Hyperparameter Selection

The accuracy of the neural network fitting procedure is dependent on the choice of several hyperparameters. For the numerical experiments reported in Section 6, the hyperparameters have been selected based on a convergence analysis. We focused on the following:

Hidden node count: see Figure A1;
Size training set: see Figure A2;
Learning-rate: see Figure A3.

Several numerical experiments indicated that the batch size did not have a significant impact on the fitting accuracy and is therefore fixed at a default of 32. For the convergence analysis of the parameters listed above, we considered a

1 Y \times 10 Y

receiver Bermudan swaption with a fixed rate of

K = 0.03

. Experiments were performed under the two-factor G2++ model using the model specifications depicted in Table 3. The figures show the mean absolute errors of the neural network fits per monitor date in basis points of the notional.

View Image - Figure A1. Impact hidden node count: accuracy of the neural network fit per monitor date under a 2-factor model. # training points = 5000. Learning-rate = 0.0002.

Figure A1. Impact hidden node count: accuracy of the neural network fit per monitor date under a 2-factor model. # training points = 5000. Learning-rate = 0.0002.

View Image - Figure A2. Impact size training set: accuracy of the neural network fit per monitor date under a 2-factor model. # hidden nodes = 64. Learning-rate = 0.0002.

Figure A2. Impact size training set: accuracy of the neural network fit per monitor date under a 2-factor model. # hidden nodes = 64. Learning-rate = 0.0002.

View Image - Figure A3. Impact learning-rate: accuracy of the neural network fit per monitor date under a 2-factor model. # hidden nodes = 64. # training-points = 10,000.

Figure A3. Impact learning-rate: accuracy of the neural network fit per monitor date under a 2-factor model. # hidden nodes = 64. # training-points = 10,000.

Appendix D. Proof of Theorem 1

Proof.

We prove by induction on m. At the last exercise date of the Bermudan, i.e., $t = T_{M - 1}$ , we have $V (T_{M - 1}; x) = \tilde{V} (T_{M - 1}; x) : = max \{h_{M - 1} (x), 0\}$ , representing the final pay-off of the contract, which at $T_{M - 1}$ is exactly known. Hence, it should be obvious that $\begin{matrix} sup_{x \in I_{d}} B^{- 1} (T_{M - 1}) |V (T_{M - 1}; x) - \tilde{V} (T_{M - 1}; x)| = 0 \end{matrix}$ For the inductive step, assume that, for some $T_{m + 1} \in T_{f}$ , an approximation $\tilde{V} (T_{m + 1})$ of the price is given, satisfying $\begin{matrix} sup_{x \in I_{d}} B^{- 1} (T_{m + 1}) |V (T_{m + 1}; x) - \tilde{V} (T_{m + 1}; x)| < k ε \end{matrix}$ We will show that it follows that, for all $t \in [T_{m}, T_{m + 1})$ , $\begin{matrix} sup_{x \in I_{d}} B^{- 1} (t) |V (t; x) - \tilde{V} (t; x)| < (k + 1) ε \end{matrix}$ First, consider the case $t \in (T_{m}, T_{m + 1})$ . It follows that $\begin{matrix} sup_{x \in I_{d}} |\frac{V (t; x) - \tilde{V} (t; x)}{B (t)}| & = sup_{x \in I_{d}} |\frac{C_{m} (t; x) - {\tilde{C}}_{m} (t; x)}{B (t)}| \\ = sup_{x \in I_{d}} |E^{Q} [\frac{V (T_{m + 1})}{B (T_{m + 1})} | x_{t} = x] - E^{Q} [\frac{G_{m + 1} (z_{m + 1})}{B (T_{m + 1})} | x_{t} = x]| \\ \leq sup_{x \in I_{d}} E^{Q} [B^{- 1} (T_{m + 1}) |V (T_{m + 1}) - G_{m + 1} (z_{m + 1})| | x_{t} = x] \\ = sup_{x \in I_{d}} E^{Q} [B^{- 1} (T_{m + 1}) |V (T_{m + 1}) - \tilde{V} (T_{m + 1}) \\ + \tilde{V} (T_{m + 1}) - G_{m + 1} (z_{m + 1})| | x_{t} = x] \\ \leq sup_{x \in I_{d}} (E^{Q} [B^{- 1} (T_{m + 1}) |V (T_{m + 1}) - \tilde{V} (T_{m + 1})| | x_{t} = x] \\ + E^{Q} [B^{- 1} (T_{m + 1}) |\tilde{V} (T_{m + 1}) - G_{m + 1} (z_{m + 1})| | x_{t} = x]) \end{matrix}$ In the last expression above, the first term is bounded due to the induction hypothesis, i.e., $B^{- 1} (T_{m + 1}) |V (T_{m + 1}) - \tilde{V} (T_{m + 1})| < k ε$ . The second term is bounded by assumption, i.e., there exists a network $G_{m + 1} (\cdot)$ such that $B^{- 1} (T_{m + 1}) |\tilde{V} (T_{m + 1}) - G_{m + 1} (z_{m + 1})| < ε$ . We hence conclude that $\begin{matrix} sup_{x \in I_{d}} B^{- 1} (t) |V (t; x) - \tilde{V} (t; x)| < (k + 1) ε, \forall_{t \in (T_{m}, T_{m + 1})} \end{matrix}$ If, on the other hand, $t = T_{m}$ , we have that $\begin{matrix} sup_{x \in I_{d}} |\frac{V (t; x) - \tilde{V} (t; x)}{B (t)}| & = sup_{x \in I_{d}} |\frac{max \{C_{m} (t; x), h_{m} (x)\} - max \{{\tilde{C}}_{m} (t; x), h_{m} (x)\}}{B (t)}| \end{matrix}$ Denoting $H (x)$ : $= B^{- 1} (t) |max \{C_{m} (t; x), h_{m} (x)\} - max \{{\tilde{C}}_{m} (t; x), h_{m} (x)\}|$ in the expression above, we can distinguish four cases for each $x \in I_{d}$ , which are

$C_{m} (t; x), {\tilde{C}}_{m} (t; x) > h_{m} (x)$ , then $H (x) = B^{- 1} (t) |C_{m} (t; x) - {\tilde{C}}_{m} (t; x)| < (k + 1) ε$ ;
$C_{m} (t; x), {\tilde{C}}_{m} (t; x) < h_{m} (x)$ , then $H (x) = B^{- 1} (t) |h_{m} (x) - h_{m} (x)| = 0 < (k + 1) ε$ ;
$C_{m} (t; x) < h_{m} (x) < {\tilde{C}}_{m} (t; x)$ , then $H (x) = B^{- 1} (t) |h_{m} (x) - {\tilde{C}}_{m} (t; x)|$
$< B^{- 1} (t) |C_{m} (t; x) - {\tilde{C}}_{m} (t; x)| < (k + 1) ε$ ;
${\tilde{C}}_{m} (t; x) < h_{m} (x) < C_{m} (t; x)$ , then $H (x) = B^{- 1} (t) |C_{m} (t; x) - h_{m} (x)|$
$< B^{- 1} (t) |C_{m} (t; x) - {\tilde{C}}_{m} (t; x)| < (k + 1) ε$ .

From all the cases, we can induce that

\begin{matrix} sup_{x \in I_{d}} B^{- 1} (t) |V (t; x) - \tilde{V} (t; x)| \leq (k + 1) ε \end{matrix}

We conclude that, by induction on

m = M - 1, \dots, 0

\begin{matrix} sup_{x \in I_{d}} B^{- 1} (t) |V (t; x) - \tilde{V} (t; x)| < M ε \end{matrix}

for all

t \in [0, T_{M - 1}]

. □

Appendix E. Proof of Theorem 2

Proof.

First, we fix some notation.

Let $V_{m} : = V (T_{m})$ denote the true price of the Bermudan swaption at $T_{m}$ conditioned on the fact that it is not yet exercised.
Let ${\tilde{C}}_{m} : = B (T_{m}) E^{Q} [\frac{G_{m + 1} (z_{m + 1})}{B (T_{m + 1})} | F_{T_{m}}]$ denote the estimator of the continuation value at $T_{m}$ .
Let ${\tilde{V}}_{m} : = max \{{\tilde{C}}_{m}, h_{m} (x_{T_{m}})\}$ denote the estimator of $V_{m}$ .
Let $G_{m} : = G_{m} (z_{m})$ denote the neural network approximation of ${\tilde{V}}_{m}$ .
Let $B_{m} : = B (T_{m})$ denote the numéraire at $T_{m}$ .
Let $h_{m} : = h_{m} (x_{T_{m}})$ .

Let

T_{m} \in {T_{0}, \dots, T_{M - 1}}

. We will prove the theorem by induction on m. For the base case, note that at time zero we have (A6)

\begin{matrix} |V (0) - \tilde{V} (0)| & = |E^{Q} [\frac{V_{0}}{B_{0}} | F_{0}] - E^{Q} [\frac{G_{0}}{B_{0}} | F_{0}]| \leq E^{Q} [|\frac{V_{0} - G_{0}}{B_{0}}| | F_{0}] \end{matrix}

which is induced by Jensen’s inequality. For the inductive step, assume that, for some

m \in {0, \dots, M - 1}

, we have that (A7)

\begin{matrix} |V (0) - \tilde{V} (0)| < E^{Q} [|\frac{V_{m} - G_{m}}{B_{m}}| | F_{0}] + m \cdot ε \end{matrix}

The expectation in (A7) can be rewritten using the triangular inequality (A8)

\begin{matrix} E^{Q} [|\frac{V_{m} - G_{m}}{B_{m}}| | F_{0}] & = E^{Q} [|\frac{V_{m} - {\tilde{V}}_{m} + {\tilde{V}}_{m} - G_{m}}{B_{m}}| | F_{0}] \\ \leq E^{Q} [|\frac{V_{m} - {\tilde{V}}_{m}}{B_{m}}| | F_{0}] + E^{Q} [|\frac{{\tilde{V}}_{m} - G_{m}}{B_{m}}| | F_{0}] \end{matrix}

The second term in (A8) is, by assumption, bounded by

ε

. Note that the first term in (A8) can be bounded as

\begin{matrix} E^{Q} [|\frac{V_{m} - {\tilde{V}}_{m}}{B_{m}}| | F_{0}] & = E^{Q} [|\frac{max \{C_{m}, h_{m}\} - max \{{\tilde{C}}_{m}, h_{m}\}}{B_{m}}| | F_{0}] \\ \leq E^{Q} [|\frac{C_{m} - {\tilde{C}}_{m}}{B_{m}}| | F_{0}] \\ = E^{Q} [|E^{Q} [\frac{V_{m + 1}}{B_{m + 1}} | F_{T_{m}}] - E^{Q} [\frac{G_{m + 1}}{B_{m + 1}} | F_{T_{m}}]| | F_{0}] \\ \leq E^{Q} [E^{Q} [|\frac{V_{m + 1} - G_{m + 1}}{B_{m + 1}}| | F_{T_{m}}] | F_{0}] \\ = E^{Q} [|\frac{V_{m + 1} - G_{m + 1}}{B_{m + 1}}| | F_{0}] \end{matrix}

It follows that

\begin{matrix} |V (0) - \tilde{V} (0)| < E^{Q} [|\frac{V_{m + 1} - G_{m + 1}}{B_{m + 1}}| | F_{0}] + (m + 1) \cdot ε \end{matrix}

For the final step, note that if $m = M - 1$ , we have $\begin{matrix} E^{Q} [|\frac{V_{m} - G_{m}}{B_{m}}| | F_{0}] = E^{Q} [|\frac{max \{h_{M - 1}, 0\} - G_{M - 1}}{B_{M - 1}}| | F_{0}] < ε \end{matrix}$ We conclude by induction on m that $|V (0) - \tilde{V} (0)| < M ε$ □

Appendix F. Proof of Theorem 3

Proof.

We consider the following three events: ${τ = \tilde{τ}}$ , ${τ < \tilde{τ}}$ , and ${τ > \tilde{τ}}$ . Note that $\begin{matrix} V (0) - L (0) & = E^{Q} [\frac{h_{τ} (x_{τ})}{B (τ)} - \frac{h_{\tilde{τ}} (x_{\tilde{τ}})}{B (\tilde{τ})} | F_{0}] \\ = E^{Q} [(\frac{h_{τ} (x_{τ})}{B (τ)} - \frac{h_{\tilde{τ}} (x_{\tilde{τ}})}{B (\tilde{τ})}) 1_{{τ = \tilde{τ}}} | F_{0}] + E^{Q} [(\frac{h_{τ} (x_{τ})}{B (τ)} - \frac{h_{\tilde{τ}} (x_{\tilde{τ}})}{B (\tilde{τ})}) 1_{{τ < \tilde{τ}}} | F_{0}] \\ + E^{Q} [(\frac{h_{τ} (x_{τ})}{B (τ)} - \frac{h_{\tilde{τ}} (x_{\tilde{τ}})}{B (\tilde{τ})}) 1_{{τ > \tilde{τ}}} | F_{0}] \\ = E_{1} + E_{2} + E_{3} \end{matrix}$ We will bound the three terms above one by one.

Bounding $E_{1}$ :Starting with the event ${τ = \tilde{τ}}$ , we observe that we can write $\begin{matrix} E_{1} = E^{Q} [(\frac{h_{τ} (x_{τ})}{B (τ)} - \frac{h_{τ} (x_{τ})}{B (τ)}) 1_{{τ = \tilde{τ}}} | F_{0}] = 0 \end{matrix}$

Bounding $E_{2}$ :We continue with the event ${τ < \tilde{τ}}$ . For this, we will introduce two types of sub-events: $A_{m}$ : $= \{τ = T_{m} \land \tilde{τ} > T_{m}\}$ and $B_{m} : = \{τ \leq T_{m} \land \tilde{τ} > T_{m}\}$ , where ∧ denotes the logical AND operator. Also, we define the difference process $e_{m}$ : $= \frac{\tilde{V} (T_{m})}{B (T_{m})} - \frac{h_{\tilde{τ}} (x_{\tilde{τ}})}{B (\tilde{τ})}$ . It should be clear that $1_{{τ < \tilde{τ}}} = \sum_{m = 0}^{M - 1} 1_{A_{m}}$ . Therefore, it holds that $\begin{matrix} E_{2} = \sum_{m = 0}^{M - 1} E^{Q} [(\frac{h_{τ} (x_{τ})}{B (τ)} - \frac{h_{\tilde{τ}} (x_{\tilde{τ}})}{B (\tilde{τ})}) 1_{A_{m}} | F_{0}] \leq \sum_{m = 0}^{M - 1} E^{Q} [e_{m} 1_{A_{m}} | F_{0}] \end{matrix}$ where the inequality follows from the fact that the direct estimator has the property $\tilde{V} (T_{m}) = max {{\tilde{C}}_{m}, h_{m}} \geq h_{m}$ . Now, we will show by induction that $E_{2} < (M - 1) ε$ . First, observe that $A_{0} \equiv B_{0}$ . Second, note that, for any $m \in {0, \dots, M - 1}$ , we have that (A9) $\begin{matrix} E^{Q} [e_{m} 1_{B_{m}} | F_{0}] & = E^{Q} [(E^{Q} [\frac{G_{m + 1} (z_{m + 1})}{B (T_{m + 1})} | F_{T_{m}}] - \frac{h_{\tilde{τ}} (x_{\tilde{τ}})}{B (\tilde{τ})}) 1_{B_{m}} | F_{0}] \\ = E^{Q} [(\frac{G_{m + 1} (z_{m + 1})}{B (T_{m + 1})} - \frac{h_{\tilde{τ}} (x_{\tilde{τ}})}{B (\tilde{τ})}) 1_{B_{m}} | F_{0}] \\ \leq E^{Q} [|\frac{G_{m + 1} (z_{m + 1})}{B (T_{m + 1})} - \frac{\tilde{V} (T_{m + 1})}{B (T_{m + 1})}| 1_{B_{m}} | F_{0}] + E^{Q} [e_{m + 1} 1_{B_{m}} | F_{0}] \end{matrix}$ The first equality follows from the fact that $\tilde{V} (T_{m}) = {\tilde{C}}_{m}$ in the event $\tilde{τ} > T_{m}$ . The second equality follows from the tower rule in combination with the fact that $1_{B_{m}}$ is $F_{T_{m}} -$ measurable. The final inequality follows from an application of the triangle inequality. The first term in (A9) is, by assumption, bounded by $ε$ . The second term in (A9) can be rewritten by observing that $1_{B_{m}} : = 1_{B_{m}^{1}} + 1_{B_{m}^{2}} : = 1_{\{τ \leq T_{m} \land \tilde{τ} = T_{m + 1}\}} + 1_{\{τ \leq T_{m} \land \tilde{τ} > T_{m + 1}\}}$ . We have that $\begin{matrix} E^{Q} [e_{m + 1} 1_{B_{m}^{1}} | F_{0}] = E^{Q} [(\frac{h_{m + 1} (x_{T_{m + 1}})}{B (T_{m + 1})} - \frac{h_{m + 1} (x_{T_{m + 1}})}{B (T_{m + 1})}) 1_{B_{m}^{1}} | F_{0}] = 0 \end{matrix}$ Furthermore, we have that $1_{B_{m}^{2}} + 1_{A_{m + 1}} = 1_{B_{m + 1}}$ . Therefore we can infer that $\begin{matrix} E^{Q} [e_{m} 1_{B_{m}} | F_{0}] + E^{Q} [e_{m + 1} 1_{A_{m + 1}} | F_{0}] & < ε + E^{Q} [e_{m + 1} 1_{B_{m}^{2}} | F_{0}] + E^{Q} [e_{m + 1} 1_{A_{m + 1}} | F_{0}] \\ = ε + E^{Q} [e_{m + 1} 1_{B_{m + 1}} | F_{0}] \end{matrix}$ Together with the fact that $A_{0} \equiv B_{0}$ , we conclude by induction on m that $\begin{matrix} E_{2} & \leq E^{Q} [e_{0} 1_{B_{0}} | F_{0}] + \sum_{m = 1}^{M - 1} E^{Q} [e_{m} 1_{A_{m}} | F_{0}] \\ < ε + E^{Q} [e_{1} 1_{B_{1}} | F_{0}] + \sum_{m = 2}^{M - 1} E^{Q} [e_{m} 1_{A_{m}} | F_{0}] \\ ⋮ \\ < (M - 1) ε + E^{Q} [e_{M - 1} 1_{B_{M - 1}} | F_{0}] = (M - 1) ε \end{matrix}$ Bounding $E_{3}$ :We finalize the proof by considering the third event ${τ > \tilde{τ}}$ . In a similar fashion as before, we introduce two types of sub-events: $A_{m}$ : $= \{\tilde{τ} = T_{m} \land τ > T_{m}\}$ and $B_{m}$ : $= \{\tilde{τ} \leq T_{m} \land τ > T_{m}\}$ . Also, again define a difference process, this time given by $e_{m}$ : $= \frac{h_{τ} (x_{τ})}{B (τ)} - \frac{\tilde{V} (T_{m})}{B (T_{m})}$ . It should be clear that $1_{{τ > \tilde{τ}}} = \sum_{m = 0}^{M - 1} 1_{A_{m}}$ . Therefore, it holds that $\begin{matrix} E_{3} = \sum_{m = 0}^{M - 1} E^{Q} [(\frac{h_{τ} (x_{τ})}{B (τ)} - \frac{h_{\tilde{τ}} (x_{\tilde{τ}})}{B (\tilde{τ})}) 1_{A_{m}} | F_{0}] = \sum_{m = 0}^{M - 1} E^{Q} [e_{m} 1_{A_{m}} | F_{0}] \end{matrix}$ where the second equality follows from the fact that the direct estimator has the property $\tilde{V} (\tilde{τ}) = h_{\tilde{τ}}$ . Now, we will show by induction that $E_{3} < (M - 1) ε$ . Note that, for any $m \in {0, \dots, M - 1}$ , we have that (A10) $\begin{matrix} E^{Q} [e_{m} 1_{B_{m}} | F_{0}] & \leq E^{Q} [(\frac{h_{τ} (x_{τ})}{B (τ)} - E^{Q} [\frac{G_{m + 1} (z_{m + 1})}{B (T_{m + 1})} | F_{T_{m}}]) 1_{B_{m}} | F_{0}] \\ = E^{Q} [(\frac{h_{τ} (x_{τ})}{B (τ)} - \frac{G_{m + 1} (z_{m + 1})}{B (T_{m + 1})}) 1_{B_{m}} | F_{0}] \\ \leq E^{Q} [|\frac{\tilde{V} (T_{m + 1})}{B (T_{m + 1})} - \frac{G_{m + 1} (z_{m + 1})}{B (T_{m + 1})}| 1_{B_{m}} | F_{0}] + E^{Q} [e_{m + 1} 1_{B_{m}} | F_{0}] \end{matrix}$ The first inequality follows from the fact that $\tilde{V} (T_{m}) = max {{\tilde{C}}_{m}, h_{m}} \geq {\tilde{C}}_{m}$ . The subsequent equality follows from the tower rule in combination with the fact that $1_{B_{m}}$ is $F_{T_{m}} -$ measurable. The final inequality follows from an application of the triangle inequality. The first term in (A10) is, by assumption, bounded by $ε$ . The second term in (A10) can be rewritten by observing that $1_{B_{m}}$ : $= 1_{B_{m}^{1}} + 1_{B_{m}^{2}} : = 1_{\{\tilde{τ} \leq T_{m} \land τ = T_{m + 1}\}} + 1_{\{\tilde{τ} \leq T_{m} \land τ > T_{m + 1}\}}$ . We have that $\begin{matrix} E^{Q} [e_{m + 1} 1_{B_{m}^{1}} | F_{0}] = E^{Q} [(\frac{h_{m + 1} (x_{T_{m + 1}})}{B (T_{m + 1})} - \frac{\tilde{V} (T_{m + 1})}{B (T_{m + 1})}) 1_{B_{m}^{1}} | F_{0}] \leq 0 \end{matrix}$ where the inequality follows from the fact that $\tilde{V} (T_{m + 1}) = max {{\tilde{C}}_{m + 1}, h_{m + 1}} \geq h_{m + 1}$ . Furthermore, we have that $1_{B_{m}^{2}} + 1_{A_{m + 1}} = 1_{B_{m + 1}}$ . Therefore, we can once again infer that $\begin{matrix} E^{Q} [e_{m} 1_{B_{m}} | F_{0}] + E^{Q} [e_{m + 1} 1_{A_{m + 1}} | F_{0}] & < ε + E^{Q} [e_{m + 1} 1_{B_{m}^{2}} | F_{0}] + E^{Q} [e_{m + 1} 1_{A_{m + 1}} | F_{0}] \\ = ε + E^{Q} [e_{m + 1} 1_{B_{m + 1}} | F_{0}] \end{matrix}$ Together with the fact that $A_{0} \equiv B_{0}$ , we again conclude by induction on m that $\begin{matrix} E_{3} & \leq E^{Q} [e_{0} 1_{B_{0}} | F_{0}] + \sum_{m = 1}^{M - 1} E^{Q} [e_{m} 1_{A_{m}} | F_{0}] \\ < ε + E^{Q} [e_{1} 1_{B_{1}} | F_{0}] + \sum_{m = 2}^{M - 1} E^{Q} [e_{m} 1_{A_{m}} | F_{0}] \\ ⋮ \\ < (M - 1) ε + E^{Q} [e_{M - 1} 1_{B_{M - 1}} | F_{0}] = (M - 1) ε \end{matrix}$ Conclusion: We hence find that $\begin{matrix} V (0) - L (0) = E_{1} + E_{2} + E_{3} < 0 + (M - 1) ε + (M - 1) ε = 2 (M - 1) ε \end{matrix}$ □

Appendix G. Proof of Theorem 4

Proof.

The discounted true price process is a supermartingale under $Q$ . Therefore, we have that $\frac{V (t)}{B (t)} = Y_{t} + Z_{t}$ for a martingale $Y_{t}$ and a predictable process $Z_{t}$ , which starts at zero (i.e., $Z_{0} = 0$ ) and is strictly decreasing. Define a difference process on $T$ , given by $e_{T_{m}} = \frac{V (T_{m}) - G_{m} (z_{m})}{B (T_{m})}$ . We can rewrite martingale $M_{t}$ as defined in (13) in terms of $e_{t}$ as follows: $\begin{matrix} M_{T_{m}} & = \frac{G_{0} (z_{0})}{B (T_{0})} + \sum_{j = 1}^{m} (\frac{G_{j} (z_{j})}{B (T_{j})} - E^{Q} [\frac{G_{j} (z_{j})}{B (T_{j})} | F_{T_{j - 1}}]) \\ = Y_{T_{m}} - e_{T_{0}} - \sum_{j = 1}^{m} (e_{T_{j}} - E^{Q} [e_{T_{j}} | F_{T_{j - 1}}]) \end{matrix}$ Substituting the expression for $M_{t}$ into the definition of $U (0)$ yields $\begin{matrix} U (0) & = M_{0} + E^{Q} [max_{T_{m} \in T_{f}} \{\frac{h_{m} (x_{T_{m}})}{B (T_{m})} - M_{T_{m}}\} | F_{0}] \\ = E^{Q} [\frac{G_{0} (z_{0})}{B (T_{0})} | F_{0}] + E^{Q} [max_{m \in {0, \dots, M - 1}} \{\frac{h_{m} (x_{T_{m}})}{B (T_{m})} - Y_{T_{m}} + e_{T_{0}} \\ + \sum_{j = 1}^{m} (e_{T_{j}} - E^{Q} [e_{T_{j}} | F_{T_{j - 1}}])\} | F_{0}] \\ \leq E^{Q} [\frac{V (T_{0})}{B (T_{0})} | F_{0}] + E^{Q} [max_{m \in {0, \dots, M - 1}} \{\sum_{j = 1}^{m} (e_{T_{j}} - E^{Q} [e_{T_{j}} | F_{T_{j - 1}}])\} | F_{0}] \end{matrix}$ The last step follows by merging $E^{Q} [e_{T_{0}} | F_{0}]$ with $M_{0}$ and by noting that $\frac{h_{m} (x_{T_{m}})}{B (T_{m})} - Y_{T_{m}} \leq \frac{V (T_{m})}{B (T_{m})} - Y_{T_{m}} = Z_{T_{m}} \leq 0$ . The remaining inequality is not easy to bound Andersen and Broadie (2004). However, by taking the absolute values of the difference process, we can obtain a loose bound as follows: $\begin{matrix} U (0) & \leq V (0) + E^{Q} [max_{m \in {0, \dots, M - 1}} \{\sum_{j = 1}^{m} |e_{T_{j}}| + \sum_{j = 1}^{m} |E^{Q} [e_{T_{j}} | F_{T_{j - 1}}]|\} | F_{0}] \\ \leq V (0) + E^{Q} [\sum_{j = 1}^{M - 1} |e_{T_{j}}| + \sum_{j = 1}^{M - 1} |E^{Q} [e_{T_{j}} | F_{T_{j - 1}}]| | F_{0}] \\ \leq V (0) + 2 \sum_{j = 1}^{M - 1} E^{Q} [|e_{T_{j}}| | F_{0}] \end{matrix}$ Note that, as a consequence of Theorem 2, we have that $E^{Q} [|e_{T_{m}}| | F_{0}] < (M - m) ε$ . It follows that $\begin{matrix} |U (0) - V (0)| < 2 \sum_{m = 1}^{M - 1} (M - m) ε = M (M - 1) ε \end{matrix}$ This concludes the proof. □

References

Ametrano, Ferdinando; Ballabio, Luigi. Quantlib—A Free/Open-Source Library for Quantitative Finance. 2003; Available online: https://github.com/lballabio/QuantLib (accessed on 1 March 2020).

Andersen, Leif; Broadie, Mark. Primal-dual simulation algorithm for pricing multidimensional american options. Management Science; 2004; 50, pp. 1222-34. [DOI: https://dx.doi.org/10.1287/mnsc.1040.0258]

Andersen, Leif B. G.; Piterbarg, Vladimir V. Interest Rate Modeling, Volume I: Foundations and Vanilla Models; Atlantic Financial Press: London, 2010a.

Andersen, Leif B. G.; Piterbarg, Vladimir V. Interest Rate Modeling, Volume II: Term Structure Models; Atlantic Financial Press: London, 2010b.

Andersson, Kristoffer; Oosterlee, Cornelis W. A deep learning approach for computations of exposure profiles for high-dimensional bermudan options. Applied Mathematics and Computation; 2021; 408, 126332. [DOI: https://dx.doi.org/10.1016/j.amc.2021.126332]

Becker, Sebastian; Cheridito, Patrick; Jentzen, Arnulf. Deep optimal stopping. Journal of Machine Learning Research; 2019; 20, 74.

Becker, Sebastian; Cheridito, Patrick; Jentzen, Arnulf. Pricing and hedging american-style options with deep learning. Journal of Risk and Financial Management; 2020; 13, 158. [DOI: https://dx.doi.org/10.3390/jrfm13070158]

Beyna, Ingo. Interest Rate Derivatives: Valuation, Calibration and Sensitivity Analysis; Springer Science & Business Media: Berlin/Heidelberg, 2013.

Bishop, Christopher M. Neural Networks for Pattern Recognition; Oxford University Press: Oxford, 1995.

Breeden, Douglas T.; Litzenberger, Robert H. Prices of state-contingent claims implicit in option prices. Journal of Business; 1978; 51, pp. 621-51. [DOI: https://dx.doi.org/10.1086/296025]

Brigo, Damiano; Mercurio, Fabio. Interest Rate Models-Theory and Practice: With Smile, Inflation and Credit; Springer: Berlin/Heidelberg, 2006; vol. 2.

Carr, Peter; Bowie, Jonathan. Static simplicity. Risk; 1994; 7, pp. 45-50.

Carr, Peter; Ellis, Katrina; Gupta, Vishal. Static hedging of exotic options. Quantitative Analysis in Financial Markets: Collected Papers of the New York University Mathematical Finance Seminar; World Scientific: Singapore, 1999; pp. 152-76.

Carr, Peter; Wu, Liuren. Static hedging of standard options. Journal of Financial Econometrics; 2014; 12, pp. 3-46. [DOI: https://dx.doi.org/10.1093/jjfinec/nbs014]

Carriere, Jacques F. Valuation of the early-exercise price for options using simulations and nonparametric regression. Insurance: Mathematics and Economics; 1996; 19, pp. 19-30. [DOI: https://dx.doi.org/10.1016/S0167-6687(96)00004-2]

Chollet, François. Keras. 2015; Available online: https://keras.io (accessed on 1 May 2020).

Chung, San-Lin; Shih, Pai-Ta. Static hedging and pricing american options. Journal of Banking & Finance; 2009; 33, pp. 2140-49.

Dai, Qiang; Singleton, Kenneth J. Specification analysis of affine term structure models. The Journal of Finance; 2000; 55, pp. 1943-78. [DOI: https://dx.doi.org/10.1111/0022-1082.00278]

Derman, Emanuel; Ergener, Deniz; Kani, Iraj. Static options replication. Journal of Derivatives; 1995; 2, [DOI: https://dx.doi.org/10.3905/jod.1995.407927]

Duffie, Darrell; Kan, Rui. A yield-factor model of interest rates. Mathematical Finance; 1996; 6, pp. 379-406. [DOI: https://dx.doi.org/10.1111/j.1467-9965.1996.tb00123.x]

Ferguson, Ryan; Green, Andrew. Deeply learning derivatives. arXiv; 2018; arXiv: 1809.02233

Filipovic, Damir. Term-Structure Models. A Graduate Course; Springer: Berlin/Heidelberg, 2009.

Geman, Helyette; Karoui, Nicole El; Rochet, Jean-Charles. Changes of numeraire, changes of probability measure and option pricing. Journal of Applied probability; 1995; 32, pp. 443-58. [DOI: https://dx.doi.org/10.2307/3215299]

Glasserman, Paul. Monte Carlo Methods in Financial Engineering; Springer Science & Business Media: Berlin/Heidelberg, 2013; vol. 53.

Glasserman, Paul; Yu, Bin. Simulation for american options: Regression now or regression later?. Monte Carlo and Quasi-Monte Carlo Methods 2002; Springer: Berlin/Heidelberg, 2004; pp. 213-26.

Gnoatto, Alessandro; Reisinger, Christoph; Picarelli, Athena. Deep xva solver—A neural network based counterparty credit risk management framework. SIAM Journal on Financial Mathematics; 2023; 14, pp. 314-352. [DOI: https://dx.doi.org/10.1137/21M1457606]

Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron; Bengio, Yoshua. Deep Learning; MIT Press Cambridge: Cambridge, 2016; vol. 1.

Gregory, Jon. The xVA Challenge: Counterparty Credit Risk, Funding, Collateral and Capital; John Wiley & Sons: Hoboken, 2015.

Hagan, Patrick S. Convexity conundrums: Pricing cms swaps, caps, and floors. The Best of Wilmott; 2005; 305. [DOI: https://dx.doi.org/10.1002/wilm.42820030211]

Harrison, J. Michael; Pliska, Stanley R. Martingales and stochastic integrals in the theory of continuous trading. Stochastic Processes and Their Applications; 1981; 11, pp. 215-60. [DOI: https://dx.doi.org/10.1016/0304-4149(81)90026-0]

Haugh, Martin B.; Kogan, Leonid. Pricing american options: A duality approach. Operations Research; 2004; 52, pp. 258-70. [DOI: https://dx.doi.org/10.1287/opre.1030.0070]

Henrard, Marc. Explicit bond option formula in heath–jarrow–morton one factor model. International Journal of Theoretical and Applied Finance; 2003; 6, pp. 57-72. [DOI: https://dx.doi.org/10.1142/S0219024903001785]

Henry-Labordere, Pierre. Deep Primal-Dual Algorithm for BSDEs: Applications of Machine Learning to CVA and IM. 2017; Available online: https://ssrn.com/abstract=3071506 (accessed on 1 October 2020).

Hornik, Kurt; Stinchcombe, Maxwell; White, Halbert. Multilayer feedforward networks are universal approximators. Neural Networks; 1989; 2, pp. 359-66. [DOI: https://dx.doi.org/10.1016/0893-6080(89)90020-8]

Hutchinson, James M.; Lo, Andrew W.; Poggio, Tomaso. A nonparametric approach to pricing and hedging derivative securities via learning networks. The Journal of Finance; 1994; 49, pp. 851-89. [DOI: https://dx.doi.org/10.1111/j.1540-6261.1994.tb00081.x]

Jain, Shashi; Oosterlee, Cornelis W. The stochastic grid bundling method: Efficient pricing of bermudan options and their greeks. Applied Mathematics and Computation; 2015; 269, pp. 412-31. [DOI: https://dx.doi.org/10.1016/j.amc.2015.07.085]

Jamshidian, Farshid. An exact bond option formula. The Journal of Finance; 1989; 44, pp. 205-209. [DOI: https://dx.doi.org/10.1111/j.1540-6261.1989.tb02413.x]

Kingma, Diederik P.; Ba, Jimmy. Adam: A method for stochastic optimization. arXiv; 2014; arXiv: 1412.6980

Kloeden, Peter E.; Platen, Eckhard. Numerical Solution of Stochastic Differential Equations; Springer Science & Business Media: Berlin/Heidelberg, 2013; vol. 23.

Kohler, Michael; Krzyżak, Adam; Todorovic, Nebojsa. Pricing of high-dimensional american options by neural networks. Mathematical Finance: An International Journal of Mathematics, Statistics and Financial Economics; 2010; 20, pp. 383-410. [DOI: https://dx.doi.org/10.1111/j.1467-9965.2010.00404.x]

Lapeyre, Bernard; Lelong, Jérôme. Neural network regression for bermudan option pricing. arXiv; 2019; arXiv: 1907.06474[DOI: https://dx.doi.org/10.1515/mcma-2021-2091]

Lokeshwar, Vikranth; Bharadwaj, Vikram; Jain, Shashi. Explainable neural network for pricing and universal static hedging of contingent claims. Applied Mathematics and Computation; 2022; 417, 126775. [DOI: https://dx.doi.org/10.1016/j.amc.2021.126775]

Longstaff, Francis A.; Schwartz, Eduardo S. Valuing american options by simulation: A simple least-squares approach. The Review of Financial Studies; 2001; 14, pp. 113-47. [DOI: https://dx.doi.org/10.1093/rfs/14.1.113]

Musiela, Marek; Rutkowski, Marek. Martingale Methods in Financial Modelling; Springer Finance: Berlin/Heidelberg, 2005.

Oosterlee, Kees; Feng, Qian; Jain, Shashi; Karlsson, Patrik; Kandhai, Drona. Efficient computation of exposure profiles on real-world and risk-neutral scenarios for bermudan swaptions. Journal of Computational Finance; 2016; 20, pp. 139-72. [DOI: https://dx.doi.org/10.21314/JCF.2017.337]

Pelsser, Antoon. Pricing and hedging guaranteed annuity options via static option replication. Insurance: Mathematics and Economics; 2003; 33, pp. 283-96. [DOI: https://dx.doi.org/10.1016/S0167-6687(03)00154-9]

Rogers, Leonard C. G. Monte carlo valuation of american options. Mathematical Finance; 2002; 12, pp. 271-86. [DOI: https://dx.doi.org/10.1111/1467-9965.02010]

Ruf, Johannes; Wang, Weiguan. Neural networks for option pricing and hedging: A literature review. Journal of Computational Finance; 2020; in press [DOI: https://dx.doi.org/10.21314/JCF.2020.390]

Shreve, Steven E. Stochastic calculus for finance II: Continuous-time models; Springer Science & Business Media: Berlin/Heidelberg, 2004; vol. 11.

Wang, Haojie; Chen, Han; Sudjianto, Agus; Liu, Richard; Shen, Qi. Deep learning-based bsde solver for libor market model with application to bermudan swaption pricing and hedging. arXiv; 2018; arXiv: 1807.06622[DOI: https://dx.doi.org/10.2139/ssrn.3214596]

Xiu, Dongbin. Numerical Methods for Stochastic Computations: A Spectral Method Approach; Princeton University Press: Princeton, 2010.

Zhu, Steven H.; Pykhtin, Michael. A guide to modeling counterparty credit risk. GARP Risk Review, July/August; 2007; Available online: https://ssrn.com/abstract=1032522 (accessed on 10 November 2020).

Word count: 15144

Show less

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

We present a semi-static replication algorithm for Bermudan swaptions under an affine, multi-factor term structure model. In contrast to dynamic replication, which needs to be continuously updated as the market moves, a semi-static replication needs to be rebalanced on just a finite number of instances. We show that the exotic derivative can be decomposed into a portfolio of vanilla discount bond options, which mirrors its value as the market moves and can be priced in closed form. This paves the way toward the efficient numerical simulation of xVA, market, and credit risk metrics for which forward valuation is the key ingredient. The static portfolio composition is obtained by regressing the target option’s value using an interpretable, artificial neural network. Leveraging the universal approximation power of neural networks, we prove that the replication error can be arbitrarily small for a sufficiently large portfolio. A direct, a lower bound, and an upper bound estimator for the Bermudan swaption price are inferred from the replication algorithm. Additionally, closed-form error margins to the price statistics are determined. We practically study the accuracy and convergence of the method through several numerical experiments. The results indicate that the semi-static replication approaches the LSM benchmark with basis point accuracy and provides tight, efficient error bounds. For in-model simulations, the semi-static replication outperforms a traditional dynamic hedge.

Details

Title

A Semi-Static Replication Method for Bermudan Swaptions under an Affine Multi-Factor Model

Author

Hoencamp, Jori¹; Jain, Shashi²

; Kandhai, Drona¹

¹ Informatics Institute, University of Amsterdam, Science Park 904, 1098XH Amsterdam, The Netherlands; [email protected]
² Indian Institute of Science, Department of Management Studies, Bangalore 560012, India; [email protected]

First page

168

Publication year

2023

Publication date

2023

Publisher

MDPI AG

e-ISSN

22279091

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/risks11100168

ProQuest document ID

2882792031

A Semi-Static Replication Method for Bermudan Swaptions under an Affine Multi-Factor Model

Jump to:

Full Text

Abstract

Details

Suggested sources