1. Introduction
Importance sampling is a mechanism to approximate expectations with respect to a target distribution using independent weighted samples from a proposal distribution. The variance of the weights—quantified by theχ2-divergence between target and proposal— gives both necessary and sufficient conditions on the sample size to achieve a desired worst-case error over large classes of test functions. This paper contributes to the understanding of importance sampling to approximate the Bayesian update, where the target is a posterior distribution obtained by conditioning the proposal to observed data. We consider illustrative examples where theχ2-divergence between target and proposal admits a closed formula and it is hence possible to characterize explicitly the required sample size. These examples showcase the fundamental challenges that importance sampling encounters in high dimension and small noise regimes where target and proposal are far apart. They also facilitate a direct comparison of standard and optimal proposals for particle filtering.
We denote the target distribution byμand the proposal byπand assume that both are probability distributions in Euclidean spaceRd. We further suppose that the target is absolutely continuous with respect to the proposal and denote by g the un-normalized density between target and proposal so that, for any suitable test functionφ,
∫Rd φ(u)μ(du)=∫Rd φ(u)g(u)π(du)∫Rd g(u)π(du).
We write this succinctly asμ(φ)=π(φg)/π(g).For simplicity of exposition, we will assume that g is positiveπ-almost surely. Importance sampling approximatesμ(φ)using independent samples{u(n)}n=1Nfrom the proposalπ, computing the numerator and denominator in (1) by Monte Carlo integration,
μ(φ)≈1N∑n=1Nφ(u(n))g(u(n))1N∑n=1Ng(u(n))=∑n=1Nw(n)φ(u(n)),w(n):=g(u(n))∑ℓ=1Ng(u(ℓ)).
The weightsw(n)—called autonormalized or self-normalized since they add up to one—can be computed as long as the un-normalized density g can be evaluated point-wise; knowledge of the normalizing constantπ(g) is not needed. We write (2) briefly asμ(φ)≈μN(φ), whereμNis the random autonormalized particle approximation measure
μN:=∑n=1Nw(n) δu(n) ,u(n)∼i.i.d.π.
This paper is concerned with the study of importance sampling in Bayesian formulations to inverse problems, data assimilation and machine learning tasks [1,2,3,4,5], where the relationshipμ(du)∝g(u)π(du)arises from application of Bayes’ ruleP(u|y)∝P(y|u)P(u); we interpretu∈Rdas a parameter of interest,π≡P(u)as a prior distribution on u,g(u)≡g(u;y)≡P(y|u)as a likelihood function which tacitly depends on observed datay∈Rk, andμ≡P(u|y)as the posterior distribution of u giveny.With this interpretation and terminology, the goal of importance sampling is to approximate posterior expectations using prior samples. Since the prior has fatter tails than the posterior, the Bayesian setting poses further structure into the analysis of importance sampling. In addition, there are several specific features of the application of importance sampling in Bayesian inverse problems, data assimilation and machine learning that shape our presentation and results.
First, Bayesian formulations have the potential to provide uncertainty quantification by computing several posterior quantiles. This motivates considering a worst-case error analysis [6] of importance sampling over large classes of test functionsφor, equivalently, bounding a certain distance between the random particle approximation measureμNand the targetμ, see [1]. As we will review in Section 2, a key quantity in controlling the error of importance sampling with bounded test functions is theχ2-divergence between target and proposal, given by
dχ2 (μ∥π)=π(g2)π(g)2−1.
Second, importance sampling in inverse problems, data assimilation and machine learning applications is often used as a building block of more sophisticated computational methods, and in such a case there may be little or no freedom in the choice of proposal. For this reason, throughout this paper we view both target and proposal as given and we focus on investigating the required sample size for accurate importance sampling with bounded test functions, following a similar perspective as [1,7,8]. The complementary question of how to choose the proposal to achieve a small variance for a given test function is not considered here. This latter question is of central interest in the simulation of rare events [9] and has been widely studied since the introduction of importance sampling in [10,11], leading to a plethora of adaptive importance sampling schemes [12].
Third, high dimensional and small noise settings are standard in inverse problems, data assimilation and machine learning, and it is essential to understand the scalability of sampling algorithms in these challenging regimes. The curse of dimension of importance sampling has been extensively investigated [1,13,14,15,16,17]. The early works [13,14] demonstrated a weight collapse phenomenon, by which unless the number of samples is scaled exponentially with the dimension of the parameter, the maximum weight converges to one. The paper [1] also considered small noise limits and further emphasized the need to define precisely the dimension of learning problems. Indeed, while many inverse problems, data assimilation models and machine learning tasks are defined in terms of millions of parameters, their intrinsic dimension can be substantially smaller since(i)all parameters may not be equally important;(ii)a priori information about some parameters may be available; and(iii) the data may be lower dimensional than the parameter space. If the intrinsic dimension is still large, which occurs often in applications in geophysics and machine learning, it is essential to leverage the correlation structure of the parameters or the observations by performing localization [18,19,20]. Local particle filters are reviewed in [21] and their potential to beat the curse of dimension is investigated from a theoretical viewpoint in [16]. Localization is popular in ensemble Kalman filters [20] and has been employed in Markov chain Monte Carlo [22,23]. Our focus in this paper is not on localization but rather on providing a unified and accessible understanding of the roles that dimension, noise-level and other model parameters play in approximating the Bayesian update. We will do so through examples where it is possible to compute explicitly theχ2-divergence between target and proposal, and hence the required sample size.
Finally, in the Bayesian context the normalizing constantπ(g) represents the marginal likelihood and is often computationally intractable. This motivates our focus on the autonormalized importance sampling estimator in (2), which estimates bothπ(gφ)andπ(g) using Monte Carlo integration, as opposed to un-normalized variants of importance sampling [8].
Main Goals, Specific Contributions and Outline
The main goal of this paper is to provide a rich and unified understanding of the use of importance sampling to approximate the Bayesian update, while keeping the presentation accessible to a large audience. In Section 2 we investigate the required sample size for importance sampling in terms of theχ2 -divergence between target and proposal. Section 3 builds on the results in Section 2 to illustrate through numerous examples the fundamental challenges that importance sampling encounters when approximating the Bayesian update in small noise and high dimensional settings. In Section 4 we show how our concrete examples facilitate a new direct comparison of standard and optimal proposals for particle filtering. These examples also allow us to identify model problems where the advantage of the optimal proposal over the standard one can be dramatic.
Next, we provide further details on the specific contributions of each section and link them to the literature. We refer to [1] for a more exhaustive literature review.
-
Section 2 provides a unified perspective on the sufficiency and necessity of having a sample size of the order of theχ2-divergence between target and proposal to guarantee accurate importance sampling with bounded test functions. Our analysis and presentation are informed by the specific features that shape the use of importance sampling to approximate Bayes’ rule. The key role of the second moment of theχ2 -divergence has long been acknowledged [24,25], and it is intimately related to an effective sample size used by practitioners to monitor the performance of importance sampling [26,27]. A topic of recent interest is the development of adaptive importance sampling schemes where the proposal is chosen by minimizing—over some admissible family of distributions—theχ2 -divergence with respect to the target [28,29]. The main original contributions of Section 2 are Proposition 2 and Theorem 1, which demonstrate the necessity of suitably increasing the sample size with theχ2 -divergence along singular limit regimes. The idea of Proposition 2 is inspired by [7], but adapted here from relative entropy toχ2 -divergence. Our results complement sufficient conditions on the sample size derived in [1] and necessary conditions for un-normalized (as opposed to autonormalized) importance sampling in [8].
-
In Section 3, Proposition 4 gives a closed formula for theχ2-divergence between posterior and prior in a linear-Gaussian Bayesian inverse problem setting. This formula allows us to investigate the scaling of theχ2 -divergence (and thereby the rate at which the sample size needs to grow) in several singular limit regimes, including small observation noise, large prior covariance and large dimension. Numerical examples motivate and complement the theoretical results. Large dimension and small noise singular limits were studied in [1] in a diagonal setting. The results here are generalized to a nondiagonal setting, and the presentation is simplified by using the closed formula in Proposition 4. Moreover, we include singular limits arising from large prior covariance. In an infinite dimensional setting, Corollary 1 establishes an equivalence between absolute continuity, finiteχ2 -divergence and finite intrinsic dimension. A similar result was proved in more generality in [1] using the advanced theory of Gaussian measures in Hilbert space [30]; our presentation and proof here are elementary, while still giving the same degree of understanding.
-
In Section 4 we follow [1,13,14,15,31] and investigate the use of importance sampling to approximate Bayes’ rule within one filtering step in a linear-Gaussian setting. We build on the examples and results in Section 3 to identify model regimes where the performance of standard and optimal proposals can be dramatically different. We refer to [2,32] for an introduction to standard and optimal proposals for particle filtering and to [33] for a more advanced presentation. The main original contribution of this section is Theorem 2, which gives a direct comparison of theχ2 -divergence between target and standard/optimal proposals. This result improves on [1], where only a comparison between the intrinsic dimension was established.
2. Importance Sampling andχ2-Divergence
The aim of this section is to demonstrate the central role of theχ2 -divergence between target and proposal in determining the accuracy of importance sampling. In Section 2.1 we show how theχ2 -divergence arises in both sufficient and necessary conditions on the sample size for accurate importance sampling with bounded test functions. Section 2.2 describes a well-known connection between the effective sample size and theχ2 -divergence. Our investigation of importance sampling to approximate the Bayesian update—developed in Section 3 and Section 4—will make use of a closed formula for theχ2 -divergence between Gaussians, which we include in Section 2.3 for later reference.
2.1. Sufficient and Necessary Sample Size
Here we provide general sufficient and necessary conditions on the sample size in terms of
ρ:=dχ2 (μ∥π)+1.
We first review upper-bounds on the worst-case bias and mean-squared error of importance sampling with bounded test functions, which imply that accurate importance sampling is guaranteed ifN≫ρ . The proof of the bound for the mean-squared error can be found in [1] and the bound for the bias in [2].
Proposition 1
(Sufficient Sample Size). It holds that
sup|φ|∞≤1EμN(φ)−μ(φ)≤4Nρ,sup|φ|∞≤1EμN(φ)−μ(φ)2≤4Nρ.
The next result shows the existence of bounded test functions for which the error may be large with a high probability ifN≪ρ. The idea is taken from [7], but we adapt it here to obtain a result in terms of theχ2-divergence rather than relative entropy. We denote byg:=g/π(g)the normalized density betweenμandπ,and note thatρ=π(g2)=μ(g).
Proposition 2
(Necessary Sample Size). LetU∼μ.For anyN≥1andα∈(0,1)there exists a test function φ with|φ|∞≤1such that
P|μN(φ)−μ(φ)|=P(g(U)>αρ)≥1−Nαρ.
Proof.
Observe that for the test functionφ(u):=1{g(u)≤αρ}, we haveμ(φ)=Pg(U)≤αρ.On the other hand,μN(φ)=1if and only ifg(u(n))≤αρfor all1≤n≤N. This implies that
P|μN(φ)−μ(φ)|=P(g(U)>αρ)≥1−NP(g(u(1))>αρ)≥1−Nαρ.
□
The power of Proposition 2 is due to the fact that in some singular limit regimes the distribution ofg(U)concentrates around its expected valueρ.In such a case, for any fixedα∈(0,1)the probability of the eventg(U)>αρwill not vanish as the singular limit is approached. This idea will become clear in the proof of Theorem 1 below.
In Section 3 and Section 4 we will investigate the required sample size for importance sampling approximation of the Bayesian update in various singular limits, where target and proposal become further apart as a result of reducing the observation noise, increasing the prior uncertainty or increasing the dimension of the problem. To formalize the discussion in a general abstract setting, let{(μθ,πθ)}θ>0be a family of targets and proposals such thatρθ:=dχ2 (μθ∥πθ)→∞asθ→∞.The parameterθmay represent for instance the size of the precision of the observation noise, the size of the prior covariance or a suitable notion of dimension. Our next result shows a clear dichotomy in the performance of importance sampling along the singular limit depending on whether the sample size grows sublinearly or superlinearly withρθ.
Theorem 1.
Suppose thatρθ→∞and thatV:=supθV[gθ(Uθ)]ρθ2<1.Letδ>0.
i
IfNθ=ρθ1+δ,then
limθ→∞sup|φ|∞≤1EμθNθ (φ)−μθ(φ)2=0.
ii
IfNθ=ρθ1−δ, then there exists a fixedc∈(0,1)such that
limθ→∞sup|φ|∞≤1P|μθNθ (φ)−μθ(φ)|>c=1.
Proof.
The proof of(i)follows directly from Proposition 1. For(ii)we fixα∈(0,1−V)andc∈0,1−V(1−α)2. Letφθ(u):=1(gθ(u)≤αρθ)as in the proof of Proposition 2. Then,
Pgθ(Uθ)>αρθ≥1−P|ρθ−gθ(Uθ)|≥(1−α)ρθ≥1−V[gθ(Uθ)](1−α)2 ρθ2≥1−V(1−α)2>c.
The bound in (5) implies that
P|μθNθ (φθ)−μθ(φθ)|>c≥P|μθN(φθ)−μθ(φθ)|=P(gθ(Uθ)>αρθ)≥1−Nθαρθ.
This completes the proof, since ifNθ=ρθ1−δthe right-hand side goes to 1 asθ→∞. □
Remark 1.
As noted in [1], the bound in Proposition 1 is sharp in the asymptotic limitN→∞.This implies that, for any fixed θ, the bound4ρθ/Nbecomes sharp asN→∞.We point out that this statement does not provide direct understanding of the joint limitθ,Nθ→∞analyzed in Theorem 1.
The assumption thatV<1 can be verified for some singular limits of interest, in particular for small noise and large prior covariance limits studied in Section 3 and Section 4; details will be given in Example 1. While the assumptionV<1 may fail to hold in high dimensional singular limit regimes, the works [13,14] and our numerical example in Section 4.4 provide compelling evidence of the need to suitably scale N withρ along those singular limits in order to avoid a weigh-collapse phenomenon. Further theoretical evidence was given for un-normalized importance sampling in [8].
2.2.χ2-Divergence and Effective Sample Size
The previous subsection provides theoretical nonasymptotic and asymptotic evidence that a sample size larger thanρis necessary and sufficient for accurate importance sampling. Here we recall a well known connection between theχ2-divergence and the effective sample size
ESS:=1∑n=1N (w(n))2,
widely used by practitioners to monitor the performance of importance sampling. Note that always1≤ESS≤N; it is intuitive thatESS=1if the maximum weight is one andESS=Nif the maximum weight is1/N.To see the connection between ESS andρ, note that
ESSN=1N∑n=1N (w(n))2=∑n=1Ng(u(n))2N∑n=1Ng(u(n))2=1N∑n=1Ng(u(n))21N∑n=1Ng(u(n))2≈π(g)2π(g2).
Therefore,ESS≈N/ρ:if the sample-based estimate ofρis significantly larger than N, ESS will be small which gives a warning sign that a larger sample size N may be needed.
2.3.χ2-Divergence between Gaussians
We conclude this section by recalling an analytical expression for theχ2 -divergence between Gaussians. In order to make our presentation self-contained, we include a proof in Appendix A.
Proposition 3.
Letμ=N(m,C)andπ=N(0,Σ). If2Σ≻C, then
ρ=|Σ||2Σ−C||C|expm′ (2Σ−C)−1m.
Otherwise,ρ=∞.
It is important to note that nondegenerate Gaussiansμ=N(m,C)andπ=N(0,Σ)inRdare always equivalent. However,ρ=∞unless2Σ≻C. In Section 3 and Section 4 we will interpretμas a posterior andπas a prior, in which case automaticallyΣ≻Candρ<∞.
3. Importance Sampling for Inverse Problems
In this section we study the use of importance sampling in a linear Bayesian inverse problem setting where the target and the proposal represent, respectively, the posterior and the prior distribution. In Section 3.1 we describe our setting and we also derive an explicit formula for theχ2-divergence between the posterior and the prior. This explicit formula allows us to determine the scaling of theχ2 -divergence in small noise regimes (Section 3.2), in the limit of large prior covariance (Section 3.3) and in a high dimensional limit (Section 3.4). Our overarching goal is to show how the sample size for importance sampling needs to grow along these limiting regimes in order to maintain the same level of accuracy.
3.1. Inverse Problem Setting andχ2-Divergence between Posterior and Prior
LetA∈Rk×dbe a given design matrix and consider the linear inverse problem of recoveringu∈Rdfrom datay∈Rkrelated by
y=Au+η,η∼N(0,Γ),
whereηrepresents measurement noise. We assume henceforth that we are in the underdetermined casek≤d, and that A is full rank. We follow a Bayesian perspective and set a Gaussian prior on u,u∼π=N(0,Σ).We assume throughout thatΣandΓare given symmetric positive definite matrices. The solution to the Bayesian formulation of the inverse problem is the posterior distributionμof u giveny.We are interested in studying the performance of importance sampling with proposalπ(the prior) and targetμ (the posterior). We recall that under this linear-Gaussian model the posterior distribution is Gaussian [2], and we denote it byμ=N(m,C). In order to characterize the posterior mean m and covariance C, we introduce standard data assimilation notation
S:=AΣA′+Γ,K:=ΣA′ S−1,
where K is the Kalman gain. Then we have
m=Ky,C=(I−KA)Σ.
Proposition 3 allows us to obtain a closed formula for the quantityρ=dχ2 (μ∥π)+1 , noting that (10) implies that
2Σ−C=(I+KA)Σ=Σ+ΣA′ S−1AΣ≻0.
The proof of the following result is then immediate and therefore omitted.
Proposition 4.
Consider the inverse problem (9) with prioru∼π=N(0,Σ)and posteriorμ=N(m,C) with m and C defined in (10). Thenρ=dχ2 (μ∥π)+1admits the explicit characterization
ρ=(|I+KA||I−KA|)−12expy′ K′ [(I+KA)Σ]−1Ky.
In the following two subsections we employ this result to derive by direct calculation the rate at which the posterior and prior become further apart —inχ2-divergence— in small noise and large prior regimes. To carry out the analysis we use parametersγ2,σ2>0to scale the noise covariance,γ2Γ,and the prior covariance,σ2Σ.
3.2. Importance Sampling in Small Noise Regime
To illustrate the behavior of importance sampling in small noise regimes, we first introduce a motivating numerical study. A similar numerical setup was used in [13] to demonstrate the curse of dimension of importance sampling. We consider the inverse problem setting in Equation (9) withd=k=5and noise covarianceγ2Γ. We conduct 18 numerical experiments with a fixed data y. For each experiment, we perform importance sampling 400 times and report in Figure 1 a histogram with the largest autonormalized weight in each of the 400 realizations. The 18 experiments differ in the sample size N and the size of the observation noiseγ2. In both Figure 1a,b we consider three choices of N (rows) and three choices ofγ2 (columns). These choices are made so that in Figure 1a it holds thatN=γ−4 along the bottom-left to top-right diagonal, while in Figure 1bN=γ−6along the same diagonal.
We can see from Figure 1a thatN=γ−4is not a fast enough growth of N to avoid weight collapse: the histograms skew to the right along the bottom-left to top-right diagonal, suggesting that weight collapse (i.e., one weight dominating the rest, and therefore the variance of the weights being large) is bound to occur in the joint limitN→∞,γ→0withN=γ−4 . In contrast, the histograms in Figure 1b skew to the left along the same diagonal, suggesting that the probability of weight collapse is significantly reduced ifN=γ−6. We observe a similar behavior with other choices of dimension d by conducting experiments with sample sizesN=γ−d+1andN=γ−d−1, and we include the histograms withd=k=4 in Appendix C. Our next result shows that these empirical findings are in agreement with the scaling of theχ2-divergence between target and proposal in the small noise limit.
Proposition 5.
Consider the inverse problem setting
y=Au+η,η=N(0,γ2Γ),u∼π=N(0,Σ).
Letμγdenote the posterior and letργ=dχ2 (μγ∥π)+1.Then, for almost everyy,
ργ∼O(γ−k)
in the small noise limitγ→0.
Proof.
LetKγ=ΣA′ (AΣA′+γ2Γ)−1denote the Kalman gain. We observe thatKγ→ΣA′ (AΣA′)−1under our standing assumption that A is full rank. LetU′ΞVbe the singular value decompostion ofΓ−12AΣ12and{ξi}i=1kbe the singular values. Then we have
KγA∼Σ12 A′ Γ−12 (Γ−12AΣA′ Γ−12+γ2I)−1 Γ−12AΣ12=V′ Ξ′U(U′ΞVV′ Ξ′U+γ2I)−1 U′ΞV∼Ξ′ (ΞΞ′+γ2I)−1Ξ,
where here “∼” denotes matrix similarity. It follows thatI+KγAconverges to a finite limit, and so does the exponenty′ Kγ′ Σ−1 (I+KγA)−1 Kγyin Proposition 4. On the other hand,
(|I+KγA||I−KγA|)−12=∏i=1kγ2ξi2+γ2−12∼O(γ−k)
asγ→0. The conclusion follows. □
Remark 2.
The scaling of ρ withγ2 obtained in Proposition 5 agrees with the lower bound reported in Table 1 in [1], which was derived in a diagonalized setting.
3.3. Importance Sampling and Prior Scaling
Here we illustrate the behavior of importance sampling in the limit of large prior covariance. We start again with a motivating numerical example, similar to the one reported in Figure 1. The behavior is analogous to the small noise regime, which is expected since the ratio of prior and noise covariances determines the closeness between target and proposal. Figure 2 shows that whend=k=5weight collapse is observed frequently when the sample size N grows asσ4,but not so often with sample sizeN=σ6. Similar histograms withd=k=4 are included in Appendix C. These empirical results are in agreement with the theoretical growth rate of theχ2-divergence between target and proposal in the limit of large prior covariance, as we prove next.
Proposition 6.
Consider the inverse problem setting
y=Au+η,η∼N(0,Γ),u∼πσ=N(0,σ2Σ).
Letμσdenote the posterior andρσ=dχ2 (μσ∥πσ)+1.Then, for almost everyy,
ρσ∼O(σd)
in the large prior limitσ→∞.
Proof.
LetΣσ=σ2Σ, letKσ=Σσ A′ (AΣσ A′+Γ)−1be the Kalman gain. Observing thatKσ=Kγ=1σ, we apply Proposition 5 and deduce that whenσ→∞:
-
Kσ→ΣA′ (AΣA′+γ2Γ)−1;
-
I+KσAhas a well-defined and invertible limit;
-
|I−Kσ A|−12∼O(σk).
On the other hand, we notice that the quadratic term
Kσ′ Σσ−1 (I+KσA)−1 Kσ=σ−2 Kσ′Σ(I+KσA)−1 Kσ
vanishes in limit. The conclusion follows by Proposition 4. □
3.4. Importance Sampling in High Dimension
In this subsection we study importance sampling in high dimensional limits. To that end, we let{ai}i=1∞,{γi2}i=1∞and{σi2}i=1∞be infinite sequences and we define, for anyd≥1,
A1:d:=diaga1,…,ad∈Rd×d,Γ1:d:=diagγ12,…,γd2∈Rd×d,Σ1:d:=diagσ12,…,σd2∈Rd×d.
We then consider the inverse problem of reconstructingu∈Rdfrom datay∈Rdunder the setting
y=A1:du+η,η∼N(0,Γ1:d),u∼π1:d=N(0,Σ1:d).
We denote the corresponding posterior distribution byμ1:d,which is Gaussian with a diagonal covariance. Given observation y, we may find the posterior distributionμiofuiby solving the one dimensional linear-Gaussian inverse problem
yi=ai ui+ηi,ηi∼N(0,γi2),1≤i≤d,
with priorπi=N(0,σi2).In this way we have defined, for eachd∈N∪{∞},an inverse problem with prior and posterior
π1:d=∏i=1dπi,μ1:d=∏i=1dμi.
In Section 3.4.1 we include an explicit calculation in the one dimensional inverse setting (12), which will be used in Section 4.4 to establish the rate of growth ofρd=dχ2 (μ1:d∥π1:d)and thereby how the sample size needs to be scaled along the high dimensional limitd→∞ to maintain the same accuracy. Finally, in Section 3.4.3 we establish from first principles and our simple one dimensional calculation the equivalence between(i)certain notion of dimension being finite;(ii)ρ∞<∞;and(iii)absolute continuity ofμ1:∞with respect toπ1:∞.
3.4.1. One Dimensional Setting
Leta∈Rbe given and consider the one dimensional inverse problem of reconstructingu∈Rfrom datay∈R, under the setting
y=au+η,η∼N(0,γ2),u∼π=N(0,σ2).
By defining
g(u):=exp−a22γ2u2+ayγ2u,
we can write the posterior densityμ(du)asμ(du)∝g(u)π(du).The next result gives a simplified closed formula forρ=dχ2 (μ∥π)+1.In addition, it gives a closed formula for the Hellinger integral
H(μ,π):=πg12π(g)12,
which will facilitate the study of the cased=∞ in Section 3.4.3.
Lemma 1.
Consider the inverse problem in (14). Letλ:=a2 σ2/γ2andz2:=y2a2 σ2+γ2.Then, for anyℓ>0,
π(gℓ)π(g)ℓ=(λ+1)ℓ2ℓλ+1exp(ℓ2−ℓ)λ2(ℓλ+1)z2.
In particular,
ρ=λ+12λ+1expλ2λ+1z2,
H(μ,π)=2λ+1λ+2exp−λz24(λ+2).
Proof.
A direct calculation shows that
π(g)=1λ+1exp12λy2a2 σ2+γ2.
The same calculation, but replacingγ2byγ2/ℓandλbyℓλ, gives similar expressions forπ(gℓ) , which leads to (15). The other two equations follow by setting ℓ to be 2 and12. □
Lemma 1 will be used in the two following subsections to study high dimensional limits. Here we show how this lemma also allows us to verify directly that the assumptionV<1in Theorem 1 holds in small noise and large prior limits.
Example 1.
Consider a sequence of inverse problems of the form (14) withλ=a2 σ2/γ2approaching infinity. Let{(μλ,πλ)}λ>0be the corresponding family of posteriors and priors and letgλbe the normalized density. Lemma 1 implies that
πλ(gλ3)πλ (gλ2)2=2λ+1(3λ+1)(λ+1)expλ(2λ+1)(3λ+1)z2→23<2,
asλ→∞. This implies that, for λ sufficiently large,
V[gλ(Uλ)]ρλ2=πλ(gλ3)πλ (gλ2)2−1<1.
3.4.2. Large Dimensional Limit
Now we investigate the behavior of importance sampling in the limit of large dimension, in the inverse problem setting (11). We start with an example similar to the ones in Figure 1 and Figure 2. Figure 3 shows that forλ=1.3fixed, weight collapse happens frequently when the sample size N grows polynomially asd2but not so often if N grows at rateO∏i=1dλ+12λ+1eλzi22λ+1. These empirical results are in agreement with the growth rate ofρdin the large d limit.
Proposition 7.
For anyd∈N∪{∞},
ρd=∏i=1dλi+12λi+1eλi zi22λi+1,Ez1:d ρd=∏i=1d(λi+1).
Proof.
The formula forρd is a direct consequence of Equation (16) and the product structure. Similarly, we have
Ezi λi+12λi+1eλi zi22λi+1=λi+12λi+1∫R12πe−zi22+λi zi22λi+1dzi=λi+12λi+1∫R12πe−zi22(2λi+1)dzi=λi+1.
□
Proposition 7 implies that, ford∈N∪{∞},
sup|φ|∞≤1Eμ1:dN(φ)−μ1:d(φ)2≤4N∏i=1dλi+12λi+1eλi zi22λi+1,Esup|φ|∞≤1Eμ1:dN(φ)−μ1:d(φ)2≤4N∏i=1d(λi+1).
Note that the outer expected value in the latter equation averages over the data, while the inner one averages oversampling from the priorπ1:d. This suggests that
logEsup|φ|∞≤1Eμ1:dN(φ)−μ1:d(φ)2≲∑i=1dλi−logN.
The quantityτ:=∑i=1d λi had been used as an intrinsic dimension of the inverse problem (11). This simple heuristic together with Theorem 1 suggest that increasing Nexponentially withτis both necessary and sufficient to maintain accurate importance sampling along the high dimensional limitd→∞. In particular, if all coordinates of the problem play the same role, this implies that N needs to grow exponentially with d, a manifestation of the curse of dimension of importance sampling [1,13,14].
3.4.3. Infinite Dimensional Singularity
Finally, we investigate the cased=∞.Our goal in this subsection is to establish a connection between the effective dimension, the quantityρ∞ and absolute continuity. The main result, Corollary 1, had been proved in more generality in [1]. However, our proof and presentation here requires minimal technical background and is based on the explicit calculations obtained in the previous subsections and in the following lemma.
Lemma 2.
It holds thatμ1:∞is absolutely continuous with respect toπ1:∞if and only if
H(μ1:∞,π1:∞)=∏i=1∞πigi12πi (gi)12>0,
wheregiis an un-normalized density betweenμiandπi.Moreover, we have the following explicit characterizations of the Hellinger integralH(μ1:∞,π1:∞)and its average with respect to data realizations,
H(μ1:∞,π1:∞)=∏i=1∞2λi+1λi+2e−λi zi24(λi+2),Ez1:∞ H(μ1:∞,π1:∞)=∏i=1∞2(λi+1)143λi+4.
Proof.
The formula for the Hellinger integral is a direct consequence of Equation (17) and the product structure. On the other hand,
Ezi 2λi+1λi+2e−λi zi24(λi+2)=2(λi+1)14λi+2∫R12πe−λi zi24(λi+2)−zi22dzi=2(λi+1)143λi+4.
The proof of the equivalence between finite Hellinger integral and absolute continuity is given in Appendix B. □
Corollary 1.The following statements are equivalent:
i
τ=∑i=1∞ λi<∞;
ii
ρ∞<∞for almost every y;
iii
μ1:∞≪π1:∞for almost every y.
Proof.
Observe thatλi→0is a direct consequence of all three statements, so we will assumeλi→0from now on.
(i)⇔(ii):By Proposition 7,
logEz1:∞ ρ∞=∑i=1∞log(1+λi)=O(∑i=1∞λi),
sincelog(1+λi)≈λifor large i.
(i)⇔(iii):Similarly, we have
logEz1:∞ H(μ1:∞,π1:∞)=−14∑i=1∞log(3λi+4)216(λi+1)=−14∑i=1∞log1+9λi2+8λi16λi+16=−14O(∑i=1∞λi).
The conclusion follows from Lemma 2. □
4. Importance Sampling for Data Assimilation
In this section, we study the use of importance sampling in a particle filtering setting. Following [13,14,15] we focus on one filtering step. Our goal is to provide a new and concrete comparison of two proposals, referred to as standard and optimal in the literature [1]. In Section 4.1 we introduce the setting and both proposals and show that theχ2-divergence between target and standard proposal is larger than theχ2 -divergence between target and optimal proposal. Section 4.3 and Section 4.4 identify small noise and large dimensional limiting regimes where the sample size for the standard proposal needs to grow unboundedly to maintain the same level of accuracy, but the required sample size for the optimal proposal remains bounded.
4.1. One-Step Filtering Setting
Let M and H be given matrices. We consider the one-step filtering problem of recoveringv0,v1from y, under the following setting
v1=Mv0+ξ,v0∼N(0,P),ξ∼N(0,Q),
y=Hv1+ζ,ζ∼N(0,R).
Similar to the setting in SubSection 3.1, we assume thatP,Q,Rare symmetric positive definite and that M and H are full rank. From a Bayesian point of view, we would like to sample from the target distributionPv0,v1|y. To achieve this, we can either useπstd=Pv1|v0 Pv0 orπopt=Pv1|v0,y Pv0 as the proposal distribution.
The standard proposalπstdis the prior distribution of(v0,v1)determined by the priorv0∼N(0,P) and the signal dynamics encoded in Equation (19). Then assimilating the observation y leads to an inverse problem [1,2] with design matrix, noise covariance and prior covariance given by
Astd:=H,Γstd:=R,Σstd:=MPM′+Q.
We denoteπstd=N(0,Σstd)the prior distribution and byμstdthe corresponding posterior distribution.
The optimal proposalπoptsamples fromv0and the conditional kernelv1|v0,y. Then assimilating y leads to the inverse problem [1,2]
y=HMv0+Hξ+ζ,
where the design matrix, noise covariance and prior covariance are given by
Aopt:=HM,Γopt:=HQH′+R,Σopt:=P.
We denoteπopt=N(0,Σopt)the prior distribution andμstdthe corresponding posterior distribution.
4.2.χ2-Divergence Comparison between Standard and Optimal Proposal
Here we show that
ρstd:=dχ2 (μstd∥πstd)+1>dχ2 (μopt∥πopt)+1=:ρopt.
The proof is a direct calculation using the explicit formula in Proposition 4. We introduce, as in Section 3, standard Kalman notation
Kstd:=Σstd Astd′ Sstd−1,Sstd:=Astd Σstd Astd′+Γstd,Kopt:=Σopt Aopt′ Sopt−1,Sopt:=Aopt Σopt Aopt′+Γopt.
It follows from the definitions in (21) and (22) that
Sstd=H(MPM′+Q)H+R=HMPM′H+HQH′+R=Sopt.
SinceSstd=Soptwe drop the subscripts in what follows and denote both simply byS.
Theorem 2.
Consider the one-step filtering setting in Equations (19) and (20). If M and H are full rank andP,Q,Rare symmetric positive definite, then, for almost everyy,
ρstd>ρopt.
Proof.
By Proposition 4 we have
ρstd=(|I−Kstd Astd||I+Kstd Astd |)−12expy′ Kstd′ [(I+Kstd Astd)Σstd]−1 Kstdy,ρopt=(|I−Kopt Aopt||I+Kopt Aopt |)−12expy′ Kopt′ [(I+Kopt Aopt)Σstd]−1 Kopty.
Therefore, it suffices to prove the following two inequalities:
|I−Kstd Astd||I+Kstd Astd|<|I−Kopt Aopt||I+Kopt Aopt|,
Kstd′ [(I+Kstd Astd)Σstd]−1 Kstd≺Kopt′ [(I+Kopt Aopt)Σstd]−1 Kopt.
We start with inequality (24). Note that
(I+Kstd Astd)Σstd=Σstd+Σstd Astd′ S−1 Astd Σstd,(I+Kopt Aopt)Σopt=Σopt+Σopt Aopt′ S−1 Aopt Σopt.
Using the definitions in (21) and (22) it follows that
Kstd′ Σstd−1 (I+Kstd Astd)−1 Kstd=H(MPM′+Q)−1+H′SH−1 H′≺H(MPM′)−1+H′SH−1 H′=Kopt′ Σopt−1 (I+Kopt Aopt)−1 Kopt.
For inequality (23), we notice that
Kstd Astd=(MPM′+Q)H′ S−1H=MP˜M′ H′ S−1H∼(H′ S−1H)12MP˜M′ (H′ S−1H)12,Kopt Aopt=PM′ H′ S−1HM∼(H′ S−1H)12MPM′ (H′ S−1H)12,
whereP˜:=P+M†QM′†.Therefore
Kopt Aopt≺Kstd Astd
which, together withKstd Astd≺I,implies that
|I−Kstd Astd||I+Kstd Astd|−|I−Kopt Aopt||I+Kopt Aopt|=|I−(Kstd Astd)2|−|I−(Kopt Aopt)2|>0,
as desired. □
Remark 3.
It is well known that if the signal dynamics are deterministic, i.e., ifQ=0 in (19), then the standard and optimal proposal agree, and thereforeρopt=ρstd. Theorem 2 shows thatρstd>ρopt provided that Q is positive definite. Further works that have investigated theoretical and practical benefits of the optimal proposal over the standard proposal include [1,2,31,34]. In particular, [1] shows that use of the optimal proposal reduces the intrinsic dimension. Theorem 2 compares directly theχ2-divergence, which is the key quantity that determines the performance of importance sampling.
4.3. Standard and Optimal Proposal in Small Noise Regime
It is possible that along a certain limiting regime,ρ diverges for the standard proposal but not for the optimal proposal. This has been observed in previous work [1,31], and here we provide some concrete examples using the scaling results from Section 3. Precisely, consider the following one-step filtering setting
v1=Mv0+ξ,v0∼N(0,P),ξ∼N(0,Q),y=Hv1+ζ,ζ∼N(0,r2R),
wherer→0. Letμopt(r),μstd(r)be the optimal/standard targets andπopt(r),πstd(r)be the optimal/standard proposals. We assume thatM∈Rd×dandH∈Rk×dare full rank.
Proposition 8.
Ifr→0, then we have
ρopt(r)<∞,ρstd(r)∼O(r−k).
Proof.
Consider the two inverse problems that correspond toμopt(r),πopt(r)andμstd(r),πstd(r). Note that the two problems have identical prior and design matrix. LetΓopt(r)andΓstd(r)denote the noise in those two inverse problems. When r goes to 0, we observe that
Γopt(r)=r2R+HQH′→HQH′,Γstd(r)=r2R→0.
Therefore, the limit ofρopt(r)converges to a finite value, but Lemma 5 implies thatρstd(r)diverges at rateO(r−k). □
4.4. Standard and Optimal Proposal in High Dimension
The previous subsection shows that the standard and optimal proposals can have dramatically different behavior in the small noise regimer→0. Here we show that both proposals can also lead to dramatically different behavior in high dimensional limits. Precisely, as a consequence of Corollary 1 we can easily identify the exact regimes where both proposals converge or diverge in limit. The notation is analogous to that in Section 4.4, and so we omit the details.
Proposition 9.Consider the sequence of particle filters defined as above. We have the following convergence criteria:
1.
μopt(1:∞)≪πopt(1:∞)andρopt<∞if and only if∑i=1∞hi2 mi2 pi2hi2 qi2+ri2<∞,
2.
μstd(1:∞)≪πstd(1:∞)andρstd<∞if and only if∑i=1∞hi2 mi2 pi2ri2<∞and∑i=1∞hi2 qi2ri2<∞.
Proof.
By direct computation, we have
λstd(i)=hi2 mi2 pi2+hi2 qi2ri2=hi2 mi2 pi2ri2+hi2 qi2ri2,λopt(i)=hi2 mi2 pi2hi2 qi2+ri2.
Theorem 1 gives the desired result. □
Example 2.
As a simple example where absolute continuity holds for the optimal proposal but not for the standard one, lethi=mi=pi=ri=1.Thenρstd=∞,butρopt<∞provided that∑i=1∞1qi2+1<∞.
Author Contributions
Funding acquisition, D.S.-A.; Investigation, Z.W.; Supervision, D.S.-A.; Writing-original draft, Z.W.; Writing-review & editing, D.S.-A. and Z.W. All authors have read and agreed to the published version of the manuscript.
Funding
The work of DSA was supported by NSF and NGA through the grant DMS-2027056. DSA also acknowledges partial support from the NSF grant DMS-1912818/1912802.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A. χ 2 -Divergence between Gaussians
We recall that the distributionPθparameterized byθbelongs to the exponential familyEF(Θ)over a natural parameter spaceΘ, ifθ∈ΘandPθhas density of the form
f(u;θ)=e〈t(u),θ〉-F(θ)+k(u),
where the natural parameter space is given by
Θ=θ:∫e〈t(u),θ〉+k(u)du<∞.
The following result can be found in [35].
Lemma A1.
Supposeθ1,2∈Θare parameters for probability densitiesf(u;θ1,2)=e〈t(u),θ1,2〉-F(θ1,2)+k(u)with2θ1-θ2∈Θ.Then,
dχ2 f(;θ1)∥f(;θ2)=eF(2θ1-θ2)-2F(θ1)+F(θ2)-1.
Proof.
By direct computation,
dχ2 f(;θ1)∥f(;θ2)+1=∫f(u;θ1)2f(u;θ2)-1du=∫e〈t(u),2θ1-θ2〉-(2F(θ1)-F(θ2))+k(u)du=eF(2θ1-θ2)-2F(θ1)+F(θ2)∫f(u;2θ1-θ2)du=eF(2θ1-θ2)-2F(θ1)+F(θ2).
Note that∫f(u;2θ1-θ2)du=1since2θ1-θ2∈Θby assumption. □
Using Lemma A1 we can compute theχ2-divergence between Gaussians. To do so, we note thatd-dimensional GaussiansN(μ,Σ)belong to the exponential family over the parameter spaceRd⨁Rd×dby lettingθ=[Σ-1μ;-12Σ-1]andF(θ)=12μ' Σ-1μ+12log|Σ|. In the context of Gaussians, an exponential parameterθ=[Σ-1μ;-12Σ-1]belongs to the natural parameter spaceΘif and only ifΣis symmetric and positive definite. Indeed, the integral∫exp(-12(u-μ)' Σ-1(u-μ))duis finite if and only ifΣ≻0.
Proof of Proposition 3.
Letθμ,θπbe the exponential parameters ofμ,π. Then2θμ-θπcorresponds to a Gaussian with mean(2C-1-Σ-1)-1(2C-1m)and covariance(2C-1-Σ-1)-1.We have
F(2θμ-θπ)-2F(θμ)+F(θπ)=12log|(2C-1-Σ-1)-1|-log|C|+12log|Σ|+12(2C-1m)' (2C-1-Σ-1)-1(2C-1m)-m' C-1m=log|Σ||2C-1-Σ-1 ||C|2+m'(C-1 (2C-1-Σ-1)-12C-1)m-m'(C-1 (2C-1-Σ-1)-1(2C-1-Σ-1))m=log|Σ||2Σ-C||C|+m'(C-1 (2C-1-Σ-1)-1 Σ-1)m=log|Σ||2Σ-C||C|+m' (2Σ-C)-1m.
Applying Lemma A1 gives
dχ2 (μ∥π)=expF(2θμ-θπ)-2F(θμ)+F(θπ)-1=|Σ||2Σ-C||C|expm' (2Σ-C)-1m-1,
if2θμ-θπ∈Θ. In other words, the corresponding covariance matrix(2C-1-Σ-1)-1is positive definite. □
Remark A1.
By translation invariance of Lebesgue measure, we can obtain the more general formula forχ2-divergence between two Gaussians with nonzero mean by replacing m with the difference between the two mean vectors:
dχ2 N(m1,C)∥N(m2,Σ)=|Σ||2Σ-C||C|e(m1-m2)' (2Σ-C)-1(m1-m2)-1.
Appendix B. Proof of Lemma 2
Proof.
Dividing g by its normalizing constant, we may assume without loss of generality that g is exactly the Radon-Nikodym derivativedμdπandH(μ,π)=πi(g).
Ifμ1:∞≪π1:∞, then the Radon-Nikodym derivativeg1:∞cannot beπ1:∞a.e. zero sinceπ1:∞andμ1:∞are probability measures. As a consequence,∏i=1∞ πigi=π1:∞g1:∞>0by the product structure ofμ1:∞andπ1:∞.
Now we assume∏i=1∞ πigi>0. It suffices to show thatg1:∞is well-defined, i.e., convergence of∏i=1L giinLπ1asL→∞. It suffices to prove that the sequence is Cauchy, in other words
limL,ℓ→∞π1:∞|g1:L+ℓ-g1:L|=0.
We observe that
∥g1:L+ℓ-g1:L ∥1≤∥g1:L+ℓ-g1:L∥2 ∥g1:L+ℓ+g1:L∥2≤∥g1:L+ℓ-g1:L∥2(∥g1:L+ℓ∥2+∥g1:L∥2)=2∥g1:L+ℓ-g1:L∥2.
Expanding the square of the right-hand side gives
π1:∞g1:L+ℓ-g1:L2=π1:∞g1:L+ℓ+g1:L-2g1:L+ℓ g1:L=2-2π1:Lg1:LπL+1:∞g1:L+ℓ g1:L=21-π1:L+ℓg1:L+ℓπ1:Lg1:L.
Therefore, it is enough to show
limL,ℓ→∞π1:L+ℓg1:L+ℓπ1:Lg1:L=1.
By Jensen's inequality, for any two probability measuresμ≪πwith density g, we have
πg≤πg=1.
Combining with our assumption, we deduce that
0<∏i=1∞πigi=π1:∞g1:∞≤1,
which is equivalent to
-∞<∑i=1∞log(πigi)≤0.
This series is monotonely decreasing by (A1) and bounded below, so it converges and satisfies that
limL,ℓ→∞π1:L+ℓg1:L+ℓπ1:Lg1:L=limL,ℓ→∞e∑i=LL+ℓlog(πigi)=1.
□
Appendix C. Additional Figures
Entropy 23 00022 g0a1 550
Figure A1.Noise scaling withd=k=4.
Figure A1.Noise scaling withd=k=4.
Entropy 23 00022 g0a2 550
Figure A2.Prior scaling withd=k=4.
Figure A2.Prior scaling withd=k=4.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2021. This work is licensed under http://creativecommons.org/licenses/by/3.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Importance sampling is used to approximate Bayes’ rule in many computational approaches to Bayesian inverse problems, data assimilation and machine learning. This paper reviews and further investigates the required sample size for importance sampling in terms of theχ2-divergence between target and proposal. We illustrate through examples the roles that dimension, noise-level and other model parameters play in approximating the Bayesian update with importance sampling. Our examples also facilitate a new direct comparison of standard and optimal proposals for particle filtering.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer




