Full text

Turn on search term navigation

1. Introduction

Classical optimal transport (OT) has received lots of attention recently, in particular in Machine Learning for tasks such as generative networks [1] or domain adaptation [2] to name a few. It generally relies on the Wasserstein distance, which builds an optimal coupling between distributions given a notion of distance between their samples. Yet, this metric cannot be used directly whenever the distributions lie in different metric spaces and lacks from potentially important properties, such as translation or rotation invariance of the supports of the distributions, which can be useful when comparing shapes or meshes [3,4]. In order to alleviate those problems, custom solutions have been proposed, such as [5], in which invariances are enforced by optimizing over some class of transformations, or [6], in which distributions lying in different spaces are compared by optimizing over the Stiefel manifold to project or embed one of the measures.

Apart from these works, another meaningful OT distance to tackle these problems is the Gromov–Wasserstein (GW) distance, originally proposed in [3,7,8]. It is a distance between metric spaces and has several appealing properties such as geodesics or invariances [8]. Yet, the price to be paid lies in its computational complexity, which requires solving a nonconvex quadratic optimization problem with linear constraints. A recent line of work tends to compute approximations or relaxations of the original problem in order to spread its use in more data-intensive machine learning applications. For example, Peyré et al. [9] rely on entropic regularization and Sinkhorn iterations [10], while recent methods impose coupling with low-rank constraints [11] or rely on a sliced approach [12] or on mini-batch estimators [13] to approximate the Gromov–Wasserstein distance. In Chowdhury et al. [4], the authors propose to partition the space and to solve the optimal transport problem between a subset of points before finding a coupling between all the points.

In this work, we study the subspace detour approach for Gromov–Wasserstein. This class of method was first proposed for the Wasserstein setting in Muzellec and Cuturi [14] and consists of (1) projecting the measures onto a wisely chosen subspace and finding an optimal coupling between them (2) and then constructing a nearly optimal plan of the measures on the whole space using disintegration (see Section 2.2). Our main contribution is to generalize the subspace detours approach on different subspaces and to apply it for the GW distance. We derive some useful properties as well as closed-form solutions of this transport plan between Gaussians distributions. From a practical side, we provide a novel closed-form expression of the one-dimensional GW problem that allows us to efficiently compute the subspace detours transport plan when the subspaces are one-dimensional. Illustrations of the method are given on a shape matching problem where we show good results with a cheaper computational cost compared to other GW-based methods. Interestingly enough, we also propose a separable quadratic cost for the GW problem that can be related with a triangular coupling [15], hence bridging the gap with Knothe–Rosenblatt (KR) rearrangements [16,17].

2. Background

In this section, we introduce all the necessary material to describe the subspace detours approach for classical optimal transport and relate it to the Knothe–Rosenblatt rearrangement. We show how to find couplings via the gluing lemma and measure disintegration. Then, we introduce the Gromov–Wasserstein problem for which we will derive the subspace detour in the next sections.

2.1. Classical Optimal Transport

Let $μ, ν \in P (R^{d})$ be two probability measures. The set of couplings between $μ$ and $ν$ is defined as:

$Π (μ, ν) = {γ \in P (R^{d} \times R^{d}) | π_{#}^{1} γ = μ, π_{#}^{2} γ = ν}$

where

π^{1}

and

π^{2}

are the projections on the first and second coordinate (i.e.,

π^{1} (x, y) = x

), respectively, and # is the push forward operator, defined such that:

$\forall A \in B (R^{d}), T_{#} μ (A) = μ (T^{- 1} (A)) .$

2.1.1. Kantorovitch Problem

There exists several types of coupling between probability measures for which a non-exhaustive list can be found in [18] (Chapter 1). Among them, the so called optimal coupling is the minimizer of the following Kantorovitch problem:

(1) $inf_{γ \in Π (μ, ν)} \int c (x, y) d γ (x, y)$

with c being some cost function. The Kantorovitch problem (1) is known to admit a solution when c is non-negative and lower semi-continuous [19] (Theorem 1.7). When

c (x, y) = {∥ x - y ∥}_{2}^{2}

, it defines the so-called Wasserstein distance:

(2) $W_{2}^{2} (μ, ν) = inf_{γ \in Π (μ, ν)} \int {∥ x - y ∥}_{2}^{2} d γ (x, y) .$

When the optimal coupling is of the form

γ = {(I d, T)}_{#} μ

with T, some deterministic map such that

T_{#} μ = ν

, T is called the Monge map.

In one dimension, with $μ$ atomless, the solution to (2) is a deterministic coupling of the form [19] (Theorem 2.5):

(3) $T = F_{ν}^{- 1} \circ F_{μ}$

where

F_{μ}

is the cumulative distribution function of

μ

, and

F_{ν}^{- 1}

is the quantile function of

ν

. This map is also known as the increasing rearrangement map.

2.1.2. Knothe–Rosenblatt Rearrangement

Another interesting coupling is the Knothe–Rosenblatt (KR) rearrangement, which takes advantage of the increasing rearrangement in one dimension by iterating over the dimension and using the disintegration of the measures. Concatenating all the increasing rearrangements between the conditional probabilities, the KR rearrangement produces a nondecreasing triangular map (i.e., $T : R^{d} \to R^{d}$ , for all $x \in R^{d}$ , $T (x) = (T_{1} (x_{1}), \dots, T_{j} (x_{1}, \dots, x_{j}), \dots, T_{d} (x))$ , and for all j, $T_{j}$ is nondecreasing with respect to $x_{j}$ ), and a deterministic coupling (i.e., $T_{#} μ = ν$ ) [18,19,20].

Carlier et al. [21] made a connection between this coupling and optimal transport by showing that it can be obtained as the limit of optimal transport plans for a degenerated cost:

$c_{t} (x, y) = \sum_{i = 1}^{d} λ_{i} (t) {(x_{i} - y_{i})}^{2},$

where for all

i \in {1, \dots, d}

t > 0

λ_{i} (t) > 0

, and for all

i \geq 2

\frac{λ_{i} (t)}{λ_{i - 1} (t)} \underset{t \to 0}{\to} 0

. This cost can be recast as in [22] as

c_{t} (x, y) = {(x - y)}^{T} A_{t} (x - y)

, where

A_{t} = diag (λ_{1} (t), \dots, λ_{d} (t))

. This formalizes into the following Theorem:

Theorem 1

([19,21]). Let μ and ν be two absolutely continuous measures on $R^{d}$ , with compact supports. Let $γ_{t}$ be an optimal transport plan for the cost $c_{t}$ , let $T_{K}$ be the Knothe–Rosenblatt map between μ and ν, and $γ_{K} = {(I d \times T_{K})}_{#} μ$ the associated transport plan. Then, we have $γ_{t} \to_{t \to 0}^{D} γ_{K}$ . Moreover, if $γ_{t}$ are induced by transport maps $T_{t}$ , then $T_{t}$ converges in $L^{2} (μ)$ when t tends to zero to the Knothe–Rosenblatt rearrangement.

2.2. Subspace Detours and Disintegration

Muzellec and Cuturi [14] proposed another OT problem by optimizing over the couplings which share a measure on a subspace. More precisely, they defined subspace-optimal plans for which the shared measure is the OT plan between projected measures.

Definition 1

(Subspace-Optimal Plans [14] Definition 1). Let $μ, ν \in P_{2} (R^{d})$ and let $E \subset R^{d}$ be a k-dimensional subspace. Let $γ_{E}^{*}$ be an OT plan for the Wasserstein distance between $μ_{E} = π_{#}^{E} μ$ and $ν_{E} = π_{#}^{E} ν$ (with $π^{E}$ as the orthogonal projection on E). Then, the set of E-optimal plans between μ and ν is defined as $Π_{E} (μ, ν) = {γ \in Π (μ, ν) | {(π^{E}, π^{E})}_{#} γ = γ_{E}^{*}}$ .

In other words, the subspace OT plans are the transport plans of $μ, ν$ that agree on the subspace E with the optimal transport plan $γ_{E}^{*}$ on this subspace. To construct such coupling $γ \in Π (μ, ν)$ , one can rely on the Gluing lemma [18] or use the disintegration of the measure as described in the following section.

2.2.1. Disintegration

Let $(Y, Y)$ and $(Z, Z)$ be measurable spaces, and $(X, X) = (Y \times Z, Y \otimes Z)$ the product measurable space. Then, for $μ \in P (X)$ , we denote $μ_{Y} = π_{#}^{Y} μ$ and $μ_{Z} = π_{#}^{Z} μ$ as the marginals, where $π^{Y}$ (respectively $π^{Z}$ ) is the projection on Y (respectively Z). Then, a family ${(K (y, \cdot))}_{y \in Y}$ is a disintegration of $μ$ if for all $y \in Y$ , $K (y, \cdot)$ is a measure on Z, for all $A \in Z$ , $K (\cdot, A)$ is measurable and:

$\forall ϕ \in C (X), \int_{Y \times Z} ϕ (y, z) d μ (y, z) = \int_{Y} \int_{Z} ϕ (y, z) K (y, d z) d μ_{Y} (y),$

where

C (X)

is the set of continuous functions on X. We can note

μ = μ_{Y} \otimes K

. K is a probability kernel if for all

y \in Y

K (y, Z) = 1

. The disintegration of a measure actually corresponds to conditional laws in the context of probabilities. This concept will allow us to obtain measures on the whole space from marginals on subspaces.

In the case where $X = R^{d}$ , which is our setting of interest, we have existence and uniqueness of the disintegration (see Box 2.2 of [19] or Chapter 5 of [23] for the more general case).

2.2.2. Coupling on the Whole Space

Let us note $μ_{E^{⊥} | E}$ and $ν_{E^{⊥} | E}$ as the disintegrated measures on the orthogonal spaces (i.e., such that $μ = μ_{E} \otimes μ_{E^{⊥} | E}$ and $ν = ν_{E} \otimes ν_{E^{⊥} | E}$ (if we have densities, $p (x_{E}, x_{E^{⊥}}) = p_{E} (x_{E}) p_{E^{⊥} | E} (x_{E^{⊥}} | x_{E})$ .)). Then, to obtain a transport plan between the two originals measures on the whole space, we can look for another coupling between disintegrated measures $μ_{E^{⊥} | E}$ and $ν_{E^{⊥} | E}$ . In particular, two such couplings are proposed in [14], the Monge-Independent (MI) plan:

$π_{MI} = γ_{E}^{*} \otimes (μ_{E^{⊥} | E} \otimes ν_{E^{⊥} | E})$

where we take the independent coupling between

μ_{E^{⊥} | E} (x_{E}, \cdot)

and

ν_{E^{⊥} | E} (y_{E}, \cdot)

for

γ_{E}^{*}

almost every

(x_{E}, y_{E})

, and the Monge-Knothe (MK) plan:

$π_{MK} = γ_{E}^{*} \otimes γ_{E^{⊥} | E}^{*}$

where

γ_{E^{⊥} | E}^{*} ((x_{E}, y_{E}), \cdot)

is an optimal plan between

μ_{E^{⊥} | E} (x_{E}, \cdot)

and

ν_{E^{⊥} | E} (y_{E}, \cdot)

for

γ_{E}^{*}

almost every

(x_{E}, y_{E})

. Muzellec and Cuturi [14] observed that MI is more adapted to noisy environments since it only computes the OT plan of the subspace. MK is more suited for applications where we want to prioritize some subspace but where all the directions still contain relevant information [14].

This subspace detour approach can be of much interest following the popular assumption that two distributions on $R^{d}$ differ only in a low-dimensional subspace as in the Spiked transport model [24]. However, it is still required to find the adequate subspace. Muzellec and Cuturi [14] propose to either rely on a priori knowledge to select the subspace (by using, e.g., a reference dataset and a principal component analysis) or to optimize over the Stiefel manifold.

2.3. Gromov–Wasserstein

Formally, the Gromov–Wasserstein distance allows us to compare metric measure spaces (mm-space), triplets $(X, d_{X}, μ_{X})$ and $(Y, d_{Y}, μ_{Y})$ , where $(X, d_{X})$ and $(Y, d_{Y})$ are complete separable metric spaces and $μ_{X}$ and $μ_{Y}$ are Borel probability measures on X and Y [8], respectively, by computing:

$G W (X, Y) = inf_{γ \in Π (μ_{X}, μ_{Y})} \int \int L (d_{X} (x, x^{'}), d_{Y} (y, y^{'})) d γ (x, y) d γ (x^{'}, y^{'})$

where L is some loss on

R

. It has actually been extended to other spaces by replacing the distances by cost functions

c_{X}

and

c_{Y}

, as, e.g., in [25]. Furthermore, it has many appealing properties such as having invariances (which depend on the costs).

Vayer [26] notably studied this problem in the setting where X and Y are Euclidean spaces, with $L (x, y) = {(x - y)}^{2}$ and $c (x, x^{'}) = ⟨ x, x^{'} ⟩$ or $c (x, x^{'}) = {∥ x - x^{'} ∥}_{2}^{2}$ . In particular, let $μ \in P (R^{p})$ and $ν \in P (R^{q})$ , and the inner-GW problem is defined as:

(4) $InnerGW (μ, ν) = inf_{γ \in Π (μ, ν)} \int \int {({⟨ x, x^{'} ⟩}_{p} - {⟨ y, y^{'} ⟩}_{q})}^{2} d γ (x, y) d γ (x^{'}, y^{'}) .$

For this problem, a closed form in one dimension can be found when one of the distributions admits a density w.r.t. the Lebesgue measure:

Theorem 2

([26] Theorem 4.2.4). Let $μ, ν \in P (R)$ , with μ being absolutely continuous with respect to the Lebesgue measure. Let $F_{μ}^{↗} (x) : = F_{μ} (x) = μ (] - \infty, x])$ be the cumulative distribution function and $F_{μ}^{↘} (x) = μ (] - x, + \infty [)$ the anti-cumulative distribution function. Let $T_{a s c} (x) = F_{ν}^{- 1} (F_{μ}^{↗} (x))$ and $T_{d e s c} (x) = F_{ν}^{- 1} (F_{μ}^{↘} (- x))$ . Then, an optimal solution of (4) is achieved either by $γ = {(I d \times T_{a s c})}_{#} μ$ or by $γ = {(I d \times T_{d e s c})}_{#} μ$ .

3. Subspace Detours for GW

In this section, we propose to extend subspace detours from Muzellec and Cuturi [14] with Gromov–Wasserstein costs. We show that we can even take subspaces of different dimensions and still obtain a coupling on the whole space using the Independent or the Monge–Knothe coupling. Then, we derive some properties analogously to Muzellec and Cuturi [14], as well as some closed-form solutions between Gaussians. We also provide a new closed-form expression of the inner-GW problem between one-dimensional discrete distributions and provide an illustration on a shape-matching problem.

3.1. Motivations

First, we adapt the definition of subspace optimal plans for different subspaces. Indeed, since the GW distance is adapted to distributions that have their own geometry, we argue that if we project on the same subspace, then it is likely that the resulting coupling would not be coherent with that of GW. To illustrate this point, we use as a source distribution $μ$ one moon of the two moons dataset and obtain a target $ν$ by rotating $μ$ by an angle of $\frac{π}{2}$ (see Figure 1). As the GW with $c (x, x^{'}) = {∥ x - x^{'} ∥}_{2}^{2}$ is invariant with respect to isometries, the optimal coupling is diagonal, as recovered on the left side of the figure. However, when choosing one subspace to project both the source and target distributions, we completely lose the optimal coupling between them. Nonetheless, by choosing one subspace for each measure more wisely (using here the first component of the principal component analysis (PCA) decomposition), we recover the diagonal coupling. This simple illustration underlines that the choice of both subspaces is important. A way of choosing the subspaces could be to project on the subspace containing the more information for each dataset using, e.g., PCA independently on each distribution. Muzellec and Cuturi [14] proposed to optimize the optimal transport cost with respect to an orthonormal matrix with a projected gradient descent, which could be extended to an optimization over two orthonormal matrices in our context.

By allowing for different subspaces, we obtain the following definition of subspace optimal plans:

Definition 2.

Let $μ \in P_{2} (R^{p})$ , $ν \in P_{2} (R^{q})$ , E be a k-dimensional subspace of $R^{p}$ and F a $k^{'}$ -dimensional subspace of $R^{q}$ . Let $γ_{E \times F}^{*}$ be an optimal transport plan for $G W$ between $μ_{E} = π_{#}^{E} μ$ and $ν_{F} = π_{#}^{F} ν$ (with $π^{E}$ (resp. $π^{F}$ ) the orthogonal projection on E (resp. F)). Then, the set of $(E, F)$ -optimal plans between μ and ν is defined as $Π_{E, F} (μ, ν) = {γ \in Π (μ, ν) | {(π^{E}, π^{F})}_{#} γ = γ_{E \times F}^{*}}$ .

Analogously to Muzellec and Cuturi [14] (Section 2.2), we can obtain from $γ_{E \times F}^{*}$ a coupling on the whole space by either defining the Monge–Independent plan $π_{MI} = γ_{E \times F}^{*} \otimes (μ_{E^{⊥} | E} \otimes ν_{F^{⊥} | F})$ or the Monge–Knothe plan $π_{MK} = γ_{E \times F}^{*} \otimes γ_{E^{⊥} \times F^{⊥} | E \times F}^{*}$ where OT plans are taken with some OT cost, e.g., $G W$ .

3.2. Properties

Let $E \subset R^{p}$ and $F \subset R^{q}$ and denote:

(5) $G W_{E, F} (μ, ν) = inf_{γ \in Π_{E, F} (μ, ν)} \int \int L (x, x^{'}, y, y^{'}) d γ (x, y) d γ (x^{'}, y^{'})$

the Gromov–Wasserstein problem restricted to subspace optimal plans (2). In the following, we show that Monge–Knothe couplings are optimal plans of this problem, which is a direct transposition of Proposition 1 in [14].

Proposition 1.

Let $μ \in P (R^{p})$ and $ν \in P (R^{q})$ , $E \subset R^{p}$ , $F \subset R^{q}$ , $π_{MK} = γ_{E \times F}^{*} \otimes γ_{E^{⊥} \times F^{⊥} | E \times F}^{*}$ , where $γ_{E \times F}^{*}$ is an optimal coupling between $μ_{E}$ and $ν_{F}$ , and for $γ_{E \times F}^{*}$ , almost every $(x_{E}, y_{F})$ , $γ_{E^{⊥} \times F^{⊥} | E \times F}^{*} ((x_{E}, y_{F}), \cdot)$ is an optimal coupling between $μ_{E^{⊥} | E} (x_{E}, \cdot)$ and $ν_{F^{⊥} | F} (y_{F}, \cdot)$ . Then we have:

$π_{MK} \in \underset{γ \in Π_{E, F} (μ, ν)}{argmin} \int \int L (x, x^{'}, y, y^{'}) d γ (x, y) d γ (x^{'}, y^{'}) .$

Proof.

Let $γ \in Π_{E, F} (μ, ν)$ , then:

$\begin{matrix} \int \int L (x, x^{'}, y, y^{'}) d γ (x, y) d γ (x^{'}, y^{'}) \\ = \int \int (\int \int L (x, x^{'}, y, y^{'}) γ_{E^{⊥} \times F^{⊥} | E \times F} ((x_{E}, y_{F}), (d x_{E^{⊥}}, d y_{F^{⊥}})) \\ γ_{E^{⊥} \times F^{⊥} | E \times F} ((x_{E}^{'}, y_{F}^{'}), (d x_{E^{⊥}}^{'}, d y_{F^{⊥}}^{'}))) d γ_{E \times F}^{*} (x_{E}, y_{F}) d γ_{E \times F}^{*} (x_{E}^{'}, y_{F}^{'}) . \end{matrix}$

However, for $γ_{E \times F}^{*}$ a.e. $(x_{E}, y_{F}), (x_{E}^{'}, y_{F}^{'})$ ,

$\begin{matrix} \int \int L (x, x^{'}, y, y^{'}) γ_{E^{⊥} \times F^{⊥} | E \times F} ((x_{E}, y_{F}), (d x_{E^{⊥}}, d y_{F^{⊥}})) γ_{E^{⊥} \times F^{⊥} | E \times F} ((x_{E}^{'}, y_{F}^{'}), (d x_{E^{⊥}}^{'}, d y_{F^{⊥}}^{'})) \\ \geq \int \int L (x, x^{'}, y, y^{'}) γ_{E^{⊥} \times F^{⊥} | E \times F}^{*} ((x_{E}, y_{F}), (d x_{E^{⊥}}, d y_{F^{⊥}})) γ_{E^{⊥} \times F^{⊥} | E \times F}^{*} ((x_{E}^{'}, y_{F}^{'}), (d x_{E^{⊥}}^{'}, d y_{F^{⊥}}^{'})) \end{matrix}$

by definition of the Monge–Knothe coupling. By integrating with respect to

γ_{E \times F}^{*}

, we obtain:

$\int \int L (x, x^{'}, y, y^{'}) d γ (x, y) d γ (x^{'}, y^{'}) \geq \int \int L (x, x^{'}, y, y^{'}) d π_{MK} (x, y) d π_{MK} (x^{'}, y^{'}) .$

Therefore,

π_{MK}

is optimal for subspace optimal plans. □

The key properties of $G W$ that we would like to keep are its invariances. We show in two particular cases that we conserve them on the orthogonal spaces (since the measure on $E \times F$ is fixed).

Proposition 2.

Let $μ \in P (R^{p})$ , $ν \in P (R^{q})$ , $E \subset R^{p}$ , $F \subset R^{q}$ .

For $L (x, x^{'}, y, y^{'}) = {(∥ x - x^{'} ∥_{2}^{2} - {∥ y - y^{'} ∥}_{2}^{2})}^{2}$ or $L (x, x^{'}, y, y^{'}) = {({⟨ x, x^{'} ⟩}_{p} - {⟨ y, y^{'} ⟩}_{q})}^{2}$ , $G W_{E, F}$ (5) is invariant with respect to isometries of the form $f = (I d_{E}, f_{E^{⊥}})$ (resp. $g = (I d_{F}, g_{F^{⊥}})$ ) with $f_{E^{⊥}}$ an isometry on $E^{⊥}$ (resp. $g_{F^{⊥}}$ an isometry on $F^{⊥}$ ) with respect to the corresponding cost ( $c (x, x^{'}) = {∥ x - x^{'} ∥}_{2}^{2}$ or $c (x, x^{'}) = {⟨ x, x^{'} ⟩}_{p}$ ).

Proof.

We propose a sketch of the proof. The full proof can be found in Appendix A.1. Let $L (x, x^{'}, y, y^{'}) = {(∥ x - x^{'} ∥_{2}^{2} - {∥ y - y^{'} ∥}_{2}^{2})}^{2}$ , let $f_{E^{⊥}}$ be an isometry w.r.t $c (x_{E^{⊥}}, x_{E^{⊥}}^{'}) = {∥ x_{E^{⊥}} - x_{E^{⊥}}^{'} ∥}_{2}^{2}$ , and let $f : R^{p} \to R^{p}$ be defined as such for all $x \in R^{p}$ , $f (x) = (x_{E}, f_{E^{⊥}} (x_{E^{⊥}}))$ .

By using Lemma 6 of [27], we show that $Π_{E, F} (f_{#} μ, ν) = {{(f, I d)}_{#} γ | γ \in Π_{E, F} (μ, ν)}$ . Hence, for all $γ \in Π_{E, F} (f_{#} μ, ν)$ , there exists $\tilde{γ} \in Π_{E, F} (μ, ν)$ such that $γ = {(f, I d)}_{#} \tilde{γ}$ . By disintegrating $\tilde{γ}$ with respect to $γ_{E \times F}^{*}$ and using the properties of the pushforward, we can show that:

$\begin{matrix} \int \int {(∥ x - x^{'} ∥_{2}^{2} - {∥ y - y^{'} ∥}_{2}^{2})}^{2} d {(f, I d)}_{#} \tilde{γ} (x, y) d {(f, I d)}_{#} \tilde{γ} (x^{'}, y^{'}) \\ = \int \int {(∥ x - x^{'} ∥_{2}^{2} - {∥ y - y^{'} ∥}_{2}^{2})}^{2} d \tilde{γ} (x, y) d \tilde{γ} (x^{'}, y^{'}) . \end{matrix}$

Finally, by taking the infimum with respect to

\tilde{γ} \in Π_{E, F} (μ, ν)

, we find:

$G W_{E, F} (f_{#} μ, ν) = G W_{E, F} (μ, ν) .$

□

3.3. Closed-Form between Gaussians

We can also derive explicit formulas between Gaussians in particular cases. Let $q \leq p$ , $μ = N (m_{μ}, Σ) \in P (R^{p})$ , $ν = N (m_{ν}, Λ) \in P (R^{q})$ two Gaussian measures with $Σ = P_{μ} D_{μ} P_{μ}^{T}$ and $Λ = P_{ν} D_{ν} P_{ν}^{T}$ . As previously, let $E \subset R^{p}$ and $F \subset R^{q}$ be k and $k^{'}$ dimensional subspaces, respectively. Following Muzellec and Cuturi [14], we represent $Σ$ in an orthonormal basis of $E \oplus E^{⊥}$ and $Λ$ in an orthonormal basis of $F \oplus F^{⊥}$ , i.e., $Σ = (\begin{matrix} Σ_{E} & Σ_{E E^{⊥}} \\ Σ_{E^{⊥} E} & Σ_{E^{⊥}} \end{matrix})$ . Now, let us denote the following:

$Σ / Σ_{E} = Σ_{E^{⊥}} - Σ_{E E^{⊥}}^{T} Σ_{E}^{- 1} Σ_{E E^{⊥}}$

as the Schur complement of

Σ

with respect to

Σ_{E}

. We know that the conditionals of Gaussians are Gaussians and that their covariances are the Schur complements (see, e.g., [28,29]).

For $L (x, x^{'}, y, y^{'}) = {(∥ x - x^{'} ∥_{2}^{2} - {∥ y - y^{'} ∥}_{2}^{2})}^{2}$ , we have for now no certainty that the optimal transport plan is Gaussian. Let $N_{p + q}$ denote the set of Gaussians in $R^{p + q}$ . By restricting the minimization problem to Gaussian couplings, i.e., by solving:

(6) $GGW (μ, ν) = inf_{γ \in Π (μ, ν) \cap N_{p + q}} \int \int {(∥ x - x^{'} ∥_{2}^{2} - {∥ y - y^{'} ∥}_{2}^{2})}^{2} d γ (x, y) d γ (x^{'}, y^{'}),$

Salmona et al. [30] showed that there is a solution

γ^{*} = {(I d, T)}_{#} μ \in Π (μ, ν)

with

μ = N (m_{μ}, Σ)

ν = N (m_{ν}, Λ)

and

(7) $\forall x \in R^{d}, T (x) = m_{ν} + P_{ν} A P_{μ}^{T} (x - m_{μ})$

where

A = (\begin{matrix} {\tilde{I}}_{q} D_{ν}^{\frac{1}{2}} {(D_{μ}^{(q)})}^{- \frac{1}{2}} & 0_{q, p - q} \end{matrix}) \in R^{q \times p}

, and

{\tilde{I}}_{q}

is of the form

diag ({(\pm 1)}_{i \leq q})

By combining the results of Muzellec and Cuturi [14] and Salmona et al. [30], we obtain the following closed-form for Monge–Knothe couplings:

Proposition 3.

Suppose $p \geq q$ and $k = k^{'}$ . For the Gaussian-restricted GW problem (6), a Monge–Knothe transport map between $μ = N (m_{μ}, Σ) \in P (R^{p})$ and $ν = N (m_{ν}, Λ) \in P (R^{q})$ is, for all $x \in R^{p}$ , $T_{MK} (x) = m_{ν} + B (x - m_{μ})$ where:

$B = (\begin{matrix} T_{E, F} & 0 \\ C & T_{E^{⊥}, F^{⊥} | E, F} \end{matrix})$

with $T_{E, F}$ being an optimal transport map between $N (0_{E}, Σ_{E})$ and $N (0_{F}, Λ_{F})$ (of the form (7)), $T_{E^{⊥}, F^{⊥} | E, F}$ an optimal transport map between $N (0_{E^{⊥}}, Σ / Σ_{E})$ and $N (0_{F^{⊥}}, Λ / Λ_{F})$ , and C satisfies:

$C = (Λ_{F^{⊥} F} {(T_{E, F}^{T})}^{- 1} - T_{E^{⊥}, F^{⊥} | E, F} Σ_{E^{⊥} E}) Σ_{E}^{- 1} .$

Proof.

See Appendix A.2.1. □

Suppose that $k \geq k^{'}$ , $m_{μ} = 0$ , and $m_{ν} = 0$ and let $T_{E, F}$ be an optimal transport map between $μ_{E}$ and $ν_{F}$ (of the form (7)). We can derive a formula for the Monge–Independent coupling for the inner-GW problem and the Gaussian restricted GW problem.

Proposition 4.

$π_{MI} = N (0_{p + q}, Γ)$ where $Γ = (\begin{matrix} Σ & C \\ C^{T} & Λ \end{matrix})$ with

$C = (V_{E} Σ_{E} + V_{E^{⊥}} Σ_{E^{⊥} E}) T_{E, F}^{T} (V_{F}^{T} + Λ_{F}^{- 1} Λ_{F^{⊥} F}^{T} V_{F^{⊥}}^{T})$

where $T_{E, F}$ is an optimal transport map, either for the inner-GW problem or the Gaussian restricted problem.

Proof.

See Appendix A.2.2. □

3.4. Computation of Inner-GW between One-Dimensional Empirical Measures

In practice, computing the Gromov–Wasserstein distance from samples of the distributions is costly. From a computational point of view, the subspace detour approach provides an interesting method with better computational complexity when choosing 1D subspaces. Moreover, we have the intuition than the GW problem between measures lying on smaller dimensional subspaces has a better sample complexity than between the original measures, as it is the case for the Wasserstein distance [31,32].

Below, we show that when both E and F are one-dimensional subspaces, then the resulting GW problem between the projected measures can be solved in linear time. This will rely on a new closed-form expression of the GW problem in 1D. Vayer et al. [12] provided a closed-form for GW with $c (x, x^{'}) = {∥ x - x^{'} ∥}_{2}^{2}$ in one dimension between discrete measures containing the same number of points and with uniform weights. However, in our framework, the 1D projection of $E, F$ may not have uniform weights, and we also would like to be able to compare distributions with different numbers of points. We provide in the next proposition a closed-form expression for the inner-GW problem between any unidimensional discrete probability distributions:

Proposition 5.

Consider $Σ_{n} = {a \in R_{+}^{n}, \sum_{i = 1}^{n} a_{i} = 1}$ the n probability simplex. For a vector $a \in R^{n}$ , we denote $a^{-}$ as the vector with values sorted decreasingly, i.e., $a_{1}^{-} \geq \dots \geq a_{n}^{-}$ . Let $μ = \sum_{i = 1}^{n} a_{i} δ_{x_{i}}, ν = \sum_{j = 1}^{m} b_{j} δ_{y_{j}} \in P (R) \times P (R)$ with $a, b \in Σ_{n} \times Σ_{m}$ . Suppose that $x_{1} \leq \dots \leq x_{n}$ and $y_{1} \leq \dots \leq y_{m}$ . Consider the problem:

(8) $min_{γ \in Π (a, b)} \sum_{i j k l} {(x_{i} x_{k} - y_{j} y_{l})}^{2} γ_{i j} γ_{k l}$

Then, there exists $γ \in {N W (a, b), N W (a^{-}, b)}$ such that γ is an optimal solution of (8) where $N W$ is the North-West corner rule defined in Algorithm 1. As a corollary, an optimal solution of (8) can be found in $O (n + m)$ .

Algorithm 1 North-West corner rule

N W (a, b)

$a \in Σ_{n}, b \in Σ_{m}$
while $i < = n$ , $j < = m$ do
$γ_{i j} = min {a_{i}, b_{j}}$
$a_{i} = a_{i} - γ_{i j}$
$b_{j} = b_{j} - γ_{i j}$
If $a_{i} = 0$ , $i = i + 1$ , if $b_{j} = 0$ , $j = j + 1$
end while
return $γ \in Π (a, b)$

Proof.

Let $γ \in Π (a, b)$ . Then:

$\begin{matrix} \sum_{i j k l} {(x_{i} x_{k} - y_{j} y_{l})}^{2} γ_{i j} γ_{k l} = \sum_{i j k l} {(x_{i} x_{k})}^{2} γ_{i j} γ_{k l} + \sum_{i j k l} {(y_{j} y_{l})}^{2} γ_{i j} γ_{k l} - 2 \sum_{i j k l} x_{i} x_{k} y_{j} y_{l} γ_{i j} γ_{k l} \end{matrix}$

However,

\sum_{i j k l} {(x_{i} x_{k})}^{2} γ_{i j} γ_{k l} = \sum_{i k} {(x_{i} x_{k})}^{2} a_{i} a_{k}

, and

\sum_{i j k l} {(y_{j} y_{l})}^{2} γ_{i j} γ_{k l} = \sum_{j l} {(y_{j} y_{l})}^{2} b_{j} b_{l}

, so this does not depend on

γ

. Moreover

2 \sum_{i j k l} x_{i} x_{k} y_{j} y_{l} γ_{i j} γ_{k l} = 2 {(\sum_{i j} x_{i} y_{j} γ_{i j})}^{2}

. Hence, the problem (8) is equivalent to

{max}_{γ \in Π (a, b)} {(\sum_{i j} x_{i} y_{j} γ_{i j})}^{2}

(in terms of the OT plan), which is also equivalent to solving

{max}_{γ \in Π (a, b)} | \sum_{i j} x_{i} y_{j} γ_{i j} |

or equivalently:

(9) $max_{γ \in Π (a, b)} \pm 1 \sum_{i j} x_{i} y_{j} γ_{i j}$

We have two cases to consider: If

\pm 1 = 1

, we have to solve

- {min}_{γ \in Π (a, b)} \sum_{i j} (- x_{i}) y_{j} γ_{i j}

. Since the points are sorted, the matrix

c_{i j} = - x_{i} y_{j}

satisfies the Monge property [33]:

(10) $\forall (i, j) \in {1, \dots, n - 1} \times {1, \dots, m - 1}, c_{i, j} + c_{i + 1, j + 1} \leq c_{i + 1, j} + c_{i, j + 1}$

To see this, check that:

(11) $\begin{matrix} (- x_{i}) y_{j} + (- x_{i + 1}) y_{j + 1} - (- x_{i + 1}) y_{j} - (- x_{i}) y_{j + 1} \\ = (- x_{i}) (y_{j} - y_{j + 1}) + (- x_{i + 1}) (y_{j + 1} - y_{j}) = (y_{j} - y_{j + 1}) (x_{i + 1} - x_{i}) \leq 0 \end{matrix}$

In this case, the North-West corner rule

N W (a, b)

defined in Algorithm 1 is known to produce an optimal solution to the linear problem (9) [33]. If

\pm = - 1

, then changing

x_{i}

- x_{i}

concludes. □

We emphasize that this result is novel and generalizes [12] in the sense that the distributions do not need to have uniform weights and the same number of points. I addition, Theorem 2 is not directly applicable to this setting since it requires having absolutely regular distributions, which is not the case here. Both results are, however, related, as the solution obtained by using the NW corner rule on the sorted samples is the same as that obtained by considering the coupling obtained from the quantile functions. The previous result could also be used to define tractable alternatives to GW in the same manner as the Sliced Gromov–Wasserstein [12].

3.5. Illustrations

We use the Python Optimal Transport (POT) library [34] to compute the different optimal transport problems involved in this illustration. We are interested here in solving a 3D mesh registration problem, which is a natural application of Gromov–Wasserstein [3] since it enjoys invariances with respect to isometries such as permutations and can also naturally exploit the topology of the meshes. For this purpose, we selected two base meshes from the Faust dataset [35], which provides ground truth correspondences between shapes. The information available from those meshes are geometrical (6890 vertices positions) and topological (mesh connectivity). These two meshes are represented, along with the visual results of the registration, in Figure 2. In order to visually depict the quality of the assignment induced by the transport map, we propagate through it a color code of the source vertices toward their associated counterpart vertices in the target mesh. Both the original color-coded source and the associated target ground truth are available on the first line of the illustration. To compute our method, we simply use as a natural subspace for both meshes the algebraic connectivity of the mesh’s topological information, also known as the Fiedler vector [36] (eigenvector associated to the second smallest eigenvalue of the un-normalized Laplacian matrix). Fiedler vectors are computed in practice using NetworkX [37] but could also be obtained by using power methods [38]. Reduced to a 1D optimal transport problem (8), we used the Proposition 5 to compute the optimal coupling in $O (n + m)$ . Consequently, the computation time is very low ( $\sim 5$ secs. on a standard laptop), and the associated matching is very good, with more than $98 %$ of correct assignments. We qualitatively compare this result to Gromov–Wasserstein mappings induced by different cost functions, in the second line of Figure 2: adjacency [39], weighted adjacency (weights are given by distances between vertices), heat kernel (derived from the un-normalized Laplacian) [40], and, finally, geodesic distances over the meshes. On average, computing the Gromov–Wasserstein mapping using POT took around 10 min of time. Both methods based on adjacency fail to recover a meaningful mapping. Heat kernel allows us to map continuous areas of the source mesh but fails in recovering a global structure. Finally, the geodesic distance gives a much more coherent mapping but has inverted left and right of the human figure. Notably, a significant extra computation time was induced by the computation of the geodesic distances ( $\sim 1$ h/mesh using the NetworkX [37] shortest path procedure). As a conclusion, and despite the simplification of the original problem, our method performs best with a speed-up of two-orders of magnitude.

4. Triangular Coupling as Limit of Optimal Transport Plans for Quadratic Cost

Another interesting property derived in Muzellec and Cuturi [14] of the Monge–Knothe coupling is that it can be obtained as the limit of classic optimal transport plans, similar to Theorem 1, using a separable cost of the form:

$c_{t} (x, y) = {(x - y)}^{T} P_{t} (x - y)$

with

P_{t} = V_{E} V_{E}^{T} + t V_{E^{⊥}} V_{E^{⊥}}^{T}

and

(V_{E}, V_{E^{⊥}})

as an orthonormal basis of

R^{p}

However, this property is not valid for the classical Gromov–Wasserstein cost (e.g., $L (x, x^{'}, y, y^{'}) = {(d_{X} {(x, x^{'})}^{2} - d_{Y} {(y, y^{'})}^{2})}^{2}$ or $L (x, x^{'}, y, y^{'}) = {({⟨ x, x^{'} ⟩}_{p} - {⟨ y, y^{'} ⟩}_{q})}^{2}$ ) as the cost is not separable. Motivated by this question, we ask ourselves in the following if we can derive a quadratic optimal transport cost for which we would have this property.

Formally, we derive a new quadratic optimal transport problem using the Hadamard product. We show that this problem is well-defined and that it has interesting properties such as invariance with respect to axis. We also show that it can be related to a triangular coupling in a similar fashion than the classical optimal transport problem with the Knothe–Rosenblatt rearrangement.

4.1. Construction of the Hadamard–Wasserstein Problem

In this part, we define the “Hadamard–Wasserstein” problem between $μ \in P (R^{d})$ and $ν \in P (R^{d})$ as:

(12) ${HW}^{2} (μ, ν) = inf_{γ \in Π (μ, ν)} \int \int {∥ x ⊙ x^{'} - y ⊙ y^{'} ∥}_{2}^{2} d γ (x, y) d γ (x^{'}, y^{'}),$

where ⊙ is the Hadamard product (element-wise product). This problem is different than the Gromov–Wasserstein problem in the sense that we do not compare intradistance anymore bur rather the Hadamard products between vectors of the two spaces (in the same fashion as the classical Wasserstein distance). Hence, we need the two measures to belong in the same Euclidean space. Let us note L as the cost defined as:

(13) $\forall x, x^{'}, y, y^{'} \in R^{d}, L (x, x^{'}, y, y^{'}) = \sum_{k = 1}^{d} {(x_{k} x_{k}^{'} - y_{k} y_{k}^{'})}^{2} = {∥ x ⊙ x^{'} - y ⊙ y^{'} ∥}_{2}^{2} .$

We observe that it coincides with the inner-GW (4) loss in one dimension. Therefore, by Theorem 2, we know that we have a closed-form solution in 1D.

4.2. Properties

First, we derive some useful properties of (12) which are usual for the regular Gromov–Wasserstein problem. Formally, we show that the problem is well defined and that it is a pseudometric with invariances with respect to axes.

Proposition 6.

Let $μ, ν \in P (R^{d})$ .

The problem (12) always admits a minimizer.
$HW$ is a pseudometric (i.e., it is symmetric, non-negative, $HW (μ, μ) = 0$ , and it satisfies the triangle inequality).
$HW$ is invariant to reflection with respect to axes.

Proof.

Let $μ, ν \in P (R^{d})$ ,

$(x, x^{'}) \mapsto x ⊙ x^{'}$ is a continuous map, therefore, L is less semi-continuous. Hence, by applying Lemma 2.2.1 of [26], we observe that $γ \mapsto \int \int L (x, x^{'}, y, y^{'}) d γ (x, y) d γ (x^{'}, y^{'})$ is less semi-continuous for the weak convergence of measures.
Now, as $Π (μ, ν)$ is a compact set (see the proof of Theorem 1.7 in [19] for the Polish space case and of Theorem 1.4 for the compact metric space) and $γ \mapsto \int \int L d γ d γ$ is less semi-continuous for the weak convergence, we can apply the Weierstrass theorem (Memo 2.2.1 in [26]), which states that (12) always admits a minimizer.
See Theorem 16 in [25].
For invariances, we first look at the properties that must be satisfied by T in order to have: $\forall x, x^{'}, f (x, x^{'}) = f (T (x), T (x^{'}))$ where $f : (x, x^{'}) \mapsto x ⊙ x^{'}$ .
We find that $\forall x \in R^{d} {, \forall 1 \leq i \leq d, | [T (x)]}_{i} | = | x_{i} |$ because, denoting ${(e_{i})}_{i = 1}^{d}$ as the canonical basis, we have:
$x ⊙ e_{i} = T (x) ⊙ T (e_{i}),$
which implies that:
$x_{i} = {[T (x)]}_{i} {[T (e_{i})]}_{i} .$
However, $f (e_{i}, e_{i}) = f (T (e_{i}), T (e_{i}))$ implies ${[T (e_{i})]}_{i}^{2} = 1$ , and therefore:
${| [T (x)]}_{i} | = | x_{i} | .$
If we take for T the reflection with respect to axis, then it satisfies $f (x, x^{'}) = f (T (x), T (x^{'}))$ well. Moreover, it is a good equivalence relation, and therefore, we have a distance on the quotient space.

□

$HW$ loses some properties compared to $G W$ . Indeed, it is only invariant with respect to axes, and it can only compare measures lying in the same Euclidean space in order for the distance to be well defined. Nonetheless, we show in the following that we can derive some links with triangular couplings in the same way as the Wasserstein distance with KR.

Indeed, the cost L (13) is separable and reduces to the inner-GW loss in 1D, for which we have a closed-form solution. We can therefore define a degenerated version of it:

(14) $\begin{matrix} \forall x, x^{'}, y, y^{'} \in R^{d}, L_{t} (x, x^{'}, y, y^{'}) & = \sum_{k = 1}^{d} (\prod_{i = 1}^{k - 1} λ_{t}^{(i)}) {(x_{k} x_{k}^{'} - y_{k} y_{k}^{'})}^{2} \\ = {(x ⊙ x^{'} - y ⊙ y^{'})}^{T} A_{t} (x ⊙ x^{'} - y ⊙ y^{'}) \end{matrix}$

with

A_{t} = diag (1, λ_{t}^{(1)}, λ_{t}^{(1)} λ_{t}^{(2)}, \dots, \prod_{i = 1}^{d - 1} λ_{t}^{(i)})

, such as for all

t > 0

, and for all

i \in {1, \dots, d - 1}

λ_{t}^{(i)} > 0

, and

λ_{t}^{(i)} \underset{t \to 0}{\to} 0

. We denote

{HW}_{t}

the problem (12) with the degenerated cost (14). Therefore, we will be able to decompose the objective as:

$\begin{matrix} \int \int L_{t} (x, x^{'}, y, y^{'}) d γ (x, y) d γ (x^{'}, y^{'}) \\ = \int \int {(x_{1} x_{1}^{'} - y_{1} y_{1}^{'})}^{2} d γ (x, y) d γ (x^{'}, y^{'}) \\ + \int \int \sum_{k = 2}^{d} (\prod_{i = 1}^{k - 1} λ_{t}^{(i)}) {(x_{k} x_{k}^{'} - y_{k} y_{k}^{'})}^{2} d γ (x, y) d γ (x^{'}, y^{'}) \end{matrix}$

and to use the same induction reasoning as [21].

Then, we can define a triangular coupling different from the Knothe–Rosenblatt rearrangement in the sense that each map will not be nondecreasing. Indeed, following Theorem 2, the solution of each 1D problem:

$\underset{γ \in Π (μ, ν)}{argmin} \int \int {(x x^{'} - y y^{'})}^{2} d γ (x, y) d γ (x^{'}, y^{'})$

is either

{(I d \times T_{asc})}_{#} μ

{(I d \times T_{desc})}_{#} μ

. Hence, at each step

k \geq 1

, if we disintegrate the joint law of the k first variables as

μ^{1 : k} = μ^{1 : k - 1} \otimes μ^{k | 1 : k - 1}

, the optimal transport map

T (\cdot | x_{1}, \dots, x_{k - 1})

will be the solution of:

$\underset{T \in {T_{asc}, T_{desc}}}{argmin} \int \int {(x_{k} x_{k}^{'} - T (x_{k}) T (x_{k}^{'}))}^{2} μ^{k ∣ 1 : k - 1} (d x_{k} ∣ x_{1 : k - 1}) μ^{k ∣ 1 : k - 1} (d x_{k}^{'} ∣ x_{1 : k - 1}^{'}) .$

We now state the main theorem, where we show that the limit of the OT plans obtained with the degenerated cost will be the triangular coupling we just defined.

Theorem 3.

Let μ and ν be two absolutely continuous measures on $R^{d}$ such that ${\int ∥ x ∥}_{2}^{4} μ (d x) < + \infty$ , ${\int ∥ y ∥}_{2}^{4} ν (d y) < + \infty$ and with compact support. Let $γ_{t}$ be an optimal transport plan for ${HW}_{t}$ , let $T_{K}$ be the alternate Knothe–Rosenblatt map between μ and ν as defined in the last paragraph, and let $γ_{K} = {(I d \times T_{K})}_{#} μ$ be the associated transport plan. Then, we have $γ_{t} \to_{t \to 0}^{D} γ_{K}$ . Moreover, if $γ_{t}$ are induced by transport maps $T_{t}$ , then $T_{t} \to_{t \to 0}^{L^{2} (μ)} T_{K}$ .

Proof.

See Appendix B.2. □

However, we cannot extend this Theorem to the subspace detour approach. Indeed, by choosing $A_{t} = V_{E} V_{E}^{T} + t V_{E^{⊥}} V_{E^{⊥}}^{T}$ with $(V_{E}, V_{E^{⊥}})$ an orthonormal basis of $R^{d}$ , then we project $x ⊙ x^{'} - y ⊙ y^{'}$ on E (respectively on $E^{⊥}$ ), which is generally different from $x_{E} ⊙ x_{E}^{'} - y_{E} ⊙ y_{E}^{'}$ (respectively $x_{E^{⊥}} ⊙ x_{E^{⊥}}^{'} - y_{E^{⊥}} ⊙ y_{E^{⊥}}^{'}$ ).

4.3. Solving Hadamard–Wasserstein in the Discrete Setting

In this part, we derive formulas to solve numerically $HW$ (12). Let $x_{1}, \dots, x_{n} \in R^{d}$ , $y_{1}, \dots, y_{m} \in R^{d}$ , $α \in Σ_{n}$ , $β \in Σ_{m}$ , $p = \sum_{i = 1}^{n} α_{i} δ_{x_{i}}$ and $q = \sum_{j = 1}^{m} β_{j} δ_{y_{j}}$ two discrete measures in $R^{d}$ . The Hadamard Wasserstein problem (12) becomes in the discrete setting:

$\begin{matrix} {HW}^{2} (p, q) & = inf_{γ \in Π (p, q)} \sum_{i, j} \sum_{k, ℓ} {∥ x_{i} ⊙ x_{k} - y_{j} ⊙ y_{ℓ} ∥}_{2}^{2} γ_{i, j} γ_{k, ℓ} \\ = inf_{γ \in Π (p, q)} E (γ) \end{matrix}$

with

E (γ) = \sum_{i, j} \sum_{k, ℓ} {∥ x_{i} ⊙ x_{k} - y_{j} ⊙ y_{ℓ} ∥}_{2}^{2} γ_{i, j} γ_{k, ℓ}

. As denoted in [9], if we note:

$L_{i, j, k, ℓ} = {∥ x_{i} ⊙ x_{k} - y_{j} ⊙ y_{ℓ} ∥}_{2}^{2},$

then we have:

$E (γ) = ⟨ L \otimes γ, γ ⟩,$

where ⊗ is defined as:

$L \otimes γ = {(\sum_{k, ℓ} L_{i, j, k, ℓ} γ_{k, ℓ})}_{i, j} \in R^{n \times m} .$

We show in the next proposition a decomposition of $L \otimes γ$ , which allows us to compute this tensor product more efficiently.

Proposition 7.

Let $γ \in Π (p, q) = {M \in {(R_{+})}^{n \times m}, M 𝟙_{m} = p, M^{T} 𝟙_{n} = q}$ , where $𝟙_{n} = {(1, \dots, 1)}^{T} \in R^{n}$ . Let us note $X = {(x_{i} ⊙ x_{k})}_{i, k} \in R^{n \times n \times d}$ , $Y = {(y_{j} ⊙ y_{ℓ})}_{j, ℓ} \in R^{m \times m \times d}$ , $X^{(2)} = (∥ X_{i, k} {∥_{2}^{2})}_{i, k} \in R^{n \times n}$ , $Y^{(2)} = (∥ Y_{j, l} {∥_{2}^{2})}_{j, l} \in R^{m \times m}$ , and $\forall t \in {1, \dots, d}, X_{t} = {(X_{i, k, t})}_{i, k} \in R^{n \times n}$ and $Y_{t} = {(Y_{j, ℓ, t})}_{j, ℓ} \in R^{m \times m}$ . Then:

$L \otimes γ = X^{(2)} p 𝟙_{m}^{T} + 𝟙_{n} q^{T} {(Y^{(2)})}^{T} - 2 \sum_{t = 1}^{d} X_{t} γ Y_{t}^{T} .$

Proof.

First, we can start by writing:

$\begin{matrix} L_{i, j, k, ℓ} & = ∥ x_{i} ⊙ x_{k} - y_{j} ⊙ y_{ℓ} ∥_{2}^{2} \\ = ∥ X_{i, k} - Y_{j, ℓ} ∥_{2}^{2} \\ = ∥ X_{i, k} ∥_{2}^{2} + {∥ Y_{j, ℓ} ∥}_{2}^{2} - 2 ⟨ X_{i, k}, Y_{j, ℓ} ⟩ \\ = {[X^{(2)}]}_{i, k} + {[Y^{(2)}]}_{j, ℓ} - 2 ⟨ X_{i, k}, Y_{j, ℓ} ⟩ . \end{matrix}$

We cannot directly apply proposition 1 from [9] (as the third term is a scalar product), but by performing the same type of computation, we obtain:

$L \otimes γ = A + B + C$

with

$A_{i, j} = \sum_{k, ℓ} {[X^{(2)}]}_{i, k} γ_{k, ℓ} = \sum_{k} {[X^{(2)}]}_{i, k} \sum_{ℓ} γ_{k, ℓ} = \sum_{k} {[X^{(2)}]}_{i, k} {[γ 𝟙_{m}]}_{k, 1} = {[X^{(2)} γ 𝟙_{m}]}_{i, 1} = {[X^{(2)} p]}_{i, 1}$

$B_{i, j} = \sum_{k, ℓ} {[Y^{(2)}]}_{j, ℓ} γ_{k, ℓ} = \sum_{ℓ} {[Y^{(2)}]}_{j, ℓ} \sum_{k} γ_{k, ℓ} = \sum_{ℓ} {[Y^{(2)}]}_{j, ℓ} {[γ^{T} 𝟙_{n}]}_{ℓ, 1} = {[Y^{(2)} γ^{T} 𝟙_{n}]}_{j, 1} = {[Y^{(2)} q]}_{j, 1}$

and

$\begin{matrix} C_{i, j} = - 2 \sum_{k, ℓ} ⟨ X_{i, k}, Y_{j, ℓ} ⟩ γ_{k, ℓ} & = - 2 \sum_{k, ℓ} \sum_{t = 1}^{d} X_{i, k, t} Y_{j, ℓ, t} γ_{k, ℓ} \\ = - 2 \sum_{t = 1}^{d} \sum_{k} {[X_{t}]}_{i, k} \sum_{ℓ} {[Y_{t}]}_{j, ℓ} γ_{ℓ, k}^{T} \\ = - 2 \sum_{t = 1}^{d} \sum_{k} {[X_{t}]}_{i, k} {[Y_{t} γ^{T}]}_{j, k} \\ = - 2 \sum_{t = 1}^{d} {[X_{t} {(Y_{t} γ^{T})}^{T}]}_{i, j} . \end{matrix}$

Finally, we have:

$L \otimes γ = X^{(2)} p 𝟙_{m}^{T} + 𝟙_{n} q^{T} {(Y^{(2)})}^{T} - 2 \sum_{t = 1}^{d} X_{t} γ Y_{t}^{T} .$

□

From this decomposition, we can compute the tensor product $L \otimes γ$ with a complexity of $O (d (n^{2} m + m^{2} n))$ using only multiplications of matrices (instead of $O (d n^{2} m^{2})$ for a naive computation).

Remark 1.

For the degenerated cost function (14), we just need to replace X and Y by $\tilde{X_{t}} = A_{t}^{\frac{1}{2}} X$ and $\tilde{Y_{t}} = A_{t}^{\frac{1}{2}} Y$ in the previous proposition.

To solve this problem numerically, we can use the conditional gradient algorithm (Algorithm 2 in [41]). This algorithm only requires to compute the gradient:

$\nabla E (γ) = 2 (A + B + C) = 2 (L \otimes γ)$

at each step and a classical OT problem. This algorithm is more efficient than solving the quadratic problem directly. Moreover, while it is a non-convex problem, it actually converges to a local stationary point [42].

On Figure 3, we generated 30 points of 2 Gaussian distributions, and computed the optimal coupling of ${HW}_{t}$ for several t. These points have the same uniform weight. We plot the couplings between the points on the second row, and between the projected points on their first coordinate on the first row. Note that for discrete points, the Knothe–Rosenblatt coupling amounts to sorting the points with respect to the first coordinate if there is no ambiguity (i.e., $x_{1}^{(1)} < \dots < x_{n}^{(1)}$ ) as it comes back to perform the optimal transport in one dimension [43] (Remark 2.28). For our cost, the optimal coupling in 1D can either be the increasing or the decreasing rearrangement. We observe on the first row of Figure 3 that the optimal coupling when t is close to 0 corresponds to the decreasing rearrangement, which corresponds well to the alternate Knothe–Rosenblatt map we defined in Section 4.2. It underlines the results provided in Theorem 3.

5. Discussion

We proposed in this work to extend the subspace detour approach to different subspaces, and to other optimal transport costs such as Gromov–Wasserstein. Being able to project on different subspaces can be useful when the data are not aligned and do not share the same axes of interest, as well as when we are working between different metric spaces as it is the case, for example, with graphs. However, a question that arises is how to choose these subspaces. Since the method is mostly interesting when we choose one-dimensional subspaces, we proposed to use a PCA and to project on the first directions for data embedded in Euclidean spaces. For more complicated data such as graphs, we projected onto the Fiedler vector and obtained good results in an efficient way on a 3D mesh registration problem. More generally, Muzellec and Cuturi [14] proposed to perform a gradient descent on the loss with respect to orthonormal matrices. This approach is non-convex and is only guaranteed to converge to a local minimum. Designing such an algorithm, which would minimize alternatively between two transformations in the Stiefel manifold, is left for future works.

The subspace detour approach for transport problem is meaningful whenever one can identify subspaces that gather most of the information from the original distributions, while making the estimate more robust and with a better sample complexity as far as dimensions are lower. On the computational complexity side, and when we have only access to discrete data, the subspace detour approach brings better computational complexity solely when the subspaces are chosen as one dimensional. Indeed, otherwise, we have the same complexity for solving the subspace detour and solving the OT problem directly (since the complexity only depends on the number of samples). In this case, the 1D projection often gives distinct values for all the samples (for continuous valued data) and hence the Monge–Knothe coupling is exactly the coupling in 1D. As such, information is lost on the orthogonal spaces. It can be artificially recovered by quantizing the 1D values (as experimented in practice in [14]), but the added value is not clear and deserves broader studies. If given absolutely continuous distributions wrt. the Lebesgue measure, however, this limit does not exist but comes with the extra cost of being able to compute efficiently the projected measure onto the subspace, which might require discretization of the space and is therefore not practical in high dimensions.

We also proposed a new quadratic cost $HW$ that we call Hadamard–Wasserstein, which allows us to define a degenerated cost for which the optimal transport plan converges to a triangular coupling. However, this cost loses many properties compared to $W_{2}$ or $G W$ , for which we are inclined to use these problems. Indeed, while $HW$ is a quadratic cost, it uses a Euclidean norm between the Hadamard product of vectors and requires the two spaces to be the same (in order to have the distance well defined). A work around in the case $X = R^{p}$ and $Y = R^{q}$ with $p \leq q$ would be to “lift” the vectors in $R^{p}$ into vectors in $R^{q}$ with padding as it is proposed in [12] or to project the vectors in $R^{q}$ on $R^{p}$ as in [6]. Yet, for some applications where only the distance/similarity matrices are available, a different strategy still needs to be found. Another concern is the limited invariance properties (only with respect to axial symmetry symmetry in our case). Nevertheless, we expect that such a cost can be of interest in cases where invariance to symmetry is a desired property, such as in [44].

Author Contributions

Methodology, C.B., N.C., T.V., F.S., and L.D.; software, C.B. and N.C.; writing—original draft, C.B., N.C., T.V., F.S., and L.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research wad funded by project DynaLearn from Labex CominLabs and Région Bretagne ARED DLearnMe. N.C. acknowledges fundings from the ANR OTTOPIA AI chair (ANR-20-CHIA-0030). T.V. was supported in part by the AllegroAssai ANR project (ANR-19-CHIA-0009) and by the ACADEMICS grant of the IDEXLYON, project of the Université de Lyon, PIA operated by ANR-16-IDEX-0005.

Data Availability Statement

The FAUST dataset might be found at http://faust.is.tue.mpg.de (accessed in 1 September 2021).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

OT	Optimal Transport
GW	Gromov–Wasserstein
KR	Knothe–Rosenblatt
MI	Monge–Independent
MK	Monge–Knothe
PCA	Principal Component Analysis
POT	Python Optimal Transport

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

View Image - Figure 1. From left to right: Data (moons); OT plan obtained with GW for [Forumla omitted. See PDF.]; Data projected on the first axis; OT plan obtained between the projected measures; Data projected on their first PCA component; OT plan obtained between the the projected measures.

Figure 1. From left to right: Data (moons); OT plan obtained with GW for [Forumla omitted. See PDF.]; Data projected on the first axis; OT plan obtained between the projected measures; Data projected on their first PCA component; OT plan obtained between the the projected measures.

View Image - Figure 2. Three-dimensional mesh registration. (First row) source and target meshes, color code of the source, ground truth color code on the target, result of subspace detour using Fiedler vectors as subspace. (Second row) After recalling the expected ground truth for ease of comparison, we present results of different Gromov–Wasserstein mappings obtained with metrics based on adjacency, heat kernel, and geodesic distances.

Figure 2. Three-dimensional mesh registration. (First row) source and target meshes, color code of the source, ground truth color code on the target, result of subspace detour using Fiedler vectors as subspace. (Second row) After recalling the expected ground truth for ease of comparison, we present results of different Gromov–Wasserstein mappings obtained with metrics based on adjacency, heat kernel, and geodesic distances.

View Image - Figure 3. Degenerated coupling. On the first row, the points are projected on their first coordinate and we plot the optimal coupling. On the second row, we plot the optimal coupling between the original points.

Figure 3. Degenerated coupling. On the first row, the points are projected on their first coordinate and we plot the optimal coupling. On the second row, we plot the optimal coupling between the original points.

Appendix A. Subspace Detours

Appendix A.1. Proofs

Proof of Proposition 2.

We first deal with $L (x, x^{'}, y, y^{'}) = {(∥ x - x^{'} ∥_{2}^{2} - {∥ y - y^{'} ∥}_{2}^{2})}^{2}$ . Let $f_{E^{⊥}}$ be an isometry w.r.t $c (x_{E^{⊥}}, x_{E^{⊥}}^{'}) = {∥ x_{E^{⊥}} - x_{E^{⊥}}^{'} ∥}_{2}^{2}$ , and let $f : R^{p} \to R^{p}$ be defined such as for all $x \in R^{p}$ , $f (x) = (x_{E}, f_{E^{⊥}} (x_{E^{⊥}}))$ .

From Lemma 6 of Paty and Cuturi [27], we know that $Π (f_{#} μ, ν) = {{(f, I d)}_{#} γ | γ \in Π (μ, ν)}$ . We can rewrite: $\begin{matrix} Π_{E, F} (f_{#} μ, ν) & = {γ \in Π (f_{#} μ, ν) | {(π^{E}, π^{F})}_{#} γ = γ_{E \times F}^{*}} \\ = {{(f, I d)}_{#} γ | γ \in Π (μ, ν), {(π^{E}, π^{F})}_{#} {(f, I d)}_{#} γ = γ_{E \times F}^{*}} \\ = {{(f, I d)}_{#} γ | γ \in Π (μ, ν), {(π^{E}, π^{F})}_{#} γ = γ_{E \times F}^{*}} \\ = {{(f, I d)}_{#} γ | γ \in Π_{E, F} (μ, ν)} \end{matrix}$ using $f = (I d_{E}, f_{E^{⊥}})$ , $π^{E} \circ f = I d_{E}$ and ${(π^{E}, π^{F})}_{#} {(f, I d)}_{#} γ = {(π^{E}, π^{F})}_{#} γ$ .

Now, for all $γ \in Π_{E, F} (f_{#} μ, ν)$ , there exists $\tilde{γ} \in Π_{E, F} (μ, ν)$ such that $γ = {(f, I d)}_{#} \tilde{γ}$ , and we can disintegrate $\tilde{γ}$ with respect to $γ_{E \times F}^{*}$ : $\tilde{γ} = γ_{E \times F}^{*} \otimes K$ with K a probability kernel on $(E \times F, B (E^{⊥}) \otimes B (F^{⊥}))$ .

For $γ_{E \times F}^{*}$ almost every $(x_{E}, y_{F}), (x_{E}^{'}, y_{F}^{'})$ , we have: $\begin{matrix} \int \int {(∥ x_{E} - x_{E}^{'} ∥_{2}^{2} + ∥ x_{E^{⊥}} - x_{E^{⊥}}^{'} ∥_{2}^{2} - ∥ y_{F} - y_{F}^{'} ∥_{2}^{2} - {∥ y_{F^{⊥}} - y_{F^{⊥}}^{'} ∥}_{2}^{2})}^{2} \\ {(f_{E^{⊥}}, I d)}_{#} K ((x_{E}, y_{F}), (d x_{E^{⊥}}, d y_{F^{⊥}})) {(f_{E^{⊥}}, I d)}_{#} K ((x_{E}^{'}, y_{F}^{'}), (d x_{E^{⊥}}^{'}, d y_{F^{⊥}}^{'})) \\ = \int \int {(∥ x_{E} - x_{E}^{'} ∥_{2}^{2} + ∥ f_{E^{⊥}} (x_{E^{⊥}}) - f_{E^{⊥}} (x_{E^{⊥}}^{'}) ∥_{2}^{2} - ∥ y_{F} - y_{F}^{'} ∥_{2}^{2} - {∥ y_{F^{⊥}} - y_{F^{⊥}}^{'} ∥}_{2}^{2})}^{2} \\ K ((x_{E}, y_{F}), (d x_{E^{⊥}}, d y_{F^{⊥}})) K ((x_{E}^{'}, y_{F}^{'}), (d x_{E^{⊥}}^{'}, d y_{F^{⊥}}^{'})) \\ = \int \int {(∥ x_{E} - x_{E}^{'} ∥_{2}^{2} + ∥ x_{E^{⊥}} - x_{E^{⊥}}^{'} ∥_{2}^{2} - ∥ y_{F} - y_{F}^{'} ∥_{2}^{2} - {∥ y_{F^{⊥}} - y_{F^{⊥}}^{'} ∥}_{2}^{2})}^{2} \\ K ((x_{E}, y_{F}), (d x_{E^{⊥}}, d y_{F^{⊥}})) K ((x_{E}^{'}, y_{F}^{'}), (d x_{E^{⊥}}^{'}, d y_{F^{⊥}}^{'})) \end{matrix}$ using in the last line that $∥ f_{E^{⊥}} (x_{E^{⊥}}) - f_{E^{⊥}} (x_{E^{⊥}}^{'}) ∥_{2} = {∥ x_{E^{⊥}} - x_{E^{⊥}}^{'} ∥}_{2}$ since $f_{E^{⊥}}$ is an isometry.

By integrating with respect to $γ_{E \times F}^{*}$ , we obtain: (A1) $\begin{matrix} \int \int (\int \int {(∥ x - x^{'} ∥_{2}^{2} - {∥ y - y^{'} ∥}_{2}^{2})}^{2} \\ {(f_{E^{⊥}}, I d)}_{#} K ((x_{E}, y_{F}), (d x_{E^{⊥}}, d y_{F^{⊥}})) {(f_{E^{⊥}}, I d)}_{#} K ((x_{E}^{'}, y_{F}^{'}), (d x_{E^{⊥}}^{'}, d y_{F^{⊥}}^{'}))) \\ d γ_{E \times F}^{*} (x_{E}, y_{F}) d γ_{E \times F}^{*} (x_{E}^{'}, y_{F}^{'}) \\ = \int \int {(∥ x - x^{'} ∥_{2}^{2} - {∥ y - y^{'} ∥}_{2}^{2})}^{2} d \tilde{γ} (x, y) d \tilde{γ} (x^{'}, y^{'}) . \end{matrix}$

Now, we show that $γ = {(f, I d)}_{#} \tilde{γ} = γ_{E \times F}^{*} \otimes {(f_{E^{⊥}}, I d)}_{#} K$ . Let $ϕ$ be some bounded measurable function on $R^{p} \times R^{q}$ : $\begin{matrix} \int ϕ (x, y) d γ (x, y) & = \int ϕ (x, y) d ({(f, I d)}_{#} \tilde{γ} (x, y)) \\ = \int ϕ (f (x), y) d \tilde{γ} (x, y) \\ = \int \int ϕ (f (x), y) K ((x_{E}, y_{F}), (d x_{E^{⊥}}, d y_{F^{⊥}})) d γ_{E \times F}^{*} (x_{E}, y_{F}) \\ = \int \int ϕ ((x_{E}, f_{E^{⊥}} (x_{E^{⊥}})), y) K ((x_{E}, y_{F}), (d x_{E^{⊥}}, d y_{F^{⊥}})) d γ_{E \times F}^{*} (x_{E}, y_{F}) \\ = \int \int ϕ (x, y) {(f_{E^{⊥}}, I d)}_{#} K ((x_{E}, y_{F}), (d x_{E^{⊥}}, d y_{F^{⊥}})) d γ_{E \times F}^{*} (x_{E}, y_{F}) . \end{matrix}$ Hence, we can rewrite (A1) as: $\begin{matrix} \int \int {(∥ x - x^{'} ∥_{2}^{2} - {∥ y - y^{'} ∥}_{2}^{2})}^{2} d {(f, I d)}_{#} \tilde{γ} (x, y) d {(f, I d)}_{#} \tilde{γ} (x^{'}, y^{'}) \\ = \int \int {(∥ x - x^{'} ∥_{2}^{2} - {∥ y - y^{'} ∥}_{2}^{2})}^{2} d \tilde{γ} (x, y) d \tilde{γ} (x^{'}, y^{'}) . \end{matrix}$

Now, by taking the infimum with respect to $\tilde{γ} \in Π_{E, F} (μ, ν)$ , we find: $G W_{E, F} (f_{#} μ, ν) = G W_{E, F} (μ, ν) .$

For the inner product case, we can do the same proof for linear isometries on $E^{⊥}$ . □

Appendix A.2. Closed-Form between Gaussians

Let $q \leq p$ , $μ = N (m_{μ}, Σ) \in P (R^{p})$ , and $ν = N (m_{ν}, Λ) \in P (R^{q})$ be two Gaussian measures with $Σ = P_{μ} D_{μ} P_{μ}^{T}$ and $Λ = P_{ν} D_{ν} P_{ν}^{T}$ .

Let $E \subset R^{p}$ be a subspace of dimension k and $F \subset R^{q}$ a subspace of dimension $k^{'}$ .

We represent $Σ$ in an orthonormal basis of $E \oplus E^{⊥}$ and $Λ$ in an orthonormal basis of $F \oplus F^{⊥}$ , i.e., $Σ = (\begin{matrix} Σ_{E} & Σ_{E E^{⊥}} \\ Σ_{E^{⊥} E} & Σ_{E^{⊥}} \end{matrix})$ . We denote $Σ / Σ_{E} = Σ_{E^{⊥}} - Σ_{E E^{⊥}}^{T} Σ_{E}^{- 1} Σ_{E E^{⊥}}$ as the Schur complement of $Σ$ with respect to $Σ_{E}$ . We know that the conditionals of Gaussians are Gaussians and of covariance, the Schur complement (see e.g., Rasmussen [28] or Von Mises [29]).

Appendix A.2.1. Quadratic GW Problem

For $G W$ with $c (x, x^{'}) = {∥ x - x^{'} ∥}_{2}^{2}$ , we have for now no guarantee that there exists an optimal coupling which is a transport map. Salmona et al. [30] proposed to restrict the problem to the set of Gaussian couplings $π (μ, ν) \cap N_{p + q}$ where $N_{p + q}$ denotes the set of Gaussians in $R^{p + q}$ . In that case, the problem becomes:(A2) $G G W (μ, ν) = inf_{γ \in Π (μ, ν) \cap N_{p + q}} \int \int {(∥ x - x^{'} ∥_{2}^{2} - {∥ y - y^{'} ∥}_{2}^{2})}^{2} d γ (x, y) d γ (x^{'}, y^{'}) .$

In that case, they showed that an optimal solution is of the form $T (x) = m_{ν} + P_{ν} A P_{μ}^{T} (x - m_{μ})$ with $A = (\begin{matrix} {\tilde{I}}_{q} D_{ν}^{\frac{1}{2}} {(D_{μ}^{(q)})}^{- \frac{1}{2}} & 0_{q, p - q} \end{matrix})$ and ${\tilde{I}}_{q}$ of the form $diag ({(\pm 1)}_{i \leq q})$ .

Since the problem is translation invariant, we can always solve the problem between the centered measures.

In the following, we suppose that $k = k^{'}$ . Let us denote $T_{E, F}$ as the optimal transport map for (A2) between $N (0, Σ_{E})$ and $N (0, Λ_{F})$ . According to Theorem 4.1 in Salmona et al. [30], such a solution exists and is of the form (7). We also denote $T_{E^{⊥}, F^{⊥}}$ as the optimal transport map between $N (0, Σ / Σ_{E})$ and $N (0, Λ / Λ_{F})$ (which is well defined since we assumed $p \geq q$ and hence $p - k \geq q - k^{'}$ since $k = k^{'}$ ).

We know that the Monge–Knothe transport map will be a linear map $T_{MK} (x) = B x$ with B a block triangular matrix of the form: $B = (\begin{matrix} T_{E, F} & 0_{k^{'}, p - k} \\ C & T_{E^{⊥}, F^{⊥}} \end{matrix}) \in R^{q \times p},$ with $C \in R^{(q - k^{'}) \times k}$ and such that $B Σ B^{T} = Λ$ (to have well a transport map between $μ$ and $ν$ ).

Actually, $B Σ B^{T} = (\begin{matrix} T_{E, F} Σ_{E} T_{E, F}^{T} & T_{E, F} Σ_{E} C^{T} + T_{E, F} Σ_{E E^{⊥}} T_{E^{⊥}, F^{⊥}}^{T} \\ (C Σ_{E} + T_{E^{⊥}, F^{⊥}} Σ_{E^{⊥} E}) T_{E, F}^{T} & (C Σ_{E} + T_{E^{⊥}, F^{⊥}} Σ_{E^{⊥} E}) C^{T} + (C Σ_{E E^{⊥}} + T_{E^{⊥}, F^{⊥}} Σ_{E^{⊥}}) T_{E^{⊥}, F^{⊥}}^{T} \end{matrix}) .$

First, we have well $T_{E, F} Σ_{E} T_{E, F}^{T} = Λ_{F}$ , as $T_{E, F}$ is a transport map between $μ_{E}$ and $ν_{F}$ . Then: $B Σ B^{T} = Λ \Leftrightarrow \{\begin{matrix} T_{E, F} Σ_{E} T_{E, F}^{T} = Λ_{F} \\ T_{E, F} Σ_{E} C^{T} + T_{E, F} Σ_{E E^{⊥}} T_{E^{⊥}, F^{⊥}}^{T} = Λ_{F F^{⊥}} \\ (C Σ_{E} + T_{E^{⊥}, F^{⊥}} Σ_{E^{⊥} E}) T_{E, F}^{T} = Λ_{F^{⊥} F} \\ (C Σ_{E} + T_{E^{⊥}, F^{⊥}} Σ_{E^{⊥} E}) C^{T} + (C Σ_{E E^{⊥}} + T_{E^{⊥}, F^{⊥}} Σ_{E^{⊥}}) T_{E^{⊥}, F^{⊥}}^{T} = Λ_{F^{⊥}} . \end{matrix}$

We have: $(C Σ_{E} + T_{E^{⊥}, F^{⊥}} Σ_{E^{⊥} E}) T_{E, F}^{T} = Λ_{F^{⊥} F} \Leftrightarrow C Σ_{E} T_{E, F}^{T} = Λ_{F^{⊥} F} - T_{E^{⊥}, F^{⊥}} Σ_{E^{⊥} E} T_{E, F}^{T} .$

As $k = k^{'}$ , $Σ_{E} T_{E, F}^{T} \in R^{k \times k}$ and is invertible (as $Σ_{E}$ and $Λ_{F}$ are positive definite and $T_{E, F} = P_{μ_{E}} A_{E, F} P_{ν_{F}}$ with $A_{E, F} = (\begin{matrix} {\tilde{I}}_{k} D_{ν_{F}}^{\frac{1}{1}} D_{μ_{E}}^{- \frac{1}{2}} \end{matrix})$ with positive values on the diagonals. Hence, we have: $C = (Λ_{F^{⊥} F} {(T_{E, F}^{T})}^{- 1} - T_{E^{⊥}, F^{⊥}} Σ_{E^{⊥} E}) Σ_{E}^{- 1} .$

Now, we still have to check the last two equations. First: $\begin{matrix} T_{E, F} Σ_{E} C^{T} + T_{E, F} Σ_{E E^{⊥}} T_{E^{⊥}, F^{⊥}}^{T} & = T_{E, F} Σ_{E} Σ_{E}^{- 1} T_{E, F}^{- 1} Λ_{F^{⊥} F}^{T} - T_{E, F} Σ_{E} Σ_{E}^{- 1} Σ_{E^{⊥} E}^{T} T_{E^{⊥}, F^{⊥}}^{T} + T_{E, F} Σ_{E E^{⊥}} T_{E^{⊥}, F^{⊥}}^{T} \\ = Λ_{F F^{⊥}} . \end{matrix}$

For the last equation: $\begin{matrix} (C Σ_{E} + T_{E^{⊥}, F^{⊥}} Σ_{E^{⊥} E}) C^{T} + (C Σ_{E E^{⊥}} + T_{E^{⊥}, F^{⊥}} Σ_{E^{⊥}}) T_{E^{⊥}, F^{⊥}}^{T} \\ = (Λ_{F^{⊥} F} {(T_{E, F}^{T})}^{- 1} - T_{E^{⊥}, F^{⊥}} Σ_{E^{⊥} E} + T_{E^{⊥}, F^{⊥}} Σ_{E^{⊥} E}) Σ_{E}^{- 1} (T_{E, F}^{- 1} Λ_{F^{⊥} F}^{T} - Σ_{E^{⊥} E}^{T} T_{E^{⊥}, F^{⊥}}^{T}) \\ + Λ_{F^{⊥} F} {(T_{E, F}^{T})}^{- 1} Σ_{E}^{- 1} Σ_{E E^{⊥}} T_{E^{⊥}, F^{⊥}}^{T} - T_{E^{⊥}, F^{⊥}} Σ_{E^{⊥} E} Σ_{E}^{- 1} Σ_{E E^{⊥}} T_{E^{⊥}, F^{⊥}}^{T} + T_{E^{⊥}, F^{⊥}} Σ_{E^{⊥}} T_{E^{⊥}, F^{⊥}}^{T} \\ = Λ_{F^{⊥} F} {(T_{E, F}^{T})}^{- 1} Σ_{E}^{- 1} T_{E, F}^{- 1} Λ_{F^{⊥} F}^{T} - Λ_{F^{⊥} F} {(T_{E, F}^{T})}^{- 1} Σ_{E}^{- 1} Σ_{E^{⊥} E}^{T} T_{E^{⊥}, F^{⊥}}^{T} - T_{E^{⊥}, F^{⊥}} Σ_{E^{⊥} E} Σ_{E}^{- 1} T_{E, F}^{- 1} Λ_{F^{⊥} F}^{T} \\ + T_{E^{⊥}, F^{⊥}} Σ_{E^{⊥} E} Σ_{E}^{- 1} Σ_{E^{⊥} E}^{T} T_{E^{⊥}, F^{⊥}}^{T} + T_{E^{⊥}, F^{⊥}} Σ_{E^{⊥} E} Σ_{E}^{- 1} T_{E, F}^{- 1} Λ_{F^{⊥} F}^{T} - T_{E^{⊥}, F^{⊥}} Σ_{E^{⊥} E} Σ_{E}^{- 1} Σ_{E^{⊥} E}^{T} T_{E^{⊥} F^{⊥}}^{T} \\ + Λ_{F^{⊥} F} {(T_{E, F}^{T})}^{- 1} Σ_{E}^{- 1} Σ_{E E^{⊥}} T_{E^{⊥}, F^{⊥}}^{T} - T_{E^{⊥}, F^{⊥}} Σ_{E^{⊥} E} Σ_{E}^{- 1} Σ_{E^{⊥} E}^{T} T_{E^{⊥}, F^{⊥}}^{T} + T_{E^{⊥}, F^{⊥}} Σ_{E^{⊥}} T_{E^{⊥}, F^{⊥}}^{T} \\ = Λ_{F^{⊥} F} {(T_{E, F}^{T})}^{- 1} Σ_{E}^{- 1} T_{E, F}^{- 1} Λ_{F^{⊥} F}^{T} - T_{E^{⊥}, F^{⊥}} Σ_{E^{⊥} E} Σ_{E}^{- 1} Σ_{E^{⊥} E}^{T} T_{E^{⊥}, F^{⊥}}^{T} + T_{E^{⊥}, F^{⊥}} Σ_{E^{⊥}} T_{E^{⊥}, F^{⊥}}^{T} \end{matrix}$ Now, using that ${(T_{E, F}^{T})}^{- 1} Σ_{E}^{- 1} T_{E, F}^{- 1} = {(T_{E, F} Σ_{E} T_{E, F}^{T})}^{- 1} = Λ_{F}^{- 1}$ and $Σ_{E^{⊥}} - Σ_{E^{⊥} E} Σ_{E}^{- 1} Σ_{E^{⊥} E}^{T} = Σ / Σ_{E}$ , we have: $\begin{matrix} (C Σ_{E} + T_{E^{⊥}, F^{⊥}} Σ_{E^{⊥} E}) C^{T} + (C Σ_{E E^{⊥}} + T_{E^{⊥}, F^{⊥}} Σ_{E^{⊥}}) T_{E^{⊥}, F^{⊥}}^{T} \\ = Λ_{F^{⊥} F} Λ_{F}^{- 1} Λ_{F^{⊥} F}^{T} + T_{E^{⊥}, F^{⊥}} (Σ_{E^{⊥}} - Σ_{E^{⊥} E} Σ_{E}^{- 1} Σ_{E^{⊥} E}^{T}) T_{E^{⊥}, F^{⊥}}^{T} \\ = Λ_{F^{⊥} F} Λ_{F}^{- 1} Λ_{F^{⊥} F}^{T} + Λ / Λ_{F} \\ = Λ_{F^{⊥}} \end{matrix}$

Then, $π_{MK}$ is of the form ${(I d, T_{MK})}_{#} μ$ with: $T_{MK} (x) = m_{ν} + B (x - m_{μ}) .$

Appendix A.2.2. Closed-Form between Gaussians for Monge–Independent

Suppose is $k \geq k^{'}$ in order to be able to define the OT map between $μ_{E}$ and $ν_{F}$ .

For the Monge–Independent plan, $π_{MI} = γ_{E \times F}^{*} \otimes (μ_{E^{⊥} | E} \otimes ν_{F^{⊥} | F})$ , let $(X, Y) \sim π_{MI}$ . We know that $π_{MI}$ is a degenerate Gaussian with a covariance of the form: $Cov (X, Y) = (\begin{matrix} Cov (X) & C \\ C^{T} & Cov (Y) \end{matrix})$ where $Cov (X) = Σ$ and $Cov (Y) = Λ$ . Moreover, we know that C is of the form: $(\begin{matrix} Cov (X_{E}, Y_{F}) & Cov (X_{E}, Y_{F^{⊥}}) \\ Cov (X_{E^{⊥}}, Y_{F}) & Cov (X_{E^{⊥}}, Y_{F^{⊥}}) \end{matrix}) .$

Let us assume that $m_{μ} = m_{ν} = 0$ , then: $\begin{matrix} Cov (X_{E}, Y_{F}) & = Cov (X_{E}, T_{E, F} X_{E}) = E [X_{E} X_{E}^{T}] T_{E, F}^{T} = Σ_{E} T_{E, F}^{T}, \end{matrix}$ $\begin{matrix} Cov (X_{E}, Y_{F^{⊥}}) & = E [X_{E} Y_{F^{⊥}}^{T}] \\ = E [E [X_{E} Y_{F^{⊥}}^{T} | X_{E}, Y_{F}]] \\ = E [X_{E} E [Y_{F^{⊥}}^{T} | Y_{F}]] \end{matrix}$ since $Y_{F} = T_{E, F} X_{E}$ , $X_{E}$ is $σ (Y_{F})$ -measurable. Now, using the equation (A.6) from Rasmussen [28], we have: $\begin{matrix} E [Y_{F^{⊥}} | Y_{F}] & = Λ_{F^{⊥} F} Λ_{F}^{- 1} Y_{F} \\ = Λ_{F^{⊥} F} Λ_{F}^{- 1} T_{E, F} X_{E} \end{matrix}$ and $E [X_{E^{⊥}} | X_{E}] = Σ_{E^{⊥} E} Σ_{E}^{- 1} X_{E} .$

Hence: $\begin{matrix} Cov (X_{E}, Y_{F^{⊥}}) & = E [X_{E} E [Y_{F^{⊥}}^{T} | Y_{F}]] \\ = E [X_{E} X_{E}^{T}] T_{E, F}^{T} Λ_{F}^{- 1} Λ_{F^{⊥} F}^{T} \\ = Σ_{E} T_{E, F}^{T} Λ_{F}^{- 1} Λ_{F^{⊥} F}^{T} . \end{matrix}$

We also have: $Cov (X_{E^{⊥}}, Y_{F}) = E [X_{E^{⊥}} X_{E}^{T} T_{E, F}^{T}] = Σ_{E^{⊥} E} T_{E, F}^{T},$ and $\begin{matrix} Cov (X_{E^{⊥}}, Y_{F^{⊥}}) & = E [X_{E^{⊥}} Y_{F^{⊥}}^{T}] \\ = E [E [X_{E^{⊥}} Y_{F^{⊥}}^{T} | X_{E}, Y_{F}]] \\ = E [E [X_{E^{⊥}} | X_{E}] E [Y_{F^{⊥}}^{T} | Y_{F}]] by independence \\ = E [Σ_{E^{⊥} E} Σ_{E}^{- 1} X_{E} X_{E}^{T} T_{E, F}^{T} Λ_{F}^{- 1} Λ_{F^{⊥} F}^{T}] \\ = Σ_{E^{⊥} E} T_{E, F}^{T} Λ_{F}^{- 1} Λ_{F^{⊥} F}^{T} . \end{matrix}$

Finally, we find: $C = (\begin{matrix} Σ_{E} T_{E, F}^{T} & Σ_{E} T_{E, F}^{T} Λ_{F}^{- 1} Λ_{F^{⊥} F}^{T} \\ Σ_{E^{⊥} E} T_{E, F}^{T} & Σ_{E^{⊥} E} T_{E, F}^{T} Λ_{F}^{- 1} Λ_{F^{⊥} F}^{T} \end{matrix}) .$

By taking orthogonal bases $(V_{E}, V_{E^{⊥}})$ and $(V_{F}, V_{F^{⊥}})$ , we can put it in a more compact way, such as in Proposition 4 in Muzellec and Cuturi [14]: $C = (V_{E} Σ_{E} + V_{E^{⊥}} Σ_{E^{⊥} E}) T_{E, F}^{T} (V_{F}^{T} + Λ_{F}^{- 1} Λ_{F^{⊥} F}^{T} V_{F^{⊥}}^{T}) .$

To check it, just expand the terms and see that $C_{E, F} = V_{E} C V_{F}^{T}$ .

Appendix B. Knothe–Rosenblatt

Appendix B.1. Properties of (12)

Proposition A1.

In a slightly more general setting, let $X_{0} = X_{1} = R^{d}$ , functions $f_{0}, f_{1}$ from $R^{d} \times R^{d}$ to $R^{d}$ , and measures $μ_{0} \in P (X_{0})$ , $μ_{1} \in P (X_{1})$ . Then, the family $X_{t} = (X_{0} \times X_{1}, f_{t}, γ^{*})$ defines a geodesic between $X_{0}$ and $X_{1}$ , where $γ^{*}$ is the optimal coupling of $HW$ between $μ_{0}$ and $μ_{1}$ , and $f_{t} ((x_{0}, x_{0}^{'}), (x_{1}, x_{1}^{'})) = (1 - t) f_{0} (x_{0}, x_{0}^{'}) + t f_{1} (x_{1}, x_{1}^{'}) .$

Proof.

See Theorem 3.1 in [8]. □

Appendix B.2. Proof of Theorem 3

We first recall a useful theorem.

Theorem A1

(Theorem 2.8 in Billingsley [45]). Let $Ω = X \times Y$ be a separable space, and let $P, P_{n} \in P (Ω)$ with marginals $P_{X}$ (respectively $P_{n, X}$ ) and $P_{Y}$ (respectively $P_{n, Y}$ ). Then, $P_{n, X} \otimes P_{n, Y} \overset{D}{\to} P$ if and only if $P_{n, X} \overset{D}{\to} P_{X}$ , $P_{n, Y} \overset{D}{\to} P_{Y}$ and $P = P_{X} \otimes P_{Y}$ .

Proof of Theorem 3.

The following proof is mainly inspired by the proof of Theorem 1 in [21] (Theorem 2.1), [22] (Theorem 3.1.6) and [19] (Theorem 2.23).

Let $μ, ν \in P (R^{d})$ , absolutely continuous, with finite fourth moments and compact supports. We recall the problem ${HW}_{t}$ : ${HW}_{t}^{2} (μ, ν) = inf_{γ \in Π (μ, ν)} \int \int \sum_{k = 1}^{d} (\prod_{i = 1}^{k - 1} λ_{t}^{(i)}) {(x_{k} x_{k}^{'} - y_{k} y_{k}^{'})}^{2} d γ_{t} (x, y) d γ_{t} (x^{'}, y^{'}),$ with $\forall t > 0, \forall i \in {1, \dots, d - 1}$ , $λ_{t}^{(i)} > 0$ and $λ_{t}^{(i)} \underset{t \to 0}{\to} 0$ .

First, let us denote $γ_{t}$ the optimal coupling for ${HW}_{t}$ for all $t > 0$ . We want to show that $γ_{t} \to_{t \to 0}^{D} γ_{K}$ with $γ_{K} = {(I d \times T_{K})}_{#} μ$ and $T_{K}$ our alternate Knothe-Rosenblatt rearrangement. Let $γ \in Π (μ, ν)$ such that $γ_{t} \to_{t \to 0}^{D} γ$ (true up to subsequence as ${μ}$ and ${ν}$ are tight in $P (X)$ and $P (Y)$ if X and Y are polish space, therefore, by [18] (Lemma 4.4), $Π (μ, ν)$ is a tight set, and we can apply the Prokhorov theorem [19] (Box 1.4) on ${(γ_{t})}_{t}$ and extract a subsequence)).

Part 1:

First, let us notice that: $\begin{matrix} {HW}_{t}^{2} (μ, ν) & = \int \int \sum_{k = 1}^{d} (\prod_{i = 1}^{k - 1} λ_{t}^{(i)}) {(x_{k} x_{k}^{'} - y_{k} y_{k}^{'})}^{2} d γ_{t} (x, y) d γ_{t} (x^{'}, y^{'}) \\ = \int \int {(x_{1} x_{1}^{'} - y_{1} y_{1}^{'})}^{2} d γ_{t} (x, y) d γ_{t} (x^{'}, y^{'}) \\ + \int \int \sum_{k = 2}^{d} (\prod_{i = 1}^{k - 1} λ_{t}^{(i)}) {(x_{k} x_{k}^{'} - y_{k} y_{k}^{'})}^{2} d γ_{t} (x, y) d γ_{t} (x^{'}, y^{'}) . \end{matrix}$

Moreover, as $γ_{t}$ is the optimal coupling between $μ$ and $ν$ , and $γ_{K} \in Π (μ, ν)$ , $\begin{matrix} {HW}_{t}^{2} (μ, ν) & \leq \int \int \sum_{k = 1}^{d} (\prod_{i = 1}^{k - 1} λ_{t}^{(i)}) {(x_{k} x_{k}^{'} - y_{k} y_{k}^{'})}^{2} d γ_{K} (x, y) d γ_{K} (x^{'}, y^{'}) \\ = \int \int {(x_{1} x_{1}^{'} - y_{1} y_{1}^{'})}^{2} d γ_{K} (x, y) d γ_{K} (x^{'}, y^{'}) \\ + \int \int \sum_{k = 2}^{d} (\prod_{i = 1}^{k - 1} λ_{t}^{(i)}) {(x_{k} x_{k}^{'} - y_{k} y_{k}^{'})}^{2} d γ_{K} (x, y) d γ_{K} (x^{'}, y^{'}) . \end{matrix}$

In our case, we have $γ_{t} \to_{t \to 0}^{D} γ$ , thus, by Theorem A1, we have $γ_{t} \otimes γ_{t} \to_{t \to 0}^{D} γ \otimes γ$ . Using the fact that $\forall i, λ_{t}^{(i)} \underset{t \to 0}{\to} 0$ (and Lemma 1.8 of Santambrogio [19], since we are on compact support, we can bound the cost (which is continuous) by its max), we obtain the following inequality $\int \int {(x_{1} x_{1}^{'} - y_{1} y_{1}^{'})}^{2} d γ (x, y) d γ (x^{'}, y^{'}) \leq \int \int {(x_{1} x_{1}^{'} - y_{1} y_{1}^{'})}^{2} d γ_{K} (x, y) d γ_{K} (x^{'}, y^{'}) .$

By denoting $γ^{1}$ and $γ_{K}^{1}$ the marginals on the first variables, we can use the projection $π^{1} (x, y) = (x_{1}, y_{1})$ , such as $γ^{1} = π_{#}^{1} γ$ and $γ_{K}^{1} = π_{#}^{1} γ_{K}$ . Hence, we get $\int \int {(x_{1} x_{1}^{'} - y_{1} y_{1}^{'})}^{2} d γ^{1} (x_{1}, y_{1}) d γ^{1} (x_{1}^{'}, y_{1}^{'}) \leq \int \int {(x_{1} x_{1}^{'} - y_{1} y_{1}^{'})}^{2} d γ_{K}^{1} (x_{1}, y_{1}) d γ_{K}^{1} (x_{1}^{'}, y_{1}^{'}) .$

However, $γ_{K}^{1}$ was constructed in order to be the unique optimal map for this cost (either $T_{a s c}$ or $T_{d e s c}$ according to theorem [26] (Theorem 4.2.4)). Thus, we can deduce that $γ^{1} = {(I d \times T_{K}^{1})}_{#} μ^{1} = γ_{K}^{1}$ .

Part 2:

We know that for any $t > 0$ , $γ_{t}$ and $γ_{K}$ share the same marginals. Thus, as previously, $π_{#}^{1} γ_{t}$ should have a cost worse than $π_{#}^{1} γ_{K}$ , which translates to $\begin{matrix} \int \int {(x_{1} x_{1}^{'} - y_{1} y_{1}^{'})}^{2} d γ_{K}^{1} (x_{1}, y_{1}) d γ_{K}^{1} (x_{1}^{'}, y_{1}^{'}) & = \int \int {(x_{1} x_{1}^{'} - y_{1} y_{1}^{'})}^{2} d γ^{1} (x_{1}, y_{1}) d γ^{1} (x_{1}^{'}, y_{1}^{'}) \\ \leq \int \int {(x_{1} x_{1}^{'} - y_{1} y_{1}^{'})}^{2} d γ_{t}^{1} (x_{1}, y_{1}) d γ_{t}^{1} (x_{1}^{'}, y_{1}^{'}) . \end{matrix}$

Therefore, we have the following inequality, $\begin{matrix} \int \int {(x_{1} x_{1}^{'} - y_{1} y_{1}^{'})}^{2} d γ^{1} (x, y) d γ^{1} (x^{'}, y^{'}) + \int \int \sum_{k = 2}^{d} (\prod_{i = 1}^{k - 1} λ_{t}^{(i)}) {(x_{k} x_{k}^{'} - y_{k} y_{k}^{'})}^{2} d γ_{t} (x, y) d γ_{t} (x^{'}, y^{'}) \\ \leq {HW}_{t}^{2} (μ, ν) \\ \leq \int \int {(x_{1} x_{1}^{'} - y_{1} y_{1}^{'})}^{2} d γ^{1} (x, y) d γ^{1} (x^{'}, y^{'}) \\ + \int \int \sum_{k = 2}^{d} (\prod_{i = 1}^{k - 1} λ_{t}^{(i)}) {(x_{k} x_{k}^{'} - y_{k} y_{k}^{'})}^{2} d γ_{K} (x, y) d γ_{K} (x^{'}, y^{'}) . \end{matrix}$

We can substract the first term and factorize by $λ_{t}^{(1)} > 0$ , $\begin{matrix} \int \int \sum_{k = 2}^{d} (\prod_{i = 1}^{k - 1} λ_{t}^{(i)}) {(x_{k} x_{k}^{'} - y_{k} y_{k}^{'})}^{2} d γ_{t} (x, y) d γ_{t} (x^{'}, y^{'}) \\ = λ_{t}^{(1)} (\int \int {(x_{2} x_{2}^{'} - y_{2} y_{2}^{'})}^{2} d γ_{t} (x, y) d γ_{t} (x^{'}, y^{'}) \\ + \int \int \sum_{k = 3}^{d} (\prod_{i = 2}^{k - 1} λ_{t}^{(i)}) {(x_{k} x_{k}^{'} - y_{k} y_{k}^{'})}^{2} d γ_{t} (x, y) d γ_{t} (x^{'}, y^{'})) \\ \leq λ_{t}^{(1)} (\int \int {(x_{2} x_{2}^{'} - y_{2} y_{2}^{'})}^{2} d γ_{K} (x, y) d γ_{K} (x^{'}, y^{'}) \\ + \int \int \sum_{k = 3}^{d} (\prod_{i = 2}^{k - 1} λ_{t}^{(i)}) {(x_{k} x_{k}^{'} - y_{k} y_{k}^{'})}^{2} d γ_{K} (x, y) d γ_{K} (x^{'}, y^{'})) . \end{matrix}$

By dividing by $λ_{t}^{(1)}$ and by taking the limit $t \to 0$ as in the first part, we get (A3) $\int \int {(x_{2} x_{2}^{'} - y_{2} y_{2}^{'})}^{2} d γ (x, y) d γ (x^{'}, y^{'}) \leq \int \int {(x_{2} x_{2}^{'} - y_{2} y_{2}^{'})}^{2} d γ_{K} (x, y) d γ_{K} (x^{'}, y^{'}) .$

Now, the 2 terms depend only on $(x_{2}, y_{2})$ and $(x_{2}^{'}, y_{2}^{'})$ . We will project on the two first coordinates, i.e., let $π^{1, 2} (x, y) = ((x_{1}, x_{2}), (y_{1}, y_{2}))$ and $γ^{1, 2} = π_{#}^{1, 2} γ$ , $γ_{K}^{1, 2} = π_{#}^{1, 2} γ_{K}$ . Using the disintegration of measures, we know that there exist kernels $γ^{2 | 1}$ and $γ_{K}^{2 | 1}$ such that $γ^{1, 2} = γ^{1} \otimes γ^{2 | 1}$ and $γ_{K}^{1, 2} = γ_{K}^{1} \otimes γ_{K}^{2 | 1}$ , where $\forall A \in B (X \times Y), μ \otimes K (A) = \int \int 𝟙_{A} (x, y) K (x, d y) μ (d x) .$

We can rewrite the previous Equation (A3) as (A4) $\begin{matrix} \int \int {(x_{2} x_{2}^{'} - y_{2} y_{2}^{'})}^{2} d γ (x, y) d γ (x^{'}, y^{'}) \\ = \int \int \int \int {(x_{2} x_{2}^{'} - y_{2} y_{2}^{'})}^{2} γ^{2 | 1} ((x_{1}, y_{1}), (d x_{2}, d y_{2})) γ^{2 | 1} ((x_{1}^{'}, y_{1}^{'}), (d x_{2}^{'}, d y_{2}^{'})) \\ d γ^{1} (x_{1}, y_{1}) d γ^{1} (x_{1}^{'}, y_{1}^{'}) \\ \leq \int \int \int \int {(x_{2} x_{2}^{'} - y_{2} y_{2}^{'})}^{2} γ_{K}^{2 | 1} ((x_{1}, y_{1}), (d x_{2}, d y_{2})) γ_{K}^{2 | 1} ((x_{1}^{'}, y_{1}^{'}), (d x_{2}^{'}, d y_{2}^{'})) \\ d γ_{K}^{1} (x_{1}, y_{1}) d γ_{K}^{1} (x_{1}^{'}, y_{1}^{'}) . \end{matrix}$

Now, we will assume at first that the marginals of $γ^{2 | 1} ((x_{1}, y_{1}), \cdot)$ are well $μ^{2 | 1} (x_{1}, \cdot)$ and $ν^{2 | 1} (y_{1}, \cdot)$ . Then, by definition of $γ_{K}^{2 | 1}$ , as it is optimal for the $G W$ cost with inner products, we have for all $(x_{1}, y_{1}), (x_{1}^{'}, y_{1}^{'})$ , (A5) $\begin{matrix} \int \int {(x_{2} x_{2}^{'} - y_{2} y_{2}^{'})}^{2} γ_{K}^{2 | 1} ((x_{1}, y_{1}), (d x_{2}, d y_{2})) γ_{K}^{2 | 1} ((x_{1}^{'}, y_{1}^{'}), (d x_{2}^{'}, d y_{2}^{'})) \\ \leq \int \int {(x_{2} x_{2}^{'} - y_{2} y_{2}^{'})}^{2} γ^{2 | 1} ((x_{1}, y_{1}), (d x_{2}, d y_{2})) γ^{2 | 1} ((x_{1}^{'}, y_{1}^{'}), (d x_{2}^{'}, d y_{2}^{'})) . \end{matrix}$ Moreover, we know from the first part that $γ^{1} = γ_{K}^{1}$ , then by integrating with respect to $(x_{1}, y_{1})$ and $(x_{1}^{'}, y_{1}^{'})$ , we have (A6) $\begin{matrix} \int \int \int \int {(x_{2} x_{2}^{'} - y_{2} y_{2}^{'})}^{2} γ_{K}^{2 | 1} ((x_{1}, y_{1}), (d x_{2}, d y_{2})) γ_{K}^{2 | 1} ((x_{1}^{'}, y_{1}^{'}), (d x_{2}^{'}, d y_{2}^{'})) \\ d γ^{1} (x_{1}, y_{1}) d γ^{1} (x_{1}^{'}, y_{1}^{'}) \\ \leq \int \int \int \int {(x_{2} x_{2}^{'} - y_{2} y_{2}^{'})}^{2} γ^{2 | 1} ((x_{1}, y_{1}), (d x_{2}, d y_{2})) γ^{2 | 1} ((x_{1}^{'}, y_{1}^{'}), (d x_{2}^{'}, d y_{2}^{'})) \\ d γ^{1} (x_{1}, y_{1}) d γ^{1} (x_{1}^{'}, y_{1}^{'}) . \end{matrix}$ By (A4) and (A6), we deduce that we have an equality and we get (A7) $\begin{matrix} \int \int (\int \int {(x_{2} x_{2}^{'} - y_{2} y_{2}^{'})}^{2} γ^{2 | 1} ((x_{1}, y_{1}), (d x_{2}, d y_{2})) γ^{2 | 1} ((x_{1}^{'}, y_{1}^{'}), (d x_{2}^{'}, d y_{2}^{'})) \\ - \int \int {(x_{2} x_{2}^{'} - y_{2} y_{2}^{'})}^{2} γ_{K}^{2 | 1} ((x_{1}, y_{1}), (d x_{2}, d y_{2})) γ_{K}^{2 | 1} ((x_{1}^{'}, y_{1}^{'}), (d x_{2}^{'}, d y_{2}^{'}))) \\ d γ^{1} (x_{1}, y_{1}) d γ^{1} (x_{1}^{'}, y_{1}^{'}) = 0 . \end{matrix}$ However, we know by (A5) that the middle part of (A7) is nonnegative, thus we have for $γ^{1}$ -a.e. $(x_{1}, y_{1}), (x_{1}^{'}, y_{1}^{'})$ , $\begin{matrix} \int \int {(x_{2} x_{2}^{'} - y_{2} y_{2}^{'})}^{2} γ_{K}^{2 | 1} ((x_{1}, y_{1}), (d x_{2}, d y_{2})) γ_{K}^{2 | 1} ((x_{1}^{'}, y_{1}^{'}), (d x_{2}^{'}, d y_{2}^{'})) \\ = \int \int {(x_{2} x_{2}^{'} - y_{2} y_{2}^{'})}^{2} γ^{2 | 1} ((x_{1}, y_{1}), (d x_{2}, d y_{2})) γ^{2 | 1} ((x_{1}^{'}, y_{1}^{'}), (d x_{2}^{'}, d y_{2}^{'})) . \end{matrix}$ From that, we can conclude as in the first part that $γ^{2 | 1} = γ_{K}^{2 | 1}$ (by unicity of the optimal map). And thus $γ^{1, 2} = γ_{K}^{1, 2}$ .

Now, we still have to show that the marginals of $γ^{2 | 1} ((x_{1}, y_{1}), \cdot)$ and $γ_{K}^{2, 1} ((x_{1}, y_{1}), \cdot)$ are well the same, i.e., $μ^{2 | 1} (x_{1}, \cdot)$ and $ν^{2 | 1} (y_{1}, \cdot)$ . Let $ϕ$ and $ψ$ be continuous functions, then we have to show that for $γ^{1}$ -a.e. $(x_{1}, y_{1})$ , we have $\{\begin{matrix} \int ϕ (x_{2}) γ^{2 | 1} ((x_{1}, y_{1}), (d x_{2}, d y_{2})) = \int ϕ (x_{2}) μ^{2 | 1} (x_{1}, d x_{2}) \\ \int ψ (y_{2}) γ^{2 | 1} ((x_{1}, y_{1}), (d x_{2}, d y_{2})) = \int ψ (y_{2}) ν^{2 | 1} (y_{1}, d y_{2}) . \end{matrix}$ As we want to prove it for $γ^{1}$ -a.e. $(x_{1}, y_{1})$ , it is sufficient to prove that for all continuous function $ξ$ , $\{\begin{matrix} \int \int ξ (x_{1}, y_{1}) ϕ (x_{2}) γ^{2 | 1} ((x_{1}, y_{1}), (d x_{2}, d y_{2})) d γ^{1} (x_{1}, y_{1}) \\ = \int \int ξ (x_{1}, y_{1}) ϕ (x_{2}) μ^{2 | 1} (x_{1}, d x_{2}) d γ^{1} (x_{1}, y_{1}) \\ \int \int ξ (x_{1}, y_{1}) ψ (y_{2}) γ^{2 | 1} ((x_{1}, y_{1}), (d x_{2}, d y_{2})) d γ^{1} (x_{1}, y_{1}) \\ = \int \int ξ (x_{1}, y_{1}) ψ (y_{2}) ν^{2 | 1} (y_{1}, d y_{2}) d γ^{1} (x_{1}, y_{1}) . \end{matrix}$

First, we can use the projections $π_{x} (x, y) = x$ and $π_{y} (x, y) = y$ . Moreover, we know that $γ^{1} = {(I d \times T_{K}^{1})}_{#} μ^{1}$ . The alternate Knothe–Rosenblatt rearrangement is, as the usual one, bijective (because $μ$ and $ν$ are absolutely continuous), and thus, as we suppose that $ν$ satisfies the same hypothesis than $μ$ , we also have $γ^{1} = {({(T_{K}^{1})}^{- 1}, I d)}_{#} ν^{1}$ . Let us note ${\tilde{T}}_{K}^{1} = {(T_{K}^{1})}^{- 1}$ . Then, the equalities that we want to show are: $\{\begin{matrix} \int \int ξ (x_{1}, T_{K}^{1} (x_{1})) ϕ (x_{2}) γ_{x}^{2 | 1} ((x_{1}, T_{K}^{1} (x_{1})), d x_{2}) d μ^{1} (x_{1}) \\ = \int \int ξ (x_{1}, T_{K}^{1} (x_{1})) ϕ (x_{2}) μ^{2 | 1} (x_{1}, d x_{2}) d μ^{1} (x_{1}) \\ \int \int ξ ({\tilde{T}}_{K}^{1} (y_{1}), y_{1}) ψ (y_{2}) γ_{y}^{2 | 1} (({\tilde{T}}_{K}^{1} (y_{1}), y_{1}), d y_{2}) d ν^{1} (y_{1}) \\ = \int \int ξ ({\tilde{T}}_{K}^{1} (y_{1}), y_{1}) ψ (y_{2}) ν^{2 | 1} (y_{1}, d y_{2}) d ν^{1} (y_{1}) . \end{matrix}$ In addition, we have indeed $\begin{matrix} \int \int ξ (x_{1}, T_{K}^{1} (x_{1})) ϕ (x_{2}) γ_{x}^{2 | 1} ((x_{1}, T_{K}^{1} (x_{1})), d x_{2}) d μ^{1} (x_{1}) \\ = \int \int ξ (x_{1}, T_{K}^{1} (x_{1})) ϕ (x_{2}) d γ^{1, 2} ((x_{1}, x_{2}), (y_{1}, y_{2})) \\ = \int \int ξ (x_{1}, T_{K}^{1} (x_{1})) ϕ (x_{2}) d γ_{x}^{1, 2} (x_{1}, x_{2}) \\ = \int \int ξ (x_{1}, T_{K}^{1} (x_{1})) ϕ (x_{2}) μ^{2 | 1} (x_{1}, d x_{2}) d μ^{1} (x_{1}) . \end{matrix}$ We can do the same for the $ν$ part by symmetry.

Part 3:

Now, we can proceed the same way by induction. Let $ℓ \in {2, \dots, d}$ and suppose that the result is true in dimension $ℓ - 1$ (i.e., $γ^{1 : ℓ - 1} = π_{#}^{1 : ℓ - 1} γ = γ_{K}^{1 : ℓ - 1}$ ).

For this part of the proof, we rely on [19] (Theorem 2.23). We can build a measure $γ_{K}^{t} \in P (R^{d} \times R^{d})$ such that: (A8) $\{\begin{matrix} π_{#}^{x} γ_{K}^{t} = μ \\ π_{#}^{y} γ_{K}^{t} = ν \\ π_{#}^{1 : ℓ - 1} γ_{K}^{t} = η_{t, ℓ} \end{matrix}$ where $η_{t, ℓ}$ is the optimal transport plan between $μ^{ℓ} = π_{#}^{1 : ℓ - 1} μ$ and $ν^{ℓ} = π_{#}^{1 : ℓ - 1} ν$ for the objective: $\int \int \sum_{k = 1}^{ℓ - 1} (\prod_{i = 1}^{k - 1} λ_{t}^{(i)}) {(x_{k} x_{k}^{'} - y_{k} y_{k}^{'})}^{2} d γ (x, y) d γ (x^{'}, y^{'}) .$ By induction hypothesis, we have $η_{t, ℓ} \to_{t \to 0}^{D} π_{#}^{1 : ℓ - 1} γ_{K}$ . To build such a measure, we can first disintegrate $μ$ and $ν$ : $\{\begin{matrix} μ = μ^{1 : ℓ - 1} \otimes μ^{ℓ : d | 1 : ℓ - 1} \\ ν = ν^{1 : ℓ - 1} \otimes ν^{ℓ : d | 1 : ℓ - 1}, \end{matrix}$ then we pick the Knothe transport $γ_{K}^{ℓ : d | 1 : ℓ - 1}$ between $μ^{ℓ : d | 1 : ℓ - 1}$ and $ν^{ℓ : d | 1 : ℓ - 1}$ . Thus, by taking $γ_{K}^{T} = η_{t, ℓ} \otimes γ_{K}^{ℓ : d | 1 : ℓ - 1}$ , $γ_{K}^{T}$ satisfies the conditions well (A8).

Hence, we have: $\begin{matrix} \int \int \sum_{k = 1}^{ℓ - 1} (\prod_{i = 1}^{k - 1} λ_{t}^{(i)}) {(x_{k} x_{k}^{'} - y_{k} y_{k}^{'})}^{2} d γ_{K}^{t} (x, y) d γ_{K}^{t} (x^{'}, y^{'}) \\ = \int \int \sum_{k = 1}^{ℓ - 1} (\prod_{i = 1}^{k - 1} λ_{t}^{(i)}) {(x_{k} x_{k}^{'} - y_{k} y_{k}^{'})}^{2} d η_{t, ℓ} (x_{1 : ℓ - 1}, y_{1 : ℓ - 1}) d η_{t, ℓ} (x_{1 : ℓ - 1}^{'}, y_{1 : ℓ - 1}^{'}) \\ \leq \int \int \sum_{k = 1}^{ℓ - 1} (\prod_{i = 1}^{k - 1} λ_{t}^{(i)}) {(x_{k} x_{k}^{'} - y_{k} y_{k}^{'})}^{2} d γ_{t} (x, y) d γ_{t} (x^{'}, y^{'}), \end{matrix}$ and therefore: $\begin{matrix} \int \int \sum_{k = 1}^{ℓ - 1} (\prod_{i = 1}^{k - 1} λ_{t}^{(i)}) {(x_{k} x_{k}^{'} - y_{k} y_{k}^{'})}^{2} d γ_{K}^{t} (x, y) d γ_{K}^{t} (x^{'}, y^{'}) \\ + \int \int \sum_{k = ℓ}^{d} (\prod_{i = 1}^{k - 1} λ_{t}^{(i)}) {(x_{k} x_{k}^{'} - y_{k} y_{k}^{'})}^{2} d γ_{t} (x, y) d γ_{t} (x^{'}, y^{'}) \\ \leq {HW}_{t}^{2} (μ, ν) \\ \leq \int \int \sum_{k = 1}^{ℓ - 1} (\prod_{i = 1}^{k - 1} λ_{t}^{(i)}) {(x_{k} x_{k}^{'} - y_{k} y_{k}^{'})}^{2} d γ_{K}^{t} (x, y) d γ_{K}^{t} (x^{'}, y^{'}) \\ + \int \int \sum_{k = ℓ}^{d} (\prod_{i = 1}^{k - 1} λ_{t}^{(i)}) {(x_{k} x_{k}^{'} - y_{k} y_{k}^{'})}^{2} d γ_{K}^{t} (x, y) d γ_{K}^{t} (x^{'}, y^{'}) . \end{matrix}$

As before, by subtracting the first term, dividing by $\prod_{i = 1}^{ℓ - 1} λ_{t}^{(i)}$ and taking the limit, we obtain: $\int \int {(x_{ℓ} x_{ℓ}^{'} - y_{ℓ} y_{ℓ}^{'})}^{2} d γ_{t} (x, y) d γ_{t} (x^{'}, y^{'}) \leq \int \int {(x_{ℓ} x_{ℓ}^{'} - y_{ℓ} y_{ℓ}^{'})}^{2} d γ_{K}^{t} (x, y) d γ_{K}^{t} (x^{'}, y^{'}) .$ For the right hand side, using that $γ_{K}^{t} = η_{t, ℓ} \otimes γ_{K}^{ℓ : d | 1 : ℓ - 1}$ , we have: $\begin{matrix} \int \int {(x_{ℓ} x_{ℓ}^{'} - y_{ℓ} y_{ℓ}^{'})}^{2} d γ_{K}^{t} (x, y) d γ_{K}^{t} (x^{'}, y^{'}) \\ = \int \int \int \int {(x_{ℓ} x_{ℓ}^{'} - y_{ℓ} y_{ℓ}^{'})}^{2} γ_{K}^{ℓ : d | 1 : ℓ - 1} ((x_{1 : ℓ - 1}, y_{1 : ℓ - 1}), (d x_{ℓ : d}, d y_{ℓ : d})) \\ γ_{K}^{ℓ : d | 1 : ℓ - 1} ((x_{1 : ℓ - 1}^{'}, y_{1 : ℓ - 1}^{'}), (d x_{ℓ : d}^{'}, d y_{ℓ : d}^{'})) d η_{t, ℓ} (x_{1 : ℓ - 1}, y_{1 : ℓ - 1}) d η_{t, ℓ} (x_{1 : ℓ - 1}^{'}, y_{1 : ℓ - 1}^{'}) \\ = \int \int \int \int {(x_{ℓ} x_{ℓ}^{'} - y_{ℓ} y_{ℓ}^{'})}^{2} γ_{K}^{ℓ | 1 : ℓ - 1} ((x_{1 : ℓ - 1}, y_{1 : ℓ - 1}), (d x_{ℓ}, d y_{ℓ})) \\ γ_{K}^{ℓ | 1 : ℓ - 1} ((x_{1 : ℓ - 1}^{'}, y_{1 : ℓ - 1}^{'}), (d x_{ℓ}^{'}, d y_{ℓ}^{'})) d η_{t, ℓ} (x_{1 : ℓ - 1}, y_{1 : ℓ - 1}) d η_{t, ℓ} (x_{1 : ℓ - 1}^{'}, y_{1 : ℓ - 1}^{'}) . \end{matrix}$ Let us note for $η_{t, ℓ}$ almost every $(x_{1 : ℓ - 1}, y_{1 : ℓ - 1}), (x_{1 : ℓ - 1}^{'}, y_{1 : ℓ - 1}^{'})$ $\begin{matrix} G W (μ^{ℓ | 1 : ℓ - 1}, ν^{ℓ | 1 : ℓ - 1}) \\ = \int \int {(x_{ℓ} x_{ℓ}^{'} - y_{ℓ} y_{ℓ}^{'})}^{2} γ_{K}^{ℓ | 1 : ℓ - 1} ((x_{1 : ℓ - 1}, y_{1 : ℓ - 1}), (d x_{ℓ}, d y_{ℓ})) γ_{K}^{ℓ | 1 : ℓ - 1} ((x_{1 : ℓ - 1}^{'}, y_{1 : ℓ - 1}^{'}), (d x_{ℓ}^{'}, d y_{ℓ}^{'})), \end{matrix}$ then $\begin{matrix} \int \int {(x_{ℓ} x_{ℓ}^{'} - y_{ℓ} y_{ℓ}^{'})}^{2} d γ_{K}^{t} (x, y) d γ_{K}^{t} (x^{'}, y^{'}) \\ = \int \int G W (μ^{ℓ | 1 : ℓ - 1}, ν^{ℓ | 1 : ℓ - 1}) d η_{t, ℓ} (x_{1 : ℓ - 1}, y_{1 : ℓ - 1}) d η_{t, ℓ} (x_{1 : ℓ - 1}^{'}, y_{1 : ℓ - 1}^{'}) . \end{matrix}$ By Theorem A1, we have $η_{t, ℓ} \otimes η_{t, ℓ} \to_{t \to 0}^{D} π_{#}^{1 : ℓ - 1} γ_{K} \otimes π_{#}^{1 : ℓ - 1} γ_{K}$ . So, if (A9) $η \mapsto \int \int G W (μ^{ℓ | 1 : ℓ - 1}, ν^{ℓ | 1 : ℓ - 1}) d η d η$ is continuous over the transport plans between $μ^{1 : ℓ - 1}$ and $ν^{1 : ℓ - 1}$ , we have $\begin{matrix} \int \int {(x_{ℓ} x_{ℓ}^{'} - y_{ℓ} y_{ℓ}^{'})}^{2} d γ_{K}^{t} (x, y) d γ_{K}^{t} (x^{'}, y^{'}) \\ \underset{t \to 0}{\to} \int \int G W (μ^{ℓ | 1 : ℓ - 1}, ν^{ℓ | 1 : ℓ - 1}) π_{#}^{1 : ℓ - 1} γ_{K} (d x_{1 : ℓ - 1}, d y_{1 : ℓ - 1}) π_{#}^{1 : ℓ - 1} γ_{K} (d x_{1 : ℓ - 1}^{'}, d y_{1 : ℓ - 1}^{'}) \end{matrix}$ and $\begin{matrix} \int \int G W (μ^{ℓ | 1 : ℓ - 1}, ν^{ℓ | 1 : ℓ - 1}) π_{#}^{1 : ℓ - 1} γ_{K} (d x_{1 : ℓ - 1}, d y_{1 : ℓ - 1}) π_{#}^{1 : ℓ - 1} γ_{K} (d x_{1 : ℓ - 1}^{'}, d y_{1 : ℓ - 1}^{'}) \\ = \int \int {(x_{ℓ} x_{ℓ}^{'} - y_{ℓ} y_{ℓ}^{'})}^{2} d γ_{K} (x, y) d γ_{K} (x^{'}, y^{'}) \end{matrix}$ by replacing the true expression of $G W$ and using the disintegration $γ_{K} = {(π_{K}^{1 : ℓ - 1})}_{#} γ_{K} \otimes γ_{K}^{ℓ | 1 : ℓ - 1}$ .

For the continuity, we can apply [19] (Lemma 1.8) (as in the [19] (Corollary 2.24)) with $X = Y = R^{ℓ - 1} \times R^{ℓ - 1}$ , $\tilde{X} = \tilde{Y} = P (Ω)$ with $Ω \subset R^{d - ℓ + 1} \times R^{d - ℓ + 1}$ and $c (a, b) = G W (a, b)$ , which can be bounded on compact supports by $max | c |$ . Moreover, we use Theorem A1 and the fact that $η_{t} \otimes η_{t} \to_{t \to 0}^{D} γ_{K}^{1 : ℓ - 1} \otimes γ_{K}^{1 : ℓ - 1}$ .

By taking the limit $t \to 0$ , we now obtain: $\int \int {(x_{ℓ} x_{ℓ}^{'} - y_{ℓ} y_{ℓ}^{'})}^{2} d γ (x, y) d γ (x^{'}, y^{'}) \leq \int \int {(x_{ℓ} x_{ℓ}^{'} - y_{ℓ} y_{ℓ}^{'})}^{2} d γ_{K} (x, y) d γ_{K} (x^{'}, y^{'}) .$

We can now disintegrate with respect to $γ^{1 : ℓ - 1}$ as before. We just need to prove that the marginals coincide, which is performed by taking for test functions: $\{\begin{matrix} ξ (x_{1}, \dots, x_{ℓ - 1}, y_{1}, \dots, y_{ℓ - 1}) ϕ (x_{ℓ}) \\ ξ (x_{1}, \dots, x_{ℓ - 1}, y_{1}, \dots, y_{ℓ - 1}) ψ (y_{ℓ}) \end{matrix}$ and using the fact that the measures are concentrated on $y_{k} = T_{K} (x_{k})$ .

Part 4:

Therefore, we have well $γ_{t} \to_{t \to 0}^{D} γ_{K}$ . Finally, for the $L^{2}$ convergence, we have: $\int ∥ T_{t} (x) - T_{K} {(x) ∥}_{2}^{2} μ (d x) = \int ∥ y - T_{K} {(x) ∥}_{2}^{2} d γ_{t} (x, y) \to \int {∥ y - T_{K} (x) ∥}_{2}^{2} d γ_{K} (x, y) = 0$ as $γ_{t} = {(I d \times T_{t})}_{#} μ$ and $γ_{K} = {(I d \times T_{K})}_{#} μ$ . Hence, $T_{t} \to_{t \to 0}^{L^{2}} T_{K}$ . □

References

1. Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. Proceedings of the International Conference on Machine Learning, PMLR; Sydney, Australia, 6–11 August 2017; pp. 214-223.

2. Courty, N.; Flamary, R.; Tuia, D.; Rakotomamonjy, A. Optimal transport for domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell.; 2016; 39, pp. 1853-1865. [DOI: https://dx.doi.org/10.1109/TPAMI.2016.2615921] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/27723579]

3. Mémoli, F. Gromov–Wasserstein distances and the metric approach to object matching. Found. Comput. Math.; 2011; 11, pp. 417-487. [DOI: https://dx.doi.org/10.1007/s10208-011-9093-5]

4. Chowdhury, S.; Miller, D.; Needham, T. Quantized Gromov–Wasserstein. Proceedings of the Machine Learning and Knowledge Discovery in Databases, Research Track—European Conference, ECML PKDD 2021; Bilbao, Spain, 13–17 September 2021; Proceedings, Part III Oliver, N.; Pérez-Cruz, F.; Kramer, S.; Read, J.; Lozano, J.A. Springer: Berlin/Heidelberg, Germany, 2021; Volume 12977, pp. 811-827. [DOI: https://dx.doi.org/10.1007/978-3-030-86523-8_49]

5. Alvarez-Melis, D.; Jegelka, S.; Jaakkola, T.S. Towards optimal transport with global invariances. Proceedings of the The 22nd International Conference on Artificial Intelligence and Statistics, PMLR; Naha, Japan, 16–18 April 2019; pp. 1870-1879.

6. Cai, Y.; Lim, L.H. Distances between probability distributions of different dimensions. arXiv; 2020; arXiv: 2011.00629

7. Mémoli, F. On the use of Gromov-Hausdorff Distances for Shape Comparison. Proceedings of the 4th Symposium on Point Based Graphics, PBG@Eurographics 2007; Prague, Czech Republic, 2–3 September 2007; Botsch, M.; Pajarola, R.; Chen, B.; Zwicker, M. Eurographics Association: Switzerland, Geneve, 2007; pp. 81-90. [DOI: https://dx.doi.org/10.2312/SPBG/SPBG07/081-090]

8. Sturm, K.T. The space of spaces: Curvature bounds and gradient flows on the space of metric measure spaces. arXiv; 2012; arXiv: 1208.0434

9. Peyré, G.; Cuturi, M.; Solomon, J. Gromov–Wasserstein averaging of kernel and distance matrices. Proceedings of the International Conference on Machine Learning, PMLR; New York, NY, USA, 19–24 June 2016; pp. 2664-2672.

10. Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. Adv. Neural Inf. Process. Syst.; 2013; 26, pp. 2292-2300.

11. Scetbon, M.; Peyré, G.; Cuturi, M. Linear-Time Gromov Wasserstein Distances using Low Rank Couplings and Costs. arXiv; 2021; arXiv: 2106.01128

12. Vayer, T.; Flamary, R.; Courty, N.; Tavenard, R.; Chapel, L. Sliced Gromov–Wasserstein. Adv. Neural Inf. Process. Syst.; 2019; 32, pp. 14753-14763.

13. Fatras, K.; Zine, Y.; Majewski, S.; Flamary, R.; Gribonval, R.; Courty, N. Minibatch optimal transport distances; analysis and applications. arXiv; 2021; arXiv: 2101.01792

14. Muzellec, B.; Cuturi, M. Subspace Detours: Building Transport Plans that are Optimal on Subspace Projections. Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019; Vancouver, BC, Canada, 8–14 December 2019; Wallach, H.M.; Larochelle, H.; Beygelzimer, A.; d’Alché-Buc, F.; Fox, E.B.; Garnett, R. 2019; pp. 6914-6925.

15. Bogachev, V.I.; Kolesnikov, A.V.; Medvedev, K.V. Triangular transformations of measures. Sb. Math.; 2005; 196, 309. [DOI: https://dx.doi.org/10.1070/SM2005v196n03ABEH000882]

16. Knothe, H. Contributions to the theory of convex bodies. Mich. Math. J.; 1957; 4, pp. 39-52. [DOI: https://dx.doi.org/10.1307/mmj/1028990175]

17. Rosenblatt, M. Remarks on a multivariate transformation. Ann. Math. Stat.; 1952; 23, pp. 470-472. [DOI: https://dx.doi.org/10.1214/aoms/1177729394]

18. Villani, C. Optimal Transport: Old and New; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008; Volume 338.

19. Santambrogio, F. Optimal transport for applied mathematicians. Birkäuser NY; 2015; 55, 94.

20. Jaini, P.; Selby, K.A.; Yu, Y. Sum-of-Squares Polynomial Flow. Proceedings of the 36th International Conference on Machine Learning, PMLR; Long Beach, CA, USA, 9–15 June 2019; pp. 3009-3018.

21. Carlier, G.; Galichon, A.; Santambrogio, F. From Knothe’s transport to Brenier’s map and a continuation method for optimal transport. SIAM J. Math. Anal.; 2010; 41, pp. 2554-2576. [DOI: https://dx.doi.org/10.1137/080740647]

22. Bonnotte, N. Unidimensional and Evolution Methods for Optimal Transportation. Ph.D. Thesis; Université Paris-Sud: Paris, France, 2013.

23. Ambrosio, L.; Gigli, N.; Savaré, G. Gradient Flows: In Metric Spaces and in the Space of Probability Measures; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008.

24. Niles-Weed, J.; Rigollet, P. Estimation of wasserstein distances in the spiked transport model. arXiv; 2019; arXiv: 1909.07513

25. Chowdhury, S.; Mémoli, F. The gromov–wasserstein distance between networks and stable network invariants. Inf. Inference A J. IMA; 2019; 8, pp. 757-787. [DOI: https://dx.doi.org/10.1093/imaiai/iaz026]

26. Vayer, T. A Contribution to Optimal Transport on Incomparable Spaces. Ph.D. Thesis; Université de Bretagne Sud: Vannes, France, 2020.

27. Paty, F.P.; Cuturi, M. Subspace robust wasserstein distances. Proceedings of the 36th International Conference on Machine Learning, PMLR; Long Beach, CA, USA, 9–15 June 2019; pp. 5072-5081.

28. Rasmussen, C.E. Gaussian processes in machine learning. Summer School on Machine Learning; Springer: Berlin/Heidelberg, Germany, 2003; pp. 63-71.

29. Von Mises, R. Mathematical Theory of Probability and Statistics; Academic Press: Cambridge, MA, USA, 1964.

30. Salmona, A.; Delon, J.; Desolneux, A. Gromov–Wasserstein Distances between Gaussian Distributions. arXiv; 2021; arXiv: 2104.07970

31. Weed, J.; Bach, F. Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance. Bernoulli; 2019; 25, pp. 2620-2648. [DOI: https://dx.doi.org/10.3150/18-BEJ1065]

32. Lin, T.; Zheng, Z.; Chen, E.; Cuturi, M.; Jordan, M. On projection robust optimal transport: Sample complexity and model misspecification. Proceedings of the International Conference on Artificial Intelligence and Statistics, PMLR; Virtual, 13–15 April 2021; pp. 262-270.

33. Burkard, R.E.; Klinz, B.; Rudolf, R. Perspectives of Monge Properties in Optimization. Discret. Appl. Math.; 1996; 70, pp. 95-161. [DOI: https://dx.doi.org/10.1016/0166-218X(95)00103-X]

34. Flamary, R.; Courty, N.; Gramfort, A.; Alaya, M.Z.; Boisbunon, A.; Chambon, S.; Chapel, L.; Corenflos, A.; Fatras, K.; Fournier, N. et al. POT: Python Optimal Transport. J. Mach. Learn. Res.; 2021; 22, pp. 1-8.

35. Bogo, F.; Romero, J.; Loper, M.; Black, M.J. FAUST: Dataset and evaluation for 3D mesh registration. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Columbus, OH, USA, 23–28 June 2014; pp. 3794-3801.

36. Fiedler, M. Algebraic connectivity of graphs. Czechoslov. Math. J.; 1973; 23, pp. 298-305. [DOI: https://dx.doi.org/10.21136/CMJ.1973.101168]

37. Hagberg, A.; Swart, P.; Chult, D.S. Exploring Network Structure, Dynamics, and Function Using NetworkX; Technical Report Los Alamos National Lab. (LANL): Los Alamos, NM, USA, 2008.

38. Wu, J.P.; Song, J.Q.; Zhang, W.M. An efficient and accurate method to compute the Fiedler vector based on Householder deflation and inverse power iteration. J. Comput. Appl. Math.; 2014; 269, pp. 101-108. [DOI: https://dx.doi.org/10.1016/j.cam.2014.03.018]

39. Xu, H.; Luo, D.; Carin, L. Scalable Gromov–Wasserstein learning for graph partitioning and matching. Adv. Neural Inf. Process. Syst.; 2019; 32, pp. 3052-3062.

40. Chowdhury, S.; Needham, T. Generalized spectral clustering via Gromov–Wasserstein learning. Proceedings of the International Conference on Artificial Intelligence and Statistics, PMLR; Virtual, 13–15 April 2021; pp. 712-720.

41. Vayer, T.; Courty, N.; Tavenard, R.; Flamary, R. Optimal transport for structured data with application on graphs. Proceedings of the 36th International Conference on Machine Learning, PMLR; Long Beach, CA, USA, 9–15 June 2019; pp. 6275-6284.

42. Lacoste-Julien, S. Convergence rate of frank-wolfe for non-convex objectives. arXiv; 2016; arXiv: 1607.00345

43. Peyré, G.; Cuturi, M. Computational optimal transport: With applications to data science. Found. Trends^® Mach. Learn.; 2019; 11, pp. 355-607. [DOI: https://dx.doi.org/10.1561/2200000073]

44. Nagar, R.; Raman, S. Detecting approximate reflection symmetry in a point set using optimization on manifold. IEEE Trans. Signal Process.; 2019; 67, pp. 1582-1595. [DOI: https://dx.doi.org/10.1109/TSP.2019.2893835]

45. Billingsley, P. Convergence of Probability Measures; John Wiley & Sons: Hoboken, NJ, USA, 2013.

Word count: 8925

Show less

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

In the context of optimal transport (OT) methods, the subspace detour approach was recently proposed by Muzellec and Cuturi. It consists of first finding an optimal plan between the measures projected on a wisely chosen subspace and then completing it in a nearly optimal transport plan on the whole space. The contribution of this paper is to extend this category of methods to the Gromov–Wasserstein problem, which is a particular type of OT distance involving the specific geometry of each distribution. After deriving the associated formalism and properties, we give an experimental illustration on a shape matching problem. We also discuss a specific cost for which we can show connections with the Knothe–Rosenblatt rearrangement.

Details

Title

Subspace Detours Meet Gromov–Wasserstein

Author

Bonet, Clément¹

; Vayer, Titouan²; Courty, Nicolas³; Septier, François¹; Lucas Drumetz⁴

¹ Laboratoire de Mathématiques de Bretagne Atlantique, Université Bretagne Sud, CNRS UMR 6205, 56000 Vannes, France; [email protected]
² ENS Lyon, CNRS UMR 5668, LIP, 69342 Lyon, France; [email protected]
³ Department of Computer Science, Université Bretagne Sud, CNRS UMR 6074, IRISA, 56000 Vannes, France; [email protected]
⁴ IMT Atlantique, CNRS UMR 6285, Lab-STICC, 29238 Brest, France; [email protected]

First page

366

Publication year

2021

Publication date

2021

Publisher

MDPI AG

e-ISSN

19994893

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/a14120366

ProQuest document ID

2612722288

Subspace Detours Meet Gromov–Wasserstein

Jump to:

Full text

Abstract

Details

Suggested sources