Learning Neural Representations and Local

Full text

Turn on search term navigation

1. Introduction

Nonlinear dimensionality reduction (NDR) mapping [1,2,3,4] addresses transforming high dimensional observations to an embedded lower dimensional manifold. NDR mapping has attracted many attentions for analyzing large volume of high dimensional observations, such as genomics [5], images [6,7], video [8] and audio signals. The goal is to preserve and visualize neighborhood relations of observations by displaying transformed images in the low dimensional output space. Principle component analysis (PCA) [9,10] extracts orthogonal eigenvectors, termed as principle components, which serve as internal representations of given training observations for linearly transforming high-dimensional observations to an output space. Linear projections of observations on selected principle components can be determined without reserving original training observations. Linear transformation by selected principle components can operate as an online process that transforms one observation at a time, but it has been shown infeasible for topology preserving [1,2,3] and can’t be directly applied for NDR mapping.

Locally linear embedding (LLE) [1] has been presented for NDR mapping and data visualization. LLE is a batch process that simultaneously determines images of all given observations in a training set $X$ . Applying the $k$ -nearest neighboring method, it recruits k closest observations to form the neighborhood $N_{k} (x)$ of each observation $x$ . It assumes a locally linear relation within $N_{k} (x)$ such that observation $x$ is a linear combination of observations in $N_{k} (x)$ . For topology preserving, images of observations in $N_{k} (x)$ are regarded as neighbors of the image $r_{x}$ of $x$ . After optimizing coefficients $c_{x}$ of a correspondent linear relation that expresses each $x$ , LLE further poses a linear relation within images of $k$ observations in $N_{k} (x)$ . Based on the assumption that $r_{x}$ is a linear combination of images of observations in $N_{k} (x)$ using $c_{x}$ , solving linear relations that express all $r_{x}$ simultaneously attains images of all observations.

In LLE, expressing $r_{x}$ by a linear relation makes use of the neighborhood and coefficients of the linear relation that expresses $x$ . Inferring the image of any novel observation during execution phase hence needs neighbors defined over all training observations. LLE cannot operate with only internal representations extracted from $X$ for image inference during testing phase. This limits portability and computational efficiency of LLE due to massive memory access to all training observations. To overcome the difficulty, this work extends LLE to locally nonlinear embedding (LNE) for NDR mapping. LNE adopts nonlinear relations for inferring images of novel observations during execution phase. LNE stands within a larger scope than LLE and can operate with only extracted internal representations for neural approximation of NDR mapping.

Similar to LLE, Isomap [2] and Laplacian Eigenmaps [11,12] maintain $N_{k} (x)$ for each $x$ . Isomap [2] applies the $k$ nearest neighboring method to calculate geodesic distances and applying the traditional multi-dimensional scaling method [13,14,15], equivalently PCA, to infer images of all observations. Laplacian Eigenmaps sketch the k-nearest-neighbor graph based on $N_{k} (x)$ for all $x$ and solve the generalized eigenvalue problem for inference of images of all observations. Both Isomap and Laplacian Eigenmaps require reserving all training observations for inferring images of novel observations during execution phase.

Self-organization maps (SOM) [16,17,18,19] as well as elastic nets (EN) [20,21] use grid-organized receptive fields as adaptable internal representations for inferring images of observations. Unsupervised learning is a process that extracts internal representations subject to training observations. Equipped with well extracted receptive fields, SOM emulates a cortex-like map, and attains a two-dimensional embedding for topology preserving mapping. It activates one and only one node in response to an observation following the winner-take-all principle. The active neural node must have a receptive field that is closest to the given observation and its geometrical location on a grid refers to the inferred image in the low-dimensional embedding. SOM infers images of novel observations during execution phase without reserving training observations. Since unsupervised learning of SOM makes use of updating operations, which directly adopt Euclidean distances among observations, it needs further improvement for NDR mapping.

The NDR mapping proposed in this work ensures properties of extracting essential internal representations and recovering the low-dimensional embedded manifold. Based on the extracted internal representations and locally nonlinear embedding, the NDR mapping infers images of novel observations during testing phase, requiring no reservation of training observations.

This work proposes graph-organized data supports to scope training observations. The union of graph-organized data supports well sketches the underlying global density support of raw observations. Internal representations of the proposed NDR mapping contain a set of receptive fields and built-in parameters of adalines (adaptive linear elements) [22,23], where receptive fields are related to represent centroids of distributed data supports. The scope of each local data support is a $K$ -dimensional regular box, where $K$ is less than or equals the dimension of the input space. A neural module consisting of $K$ pairs of adalines is employed to determine the membership of observations to a correspondent data support. An adaline neural module is an indicator to the scope of a correspondent data support. There are $M$ neural modules, respectively determining individual scopes of $M$ data supports as well as their neighboring relations. Neighboring relations among data supports are related to edges of a graph. Derived from training observations, the graph configuration describes neighboring relations among data supports. All neural modules are further extended for NDR mapping. The extension simply equips every neural module with a posterior weight that represents the image of the centroid of a correspondent data support. The image of every centroid and images of its neighboring centroids following the property of locally nonlinear embedding induce nonlinear constraints for optimizing all posterior weights.

Internal representations extracted from training observations include features well characterizing the membership to every data support. Based on extracted internal representations of $M$ neural modules as well as posterior weights, the NDR mapping following locally nonlinear embedding can infer images of novel observations during testing phase without reserving original training observations. This property highly increases portability of the proposed NDR mapping. The size of adaptable built-in parameters for the proposed NDR mapping depends on the number of neural modules and the dimension of every data support. Massive training observations are no more required during execution of the proposed NDR mapping for testing.

The challenge is to optimize adaptable built-in parameters and posterior weights of joint adaline neural modules for the proposed NDR mapping. The union of graph-organized data supports sketches a bounded domain of the proposed neural approximation for NDR mapping. The NDR mapping explored in this work transforms high dimensional observations to images in the output space that recovers the manifold embedded within the input space. It is realized by adaline neural modules extended with posterior weights. The learning process mainly contains stages respectively constructing graph-organized cluster supports and optimizing posterior weights by the Levenberg-Marquardt algorithm [24,25,26,27]. The first learning stage is aimed to optimize centroids, bulit-in parameters of adaline modules and graph interconnections for representing graph-organized data supports. The second stage is to determine posterior weights by solving a nonlinear system that characterizes distance preserving mapping of centroids to images in the output space. The proposed neural approximation realizes NDR mapping without reserving training observations, depending only on adaptable feature representations or built-in parameters. Equipped with well trained built-in parameters, the proposed NDR mapping can determine the image of a novel testing observation during execution phase by resolving constraints of locally nonlinear embedding.

2. Materials and Methods

2.1. Adaline Neural Modules for Representing Distributed Data Supports

Adaline neural modules are proposed for NDR mapping from the space of high dimensional observations to the output space of images. Those neural modules are with lateral interconnections and posterior weights for performing composite linear and nonlinear transformation. Every neural module is equipped with a receptive field as well as $K$ pairs of adalines [22,23], where $K$ denotes the projection dimension, in reception of observations. An NDR mapping model is composed of many neural modules. Like SOM and EN, there exist lateral interconnections among neural modules. The NDR mapping is realized by graph-organized neural modules, where the graph configuration represents neighboring relations among extracted data supports with interconnections dynamically derived subject to training observations.

Let $x$ denote an observation in the input space $R^{L}$ and $μ \in R^{L}$ denote a receptive field. The deference $x - μ$ is a result of de-mean, as $μ$ is related to the centroid of a data cluster. The difference $x - μ$ propagates forward through $K$ projective fields of adalines as shown in Figure 1, where $a_{k} \in R^{L}$ denotes a projective field and matrix $A = [a_{k}^{T}]$ collects $K$ receptive fields. The projection $h_{k} = {(x - μ)}^{T} a_{k}$ forms an external field to paired threshold elements of adalines,

$θ_{l} (h) = {\begin{matrix} 1, & i f h \geq l \\ 0, & o t h e r w i s e \end{matrix}$

and

$θ^{u} (h) = {\begin{matrix} 1, & i f h \leq u \\ 0, & o t h e r w i s e \end{matrix}$

when both threshold elements are active in response to

h

, it means that

h \in [l, u]

Each adaline neural module for receipting observations is equipped with $K$ projective fields and $K$ pairs of threshold elements in additional to a receptive field. Let $I_{k} = [l_{k}, u_{k}]$ , where $l_{k}$ and $u_{k}$ respectively denote the lower and upper bounds of projection $h_{k} = a_{k}^{T} (x - μ)$ . When all $K$ pairs of threshold elements are active, the tuple $(h_{1}, \dots h_{K})$ is within a K-dimensional cuboid or box defined by the Cartesian product of $K$ projection intervals, $I = I_{1} \times \dots \times I_{K}$ , expressed by

$I = {(h_{1}, \dots, h_{K}) | h_{k} = a_{k}^{T} (x - μ) \in I_{k}, k = 1, \dots, K}$

For $K = L$ , in the original observation space, a bounded region termed as a data support is expressed by

$Λ = {x | h = A (x - μ) \in I, x \in R^{L}}$

A neural module is equipped with a threshold element that has a lower bound $2 K$ for indicating membership to $Λ$ as shown in Figure 1. This threshold element activates if all of $2 K$ threshold elements contribute positive responses to given $x$ . Internal representations of a neural module in Figure 1 include a receptive field $μ$ , $K$ projective fields and $K$ pairs of lower and upper bounds, for reception of observations.

When $x \in Λ$ , equivalently $A (x - μ) \in I$ , an adaline module summarizes the membership to the cluster support $Λ$ and is activated if the external field reaches the lower threshold level, $2 K$ . If the membership threshold element is active, the attached posterior weight $r$ produces a non-zero image in the output space $R^{d}$ , where $d$ is less than $L$ for dimensionality reduction, otherwise the output is a zero vector.

The membership threshold element in an adaline neural module activates to indicate the membership of the input to a correspondent data support. The proposed NDR model consists of multiple graph-organized neural modules, each possessing its own data support. The union of all cluster supports presents an approximation to the global density support underlying training observations in the original input space for modeling the embedded manifold. An NDR neural model contains not only receptive fields for input perception but also posterior weights for output generation.

The NDR neural model shown in Figure 2 is composed of $M$ adaline neural modules. Each neural module is with internal representations, including a receptive field $μ_{m}$ , $K$ projective fields in matrix form $A_{m}$ , lower and upper bounds, respectively denoted by ${l_{m k}}_{k}$ and ${u_{m k}}_{k}$ , and a posterior weight $r_{m}$ . Figure 3 shows training observations from the Swiss roll and edges of vertices in a graph. Each vertex $m$ in the graph has a set of neighboring vertices, denoted by $N B_{m}$ , according to the graph configuration derived by the learning process. It is notable that neighboring relations of nodes exactly concise with those among correspondent data supports and lateral interconnections of joint adaline neural modules.

2.2. NDR Model Learning

The proposed neural model for NDR mapping inherits receptive fields and graph-organized neural nodes of SOM and EN. The model design further recruits neural organization of dynamical graph configuration, projective fields, threshold elements and posterior weights. For data analysis, an NDR model has adaptable built-in parameters for representing distributed cluster supports, neighboring relations of local supports, and images of centroids. NDR model learning is with objectives of well approximating the global density support underlying training observations by the union of distributed cluster supports, structuring neighborhood relations of distributed cluster supports by the dynamic graph configuration and determining images of centroids based on distance preserving mapping.

2.2.1. Clustering Analysis for Learning Receptive Fields

Given high dimensional observations are patterns without labels for NDR mapping. Let $S = {x_{i} \in R^{L}}_{i}$ be a training set. NDR model learning is subject to unlabeled training observations and is hence unsupervised. An NDR model transforms an observation to an image that is not explicitly provided in advance. This work relates receptive fields to centroids of clusters derived by clustering analysis subject to training observations in $S$ .

Mathematical modeling [28,29,30] for clustering analysis involves formulation of constraints and the objective using mixed integer and continuous variables. The mixed integer programming leads to the annealed clustering algorithm [28,29] in Appendix A, where $M$ receptive fields eventually partition training observations in $S$ to $M$ non-overlapping subsets. Let $S_{i}$ collect observations that are closest to $μ_{i}$ among $M$ receptive fields,

(1) $S_{i} = {x | i = a r g \underset{m}{m i n} ∥ x - μ_{m} ∥, x \in S} .$

The annealed clustering algorithm attains not only all $μ_{m}$ but also the exclusive membership of each $x_{t}$ to $M$ subsets, denoted by a unitary vector of binary elements, $δ [t] = (δ_{1} [t], \dots, δ_{M} [t])$ , where $δ_{m} [t] \in {0, 1}$ . There is one only one active bit among $M$ binary bits in $δ [t]$ . It is ensured by the annealed clustering algorithm that $δ_{i} [t]$ is the only active bit if and only if $x_{t}$ belongs to $S_{i}$ . $δ [t]$ represents the membership of $x_{t}$ to $M$ subsets partitioned by $M$ receptive fields.

The objective function [28,29] for optimizing continuous receptive fields and discrete memberships could have different forms under variant computation objectives. The objective function is commonly not differentiable with respect to discrete $δ_{i} [t]$ . Minimization of the objective function with respect to discrete and continuous variables by the annealed clustering algorithm has been extensively verified in previous works [28,29]. Appendix A gives a simplified version of the annealed clustering algorithm. Numerical simulations that show its effectiveness and reliability for clustering analysis have been extensively given in previous works [28,29].

2.2.2. Deriving Cluster Supports by Optimizing Adaline Modules

Each adaline module in Figure 2 plays a role of indicating the membership to a cluster support. Clustering analysis partitions training observations in $S$ to $M$ disjoint subsets. It follows $S_{i} \cap S_{j} = \emptyset$ and $S = \cup_{m} S_{m}$ . Both SOM and EN make use of receptive fields to determine the exclusive membership of an observation following the winner-take-all (WTA) principle. This work further constructs cluster supports by optimizing adaline modules. A local cluster support is related to a region $Λ_{m}$ that is characterized by $K$ orthogonal projective fields derived from training observations in $S_{m}$ , where $K \leq L$ . The membership to $Λ_{m}$ is determined by lower and upper thresholds of projections on $K$ projective fields.

The receptive field $μ_{m}$ derived by clustering analysis well represents the mean of training observations in $S_{m}$ . Subtracting $μ_{m}$ attains demeaned observations in $S_{m}$ . Let matrix $C_{m}$ denote the covariance matrix of demeaned observations in $S_{m}$ . Orthogonal projective fields can be obtained by solving the following eigenvalue problem,

$C_{m} a_{m} = λ a_{m}$

Orthogonal projective fields, denoted by ${a_{m k}}_{k}$ , are set to eigenvectors corresponding to $K$ largest eigenvalues. Let $l_{m k}$ and $u_{m k}$ respectively denote the lower bound and the upper bound of projections on $a_{m k}$ over $S_{m}$ , where

${\begin{matrix} l_{m k} & = m i n_{x \in S_{m}} {(x - μ_{m})}^{T} a_{m k} \\ u_{m k} & = m a x_{x \in S_{m}} {(x - μ_{m})}^{T} a_{m k} \end{matrix}$

Built-in parameters in an adaline module including $K$ projective fields and $K$ pairs of lower and upper thresholds have been determined for constructing cluster supports. The projection interval $I_{m k} = [l_{m k}, u_{m k}]$ denotes the minimal interval that covers projections of all demeaned observations in $S_{m}$ on $a_{m k}$ .

Let $f_{m} (x)$ denote the result of transferring $x$ to vector ${(h_{m 1}, \dots, h_{m K})}^{T}$ , where $h_{m k} = {(x - μ_{m})}^{T} a_{m k}$ . The local support $Λ_{m}$ is defined by

(2) $Λ_{m} = {x | f_{m} (x) \in I_{m 1} \times \dots \times I_{m K}} .$

If $f_{m} (x)$ belongs to the Cartesian product $I_{m 1} \times \dots \times I_{m K}$ , $x$ belongs to $Λ_{m}$ .

If $x$ belongs to $S_{m}$ , by definition of $l_{m k}$ and $u_{m k}$ , the component $h_{m k}$ of $f_{m} (x)$ must belong to $I_{m k}$ for any $k$ , and $f_{m} (x)$ belongs to $Λ_{m}$ . If $K$ equals $L$ , $Λ_{m}$ is simply a geometry box with orthogonal edges and is centered at $μ_{m}$ , containing training observations in $S_{m}$ . The union of all $Λ_{m}$ is an approximation to the real global density support that covers all training observations in $S$ . The volume of the local support corresponding to $S_{m}$ can be calculated by

(3) $V_{m} = \prod_{k} (u_{m k} - l_{m k})$

The ratio of cluster size to the volume can be calculated by $\frac{| S_{m} |}{V_{m}}$ that represents the uniform density of $Λ_{m}$ and the global uniform density is estimated by $\frac{| S |}{\sum_{m} V_{m}}$ .

It is notable that the membership of the projection $f_{m} (x)$ to $I_{m 1} \times \dots \times I_{m K}$ is calculated by $K$ pairs of threshold elements shown in Figure 1. The top threshold element in Figure 1 that has a lower bound, $2 K$ , is active if $K$ pairs of threshold elements are active.

2.2.3. Graph Configuration and Neighboring Relations of Cluster Supports

The membership of $x$ to $Λ_{m}$ can be determined by the adaline neural module in Figure 2. It takes one subtraction and $K$ projections on projective fields and $2 K + 1$ comparisons by threshold elements to determine the membership. It is extended to determine the membership of a line segment between two points in the space of $R^{L}$ to $Λ_{m}$ in case of $K = L$ . The membership of a line segment to a cluster support is significant for calculating the distance between centroids of $Λ_{m}$ and $Λ_{n}$ . The line segment between $μ_{m}$ and $μ_{n}$ is expressed by

$μ_{m n} (t) = μ_{m} + t (μ_{n} - μ_{m})$

where

t \in [0, 1]

. By linearly spacing the interval

[0, 1]

to a partition, one can obtain a lot of equally spaced points on the segment. If all sampled points on the segment belong to

Λ_{m} \cup Λ_{n}

, it is inferred that the line segment belongs to

Λ_{m} \cup Λ_{n}

. as shown in Figure 4C. In the occasion, the distance between

μ_{m}

and

μ_{n}

is defined by

(4) $D_{m n} = \sqrt{{(μ_{m} - μ_{n})}^{T} (μ_{m} - μ_{n})}$

If there exists some t such that $u_{m n} (t)$ does not belong to $Λ_{m} \cup Λ_{n}$ , the above definition is not feasible for determining the distance between centroids of $Λ_{m}$ and $Λ_{n}$ .

When $K$ equals $L$ , the mapping $f_{m} (x)$ is invertible. $Λ_{m}$ is a geometrical box with $2^{K}$ vertices defined by the Cartesian product ${l_{m 1}, u_{m 1}} \times \dots \times {l_{m K}, u_{m K}}$ . Two vertices with $K - 1$ same coordinates and only one different coordinate are neighboring. Applying the inverse of $f_{m}$ to two neighboring vertices induces two end points of an edge of $Λ_{m}$ in $R^{L}$ . Again the membership of each edge of $Λ_{m}$ to $Λ_{n}$ can be examined by the neural circuit in Figure 2. If there exists one vertex or one point sampled from an edge of $Λ_{m}$ belonging to $Λ_{n}$ , or inversely, from an edge of $Λ_{n}$ belonging to $Λ_{m}$ , $Λ_{m} \cap Λ_{n}$ is not empty. Two cluster supports are neighboring, if their intersection is not empty. The neighboring relations are maintained by a symmetric interconnection matrix $G$ with element $G_{m n} \in {0, 1}$ . Both $G_{m n}$ and $G_{n m}$ equal one if $Λ_{m}$ and $Λ_{n}$ are neighboring, and are zero otherwise.

The distance between $μ_{m}$ and $μ_{n}$ can be also determined if some part in the whole segment does not belong to $Λ_{m} \cup Λ_{n}$ as shown in Figure 4D and $G_{m n}$ equals one. The following definition is employed if $μ_{m n} (t) \notin Λ_{m} \cup Λ_{n}$ for some t and $Λ_{m} \cap Λ_{n} \neq \emptyset$ ,

(5) $D_{m n} = m i n_{{z \in Λ_{m} \cap Λ_{n}}} ∥ μ_{m} - z ∥ + ∥ μ_{n} - z ∥$

Calculation of $D_{m n}$ is approximated by searching for the best $z$ among vertices and edges of two neighboring cluster supports for simplification. For $K < L$ , subject to given training observations from $R^{L}$ , the vertices of $Λ_{m}$ can not be determined from the Cartesian product ${l_{m 1}, u_{m 1}} \times \dots \times {l_{m K}, u_{m K}}$ , since $f_{m}$ is no more invertible. If $K$ is less than $L$ , the nearest distance between $S_{m}$ and $S_{n}$ is employed to determine the neighboring relation of $Λ_{m}$ and $Λ_{n}$ .

When $K < L$ , $G_{m n} = 1$ if $d_{m n}$ is less than a predetermined small positive number $ϵ$ , and $G_{m n} = 0$ , otherwise, where $d_{m n}$ denotes the nearest distance between $S_{m}$ and $S_{n}$ defined by the minimal distance between any $x_{m} \in S_{m}$ and $x_{n} \in S_{n}$ . The two observations, respectively belonging to $S_{m}$ and $S_{n}$ , whose distance induces $d_{m n}$ less than $ϵ$ could be reserved for recalculating $d_{m n}$ during testing phase. The distance $D_{m n}$ between centroids $μ_{m}$ and $μ_{n}$ , if $G_{m n} = 1$ , is determined by the sum of three distances, respectively between $μ_{m}$ and $x_{m}$ , $μ_{n}$ and $x_{n}$ , and $x_{m}$ and $x_{n}$ , where $x_{m} \in S_{m}$ , $x_{n} \in S_{n}$ , and the distance between $x_{m}$ and $x_{n}$ is less than $ϵ$ . Both $x_{m}$ and $x_{n}$ in accompany with $G_{m n}$ for $K < L$ , are reserved in order to calculate the distance between two observations respectively belonging to two neighboring local supports.

For the case with $K = L$ , the connectivity $G_{m n}$ is one if intersection of $Λ_{m}$ and $Λ_{n}$ is non-empty and is zero otherwise. Since the projection function $f_{m}$ is invertible, vertices and edges of each local support can be sketched and their memberships to other supports can be checked directly. However, in case with $K < L$ , the projection function $f_{m}$ is not invertible. The connectivity $G_{m n}$ is checked by comparing the set distance between $S_{m}$ and $S_{n}$ with a predetermined threshold. Figure 3B shows the connectivity derived subject to the Swiss roll data. The graph configuration, denoted by $G$ , is not predetermined and is a result of checking neighboring relations between cluster supports. Neighboring relations among training observations in LLE have been extended to neighboring relations of data supports.

The distance between centroids of two cluster supports that are not neighboring is calculated based on given graph configuration and all $D_{m n}$ with $G_{m n} = 1$ . The problem now is to determine $D_{m n}$ between $μ_{m}$ and $μ_{n}$ for $G_{m n} = 0$ . Generally, $G$ denotes edges among $M$ nodes on a graph such that nodes $m$ and $n$ are connected if and only if $G_{m n} = 1$ . The distance $D_{m n}$ corresponding to an edge that connects nodes $m$ and $n$ is well defined. Two arbitrary nodes on a graph are path-connected if there exists a path between them. The distance between two path-connected nodes can be determined by the shortest path algorithm of Dijkstra [31].

If $M$ nodes on the graph are path-connected, all distances among $M$ nodes or all entries in matrix D are determined.

2.3. Levenberg-Marquardt Learning for Distance-Preserving Mapping and Optimal Posterior Weights

The distance matrix D provides clues for determining images of centroids of cluster supports, where images of centroids are regarded as optimal posterior weights of graph-organized neural modules for NDR mapping. The dimension of the output space is typically less than or equals three for data visualization by computer graphics. If the output dimension equals one, without losing generality, the output space is related to the interval $[0, 2 π]$ for visualization and the posterior weight $r_{m}$ refers to the radian of a unitary circle. The problem of determining all posterior weights turns to find images on a unitary circle based on the distance matrix $D$ . On a unitary circle, each image has two neighbors. The goal is to minimize the total distance of neighboring images subject to given $D$ . The neighboring relation defines a cycle path that visits each node exactly once and returns to where it starts. The task is different from a typical TSP (traveling salesman problem) that is provided with city cites within a Euclidean space.

The Hopfield neural network [32,33] and Potts neural networks [34,35] for solving TSP with only the distance matrix $D$ can be directly applied for optimizing images of centroids, equivalently posterior weights. Given distance matrix D, neural approaches [34,35] attain a circular sequence that visits $M$ centroids. Let $p_{i}$ denote the index of a node at stop $i$ on the determined circular sequence, where $1 \leq i \leq M$ +1 and $p_{M + 1} = p_{1}$ for circular representation. Let $q_{i}$ denote the difference between images of node $p_{i}$ and $p_{i + 1}$ . Let $q_{i}$ be proportional to $D_{p_{i} p_{i + 1}}$ and the sum of all $q_{i}$ be $2 π$ . Then

(6) $q_{i} = \frac{2 π D_{p_{i} p_{i + 1}}}{\sum_{j} D_{p_{j} p_{j + 1}}} .$

By setting the image of node $p_{1}$ to zero, one can determine the image of node $p_{i + 1}$ by adding $q_{i}$ to the image of node $p_{i}$ for all $i$ .

For $K = L$ , the matrix $D$ contains distances defined inside the union of cluster supports with the non-negative, symmetric and triangle properties for distance measure. The first two properties are trivial. The third triangle property holds for distances in D. Let $Q_{i k}$ denote the shortest path from node $i$ to node k. According to the shortest path algorithm, path $Q_{i k}$ contains no cycle. Let $Q_{i j}$ and $Q_{j k}$ respectively denote the shortest path between node i and node j, and node j and node k. Let node j be absent in path $Q_{i k}$ . The triangle property holds if $D_{i k} \leq D_{i j} + D_{j k}$ . Assume $D_{i k}$ greater than $D_{i j} + D_{j k}$ . It follows that path $Q_{i k}$ is not the shortest one, since it could be improved by the path that concatenates $Q_{i j}$ with $Q_{j k}$ . This leads to a contradiction. So, the length of $Q_{i k}$ must be less than or equal to $D_{i j} + D_{j k}$ .

Now the task extends to find images of centroids or posterior weights, where the dimension of the output space is two or three and $K = L$ . The purpose is to find and display the image $r_{m}$ of every $μ_{m}$ . The distance $D_{m n}$ is determined by Equations (4) and (5) if $G_{m n} = 1$ , and by the shortest path algorithm otherwise. For distance preserving mapping, distances in D propose the following constraint for determining $r_{m}$ and $r_{n}$ in the output space,

(7) $∥ r_{m} - r_{n} ∥ - D_{m n} = 0$

The constraint (7) for any path-connected node $m$ and $n$ constitutes a nonlinear system. There are $\frac{1}{2} M (M - 1)$ constraints. The LM (Levenberg-Marquardt) algorithm has been extensively applied for learning multilayer neural networks [26,27] and solving nonlinear system. This work applies the LM algorithm for solving the nonlinear system (7) and determining images of centroids for distance preserve mapping. The following criterion is formulated for least square fitting,

$E (r) = \sum_{m, n} {(∥ r_{n} - r_{m} ∥ - D_{n m})}^{2}$

where

r

denotes a collection of all posterior weights. The outstanding performance of the LM algorithm has been extensively explored in previous works [26,27] for learning neural networks. The LM algorithm is applied to find optimal

r_{o p t}

defined by

$r_{o p t} = a r g \underset{r}{m i n} E (r)$

If $G$ contains unconnected subgraphs, solving the nonlinear system (7) by the LM algorithm can be separately applied to every unconnected subgraph. There will be multiple isolated subgraphs and multiple sets of images, each corresponding to one subgraph whose nodes are path-connected.

2.4. Locally Nonlinear Embedding for NDR Mapping

Learning the proposed neural model subject to training observations in $S$ achieves optimal built-in parameters, including receptive fields, projective fields, projection bounds and posterior weights. This section presents locally nonlinear embedding (LNE) for NDR mapping based on a neural model that has been equipped with optimal built-in parameters and posterior weights.

In response to an observation in the training set $S$ , a well-trained NDR model is expected to generate an image in the output space. Let $x$ belong to some subset $S_{m}$ . Then $x$ belongs to cluster support $Λ_{m}$ and a correspondent neural module is active. Let $N B_{m} = {n | G_{m n} = 1}$ and $l_{n}$ with $n \in N B_{m}$ denote the distance between $x$ and $μ_{n}$ similar to distance measure described in Equation (5). If the line segment between $x$ and $μ_{n}$ belongs to $Λ_{m} \cup Λ_{n}$ , since $x \in Λ_{m}$ , $l_{n}$ is calculated by the Euclidean distance between $x$ and $μ_{n}$ , otherwise by the following equation

(8) $l_{n} = m i n_{{z \in Λ_{m} \cap Λ_{n}}} ∥ x - z ∥ + ∥ μ_{n} - z ∥$

The calculation of $l_{n}$ is approximated by seeking the best $z$ on vertices and edges of $Λ_{m}$ for simplicity. Similar to Equation (7), the LNE distance-preserving constraint is given by

(9) $∥ r - r_{n} ∥ = l_{n}$

where

n \in N B_{m} \cup {m}

. Equation (9) poses locally nonlinear constraints for inferring image

r

subject to given distances. Following locally nonlinear embedding, the constraint (9) specifies a nonlinear system, where the only unknown is

r

and the constraint number is only

| N B_{m} | + 1

. By the strategy of distance preservation, optimal

r

can be trivially resolved by the LM algorithm with random initialization near

r_{m}

. A well trained NDR mapping can be applied to determine the image of each observation in

S

A well trained NDR model is employed to generate locally nonlinear constraints (9) for mapping a novel observation. The graph-organized neural modules simultaneously operate to encode a novel observation $x$ . If these is no active module that generates an image in the output space, it is implied that $x$ is out of all local supports and is not within the domain of NDR mapping, otherwise there exists at least one local support that covers $x$ . Let $H$ denote a set that collects indices of local supports commonly covering $x$ . These local supports must be overlapping and $H$ is a small set only with several elements. Let $α$ denote a vector with element $l_{n}$ that measures the distance between $x$ and $μ_{n}$ according to Equation (8), and the image vector $β$ consist of $r_{n}$ , where $n \in N B_{m} \cup {m}$ for each $m \in H$ . Locally nonlinear constraints (9) are well defined by given vectors $α$ and $β$ . The calculation of $l_{n}$ in case of $K = L$ and $K < L$ has been given in contexts of Section 2.2.3.

Inferring the image $r$ of $x$ is based on distances in $α$ and images in $β$ . According to Equation (9), locally nonlinear constraints are in number identical to the length of vector $α$ or $β$ , or the sum of $| N B_{m} | + 1$ over $m \in H$ . Since $H$ is a small set and there is only one unknown, locally nonlinear constraints (9) constitute a nonlinear system that can be easily solved by the LM algorithm.

3. Results

Numerical simulations verify the proposed neural model and learning process for NDR mapping of the 3-dimensional Swiss-roll data [1,2]. Given Swiss-roll data contain N = 5000 3-dimensional points, $S = {x_{i} \in R^{3}}_{i = 1}^{N}$ . The training data S are embedded within a 2-dimensional surface as shown in Figure 3A. The Swiss-roll data are suitable for evaluating the effectiveness of the proposed neural model for NDR mapping.

The first step applies the annealed clustering algorithm [28,29] in Appendix A to partition S to M non-overlapping subsets as described in Section 2.2.1. This step obtains receptive fields ${μ_{m}}_{m}$ as centroids of cluster supports.

The second step completes cluster support construction for optimal projective fields and projection intervals as described in Section 2.2.2. The projection dimension K is set to three for current example. The centroid of a correspondent local support $Λ_{m}$ is set to the mean of a cluster. Figure 4A and Figure 5A show all centroids. Subtracting $μ_{m}$ from observations in $S_{m}$ leads to demeaned observations, subsequently a covariance matrix $C_{m}$ . Solving the eigenvalue problem corresponding to $C_{m}$ attains K orthogonal projective fields, ${a_{m k}}_{k}$ , and K projection intervals, each being denoted by $I_{m k}$ . As shown in Figure 4B, these parameters sketch a local support as a 3-dimensional box. The union of all local supports approximates the global density support underlying observations in S.

If M = 1, there is only one cluster support. Let $S_{t e s t}$ collect novel 20,000 data points randomly sampled from the sole regular cluster support that covers all observations in S. A well-trained neural model with M = 64 can evaluate memberships of newly sampled observations in $S_{t e s t}$ to the union of M cluster supports. The membership is verified if there exists at least one activated module in the neural model. Reserving those inside the union of cluster supports attain results in Figure 5B, which displays similar structure to the original Swiss-roll.

The third step figures out lateral interconnections of neural modules, equivalently neighboring relations of cluster supports. For current situation, $f_{m}$ is invertible such that vertices and edges of cluster support $Λ_{m}$ can be efficiently tracked. Figure 6A shows neighborhood relations of cluster supports in the input space. On the basis, the distance between centers of two neighboring cluster supports has two different ways for calculation, depending on the straight line between two centroids $μ_{i}$ and $μ_{j}$ totally within two neighboring cluster supports as shown in Figure 4C or partially within $Λ_{i} \cup Λ_{j}$ as shown in Figure 4D. Based on distances of centroids of neighboring cluster supports, as described in Section 2.2.3, the shortest path algorithm of Dijkstra [12] is further applied to determine the distance between every pair of path-connected nodes m and n, where $G_{m n} = 0$ . Now all entries in matrix D have been determined for inferring images of all centroids.

Based on distances of centroids in matrix D, distance preserving constraints in Equation (7) constitute a large scaled nonlinear system, where all images of centroids in $r$ are unknown. As described in Section 2.3, the LM algorithm minimizes the least square criterion for optimizing all posterior weights. As a result, images of all $μ_{m}$ on a 2D plane are shown in Figure 6B. The centroids in Figure 6A are mapped to images according to the distance preserving criterion. This illustrates significance of applying the LM algorithm for topology-preserving dimensionality reduction.

A well trained NDR neural model is employed to map each training or testing observation. As described in Section 2.4, locally nonlinear constraints (9) are given for each training or testing observation, where the only unknown denotes the generated image in the output space. Figure 6C shows the generated images of all observations. The LM algorithm is effective and reliable for seeking the image of each testing observation that satisfies locally nonlinear constraints (9). It is notable that a well-trained NDR neural model is semi-parametric. It operates without reserving full training observations and is able to map novel testing observations to an embedded low dimensional manifold faithfully.

4. Discussion

Let V(M) be the sum of all V_i, where V_i denotes the volume of cluster support $Λ_{i}$ determined by Equation (3). V(M) is expected to be related to an upper bound of the volume of the the global density support that covers all data points in S. However, as shown in Figure 5A, V(M) decreases as M increases. Hence minimal V(M) cannot serve as the only criterion of setting M. For current example, two local supports are neighboring if $Λ_{i} \cap Λ_{j} \neq \emptyset$ . The connectivity of the graph G tends to increase as M increases. Following the minimal volume and maximal connectivity principle, the optimal number of cluster supports derived by advanced clustering analysis is determined by

(10) $M_{o p t} = \min_{m} (V (m) - λ C (m))$

where C(M) is the ratio of the sum of elements in G to the number,

\frac{1}{2}

M(M-1), of full connections. Figure 5A shows empirical volume V(M) and connectivity C(M), where the horizontal axis denotes the number of clusters. Here λ is set to V(1) for balancing two quantities. As the number of clusters increases to M = 64, it tends to balance volume and connectivity. Advanced clustering analysis partitions S into M clusters, as shown in Figure 5B. For this case, nodes on the graph are connected and there is no isolated node.

The proposed NDR model is applied for neighborhood preservation mapping. Consisting of two hidden layers in additional to input and output layers, the proposed NDR model is a deep neural network [36,37,38]. For generalization, learning an NDR model subject to training observations is verified by a testing set that is not provided during learning phase. Let $S_{test}$ denote the testing set. The first step is to check if given testing observations are within the union of local supports defined by the trained NDR model. Let $S_{test}^{'}$ collect screened observations whose images could be inferred by the trained NDR model. The reservation ratio of testing observations is expressed by $\frac{| S_{test} |}{| S_{test}^{'} |}$ .

By a well-trained NDR model and locally nonlinear embedding, each observation $x$ in $S_{test}^{'}$ is transformed to an image $r$ in the output space. Let ${NB}_{k} (x)$ denote $k$ nearest neighbors in the input space and ${NB}_{k^{'}} (r)$ denote $k^{'}$ nearest neighbors of the image $r$ in the output space. For $z \in {NB}_{k} (x)$ , an indicator $hit (z)$ is set to one if the image of $z$ belongs to ${NB}_{k^{'}} (r)$ , and zero otherwise. An error percentage for testing is defined by,

(11) $ε (S_{test}^{'}) = 1 - \frac{1}{| S_{test}^{'} | \times k} \sum_{x \in S_{test}^{'}} \sum_{z \in {NB}_{k} (x)} hit (z)$

Determining images of testing observations can be realized by parallel computation. Inferences of two testing observations to images are independent and can be performed simultaneously. Numerical simulations employ a notebook equipped with four 2.3 GHz Intel Core i5 workers that can simultaneously execute to speed up NDR mapping of testing observations under Matlab environment.

Numerical simulations independently generate two “brokenswiss” datasets respectively for training and testing. Each dataset consists of $N = 5000$ three dimensional observations. The testing “brokenswiss” dataset is shown in Figure 7. Learning an NDR model with M = 91 internal nodes subject to the training set ten times attains entries in the first row of Table 1, where mean and variance of reservation ratio and error percentage over ten executions are listed. Low variances in Table 1 reflect high reliability of the proposed learning process. For current example, $k = 12$ and $k^{'} = 36$ . The mean of reservation ratio is near 90% and the mean of error percentage is less than 4%, which shows effectiveness of the proposed learning process.

Both LLE and Isomap methods are not model based. Without further improvement, these two methods are unable to produce models for testing, hence inducing neither reservation ratio nor error percentage for testing in Table 1. LLE and Isomap codes [39] induce execution results that miss images of half training observations in the output space for this example. Both LLE and Isomap are not able to reduce the training error percentage for current example. Their training error percentage is about 50%.

Figure 8A shows observations sampled from the two-colored surface of a ball. The training set contains four-dimensional observations, since except for 3D coordinates an attribute is recruited for representing two different colors. The dimension $L$ of the input space is four for this example. Figure 8B shows a graph with edges in the output space, where the positions of nodes refer to images of centroids of local supports. These edges represent neighboring relations of extracted local supports. The graph contains two isolated subgraphs. It is observed that two nodes respectively belonging to two isolated subgraphs are not connected. It is verified that local supports corresponding to nodes on each isolated subgraph contain observations of the same color and two subgraphs separately reflect two different colors of observations. The proposed NDR mapping successfully translates 4D observations to an embedded low-dimensional manifold. The encouraging results motive further applications of the proposed NDR neural model to high dimensional observations oriented from spectral features of sounds and handwritten patterns of characters.

The second row of Table 1 lists numerical results of evaluating the proposed learning process for the testing dataset in Figure 8. Either the training set $S$ or the testing set $S_{test}$ contains $N = 8000$ four dimensional observations. For this example, $k = 12$ , $k^{'} = 48$ . The proposed learning process executes ten times for training an NDR model with $M = 81$ subject to $S$ . Each time the trained NDR model is verified with $S_{test}$ . Statistics over ten executions are summarized in Table 1 and Table 2, where the reservation ratio is 95% and the error percentage is 16.51% in average. For the same reason stated previously, testing results of LLE and Isomapp are not listed in Table 1. As shown in Table 2, the proposed learning process in average takes 166 s to train an NDR model, and 190 s to map 8000 testing observations to images through parallel computation. Matlab environment can further extend parallel computation with more cores. Both LLE and Isomap are also unable to reduce training error percentage effectively for this example. It is notable that Isomap takes 1336 s to process training observations for this example. The proposed learning process significantly improves computational efficiency for training and accuracy for mapping testing observations.

5. Conclusions

Numerical simulations show the proposed NDR neural model feasible for nonlinear dimensionality reduction and visualization of the embedded low dimensional manifold. A well trained NDR neural model needs no more training observations for mapping novel testing observations during execution phase. Traditional LLE and Isomap require reserving full training observations for mapping a novel testing observation. Compared with LLE and Isomap, the proposed NDR neural model possesses better portability for mapping testing observations during execution phase. It only requires built-in parameters for further execution.

All built-in parameters of a well-trained NDR model are comprehensible and meaningful for abstractive data structures. The optimized built-in parameters subject to training observations constitute extracted local supports, serving as internal representations of the centroid and projected PCA box of each local support. The obtained posterior weights refer to images of all centroids of local supports. The dynamic graph configuration states neighboring relations of local supports. Edges on the graph refer to lateral interconnections of adaline neural modules in a well-trained NDR model.

The LM algorithm has been shown effective for solving a large scaled nonlinear system toward transferring centroids to images in the output space. The nonlinear system is derived following the principle of distance preserving mapping. The LM algorithm has been also shown for seeking the image of a novel observation that satisfies locally nonlinear embedding constraints (9) during execution phase. Since there is only one unknown in a small set of locally nonlinear constraints (9), the LM algorithm is fast for NDR mapping of a single testing observation. This work has successfully extended LLE to locally nonlinear embedding constraints. The LNE constraints hold within a larger scope in comparison with traditional LLE, extensively enhancing quality of NDR mapping. Learning an NDR neural model is a process of integrating advanced clustering analysis, optimizing adaline neural modules and solving large-scale distance-preserving nonlinear constraints. The integrating learning process has been shown effective and reliable for optimizing built-in parameters of the proposed NDR neural model. An NDR neural model performs a standalone approximation that infers images of testing observations without maintaining huge volume of training observations.

Author Contributions

Conceptualization of locally nonlinear embedding, K.H. and J.-M.W.; methodology, S.-S.W. and J.-M.W.; software, S.-S.W. and J.-M.W.; validation, S.-J.J., K.H. and J.-M.W.; data curation, S.-S.W. and S.-J.J.; writing—original draft preparation, S.-S.W. and J.-M.W.; writing—review and editing, S.-S.W., K.H. and J.-M.W.; visualization, S.-S.W. and J.-M.W.; supervision, J.-M.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not Applicable.

Data Availability Statement

The data presented in this study are openly available in [39].

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

A simplified version of the annealed clustering algorithm [28,29] is summarized by the following iterative steps.

Randomly set all $μ_{j}$ near the mean of all data points, ${x_{i}}_{i}$ , initialize the inverse temperature parameter β to sufficiently low value, and set $ε$ to a small positive value.
Calculate the distance $d_{i j}$ between each point $x_{i}$ and $μ_{j}$ .
Update the expectation of each element in exclusive memberships. $q_{j} [i] = \frac{e x p (- β d_{i j})}{\sum_{k} e x p (- β d_{i k})}$ .
Set stability to the mean of $λ_{i} = \sum_{j} q_{j}^{2} [i]$ over $i$ . If stability is less than the sum of $\frac{1}{M}$ and $ε$ , add each $q_{j} [i]$ with a small andom noise for perturbation.
Fix all $q_{k} [i]$ and minimize the $\frac{1}{N} \sum_{i} \sum_{k} q_{k} [i] ∥ x_{i} - μ_{k} ∥^{2}$ with respect to all $μ_{m}$ .
If stability < 0.98, set β to β/0.995 and repeat Step ii-v, otherwise halt.

As in previous works [28,29], the algorithm operates under a physical-like annealing process. Annealing schedules a temperature-like parameter $\frac{1}{β}$ from sufficiently high to low values. The expectation of each element in stochastic membership variables and each adaptive centroid $μ_{j}$ are maintained during the physical-like annealing process. At each intermediate temperature, the algorithm iteratively updates each $q_{j} [i]$ and each centroid $μ_{j}$ . At the end of the annealing process, the temperature-like parameter is scheduled to sufficiently low values, where the algorithm eventually attains optimal exclusive memberships and centroids. The mean of $λ_{i}$ over $i$ denotes the stability. This quantity is calculated at step iv. If it is less than the sum of $\frac{1}{M}$ and a pre-determined small positive value $ε$ , each $q_{j} [i]$ is added with a noise to escape a trivial state at step iv. If it approaches one at step vi, the algorithm halts.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures and Tables

Figure 1. A feedforward neural module for translation of a high-dimensional observation to an image in the output space.

Figure 2. A deep neural network consisting of multiple neural modules for dimensionality reduction mapping.

Figure 3. (A) Swiss-roll data. (B) A graph with derived edges for organizing neural modules.

View Image - Figure 4. (A) Centroids of partitioned subsets. (B) A regular box for representing a local support. (C) The line segment that connects two centers totally belongs to union of two local supports. (D) The line segment that connects two centers partially belonged to union of two local supports.

Figure 4. (A) Centroids of partitioned subsets. (B) A regular box for representing a local support. (C) The line segment that connects two centers totally belongs to union of two local supports. (D) The line segment that connects two centers partially belonged to union of two local supports.

Figure 5. (A) Volume V(M) and connectivity C(M) empirically change as cluster number M increases. (B) Centroids.

View Image - Figure 6. (A) Edges on a graph and neighboring relations of cluster supports in the input space. (B) Images of centroids and training observations. (C) Images of all observations in the output space.

Figure 6. (A) Edges on a graph and neighboring relations of cluster supports in the input space. (B) Images of centroids and training observations. (C) Images of all observations in the output space.

Figure 7. (A) A broken Swiss dataset for testing. (B) Isolated images of centroids and their edges. (C) Images of testing observations.

Figure 8. (A) Observations sampled from the two-colored ball surface for testing. (B) Isolated sub-graphs and their edges.

Table 1

Mean and variance of reservation ratio and error percentage for testing.

	Reservation Ratio (Testing)		Error Percentage (Testing)
Dataset	Mean	Var	Mean	Var
Brokenswiss	89.78%	1.73 × 10⁻⁵	3.92%	6.03 × 10⁻⁵
Colorball	95.00%	5.35 × 10⁻⁶	16.51%	2.83 × 10⁻⁵

Table 2

Mean and variance of execution time for training and testing.

	Execution Time
	Training (Seconds)		Testing (Seconds)
Dataset	Mean	Var	Mean	Var
Colorball	166.0	66.0	190.0	43.7

Word count: 7396

Show less

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

This work explores neural approximation for nonlinear dimensionality reduction mapping based on internal representations of graph-organized regular data supports. Given training observations are assumed as a sample from a high-dimensional space with an embedding low-dimensional manifold. An approximating function consisting of adaptable built-in parameters is optimized subject to given training observations by the proposed learning process, and verified for transformation of novel testing observations to images in the low-dimensional output space. Optimized internal representations sketch graph-organized supports of distributed data clusters and their representative images in the output space. On the basis, the approximating function is able to operate for testing without reserving original massive training observations. The neural approximating model contains multiple modules. Each activates a non-zero output for mapping in response to an input inside its correspondent local support. Graph-organized data supports have lateral interconnections for representing neighboring relations, inferring the minimal path between centroids of any two data supports, and proposing distance constraints for mapping all centroids to images in the output space. Following the distance-preserving principle, this work proposes Levenberg-Marquardt learning for optimizing images of centroids in the output space subject to given distance constraints, and further develops local embedding constraints for mapping during execution phase. Numerical simulations show the proposed neural approximation effective and reliable for nonlinear dimensionality reduction mapping.

Details

Title

Learning Neural Representations and Local Embedding for Nonlinear Dimensionality Reduction Mapping

Author

Wu, Sheng-Shiung; Sing-Jie Jong; Hu, Kai

First page

1017

Publication year

2021

Publication date

2021

Publisher

MDPI AG

e-ISSN

22277390

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/math9091017

ProQuest document ID

2530157246

Learning Neural Representations and Local Embedding for Nonlinear Dimensionality Reduction Mapping

Jump to:

Full text

Abstract

Details

Suggested sources