SGB-ELM: An Advanced Stochastic Gradient

Full text

Turn on search term navigation

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Extreme learning machine (ELM) was proposed as a promising learning algorithm for single-hidden-layer feedforward neural networks (SLFN) by Huang [1–3], which randomly chooses weights and biases for hidden nodes and analytically determines the output-layer weights by using Moore-Penrose (MP) generalized inverse [4]. Due to avoiding the iterative parameter adjustment and time-consuming weight updating, ELM obtains an extremely fast learning speed and thus attracts a lot of attention. However, random initialization of input-layer weights and hidden biases might generate some suboptimal parameters, which have negative impact on its generalization performance and predicted robustness.

To alleviate such weakness, many works have been proposed to further improve the generalization capability and stability of ELM, where ELM ensemble algorithms are the representative ones. Three representative ELM ensemble algorithms are summarized as follows. The earliest ensemble based ELM (EN-ELM) method was presented by Liu and Wang in [5]. EN-ELM introduced the cross-validation scheme into its training phase, where the original training dataset was partitioned into $R$ subsets and then $R$ pairs of training and validation sets were obtained so that each training set consists of $R$ - $1$ subsets. Additionally, with updated input weights and hidden biases, $K$ individual ELMs were trained based on each pair of the training and validation set. There were totally $K \times R$ ELMs that were constructed for decision-making in EN-ELM algorithm. Cao et al. [6] proposed a voting-based ELM (V-ELM) ensemble algorithm, which made the final decision based on the majority voting mechanism in classification applications. All the individual ELMs in V-ELM were trained on the same training dataset and the learning parameters of each basic ELM were randomly initialized independently. Moreover, a genetic ensemble of ELM (GE-ELM) method was designed by Xue et al. in [7], which used the genetic algorithm to produce optimal input weights as well as hidden biases for individual ELMs and selected ELMs equipped with not only higher fitness values but also smaller norm of output weights from the candidate networks. In GE-ELM, the fitness value of each individual ELM was evaluated based on the validation set which was randomly selected from the entire training dataset. There are still several other types of ELM ensemble algorithms which can be found in literatures [8–13].

As for ensemble of the traditional neural networks, the most prevailing approaches are Bagging and Boosting. In Bagging scheme [14], it generates several training datasets from the original training dataset and then trains a component neural network from each of those training datasets. Boosting mechanism [15] generates a series of component neural networks whose training datasets are determined by the performance of former ones. There are also many other approaches for training the component neural networks. Hampshire [16] utilizes different object functions to train distinct component neural networks. Xu et al. [17] introduce the stochastic gradient boosting ensemble scheme to bioinformatics applications. Yao et al. [18] regard all the individuals in an evolved population of neural networks as component networks.

In this paper, a new ELM ensemble scheme called Stochastic Gradient Boosting-based Extreme Learning Machine (SGB-ELM) which makes use of the mechanism of stochastic gradient boosting [19, 20] is proposed. SGB-ELM constructs an ensemble model by training a sequence of ELMs where the output weights of each individual ELM is learned by optimizing the regularized objective in an additive manner. More specifically, we design an objective based on the training mechanism of boosting method. In order to alleviate overfitting, we introduce a regularization item which controls the complexity of our ensemble model to the objective function concurrently. Then the derivation formula aimed at solving output weights of the newly added ELM is determined by optimizing the objective using second-order approximation. As the output weights of the newly added ELM at each iteration are hard to be analytically calculated based on the derivation formula, we take the output weights learned by the pseudo-residuals-based training dataset as an initial heuristic item and thus obtain the optimal output weights by using the derivation formula to update the heuristic item iteratively. Because the regularized objective tends to employ not only predictive but also simple functions and meanwhile a randomly selected subset rather than the whole training set is used to minimize training residuals at each iteration, SGB-ELM can continually improve the generalization capability of ELM while effectively avoiding overfitting. The experimental results in comparison with Bagging ELM, Boosting ELM, EN-ELM, and V-ELM show that SGB-ELM obtains better classification and regression performances, which demonstrates the feasibility and effectiveness of SGB-ELM algorithm.

The rest of this paper is organized as follows. In Section 2, we briefly summarize the basic ELM model as well as the stochastic gradient boosting method. Section 3 introduces our proposed SGB-ELM algorithm. Experimental results are presented in Section 4. Finally, we conclude this paper and make some discussions in Section 5.

2. Preliminaries

In this section, we briefly review the principles of basic ELM model and the stochastic gradient boosting method to provide necessary backgrounds for the development of SGB-ELM algorithm in Section 3.

2.1. Extreme Learning Machine

ELM is a special learning algorithm for SLFN, which randomly selects weights (linking the input layer to the hidden layer) and biases for hidden nodes and analytically determines the output weights (linking the hidden layer to the output layer) by using MP generalized inverse. Suppose we have a training dataset with $N$ instances $D = {\{(x_{i}, y_{i})\}}_{i = 1}^{N}$ , where $x_{i} = (x_{i 1}, x_{i 2}, \dots, x_{i d}) \in R^{d}$ and $y_{i} = (y_{i 1}, y_{i 2}, \dots, y_{i m}) \in R^{m}$ . It is known that $m = 1$ for regression and $m > 1$ for classification. In ELM, the input weights and hidden biases can be randomly chosen according to any continuous probability distribution [2]. Namely, we randomly select the learning parameters within the range of $[- 1,1]$ as $\begin{matrix} (1) & W = {[\begin{bmatrix} w_{1} \\ w_{2} \\ ⋮ \\ w_{L} \end{bmatrix}]}^{T} = {[\begin{bmatrix} w_{11} & w_{12} & \dots & w_{1 L} \\ w_{21} & w_{22} & \dots & w_{2 L} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ w_{d 1} & w_{d 2} & \dots & w_{d L} \end{bmatrix}]}_{d \times L} \end{matrix}$ and $\begin{matrix} (2) & B = {[b_{1}, b_{2}, \dots, b_{L}]}^{T}, \end{matrix}$ where $L$ is the number of hidden-layer nodes in SLFN. Depending on the theory proved in [2], the output-layer weights in ELM model can be analytically calculated by $\begin{matrix} (3) & β = H^{†} Y . \end{matrix}$ Here, $H^{†}$ is the MP generalized inverse of the hidden-layer output matrix $\begin{matrix} (4) & H = {[g (x_{i} w_{l} + b_{l})]}_{N \times L}, \end{matrix}$ where $i = 1,2, \dots, N$ , $l = 1,2, \dots, L$ , and $g (u) = 1 / (1 + \exp (- u))$ is the sigmoid activation function, and $\begin{matrix} (5) & Y = [\begin{bmatrix} y_{1} \\ y_{2} \\ ⋮ \\ y_{N} \end{bmatrix}] = {[\begin{bmatrix} y_{11} & y_{12} & \dots & y_{1 m} \\ y_{21} & y_{22} & \dots & y_{2 m} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ y_{N 1} & y_{N 2} & \dots & y_{N m} \end{bmatrix}]}_{N \times m} \end{matrix}$ is the target matrix. Generally, for an unseen instance $\hat{x} = ({\hat{x}}_{1}, {\hat{x}}_{2}, \dots, {\hat{x}}_{d})$ , ELM predicts its output $\hat{y}$ as follows: $\begin{matrix} (6) & \hat{y} = h (\hat{x}) β, \end{matrix}$ where $h (\hat{x}) = [g (\hat{x} w_{1} + b_{1}), \dots, g (\hat{x} w_{L} + b_{L})]$ is the hidden-layer output vector of $\hat{x}$ .

Due to avoiding the iterative adjustment to input-layer weights and hidden biases, ELM’s training speed can be thousands of times faster than those of traditional gradient-based learning algorithms [2]. At the meantime, ELM also produces good generalization performance. It has been verified that ELM can achieve the equal generalization performance with the typical Support Vector Machine algorithm [3].

2.2. Stochastic Gradient Boosting

Stochastic gradient boosting scheme was proposed by Friedman in [20], and it is a variant of the gradient boosting method presented in [19]. Given a training set ${\{(x_{i}, y_{i})\}}_{i = 1}^{N}$ , the goal is to learn a hypothesis $F_{K} (x)$ that maps $x$ to $y$ and minimizes the training loss as follows: $\begin{matrix} (7) & F_{K} (x) = \underset{F_{K} (x)}{a r g m i n} \sum_{i = 1}^{N} L (y_{i}, F_{K} (x_{i})), \end{matrix}$ where $L (\cdot, \cdot)$ is the loss function which evaluates the difference between the predicted value and the target and K denotes the number of iterations. In boosting mechanism, K additive individual learners are trained sequentially by $\begin{matrix} (8) & f_{k} (x) = \underset{f_{k} (x)}{a r g m i n} \sum_{i = 1}^{N} L (y_{i}, F_{k - 1} (x_{i}) + f_{k} (x_{i})) \end{matrix}$ and $\begin{matrix} (9) & F_{k} (x) = F_{k - 1} (x) + f_{k} (x), \end{matrix}$ where $k = 1,2, \dots, K$ . It is shown that the optimization problem depends much on the loss function and becomes unsolvable when $L (\cdot, \cdot)$ is complex. Creatively, gradient boosting constructs the weak individuals based on the pseudo residuals, which are the gradient of loss function with respect to the model values predicted at the current learning step. For instance, let $ϵ_{i}^{(k)}$ be the pseudo residual of the $i$ th sample at the $k$ th iteration written as $\begin{matrix} (10) & ϵ_{i}^{(k)} = - {[\frac{\partial L (y_{i}, \hat{y})}{\partial \hat{y}}]}_{\hat{y} = F_{k - 1} (x_{i})}, \end{matrix}$ and thus the $k$ th weak learner $E_{k} (x)$ is trained by $\begin{matrix} (11) & E_{k} (x) = \underset{E_{k} (x)}{a r g m i n} \sum_{i = 1}^{N} L (ϵ_{i}^{(k)}, E_{k} (x_{i})) . \end{matrix}$

As gradient boosting constructs additive ensemble model by sequentially fitting a weak individual learner to the current pseudo-residuals of whole training dataset at each iteration, it costs much training time and may suffer from overfitting problem. In view of that, a minor modification named stochastic gradient boosting is proposed to incorporate some randomization to the procedure. Specifically, at each iteration a randomly selected subset instead of the full training dataset is used to fit the individual learner and compute the model update for the current iteration. Namely, let ${\{π (i)\}}_{1}^{N}$ be a random permutation of the integers $\{1,2, \dots, N\}$ , and the subset with size $\tilde{N} < N$ of the entire training dataset can be given by ${\{(x_{π (i)}, y_{π (i)})\}}_{i = 1}^{\tilde{N}}$ . Furthermore, the $k$ th weak learner using the stochastic gradient boosting ensemble scheme is trained by solving the following optimization problem as $\begin{matrix} (12) & E_{k}^{*} (x) = \underset{E_{k}^{*} (x)}{a r g m i n} \sum_{i = 1}^{\tilde{N}} L (ϵ_{π (i)}^{(k)}, E_{k}^{*} (x_{π (i)})) . \end{matrix}$ Given the base learner $F_{0} (x)$ which is trained by the initial training dataset, the final ensemble learning model constructed by stochastic gradient boosting scheme predicts an unknown testing instance $\hat{x}$ as follows: $\begin{matrix} (13) & F_{K} (\hat{x}) = F_{0} (\hat{x}) + \sum_{k = 1}^{K} E_{k}^{*} (\hat{x}) . \end{matrix}$

Stochastic gradient boosting is also considered as a special linear search optimization algorithm, which makes the newly added individual learner fit the fastest descent direction of partial training loss at each learning step.

3. Stochastic Gradient Boosting-Based Extreme Learning Machine (SGB-ELM)

SGB-ELM is a novel hybrid learning algorithm, which introduces the stochastic gradient boosting method into ELM ensemble procedure. As boosting mechanism focuses on gradually reducing the training residuals at each iteration and ELM is a special multiparameters network (for classification tasks particularly), instead of combining the ELM and stochastic gradient boosting primitively, we design an enhanced training scheme to alleviate possible overfitting in our proposed SGB-ELM algorithm. The detailed implementation of SGB-ELM is presented in Algorithm 2, where the determination of optimal output weights for each individual ELM learner is illustrated in Algorithm 1 accordingly.

Algorithm 1: The determination of $β^{(k - O p t i m a l)}$ .

Input:

$β_{k}$ –– Heuristic item for the optimal output weights;

$H_{k}$ –– The hidden-layer output matrix;

$λ$ –– Regularization factor, $λ > 0$ ;

$\{u_{π (i)}^{(k)} ∣ i = 1,2, \dots, \tilde{N}\}$ –– First order gradient statistics;

$\{v_{π (i)}^{(k)} ∣ i = 1,2, \dots, \tilde{N}\}$ –– Second order gradient statistics;

$(1)$ Let $U_{k} = {[u_{π (1)}^{(k)}, u_{π (2)}^{(k)}, \dots, u_{π (\tilde{N})}^{(k)}]}^{T} = {(u_{i j}^{(k)})}_{\tilde{N} \times m}$ ;

$(2)$ Let $V_{k} = {[v_{π (1)}^{(k)}, v_{π (2)}^{(k)}, \dots, v_{π (\tilde{N})}^{(k)}]}^{T} = {(v_{i j}^{(k)})}_{\tilde{N} \times m}$ ;

$(3)$ for $j = 1$ to $m$ do

$(4)$ for $l = 1$ to $L$ do

$(5)$ ${\hat{β}}_{l j}^{(k)} = - \frac{\sum_{i = 1}^{\tilde{N}} ‍ [u_{i j}^{(k)} h_{i l}^{(k)} + v_{i j}^{(k)} h_{i l}^{(k)} \sum_{s = 1, s \neq l}^{L} ‍ h_{i s}^{(k)} β_{s j}^{(k)}]}{\sum_{i = 1}^{\tilde{N}} ‍ v_{i j}^{(k)} {[h_{i l}^{(k)}]}^{2} + λ \tilde{N}}$

$(6)$ $β_{l j}^{(k)} = {\hat{β}}_{l j}^{(k)}$

$(7)$ end for

$(8)$ end for

Output:

The optimal output weights $β^{(k - O p t i m a l)} = {\hat{β}}_{k}$

Algorithm 2: SGB-ELM.

Input:

$D^{(0)}$ –– Initial training dataset ${\{(x_{i}, y_{i})\}}_{i = 1}^{N}$ ;

$\tilde{N}$ –– Size of random subset at each iteration;

$L$ –– Number of hidden nodes;

$K$ –– Number of iterations;

$L (\cdot, \cdot)$ ––Loss function which is twice differentiable;

$λ$ –– Regularization factor, $λ > 0$ .

$(1)$ Use $D^{(0)}$ to train the initial $E L M_{0}$ written as $F_{0} (x)$ , where the

input weights $W_{0}$ and hidden biases $B_{0}$ are randomly selected

within the range of $[- 1,1]$ and the output-layer weights $β_{0}$ are

determined analytically by $β_{0} = H_{0}^{†} Y$ , and record the initial

base learner $F_{0} = [W_{0}; B_{0}; β_{0}]$ ;

$(2)$ for $(k = 1; k \leq K; k + +)$ do

$(3)$ Randomly generate a permutation ${\{π (i)\}}_{1}^{N}$ of the integers $\{1,2, \dots, N\}$ ,

and then a stochastic subset of the whole training dataset is defined

as ${\{(x_{π (i)}, y_{π (i)})\}}_{i = 1}^{\tilde{N}}$ ;

$(4)$ Calculate the first order gradient statistics on the loss function

with regard to the predicted output of the current ensemble model

for each training instance in the subset as

$u_{π (i)}^{(k)} = [{\frac{\partial L (y_{π (i)}, \hat{y})}{\partial \hat{y}}|}_{\hat{y} = F_{k - 1} (x_{π (i)})}], i = 1,2, \dots, \tilde{N}$ ;

$(5)$ Calculate the second order gradient statistics on the loss function

with regard to the predicted output of current ensemble model for each

training instance in the subset as

$v_{π (i)}^{(k)} = [{\frac{\partial L^{2} (y_{π (i)}, \hat{y})}{\partial {\hat{y}}^{2}}|}_{\hat{y} = F_{k - 1} (x_{π (i)})}], i = 1,2, \dots, \tilde{N}$ ;

$(6)$ For the training instances in the subset, compute the current

pseudo residuals $E_{k} = {[ϵ_{π (1)}^{(k)}, ϵ_{π (2)}^{(k)}, \dots, ϵ_{π (\tilde{N})}^{(k)}]}^{T}$ , where

$ϵ_{π (i)}^{(k)} = - [{\frac{\partial L (y_{π (i)}, \hat{y})}{\partial \hat{y}}|}_{\hat{y} = F_{k - 1} (x_{π (i)})}], i = 1,2, \dots, \tilde{N}$ ;

$(7)$ Determine the output weights $β_{k}$ used as a heuristic item

for the derivation formula based on the modified training

dataset $D^{(k)} = {\{(x_{π (i)}, ε_{π (i)}^{(k)})\}}_{i = 1}^{\tilde{N}}$ as follows

$β_{k} = H_{k}^{†} E_{k}$ , $(*)$

where $H_{k}$ is calculated according to the randomly selected

input weights $W_{k}$ and hidden biases $B_{k}$ ;

$(8)$ Use the derivation formula in Algorithm 1 to obtain the optimal

output-layer weights $β_{k}^{*}$ of $E L M_{k}$ ;

$(9)$ Add the $k$ -th individual learner $E L M_{k}$ ( $E_{k} = [W_{k}; B_{k}; β_{k}^{*}]$ ) to

the current ensemble learning model as

$F_{k} (x) = F_{k - 1} (x) + E_{k} (x)$ ;

$(10)$ end for

output:

The final ensemble model $F_{K} (x) = F_{0} (x) + \sum_{k = 1}^{K} ‍ E_{k} (x)$ .

There are many existing second-order approximation methods including sequential quadratic programming (SQP) [21] and majorization-minimization algorithm (MM) [22]. SQP is an effective method for nonlinearly constrained optimization by solving quadratic subproblems. MM aims to optimize the local alternative objective which is easier to solve in comparison with the original cost function. Instead of using second-order approximation directly, SGB-ELM designs an optimization criterion for the output-layer weights of each individual ELM. In view of that, quadratic approximation is merely employed as an optimization tool in SGB-ELM.

In SGB-ELM, the key issue is to determine the optimal output-layer weights of each weak individual ELM, which is expected to further decrease the training loss and meanwhile keep a simple network structure. Consequently, we design a learning objective considering not only the fitting ability for training instances but also the complexity of our ensemble model as follows: $\begin{matrix} (14) & Q^{(K)} = \sum_{i = 1}^{N} L (y_{i}, {\hat{y}}_{i}) + λ \sum_{k = 1}^{K} Ω (E_{k}), \end{matrix}$ where $L (\cdot, \cdot)$ is a differentiable loss function that measures the difference between the predicted output ${\hat{y}}_{i}$ and the target value $y_{i}$ . The second term $Ω$ represents the complexity of the ensemble model consisting of $K$ weak individual learners. Moreover, $λ$ is a regularization factor that makes a balance between training loss and architectural risk. It is obvious that the objective falls back to the traditional gradient booting method when the regularization factor $λ$ is set to zero.

As for boosting training mechanism, each individual ELM is greedily added to the current ensemble model sequentially so that it can most improve our model according to (8). Specifically, let ${\hat{y}}_{i}^{(k - 1)}$ be the predicted value of the $i$ th instance at the $(k - 1)$ th iteration and $E_{k} (x)$ be the $k$ th weak ELM learner that needs to be incorporated into the ensemble model, then the prediction of the $i$ th instance at the $k$ th iteration ${\hat{y}}_{i}^{(k)}$ can be written as $\begin{matrix} (15) & {\hat{y}}_{i}^{(k)} = {\hat{y}}_{i}^{(k - 1)} + E_{k} (x_{i}), k = 1,2, \dots, K . \end{matrix}$ In order to obtain the newly added individual ELM, we first introduce $E_{k} (x)$ to the existing learned ensemble model and then minimize the following objective: $\begin{matrix} (16) & Q^{(k)} = \sum_{i = 1}^{N} L (y_{i}, {\hat{y}}_{i}^{(k)}) + λ \sum_{j = 1}^{k} Ω (E_{j}) = \sum_{i = 1}^{N} L [y_{i}, {\hat{y}}_{i}^{(k - 1)} + E_{k} (x_{i})] + λ [\sum_{j = 1}^{k - 1} Ω (E_{j}) + Ω (E_{k})], \end{matrix}$ where $E_{1}, E_{2}, \dots, E_{k - 1}$ is already obtained at the previous iterations. As a consequence, the complexity of the learned ensemble model $\sum_{j = 1}^{k - 1} Ω (E_{j})$ is a constant, and we only need to take $Ω (E_{k})$ into consideration. Removing the constant item, the objective $Q^{(k)}$ is simplified as $\begin{matrix} (17) & Q^{(k)} = \sum_{i = 1}^{N} L [y_{i}, {\hat{y}}_{i}^{(k - 1)} + E_{k} (x_{i})] + λ Ω (E_{k}) . \end{matrix}$ Stochastic gradient boosting selects a random subset with size $\tilde{N} < N$ of the whole training set to fit the individual learner at each iteration. Namely, let ${\{π (i)\}}_{1}^{N}$ be a random permutation of the integers $\{1,2, \dots, N\}$ , then we can define a stochastic subset as ${\{(x_{π (i)}, y_{π (i)})\}}_{i = 1}^{\tilde{N}}$ . Accordingly, the objective using stochastic gradient boosting is transformed as $\begin{matrix} (18) & Q^{(k)} = \sum_{i = 1}^{\tilde{N}} L [y_{π (i)}, {\hat{y}}_{π (i)}^{(k - 1)} + E_{k} (x_{π (i)})] + λ Ω (E_{k}) . \end{matrix}$ We use second-order approximation to optimize the above learning objective, where the lose function is derived by Taylor expansion as follows: $\begin{matrix} (19) & L^{(k)} ≃ L [y_{j}, {\hat{y}}_{j}^{(k - 1)}] + {[u_{j}^{(k)}]}^{T} E_{k} (x_{j}) + \frac{1}{2} {[v_{j}^{(k)}]}^{T} {E_{k}}^{2} (x_{j}), \end{matrix}$ where $j = π (i)$ is the new index for $(x_{π (i)}, y_{π (i)})$ in the randomly generated subset, $\begin{matrix} (20) & u_{j}^{(k)} = [{\frac{\partial L (y_{π (i)}, \hat{y})}{\partial \hat{y}}|}_{\hat{y} = F_{k - 1} (x_{π (i)})}], i = 1,2, \dots, \tilde{N} \end{matrix}$ is the first-order gradient statistics on the loss function with respect to the current predicted output $F_{k - 1} (x_{π (i)})$ , and $\begin{matrix} (21) & v_{j}^{(k)} = [{\frac{\partial L^{2} (y_{π (i)}, \hat{y})}{\partial {\hat{y}}^{2}}|}_{\hat{y} = F_{k - 1} (x_{π (i)})}], i = 1,2, \dots, \tilde{N} \end{matrix}$ is the second-order gradient statistics on the loss function with respect to the current predicted output $F_{k - 1} (x_{π (i)})$ . Due to the approximation for training loss, we can provide a general solution scheme regardless of the specific type of loss function. In addition, second-order optimization tends to achieve better convergence in comparison with the traditional gradient method [23]. Obviously, $L [y_{j}, {\hat{y}}_{j}^{(k - 1)}]$ is a fixed value, and thus the objective can be further expressed as $\begin{matrix} (22) & Q^{(k)} = \sum_{j = 1}^{\tilde{N}} [{[u_{j}^{(k)}]}^{T} E_{k} (x_{j}) + \frac{1}{2} {[v_{j}^{(k)}]}^{T} {E_{k}}^{2} (x_{j})] + λ Ω (E_{k}) . \end{matrix}$ Let $[X; Y] = {\{(x_{π (i)}, y_{π (i)})\}}_{i = 1}^{\tilde{N}}$ , and the objective can be rewritten in a matrix form as $\begin{matrix} (23) & Q^{(k)} = T r [U_{k}^{T} E_{k} (X)] + \frac{1}{2} T r [V_{k}^{T} E_{k}^{2} (X)] + λ Ω (E_{k}), \end{matrix}$ where $\begin{matrix} (24) & U_{k} = [\begin{bmatrix} u_{π (1)}^{(k)} \\ u_{π (2)}^{(k)} \\ ⋮ \\ u_{π (\tilde{N})}^{(k)} \end{bmatrix}] = {[\begin{bmatrix} u_{11}^{(k)} & u_{12}^{(k)} & \dots & u_{1 m}^{(k)} \\ u_{21}^{(k)} & u_{22}^{(k)} & \dots & u_{2 m}^{(k)} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ u_{\tilde{N} 1}^{(k)} & u_{\tilde{N} 2}^{(k)} & \dots & u_{\tilde{N} m}^{(k)} \end{bmatrix}]}_{\tilde{N} \times m} \end{matrix}$ and $\begin{matrix} (25) & V_{k} = [\begin{bmatrix} v_{π (1)}^{(k)} \\ v_{π (2)}^{(k)} \\ ⋮ \\ v_{π (\tilde{N})}^{(k)} \end{bmatrix}] = {[\begin{bmatrix} v_{11}^{(k)} & v_{12}^{(k)} & \dots & v_{1 m}^{(k)} \\ v_{21}^{(k)} & v_{22}^{(k)} & \dots & v_{2 m}^{(k)} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ v_{\tilde{N} 1}^{(k)} & v_{\tilde{N} 2}^{(k)} & \dots & v_{\tilde{N} m}^{(k)} \end{bmatrix}]}_{\tilde{N} \times m} . \end{matrix}$ The $k$ th individual learner $E_{k} (X)$ is a basic ELM model, which randomly selects input-layer weights $W_{k}$ and hidden biases $B_{k}$ . Given the hidden-layer output matrix $\begin{matrix} (26) & H_{k} = g (X W_{k} + B_{k}) = {[\begin{bmatrix} h_{11}^{(k)} & h_{12}^{(k)} & \dots & h_{1 L}^{(k)} \\ h_{21}^{(k)} & h_{22}^{(k)} & \dots & h_{2 L}^{(k)} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ h_{\tilde{N} 1}^{(k)} & h_{\tilde{N} 2}^{(k)} & \dots & h_{\tilde{N} L}^{(k)} \end{bmatrix}]}_{\tilde{N} \times L}, \end{matrix}$ $E_{k} (X)$ can be expressed as $\begin{matrix} (27) & E_{k} (X) = H_{k} β_{k}, \end{matrix}$ where $\begin{matrix} (28) & β_{k} = {[\begin{bmatrix} β_{11}^{(k)} & β_{12}^{(k)} & \dots & β_{1 m}^{(k)} \\ β_{21}^{(k)} & β_{22}^{(k)} & \dots & β_{2 m}^{(k)} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ β_{L 1}^{(k)} & β_{L 2}^{(k)} & \dots & β_{L m}^{(k)} \end{bmatrix}]}_{L \times m} \end{matrix}$ is the output-layer weight matrix that needs to be determined. As Bartlett [24] pointed out that networks tend to perform better generalization with not only small training error but also small norm of weights ( $‖β_{k}‖$ ), we use L2-norm to evaluate the complexity of a basic ELM model as $\begin{matrix} (29) & Ω (E_{k}) = \frac{1}{2} {‖β_{k}‖}^{2} . \end{matrix}$ Accordingly, the conclusive objective can be written as $\begin{matrix} (30) & Q^{(k)} = T r [U_{k}^{T} (H_{k} β_{k})] + \frac{1}{2} T r [V_{k}^{T} {(H_{k} β_{k})}^{2}] + \frac{λ}{2} {‖β_{k}‖}^{2} . \end{matrix}$ From (30), we can find that the objective is only sensitive to $β_{k}$ at the $k$ th iteration. For single-variable optimization, solving partial derivative is conducted as $\begin{matrix} (31) & \frac{\partial Q^{(k)}}{\partial β_{k}} = 0, \end{matrix}$ where each element in $β_{k}$ conducted a partial derivative, respectively. Thus we obtain the derivation formula as follows: $\begin{matrix} (32) & β_{l j}^{(k)} = - \frac{\sum_{i = 1}^{\tilde{N}} [u_{i j}^{(k)} h_{i l}^{(k)} + v_{i j}^{(k)} h_{i l}^{(k)} \sum_{s = 1, s \neq l}^{L} h_{i s}^{(k)} β_{s j}^{(k)}]}{\sum_{i = 1}^{\tilde{N}} v_{i j}^{(k)} {[h_{i l}^{(k)}]}^{2} + λ \tilde{N}}, \end{matrix}$ where $l = 1,2, \dots, L$ and $j = 1,2, \dots, m$ . It is shown that $β_{k}$ is difficult to be calculated analytically. Since our designed regularized objective tends to generate an ensemble model employing predictive as well as simple hypotheses, (32) derived by the objective can be used as an optimization criterion. Specifically, we take the output-layer weights determined by pseudo-residuals dataset as an initial heuristic item and thus obtain the optimal output-layer weights by using the derivation formula to update the heuristic item iteratively. Algorithm 1 illustrates how the optimal output weight matrix is determined and the detailed implementation of SGB-ELM is presented in Algorithm 2.

In Algorithm 2, all the input weights and hidden biases of individual ELMs are randomly chosen within the range of $[- 1,1]$ . For boosting-based ensemble methods, the initial base learner is expected to be enhanced by adding weak individual learners to the current ensemble model step by step. In view of that, high-precision initial base learner might affect the effectiveness of ensemble negatively. In order to control the fitting ability of the initial base learner $F_{0} (x)$ and meanwhile reduce the instability brought by random determination of the input weights and hidden biases, SGB-ELM conducts multiple random initializations for parameters in $E L M_{0}$ and takes the average at last. For instance, we take the average of 100 random initializations as $\begin{matrix} (33) & w_{i l} = \frac{\sum_{t = 1}^{100} w_{i l}^{(t)}}{100}, \\ b_{l} = \frac{\sum_{t = 1}^{100} b_{l}^{(t)}}{100}, \end{matrix}$ where $i = 1,2, \dots, d$ and $l = 1,2, \dots, L$ . For the weak individual ELM, which plays a smaller role in the whole ensemble model, random initialization of parameters exactly increases the diversity between weak individual learners.

4. Performance Validation

In this section, a series of experiments are conducted to validate the feasibility and effectiveness of our proposed SGB-ELM algorithm, and meanwhile we compare the generalization performance and predicted stability of several typical ensemble learning methods (EN-ELM [5], V-ELM [6], Bagging [14], and Adaboost [15]) on 4 KEEL [25] regression and 5 UCI [26] classification datasets. Among all the above-mentioned ensemble methods, the basic ELM model proposed in [2] is used as the individual learner, where the sigmoid function $g (x) = 1 / (1 + \exp (- x))$ is selected as the activation function. All the experiments are carried out on a desktop computer with Win10 operating system, Intel (R) i5-4590 3.30GHz CPU, and 12GB memory and implemented with Matlab 9.0 version. Meanwhile, all the experimental results are the average of 50 repeated trials. The experiments are generally divided into two parts: one part is to evaluate the performance of SGB-ELM, and the other part is to measure the effect of learning parameters on training SGB-ELM algorithm.

4.1. Performance Evaluation of SGB-ELM

For regression problem, the performances of SGB-ELM and other comparative algorithms are both measured by Root Mean Square Error (RMSE), which reveals the difference between the predicted value and the target. Additionally, in this paper, we take the squared loss as our loss function in SGB-ELM algorithm for regression task. Suppose ${\hat{y}}_{i}$ and $y_{i}$ are the predicted value and the target of the $i$ th instance, respectively, and the loss function $L (\cdot, \cdot)$ is given by $\begin{matrix} (34) & L (y_{i}, {\hat{y}}_{i}) = \frac{1}{2} {({\hat{y}}_{i} - y_{i})}^{2} . \end{matrix}$ Since V-ELM and EN-ELM are designed for classification applications, we compare the generalization capability of SGB-ELM with the basic ELM, simple ensemble ELM, Bagging ELM, and Adaboost ELM in regression tasks. Among them, simple ensemble ELM can be considered as a variant of the V-ELM method, which trains a number of individual ELMs independently and takes the simple average of all the predictions at last. Adaboost ELM is implemented by Adaboost.R2 method [27], which applies the primitive Adaboost algorithm designed for classification tasks [15] to the regression field. Furthermore, we adopt resampling the original training dataset rather than assigning a weight to every instance to train each individual learner in Adaboost.R2 ELM.

The performances of the traditional ELM, simple ensemble ELM, Bagging ELM, Adaboost ELM, and our proposed SGB-ELM are compared on 4 representative regression datasets, which are selected from the KEEL [25] repository. Experimentally, all the inputs of each dataset are normalized into the range of $[0,1]$ . The characteristics of these datasets are summarized in Table 1, where each original dataset is divided into two groups including a training set ( $70 %$ ) and a testing set ( $30 %$ ). In our regression experiments, for each dataset, the number of hidden nodes $L$ is selected from $\{10,20, \dots, 200\}$ . The parameters in SGB-ELM are set as $λ = \{0.001,0.0015, \dots, 0.01\}$ , $\tilde{N} / N = \{0.1,0.2, \dots, 1\}$ , and $K = 50$ . The settings of other comparative algorithms can be found in Table 2. Figure 1 shows the training and testing RMSE of different learning methods during 50 trials on Friedman dataset. The detailed comparison results between SGB-ELM and other learning algorithms on 4 regression benchmark datasets are shown in Table 2. Furthermore, we compare the training and testing performances of SGB-ELM with those of Adaboost.R2 with regard to the number of iterations on Mortgage dataset, which is presented in Figure 3(a).

Table 1

Details of 4 KEEL regression datasets.

No.	Datasets	Condition attributes	Training samples	Testing samples
1	Laser	4	695	298
2	Friedman	5	840	360
3	Mortgage	15	734	315
4	Wizmir	9	1023	438

Table 2

The comparison results between SGB-ELM and other representative algorithms on 4 regression datasets.

Dataset	Algorithm	Training time	Training RMSE (Dev)	Testing RMSE (Dev)	Hidden nodes	Iterations
Laser ( $\tilde{N} / N = 0.4$ )	ELM	0.0097	12.3791 $\pm$ 0.6067	12.7783 $\pm$ 1.5863	80	N/A
	Simple ensemble	0.1547	12.1216 $\pm$ 0.4585	12.8794 $\pm$ 1.5415	80	10
	Bagging	0.7991	12.1707 $\pm$ 0.5094	13.4085 $\pm$ 1.5998	80	50
	Adaboost	0.1591	11.4460 $\pm$ 0.4875	12.0666 $\pm$ 1.2080	50	Max = 50
	SGB-ELM	3.4853	7.6354 $\pm$ 0.3919	8.4170 $\pm$ 1.0788	50 ( $λ = 0.001$ )	50

Friedman ( $\tilde{N} / N = 0.5$ )	ELM	0.0141	1.4220 $\pm$ 0.0532	1.5124 $\pm$ 0.0906	100	N/A
	Simple ensemble	0.2081	1.4005 $\pm$ 0.0240	1.4791 $\pm$ 0.0747	100	10
	Bagging	1.0144	1.4111 $\pm$ 0.0197	1.4906 $\pm$ 0.0701	100	50
	Adaboost	0.5219	1.2551 $\pm$ 0.0304	1.3342 $\pm$ 0.0587	60	Max = 50
	SGB-ELM	4.8853	1.0627 $\pm$ 0.0136	1.1581 $\pm$ 0.0346	60 ( $λ = 0.002$ )	50

Mortgage ( $\tilde{N} / N = 0.4$ )	ELM	0.0200	0.0855 $\pm$ 0.0022	0.0961 $\pm$ 0.0078	150	N/A
	Simple ensemble	0.3044	0.0843 $\pm$ 0.0021	0.0947 $\pm$ 0.0085	150	10
	Bagging	1.4544	0.0834 $\pm$ 0.0018	0.0937 $\pm$ 0.0077	150	50
	Adaboost	0.4778	0.0785 $\pm$ 0.0020	0.0885 $\pm$ 0.0058	80	Max = 50
	SGB-ELM	6.2434	0.0607 $\pm$ 0.0015	0.0759 $\pm$ 0.0056	80 ( $λ = 0.004$ )	50

Wizmir ( $\tilde{N} / N = 0.5$ )	ELM	0.0128	1.0906 $\pm$ 0.0277	1.1263 $\pm$ 0.0667	100	N/A
	Simple ensemble	0.2066	1.0869 $\pm$ 0.0269	1.1203 $\pm$ 0.0629	100	10
	Bagging	1.0366	1.0859 $\pm$ 0.0259	1.1165 $\pm$ 0.0620	100	50
	Adaboost	0.4331	1.0622 $\pm$ 0.0519	1.1091 $\pm$ 0.0857	60	Max = 50
	SGB-ELM	5.6525	1.0148 $\pm$ 0.0258	1.1032 $\pm$ 0.0615	60 ( $λ = 0.002$ )	50

[figure omitted; refer to PDF]

As for classification problem, like other typical feedforward neural networks (for instance, BP neural networks [28]), SGB-ELM evaluates the predicted output by calculating the sum of squared errors. Specifically, let ${\hat{y}}_{i}$ be the predicted output vector and $y_{i}$ be the target encoded by One-Hot scheme [29] of the $i$ th sample, respectively, and we define the loss function $L (\cdot, \cdot)$ in SGB-ELM for classification as follows: $\begin{matrix} (35) & L (y_{i}, {\hat{y}}_{i}) = \frac{1}{2} {‖{\hat{y}}_{i} - y_{i}‖}^{2} = \frac{1}{2} \sum_{j = 1}^{m} {({\hat{y}}_{i j} - y_{i j})}^{2} . \end{matrix}$ It is shown that SGB-ELM aims at reducing the training RMSE inch by inch for classification problem. Accordingly, we compare SGB-ELM with several representative ensemble learning methods including V-ELM, EN-ELM, Bagging ELM, and Adaboost ELM. Among them, V-ELM and EN-ELM have been briefly summarized in Section 1, and Adaboost ELM is implemented by Adaboost.SAMME method [30], which extends the original Adaboost designed for binary classification to multiclassification problem.

Similarly, we select 5 popular classification datasets from the UCI Machine Learning Repository [26] to verify the performance of our proposed SGB-ELM algorithm. For each dataset, all the decision attributes are encoded by One-Hot scheme [29]. The characteristics of these datasets are described in Table 3, where each original data set is equally divided into two groups including a training set ( $50 %$ ) and a testing set ( $50 %$ ). The number of hidden nodes is also selected from $L = \{10,20, \dots, 200\}$ for each dataset. The parameters in SGB-ELM are set as $λ = \{0.001,0.002, \dots, 0.03\}$ , $\tilde{N} / N = \{0.1,0.2, \dots, 1\}$ , and $K = 100$ . The cross-validation is tenfold ( $R = 10$ ) in EN-ELM. The number of individual ELMs for ensemble is 7 ( $K = 7$ ) in V-ELM. Other settings can be found in Table 4. Figure 2 shows the training and testing accuracy of different algorithms during 50 trials on Segmentation dataset. The detailed performances of SGB-ELM in comparison with other learning algorithms on 5 classification benchmark datasets are summarized in Table 4. Lastly, the training and testing accuracy of SGB-ELM and Adaboost.SAMME with regard to the number of iterations on the Segmentation dataset are presented in Figure 3(b).

Table 3

Details of 5 UCI classification datasets.

No.	Datasets	Condition attributes	Decision attributes	Training samples	Testing samples
1	Image segmentation	19	7	1155	1155
2	Texture	40	11	2750	2750
3	Spambase	57	2	2295	2294
4	Banana	2	2	2650	2650
5	Ring	20	2	3700	3700

Table 4

The comparison results between SGB-ELM and other representative algorithms on 5 classification datasets.

Dataset	Algorithm	Training time	Training accuracy (Dev)	Testing accuracy (Dev)	Hidden nodes	Iterations
Segmentation ( $\tilde{N} / N = 0.5$ )	ELM	0.0431	0.9465 $\pm$ 0.0055	0.9351 $\pm$ 0.0064	180	N/A
	V-ELM	0.4487	0.9463 $\pm$ 0.0061	0.9374 $\pm$ 0.0060	180	7
	EN-ELM	43.9234	0.9472 $\pm$ 0.0048	0.9353 $\pm$ 0.0063	180 ( $R = 10$ )	50
	Bagging	3.1853	0.9474 $\pm$ 0.0035	0.9353 $\pm$ 0.0050	180	50
	Adaboost	3.5372	0.9853 $\pm$ 0.0052	0.9466 $\pm$ 0.0067	100	Max = 100
	SGB-ELM	134.2969	0.9761 $\pm$ 0.0030	0.9558 $\pm$ 0.0049	100 ( $λ = 0.004$ )	100

Texture ( $\tilde{N} / N = 0.7$ )	ELM	0.0338	0.9954 $\pm$ 0.0011	0.9945 $\pm$ 0.0016	100	N/A
	V-ELM	0.4275	0.9965 $\pm$ 0.0007	0.9950 $\pm$ 0.0013	100	7
	EN-ELM	44.4969	0.9963 $\pm$ 0.0008	0.9946 $\pm$ 0.0011	100 ( $R = 10$ )	50
	Bagging	3.0959	0.9965 $\pm$ 0.0007	0.9957 $\pm$ 0.0013	100	50
	Adaboost	10.3628	0.9996 $\pm$ 0.0017	0.9972 $\pm$ 0.0024	60	Max = 100
	SGB-ELM	193.2019	0.9992 $\pm$ 0.0005	0.9982 $\pm$ 0.0008	60 ( $λ = 0.002$ )	100

Spambase ( $\tilde{N} / N = 0.7$ )	ELM	0.0459	0.9174 $\pm$ 0.0044	0.9080 $\pm$ 0.0061	150	N/A
	V-ELM	0.4213	0.9192 $\pm$ 0.0042	0.9115 $\pm$ 0.0053	150	7
	EN-ELM	62.6000	0.9183 $\pm$ 0.0054	0.9071 $\pm$ 0.0060	150 ( $R = 10$ )	50
	Bagging	2.9869	0.9219 $\pm$ 0.0039	0.9145 $\pm$ 0.0051	150	50
	Adaboost	7.6875	0.9620 $\pm$ 0.0046	0.9234 $\pm$ 0.0072	100	Max = 100
	SGB-ELM	129.0922	0.9522 $\pm$ 0.0033	0.9222 $\pm$ 0.0043	100 ( $λ = 0.006$ )	100

Banana ( $\tilde{N} / N = 0.6$ )	ELM	0.0550	0.6838 $\pm$ 0.0253	0.6787 $\pm$ 0.0263	180	N/A
	V-ELM	0.4906	0.6860 $\pm$ 0.0261	0.6848 $\pm$ 0.0264	180	7
	EN-ELM	67.5578	0.6821 $\pm$ 0.0227	0.6780 $\pm$ 0.0250	180 ( $R = 10$ )	50
	Bagging	3.3253	0.6808 $\pm$ 0.0164	0.6777 $\pm$ 0.0178	180	50
	Adaboost	7.5100	0.7457 $\pm$ 0.0288	0.7448 $\pm$ 0.0320	100	Max = 100
	SGB-ELM	133.0791	0.7610 $\pm$ 0.0085	0.7563 $\pm$ 0.0082	100 ( $λ = 0.004$ )	100

Ring ( $\tilde{N} / N = 0.8$ )	ELM	0.0897	0.9492 $\pm$ 0.0024	0.9418 $\pm$ 0.0035	200	N/A
	V-ELM	0.7609	0.9532 $\pm$ 0.0024	0.9466 $\pm$ 0.0032	200	7
	EN-ELM	114.0641	0.9517 $\pm$ 0.0028	0.9418 $\pm$ 0.0031	200 ( $R = 10$ )	50
	Bagging	5.5241	0.9539 $\pm$ 0.0027	0.9468 $\pm$ 0.0030	200	50
	Adaboost	17.2109	0.9940 $\pm$ 0.0023	0.9524 $\pm$ 0.0038	150	Max = 100
	SGB-ELM	363.7976	0.9750 $\pm$ 0.0021	0.9567 $\pm$ 0.0027	150 ( $λ = 0.008$ )	100

[figure omitted; refer to PDF]

[figures omitted; refer to PDF]

Tables 2 and 4 present the comparison results including training time, training RMSE/accuracy, and testing RMSE/accuracy for regression and classification tasks, respectively. It is shown that SGB-ELM obtains the better generalization capability in most cases without significantly increasing the training time. At the same time, SGB-ELM tends to have smaller training Dev and testing Dev than those of the comparative learning algorithms, which exactly validates the robustness and stability of our proposed SGB-ELM Algorithm. In particular, since SGB-ELM adopts the similar training mechanism with Adaboost which integrates multiple weak individual learners sequentially, the number of hidden nodes is set as a smaller value in both SGB-ELM and Adaboost method. It is worth noting that SGB-ELM can achieve better performance than the existing methods with less hidden nodes and outperforms Adaboost with the same number of hidden nodes.

From Figures 1 and 2, we can find that SGB-ELM is more stable than the traditional ELM, simple ensemble, Bagging, and Adaboost.R2 in regression problem and also produces better robustness than V-ELM, EN-ELM, Bagging, and Adaboost.SAMME in classification problem. It is shown that SGB-ELM not only focuses on reducing the predicted bias as other boosting like methods, but also generates a robust ensemble model with a low variance. As observed in Figure 2 although Adaboost.SAMME generates higher training accuracy than SGB-ELM during the most of 50 trials, SGB-ELM obtains the better generalization capability (testing accuracy). It can be explained by two reasons as $(1)$ we introduce a regularization item (L2-norm) to the learning objective to control the complexity of our ensemble learning model; $(2)$ a randomly selected subset rather than the whole training dataset is used to minimize the training loss at each iteration in our proposed SGB-ELM algorithm.

Figure 3 shows the training RMSE/accuracy and testing RMSE/accuracy of Adaboost (Adaboost.R2 for regression and Adaboost.SAMME for classification) and SGB-ELM with regard to the number of iterations. The fixed reference line denotes the training and testing performance of a traditional ELM, which is equipped with much more hidden nodes. As shown in Figure 3, SGB-ELM obviously improves the generalization capability of the initial base ELM in both regression and classification tasks. From Figure 3(a), we can find that the training and testing RMSE is declining gradually as the number of iterations increases. Similarly, both the training and testing accuracy curve show an increasing trend in Figure 3(b). Because we conduct multiple random initializations for parameters in the initial base learner $F_{0}$ and take the average at last, the fitting ability of $F_{0}$ is artificially weakened to some extent. As a result, the initial training and testing RMSE/accuracy of SGB-ELM are much lower than the initial Adaboost. It is shown that both SGB-ELM and Adaboost outperform the traditional ELM equipped more hidden nodes after a small number of learning steps. Furthermore, we can find that SGB-ELM produces better performance than Adaboost after only 5 iterations in regression tasks and 10 iterations in classification tasks. It verifies the significant convergence of second-order optimization method, which is incorporated into the procedure of SGB-ELM.

From the experimental results of both regression and classification problems, we can conclude that our proposed SGB-ELM algorithm can not only achieve better generalization capability (low predicted bias) than the typical existing variants of ELM, but also obtain an enough robust ELM ensemble learning model (low predicted variance).

4.2. Impact of Learning Parameters on Training SGB-ELM

To achieve good generalization performance, three learning parameters of SGB-ELM including the number of hidden nodes $L$ , the regularization factor $λ$ , and the size of subset $\tilde{N}$ need to be chosen appropriately. In this section, we attempt to evaluate the impact of learning parameters on training SGB-ELM algorithm and provide some empirical references of choosing these parameters.

For the basic ELM model, the number of hidden nodes $L$ decides the model's capacity. In other words, an ELM with more hidden nodes is more complex and can deal with more training instances. However, it tends to obtain an overfitting model when $L$ is set as a value too large. The regularization factor $λ$ makes a balance between the training loss and the complexity of model. It means that $λ$ can control the capacity or the complexity of our model. The size of subset $\tilde{N}$ represents the number of training instances at each iteration and it introduces some randomization to the training procedure of SGB-ELM. Firstly, we use grid-search method to observe the training and testing performance of SGB-ELM with different $L$ and $λ$ . Specifically, we set $L = \{10,15, \dots, 160\}$ , $λ = \{0.001,0.002, \dots, 0.03\}$ , and a fixed $\tilde{N} = 0.8$ . The training and testing performance of SGB-ELM with regard to the combination of $(L, λ)$ on the Spambase dataset is shown in Figure 4. Secondly, as we empirically find that the optimal $\tilde{N}$ depends much on the size of training dataset, we conduct two experiments (including a small dataset and a large dataset) to measure the impact of $\tilde{N}$ on training SGB-ELM. We choose the optimal value of $(L, λ)$ according to the grid-search results and set $\tilde{N} / N = \{0.1,0.2, \dots, 1\}$ . Figure 5 shows the training and testing performance of SGB-ELM with different sampling fraction ( $\tilde{N} / N$ ) on the Wizmir and Spambase datasets.

[figure omitted; refer to PDF]

[figures omitted; refer to PDF]

As shown in Figure 4, changing the value of $(L, λ)$ has a significant effect on the training and testing accuracy of SGB-ELM algorithm. It is obvious that SGB-ELM with excess hidden nodes is more likely to produce overfitting when the regularization factor $λ$ is set as a small value. It also demonstrates that SGB-ELM can effectively reduce overfitting when $λ$ is assigned a proper value. In addition, from Figure 4 we can find that SGB-ELM achieves better performance with enough hidden nodes and a proper $λ$ . It can be explained by the rule that although SGB-ELM with a small number of hidden nodes can avoid overfitting intuitively, meanwhile it produces a barrier to fit the current training residuals appropriately.

From Figure 5, it is obvious that randomization improves the performance of SGB-ELM substantially. As each weak individual ELM is learned based on randomly selected subset of the whole training dataset, it exactly increases the diversity between all the individuals. On the other hand, randomization introduces a noisy estimate of the total training loss. As a result, it slows down the convergence and even makes the learning curve fluctuate (higher variance) if $\tilde{N}$ is too small. It is shown that the best value of the sampling fraction is approximately $50 %$ on the Wizmir dataset and $70 %$ on the Spambase dataset, where there are a typical improvement in testing performance comparing to no sampling at all. Since the optimal values of $\tilde{N} / N$ are different on the Wizmir and Spambase datasets, it indicates that the sampling fraction ( $\tilde{N} / N$ ) is expected to be determined based on the specific learning tasks and assigned a bigger value on the training dataset containing more instances.

5. Conclusions

In this paper, we proposed a novel ensemble model named Stochastic Gradient Boosting-based Extreme Learning Machine (SGB-ELM). Instead of combining ELM and stochastic gradient boosting primitively, we construct an ELM flow or ELM sequence where the output-layer weights of each weak ELM are determined by optimizing the regularized objective additively. Firstly, by minimizing the objective using second-order approximation, the derivation formula aimed at solving the output-layer weights of each individual ELM is determined. Then we take the output-layer weights learned by the current pseudo residuals as a heuristic item and thus obtain the optimal output-layer weights by updating the heuristic item iteratively. The performance of SGB-ELM was evaluated on 4 regression and 5 classification datasets. In comparison with several typical ELM ensemble methods, SGB-ELM obtained better performance and robustness, which demonstrated the feasibility and effectiveness of SGB-ELM algorithm.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Authors’ Contributions

Hua Guo and Jikui Wang contributed equally the same to this work.

Acknowledgments

This work is supported by National Natural Science Foundations of China (61503252 and 61473194), China Postdoctoral Science Foundation (2016T90799), and Natural Science Foundation of Gansu Province (17JR5RA177).

References

[1] G. B. Huang, Q. Y. Zhu, C. K. Siew, "Extreme learning machine: a new learning scheme of feedforward neural networks," Proceedings of the IEEE International Joint Conference on Neural Networks, vol. 2, pp. 985-990, DOI: 10.1109/IJCNN.2004.1380068, .

[2] G. B. Huang, Q. Y. Zhu, C. K. Siew, "Extreme learning machine: theory and applications," Neurocomputing, vol. 70 no. 1–3, pp. 489-501, DOI: 10.1016/j.neucom.2005.12.126, 2006.

[3] G.-B. Huang, H. Zhou, X. Ding, R. Zhang, "Extreme learning machine for regression and multiclass classification," IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 42 no. 2, pp. 513-529, DOI: 10.1109/TSMCB.2011.2168604, 2012.

[4] R. Penrose, "A generalized inverse for matrices," Mathematical Proceedings of the Cambridge Philosophical Society, vol. 51 no. 3, pp. 406-413, DOI: 10.1017/S0305004100030401, 1955.

[5] N. Liu, H. Wang, "Ensemble based extreme learning machine," IEEE Signal Processing Letters, vol. 17 no. 8, pp. 754-757, DOI: 10.1109/LSP.2010.2053356, 2010.

[6] J. Cao, Z. Lin, G.-B. Huang, N. Liu, "Voting based extreme learning machine," Information Sciences, vol. 185, pp. 66-77, DOI: 10.1016/j.ins.2011.09.015, 2012.

[7] X. Xue, M. Yao, Z. Wu, J. Yang, "Genetic ensemble of extreme learning machine," Neurocomputing, vol. 129, pp. 175-184, DOI: 10.1016/j.neucom.2013.09.042, 2014.

[8] A. O. M. Abuassba, D. Zhang, X. Luo, A. Shaheryar, H. Ali, "Improving Classification Performance through an Advanced Ensemble Based Heterogeneous Extreme Learning Machines," Computational Intelligence and Neuroscience, vol. 2017, 2017.

[9] M. Han, B. Liu, "Ensemble of extreme learning machine for remote sensing image classification," Neurocomputing, vol. 149, pp. 65-70, DOI: 10.1016/j.neucom.2013.09.070, 2015.

[10] H.-J. Lu, C.-L. An, E.-H. Zheng, Y. Lu, "Dissimilarity based ensemble of extreme learning machine for gene expression data classification," Neurocomputing, vol. 128, pp. 22-30, DOI: 10.1016/j.neucom.2013.02.052, 2014.

[11] B. Mirza, Z. Lin, N. Liu, "Ensemble of subset online sequential extreme learning machine for class imbalance and concept drift," Neurocomputing, vol. 149, pp. 316-329, DOI: 10.1016/j.neucom.2014.03.075, 2015.

[12] D. Wang, M. Alhamdoosh, "Evolutionary extreme learning machine ensembles with size control," Neurocomputing, vol. 102, pp. 98-110, DOI: 10.1016/j.neucom.2011.12.046, 2013.

[13] X.-Z. Wang, R. Wang, H.-M. Feng, H.-C. Wang, "A new approach to classifier fusion based on upper integral," IEEE Transactions on Cybernetics, vol. 44 no. 5, pp. 620-635, DOI: 10.1109/TCYB.2013.2263382, 2014.

[14] L. Breiman, "Bagging predictors," Machine Learning, vol. 24 no. 2, pp. 123-140, 1996.

[15] Y. Freund, R. Schapire, "A short introduction to boosting," Journal of Japanese Society For Artificial Intelligence, vol. 14, pp. 771-780, 1999.

[16] J. B. Hampshire, A. H. Waibel, "Novel objective function for improved phoneme recognition using time-delay neural networks," IEEE Transactions on Neural Networks and Learning Systems, vol. 1 no. 2, pp. 216-228, DOI: 10.1109/72.80233, 1990.

[17] Q. Xu, Y. Xiong, H. Dai, K. M. Kumari, Q. Xu, H.-Y. Ou, D.-Q. Wei, "PDC-SGB: Prediction of effective drug combinations using a stochastic gradient boosting algorithm," Journal of Theoretical Biology, vol. 417,DOI: 10.1016/j.jtbi.2017.01.019, 2017.

[18] . Xin Yao, . Yong Liu, "Making use of population information in evolutionary artificial neural networks," IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), vol. 28 no. 3, pp. 417-425, DOI: 10.1109/3477.678637, .

[19] J. H. Friedman, "Greedy function approximation: a gradient boosting machine," The Annals of Statistics, vol. 29 no. 5, pp. 1189-1232, DOI: 10.1214/aos/1013203451, 2001.

[20] J. H. Friedman, "Stochastic gradient boosting," Computational Statistics & Data Analysis, vol. 38 no. 4, pp. 367-378, DOI: 10.1016/s0167-9473(01)00065-2, 2002.

[21] P. T. Boggs, J. W. Tolle, "Sequential Quadratic Programming," Acta Numerica, vol. 4,DOI: 10.1017/S0962492900002518, 1995.

[22] M. A. Figueiredo, J. M. Bioucas-Dias, R. D. Nowak, "Majorization-minimization algorithms for wavelet-based image restoration," IEEE Transactions on Image Processing, vol. 16 no. 12, pp. 2980-2991, DOI: 10.1109/TIP.2007.909318, 2007.

[23] R. Battiti, "First- and Second-Order Methods for Learning: Between Steepest Descent and Newton's Method," Neural Computation, vol. 4 no. 2, pp. 141-166, DOI: 10.1162/neco.1992.4.2.141, 1992.

[24] P. L. Bartlett, "The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network," Institute of Electrical and Electronics Engineers Transactions on Information Theory, vol. 44 no. 2, pp. 525-536, DOI: 10.1109/18.661502, 1998.

[25] J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Herrera, "KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework," Journal of Multiple-Valued Logic and Soft Computing, vol. 17 no. 2-3, pp. 255-287, 2011.

[26] M. Lichman, UCI Machine Learning Repository, 2013.

[27] H. Drucker, "Improving regressors using boosting techniques," Proceedings of the International Conference on Machine Learning, vol. 97, pp. 107-115, .

[28] D. E. Rumelhart, G. E. Hinton, R. J. Williams, "Learning representations by back-propagating errors," Nature, vol. 323 no. 6088, pp. 533-536, DOI: 10.1038/323533a0, 1986.

[29] A. Coates, A. Y. Ng, "The importance of encoding versus training with sparse coding and vector quantization," Proceedings of the 28th International Conference on Machine Learning (ICML '11), pp. 921-928, .

[30] J. Zhu, H. Zou, S. Rosset, T. Hastie, "Multi-class AdaBoost," Statistics and Its Interface, vol. 2 no. 3, pp. 349-360, DOI: 10.4310/SII.2009.v2.n3.a8, 2009.

Word count: 6595

Show less

Copyright © 2018 Hua Guo et al. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/

Abstract

Translate

A novel ensemble scheme for extreme learning machine (ELM), named Stochastic Gradient Boosting-based Extreme Learning Machine (SGB-ELM), is proposed in this paper. Instead of incorporating the stochastic gradient boosting method into ELM ensemble procedure primitively, SGB-ELM constructs a sequence of weak ELMs where each individual ELM is trained additively by optimizing the regularized objective. Specifically, we design an objective function based on the boosting mechanism where a regularization item is introduced simultaneously to alleviate overfitting. Then the derivation formula aimed at solving the output-layer weights of each weak ELM is determined using the second-order optimization. As the derivation formula is hard to be analytically calculated and the regularized objective tends to employ simple functions, we take the output-layer weights learned by the current pseudo residuals as an initial heuristic item and thus obtain the optimal output-layer weights by using the derivation formula to update the heuristic item iteratively. In comparison with several typical ELM ensemble methods, SGB-ELM achieves better generalization performance and predicted robustness, which demonstrates the feasibility and effectiveness of SGB-ELM.

Details

Title

SGB-ELM: An Advanced Stochastic Gradient Boosting-Based Ensemble Scheme for Extreme Learning Machine

Author

Guo, Hua¹; Wang, Jikui²

; Ao, Wei²

; He, Yulin²

¹ School of Information Engineering, Lanzhou University of Finance and Economics, Lanzhou 730020, China
² College of Computer Science & Software Engineering, Shenzhen University, Shenzhen 518060, China

Editor

Pedro Antonio Gutierrez

Publication year

2018

Publication date

2018

Publisher

John Wiley & Sons, Inc.

ISSN

16875265

e-ISSN

16875273

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1155/2018/4058403

ProQuest document ID

2066312945

SGB-ELM: An Advanced Stochastic Gradient Boosting-Based Ensemble Scheme for Extreme Learning Machine

Jump to:

Full text

Abstract

Details

Suggested sources