Full text

Turn on search term navigation

1. Introduction

Extreme learning machine (ELM) [1] is a training algorithm for single-hidden-layer feedforward neural networks (SLFNs) that aims to minimize the training error while ensuring the minimum norm of output weights [2]. Its distinctive feature lies in the random initialization of input layer weights and hidden layer biases, which remain fixed throughout the training process.

Although ELM’s randomly generated network parameters guarantee the infinite approximation capability of SLFNs [3], researches have shown that ELM requires a larger number of hidden nodes compared to traditional SLFNs to achieve similar performance [2,4]. Studies addressing this issue primarily focus on two ways. One is determining the optimal number of hidden nodes to construct a compact network structure [1]. Another way focuses setting effective hidden layer node parameters. The most common approaches employ optimization algorithms to search for effective input layer weights and hidden layer biases [5–8]. However, these methods require multiple ELM training iterations, resulting in high computational overhead, which contradicts ELM’s design principle.

Kernel ELM (KELM) [1] obviates the need to set the number of hidden nodes and parameters, exhibiting improved stability and generalization ability compared to ELM. Reduced KELM (RKELM) [9] further enhances training efficiency by randomly selecting a subset of training samples as support vectors. The effectiveness of kernel methods stems from the ability of kernel functions to compute complex similarities between different samples [10]. Kärkkäinen [11] adopted Euclidean distance to measure the relationship between arbitrary training samples and selected reference points in ELM, demonstrating that similarity function-based hidden layer outputs ensure the infinite approximation capability of ELM. Therefore, from the perspective of similarity functions, it is possible to reinterpret the mechanism of kernel mapping, guiding the selection of kernel functions.

On the other hand, different kernel functions possess varying data mapping capabilities. To fully exploit the advantages of different kernel functions, ELM based on multiple kernel functions has been proposed. The most common multi-kernel model involves linearly combining multiple kernel functions [12,13]. However, it is difficult to predict the contribution of different kernel functions to classification performance, and determining appropriate weights for each kernel function to maximize the model’s generalization performance remains a significant challenge in multi-kernel learning. While some researchers have proposed using optimization methods to determine the optimal weights [14], the training process incurs high computational costs. Consequently, designing multi-kernel fusion strategies remains an open problem.

Multilayer ELM (ML-ELM) [14] achieves deep representation of input features by stacking multiple ELM autoencoders (ELM-AE). Similar approaches include hierarchical ELM (H-ELM) [15], KELM-based autoencoders (KELM-AE) [16]. These methods construct deep network models by vertically stacking multiple autoencoders, enabling layer-by-layer representation of input features. In contrast, Wang [17] arranged multiple autoencoders horizontally, fusing the encoding results into a single vector. Overall, while deep learning based on ELM-AE has yielded fruitful results, horizontal expansion of networks based on ELM-AE remains an under-explored area, and research on autoencoders based on multiple kernels or similarity functions is still lacking. replaces traditional predefined kernel functions with multi-layer complex mappings of DNN [18].

Cost-sensitive learning has drawn great attention in both theoretical researches and engineering practices. Theoretical researches can be divided into two categories, direct cost-sensitive learning and indirect cost-sensitive learning [19]. The former methods intended to use cost information to design new variants based on the basic classifiers, while the later ones treated the traditional classification algorithms as black boxes and performed preprocessing on the training data or post-processing on the outcomes with the cost information [20]. For engineering practices, cost-sensitive learning has been widely used in imbalanced classification [1], biomedicine [21], fault diagnosis [22], etc. Traditional ELM-based cost-sensitive learning mainly depended on the original ELM model [1]. Therefore, exploration based on other new variants of ELM is always needed.

Based on the aforementioned analysis, this paper proposes two multi-kernel cost-sensitive ELM models based on expected kernel functions. Reinterpreting the significance of kernel functions from the perspective of similarity, we present a simplified kernel autoencoder model by randomly selecting a subset of samples from the input data as reference points. Subsequently, inspired by the theory of expected kernel ELM, we add a random mapping layer after the input layer, designing a simplified expected kernel autoencoder that effectively combines random mapping and similarity mapping. Finally, we define four similarity kernel functions and utilize the simplified expected kernel autoencoder to design two multi-kernel ELM models, converting the classifier output into posterior probabilities and implementing cost-sensitive decision-making based on the minimum risk criterion.

2. Related work

2.1 ELM and KELM

Let a training set consisting of N training samples is denoted as , where is a set of d-dimensional input vectors, and the corresponding class vectors is , where m is the number of classes. For a single-hidden-layer neural network with L hidden nodes and m output nodes, the output function of ELM can be represented as:(1)

Where β_ij represents the output weights connecting the j-th hidden node to the i-th output node, and . h_j(w_j,b_j,x) is the output corresponding to the j-th hidden node for input sample x, w_j is the randomly generated weights connecting input to the j-th hidden node, and , b_j is the bias for the j-th hidden node.

Let , Eq (1) can be rewritten in matrix form as:(2)

The solution of Eq (2) can be expressed in two forms:(3)

C is the regularization parameter. When the form of h_j(⋅) is unknown, a kernel matrix can be defined for the ELM using the Mercer criterion:(4)

Replacing HH^T with Ω in Eq (4), the output function of KELM can be expressed as:(5)

ELM-AE [14] (as shown in Fig 1A) is essentially an ELM where the input and output are the same. Let Γ be the output weights of the autoencoder. The solution of ELM-AE in this case can be expressed as:(6)

[Figure omitted. See PDF.]

(a) ELM Autoencoder. (b) KELM Autoencoder.

Based on KELM, Wong [16] replaced the random mapping function in the autoencoder with a kernel function, proposed KELM-AE (as shown in Fig 3B), and its output layer encoding matrix can be represented as:(7)

It has been demonstrated composite kernels could efficiently combines multi-level information and have excellent performance [12]. In [12], composite kernels is used for mapping contextual and spectral information,(8)Where μ is balance factor and K₁(x_i,x_j), K₂(x_i,x_j) are Radial basis function and polynomial function, respectively. The same methodology was adopted by [13]. However, the method above introduces one hyper-parameter, the balance factor, which is difficult to set. Therefore, [23] integrated multi-kernel matrix by adding each matrix immediately. Different from other articles that adopt traditional kernel functions, [18] redefined the kernel matrix as,(9)Where Φ_i(x) is the hidden layer output of a neural network. Instead of integrating multi-kernel functions into one super kernel, [16] proposed ML-KELM by stacking multiple KELM-AEs, which does not need to tune the parameters for all layers as in ML-ELM. In ML-KELM, each encoder has one distinct kernel function.

2.2 Cost-sensitive learning based on ELM

ELM has attracted extensive attention from scholars in the field of cost-sensitive learning, and many cost-sensitive variations of ELM have been proposed.

For class-imbalanced problems, the most intuitive approach is to directly weight the samples according to their numbers [21]:(10)Where W_ii is the squared error weight of x_i, Ny_min is the number of samples in the smallest class, and Ny_i is the number of samples in class y_i. The Cost-sensitive ELM model (CELM) [24] utilizes misclassification cost information to weight the classification errors of samples:(11)Where c_i is the misclassification cost of the i-th sample. Zhu[25] then uses the sum of misclassification costs for all classes in the cost matrix as weights:(12)When constructing a cost-sensitive ELM model, Zhang [26] takes into account both the class imbalance of the training data and misclassification costs, i.e:(13)Where B is a cost information vector determined by the class weight matrix and the misclassification cost matrix. Similarly, Fatemeh [27,28] proposed a hierarchical ELM (H-ELM) classification method, which consists of two successive stages, an unsupervised hierarchical feature representation and a supervised feature classification. In H-ELM, the weights of different classes were defined as follows:(14)

In addition, indirect cost-sensitive methods based on ELM mainly involve ensemble of multiple ELM classifiers [29], or adjusting the classifier output thresholds [22].

Although the above methods have achieved some results in specific application areas, the vast majority of cost-sensitive models mentioned above utilize the original ELM as the base classifier, with little research on multi-layered or parallel cost-sensitive ELM models. Additionally, the use of ensemble methods to estimate posterior probabilities can fully utilize cost matrix information, but requires training multiple classifiers. For example, Lu’s method [30] requires training thirty classifiers each time, resulting in high time overhead when dealing with large training data. Therefore, further research is needed in cost-sensitive learning based on ELM.

3. Double hidden layer autoencoder based on simplified expected kernel function

In this section, a Reduced Kernel Extreme Learning Machine (RKELM) is first introduced, then a RKELM-based autoencoder (RKELM-AE) is proposed. Subsequently, a Reduced Expectation Kernel Autoencoder (REKELM-AE) is proposed by adding a random mapping layer after the input layer of RKELM-AE according to the theory of expected kernel.

3.1 RKELM-AE

In KELM, each training sample is measured for similarity with all other training samples [10]. For any training sample x_i, KELM essentially creates a new feature for x_i by measuring its similarity with samples of different classes. Let r_i (reference point) be the reference point used for measuring similarity with training samples, and let be the collection of all reference points. In KELM, R represents the training set, and each reference point in R corresponds to each dimension of the kernel space. However, when two reference points in R are close, the similarity between the training samples and these two reference points is approximately the same. Therefore, the features corresponding to these two reference points in the kernel space after mapping will be similar. Hence, when the number of training samples is large or the distribution is compact, it is unnecessary to select all samples as reference points. Kärkkäinen [11] have validated the effectiveness of randomly selecting a subset of samples from the training data as reference points for KELM from different perspectives.

Based on the analysis above, the proposed RKELM-AE, building upon KELM, randomly selects a subset of samples from the N training samples as reference points to map the input data for similarity. Similar to ELM-AE, the output matrix of RKELM-AE can be represented as:(15)Where is equal to , and , each row represents the similarity of a sample to reference points. The output matrix can be represented as:(16)

In Eq (13), the computational complexity of KELM-AE is O(N³). When the number of samples is large, solving for the output matrix will still take significant time. However, in RKELM-AE, selecting only samples can keep the reconstruction error within a small range. Therefore, the computational complexity of RKELM-AE is much smaller than that of KELM-AE.

3.2 REKELM-AE

The main feature of ELM is the random generation of hidden layer weights. However, in ELM models based on kernel functions, the mapping of the hidden layer is deterministic or semi-deterministic (such as RKELM), and ELM also can learn effective information based on this stable mapping. By combining these two models, one can simultaneously leverage the advantages and disadvantages of both mapping behaviors. To achieve this, Zhang [10] proposed the concept of the Expectation Kernel (EK) to study the relationship between random mapping and kernel functions. The definition of linear EK is:(17)Where v^Tx_i = w^Tx_i+b. p(v) is the probability distribution function of weights and biases. Sampling v according to p(v), the linear EK in Eq (20) can be approximated as:(18)Where V = [v₁,⋯,v_L] is the weight matrix randomly sampled according to p(v). However, the linear EK and the standard ELM mapping process are essentially equivalent, both can be viewed as randomly initializing the hidden layer weights and mapping the input randomly. Therefore, Zhang built upon the linear EK and defined a non-linear EK using traditional kernel functions K(⋅):(19)

According to Eq (22), the non-linear EK can be divided into two independent processes: randomly mapping the original data, and using the results of the random mapping for kernel mapping.

Following the concept of EK, a random mapping layer consisting of L₁ hidden layer nodes is added after the input layer of RKELM-AE, allowing for high-dimensional random mapping of the original data. The simplified structure of the Expected Kernel Autoencoder REKELM-AE is defined as shown in Fig 2.

[Figure omitted. See PDF.]

REKELM-AE can be seen as a combination of traditional ELM and RKELM: the first part utilizes the random hidden layers of ELM to map the input to ELM space, while the second part uses the simplified kernel ELM to compute the similarity of the random mapping results.

For the training samples, the mapping result after the first hidden layer of REKELM-AE is:(20)

Then, randomly select reference points from the N rows of samples , and calculate the simplified matrix:(21)

Finally, the output matrix Γ can be calculated based on Eq (19). The training procedure of REKELM-AE is presented as follows,

Algorithm 1. pseudocode for REKELM-AE.

Input

1. Randomly generate W,b according to p(v)

2. Generate random mapping matrix H using W,b

3. Select reference points from H randomly

4. Calculate the simplified matrix according to Eq (21)

5. Calculate output matrix Γ according to Eq (16)

Output Γ

It should be noted that although REKELM-AE introduces a random mapping process compared to RKELM-AE, the complexity of this process is only related to the number of selected reference points after simplification and is independent of the dimensionality of the samples after dimensionality augmentation. Therefore, REKELM-AE retains the efficiency of RKELM-AE.

4. Cost-sensitive ELM with multi-kernel autoencoder

This section first proposes two classification models based on REKELM-AE, namely multi-kernel parallel ELM (MKP-ELM) and multi-kernel residual ELM (MKR-ELM). Then, the cost-sensitivity models are realized according to the minimum risk decision theory.

4.1 MKP-ELM and MKR-ELM

Most existing AE-based neural network models achieve deep representation of inputs by vertically stacking multiple AEs. However, this layer-by-layer learning strategy accumulates reconstruction errors in the process of feature reconstruction, leading to the learned features cannot represent the original input information well [16]. On the other hand, different kernel functions have different expressive capabilities for different data. By combining multiple kernel functions for learning, we can leverage the advantages of different kernel functions, obtaining better flexibility and adaptability. Based on the REKELM-AE proposed in the third section, this paper presents two multi-kernel autoencoder-based ELM models, multi-kernel parallel ELM (MKP-ELM) and multi-kernel residual ELM (MKR-ELM), as shown in Fig 3.

[Figure omitted. See PDF.]

Both of the two models consist of two separate processes: feature extraction based on multi-kernel parallel autoencoders, and ELM classification based on the extracted features. For MKP-ELM, firstly, given k different kernel functions, each corresponding to a REKELM-AE, calculate the output weight Γ⁽ⁱ⁾,i = 1,2,⋯,k of the corresponding autoencoder using the method in Algorithm 1, then obtain the abstract features extracted by the encoder as,(22)

After obtaining k abstract features extracted by REKELM-AEs, combine all the results into a feature matrix , and finally use X^final as input to train an ELM classifier H^finalβ = T.

Residual connection has been proved to be an effective and efficient principle in deep neural networks [2,3]. In order to further exploit the effectiveness of the proposed RKELM-AE, another deep model is designed by stacking multiple REKELM-AEs. Furthermore, the residual connection is adopted between each autoencoder, as illustrated in Fig 3. The output Y⁽ⁱ⁾ of one REKELM-AE is,(23)Where X^input,X^output are the input and output of the encoder. Different REKELM-AE has different kernel function.

Stacking multiple REKELM-AEs can obtain multiple kernel mappings, while residual connection eases the information loss during the transmission. In this paper, 4 different kernel functions (or similarity functions) are selected, namely radial basis function (K₁(⋅)), Euclidean distance (K₂(⋅)), Manhattan distance (K₃(⋅)), and cosine similarity (K₄(⋅)):(24)

4.2 Cost-sensitive classification

For a cost-sensitive classification problem with m classes, the cost matrix is:(25)Where c_ij is the cost of classifying the i class as the j class. Then, for any x, the category based on minimum risk decision can be expressed as:(26)

The key to implementing minimum risk decision is estimating the posterior probability. Therefore, the hard output of ELM needs to be converted into probability output. This paper adopts the method from [31] to use the Sigmoid function to convert the output of MKP-ELM into posterior probabilities:(27)

Where f(x) is the output of the classifier for class x. When the number of categories is greater than 2, it cannot be guaranteed that Eq (27) will sum to 1 for all classes. Therefore, further normalization of the estimation of Eq (27) is needed:(28)Once we obtain the estimation of unknown sample x, we can then use the minimum risk decision criterion in Eq (26) for classification.

5. Experiment and analysis

This section validates the effectiveness of MKP-CSELM. The experiments consist of three parts: first, verifying the impact of different numbers of nodes in the first hidden layer (L) and random sampling numbers in the second hidden layer () on the performance of the autoencoder in REKELM-AE; then validating the effectiveness of MKP-CSELM in handling cost-sensitive issues on 21 UCI datasets. Finally, further validating the performance of the proposed method on a cost-sensitive pulmonary pathology recognition dataset.

5.1 Implementation design

There are three evaluation matrices for classifiers: accuracy rate acc, total misclassification cost Tc, and relative performance r_α, defined as follows:(29)Where err_i is the number of misclassified samples in class i, err_ij is the number of samples misclassified as class j when they belong to class i, and Tc_α is the misclassification cost of algorithm α.

In addition to MKP-CSELM, seven methods were selected for comparison, including two cost-sensitive ELM methods: cost-sensitive ELM (CELM) [24], cost-sensitive voting ELM (CSVELM) [30]; HELM [27], three cost-sensitive naïve Bayes methods: cost-sensitive naïve Bayes (CSNB) [32], cost-sensitive Bayesian network (CSBN) [33], multi-class cost-sensitive k-nearest neighbor (mcKNN) [34]; and one cost-sensitive neural network method: cost-sensitive neural network (CSNN) [34]. For CELM, sample weights were calculated using the method in Eq (16) based on the misclassification cost; the base classifier number for CSVELM was set to 30; the number of neighbors for mcKNN was set to 3 as in reference [34]. For CELM and CSVELM, the hidden layer activation function is both set to the Sigmoid function. The number of hidden nodes in the classifier, represented as L, and the regularization parameter, represented as C, are selected from the set [200,400,⋯,2000] and [2⁻⁶,2⁻⁴,⋯,2¹²] respectively to find the combination that minimizes the classification cost. Apart from the random encoding layer weights with a Gaussian model of variance 1 and mean 0 in MKP-CSELM [10], random weights at other locations are generated in set (−1,1) using a uniform distribution. For MKP-CSELM, the number of hidden mapping layer nodes and the number of similar mapping layers were the same for each autoencoder.

The experimental data includes 5 binary datasets and 16 multiclass datasets from UCI repository, as shown in Table 1. All features were normalized to [0,1]. Due to the large sample sizes of the Page Blocks, Sat, Segmentation, and Shuttle datasets, memory constraints were encountered during experimentation. Therefore, 1000 random samples were selected from each dataset for training and testing. For each dataset, 60% of the samples were randomly selected as the training set, and the remaining 40% were used as the test set. The experiments were conducted using MATLAB R2022a on a desktop computer with an Intel Core i9 CPU and 16GB RAM.

[Figure omitted. See PDF.]

5.2 Ablation study

To examine the effectiveness of each component in our proposed framework, a series of ablation experiments are performed on UCI dataset.

a) Evaluation on different hidden layer nodes and reference points.

The representational capability of REKELM-AE is mainly influenced by the number of random hidden layer nodes L₁ and the number of reference points for similar hidden layers . The first hidden layer can be seen as an extension of the original input dimensions, and the similar mapping layer can be viewed as a feature compression process on the output of the former, two parameters are defined to measure the degree of extension and compression,Where d represents the original input feature dimension, and N is the number of training samples.

This section only compares the impact of different parameters on the representational capability of REKELM-AE, hence neglecting the misclassification cost. By fixing C = 500, L = 2000 for ELM and R₁ varies from 5 to 100 and R₂ ranges from 0.1 to 1, Fig 4 shows the average error rates of MKP-ELM by running 10 times independently.

[Figure omitted. See PDF.]

From Fig 4, it can be observed that the classification error rate of REKELM-AE does not vary significantly with different values of R₂, indicating that the classifier is not sensitive to the number of nodes in the first hidden layer. On the other hand, in 7 out of the 9 datasets (Balance, Diabetes, Ecoli, Glass, Heart statlog, Iris, and Letter), there is a clear downward trend in the classification error rate as R₂ decreases. Only in one dataset (Breast-w), the classification error rate increases as R₂ decreases. This suggests that an effective similarity mapping of input data can be achieved using only a small number of reference points; utilizing a larger (or R₂) and selecting more training data as reference points may yield redundant features after similarity mapping, resulting in a decrease in classification performance. Therefore, it is completely unnecessary to compute the similarity matrix using a larger portion or the entire training set.

To further discuss the impact of different values of R₂ on classification performance, Fig 5 provides results for different values of R₁ while R₂ ranges from 0.01 to 0.1. It can be seen that the error rates of 4 out of the 9 datasets (Balance, Breast-w, Diabetes, Ecoli) further decrease as R₂ decreases, indicating that for these four datasets, only a very small number of reference points are needed to achieve similarity mapping. On the other hand, four datasets (Glass, Ionosphere, Iris, Letter) achieve optimal values around R₂ = 0.02, and as R₂ is further reduced, the classification error rate increases. This is mainly because when R₂ is very low, only one or a few reference points may be selected, which is insufficient to achieve similarity mapping of the input data.

[Figure omitted. See PDF.]

b) Evaluation on different models.

In this part, the effectiveness of the proposed two different models is investigated. For MKP-ELM and MKR-ELM, the kernel functions, No. of , NO. of L₁ are all the same. The cost matrix is randomly produced in a range of [1,20]. The average results on all datasets are presented in Table 2.

[Figure omitted. See PDF.]

The results show that the performance of MKR-ELM is slightly better than that of MKP-ELM when there is only one kernel function, indicating that residual connection is effective for improving performance. When the kernel function is increased, the performance increase of MKP-ELM is smaller than that of MKR-ELM, indicating that firstly, increasing the kernel function can improve the performance, and secondly, vertically stacking multiple autoencoders with residual connections is more effective than simple horizontal expansion.

c) Evaluation on different kernel combinations.

In this part, the effectiveness of different kernel function combination is investigated. In order to thoroughly analysis the influence of kernel function, we have exhausted all possible combinations under different number of kernel functions. For example, when the number of kernel functions is 2, we conduct six sets of experiments. The average results are shown in Table 3.

[Figure omitted. See PDF.]

The results in Tables 2 and 3 both show that with the increase of the number of kernel functions, the performance of both models is improved. When the number of kernel functions is increased from 1 to 2, the performance increases significantly. Models with two and three kernel functions have similar performance, but when the number of kernel functions is increased from 3 to 4, the performance increases significantly. These indicate that different kernel functions have different feature mapping ability.

5.3 Results and analysis on UCI datasets

In this section, the effectiveness of MKP-CSELM is validated on 21 UCI datasets. To reduce computational complexity, the regularization factor and the numbers of hidden nodes are fixed as C = 100 L = 1000, respectively. Both the first hidden layer of the autoencoder and the hidden layer of the classifier use the Sigmoid activation function. As analyzed in Section 4.2, R₁ has a minor impact on the classifier, while the performance of MKP-ELM fluctuates significantly around R₂ = 0.1. For MKP-CSELM, keeping R₁ = 10 fixed, search for the value of R₂ within the range of [0.01,0.2] with a step size of 0.01 to minimize the classification cost of the classifier. For the cost matrix, three types (Type-a, Type-b, Type-c) are generated with a maximum value of 10 for each type. The training and testing sets are the same for all methods.

Every method runs 10 independent repetitions. Due to space limitations, only the results on Type-a is presented. Table 4 shows the average accuracy and misclassification cost. The “↓”,”↑” and “≈” marks imply that the result of MKP-CSELM is significantly better than, worse than and similar to the compared result with 0.05 significance level, respectively. In Table 4, in terms of accuracy, MKP-CSELM obtained the most optimal values (6), followed by mcKNN (4). As for the three data sets (Segmentation, Vehicle and Vowel), the accuracy of MKP-CSELM has been significantly improved compared with other methods, 3.0%, 3.3% and 7.3% higher than that of the second best method, and more than 20% higher than that of the worst method. For the misclassification cost, MKP-CSELM achieved optimal values on a total of 10 datasets, nearly half of the total number of datasets. Similarly, the misclassification cost of MKP-CSELM on Segmentation, Vehicle and Vowel data sets is significantly lower than that of other methods, indicating that the overall recognition performance of MKP-CSELM on the above three data sets has been greatly improved.

[Figure omitted. See PDF.]

Table 5 summarizes the results of rank-sum tests under three cost matrices. Each column indicates the number of dataset which MKP-CSELM is significantly better than, worse than and similar to the compared method with 0.05 significance level. For example, the first column under Type-a, 3,10,8 indicate MKP-CSELM is significantly worse than, similar to and better than CELM on 3,10,8 datasets, respectively. The results show that under all three cost matrices, MKP-CSELM outperforms the other 7 methods significantly in both indicators on the majority of datasets. Table 6 provides the average rank of Friedman test results. Overall, it can observed that the rank of MKP-CSELM is the best for both indicators under the three cost matrices, followed by CELM. The performance of CSNB is the worst, followed by CSVELM, indicating that the multi-classifier voting decision cannot estimate posterior probabilities well. Relative performance r_α of the seven methods for each dataset is calculated, and the cumulative r_α of each method on all datasets under different cost matrices is shown in Fig 6. Each color layer corresponding to a dataset. The figure indicated that MKP-CSELM performs the best, followed by CELM, while CSNB shows the poorest average performance. These experimental results demonstrate the effectiveness of MKP-CSELM.

[Figure omitted. See PDF.]

5.4 Case study

In this section, the proposed method is used for realistic cost-sensitive problems. LungHist700 [35] is dataset of histological images in pulmonary pathology. It consists of 691 images from 45 patients, with each image having a resolution of 1200 × 1600 pixels and stored in.jpg format. These images are captured at either 20x or 40x magnification levels and are categorized into seven classes (see Fig 6). An accompanying.csv file links each image to the associated patient ID. All patients have been anonymized, and the file includes an identifier to match images from the same patient.

All the images are resized to 300*400 pixels, 70% for training and 30% for testing. A ResNet50 is trained and the last feature map is used as the input of MKR-ELM. For simplicity, the samples are divided into two classes, normal and unnormal. The misclassification costs of the normal and unnormal are 1 and 10, respectively. The results are presented in Table 7. ‘Normal’ indicates MKR-ELM without cost-sensitive classification. The results show that, cost-sensitive classification results in a 4.76% reduction in classification accuracy, but a 42.18% percent reduction in misclassification costs.

[Figure omitted. See PDF.]

6. Conclusion

This paper proposes two multi-kernel cost-sensitive ELM models based on the expected kernel function. Firstly, the kernel function is reinterpreted from the perspective of similarity. Based on the kernel ELM, a simplified kernel autoencoder model is presented by randomly selecting a subset of samples from the input data as reference points. According to the expected kernel ELM theory, a random mapping layer is added after the input layer to design a simplified expected kernel autoencoder, which effectively combines random mapping and similarity mapping. Four types of similarity kernel functions are defined, and two multi-kernel ELM models are designed using the simplified expected kernel autoencoder. The classifier output is then transformed into posterior probabilities, and cost-sensitive decisions are made based on the minimum risk criterion. Comparative analysis with six cost-sensitive methods on 21 UCI datasets shows that the proposed method can achieve better generalization performance with only a few reference points selected than the comparative methods. The case study on realistic pulmonary pathology classification further demonstrated the effectiveness of the proposed approach.

References

1. 1. Rajput Brajendra Singh, et al. "ELM-Based Imbalanced Data Classification-A Review." Informatica 48.2 (2024).

* View Article

* Google Scholar

2. 2. Zhang Z, Cai Y, Gong W. Semi-supervised learning with graph convolutional extreme learning machines[J]. Expert Systems with Applications, 2023, 213: 119164.

* View Article

* Google Scholar

3. 3. Wu C, Khishe M, Mohammadi M, et al. RETRACTED ARTICLE: Evolving deep convolutional neutral network by hybrid sine–cosine and extreme learning machine for real-time COVID19 diagnosis from X-ray images[J]. Soft Computing, 2023, 27(6): 3307–3326.

* View Article

* Google Scholar

4. 4. Gao Q, Ai Q, Wang W. Intuitionistic Fuzzy Extreme Learning Machine with the Truncated Pinball Loss[J]. Neural Processing Letters, 2024, 56(2): 116.

* View Article

* Google Scholar

5. 5. Bacanin N., Stoean C., Zivkovic M., Jovanovic D., Antonijevic M., Mladenovic D. Multi-Swarm Algorithm for Extreme Learning Machine Optimization, SENSORS, Vol. 22, No. 11, pp. 1–34, May, 2022. pmid:35684824

* View Article

* PubMed/NCBI

* Google Scholar

6. 6. Bacanin N., Zivkovic M., Antonijevic M., Venkatachalam K., Lee J., Nam Y., Marjanovic M., Strumberger I., Abouhawwash M., Addressing feature selection and extreme learning machine tuning by diversity-oriented social network search: an application for phishing websites detection, Complex & Intelligent Systems, pp. 1–36, Jun, 2023.

* View Article

* Google Scholar

7. 7. Laifi A., Benmohamed E. & Ltifi H. Xavier-PSO-ELM-based EEG signal classification method for predicting epileptic seizures. Multimed Tools Appl 83, 30675–30696 (2024).

* View Article

* Google Scholar

8. 8. Wang F, Liang Y, Lin Z, Zhou J, Zhou T. SSA-ELM: A Hybrid Learning Model for Short-Term Traffic Flow Forecasting. Mathematics. 2024; 12(12):1895

* View Article

* Google Scholar

9. 9. Zhu C, Yang H, ** X, et al. Multilayer Online Sequential Reduced Kernel Extreme Learning Machine-Based Modeling for Time-Varying Distributed Parameter Systems[J]. IEEE Transactions on Cybernetics, 2023.

* View Article

* Google Scholar

10. 10. Zhang W., Zhang Z., Wang L., Chao H., et al. Extreme learning machine with expectation kernels[J]. Pattern Recognition, 2019, 96, 106960.

* View Article

* Google Scholar

11. 11. Kärkkäinen T. Extreme minimal learning machine: Ridge regression with distance-based basis. Neurocomputing, 2019, 342, 33–48.

* View Article

* Google Scholar

12. 12. Li L., Wang C., Li W., et al. Hyperspectral image classification by Adaboost weighted composite kernel extreme learning machine[J]. Neurocomputing, 2018, 275: 1725–1733.

* View Article

* Google Scholar

13. 13. Guo X, Zhu C, Hao J, et al. Multi-step wind speed prediction based on an improved multi-objective seagull optimization algorithm and a multi-kernel extreme learning machine[J]. Applied Intelligence, 2023, 53(13): 16445–16472.

* View Article

* Google Scholar

14. 14. Kaur R, Roul R K, Batra S. Multilayer extreme learning machine: a systematic review[J]. Multimedia Tools and Applications, 2023, 82(26): 40269–40307.

* View Article

* Google Scholar

15. 15. Ma S, Cheng G, Li Y, et al. Research on multi-granularity imbalanced knowledge condition monitoring for mechanical equipment based on hierarchical ELM in multi-entropy space[J]. Expert Systems with Applications, 2024, 238: 121817.

* View Article

* Google Scholar

16. 16. Wong C. M., Vong C. M., Wong P. K., et al. Kernel-based multilayer extreme learning machines for representation learning[J]. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(3): 757–762.

* View Article

* Google Scholar

17. 17. Wang Y., Xie Z., Xu K., et al. An efficient and effective convolutional auto-encoder extreme learning machine network for 3d feature learning[J]. Neurocomputing, 2016, 174: 988–998.

* View Article

* Google Scholar

18. 18. Zhang Meng, et al. An enhancing multiple kernel extreme learning machine based on deep learning. 2024 39th Youth Academic Annual Conference of Chinese Association of Automation (YAC). IEEE, 2024.

19. 19. Zhao Zhenchong, et al. "Cost-sensitive sample shifting in feature space." Pattern Analysis and Applications 23 (2020): 1689–1707.

* View Article

* Google Scholar

20. 20. Zhao Zhenchong, and Wang Xiaodan. "Cost-sensitive SVDD models based on a sample selection approach." Applied Intelligence 48.11 (2018): 4247–4266.

* View Article

* Google Scholar

21. 21. Araf Imane, Idri Ali, and Chairi Ikram. Cost-sensitive learning for imbalanced medical data: a review. Artificial Intelligence Review 57.4 (2024): 80.

* View Article

* Google Scholar

22. 22. Wang Z, Wang N, Zhang H, et al. Segmentalized mRMR features and cost-sensitive ELM with fixed inputs for fault diagnosis of high-speed railway turnouts[J]. IEEE Transactions on Intelligent Transportation Systems, 2023.

* View Article

* Google Scholar

23. 23. Guan Shan, et al. A single-joint multi-task motor imagery EEG signal recognition method based on Empirical Wavelet and Multi-Kernel Extreme Learning Machine. Journal of Neuroscience Methods 407 (2024): 110136. pmid:38642806

* View Article

* PubMed/NCBI

* Google Scholar

24. 24. Chen Zhen, Xiao Xianyong, Li Changsong, et al. Real-time transient stability status prediction using cost-sensitive extreme learning machine[J]. Neural Comput & Applic, 2016, 27: 321–331.

* View Article

* Google Scholar

25. 25. Zhu Hongyu, Wang Xizhao. A cost-sensitive semi-supervised learning model based on uncertainty[J]. Neurocomputing, 251(2017): 106–114.

* View Article

* Google Scholar

26. 26. Zhang Lei, Zhang David. Evolutionary Cost-Sensitive Extreme Learning Machine[J]. IEEE Transactions on Nueral Networks and learning Systems, 2017, 28(12): 3045–3060.

* View Article

* Google Scholar

27. 27. Daneshfar , Fatemeh , and Seyed Jahanshah Kabudian. "Speech Emotion Recognition Using Multi-Layer Sparse Auto-Encoder Extreme Learning Machine and Spectral/Spectro-Temporal Features with New Weighting Method for Data Imbalance." 2021 11th International Conference on Computer Engineering and Knowledge (ICCKE). IEEE, 2021.

28. 28. Daneshfar , Fatemeh , and Seyed Jahanshah Kabudian. "Speech Emotion Recognition Using Deep Sparse Auto-Encoder Extreme Learning Machine with a New Weighting Scheme and Spectro-Temporal Features Along with Classical Feature Selection and A New Quantum-Inspired Dimension Reduction Method." arxiv preprint arxiv:2111.07094 (2021).

29. 29. Duan J, Gu Y, Yu H, et al. ECC++: An algorithm family based on ensemble of classifier chains for classifying imbalanced multi-label data[J]. Expert Systems with Applications, 2024, 236: 121366.

* View Article

* Google Scholar

30. 30. Lu Huijuan, Zheng Enhui, Lu Yi, et al. ELM-based gene expression classification with misclassification cost[J]. Neural Comput & Applic, 2014, 25: 525–531.

* View Article

* Google Scholar

31. 31. Lai J, Wang X, **ang Q, et al. Multilayer discriminative extreme learning machine for classification[J]. International Journal of Machine Learning and Cybernetics, 2023, 14(6): 2111–2125.

* View Article

* Google Scholar

32. 32. Ibáñez A., Bielza C., Larrañaga P. Cost-sensitive selective naive Bayes classifiers for predicting the increase of the h-index for scientific journals[J]. Neurocomputing, 2014, 135: 42–52.

* View Article

* Google Scholar

33. 33. Jiang L., Li C., Wang S. Cost-sensitive Bayesian network classifiers[J]. Pattern Recognition Letters, 2014, 45: 211–216.

* View Article

* Google Scholar

34. 34. Zhang Y., Zhou Z.-H. Cost-sensitive face recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(10): 1758–1769.

* View Article

* Google Scholar

35. 35. Diosdado J., Gilabert P., Seguí S. et al. LungHist700: A dataset of histological images for deep learning in pulmonary pathology. Sci Data 11, 1088 (2024). pmid:39368979

* View Article

* PubMed/NCBI

* Google Scholar

Citation: Yixuan L (2025) Cost-sensitive multi-kernel ELM based on reduced expectation kernel auto-encoder. PLoS ONE 20(2): e0314851. https://doi.org/10.1371/journal.pone.0314851

About the Authors:

Liang Yixuan

Roles: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft

E-mail: [email protected]

Affiliations: School of Science, Xi ’an University of Technology, Xi’an, Shaanxi, P. R. China, The University of Melbourne, Parkville, Victoria, Australia

ORICD: https://orcid.org/0009-0002-5532-3061

[/RAW_REF_TEXT]

References

1. Rajput Brajendra Singh, et al. "ELM-Based Imbalanced Data Classification-A Review." Informatica 48.2 (2024).

2. Zhang Z, Cai Y, Gong W. Semi-supervised learning with graph convolutional extreme learning machines[J]. Expert Systems with Applications, 2023, 213: 119164.

3. Wu C, Khishe M, Mohammadi M, et al. RETRACTED ARTICLE: Evolving deep convolutional neutral network by hybrid sine–cosine and extreme learning machine for real-time COVID19 diagnosis from X-ray images[J]. Soft Computing, 2023, 27(6): 3307–3326.

4. Gao Q, Ai Q, Wang W. Intuitionistic Fuzzy Extreme Learning Machine with the Truncated Pinball Loss[J]. Neural Processing Letters, 2024, 56(2): 116.

5. Bacanin N., Stoean C., Zivkovic M., Jovanovic D., Antonijevic M., Mladenovic D. Multi-Swarm Algorithm for Extreme Learning Machine Optimization, SENSORS, Vol. 22, No. 11, pp. 1–34, May, 2022. pmid:35684824

6. Bacanin N., Zivkovic M., Antonijevic M., Venkatachalam K., Lee J., Nam Y., Marjanovic M., Strumberger I., Abouhawwash M., Addressing feature selection and extreme learning machine tuning by diversity-oriented social network search: an application for phishing websites detection, Complex & Intelligent Systems, pp. 1–36, Jun, 2023.

7. Laifi A., Benmohamed E. & Ltifi H. Xavier-PSO-ELM-based EEG signal classification method for predicting epileptic seizures. Multimed Tools Appl 83, 30675–30696 (2024).

8. Wang F, Liang Y, Lin Z, Zhou J, Zhou T. SSA-ELM: A Hybrid Learning Model for Short-Term Traffic Flow Forecasting. Mathematics. 2024; 12(12):1895

9. Zhu C, Yang H, ** X, et al. Multilayer Online Sequential Reduced Kernel Extreme Learning Machine-Based Modeling for Time-Varying Distributed Parameter Systems[J]. IEEE Transactions on Cybernetics, 2023.

10. Zhang W., Zhang Z., Wang L., Chao H., et al. Extreme learning machine with expectation kernels[J]. Pattern Recognition, 2019, 96, 106960.

11. Kärkkäinen T. Extreme minimal learning machine: Ridge regression with distance-based basis. Neurocomputing, 2019, 342, 33–48.

12. Li L., Wang C., Li W., et al. Hyperspectral image classification by Adaboost weighted composite kernel extreme learning machine[J]. Neurocomputing, 2018, 275: 1725–1733.

13. Guo X, Zhu C, Hao J, et al. Multi-step wind speed prediction based on an improved multi-objective seagull optimization algorithm and a multi-kernel extreme learning machine[J]. Applied Intelligence, 2023, 53(13): 16445–16472.

14. Kaur R, Roul R K, Batra S. Multilayer extreme learning machine: a systematic review[J]. Multimedia Tools and Applications, 2023, 82(26): 40269–40307.

15. Ma S, Cheng G, Li Y, et al. Research on multi-granularity imbalanced knowledge condition monitoring for mechanical equipment based on hierarchical ELM in multi-entropy space[J]. Expert Systems with Applications, 2024, 238: 121817.

16. Wong C. M., Vong C. M., Wong P. K., et al. Kernel-based multilayer extreme learning machines for representation learning[J]. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(3): 757–762.

17. Wang Y., Xie Z., Xu K., et al. An efficient and effective convolutional auto-encoder extreme learning machine network for 3d feature learning[J]. Neurocomputing, 2016, 174: 988–998.

18. Zhang Meng, et al. An enhancing multiple kernel extreme learning machine based on deep learning. 2024 39th Youth Academic Annual Conference of Chinese Association of Automation (YAC). IEEE, 2024.

19. Zhao Zhenchong, et al. "Cost-sensitive sample shifting in feature space." Pattern Analysis and Applications 23 (2020): 1689–1707.

20. Zhao Zhenchong, and Wang Xiaodan. "Cost-sensitive SVDD models based on a sample selection approach." Applied Intelligence 48.11 (2018): 4247–4266.

21. Araf Imane, Idri Ali, and Chairi Ikram. Cost-sensitive learning for imbalanced medical data: a review. Artificial Intelligence Review 57.4 (2024): 80.

22. Wang Z, Wang N, Zhang H, et al. Segmentalized mRMR features and cost-sensitive ELM with fixed inputs for fault diagnosis of high-speed railway turnouts[J]. IEEE Transactions on Intelligent Transportation Systems, 2023.

23. Guan Shan, et al. A single-joint multi-task motor imagery EEG signal recognition method based on Empirical Wavelet and Multi-Kernel Extreme Learning Machine. Journal of Neuroscience Methods 407 (2024): 110136. pmid:38642806

24. Chen Zhen, Xiao Xianyong, Li Changsong, et al. Real-time transient stability status prediction using cost-sensitive extreme learning machine[J]. Neural Comput & Applic, 2016, 27: 321–331.

25. Zhu Hongyu, Wang Xizhao. A cost-sensitive semi-supervised learning model based on uncertainty[J]. Neurocomputing, 251(2017): 106–114.

26. Zhang Lei, Zhang David. Evolutionary Cost-Sensitive Extreme Learning Machine[J]. IEEE Transactions on Nueral Networks and learning Systems, 2017, 28(12): 3045–3060.

27. Daneshfar , Fatemeh , and Seyed Jahanshah Kabudian. "Speech Emotion Recognition Using Multi-Layer Sparse Auto-Encoder Extreme Learning Machine and Spectral/Spectro-Temporal Features with New Weighting Method for Data Imbalance." 2021 11th International Conference on Computer Engineering and Knowledge (ICCKE). IEEE, 2021.

28. Daneshfar , Fatemeh , and Seyed Jahanshah Kabudian. "Speech Emotion Recognition Using Deep Sparse Auto-Encoder Extreme Learning Machine with a New Weighting Scheme and Spectro-Temporal Features Along with Classical Feature Selection and A New Quantum-Inspired Dimension Reduction Method." arxiv preprint arxiv:2111.07094 (2021).

29. Duan J, Gu Y, Yu H, et al. ECC++: An algorithm family based on ensemble of classifier chains for classifying imbalanced multi-label data[J]. Expert Systems with Applications, 2024, 236: 121366.

30. Lu Huijuan, Zheng Enhui, Lu Yi, et al. ELM-based gene expression classification with misclassification cost[J]. Neural Comput & Applic, 2014, 25: 525–531.

31. Lai J, Wang X, **ang Q, et al. Multilayer discriminative extreme learning machine for classification[J]. International Journal of Machine Learning and Cybernetics, 2023, 14(6): 2111–2125.

32. Ibáñez A., Bielza C., Larrañaga P. Cost-sensitive selective naive Bayes classifiers for predicting the increase of the h-index for scientific journals[J]. Neurocomputing, 2014, 135: 42–52.

33. Jiang L., Li C., Wang S. Cost-sensitive Bayesian network classifiers[J]. Pattern Recognition Letters, 2014, 45: 211–216.

34. Zhang Y., Zhou Z.-H. Cost-sensitive face recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(10): 1758–1769.

35. Diosdado J., Gilabert P., Seguí S. et al. LungHist700: A dataset of histological images for deep learning in pulmonary pathology. Sci Data 11, 1088 (2024). pmid:39368979

Word count: 6890

Show less

© 2025 Liang Yixuan. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

ELM (Extreme learning machine) has drawn great attention due its high training speed and outstanding generalization performance. To solve the problem that the long training time of kernel ELM auto-encoder and the difficult setting of the weight of kernel function in the existing multi-kernel models, a multi-kernel cost-sensitive ELM method based on expectation kernel auto-encoder is proposed. Firstly, from the view of similarity, the reduced kernel auto-encoder is defined by randomly selecting the reference points from the input data; then, the reduced expectation kernel auto-encoder is designed according to the expectation kernel ELM, and the combination of random mapping and similarity mapping is realized. On this basis, two multi-kernel ELM models are designed, and the output of the classifier is converted into posterior probability. Finally, the cost-sensitive decision is realized based on the minimum risk criterion. The experimental results on the public and realistic datasets verify the effectiveness of the method.

Details

Title

Cost-sensitive multi-kernel ELM based on reduced expectation kernel auto-encoder

Author

Liang Yixuan

First page

e0314851

Section

Research Article

Publication year

2025

Publication date

Feb 2025

Publisher

Public Library of Science

e-ISSN

19326203

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1371/journal.pone.0314851

ProQuest document ID

3166673705

Cost-sensitive multi-kernel ELM based on reduced expectation kernel auto-encoder

Jump to:

Full text

1. Introduction

2. Related work

2.1 ELM and KELM

2.2 Cost-sensitive learning based on ELM

3. Double hidden layer autoencoder based on simplified expected kernel function

3.1 RKELM-AE

3.2 REKELM-AE

4. Cost-sensitive ELM with multi-kernel autoencoder

4.1 MKP-ELM and MKR-ELM

4.2 Cost-sensitive classification

5. Experiment and analysis

5.1 Implementation design

5.2 Ablation study

a) Evaluation on different hidden layer nodes and reference points.

b) Evaluation on different models.

c) Evaluation on different kernel combinations.

5.3 Results and analysis on UCI datasets

5.4 Case study

6. Conclusion

References

Abstract

Details

Suggested sources