1. Introduction
Extreme learning machine (ELM) [1] is a training algorithm for single-hidden-layer feedforward neural networks (SLFNs) that aims to minimize the training error while ensuring the minimum norm of output weights [2]. Its distinctive feature lies in the random initialization of input layer weights and hidden layer biases, which remain fixed throughout the training process.
Although ELM’s randomly generated network parameters guarantee the infinite approximation capability of SLFNs [3], researches have shown that ELM requires a larger number of hidden nodes compared to traditional SLFNs to achieve similar performance [2,4]. Studies addressing this issue primarily focus on two ways. One is determining the optimal number of hidden nodes to construct a compact network structure [1]. Another way focuses setting effective hidden layer node parameters. The most common approaches employ optimization algorithms to search for effective input layer weights and hidden layer biases [5–8]. However, these methods require multiple ELM training iterations, resulting in high computational overhead, which contradicts ELM’s design principle.
Kernel ELM (KELM) [1] obviates the need to set the number of hidden nodes and parameters, exhibiting improved stability and generalization ability compared to ELM. Reduced KELM (RKELM) [9] further enhances training efficiency by randomly selecting a subset of training samples as support vectors. The effectiveness of kernel methods stems from the ability of kernel functions to compute complex similarities between different samples [10]. Kärkkäinen [11] adopted Euclidean distance to measure the relationship between arbitrary training samples and selected reference points in ELM, demonstrating that similarity function-based hidden layer outputs ensure the infinite approximation capability of ELM. Therefore, from the perspective of similarity functions, it is possible to reinterpret the mechanism of kernel mapping, guiding the selection of kernel functions.
On the other hand, different kernel functions possess varying data mapping capabilities. To fully exploit the advantages of different kernel functions, ELM based on multiple kernel functions has been proposed. The most common multi-kernel model involves linearly combining multiple kernel functions [12,13]. However, it is difficult to predict the contribution of different kernel functions to classification performance, and determining appropriate weights for each kernel function to maximize the model’s generalization performance remains a significant challenge in multi-kernel learning. While some researchers have proposed using optimization methods to determine the optimal weights [14], the training process incurs high computational costs. Consequently, designing multi-kernel fusion strategies remains an open problem.
Multilayer ELM (ML-ELM) [14] achieves deep representation of input features by stacking multiple ELM autoencoders (ELM-AE). Similar approaches include hierarchical ELM (H-ELM) [15], KELM-based autoencoders (KELM-AE) [16]. These methods construct deep network models by vertically stacking multiple autoencoders, enabling layer-by-layer representation of input features. In contrast, Wang [17] arranged multiple autoencoders horizontally, fusing the encoding results into a single vector. Overall, while deep learning based on ELM-AE has yielded fruitful results, horizontal expansion of networks based on ELM-AE remains an under-explored area, and research on autoencoders based on multiple kernels or similarity functions is still lacking. replaces traditional predefined kernel functions with multi-layer complex mappings of DNN [18].
Cost-sensitive learning has drawn great attention in both theoretical researches and engineering practices. Theoretical researches can be divided into two categories, direct cost-sensitive learning and indirect cost-sensitive learning [19]. The former methods intended to use cost information to design new variants based on the basic classifiers, while the later ones treated the traditional classification algorithms as black boxes and performed preprocessing on the training data or post-processing on the outcomes with the cost information [20]. For engineering practices, cost-sensitive learning has been widely used in imbalanced classification [1], biomedicine [21], fault diagnosis [22], etc. Traditional ELM-based cost-sensitive learning mainly depended on the original ELM model [1]. Therefore, exploration based on other new variants of ELM is always needed.
Based on the aforementioned analysis, this paper proposes two multi-kernel cost-sensitive ELM models based on expected kernel functions. Reinterpreting the significance of kernel functions from the perspective of similarity, we present a simplified kernel autoencoder model by randomly selecting a subset of samples from the input data as reference points. Subsequently, inspired by the theory of expected kernel ELM, we add a random mapping layer after the input layer, designing a simplified expected kernel autoencoder that effectively combines random mapping and similarity mapping. Finally, we define four similarity kernel functions and utilize the simplified expected kernel autoencoder to design two multi-kernel ELM models, converting the classifier output into posterior probabilities and implementing cost-sensitive decision-making based on the minimum risk criterion.
2. Related work
2.1 ELM and KELM
Let a training set consisting of N training samples is denoted as , where is a set of d-dimensional input vectors, and the corresponding class vectors is , where m is the number of classes. For a single-hidden-layer neural network with L hidden nodes and m output nodes, the output function of ELM can be represented as:(1)
Where βij represents the output weights connecting the j-th hidden node to the i-th output node, and . hj(wj,bj,x) is the output corresponding to the j-th hidden node for input sample x, wj is the randomly generated weights connecting input to the j-th hidden node, and , bj is the bias for the j-th hidden node.
Let , Eq (1) can be rewritten in matrix form as:(2)
The solution of Eq (2) can be expressed in two forms:(3)
C is the regularization parameter. When the form of hj(⋅) is unknown, a kernel matrix can be defined for the ELM using the Mercer criterion:(4)
Replacing HHT with Ω in Eq (4), the output function of KELM can be expressed as:(5)
ELM-AE [14] (as shown in Fig 1A) is essentially an ELM where the input and output are the same. Let Γ be the output weights of the autoencoder. The solution of ELM-AE in this case can be expressed as:(6)
[Figure omitted. See PDF.]
(a) ELM Autoencoder. (b) KELM Autoencoder.
Based on KELM, Wong [16] replaced the random mapping function in the autoencoder with a kernel function, proposed KELM-AE (as shown in Fig 3B), and its output layer encoding matrix can be represented as:(7)
It has been demonstrated composite kernels could efficiently combines multi-level information and have excellent performance [12]. In [12], composite kernels is used for mapping contextual and spectral information,(8)Where μ is balance factor and K1(xi,xj), K2(xi,xj) are Radial basis function and polynomial function, respectively. The same methodology was adopted by [13]. However, the method above introduces one hyper-parameter, the balance factor, which is difficult to set. Therefore, [23] integrated multi-kernel matrix by adding each matrix immediately. Different from other articles that adopt traditional kernel functions, [18] redefined the kernel matrix as,(9)Where Φi(x) is the hidden layer output of a neural network. Instead of integrating multi-kernel functions into one super kernel, [16] proposed ML-KELM by stacking multiple KELM-AEs, which does not need to tune the parameters for all layers as in ML-ELM. In ML-KELM, each encoder has one distinct kernel function.
2.2 Cost-sensitive learning based on ELM
ELM has attracted extensive attention from scholars in the field of cost-sensitive learning, and many cost-sensitive variations of ELM have been proposed.
For class-imbalanced problems, the most intuitive approach is to directly weight the samples according to their numbers [21]:(10)Where Wii is the squared error weight of xi, Nymin is the number of samples in the smallest class, and Nyi is the number of samples in class yi. The Cost-sensitive ELM model (CELM) [24] utilizes misclassification cost information to weight the classification errors of samples:(11)Where ci is the misclassification cost of the i-th sample. Zhu[25] then uses the sum of misclassification costs for all classes in the cost matrix as weights:(12)When constructing a cost-sensitive ELM model, Zhang [26] takes into account both the class imbalance of the training data and misclassification costs, i.e:(13)Where B is a cost information vector determined by the class weight matrix and the misclassification cost matrix. Similarly, Fatemeh [27,28] proposed a hierarchical ELM (H-ELM) classification method, which consists of two successive stages, an unsupervised hierarchical feature representation and a supervised feature classification. In H-ELM, the weights of different classes were defined as follows:(14)
In addition, indirect cost-sensitive methods based on ELM mainly involve ensemble of multiple ELM classifiers [29], or adjusting the classifier output thresholds [22].
Although the above methods have achieved some results in specific application areas, the vast majority of cost-sensitive models mentioned above utilize the original ELM as the base classifier, with little research on multi-layered or parallel cost-sensitive ELM models. Additionally, the use of ensemble methods to estimate posterior probabilities can fully utilize cost matrix information, but requires training multiple classifiers. For example, Lu’s method [30] requires training thirty classifiers each time, resulting in high time overhead when dealing with large training data. Therefore, further research is needed in cost-sensitive learning based on ELM.
3. Double hidden layer autoencoder based on simplified expected kernel function
In this section, a Reduced Kernel Extreme Learning Machine (RKELM) is first introduced, then a RKELM-based autoencoder (RKELM-AE) is proposed. Subsequently, a Reduced Expectation Kernel Autoencoder (REKELM-AE) is proposed by adding a random mapping layer after the input layer of RKELM-AE according to the theory of expected kernel.
3.1 RKELM-AE
In KELM, each training sample is measured for similarity with all other training samples [10]. For any training sample xi, KELM essentially creates a new feature for xi by measuring its similarity with samples of different classes. Let ri (reference point) be the reference point used for measuring similarity with training samples, and let be the collection of all reference points. In KELM, R represents the training set, and each reference point in R corresponds to each dimension of the kernel space. However, when two reference points in R are close, the similarity between the training samples and these two reference points is approximately the same. Therefore, the features corresponding to these two reference points in the kernel space after mapping will be similar. Hence, when the number of training samples is large or the distribution is compact, it is unnecessary to select all samples as reference points. Kärkkäinen [11] have validated the effectiveness of randomly selecting a subset of samples from the training data as reference points for KELM from different perspectives.
Based on the analysis above, the proposed RKELM-AE, building upon KELM, randomly selects a subset of samples from the N training samples as reference points to map the input data for similarity. Similar to ELM-AE, the output matrix of RKELM-AE can be represented as:(15)Where is equal to , and , each row represents the similarity of a sample to reference points. The output matrix can be represented as:(16)
In Eq (13), the computational complexity of KELM-AE is O(N3). When the number of samples is large, solving for the output matrix will still take significant time. However, in RKELM-AE, selecting only samples can keep the reconstruction error within a small range. Therefore, the computational complexity of RKELM-AE is much smaller than that of KELM-AE.
3.2 REKELM-AE
The main feature of ELM is the random generation of hidden layer weights. However, in ELM models based on kernel functions, the mapping of the hidden layer is deterministic or semi-deterministic (such as RKELM), and ELM also can learn effective information based on this stable mapping. By combining these two models, one can simultaneously leverage the advantages and disadvantages of both mapping behaviors. To achieve this, Zhang [10] proposed the concept of the Expectation Kernel (EK) to study the relationship between random mapping and kernel functions. The definition of linear EK is:(17)Where vTxi = wTxi+b. p(v) is the probability distribution function of weights and biases. Sampling v according to p(v), the linear EK in Eq (20) can be approximated as:(18)Where V = [v1,⋯,vL] is the weight matrix randomly sampled according to p(v). However, the linear EK and the standard ELM mapping process are essentially equivalent, both can be viewed as randomly initializing the hidden layer weights and mapping the input randomly. Therefore, Zhang built upon the linear EK and defined a non-linear EK using traditional kernel functions K(⋅):(19)
According to Eq (22), the non-linear EK can be divided into two independent processes: randomly mapping the original data, and using the results of the random mapping for kernel mapping.
Following the concept of EK, a random mapping layer consisting of L1 hidden layer nodes is added after the input layer of RKELM-AE, allowing for high-dimensional random mapping of the original data. The simplified structure of the Expected Kernel Autoencoder REKELM-AE is defined as shown in Fig 2.
[Figure omitted. See PDF.]
REKELM-AE can be seen as a combination of traditional ELM and RKELM: the first part utilizes the random hidden layers of ELM to map the input to ELM space, while the second part uses the simplified kernel ELM to compute the similarity of the random mapping results.
For the training samples, the mapping result after the first hidden layer of REKELM-AE is:(20)
Then, randomly select reference points from the N rows of samples , and calculate the simplified matrix:(21)
Finally, the output matrix Γ can be calculated based on Eq (19). The training procedure of REKELM-AE is presented as follows,
Algorithm 1. pseudocode for REKELM-AE.
Input
1. Randomly generate W,b according to p(v)
2. Generate random mapping matrix H using W,b
3. Select reference points from H randomly
4. Calculate the simplified matrix according to Eq (21)
5. Calculate output matrix Γ according to Eq (16)
Output Γ
It should be noted that although REKELM-AE introduces a random mapping process compared to RKELM-AE, the complexity of this process is only related to the number of selected reference points after simplification and is independent of the dimensionality of the samples after dimensionality augmentation. Therefore, REKELM-AE retains the efficiency of RKELM-AE.
4. Cost-sensitive ELM with multi-kernel autoencoder
This section first proposes two classification models based on REKELM-AE, namely multi-kernel parallel ELM (MKP-ELM) and multi-kernel residual ELM (MKR-ELM). Then, the cost-sensitivity models are realized according to the minimum risk decision theory.
4.1 MKP-ELM and MKR-ELM
Most existing AE-based neural network models achieve deep representation of inputs by vertically stacking multiple AEs. However, this layer-by-layer learning strategy accumulates reconstruction errors in the process of feature reconstruction, leading to the learned features cannot represent the original input information well [16]. On the other hand, different kernel functions have different expressive capabilities for different data. By combining multiple kernel functions for learning, we can leverage the advantages of different kernel functions, obtaining better flexibility and adaptability. Based on the REKELM-AE proposed in the third section, this paper presents two multi-kernel autoencoder-based ELM models, multi-kernel parallel ELM (MKP-ELM) and multi-kernel residual ELM (MKR-ELM), as shown in Fig 3.
[Figure omitted. See PDF.]
Both of the two models consist of two separate processes: feature extraction based on multi-kernel parallel autoencoders, and ELM classification based on the extracted features. For MKP-ELM, firstly, given k different kernel functions, each corresponding to a REKELM-AE, calculate the output weight Γ(i),i = 1,2,⋯,k of the corresponding autoencoder using the method in Algorithm 1, then obtain the abstract features extracted by the encoder as,(22)
After obtaining k abstract features extracted by REKELM-AEs, combine all the results into a feature matrix , and finally use Xfinal as input to train an ELM classifier Hfinalβ = T.
Residual connection has been proved to be an effective and efficient principle in deep neural networks [2,3]. In order to further exploit the effectiveness of the proposed RKELM-AE, another deep model is designed by stacking multiple REKELM-AEs. Furthermore, the residual connection is adopted between each autoencoder, as illustrated in Fig 3. The output Y(i) of one REKELM-AE is,(23)Where Xinput,Xoutput are the input and output of the encoder. Different REKELM-AE has different kernel function.
Stacking multiple REKELM-AEs can obtain multiple kernel mappings, while residual connection eases the information loss during the transmission. In this paper, 4 different kernel functions (or similarity functions) are selected, namely radial basis function (K1(⋅)), Euclidean distance (K2(⋅)), Manhattan distance (K3(⋅)), and cosine similarity (K4(⋅)):(24)
4.2 Cost-sensitive classification
For a cost-sensitive classification problem with m classes, the cost matrix is:(25)Where cij is the cost of classifying the i class as the j class. Then, for any x, the category based on minimum risk decision can be expressed as:(26)
The key to implementing minimum risk decision is estimating the posterior probability. Therefore, the hard output of ELM needs to be converted into probability output. This paper adopts the method from [31] to use the Sigmoid function to convert the output of MKP-ELM into posterior probabilities:(27)
Where f(x) is the output of the classifier for class x. When the number of categories is greater than 2, it cannot be guaranteed that Eq (27) will sum to 1 for all classes. Therefore, further normalization of the estimation of Eq (27) is needed:(28)Once we obtain the estimation of unknown sample x, we can then use the minimum risk decision criterion in Eq (26) for classification.
5. Experiment and analysis
This section validates the effectiveness of MKP-CSELM. The experiments consist of three parts: first, verifying the impact of different numbers of nodes in the first hidden layer (L) and random sampling numbers in the second hidden layer () on the performance of the autoencoder in REKELM-AE; then validating the effectiveness of MKP-CSELM in handling cost-sensitive issues on 21 UCI datasets. Finally, further validating the performance of the proposed method on a cost-sensitive pulmonary pathology recognition dataset.
5.1 Implementation design
There are three evaluation matrices for classifiers: accuracy rate acc, total misclassification cost Tc, and relative performance rα, defined as follows:(29)Where erri is the number of misclassified samples in class i, errij is the number of samples misclassified as class j when they belong to class i, and Tcα is the misclassification cost of algorithm α.
In addition to MKP-CSELM, seven methods were selected for comparison, including two cost-sensitive ELM methods: cost-sensitive ELM (CELM) [24], cost-sensitive voting ELM (CSVELM) [30]; HELM [27], three cost-sensitive naïve Bayes methods: cost-sensitive naïve Bayes (CSNB) [32], cost-sensitive Bayesian network (CSBN) [33], multi-class cost-sensitive k-nearest neighbor (mcKNN) [34]; and one cost-sensitive neural network method: cost-sensitive neural network (CSNN) [34]. For CELM, sample weights were calculated using the method in Eq (16) based on the misclassification cost; the base classifier number for CSVELM was set to 30; the number of neighbors for mcKNN was set to 3 as in reference [34]. For CELM and CSVELM, the hidden layer activation function is both set to the Sigmoid function. The number of hidden nodes in the classifier, represented as L, and the regularization parameter, represented as C, are selected from the set [200,400,⋯,2000] and [2−6,2−4,⋯,212] respectively to find the combination that minimizes the classification cost. Apart from the random encoding layer weights with a Gaussian model of variance 1 and mean 0 in MKP-CSELM [10], random weights at other locations are generated in set (−1,1) using a uniform distribution. For MKP-CSELM, the number of hidden mapping layer nodes and the number of similar mapping layers were the same for each autoencoder.
The experimental data includes 5 binary datasets and 16 multiclass datasets from UCI repository, as shown in Table 1. All features were normalized to [0,1]. Due to the large sample sizes of the Page Blocks, Sat, Segmentation, and Shuttle datasets, memory constraints were encountered during experimentation. Therefore, 1000 random samples were selected from each dataset for training and testing. For each dataset, 60% of the samples were randomly selected as the training set, and the remaining 40% were used as the test set. The experiments were conducted using MATLAB R2022a on a desktop computer with an Intel Core i9 CPU and 16GB RAM.
[Figure omitted. See PDF.]
5.2 Ablation study
To examine the effectiveness of each component in our proposed framework, a series of ablation experiments are performed on UCI dataset.
a) Evaluation on different hidden layer nodes and reference points.
The representational capability of REKELM-AE is mainly influenced by the number of random hidden layer nodes L1 and the number of reference points for similar hidden layers . The first hidden layer can be seen as an extension of the original input dimensions, and the similar mapping layer can be viewed as a feature compression process on the output of the former, two parameters are defined to measure the degree of extension and compression,Where d represents the original input feature dimension, and N is the number of training samples.
This section only compares the impact of different parameters on the representational capability of REKELM-AE, hence neglecting the misclassification cost. By fixing C = 500, L = 2000 for ELM and R1 varies from 5 to 100 and R2 ranges from 0.1 to 1, Fig 4 shows the average error rates of MKP-ELM by running 10 times independently.
[Figure omitted. See PDF.]
From Fig 4, it can be observed that the classification error rate of REKELM-AE does not vary significantly with different values of R2, indicating that the classifier is not sensitive to the number of nodes in the first hidden layer. On the other hand, in 7 out of the 9 datasets (Balance, Diabetes, Ecoli, Glass, Heart statlog, Iris, and Letter), there is a clear downward trend in the classification error rate as R2 decreases. Only in one dataset (Breast-w), the classification error rate increases as R2 decreases. This suggests that an effective similarity mapping of input data can be achieved using only a small number of reference points; utilizing a larger (or R2) and selecting more training data as reference points may yield redundant features after similarity mapping, resulting in a decrease in classification performance. Therefore, it is completely unnecessary to compute the similarity matrix using a larger portion or the entire training set.
To further discuss the impact of different values of R2 on classification performance, Fig 5 provides results for different values of R1 while R2 ranges from 0.01 to 0.1. It can be seen that the error rates of 4 out of the 9 datasets (Balance, Breast-w, Diabetes, Ecoli) further decrease as R2 decreases, indicating that for these four datasets, only a very small number of reference points are needed to achieve similarity mapping. On the other hand, four datasets (Glass, Ionosphere, Iris, Letter) achieve optimal values around R2 = 0.02, and as R2 is further reduced, the classification error rate increases. This is mainly because when R2 is very low, only one or a few reference points may be selected, which is insufficient to achieve similarity mapping of the input data.
[Figure omitted. See PDF.]
b) Evaluation on different models.
In this part, the effectiveness of the proposed two different models is investigated. For MKP-ELM and MKR-ELM, the kernel functions, No. of , NO. of L1 are all the same. The cost matrix is randomly produced in a range of [1,20]. The average results on all datasets are presented in Table 2.
[Figure omitted. See PDF.]
The results show that the performance of MKR-ELM is slightly better than that of MKP-ELM when there is only one kernel function, indicating that residual connection is effective for improving performance. When the kernel function is increased, the performance increase of MKP-ELM is smaller than that of MKR-ELM, indicating that firstly, increasing the kernel function can improve the performance, and secondly, vertically stacking multiple autoencoders with residual connections is more effective than simple horizontal expansion.
c) Evaluation on different kernel combinations.
In this part, the effectiveness of different kernel function combination is investigated. In order to thoroughly analysis the influence of kernel function, we have exhausted all possible combinations under different number of kernel functions. For example, when the number of kernel functions is 2, we conduct six sets of experiments. The average results are shown in Table 3.
[Figure omitted. See PDF.]
The results in Tables 2 and 3 both show that with the increase of the number of kernel functions, the performance of both models is improved. When the number of kernel functions is increased from 1 to 2, the performance increases significantly. Models with two and three kernel functions have similar performance, but when the number of kernel functions is increased from 3 to 4, the performance increases significantly. These indicate that different kernel functions have different feature mapping ability.
5.3 Results and analysis on UCI datasets
In this section, the effectiveness of MKP-CSELM is validated on 21 UCI datasets. To reduce computational complexity, the regularization factor and the numbers of hidden nodes are fixed as C = 100 L = 1000, respectively. Both the first hidden layer of the autoencoder and the hidden layer of the classifier use the Sigmoid activation function. As analyzed in Section 4.2, R1 has a minor impact on the classifier, while the performance of MKP-ELM fluctuates significantly around R2 = 0.1. For MKP-CSELM, keeping R1 = 10 fixed, search for the value of R2 within the range of [0.01,0.2] with a step size of 0.01 to minimize the classification cost of the classifier. For the cost matrix, three types (Type-a, Type-b, Type-c) are generated with a maximum value of 10 for each type. The training and testing sets are the same for all methods.
Every method runs 10 independent repetitions. Due to space limitations, only the results on Type-a is presented. Table 4 shows the average accuracy and misclassification cost. The “↓”,”↑” and “≈” marks imply that the result of MKP-CSELM is significantly better than, worse than and similar to the compared result with 0.05 significance level, respectively. In Table 4, in terms of accuracy, MKP-CSELM obtained the most optimal values (6), followed by mcKNN (4). As for the three data sets (Segmentation, Vehicle and Vowel), the accuracy of MKP-CSELM has been significantly improved compared with other methods, 3.0%, 3.3% and 7.3% higher than that of the second best method, and more than 20% higher than that of the worst method. For the misclassification cost, MKP-CSELM achieved optimal values on a total of 10 datasets, nearly half of the total number of datasets. Similarly, the misclassification cost of MKP-CSELM on Segmentation, Vehicle and Vowel data sets is significantly lower than that of other methods, indicating that the overall recognition performance of MKP-CSELM on the above three data sets has been greatly improved.
[Figure omitted. See PDF.]
Table 5 summarizes the results of rank-sum tests under three cost matrices. Each column indicates the number of dataset which MKP-CSELM is significantly better than, worse than and similar to the compared method with 0.05 significance level. For example, the first column under Type-a, 3,10,8 indicate MKP-CSELM is significantly worse than, similar to and better than CELM on 3,10,8 datasets, respectively. The results show that under all three cost matrices, MKP-CSELM outperforms the other 7 methods significantly in both indicators on the majority of datasets. Table 6 provides the average rank of Friedman test results. Overall, it can observed that the rank of MKP-CSELM is the best for both indicators under the three cost matrices, followed by CELM. The performance of CSNB is the worst, followed by CSVELM, indicating that the multi-classifier voting decision cannot estimate posterior probabilities well. Relative performance rα of the seven methods for each dataset is calculated, and the cumulative rα of each method on all datasets under different cost matrices is shown in Fig 6. Each color layer corresponding to a dataset. The figure indicated that MKP-CSELM performs the best, followed by CELM, while CSNB shows the poorest average performance. These experimental results demonstrate the effectiveness of MKP-CSELM.
[Figure omitted. See PDF.]
[Figure omitted. See PDF.]
[Figure omitted. See PDF.]
5.4 Case study
In this section, the proposed method is used for realistic cost-sensitive problems. LungHist700 [35] is dataset of histological images in pulmonary pathology. It consists of 691 images from 45 patients, with each image having a resolution of 1200 × 1600 pixels and stored in.jpg format. These images are captured at either 20x or 40x magnification levels and are categorized into seven classes (see Fig 6). An accompanying.csv file links each image to the associated patient ID. All patients have been anonymized, and the file includes an identifier to match images from the same patient.
All the images are resized to 300*400 pixels, 70% for training and 30% for testing. A ResNet50 is trained and the last feature map is used as the input of MKR-ELM. For simplicity, the samples are divided into two classes, normal and unnormal. The misclassification costs of the normal and unnormal are 1 and 10, respectively. The results are presented in Table 7. ‘Normal’ indicates MKR-ELM without cost-sensitive classification. The results show that, cost-sensitive classification results in a 4.76% reduction in classification accuracy, but a 42.18% percent reduction in misclassification costs.
[Figure omitted. See PDF.]
6. Conclusion
This paper proposes two multi-kernel cost-sensitive ELM models based on the expected kernel function. Firstly, the kernel function is reinterpreted from the perspective of similarity. Based on the kernel ELM, a simplified kernel autoencoder model is presented by randomly selecting a subset of samples from the input data as reference points. According to the expected kernel ELM theory, a random mapping layer is added after the input layer to design a simplified expected kernel autoencoder, which effectively combines random mapping and similarity mapping. Four types of similarity kernel functions are defined, and two multi-kernel ELM models are designed using the simplified expected kernel autoencoder. The classifier output is then transformed into posterior probabilities, and cost-sensitive decisions are made based on the minimum risk criterion. Comparative analysis with six cost-sensitive methods on 21 UCI datasets shows that the proposed method can achieve better generalization performance with only a few reference points selected than the comparative methods. The case study on realistic pulmonary pathology classification further demonstrated the effectiveness of the proposed approach.
References
1. 1. Rajput Brajendra Singh, et al. "ELM-Based Imbalanced Data Classification-A Review." Informatica 48.2 (2024).
* View Article
* Google Scholar
2. 2. Zhang Z, Cai Y, Gong W. Semi-supervised learning with graph convolutional extreme learning machines[J]. Expert Systems with Applications, 2023, 213: 119164.
* View Article
* Google Scholar
3. 3. Wu C, Khishe M, Mohammadi M, et al. RETRACTED ARTICLE: Evolving deep convolutional neutral network by hybrid sine–cosine and extreme learning machine for real-time COVID19 diagnosis from X-ray images[J]. Soft Computing, 2023, 27(6): 3307–3326.
* View Article
* Google Scholar
4. 4. Gao Q, Ai Q, Wang W. Intuitionistic Fuzzy Extreme Learning Machine with the Truncated Pinball Loss[J]. Neural Processing Letters, 2024, 56(2): 116.
* View Article
* Google Scholar
5. 5. Bacanin N., Stoean C., Zivkovic M., Jovanovic D., Antonijevic M., Mladenovic D. Multi-Swarm Algorithm for Extreme Learning Machine Optimization, SENSORS, Vol. 22, No. 11, pp. 1–34, May, 2022. pmid:35684824
* View Article
* PubMed/NCBI
* Google Scholar
6. 6. Bacanin N., Zivkovic M., Antonijevic M., Venkatachalam K., Lee J., Nam Y., Marjanovic M., Strumberger I., Abouhawwash M., Addressing feature selection and extreme learning machine tuning by diversity-oriented social network search: an application for phishing websites detection, Complex & Intelligent Systems, pp. 1–36, Jun, 2023.
* View Article
* Google Scholar
7. 7. Laifi A., Benmohamed E. & Ltifi H. Xavier-PSO-ELM-based EEG signal classification method for predicting epileptic seizures. Multimed Tools Appl 83, 30675–30696 (2024).
* View Article
* Google Scholar
8. 8. Wang F, Liang Y, Lin Z, Zhou J, Zhou T. SSA-ELM: A Hybrid Learning Model for Short-Term Traffic Flow Forecasting. Mathematics. 2024; 12(12):1895
* View Article
* Google Scholar
9. 9. Zhu C, Yang H, ** X, et al. Multilayer Online Sequential Reduced Kernel Extreme Learning Machine-Based Modeling for Time-Varying Distributed Parameter Systems[J]. IEEE Transactions on Cybernetics, 2023.
* View Article
* Google Scholar
10. 10. Zhang W., Zhang Z., Wang L., Chao H., et al. Extreme learning machine with expectation kernels[J]. Pattern Recognition, 2019, 96, 106960.
* View Article
* Google Scholar
11. 11. Kärkkäinen T. Extreme minimal learning machine: Ridge regression with distance-based basis. Neurocomputing, 2019, 342, 33–48.
* View Article
* Google Scholar
12. 12. Li L., Wang C., Li W., et al. Hyperspectral image classification by Adaboost weighted composite kernel extreme learning machine[J]. Neurocomputing, 2018, 275: 1725–1733.
* View Article
* Google Scholar
13. 13. Guo X, Zhu C, Hao J, et al. Multi-step wind speed prediction based on an improved multi-objective seagull optimization algorithm and a multi-kernel extreme learning machine[J]. Applied Intelligence, 2023, 53(13): 16445–16472.
* View Article
* Google Scholar
14. 14. Kaur R, Roul R K, Batra S. Multilayer extreme learning machine: a systematic review[J]. Multimedia Tools and Applications, 2023, 82(26): 40269–40307.
* View Article
* Google Scholar
15. 15. Ma S, Cheng G, Li Y, et al. Research on multi-granularity imbalanced knowledge condition monitoring for mechanical equipment based on hierarchical ELM in multi-entropy space[J]. Expert Systems with Applications, 2024, 238: 121817.
* View Article
* Google Scholar
16. 16. Wong C. M., Vong C. M., Wong P. K., et al. Kernel-based multilayer extreme learning machines for representation learning[J]. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(3): 757–762.
* View Article
* Google Scholar
17. 17. Wang Y., Xie Z., Xu K., et al. An efficient and effective convolutional auto-encoder extreme learning machine network for 3d feature learning[J]. Neurocomputing, 2016, 174: 988–998.
* View Article
* Google Scholar
18. 18. Zhang Meng, et al. An enhancing multiple kernel extreme learning machine based on deep learning. 2024 39th Youth Academic Annual Conference of Chinese Association of Automation (YAC). IEEE, 2024.
19. 19. Zhao Zhenchong, et al. "Cost-sensitive sample shifting in feature space." Pattern Analysis and Applications 23 (2020): 1689–1707.
* View Article
* Google Scholar
20. 20. Zhao Zhenchong, and Wang Xiaodan. "Cost-sensitive SVDD models based on a sample selection approach." Applied Intelligence 48.11 (2018): 4247–4266.
* View Article
* Google Scholar
21. 21. Araf Imane, Idri Ali, and Chairi Ikram. Cost-sensitive learning for imbalanced medical data: a review. Artificial Intelligence Review 57.4 (2024): 80.
* View Article
* Google Scholar
22. 22. Wang Z, Wang N, Zhang H, et al. Segmentalized mRMR features and cost-sensitive ELM with fixed inputs for fault diagnosis of high-speed railway turnouts[J]. IEEE Transactions on Intelligent Transportation Systems, 2023.
* View Article
* Google Scholar
23. 23. Guan Shan, et al. A single-joint multi-task motor imagery EEG signal recognition method based on Empirical Wavelet and Multi-Kernel Extreme Learning Machine. Journal of Neuroscience Methods 407 (2024): 110136. pmid:38642806
* View Article
* PubMed/NCBI
* Google Scholar
24. 24. Chen Zhen, Xiao Xianyong, Li Changsong, et al. Real-time transient stability status prediction using cost-sensitive extreme learning machine[J]. Neural Comput & Applic, 2016, 27: 321–331.
* View Article
* Google Scholar
25. 25. Zhu Hongyu, Wang Xizhao. A cost-sensitive semi-supervised learning model based on uncertainty[J]. Neurocomputing, 251(2017): 106–114.
* View Article
* Google Scholar
26. 26. Zhang Lei, Zhang David. Evolutionary Cost-Sensitive Extreme Learning Machine[J]. IEEE Transactions on Nueral Networks and learning Systems, 2017, 28(12): 3045–3060.
* View Article
* Google Scholar
27. 27. Daneshfar , Fatemeh , and Seyed Jahanshah Kabudian. "Speech Emotion Recognition Using Multi-Layer Sparse Auto-Encoder Extreme Learning Machine and Spectral/Spectro-Temporal Features with New Weighting Method for Data Imbalance." 2021 11th International Conference on Computer Engineering and Knowledge (ICCKE). IEEE, 2021.
28. 28. Daneshfar , Fatemeh , and Seyed Jahanshah Kabudian. "Speech Emotion Recognition Using Deep Sparse Auto-Encoder Extreme Learning Machine with a New Weighting Scheme and Spectro-Temporal Features Along with Classical Feature Selection and A New Quantum-Inspired Dimension Reduction Method." arxiv preprint arxiv:2111.07094 (2021).
29. 29. Duan J, Gu Y, Yu H, et al. ECC++: An algorithm family based on ensemble of classifier chains for classifying imbalanced multi-label data[J]. Expert Systems with Applications, 2024, 236: 121366.
* View Article
* Google Scholar
30. 30. Lu Huijuan, Zheng Enhui, Lu Yi, et al. ELM-based gene expression classification with misclassification cost[J]. Neural Comput & Applic, 2014, 25: 525–531.
* View Article
* Google Scholar
31. 31. Lai J, Wang X, **ang Q, et al. Multilayer discriminative extreme learning machine for classification[J]. International Journal of Machine Learning and Cybernetics, 2023, 14(6): 2111–2125.
* View Article
* Google Scholar
32. 32. Ibáñez A., Bielza C., Larrañaga P. Cost-sensitive selective naive Bayes classifiers for predicting the increase of the h-index for scientific journals[J]. Neurocomputing, 2014, 135: 42–52.
* View Article
* Google Scholar
33. 33. Jiang L., Li C., Wang S. Cost-sensitive Bayesian network classifiers[J]. Pattern Recognition Letters, 2014, 45: 211–216.
* View Article
* Google Scholar
34. 34. Zhang Y., Zhou Z.-H. Cost-sensitive face recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(10): 1758–1769.
* View Article
* Google Scholar
35. 35. Diosdado J., Gilabert P., Seguí S. et al. LungHist700: A dataset of histological images for deep learning in pulmonary pathology. Sci Data 11, 1088 (2024). pmid:39368979
* View Article
* PubMed/NCBI
* Google Scholar
Citation: Yixuan L (2025) Cost-sensitive multi-kernel ELM based on reduced expectation kernel auto-encoder. PLoS ONE 20(2): e0314851. https://doi.org/10.1371/journal.pone.0314851
About the Authors:
Liang Yixuan
Roles: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft
E-mail: [email protected]
Affiliations: School of Science, Xi ’an University of Technology, Xi’an, Shaanxi, P. R. China, The University of Melbourne, Parkville, Victoria, Australia
ORICD: https://orcid.org/0009-0002-5532-3061
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
1. Rajput Brajendra Singh, et al. "ELM-Based Imbalanced Data Classification-A Review." Informatica 48.2 (2024).
2. Zhang Z, Cai Y, Gong W. Semi-supervised learning with graph convolutional extreme learning machines[J]. Expert Systems with Applications, 2023, 213: 119164.
3. Wu C, Khishe M, Mohammadi M, et al. RETRACTED ARTICLE: Evolving deep convolutional neutral network by hybrid sine–cosine and extreme learning machine for real-time COVID19 diagnosis from X-ray images[J]. Soft Computing, 2023, 27(6): 3307–3326.
4. Gao Q, Ai Q, Wang W. Intuitionistic Fuzzy Extreme Learning Machine with the Truncated Pinball Loss[J]. Neural Processing Letters, 2024, 56(2): 116.
5. Bacanin N., Stoean C., Zivkovic M., Jovanovic D., Antonijevic M., Mladenovic D. Multi-Swarm Algorithm for Extreme Learning Machine Optimization, SENSORS, Vol. 22, No. 11, pp. 1–34, May, 2022. pmid:35684824
6. Bacanin N., Zivkovic M., Antonijevic M., Venkatachalam K., Lee J., Nam Y., Marjanovic M., Strumberger I., Abouhawwash M., Addressing feature selection and extreme learning machine tuning by diversity-oriented social network search: an application for phishing websites detection, Complex & Intelligent Systems, pp. 1–36, Jun, 2023.
7. Laifi A., Benmohamed E. & Ltifi H. Xavier-PSO-ELM-based EEG signal classification method for predicting epileptic seizures. Multimed Tools Appl 83, 30675–30696 (2024).
8. Wang F, Liang Y, Lin Z, Zhou J, Zhou T. SSA-ELM: A Hybrid Learning Model for Short-Term Traffic Flow Forecasting. Mathematics. 2024; 12(12):1895
9. Zhu C, Yang H, ** X, et al. Multilayer Online Sequential Reduced Kernel Extreme Learning Machine-Based Modeling for Time-Varying Distributed Parameter Systems[J]. IEEE Transactions on Cybernetics, 2023.
10. Zhang W., Zhang Z., Wang L., Chao H., et al. Extreme learning machine with expectation kernels[J]. Pattern Recognition, 2019, 96, 106960.
11. Kärkkäinen T. Extreme minimal learning machine: Ridge regression with distance-based basis. Neurocomputing, 2019, 342, 33–48.
12. Li L., Wang C., Li W., et al. Hyperspectral image classification by Adaboost weighted composite kernel extreme learning machine[J]. Neurocomputing, 2018, 275: 1725–1733.
13. Guo X, Zhu C, Hao J, et al. Multi-step wind speed prediction based on an improved multi-objective seagull optimization algorithm and a multi-kernel extreme learning machine[J]. Applied Intelligence, 2023, 53(13): 16445–16472.
14. Kaur R, Roul R K, Batra S. Multilayer extreme learning machine: a systematic review[J]. Multimedia Tools and Applications, 2023, 82(26): 40269–40307.
15. Ma S, Cheng G, Li Y, et al. Research on multi-granularity imbalanced knowledge condition monitoring for mechanical equipment based on hierarchical ELM in multi-entropy space[J]. Expert Systems with Applications, 2024, 238: 121817.
16. Wong C. M., Vong C. M., Wong P. K., et al. Kernel-based multilayer extreme learning machines for representation learning[J]. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(3): 757–762.
17. Wang Y., Xie Z., Xu K., et al. An efficient and effective convolutional auto-encoder extreme learning machine network for 3d feature learning[J]. Neurocomputing, 2016, 174: 988–998.
18. Zhang Meng, et al. An enhancing multiple kernel extreme learning machine based on deep learning. 2024 39th Youth Academic Annual Conference of Chinese Association of Automation (YAC). IEEE, 2024.
19. Zhao Zhenchong, et al. "Cost-sensitive sample shifting in feature space." Pattern Analysis and Applications 23 (2020): 1689–1707.
20. Zhao Zhenchong, and Wang Xiaodan. "Cost-sensitive SVDD models based on a sample selection approach." Applied Intelligence 48.11 (2018): 4247–4266.
21. Araf Imane, Idri Ali, and Chairi Ikram. Cost-sensitive learning for imbalanced medical data: a review. Artificial Intelligence Review 57.4 (2024): 80.
22. Wang Z, Wang N, Zhang H, et al. Segmentalized mRMR features and cost-sensitive ELM with fixed inputs for fault diagnosis of high-speed railway turnouts[J]. IEEE Transactions on Intelligent Transportation Systems, 2023.
23. Guan Shan, et al. A single-joint multi-task motor imagery EEG signal recognition method based on Empirical Wavelet and Multi-Kernel Extreme Learning Machine. Journal of Neuroscience Methods 407 (2024): 110136. pmid:38642806
24. Chen Zhen, Xiao Xianyong, Li Changsong, et al. Real-time transient stability status prediction using cost-sensitive extreme learning machine[J]. Neural Comput & Applic, 2016, 27: 321–331.
25. Zhu Hongyu, Wang Xizhao. A cost-sensitive semi-supervised learning model based on uncertainty[J]. Neurocomputing, 251(2017): 106–114.
26. Zhang Lei, Zhang David. Evolutionary Cost-Sensitive Extreme Learning Machine[J]. IEEE Transactions on Nueral Networks and learning Systems, 2017, 28(12): 3045–3060.
27. Daneshfar , Fatemeh , and Seyed Jahanshah Kabudian. "Speech Emotion Recognition Using Multi-Layer Sparse Auto-Encoder Extreme Learning Machine and Spectral/Spectro-Temporal Features with New Weighting Method for Data Imbalance." 2021 11th International Conference on Computer Engineering and Knowledge (ICCKE). IEEE, 2021.
28. Daneshfar , Fatemeh , and Seyed Jahanshah Kabudian. "Speech Emotion Recognition Using Deep Sparse Auto-Encoder Extreme Learning Machine with a New Weighting Scheme and Spectro-Temporal Features Along with Classical Feature Selection and A New Quantum-Inspired Dimension Reduction Method." arxiv preprint arxiv:2111.07094 (2021).
29. Duan J, Gu Y, Yu H, et al. ECC++: An algorithm family based on ensemble of classifier chains for classifying imbalanced multi-label data[J]. Expert Systems with Applications, 2024, 236: 121366.
30. Lu Huijuan, Zheng Enhui, Lu Yi, et al. ELM-based gene expression classification with misclassification cost[J]. Neural Comput & Applic, 2014, 25: 525–531.
31. Lai J, Wang X, **ang Q, et al. Multilayer discriminative extreme learning machine for classification[J]. International Journal of Machine Learning and Cybernetics, 2023, 14(6): 2111–2125.
32. Ibáñez A., Bielza C., Larrañaga P. Cost-sensitive selective naive Bayes classifiers for predicting the increase of the h-index for scientific journals[J]. Neurocomputing, 2014, 135: 42–52.
33. Jiang L., Li C., Wang S. Cost-sensitive Bayesian network classifiers[J]. Pattern Recognition Letters, 2014, 45: 211–216.
34. Zhang Y., Zhou Z.-H. Cost-sensitive face recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(10): 1758–1769.
35. Diosdado J., Gilabert P., Seguí S. et al. LungHist700: A dataset of histological images for deep learning in pulmonary pathology. Sci Data 11, 1088 (2024). pmid:39368979
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2025 Liang Yixuan. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
ELM (Extreme learning machine) has drawn great attention due its high training speed and outstanding generalization performance. To solve the problem that the long training time of kernel ELM auto-encoder and the difficult setting of the weight of kernel function in the existing multi-kernel models, a multi-kernel cost-sensitive ELM method based on expectation kernel auto-encoder is proposed. Firstly, from the view of similarity, the reduced kernel auto-encoder is defined by randomly selecting the reference points from the input data; then, the reduced expectation kernel auto-encoder is designed according to the expectation kernel ELM, and the combination of random mapping and similarity mapping is realized. On this basis, two multi-kernel ELM models are designed, and the output of the classifier is converted into posterior probability. Finally, the cost-sensitive decision is realized based on the minimum risk criterion. The experimental results on the public and realistic datasets verify the effectiveness of the method.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer