This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction
As a basic constituent of organisms, proteins play a critical role in life activities such as metabolism, breeding, growth, and development, especially for the apoptosis protein, which are crucial in the proteomics. Since the functions of an apoptosis protein are closely related to its subcellular location and different kinds of apoptosis proteins can only function in specific subcellular location, it is important to predict the subcellular location of certain apoptosis protein by existing methods, which could not only help us to understand the interactions and properties of apoptosis proteins but also realize the biological pathway involved [1–3]. With the application of high-throughput sequencing techniques and the explosion of sequence data volumes, developing an accurate and reliable computational method to predict apoptosis protein subcellular location has been a great challenge for bioinformaticians, accordingly promoting the development of machine learning in this field [4–8].
By the analysis of research status, the improved directions of using machine learning to predict apoptosis protein subcellular location in the past decade can be roughly categorized into two classes: sequence feature extraction and prediction model [5–10]. Currently the widely used methods for feature extraction are amino acid composition (AAC) [11, 12], pseudo amino acid composition (PseAAC) [13, 14], gene ontology (GO)[15, 16], position specific scoring matrix (PSSM)[17, 18], feature fusion [19, 20], and so on. For example, Zhou et al. used the covariant discriminant function based on Mahalanobis distance and Chou's invariance theorem; combining with traditional AAC feature to predict apoptosis protein subcellular location, the prediction result by Jackknife test on data set ZD98 achieved about 72.5% [21]; Wan et al. proposed GOASVM algorithm based on the information of GO term frequencies and distant homologs to represent a protein in general form of PseAAC and got a high accuracy [22]; Chen et al. used the increment of diversity to fuse N-terminal, C-terminal, and hydrophobic features of apoptosis protein sequences, and the accuracies on ZD98 and CH317 were 90.8% and 82.7%, respectively [23]; Zhao et al. combined the bag of words model with PseAAC method, using K-Means to construct the dictionary of sequence features, and obtained a great predictive effect [24]. At the same time, there are also many efforts for the development of prediction model. For example, Wan et al. proposed an adaptive-decision support vector machine classifier through the annotation information of GO database and realized the prediction of membrane proteins as well as their multifunctional types [25]; Ali et al. extracted the PseAAC features of protein sequences, combining with location voting, k-nearest neighbor and probabilistic neural network to predict the subcellular localization of membrane proteins [26]. Besides, there are also some other prediction models used in this filed such as logistic regression, bayesian classifiers, and long short-term memory [27–29].
In the last decade or so, a recent review [30] pointed that a number of web-servers were developed for predicting the subcellular localization of proteins with both single site and multiple sites [31–36]. In general, proteins can simultaneously exist in multiple sites. In this study, given that the number of multilabel proteins in the existing apoptosis protein database is not large enough to construct a benchmark data set meaningfully in statistics and for the case of multiple locations, the sequence information is more complex and various than single locations, using oversampling approach to copy sequence feature may generate the inaccurate results, so here we did not consider the situation of multilabel proteins.
To summarize the previous research results, it is not difficult to find that the prediction accuracy is relatively low if merely using simple method such as AAC or PseAAC to extract sequence features for classification; as for the other feature extraction methods, like PSSM or feature fusion, though the prediction effect is better, the extraction process is too complicated and time-consuming for practical application. Given that many former researches have proved that support vector machine is one of the best classifiers for the prediction of protein subcellular localization [5, 9, 10, 14, 17, 22], in this study, we focus on obtaining a higher prediction accuracy on the premises of simple feature extraction method and support vector machine to predict the subcellular localization of the apoptosis protein, therefore, finding an efficient approach to optimize the traditional sequence-based feature is the key problem to be solved in this paper.
In the study, we proposed a sparse coding method combined with traditional sequence feature extraction algorithm to extract low-level features of the apoptosis protein sequence, using multilayer pooling based on different sizes of dictionaries to integrate the local and holistic features of the sparse representation. Then the support vector machine was used to complete the final prediction. Given that our adopted benchmark data sets are unbalanced which may influence the classification effects of support vector machine [37], we used an oversampling approach to balance the data sets in the study. Compared with other experimental results with the same support vector machine classifier, the experimental results show that the proposed method can not only simplify the feature extraction process and reduce the time and space complexity of the classifier but also reflect the sequence features more comprehensively and improve the classification performance. The detailed descriptions are shown in the following sections.
2. Materials and Methods
2.1. Datasets
Two widely used benchmark data sets are adopted in this study: ZD98 and CH317, respectively. The data set ZD98 was constructed by Zhou and Doctor [21]. There are 98 apoptosis protein sequences divided into four kinds of subcellular locations, which are cytoplasmic proteins (Cy), mitochondrial proteins (Mi), membrane proteins (Me), and other proteins (Other). The data set CH317 was constructed by Chen and Li [23] and contains a total of 317 apoptosis protein sequences, in 6 classes of subcellular locations that are secreted proteins (Se), nuclear proteins (Nu), cytoplasmic proteins (Cy), endoplasmic reticulum proteins (En), membrane proteins (Me), and mitochondrial proteins (Mi). Considering that the above data sets are old, we update ZD98 and CH317 data sets with reference to Wang et al. [38] and remove some of the duplicates and error sequences. The specific method is not repeated here. After processing, there were 96 protein sequences remaining in ZD98 data set and 314 protein sequences remaining in CH317 data set. All protein sequences in the above two data sets are from the latest version of the UniProt database (Release 2018_12), and the number of protein sequences in each class of 2 data sets is shown in Table 1.
Table 1
Numbers of protein sequences in different class of 2 datasets.
Dataset | Number of sequences in each class | Total | |||||
---|---|---|---|---|---|---|---|
ZD98 | Cy | Me | Mi | Other | |||
43 | 30 | 13 | 12 | 98 | |||
CH317 | Cy | En | Me | Mi | Nu | Se | |
112 | 47 | 55 | 34 | 52 | 17 | 317 |
2.2. Feature Extraction
In order to set up a more accurate mapping relationship between each protein sequence and its corresponding feature vector, multilayer sparse coding was introduced in this study to find the most essential feature of original protein sequence based on simple feature extraction method. The algorithm mainly includes the following steps: local feature extraction, sparse coding, and pooling. And the process of sparse coding is divided into 2 sections: dictionary learning and sparse representation. Firstly, the protein sequence is segmented into some fragments, and the traditional protein feature extraction algorithm will be used to extract the features of these fragments, which could be applied for the step of dictionary learning. Then these local features are trained to construct a dictionary and the feature representation of original sequence would be sparsely reconstructed by it. The mean pooling is used to reduce the dimensions of the feature matrix, and finally the pooled vectors based on different dictionary sizes would be integrated as the ultimate features of protein sequences. The flow chart of extraction progress is shown in Figure 1.
[figure omitted; refer to PDF]2.2.1. Local Feature Extraction
Before the step of sparse coding, it is necessary to extract the local features of protein sequence to constitute a training sample set for dictionary learning. Since every protein sequence has the different length and the critical features may be distributed in different positions of the sequence, in this paper, we adopted sliding window segmentation method inspired by Noor to cut all the protein sequences into pieces [39], generating a number of sequence fragments afterwards. The size of sliding window represents the segment length of each protein sequence, and the reference formula is
After the step of segmentation, the existing sequence feature extraction method is used to statistically analyze the component information of sequence fragments and to transform the character sequences into numerical vectors as the local features of the protein. Effective feature extraction method can remarkably increase the final prediction accuracy. Nakashima and Nishikawa [47] firstly associated the amino acid composition (AAC) with the prediction of protein subcellular location in 1994. The AAC coding method was proposed to count the occurrence frequency of each amino acid in the protein sequence, described as follows:
By using AAC to calculate the fragment features of protein sequence P, we can obtain a feature matrix for each original protein sequence constituted by all the AAC features of corresponding fragments. The matrix is shown in
2.2.2. Sparse Coding
Sparse coding is a branch of deep neural networks, and it contains 2 main steps: dictionary learning and sparse representation, respectively [48]. It can extract the detailed features of original data set and decompose the input sample set into a linear combination of multiple primitives. The coefficients of the primitives are the features of input sample. The description can be formulated as
(1)
Initialize the dictionary
(2)
Fix
(3)
Fix
(4)
Perform steps
After obtaining the dictionary, the orthogonal matching pursuit (OMP) algorithm is used to complete the sparse representation of the fragment features of the original protein sequence [53]. The basic theory of OMP is to select one of the most matching primitives from the dictionary to perform a sparse approximation with the primitives of original samples and to obtain the residual between them. Then, it continues to select the next proper primitive which is best matched with this signal residual and iterates in this way over and over until the residual and sparse rate meets the fixed terminal conditions. Samples can be approximately presented by a linear combination of these derived primitives. All primitives selected in each process must be orthogonalized first, which would make the convergence speed faster [54]. Constituting the sparse features of all the encoded fragments, we can obtain an
2.2.3. Multilayer Pooling
The dimension of the feature matrix obtained by sparse coding is very high, if we expand it directly, the huge data volume will cause redundant space and time complexities of classification, and it is prone to overfitting. Therefore, it is necessary to reduce the dimensions of the feature matrix. The method of pooling can map the collection of feature vectors into a single vector. There are two different common pooling methods that are the max pooling and mean pooling, respectively. The aggregation statistics of features in different positions can extract the effective information and reduce the calculated amount of numerical matrix [55]. Max pooling takes the maximum value of the feature points in the neighborhood and retains the edge information of the feature matrix more, while mean pooling takes the average value of the feature points in the neighborhood and more to extract the background information [56]. Given that the characters of sequence data are different from images, we chose the mean pooling as the final dimension-reduced method. The formula is shown as follows:
In order to obtain a more overall feature representation of original protein sequence, multilayer pooling based on different sizes of dictionary is performed, and several pooling results will be integrated to help extract the local and holistic features severally. The specific description is as follows: in the process of sparse coding, the values of dictionary sizes are set to
2.3. Oversampling Method
Since the data sets used in this paper are not balanced, which may cause the low accuracy of prediction, we referred to some similar papers used the oversampling to balance the data set [16, 30, 43]. In order to further illustrate the effect of our method, a simple oversampling method called synthetic minority oversampling technique (SMOTE) was applied in the study to decrease the imbalance of our data set. SMOTE is a classical oversampling method proposed by Chawla et al. [57]. It is widely used for its good classification effect and simple operation. The basic principle of SMOTE algorithm is to synthesize new minority samples between a few neighbouring samples and to reduce the imbalance of the data distribution. The details are as follows:
(1) For each sample
(2) Assuming that the sampling magnification is
(3) According to the following, combine each sample
(4) Finally, the interpolated sample
The imbalance degree of the data set determines the value of
Table 2
Numbers of protein sequences in different class of 2 datasets.
Dataset | Number of sequences in each class | Total | |||||
---|---|---|---|---|---|---|---|
ZD98 | Cy | Me | Mi | Other | |||
43 | 30 | 26 | 24 | 123 | |||
CH317 | Cy | En | Me | Mi | Nu | Se | |
112 | 47 | 55 | 51 | 52 | 51 | 368 |
2.4. Classifier and Performance Measures
In order to facilitate the comparison with other feature extraction algorithms, we used support vector machine (SVM) as the classification model in this study. After the feature extraction of protein sequences, the universal package of LIBSVM developed by Lin was applied to construct the SVM multiclass classifier [58]. The Jackknife test was also adopted to examine the effectiveness of classifier in our experiment. Jackknife test has the least arbitrary that can always yield a unique result for a given benchmark dataset [59]. Furthermore, in order to have a more comprehensive evaluation, sensitivity (Se), specificity (Sp), Matthew’s correlation coefficient (MCC), and the overall accuracy (OA) over the entire data set are applied as the evaluation index [20, 21, 60]. These parameters are detailed in
2.5. Parameters Selection
There are two key parameters in this study. One is the length of sequence fragment in the local feature extraction. The shortest protein sequence length in the data set is 50, and the fragment length is selected between 25 and 50. Figure 3 shows the prediction accuracy of the data set ZD98 and CH317, respectively, when taking different slice lengths.
[figure omitted; refer to PDF]As shown in Figure 3, when the sequence length is between 35 and 40, the prediction accuracies on the data sets ZD98 and CH317 are the highest and tend to be stable, and the current length is the optimal value. The optimal values for the two data sets used in this study are 35 and 40, respectively.
When using PCA to select the final feature vectors, the setting of dimension D has an effect on the accuracy of the entire algorithm. The more dimensions are selected and the more features are included, but the training time of the classifier may be too long. The smaller the dimension is, the more likely it is to lose some truly meaningful features and affect the classification effect. Therefore, an optimal
As shown in Figure 4, when the dimension of the feature vector is low, the prediction accuracy of two data sets is relatively low. When the dimension is higher than a certain value, the prediction accuracy is also reduced. When the dimension is between 60 and 70, the prediction accuracies on the data sets ZD98 and CH317 are the largest and tend to be steady, and the current
3. Result and Discussion
The prediction results of our experiments by Jackknife on the data sets ZD98 and CH317 are listed in Tables 3, 4, and 5. To further illustrate the effectiveness of our method, the prediction results in each subcellular location of two data sets are also listed in Tables 3–5, which are sensitivity, specificity, correlation coefficient, and overall accuracy, respectively.
Table 3
The experimental results of data sets.
Dataset | Jackknife test(%) | ||||
---|---|---|---|---|---|
Location | Sn(%) | Sp(%) | MMC(%) | OA(%) | |
ZD98 | Cy | 100 | 95.6 | 95.9 | 96.7 |
Me | 96.7 | 96.7 | 95.2 | ||
Mi | 92.3 | 96.0 | 86.4 | ||
Other | 95.9 | 95.8 | 90.5 | ||
CH317 | Cy | 95.5 | 93.8 | 90.9 | 94.8 |
Me | 93.6 | 92.7 | 91.1 | ||
Mi | 96.4 | 94.6 | 96.7 | ||
Se | 94.1 | 92.3 | 83.4 | ||
Nu | 94.2 | 92.5 | 89.6 | ||
En | 94.1 | 90.5 | 91.5 |
Table 4
Comparison of the accuracy of ZD98 data set.
Methods | Jackknife test(%) | ||||
---|---|---|---|---|---|
Cyto | Memb | Mito | Other | OA(%) | |
DCC_SVM [40] | 93.0 | 86.7 | 92.3 | 75.0 | 88.9 |
OF_SVM [41] | 97.7 | 86.3 | 92.3 | 66.7 | 90.8 |
DE_SVM [42] | 95.4 | 93.3 | 76.9 | 83.3 | 90.8 |
BOW_SVM [24] | 97.7 | 92.9 | 76.9 | 83.3 | 91.7 |
GA_SVM [17] | 95.4 | 90.0 | 92.3 | 83.3 | 91.8 |
OA_SVM [43] | 95.3 | 88.9 | 97.4 | 91.7 | 93.2 |
Our | 100 | 96.7 | 92.3 | 95.9 | 96.7 |
Table 5
Comparison of the accuracy of CH317 data set.
Methods | Jackknife test(%) | ||||||
---|---|---|---|---|---|---|---|
Cyto | Memb | Mito | Secr | Nucl | Endo | OA(%) | |
DCC_SVM [40] | 91.1 | 92.7 | 82.4 | 76.5 | 80.0 | 93.6 | 88.3 |
GA_SVM [17] | 92.9 | 89.1 | 82.4 | 76.5 | 84.6 | 93.6 | 89.0 |
BOW_SVM [24] | 94.6 | 87.3 | 82.4 | 82.4 | 84.3 | 91.5 | 89.2 |
IAC_SVM [44] | 96.4 | 94.5 | 82.4 | 76.5 | 80.8 | 93.6 | 90.5 |
EI_SVM [45] | 94.6 | 95.7 | 92.7 | 82.4 | 90.4 | 70.6 | 91.1 |
CF_SVM [46] | 96.4 | 90.9 | 92.3 | 95.7 | 82.4 | 64.7 | 91.5 |
Our | 95.5 | 93.6 | 96.4 | 94.1 | 94.2 | 94.1 | 94.8 |
It can be seen from Table 3 that the method has obtained good experimental results on both two data sets, and the total accuracies rates are 96.7% and 94.8%, respectively. The experiment proves that the method can effectively increase the accuracy of the prediction of protein subcellular localization. At the same time, in order to facilitate the comparison with other methods, we have listed some experimental results based on some improved algorithms of protein sequence feature extraction in the past several years.
In Tables 4 and 5, DCC_SVM comes from Liang [40], by using detrended cross-correlation coefficient(2016); OF_SVM comes from Zhang [41], by using λ-Order Factor and principal component analysis(2017); DE_SVM comes from Liang [42], by fusing two different descriptors based on evolutionary information(2018); BOW_SVM comes from Zhao [24], by using bag of words(2017); GA_SVM comes from Liang [17], by using geary autocorrelation and DCCA coefficient(2017); OA_SVM comes from Zhang [43], by using oversampling and pseudo amino acid composition(2018); IAC_SVM comes from Zhang [44], by using integrating auto-cross correlation and PSSM(2018); EI_SVM comes from Xiang [45], by using evolutionary information(2017); CF_SVM comes from Chen [46], by using a set of discrete sequence correlation factors(2015); all the methods use SVM as the final classifier.
It can be seen from Table 4 that the result on the data set ZD98 has a maximum improvement of the overall prediction accuracy, increasing by about 6 to 8 percentage points compared with traditional protein sequence feature extraction algorithms such as DCC_SVM, OF_SVM, and DE_SVM. In the subcellular class of cytoplasmic proteins, the prediction accuracy rate is 100%, which means that all the sequences in this class are predicted correctly, and the overall prediction accuracy is better than other methods as well. Compared the experimental results with other improved feature extraction algorithms such as BOW_SVM, GA_SVM, and OA_SVM, the accuracy on the same data set is also improved by about 3 to 5 percentage points. Experiments show that the proposed method indeed provides a better source of information for protein sequences and have significant advantages than other similar feature extraction methods. From the comparison in Table 5, we can see that the prediction result on mitochondrial proteins of data set CH317 is up to 96.4%, which is about 4.1% to 14% higher than other algorithms. The accuracy rate in the class of Nuclear has also increased by 14.2% maximally, improving the total prediction accuracy by 3.3 to 4.3 percentage points compared with the improved algorithms such as IAC_SVM, EI_SVM, and CF_SVM, which further demonstrates that the method can optimize the underlying features of the sequence and effectively improve the prediction accuracy of apoptosis protein subcellular localization. Compared with the traditional protein sequence feature extraction and their improved methods, the time complexity of our algorithm is not only low but can also achieve better results based on the simple AAC feature. The background information of the feature representation can also be extracted by mean pooling and comprehensively reflect the distribution of sequence features more, as well as improving the classification accuracy.
4. Conclusions
Prediction of apoptosis protein subcellular localization has always been the hotspot of bioinformaticians all over the world. Based on the traditional protein sequence feature extraction algorithm AAC, this paper introduced sparse coding to optimize sequence features and proposed a feature fusion method based on multilevel dictionary. The main contribution includes firstly using sliding window segmentation to extract the sequence fragments of protein sequences, and the traditional feature extraction algorithm was used to encode them. Then the K-SVD algorithm was used to learn the dictionary, and the sequence feature matrix was sparsely represented by the OMP algorithm. The feature representation based on different sizes of dictionaries is mean-pooled to help extract the overall and local feature information. Finally the SVM multiclass classifier is used to predict the subcellular location of the proteins. Experiments show that the proposed method can obtain better results in the prediction success rate of most subcellular classes and have important guiding significance for improving the feature expression of traditional apoptosis protein sequence feature extraction algorithms. Generally speaking, it is a relatively effective method for predicting the subcellular localization of apoptosis proteins.
Conflicts of Interest
The authors declare no conflicts of interest.
Acknowledgments
This work was partially supported by National Key R & D Program of China (2017YFD0800204) and the Fundamental Research Funds for the Central Universities (No. KYZ201600175).
[1] M. Breker, M. Schuldiner, "The emergence of proteome-wide technologies: Systematic analysis of proteins comes of age," Nature Reviews Molecular Cell Biology, vol. 15 no. 7, pp. 453-464, DOI: 10.1038/nrm3821, 2014.
[2] S. Wang, Y. Yue, "Protein subnuclear localization based on a new effective representation and intelligent kernel linear discriminant analysis by dichotomous greedy genetic algorithm," PLoS ONE, vol. 13 no. 4, 2018.
[3] T. J. Burns, A. P. Frei, P. F. Gherardini, "High-throughput precision measurement of subcellular localization in single cells," Cytometry Part A, vol. 91 no. 2, pp. 180-189, DOI: 10.1002/cyto.a.23054, 2017.
[4] M. Salvatore, N. Shu, A. Elofsson, "The SubCons webserver: A user friendly web interface for state-of-the-art subcellular localization prediction," Protein Science, vol. 27 no. 1, pp. 195-201, DOI: 10.1002/pro.3297, 2017.
[5] X. Cheng, X. Xiao, K.-C. Chou, "PLoc-mHum: Predict subcellular localization of multi-location human proteins via general PseAAC to winnow out the crucial GO information," Bioinformatics, vol. 34 no. 9, pp. 1448-1456, DOI: 10.1093/bioinformatics/btx711, 2017.
[6] Q. Xiang, B. Liao, X. Li, "Subcellular localization prediction of apoptosis proteins based on evolutionary information and support vector machine," Artificial Intelligence in Medicine, vol. 78 no. 9, pp. 41-46, DOI: 10.1016/j.artmed.2017.05.007, 2017.
[7] M. R. Uddin, A. Sharma, D. M. Farid, "EvoStruct-SUB: An accurate Gram-positive protein subcellular localization predictor using evolutionary and structural features," Journal of Theoretical Biology, vol. 443, pp. 138-146, DOI: 10.1016/j.jtbi.2018.02.002, 2018.
[8] S. Qiao, B. Yan, J. Li, "Ensemble learning for protein multiplex subcellular localization prediction based on weighted KNN with different features," Applied Intelligence, vol. 48 no. 7, pp. 1813-1824, DOI: 10.1007/s10489-017-1029-6, 2018.
[9] S. Wang, B. Nie, K. Yue, "Protein subcellular localization with gaussian kernel discriminant analysis and its kernel parameter selection," International Journal of Molecular Sciences, vol. 18 no. 12, pp. 2718-2733, 2017.
[10] W. Xue, X.-Y. Hong, N. Zhao, "Predicting protein subcellular localization by approximate nearest neighbor searching," pp. 2842-2846, .
[11] Y. Huang, G. Huang, "A homology and pseudo amino acid composition-based multi-label model for predicting human membrane protein types," Current Proteomics, vol. 15 no. 2, pp. 135-141, DOI: 10.2174/1570164614666171030162205, 2018.
[12] M. Saidijam, S. Azizpour, S. G. Patching, "Amino acid composition analysis of human secondary transport proteins and implications for reliable membrane topology prediction," Journal of Biomolecular Structure and Dynamics, vol. 35 no. 5, pp. 929-949, DOI: 10.1080/07391102.2016.1167622, 2017.
[13] M. Rahimi, M. R. Bakhtiarizadeh, A. Mohammadi-Sangcheshmeh, "OOgenesis_Pred: A sequence-based method for predicting oogenesis proteins by six different modes of Chou’s pseudo amino acid composition," Journal of Theoretical Biology, vol. 414, pp. 128-136, DOI: 10.1016/j.jtbi.2016.11.028, 2017.
[14] K. Ahmad, M. Waris, M. Hayat, "Prediction of protein submitochondrial locations by incorporating dipeptide composition into chou’s general pseudo amino acid composition," Journal of Membrane Biology, vol. 249 no. 3, pp. 293-304, DOI: 10.1007/s00232-015-9868-8, 2016.
[15] X. Xiao, X. Cheng, S. Su, "PLoc-mGpos: Incorporate key gene ontology information into general PseAAC for predicting subcellular localization of gram-positive bacterial proteins," Natural Science, vol. 09 no. 09, pp. 330-349, DOI: 10.4236/ns.2017.99032, 2017.
[16] M. A. M. Hasan, S. Ahmad, M. N. I. Mondal, "PredMultiLoc-Gneg: Predicting subcellular localization of gram-negative bacterial proteins using feature selection in gene ontology space and resolving the data imbalance issue," pp. 109-112, .
[17] Y. Liang, S. Liu, S. Zhang, "Geary autocorrelation and DCCA coefficient: Application to predict apoptosis protein subcellular localization via PSSM," Physica A: Statistical Mechanics and its Applications, vol. 467, pp. 296-306, DOI: 10.1016/j.physa.2016.10.038, 2017.
[18] Y.-H. Yao, Z.-X. Shi, Q. Dai, "Apoptosis protein subcellular location prediction based on position-specific scoring matrix," Journal of Computational and Theoretical Nanoscience, vol. 11 no. 10, pp. 2073-2078, DOI: 10.1166/jctn.2014.3607, 2014.
[19] S. Wan, M.-W. Mak, S.-Y. Kung, "HybridGO-Loc: Mining hybrid features on gene ontology for predicting subcellular localization of multi-location proteins," PLoS ONE, vol. 9 no. 3,DOI: 10.1371/journal.pone.0089545, 2014.
[20] Z. Wang, Q. Zou, Y. Jiang, "Review of protein subcellular localization prediction," Current Bioinformatics, vol. 9 no. 3, pp. 331-342, DOI: 10.2174/1574893609666140212000304, 2014.
[21] G.-P. Zhou, K. Doctor, "Subcellular location prediction of apoptosis proteins," Proteins: Structure, Function, and Bioinformatics, vol. 50 no. 1, pp. 44-48, DOI: 10.1002/prot.10251, 2003.
[22] S. Wan, M. Mak, S. Kung, "GOASVM: A subcellular location predictor by incorporating term-frequency gene ontology into the general form of Chou’s pseudo-amino acid composition," Journal of Theoretical Biology, vol. 323, pp. 40-48, DOI: 10.1016/j.jtbi.2013.01.012, 2013.
[23] Y.-L. Chen, Q.-Z. Li, "Prediction of the subcellular location of apoptosis proteins," Journal of Theoretical Biology, vol. 245 no. 4, pp. 775-783, DOI: 10.1016/j.jtbi.2006.11.010, 2007.
[24] N. Zhao, L. Zhang, W. Xue, "Application of bag of words model in the prediction of protein subcellular location," Journal of Food Science and Biotechnology, vol. 36 no. 3, pp. 296-301, 2017.
[25] S. Wan, M.-W. Mak, S.-Y. Kung, "Mem-ADSVM: A two-layer multi-label predictor for identifying multi-functional types of membrane proteins," Journal of Theoretical Biology, vol. 398, pp. 32-42, DOI: 10.1016/j.jtbi.2016.03.013, 2016.
[26] F. Ali, M. Hayat, "Classification of membrane protein types using voting feature interval in combination with Chou’s pseudo amino acid composition," Journal of Theoretical Biology, vol. 384, pp. 78-83, DOI: 10.1016/j.jtbi.2015.07.034, 2015.
[27] S. Wan, M.-W. Mak, S.-Y. Kung, "MPLR-Loc: An adaptive decision multi-label classifier based on penalized logistic regression for protein subcellular localization prediction," Analytical Biochemistry, vol. 473, pp. 14-27, DOI: 10.1016/j.ab.2014.10.014, 2015.
[28] S. Sáez-Atienzar, J. Martínez-Gómez, J. I. Alonso-Barba, "Automatic quantification of the subcellular localization of chimeric GFP protein supported by a two-level Naive Bayes classifier," Expert Systems with Applications, vol. 42 no. 3, pp. 1531-1537, DOI: 10.1016/j.eswa.2014.09.052, 2015.
[29] S. r. Sønderby, C. K. Sønderby, H. Nielsen, O. Winther, "Convolutional LSTM networks for subcellular localization of proteins," International Conference on Algorithms for Computational Biology, vol. 9199, pp. 68-80, DOI: 10.1007/978-3-319-21233-3_6, 2015.
[30] K. Chou, "Impacts of bioinformatics to medicinal chemistry," Medicinal Chemistry, vol. 11 no. 3, pp. 218-234, DOI: 10.2174/1573406411666141229162834, 2015.
[31] X. Cheng, X. Xiao, K.-C. Chou, "PLoc-mEuk: Predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general PseAAC," Genomics, vol. 110 no. 1, pp. 50-58, DOI: 10.1016/j.ygeno.2017.08.005, 2018.
[32] S. Wan, M.-W. Mak, S.-Y. Kung, "FUEL-mLoc: Feature-unified prediction and explanation of multi-localization of cellular proteins in multiple organisms," Bioinformatics, vol. 33 no. 5, pp. 749-750, DOI: 10.1093/bioinformatics/btw717, 2017.
[33] X. Guo, F. Liu, Y. Ju, Z. Wang, C. Wang, "Human protein subcellular localization with integrated source and multi-label ensemble classifier," Scientific Reports, vol. 6 no. 1,DOI: 10.1038/srep28087, 2016.
[34] S. Wan, M.-W. Mak, S.-Y. Kung, "Ensemble linear neighborhood propagation for predicting subchloroplast localization of multi-location proteins," Journal of Proteome Research, vol. 15 no. 12, pp. 4755-4762, DOI: 10.1021/acs.jproteome.6b00686, 2016.
[35] J. J. A. Armenteros, C. K. Sønderby, S. K. Sønderby, H. Nielsen, O. Winther, "DeepLoc: Prediction of protein subcellular localization using deep learning," Bioinformatics, vol. 33 no. 21, pp. 3387-3395, DOI: 10.1093/bioinformatics/btx431, 2017.
[36] S. Wan, M.-W. Mak, S.-Y. Kung, "Gram-LocEN: Interpretable prediction of subcellular multi-localization of gram-positive and gram-negative bacterial proteins," Chemometrics and Intelligent Laboratory Systems, vol. 162,DOI: 10.1016/j.chemolab.2016.12.014, 2017.
[37] K. Tbarki, S. B. Said, R. Ksantini, Z. Lachiri, "Landmine detection improvement using one-class SVM for unbalanced data," .
[38] X. Wang, H. Li, Q. Zhang, R. Wang, "Predicting subcellular localization of apoptosis proteins combining GO features of homologous proteins and distance weighted KNN classifier," BioMed Research International, vol. 2016,DOI: 10.1155/2016/1793272, 2016.
[39] M. H. M. Noor, Z. Salcic, K. I.-K. Wang, "Adaptive sliding window segmentation for physical activity recognition using a single tri-axial accelerometer," Pervasive and Mobile Computing, vol. 38 no. 1, pp. 41-59, DOI: 10.1016/j.pmcj.2016.09.009, 2017.
[40] Y. Liang, S. Liu, S. Zhang, "Detrended cross-correlation coefficient: Application to predict apoptosis protein subcellular localization," Mathematical Biosciences, vol. 282, pp. 61-67, DOI: 10.1016/j.mbs.2016.09.019, 2016.
[41] S. Zhang, J. Jin, "Prediction of protein subcellular localization by using λ -order factor and principal component analysis," Letters in Organic Chemistry, vol. 14 no. 9, pp. 717-724, 2017.
[42] Y. Liang, S. Zhang, "Prediction of apoptosis protein’s subcellular localization by fusing two different descriptors based on evolutionary information," Acta Biotheoretica, vol. 66 no. 1, pp. 61-78, DOI: 10.1007/s10441-018-9319-x, 2018.
[43] S. Zhang, X. Duan, "Prediction of protein subcellular localization with oversampling approach and Chou’s general PseAAC," Journal of Theoretical Biology, vol. 437, pp. 239-250, DOI: 10.1016/j.jtbi.2017.10.030, 2018.
[44] S. Zhang, Y. Liang, "Predicting apoptosis protein subcellular localization by integrating auto-cross correlation and PSSM into Chou’s PseAAC," Journal of Theoretical Biology, vol. 457, pp. 163-169, DOI: 10.1016/j.jtbi.2018.08.042, 2018.
[45] Q. Xiang, B. Liao, X. Li, "Subcellular localization prediction of apoptosis proteins based on evolutionary information and support vector machine," Artificial Intelligence in Medicine, vol. 78, pp. 41-46, DOI: 10.1016/j.artmed.2017.05.007, 2017.
[46] H. Chen, X. Chen, Q. Hu, Z. Cao, "Predicting protein subcellular location based on a novel sequence numerical model," Journal of Computational and Theoretical Nanoscience, vol. 12 no. 1, pp. 82-87, DOI: 10.1166/jctn.2015.3701, 2015.
[47] H. Nakashima, K. Nishikawa, "Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies," Journal of Molecular Biology, vol. 238 no. 1, pp. 54-61, DOI: 10.1006/jmbi.1994.1267, 1994.
[48] Y. Quan, Y. Xu, Y. Sun, Y. Huang, H. Ji, "Sparse coding for classification via discrimination ensemble," Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5839-5847, .
[49] M. Yang, L. Zhang, X. Feng, D. Zhang, "Sparse representation based Fisher discrimination dictionary learning for image classification," International Journal of Computer Vision, vol. 109 no. 3, pp. 209-232, DOI: 10.1007/s11263-014-0722-8, 2014.
[50] B. M. Whitaker, P. B. Suresha, C. Liu, G. D. Clifford, D. V. Anderson, "Combining sparse coding and time-domain features for heart sound classification," Physiological Measurement, vol. 38 no. 8, pp. 1701-1729, DOI: 10.1088/1361-6579/aa7623, 2017.
[51] A. Cherian, S. Sra, "Riemannian sparse coding for positive definite matrices," Proceedings of the European Conference on Computer Vision, pp. 299-314, .
[52] M. Aharon, M. Elad, A. Bruckstein, "K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation," IEEE Transactions on Signal Processing, vol. 54 no. 11, pp. 4311-4322, DOI: 10.1109/TSP.2006.881199, 2006.
[53] J. Wen, Z. Zhou, J. Wang, X. Tang, Q. Mo, "A sharp condition for exact support recovery with orthogonal matching pursuit," vol. 65, pp. 1370-1382, DOI: 10.1109/TSP.2016.2634550, .
[54] A. Cohen, W. Dahmen, R. DeVore, "Orthogonal matching pursuit under the restricted isometry property," Constructive Approximation. An International Journal for Approximations and Expansions, vol. 45 no. 1, pp. 113-127, DOI: 10.1007/s00365-016-9338-2, 2017.
[55] Y. Liu, J. Cheng, Y. Ma, Y. Chen, "Protein secondary structure prediction based on two dimensional deep convolutional neural networks," pp. 1995-1999, DOI: 10.1109/CompComm.2017.8322886, .
[56] Y. Chen, "Long sequence feature extraction based on deep learning neural network for protein secondary structure prediction," pp. 843-847, .
[57] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, "SMOTE: Synthetic minority over-sampling technique," Journal of Artificial Intelligence Research, vol. 16 no. 1, pp. 321-357, DOI: 10.1613/jair.953, 2002.
[58] C. Chang, C. Lin, "LIBSVM: A Library for support vector machines," ACM Transactions on Intelligent Systems and Technology, vol. 2 no. 3, article no 27,DOI: 10.1145/1961189.1961199, 2011.
[59] H. Ding, Z. Y. Liang, F. B. Guo, J. Huang, W. Chen, H. Lin, "Predicting bacteriophage proteins located in host cell with feature selection technique," Computers in Biology and Medicine, vol. 71, pp. 156-161, DOI: 10.1016/j.compbiomed.2016.02.012, 2016.
[60] K. C. Chou, H. B. Shen, "Recent progress in protein subcellular location prediction," Analytical Biochemistry, vol. 370 no. 1,DOI: 10.1016/j.ab.2007.07.006, 2007.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2019 Xingjian Chen et al. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/
Abstract
The prediction of apoptosis protein subcellular localization plays an important role in understanding the progress in cell proliferation and death. Recently computational approaches to this issue have become very popular, since the traditional biological experiments are so costly and time-consuming that they cannot catch up with the growth rate of sequence data anymore. In order to improve the prediction accuracy of apoptosis protein subcellular localization, we proposed a sparse coding method combined with traditional feature extraction algorithm to complete the sparse representation of apoptosis protein sequences, using multilayer pooling based on different sizes of dictionaries to integrate the processed features, as well as oversampling approach to decrease the influences caused by unbalanced data sets. Then the extracted features were input to a support vector machine to predict the subcellular localization of the apoptosis protein. The experiment results obtained by Jackknife test on two benchmark data sets indicate that our method can significantly improve the accuracy of the apoptosis protein subcellular localization prediction.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer