This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction
Polycystic ovary syndrome (PCOS), as a heterogeneous endocrine disorder, is closely associated with menstrual dysfunction, infertility, hirsutism, acne, obesity, and metabolic syndrome [1]. The three major diagnostic criteria of PCOS widely followed are criteria raised by National Institutes of Health (NIH) [2], 2003 Rotterdam Consensus raised by European Society of Human Reproduction and Embryology (ESHRE) and American Society for Reproductive Medicine (ASRM) [3, 4], and criteria raised by Androgen Excess Society (AES) [5]. However, these criteria have created some controversy in the field [6]. The multifactorial etiology of PCOS is underpinned by a complex genetic architecture [7]. Ethnicity is eminently related to PCOS phenotype because of the different genetic and environmental propensity to metabolic disorders [8–10].
Although the identified genetic risk markers can be used as predictive and diagnostic tools for PCOS, they may not possess the strong power due to the complicated genetic architecture [6]. Combination of various markers in diagnostic panels may significantly improve the success [11]. Many studies have successfully used genetic risk scores to explain increasing amounts of variance in diseases [12].
In recent years, the wide application of microarray technology and more advanced, accurate RNA-sequencing technology made the study of disease mechanism more convenient. In view of the differences between the two platforms, it is necessary to analyze the data of the two platforms separately.
The main difficulty arisen in establishing a classification model using gene expression data was how to find the most meaningful index or feature for classification. To address this, various machine learning approaches such as random forest (RF) [13, 14] and artificial neural network (ANN) [15] were utilized. The single or combined use of these algorithms has contributed much in gene expression data classification [16], disease diagnosis [17], cell migration [18], and microbiome research [19]. Given their high classification accuracy and convenience, they have become powerful tools to learn feature representations.
In this work, we established a diagnosis model of PCOS using microarray and RNA-seq data from Gene Expression Omnibus (GEO) database with the combined utilization of RF and ANN. Firstly, the RF classifier was used to identify the key genes for classification, and then, the ANN was performed to calculate the weights of the key genes in microarray and RNA-seq data, respectively. Finally, a scoring model named neuralPCOS was developed with the integration of RF and ANN. To validate the accuracy and superiority of the diagnosis model we established, we evaluated the performance with microarray and RNA-seq data and compared them to other marker genes obtained in previous studies [20, 21].
2. Materials and Methods
2.1. Study Design
For establishment of the diagnostic model of PCOS, RF and ANN were adopted in this study. The study overview was schematically depicted in Figure 1. GSE6798 dataset (
2.2. Data Selection and Preprocessing
In the present study, a wide search through the National Center for Biotechnology Information Gene Expression Omnibus database (NCBIGEO) platform was conducted with the key words “PCOS, human”. As shown in Table 1, 6 sets of microarray data and 1 set of RNA-seq data were downloaded from GEO database. In order to obtain one training dataset (microarray ComBat dataset1) with large sample size, three microarray datasets with small sample size (GSE137684, GSE137354, and GSE34526) were combined. Meanwhile, GSE43264 and GSE124226 were combined to form one validation dataset (microarray ComBat dataset2). These datasets were converted to logarithmic form after standardization, and the R package ComBat was used to remove the batch effects [22]. Two microarray datasets with 28 and 23 samples were obtained using classical and Bayesian correction methods.
Table 1
Gene expression data from Gene Expression Omnibus (GEO) database.
Dataset ID | Total samples | Control | PCOS | Data type | Tissue type | Country |
GSE6798 | 29 | 13 | 16 | Microarray | Skeletal muscle | Denmark |
GSE43264 | 15 | 7 | 8 | Microarray | Adipose | Ireland |
GSE34526 | 10 | 3 | 7 | Microarray | Granulosa cells | India |
GSE137684 | 12 | 4 | 8 | Microarray | Granulosa cells | China |
GSE137354 | 6 | 3 | 3 | Microarray | Endometrium | China |
GSE124226 | 8 | 4 | 4 | Microarray | Adipose stem cells | USA |
GSE84958 | 53 | 23 | 30 | RNA-seq | Adipose | UK |
2.3. Differentially Expressed Genes (DEGs) Screening
The dataset GSE6798, based on Affymetrix Human Genome U133 Plus 2.0 Array (Affymetrix Inc., Santa Clara, California, USA) contained 16 cases of PCOS and 13 cases of control, was used for DEGs analysis. The boxplot was performed using R package stats (v 3.5.0). The R package limma was used to calculate the DEGs between the PCOS and control samples by the classical Bayesian method with
2.4. Gene Ontology (GO) Enrichment Analysis
To further reveal the biofunction of selected DEGs, GO enrichment analysis, including biological process (BP), cellular component (CC), and molecular function (MF), was performed using R package clusterProfiler [25]. Significant enrichment terms were screened with the threshold adjusted
2.5. Random Forest (RF) Classification
We used random forest to classify the DEGs with the R package randomforest [27]. Firstly, the optimal number of variables (mtry parameter, the optimal number of variables used in the binary tree in the specified node) was identified. All possible variables (1~2000) were looped into the random forest classifier. Each error rate was calculated, and the optimal number of variables was selected. Next, each error rate of 1~3000 trees was calculated, and the optimal tree number was determined by the lowest error rate and best stability. Based on the above-selected parameters, the random forest classifier was used to calculate the results, and the important genes were selected as the candidate PCOS-specific genes according to the Gini coefficient method.
2.6. Calculation of DEGs Weight by Artificial Neural Network (ANN)
The GSE84958 dataset was randomly divided into training data (
2.7. Neural-PCOS
We constructed an equation named neuralPCOS that could estimate the classification score of each gene in microarray data or RNA-seq data.
The gene expression value was multiplied by the weight of gene, and the results of all genes were added. (Note: before calculating the score, the expression data after log2 processing needs to be normalized by min-max normalization.)
2.8. Evaluation of Performance by Area under Curve (AUC)
The AUCs of three kinds of scores (neuralPCOS, EC-PCOS, GC-PCOS) were calculated in GSE84958 RNA-seq validation data (
Three kinds of score:
(1)
neuralPCOS
(2)
EC-PCOS: three upregulated genes including insulin-like growth factor 1 (IGF1), phosphatase and tensin homolog (PTEN), and insulin-like growth factor-binding protein 1 (IGFBP1) in endometrial cells (ECs) of PCOS [20].
(3)
GC-PCOS: upegulated genes including hydroxy-delta-5-steroid dehydrogenase, 3 beta- and steroid delta-isomerase 2 (HSD3B2), steroidogenic acute regulatory protein (STAR), inhibin subunit beta A (INHBA), and cytochrome P450 family 19 subfamily A member 1 (CYP19A1) in granulosa cells (GCs) of PCOS [21].
3. Results
3.1. Identification of DEGs
Firstly, the boxplot presented RNA expression level in GSE6798 (
[figures omitted; refer to PDF]
3.2. Functional Characterization of Selected DEGs
GO enrichment analysis for the selected 264 DEGs was carried out to identify the significantly enriched GO terms. The GObar showed the predominant significantly enriched GO terms (adjusted
[figures omitted; refer to PDF]
3.3. Screening Candidate PCOS-Specific Genes by Random Forest
In order to obtain more reliable PCOS-specific genes, we inputted the above 264 DEGs into the RF classifier. The lowest error rate occurred when the number of variables was 4 (Figure 4(a)); meanwhile, the optimal number of trees in RF classifier was set to 1000 due to the low error rate and stability (Figure 4(b)). Therefore, we finally choose 4 and 1000 trees as the final parameter in RF classifier to obtain the dimensional importance of all variables. Top 12 genes in the results of MeanDecreaseAccuracy and MeanDecreaseGini were shown in Figure 4(c). Finally, we selected 0.15 as the screening threshold of importance in MeanDecreaseGini result, and a set of 12 PCOS-specific DEGs was identified.
[figures omitted; refer to PDF]
3.4. ANN Analysis
RF classifier identified the key genes, which optimally differentiated between PCOS and controls. To further construct a PCOS-specific scoring model, ANN analysis was performed to calculate the weight of 12 genes. Here, two parallel training processes were carried out according to format of the training data, including RNA-seq training data GSE84958 (
[figures omitted; refer to PDF]
3.5. The Validation of neuralPCOS
Microarray ComBat dataset2 (
[figures omitted; refer to PDF]
4. Discussion
In recent years, the development of machine learning algorithms and the availability of gene expression data in public databases provide approaches to infer biomarkers for disease diagnosis or prognosis in a wide range of fields [30–33]. In the field of PCOS, some attempts have been made to explore a better way for PCOS diagnosis by using various machine learning algorithms [34–38], among which, suitable algorithms using some clinical data, such as survey data [35] or pelvic ultrasound data, were used [37]. An algorithm was ever constructed to predict new PCOS candidates using the data from Polycystic Ovary Syndrome Database (PCOSDB; http://www.pcosdb.net/) [39] and the KnowledgeBase on Polycystic Ovary Syndrome (PCOSKB; http://pcoskb.bicnirrh.res.in) [36, 40]. Another study converted the ovary microarray data of GEO database to the gene set regularity (GSR) indices, and the GSR indices were then computed by the modified differential rank conversion algorithm [38]. Comparing with these studies, we aimed to develop a diagnostic model based on gene expression data using as many samples as possible from GEO database. We finally integrated RF and ANN algorithms to infer the key classification genes and calculate the weights of these genes.
In the present study, when identifying DEGs with GSE6798 dataset, we removed the DEGs with low expression level, which can obtain more authentic genes. GO enrichment analysis was performed and displayed by bar plot and bubble plot. Among the 11 enriched GO terms, 4 terms including actin binding [41], myofibril [42], sarcomere [42], and contractile fiber part [42] were also identified in other PCOS researches. We listed the top 12 core genes screened by the RF model for classification in DEGs based on MeanDecreaseGini. Moreover, 10 of the 12 genes were also regarded as PCOS candidate genes in other studies: tropomodulin 1 (TMOD1) [43]; BTB domain containing 9 (BTBD9) [44]; trans-2,3-enoyl-CoA reductase like (TECRL) [44, 45]; glutathione S-transferase omega 1 (GSTO1) [44, 46, 47]; adenosine monophosphate deaminase 3 (AMPD3) [45]; alpha kinase 2 (ALPK2) [48]; Ras association (RalGDS/AF-6) and pleckstrin homology domains 1 (RAPH1) [44, 45, 48, 49]; aldehyde dehydrogenase 6 family member A1 (ALDH6A1) [44, 45, 50–52]; zinc finger protein 385B (ZNF385B) [53]; ST3 Beta-galactoside alpha-2,3-sialyltransferase 2 (ST3GAL2) [44]. Given that RNA-seq technology has the superiorities to detect novel transcripts with wider dynamic range, higher specificity, and higher sensitivity than microarray technology [54], the gene expression data obtained by these two technologies may have some differences. In the study, we calculated the weights of core genes by ANN using each type of data separately. Although the weights of only 11 genes in both microarray data and RNA-seq data were calculated, 10 genes were verified in previous studies in both platforms. The novelty of our diagnostic model was that the scoring model was obtained by comprehensively considering the genes those are vital to classification and their weights. In order to validate the applicability and superiority of this model in different types of data, AUC analysis was performed in microarray ComBat dataset2 (
Even so, our study still has some limitations. Although our total sample size is not too small (PCOS:
5. Conclusions
A novel diagnostic model for PCOS was established based on machine learning algorithms using microarray and RNA-seq datasets, which showed better prediction performance in microarray data than using existing marker genes.
Authors’ Contributions
Ning-Ning Xie and Fang-Fang Wang contributed equally to this work.
[1] R. J. Norman, D. Dewailly, R. S. Legro, T. E. Hickey, "Polycystic ovary syndrome," Lancet, vol. 370 no. 9588, pp. 685-697, DOI: 10.1016/S0140-6736(07)61345-2, 2007.
[2] J. K. Zawadzki, A. Dunaif, "Diagnostic criteria for polycystic ovary syndrome: towards a rational approach," Polycystic Ovary Syndrome, pp. 377-384, 1992.
[3] The Rotterdam ESHRE/ASRM-Sponsored PCOS Consensus Workshop Group, "Revised 2003 consensus on diagnostic criteria and long-term health risks related to polycystic ovary syndrome," Fertility and Sterility, vol. 81 no. 1, pp. 19-25, DOI: 10.1016/j.fertnstert.2003.10.004, 2004.
[4] The Rotterdam ESHRE/ASRM-Sponsored PCOS Consensus Workshop Group, "Revised 2003 consensus on diagnostic criteria and long-term health risks related to polycystic ovary syndrome (PCOS)," Human Reproduction, vol. 19 no. 1, pp. 41-47, DOI: 10.1093/humrep/deh098, 2004.
[5] R. Azziz, E. Carmina, D. Dewailly, E. Diamanti-Kandarakis, H. F. Escobar-Morreale, W. Futterweit, O. E. Janssen, R. S. Legro, R. J. Norman, A. E. Taylor, S. F. Witchel, Task Force on the Phenotype of the Polycystic Ovary Syndrome of The Androgen Excess and PCOS Society, "The Androgen Excess and PCOS Society criteria for the polycystic ovary syndrome: the complete task force report," Fertility and Sterility, vol. 91 no. 2, pp. 456-488, DOI: 10.1016/j.fertnstert.2008.06.035, 2009.
[6] B. C. J. M. Fauser, B. C. Tarlatzis, R. W. Rebar, R. S. Legro, A. H. Balen, R. Lobo, E. Carmina, J. Chang, B. O. Yildiz, J. S. E. Laven, J. Boivin, F. Petraglia, C. N. Wijeyeratne, R. J. Norman, A. Dunaif, S. Franks, R. A. Wild, D. Dumesic, K. Barnhart, "Consensus on women’s health aspects of polycystic ovary syndrome (PCOS): the Amsterdam ESHRE/ASRM-Sponsored 3rd PCOS Consensus Workshop Group," Fertility and Sterility, vol. 97 no. 1, pp. 28-38.e25, DOI: 10.1016/j.fertnstert.2011.09.024, 2012.
[7] M. R. Jones, M. O. Goodarzi, "Genetic determinants of polycystic ovary syndrome: progress and future directions," Fertility and Sterility, vol. 106 no. 1, pp. 25-32, DOI: 10.1016/j.fertnstert.2016.04.040, 2016.
[8] I. Sirota, D. E. Stein, M. Vega, M. D. Keltz, "Increased insulin-resistance and beta-cell function in polycystic ovary syndrome women-does ethnicity play a role?," Reproductive Sciences, vol. 20 no. S3, pp. 180a-181a, 2013.
[9] Y. Louwers, O. Lao, M. Kayser, "Inferred genetic ancestry versus reported ethnicity in polycystic ovary syndrome (PCOS)," Human Reproduction, vol. 28, pp. 349-349, 2013.
[10] R. Azziz, U. Ezeh, M. Pall, D. A. Dumesic, M. O. Goodarzi, "Effect of race on the metabolic dysfunction of Polycystic Ovary Syndrome (PCOS): comparing African-American (AA) and Non-Hispanic White (NHW) patients.," Endocrine Reviews, vol. 31 no. 3, 2010.
[11] B. J. Vilhjálmsson, J. Yang, H. K. Finucane, A. Gusev, S. Lindström, S. Ripke, G. Genovese, P.-R. Loh, G. Bhatia, R. Do, T. Hayeck, H.-H. Won, S. Kathiresan, M. Pato, C. Pato, R. Tamimi, E. Stahl, N. Zaitlen, B. Pasaniuc, G. Belbin, E. E. Kenny, M. H. Schierup, P. de Jager, N. A. Patsopoulos, S. McCarroll, M. Daly, S. Purcell, D. Chasman, B. Neale, M. Goddard, P. M. Visscher, P. Kraft, N. Patterson, A. L. Price, S. Ripke, B. M. Neale, A. Corvin, J. T. R. Walters, K.-H. Farh, P. A. Holmans, P. Lee, B. Bulik-Sullivan, D. A. Collier, H. Huang, T. H. Pers, I. Agartz, E. Agerbo, M. Albus, M. Alexander, F. Amin, S. A. Bacanu, M. Begemann, R. A. Belliveau, J. Bene, S. E. Bergen, E. Bevilacqua, T. B. Bigdeli, D. W. Black, R. Bruggeman, N. G. Buccola, R. L. Buckner, W. Byerley, W. Cahn, G. Cai, D. Campion, R. M. Cantor, V. J. Carr, N. Carrera, S. V. Catts, K. D. Chambert, R. C. K. Chan, R. Y. L. Chen, E. Y. H. Chen, W. Cheng, E. F. C. Cheung, S. A. Chong, C. R. Cloninger, D. Cohen, N. Cohen, P. Cormican, N. Craddock, J. J. Crowley, D. Curtis, M. Davidson, K. L. Davis, F. Degenhardt, J. del Favero, L. E. DeLisi, D. Demontis, D. Dikeos, T. Dinan, S. Djurovic, G. Donohoe, E. Drapeau, J. Duan, F. Dudbridge, N. Durmishi, P. Eichhammer, J. Eriksson, V. Escott-Price, L. Essioux, A. H. Fanous, M. S. Farrell, J. Frank, L. Franke, R. Freedman, N. B. Freimer, M. Friedl, J. I. Friedman, M. Fromer, G. Genovese, L. Georgieva, E. S. Gershon, I. Giegling, P. Giusti-Rodrguez, S. Godard, J. I. Goldstein, V. Golimbet, S. Gopal, J. Gratten, J. Grove, L. de Haan, C. Hammer, M. L. Hamshere, M. Hansen, T. Hansen, V. Haroutunian, A. M. Hartmann, F. A. Henskens, S. Herms, J. N. Hirschhorn, P. Hoffmann, A. Hofman, M. V. Hollegaard, D. M. Hougaard, M. Ikeda, I. Joa, A. Julia, R. S. Kahn, L. Kalaydjieva, S. Karachanak-Yankova, J. Karjalainen, D. Kavanagh, M. C. Keller, B. J. Kelly, J. L. Kennedy, A. Khrunin, Y. Kim, J. Klovins, J. A. Knowles, B. Konte, V. Kucinskas, Z. A. Kucinskiene, H. Kuzelova-Ptackova, A. K. Kahler, C. Laurent, J. L. C. Keong, S. H. Lee, S. E. Legge, B. Lerer, M. Li, T. Li, K.-Y. Liang, J. Lieberman, S. Limborska, C. M. Loughland, J. Lubinski, J. Lnnqvist, M. Macek, P. K. E. Magnusson, B. S. Maher, W. Maier, J. Mallet, S. Marsal, M. Mattheisen, M. Mattingsdal, R. W. McCarley, C. McDonald, A. M. McIntosh, S. Meier, C. J. Meijer, B. Melegh, I. Melle, R. I. Mesholam-Gately, A. Metspalu, P. T. Michie, L. Milani, V. Milanova, Y. Mokrab, D. W. Morris, O. Mors, P. B. Mortensen, K. C. Murphy, R. M. Murray, I. Myin-Germeys, B. Mller-Myhsok, M. Nelis, I. Nenadic, D. A. Nertney, G. Nestadt, K. K. Nicodemus, L. Nikitina-Zake, L. Nisenbaum, A. Nordin, E. O’Callaghan, C. O’Dushlaine, F. A. O’Neill, S.-Y. Oh, A. Olincy, L. Olsen, J. van Os, C. Pantelis, G. N. Papadimitriou, S. Papiol, E. Parkhomenko, M. T. Pato, T. Paunio, M. Pejovic-Milovancevic, D. O. Perkins, O. Pietilinen, J. Pimm, A. J. Pocklington, J. Powell, A. Price, A. E. Pulver, S. M. Purcell, D. Quested, H. B. Rasmussen, A. Reichenberg, M. A. Reimers, A. L. Richards, J. L. Roffman, P. Roussos, D. M. Ruderfer, V. Salomaa, A. R. Sanders, U. Schall, C. R. Schubert, T. G. Schulze, S. G. Schwab, E. M. Scolnick, R. J. Scott, L. J. Seidman, J. Shi, E. Sigurdsson, T. Silagadze, J. M. Silverman, K. Sim, P. Slominsky, J. W. Smoller, H.-C. So, C. C. A. Spencer, E. A. Stahl, H. Stefansson, S. Steinberg, E. Stogmann, R. E. Straub, E. Strengman, J. Strohmaier, T. S. Stroup, M. Subramaniam, J. Suvisaari, D. M. Svrakic, J. P. Szatkiewicz, E. Sderman, S. Thirumalai, D. Toncheva, P. A. Tooney, S. Tosato, J. Veijola, J. Waddington, D. Walsh, D. Wang, Q. Wang, B. T. Webb, M. Weiser, D. B. Wildenauer, N. M. Williams, S. Williams, S. H. Witt, A. R. Wolen, E. H. M. Wong, B. K. Wormley, J. Q. Wu, H. S. Xi, C. C. Zai, X. Zheng, F. Zimprich, N. R. Wray, K. Stefansson, P. M. Visscher, R. Adolfsson, O. A. Andreassen, D. H. R. Blackwood, E. Bramon, J. D. Buxbaum, A. D. Børglum, S. Cichon, A. Darvasi, E. Domenici, H. Ehrenreich, T. Esko, P. V. Gejman, M. Gill, H. Gurling, C. M. Hultman, N. Iwata, A. V. Jablensky, E. G. Jonsson, K. S. Kendler, G. Kirov, J. Knight, T. Lencz, D. F. Levinson, Q. S. Li, J. Liu, A. K. Malhotra, S. A. McCarroll, A. McQuillin, J. L. Moran, P. B. Mortensen, B. J. Mowry, M. M. Nthen, R. A. Ophoff, M. J. Owen, A. Palotie, C. N. Pato, T. L. Petryshen, D. Posthuma, M. Rietschel, B. P. Riley, D. Rujescu, P. C. Sham, P. Sklar, D. St Clair, D. R. Weinberger, J. R. Wendland, T. Werge, M. J. Daly, P. F. Sullivan, M. C. O’Donovan, P. Kraft, D. J. Hunter, M. Adank, H. Ahsan, K. Aittomäki, L. Baglietto, S. Berndt, C. Blomquist, F. Canzian, J. Chang-Claude, S. J. Chanock, L. Crisponi, K. Czene, N. Dahmen, I. d. S. Silva, D. Easton, A. H. Eliassen, J. Figueroa, O. Fletcher, M. Garcia-Closas, M. M. Gaudet, L. Gibson, C. A. Haiman, P. Hall, A. Hazra, R. Hein, B. E. Henderson, A. Hofman, J. L. Hopper, A. Irwanto, M. Johansson, R. Kaaks, M. G. Kibriya, P. Lichtner, S. Lindström, J. Liu, E. Lund, E. Makalic, A. Meindl, H. Meijers-Heijboer, B. Müller-Myhsok, T. A. Muranen, H. Nevanlinna, P. H. Peeters, J. Peto, R. L. Prentice, N. Rahman, M. J. Sánchez, D. F. Schmidt, R. K. Schmutzler, M. C. Southey, R. Tamimi, R. Travis, C. Turnbull, A. G. Uitterlinden, R. B. van der Luijt, Q. Waisfisz, Z. Wang, A. S. Whittemore, R. Yang, W. Zheng, "Modeling linkage disequilibrium increases accuracy of polygenic risk scores," The American Journal of Human Genetics, vol. 97 no. 4, pp. 576-592, DOI: 10.1016/j.ajhg.2015.09.001, 2015.
[12] P. J. Talmud, J. A. Cooper, R. W. Morris, F. Dudbridge, T. Shah, J. Engmann, C. Dale, J. White, S. McLachlan, D. Zabaneh, A. Wong, K. K. Ong, T. Gaunt, M. V. Holmes, D. A. Lawlor, M. Richards, R. Hardy, D. Kuh, N. Wareham, C. Langenberg, Y. Ben-Shlomo, S. G. Wannamethee, M. W. Strachan, M. Kumari, J. C. Whittaker, F. Drenos, M. Kivimaki, A. D. Hingorani, J. F. Price, S. E. Humphries, UCLEB Consortium, "Sixty-five common genetic variants and prediction of type 2 diabetes," Diabetes, vol. 64 no. 5, pp. 1830-1840, DOI: 10.2337/db14-1504, 2015.
[13] M. B. Kursa, "Robustness of Random Forest-based gene selection methods," BMC Bioinformatics, vol. 15 no. 1,DOI: 10.1186/1471-2105-15-8, 2014.
[14] Z. Cai, D. Xu, Q. Zhang, J. Zhang, S.-M. Ngai, J. Shao, "Classification of lung cancer using ensemble-based feature selection and machine learning methods," Molecular BioSystems, vol. 11 no. 3, pp. 791-800, DOI: 10.1039/C4MB00659C, 2015.
[15] Y.-C. Chen, W.-C. Ke, H.-W. Chiu, "Risk classification of cancer survival using ANN with gene expression data from multiple laboratories," Computers in Biology and Medicine, vol. 48,DOI: 10.1016/j.compbiomed.2014.02.006, 2014.
[16] Y. Kong, T. Yu, "A deep neural network model using random forest to extract feature representation for gene expression data classification," Scientific Reports, vol. 8 no. 1, article 16477,DOI: 10.1038/s41598-018-34833-6, 2018.
[17] C.-H. Hsieh, R.-H. Lu, N.-H. Lee, W.-T. Chiu, M.-H. Hsu, Y.-C. (. J.). Li, "Novel solutions for an old disease: diagnosis of acute appendicitis with random forest, support vector machines, and artificial neural networks," Surgery, vol. 149 no. 1, pp. 87-93, DOI: 10.1016/j.surg.2010.03.023, 2011.
[18] Z. Zhang, L. Chen, B. Humphries, R. Brien, M. S. Wicha, K. E. Luker, G. D. Luker, Y.-C. Chen, E. Yoon, "Morphology-based prediction of cancer cell migration using an artificial neural network and a random decision forest," Integrative Biology, vol. 10 no. 12, pp. 758-767, DOI: 10.1039/C8IB00106E, 2018.
[19] R. Janßen, J. Zabel, U. von Lukas, M. Labrenz, "An artificial neural network and Random Forest identify glyphosate-impacted brackish communities based on 16S rRNA amplicon MiSeq read counts," Marine Pollution Bulletin, vol. 149,DOI: 10.1016/j.marpolbul.2019.110530, 2019.
[20] M. N. Shafiee, C. Seedhouse, N. Mongan, C. Chapman, S. Deen, J. Abu, W. Atiomo, "Up-regulation of genes involved in the insulin signalling pathway ( IGF1 , PTEN and IGFBP1 ) in the endometrium may link polycystic ovarian syndrome and endometrial cancer," Molecular and Cellular Endocrinology, vol. 424, pp. 94-101, DOI: 10.1016/j.mce.2016.01.019, 2016.
[21] L. A. Owens, S. G. Kristensen, A. Lerner, G. Christopoulos, S. Lavery, A. C. Hanyaloglu, K. Hardy, C. Yding Andersen, S. Franks, "Gene expression in granulosa cells from small antral follicles from women with or without polycystic ovaries," The Journal of Clinical Endocrinology and Metabolism, vol. 104 no. 12, pp. 6182-6192, DOI: 10.1210/jc.2019-00780, 2019.
[22] W. E. Johnson, C. Li, A. Rabinovic, "Adjusting batch effects in microarray expression data using empirical Bayes methods," Biostatistics, vol. 8 no. 1, pp. 118-127, DOI: 10.1093/biostatistics/kxj037, 2007.
[23] M. E. Ritchie, B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi, G. K. Smyth, "limma powers differential expression analyses for RNA-sequencing and microarray studies," Nucleic Acids Research, vol. 43 no. 7, article e47,DOI: 10.1093/nar/gkv007, 2015.
[24] W. Li, "Volcano plots in analyzing differential expressions with mRNA microarrays," Journal of Bioinformatics and Computational Biology, vol. 10 no. 6,DOI: 10.1142/S0219720012310038, 2012.
[25] G. Yu, L.-G. Wang, Y. Han, Q.-Y. He, "clusterProfiler: an R package for comparing biological themes among gene clusters," OMICS: A Journal of Integrative Biology, vol. 16 no. 5, pp. 284-287, DOI: 10.1089/omi.2011.0118, 2012.
[26] W. Walter, F. Sánchez-Cabo, M. Ricote, "GOplot: an R package for visually combining expression data with functional analysis," Bioinformatics, vol. 31 no. 17, pp. 2912-2914, DOI: 10.1093/bioinformatics/btv300, 2015.
[27] L. Breiman, "Machine learning, volume 45, number 1- springer link," Machine Learning, vol. 45 no. 1,DOI: 10.1023/A:1010933404324, 2001.
[28] F. Günther, S. Fritsch, "neuralnet: training of neural networks," The R Journal, vol. 2 no. 1, pp. 30-38, DOI: 10.32614/RJ-2010-006, 2010.
[29] X. Robin, N. Turck, A. Hainard, N. Tiberti, F. Lisacek, J.-C. Sanchez, M. Müller, "pROC: an open-source package for R and S+ to analyze and compare ROC curves," BMC Bioinformatics, vol. 12 no. 1,DOI: 10.1186/1471-2105-12-77, 2011.
[30] A. A. Tabl, A. Alkhateeb, W. ElMaraghy, L. Rueda, A. Ngom, "A machine learning approach for identifying gene biomarkers guiding the treatment of breast Cancer," Frontiers in Genetics, vol. 10,DOI: 10.3389/fgene.2019.00256, 2019.
[31] D. Wang, J. R. Li, Y. H. Zhang, L. Chen, T. Huang, Y. D. Cai, "Identification of differentially expressed genes between original breast cancer and xenograft using machine learning algorithms," Genes, vol. 9 no. 3,DOI: 10.3390/genes9030155, 2018.
[32] C. Wang, W. Pu, D. Zhao, Y. Zhou, T. Lu, S. Chen, Z. He, X. Feng, Y. Wang, C. Li, S. Li, L. Jin, S. Guo, J. Wang, M. Wang, "Identification of hyper-methylated tumor suppressor genes-based diagnostic panel for esophageal squamous cell carcinoma (ESCC) in a Chinese Han population," Frontiers in Genetics, vol. 9,DOI: 10.3389/fgene.2018.00356, 2018.
[33] Y. Zhang, J. T. C. Tseng, I. C. Lien, F. Li, W. Wu, H. Li, "mRNAsi index: machine learning in mining lung adenocarcinoma stem cell biomarkers," Genes, vol. 11 no. 3,DOI: 10.3390/genes11030257, 2020.
[34] D. K. Meena, D. M. Manimekalai, S. Rethinavalli, "A novel framework for filtering the PCOS attributes using data mining techniques," International Journal of Engineering Research & Technology, vol. 4 no. 1, pp. 702-706, 2015.
[35] B. Vikas, B. Anuhya, K. S. Bhargav, S. Sarangi, M. Chilla, "Application of the apriori algorithm for prediction of Polycystic Ovarian Syndrome (PCOS)," Information Systems Design and Intelligent Applications, pp. 934-944, 2018.
[36] X. Z. Zhang, Y. L. Pang, X. Wang, Y. H. Li, "Computational characterization and identification of human polycystic ovary syndrome genes," Scientific Reports, vol. 8 no. 1, article 12949,DOI: 10.1038/s41598-018-31110-4, 2018.
[37] J. J. Cheng, S. Mahalingaiah, "Data mining polycystic ovary morphology in electronic medical record ultrasound reports," Fertility Research and Practice, vol. 5 no. 1,DOI: 10.1186/s40738-019-0067-7, 2019.
[38] C.-H. Ho, C.-M. Chang, H.-Y. Li, H.-Y. Shen, F.-K. Lieu, P. S.-G. Wang, "Dysregulated immunological and metabolic functions discovered by a polygenic integrative analysis for PCOS," Reproductive BioMedicine Online, vol. 40 no. 1, pp. 160-167, DOI: 10.1016/j.rbmo.2019.09.011, 2020.
[39] M. Jesintha Mary, U. Vetrivel, D. Munuswamy, V. Melanathuru, "PCOSDB: PolyCystic Ovary Syndrome Database for manually curated disease associated genes," Bioinformation, vol. 12 no. 1,DOI: 10.6026/97320630012004, 2016.
[40] S. Joseph, R. S. Barai, R. Bhujbalrao, S. Idicula-Thomas, "PCOSKB: A KnowledgeBase on genes, diseases, ontology terms and biochemical pathways associated with PolyCystic Ovary Syndrome," Nucleic Acids Research, vol. 44 no. D1, pp. D1032-D1035, DOI: 10.1093/nar/gkv1146, 2016.
[41] T. S. Domingues, T. C. Bonetti, D. C. Pimenta, D. O. C. Mariano, B. Barros, A. P. Aquino, E. L. A. Motta, "Proteomic profile of follicular fluid from patients with polycystic ovary syndrome (PCOS) submitted to in vitro fertilization (IVF) compared to oocyte donors," JBRA Assisted Reproduction, vol. 23 no. 4, pp. 367-391, DOI: 10.5935/1518-0557.20190041, 2019.
[42] C. Lu, X. Liu, L. Wang, N. Jiang, J. Yu, X. Zhao, H. Hu, S. Zheng, X. Li, G. Wang, "Integrated analyses for genetic markers of polycystic ovary syndrome with 9 case-control studies of gene expression profiles," Oncotarget, vol. 8 no. 2, pp. 3170-3180, DOI: 10.18632/oncotarget.13881, 2017.
[43] E. Jansen, J. S. E. Laven, H. B. R. Dommerholt, J. Polman, C. van Rijt, C. van den Hurk, J. Westland, S. Mosselman, B. C. J. M. Fauser, "Abnormal gene expression profiles in human ovaries from polycystic ovary syndrome patients," Molecular Endocrinology, vol. 18 no. 12, pp. 3050-3063, DOI: 10.1210/me.2004-0074, 2004.
[44] D. Haouzi, S. Assou, C. Monzo, C. Vincens, H. Dechaud, S. Hamamah, "Altered gene expression profile in cumulus cells of mature MII oocytes from patients with polycystic ovary syndrome," Human Reproduction, vol. 27 no. 12, pp. 3523-3530, DOI: 10.1093/humrep/des325, 2012.
[45] Z. G. Ouandaogo, N. Frydman, L. Hesters, S. Assou, D. Haouzi, H. Dechaud, R. Frydman, S. Hamamah, "Differences in transcriptomic profiles of human cumulus cells isolated from oocytes at GV, MI and MII stages after in vivo and in vitro oocyte maturation," Human Reproduction, vol. 27 no. 8, pp. 2438-2447, DOI: 10.1093/humrep/des172, 2012.
[46] H. Liu, L. Zeng, K. Yang, G. Zhang, "A network pharmacology approach to explore the pharmacological mechanism of xiaoyao powder on anovulatory infertility," Evidence-based Complementary and Alternative Medicine: Ecam, vol. 2016, article 2960372,DOI: 10.1155/2016/2960372, 2016.
[47] A. S. Ambekar, D. S. Kelkar, S. M. Pinto, R. Sharma, I. Hinduja, K. Zaveri, A. Pandey, T. S. K. Prasad, H. Gowda, S. Mukherjee, "Proteomics of follicular fluid from women with polycystic ovary syndrome suggests molecular defects in follicular development," The Journal of Clinical Endocrinology & Metabolism, vol. 100 no. 2, pp. 744-753, DOI: 10.1210/jc.2014-2086, 2015.
[48] V. Skov, D. Glintborg, S. Knudsen, T. Jensen, T. A. Kruse, Q. Tan, K. Brusgaard, H. Beck-Nielsen, K. Højlund, "Reduced expression of nuclear-encoded genes involved in mitochondrial oxidative metabolism in skeletal muscle of insulin-resistant women with polycystic ovary syndrome," Diabetes, vol. 56 no. 9, pp. 2349-2355, DOI: 10.2337/db07-0275, 2007.
[49] E. Nilsson, A. Benrick, M. Kokosar, A. Krook, E. Lindgren, T. Källman, M. M. Martis, K. Højlund, C. Ling, E. Stener-Victorin, "Transcriptional and epigenetic changes influencing skeletal muscle metabolism in women with polycystic ovary syndrome," The Journal of Clinical Endocrinology & Metabolism, vol. 103 no. 12, pp. 4465-4477, DOI: 10.1210/jc.2018-00935, 2018.
[50] J. Qiao, L. Wang, R. Li, X. Zhang, "Microarray evaluation of endometrial receptivity in Chinese women with polycystic ovary syndrome," Reproductive Biomedicine Online, vol. 17 no. 3, pp. 425-435, DOI: 10.1016/S1472-6483(10)60228-3, 2008.
[51] H. Xu, Y. Han, J. Lou, H. Zhang, Y. Zhao, B. Győrffy, R. Li, "PDGFRA, HSD17B4 and HMGB2 are potential therapeutic targets in polycystic ovarian syndrome and breast cancer," Oncotarget, vol. 8 no. 41, pp. 69520-69526, DOI: 10.18632/oncotarget.17846, 2017.
[52] J. R. Wood, V. L. Nelson-Degrave, E. Jansen, J. M. McAllister, S. Mosselman, J. F. Strauss, "Valproate-induced alterations in human theca cell gene expression: clues to the association between valproate use and metabolic side effects," Physiological Genomics, vol. 20 no. 3, pp. 233-243, DOI: 10.1152/physiolgenomics.00193.2004, 2005.
[53] S. Kenigsberg, Y. Bentov, V. Chalifa-Caspi, G. Potashnik, R. Ofir, O. S. Birk, "Gene expression microarray profiles of cumulus cells in lean and overweight-obese polycystic ovary syndrome patients," Molecular Human Reproduction, vol. 15 no. 2, pp. 89-103, DOI: 10.1093/molehr/gan082, 2009.
[54] Z. Wang, M. Gerstein, M. Snyder, "RNA-Seq: a revolutionary tool for transcriptomics," Nature Reviews Genetics, vol. 10 no. 1, pp. 57-63, DOI: 10.1038/nrg2484, 2009.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2020 Ning-Ning Xie et al. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. http://creativecommons.org/licenses/by/4.0/
Abstract
Polycystic ovary syndrome (PCOS) is one of the most common metabolic and reproductive endocrinopathies. However, few studies have tried to develop a diagnostic model based on gene biomarkers. In this study, we applied a computational method by combining two machine learning algorithms, including random forest (RF) and artificial neural network (ANN), to identify gene biomarkers and construct diagnostic model. We collected gene expression data from Gene Expression Omnibus (GEO) database containing 76 PCOS samples and 57 normal samples; five datasets were utilized, including one dataset for screening differentially expressed genes (DEGs), two training datasets, and two validation datasets. Firstly, based on RF, 12 key genes in 264 DEGs were identified to be vital for classification of PCOS and normal samples. Moreover, the weights of these key genes were calculated using ANN with microarray and RNA-seq training dataset, respectively. Furthermore, the diagnostic models for two types of datasets were developed and named neuralPCOS. Finally, two validation datasets were used to test and compare the performance of neuralPCOS with other two set of marker genes by area under curve (AUC). Our model achieved an AUC of 0.7273 in microarray dataset, and 0.6488 in RNA-seq dataset. To conclude, we uncovered gene biomarkers and developed a novel diagnostic model of PCOS, which would be helpful for diagnosis.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details





1 Women’s Hospital, School of Medicine, Zhejiang University, Hangzhou 310006, China
2 College of Food Science and Biotechnology, Zhejiang Gongshang University, Hangzhou 310018, China
3 Zhejiang Chinese Medical University, Hangzhou 310053, China