Deep learning-based classification of speech

Full text

Turn on search term navigation

Introduction

Speech disorders significantly impair communication abilities, drastically diminishing the quality of life of affected individuals [1–3]. These disorders can stem from various causes, including congenital conditions, neurological damage, diseases, or other disorders [4–10]. Consequently, medical professionals have traditionally used vocal changes to diagnose the origins of speech disorders. Initial assessments relied on the auditory judgments of examiners; however, advancements in science and technology have enabled more precise acoustical analysis [11,12]. This progress has enhanced the quantification and accuracy of diagnoses, although limitations remain in fully understanding and interpreting complex voice signals. Currently, various studies have leveraged the advantages of AI for diagnosing causes through acoustical analysis of voices. AI has demonstrated the potential to deliver diagnoses that are more accurate than those made by medical specialists or to support specialists in making more informed diagnostic decisions [13–15].

In 2023, Kim et al. extracted features using Mel-frequency cepstral coefficients (MFCC) from the voice data of 230 healthy individuals and 230 stroke patients and applied them to a convolutional neural network (CNN), achieving an accuracy of 99.60% [16]. In 2020, Singh and Xu extracted features using MFCC from the voice data of 860 healthy individuals and 140 Parkinson’s patients, applying the data to a support vector machine (SVM) and achieving an accuracy of 99.00% [17]. In 2018, Wu et al. achieved an accuracy of 77.00% by extracting features via the short-time Fourier transform (STFT) from the voice data of 482 healthy individuals and 482 patients with various speech disorders and applying them to a 2D-CNN [18]. Using both 2D-CNN and long short-term memory (LSTM) approaches, Syed et al. compared and researched the application of features extracted using MFCC on voice data from 1600 healthy individuals and 400 patients experiencing various speech disorders, achieving an accuracy of 97.11% using 2D-CNN [19]. Shih et al. extracted features using STFT and MFCC and applied them to convolutional neural network-gated recurrent units (CNN-GRU), CNN, and LSTM for comparison; they achieved an accuracy of 98.38% using CNN-GRU [20].

Despite the attainment of high accuracy rates, existing research is limited. Most studies are based on small datasets and primarily focus on comparing voice data between normal individuals and those experiencing speech disorders. Consequently, classification tasks have often focused on a single cause among the various origins of speech disorders, such as laryngeal cancer or Parkinson’s disease.

“This study aims to overcome these limitations by utilizing speech disorder data collected through scripts from patients with stroke, the representative neurological disease, and HI, to classify whether the speech disorders are caused by stroke or HI using a CNN-based algorithm.”

Materials and methods

Experimental setup

The research environment for this study was established using a system equipped with two NVIDIA RTX 2080 Ti GPUs (NVIDIA, Santa Clara, CA, USA), 329 GB RAM, and a 22-core Intel(R) Xeon(R) Gold 6238 CPU @ 2.10 GHz (Intel, Santa Clara, CA, USA). The system operated on the Ubuntu LTS operating system (version 16.04.7) within a Python environment (version 3.6.10). Data preprocessing was conducted using the Librosa (version 0.9.2) and Pandas (version 1.1.5) libraries, while model training was performed using the TensorFlow library (version 2.6.2). Additionally, statistical analysis and evaluation were conducted using libraries such as Sklearn (version 0.24.2), Imblearn (version 0.8.1), and Matplotlib (version 3.3.4), alongside the R environment (version 4.3.3), utilizing the pROC package (version 1.18.5) for Receiver Operating Characteristic (ROC) curve analysis.

Data

This study was designed as a retrospective analysis using pre-existing clinical voice data collected from patients with stroke and hearing impairment at Ewha Womans University Seoul Hospital. The institutional review board of Ewha Womans University Seoul Hospital (IRB Number: SEUMC 2022-03-012) approved this study; the requirement of informed consent was waived due to the retrospective nature of the study design. The data collection targeted 250 patients with stroke and 250 patients with HI, both representing potential congenital or acquired causes of speech disorders. Detailed information on the patients’ gender, age, diagnosis, intelligibility, and degree of disability is summarized in Table 1.

[Figure omitted. See PDF.]

Recordings were made in a lossless format to ensure high quality. In collaboration with medical personnel and HI specialists, patients were provided with 702 types of diverse scripts that one might encounter in daily life. These scripts were presented in full sentences to maintain the continuous characteristics of speech and were recorded, resulting in a total of 2,000 voice data sets. Each recording was assigned a ground truth label of either stroke or hearing impairment (HI) based on clinical diagnoses confirmed by certified neurologists and audiologists at Ewha Womans University Seoul Hospital. To ensure the reliability of these labels, an independent review was conducted by speech-language pathologists who cross-checked the diagnostic information with the observed speech characteristics, such as nasalization and consonant weakening for stroke patients and syllable segmentation for HI patients. This multi-step verification process minimized labeling errors and ensured that each voice sample accurately represented its corresponding disorder category. The recordings were collected in soundproof speech therapy rooms and underwent a comprehensive data review, cleansing, and processing procedure. Particularly, considering the speech characteristics of severely disabled individuals, voice data were cleansed and processed in collaboration with speech therapists based on the International Phonetic Alphabet. The scripts used for the recordings are listed in Table 2.

[Figure omitted. See PDF.]

Waveforms for comparison between voice data are depicted in Fig 1. In the case of individuals with speech disorders, two notable characteristics are observed: (1) nasalization, where the airflow from the vocal tract resonates simultaneously in the nasal and oral cavities, causing syllables to merge and sound muffled; and (2) consonant weakening, where the boundaries between consonants and vowels become unclear. Additionally, in the case of HI, the characteristic of syllable segmentation is more pronounced compared with stroke patients.

[Figure omitted. See PDF.]

(a) Normal Speech Waveform, (b) Stroke Patient’s Waveform, (c) Hearing-Impaired (HI) Patient’s Waveform.

The recorded set of 2,000 voice data was divided into training, validation, and testing datasets at a ratio of 8:1:1, resulting in 1,600 data points for the training dataset, 200 for the validation dataset, and 200 for the testing dataset.

Data preprocessing

As illustrated in Fig 2, data preprocessing began with the resampling of all voice data to 16 kHz, in accordance with the Nyquist–Shannon sampling theorem. To ensure consistency in data handling, all voice recordings were uniformly segmented to a length of 4 s. Recordings shorter than 2 s were excluded from the analysis, and recordings between 2 and 4 s were extended to a uniform length using zero padding, thereby enhancing signal stability, frequency transformation, and resolution. Additionally, to address variations in the distance between the recorder and the patient during recording sessions, voice signals were normalized to a range of -1–1. Subsequently, signals were re-normalized to a range of 0–1 to optimize model training efficiency. Through dataset preprocessing, we ultimately secured 2,914 data points for training, 382 for validation, and 378 for testing purposes.

[Figure omitted. See PDF.]

* Data cleaning performed using Audacity ** Resampling to 16,000Hz; Normalization to a range of -1 to 1; Segmentation into 4s intervals with zero-padding, resulting in 3,674 segments, each containing 64,000 samples; Normalization to a range of 0 to 1 applied to each segment; Creation of a single NumPy file for training *** 1D-CNN for data-driven feature extraction and training using ResNet-18, Inception V3, and SEResNeXt-18 models.

Models for classifying the causes of speech disorder using voice data

Feature extraction, as shown in Fig 2, was performed using a 1D-CNN model by applying convolution operations to voice data. This study adopted the 1D-CNN approach due to existing research indicating the superior performance of 1D-CNNs over methods utilizing recurrent neural networks (RNN) such as LSTM, MFCC, and 2D-CNNs that use spectrograms for voice data processing [21,22].

The model incorporated three architectures: ResNet, SEResNeXt, and Inception. The ResNet model employed by He et al. in 2016 addressed the vanishing gradient problem in deep neural networks through residual connections [23]. As illustrated by Szegedy et al., the Inception network enables the efficient extraction of spatial information at multiple levels using filters of various dimensions [24]. A set of transformations with the same topology defines the building blocks of the ResNeXt architecture, a modular variation of ResNet, which implements a multi-branch architecture based on an additional dimension known as cardinality [25]. The SEResNeXt architecture is built upon ResNeXt with the integration of squeeze-and-excitation (SE) blocks, enhancing the network’s representational power through dynamic channel-wise feature recalibration [26].

Hyperparameters were set with a batch size of 16 and a learning rate of 0.0001. The ReduceLROnPlateau algorithm was applied to optimize performance by gradually decreasing the learning rate if the model plateaued during training. The Adam optimizer was used to train the model over 200 epochs, and an early stopping algorithm using the binary cross-entropy loss function was introduced to prevent overfitting.

Results

In this study, the performance of the trained models was validated against a separately constructed test dataset. The evaluation metrics employed were sensitivity, specificity, and accuracy, calculated based on the true positive (TP), false positive (FP), true negative (TN), and false negative (FN) ratios from the confusion matrix. Additionally, AUC was used as a performance metric derived from the ROC curve. Delong’s test was utilized to verify the statistical significance of performance differences between the models. A comparative analysis among the three models was conducted based on the versions that exhibited the best performance.

As shown in Table 3, all models demonstrated excellent performance across all evaluation metrics. In terms of sensitivity, ResNet-18 recorded a performance of 77.53, Inception V3 93.26, and SEResNeXt-18 87.64. ResNet-18 exhibited a specificity of 82.00, Inception V3 84.50, and SEResNeXt-18 84.00. ResNet-18 demonstrated an accuracy of 79.89, Inception V3 88.62, and SEResNeXt-18 85.71. Overall, based on the evaluation metrics derived from the confusion matrix, Inception V3 demonstrated relatively high performance among the three architectures.

[Figure omitted. See PDF.]

Fig 3 presents the confusion matrices of each model, illustrating the detailed classification performance for stroke and hearing-impaired (HI) patients. It shows how each model performed in terms of true positive, true negative, false positive, and false negative counts for each class. Notably, Inception V3 in Fig 3b has significantly fewer false negatives, with 12 cases, compared to 40 in ResNet-18 in Fig 3a and 22 in SEResNeXt-18 in Fig 3c. Additionally, Inception V3 exhibits fewer false positives, with 31 cases, compared to 36 in ResNet-18 and a similar count of 32 in SEResNeXt-18. The comparison of the confusion matrices provides an additional layer of insight into the classification behavior of each model, highlighting how they handle classification errors across different classes.

[Figure omitted. See PDF.]

(a) ResNet-18, (b) Inception V3, (c) SEResNeXt-18.

Fig 4 shows a comparison of performance in terms of AUC, confirming that the models effectively classify both stroke and HI. ResNet-18 recorded an AUC of 0.839 with a standard deviation of 0.038 and a 95% confidence interval of 0.800–0.881, demonstrating relatively lower figures compared with the other two models. Inception V3 demonstrated the highest performance with an AUC of 0.913, a standard deviation of 0.019, and a 95% confidence interval of 0.881–0.947. Among all three models, SEResNeXt-18 exhibited a moderate performance with an AUC of 0.906, a standard deviation of 0.009, and a 95% confidence interval of 0.872–0.939.

[Figure omitted. See PDF.]

As shown in Table 4, the comparison of AUC between models through Delong’s test revealed that the performance difference between ResNet-18 and the other two models was statistically significant, with p < 0.05. However, the performance difference between Inception V3 and SEResNeXt-18 was not statistically significant, as indicated by p < 0.227.

[Figure omitted. See PDF.]

Discussion

This paper presents a study on classifying stroke and HI using speech disorder voice data. We acquired high-quality source data and engaged in data cleansing and processing, which included effective noise removal. Subsequently, we performed data preprocessing steps such as normalization, zero-padding, and segmentation. Utilizing a 1D convolution layer, we automatically extracted and learned the local characteristics of complex waveform patterns from raw voice data. This model-based approach afforded remarkable classification performance, with accuracy scores of 79.89 for ResNet-18, 88.62 for Inception V3, and 85.71 for SEResNeXt-18. The AUC scores of 0.839 for ResNet-18, 0.913 for Inception V3, and 0.906 for SEResNeXt-18 confirm the models’ effective differentiation between speech disorders caused by stroke and HI. However, despite all models demonstrating strong performance, there were notable differences in their classification accuracy. These differences can be attributed to the distinct architectural characteristics of each model. ResNet-18, while benefiting from residual connections that mitigate the vanishing gradient problem and enable deeper network training, is primarily optimized for simplicity and computational efficiency [23]. This results in an inability to effectively capture subtle, multi-scale features in complex datasets such as speech disorder data, which requires a more nuanced approach to feature extraction. Consequently, ResNet-18’s lower performance may be linked to its reliance on standard convolutional filters that do not adapt as dynamically to varying spatial patterns, making it less suited for highly variable speech signals. Similarly, SEResNeXt-18, which incorporates Squeeze-and-Excitation (SE) blocks to adaptively recalibrate feature maps and enhance sensitivity to important features, showed improved performance over ResNet-18 but still fell short compared to Inception V3. While SEResNeXt-18 excels in focusing on critical features by adjusting the significance of channel-wise information, its relatively narrower network structure and limited ability to analyze complex patterns at multiple scales may hinder its performance when faced with highly heterogeneous speech patterns. The SE block’s recalibration advantage is effective, but it may not be sufficient for capturing complex temporal dynamics in voice data [25,26]. In contrast, Inception V3’s superior performance can be attributed to its distinctive multi-branch architecture, which utilizes multiple convolutional filter sizes within a single layer to capture diverse spatial hierarchies. This design allows Inception V3 to learn both fine-grained and coarse features simultaneously, making it particularly well-suited for the intricate and multi-scale nature of speech disorder data. Moreover, Inception V3 employs factorized convolutions and auxiliary classifiers, which help reduce computational complexity while preserving its capacity to analyze deep feature representations. This combination of deep feature extraction and efficient computational strategies enables the model to maintain high classification accuracy even in the presence of subtle variations in speech patterns caused by stroke and HI [24]. Consequently, Inception V3’s ability to balance depth, width, and multi-scale feature extraction makes it the most effective model for the given task, demonstrating not only the highest overall accuracy but also the most balanced performance in differentiating between stroke and HI speech disorder classes. In addition to each model’s structural characteristics, the number of parameters likely also contributed to the observed performance differences. Within the constraints of our experimental environment, this study selected the highest-performing variant from each model family—ResNet-18, Inception V3, and SEResNeXt-18—for comparison across models. ResNet-18 has 925,697 parameters, Inception V3 has 12,345,761, and SEResNeXt-18 has 1,308,033, with Inception V3 containing the highest parameter count among the models. In the AUC comparison, Delong’s test results showed a statistically significant difference between ResNet-18 and the other two models (p < 0.05), while no significant difference was observed between Inception V3 and SEResNeXt-18 (p < 0.227). These findings further support the suitability of Inception V3’s balanced architecture and higher parameter count for complex speech classification tasks [27].

This research is significant for utilizing a larger dataset compared with those used in previous studies and achieving excellent performance across various evaluation metrics. It excels in accurately classifying speech disorder voices caused by stroke and HI rather than merely distinguishing between normal and abnormal speakers. Unlike previous research that primarily focused on the phonation of the vowel/a/, this study utilized entire scripts, allowing for a more comprehensive analysis of speech patterns and characteristics such as pitch breaks, voice fatigue, and voice breaks, which is expected to enhance accuracy and flexibility in screening diagnoses [28,29].

However, this study is limited to Korean voice data and focuses only on stroke and HI among various causes of speech disorders, which indicates a lack of diversity in the data. Toward the development of multi-classification or cause-specific classification models, future research should aim to build datasets suitable for various languages and cultures and secure extensive voice datasets for diverse causes inducing speech disorders, such as amyotrophic lateral sclerosis, Parkinson’s disease, Down syndrome, and multiple sclerosis. This paper presents significant results in experimentally classifying the causes of speech disorders stemming from stroke and HI using speech disorder voice data. The developed classification models can be employed for the early detection and diagnosis of speech disorders, potentially mitigating symptom severity and enhancing the effectiveness of treatments. In particular, the high sensitivity and balanced performance of these models make them well-suited for identifying early signs of speech deterioration, even when symptoms are subtle and not easily detectable by the human ear. This capability is crucial in clinical practice, where early intervention is often the key to preventing further progression of speech disorders. For instance, in stroke rehabilitation, prompt detection of dysarthria or other related symptoms can help tailor personalized speech therapy programs, thereby improving recovery outcomes. Moreover, these models can serve as efficient tools for remote monitoring and risk assessment. By leveraging automated voice analysis through digital devices, clinicians can track patients’ speech changes over time and adjust treatment plans accordingly without requiring frequent hospital visits. This approach not only reduces the burden on healthcare facilities but also improves patient accessibility to specialized care, especially for those in rural or medically underserved areas. Additionally, in contexts such as pandemics or when mobility is restricted, remote voice monitoring provides a safe and effective way to ensure continuous health management and intervention.

The advancement of AI technology and its integration with the medical field are expected to play a significant role in assisting medical diagnoses and improving accuracy and speed, leading to the realization of enhanced medical services. As demonstrated in this study, machine learning models can contribute to more nuanced diagnostic capabilities, potentially extending beyond speech disorders to other areas where precise pattern recognition is crucial. This evolution in diagnostic methodology could ultimately pave the way for AI-assisted systems that support clinicians in making more informed decisions, thereby optimizing patient outcomes and revolutionizing healthcare delivery.

References

1. 1. Benninger MS, Ahuja AS, Gardner G, Grywalski C. Assessing outcomes for dysphonic patients. J Voice. 1998;12(4):540–50. pmid:9988041

* View Article

* PubMed/NCBI

* Google Scholar

2. 2. Roy N, Merrill RM, Gray SD, Smith EM. Voice disorders in the general population: prevalence, risk factors, and occupational impact. Laryngoscope. 2005;115(11):1988–95. pmid:16319611

* View Article

* PubMed/NCBI

* Google Scholar

3. 3. Mirza N, Ruiz C, Baum ED, Staab JP. The prevalence of major psychiatric pathologies in patients with voice disorders. Ear Nose Throat J. 2003;82(10):808–10, 812, 814. pmid:14606179

* View Article

* PubMed/NCBI

* Google Scholar

4. 4. Novotný M, Rusz J, Čmejla R, Růžičková H, Klempíř J, Růžička E. Hypernasality associated with basal ganglia dysfunction: evidence from Parkinson’s disease and Huntington’s disease. PeerJ. 2016;4:e2530. pmid:27703866

* View Article

* PubMed/NCBI

* Google Scholar

5. 5. Danesh-Sani SA, Rahimdoost A, Soltani M, Ghiyasi M, Haghdoost N, Sabzali-Zanjankhah S. Clinical assessment of orofacial manifestations in 500 patients with multiple sclerosis. J Oral Maxillofac Surg. 2013;71(2):290–4. pmid:22742955

* View Article

* PubMed/NCBI

* Google Scholar

6. 6. Elliott E, Newton J, Rewaj P, Gregory JM, Tomarelli L, Colville S, et al. An epidemiological profile of dysarthria incidence and assistive technology use in the living population of people with MND in Scotland. Amyotroph Lateral Scler Frontotemporal Degener. 2020;21(1–2):116–22. pmid:31594398

* View Article

* PubMed/NCBI

* Google Scholar

7. 7. Poole ML, Brodtmann A, Darby D, Vogel AP. Motor speech phenotypes of frontotemporal dementia, primary progressive aphasia, and progressive apraxia of speech. J Speech Lang Hear Res. 2017;60(4):897–911. pmid:28289749

* View Article

* PubMed/NCBI

* Google Scholar

8. 8. Steurer H, Schalling E, Franzén E, Albrecht F. Characterization of mild and moderate dysarthria in Parkinson’s Disease: behavioral measures and neural correlates. Front Aging Neurosci. 2022;14:870998. pmid:35651530

* View Article

* PubMed/NCBI

* Google Scholar

9. 9. De Cock E, Oostra K, Bliki L, Volkaerts A-S, Hemelsoet D, De Herdt V, et al. Dysarthria following acute ischemic stroke: Prospective evaluation of characteristics, type and severity. Int J Lang Commun Disord. 2021;56(3):549–57. pmid:33580596

* View Article

* PubMed/NCBI

* Google Scholar

10. 10. Stubbs E, Togher L, Kenny B, Fromm D, Forbes M, MacWhinney B, et al. Procedural discourse performance in adults with severe traumatic brain injury at 3 and 6 months post injury. Brain Inj. 2018;32(2):167–81. pmid:29281420

* View Article

* PubMed/NCBI

* Google Scholar

11. 11. Pruszewicz A, Obrebowski A, Swidziński P, Demeńko G, Wika T, Wojciechowska A. Usefulness of acoustic studies on the differential diagnostics of organic and functional dysphonia. Acta Otolaryngol. 1991;111(2):414–9. pmid:2068930

* View Article

* PubMed/NCBI

* Google Scholar

12. 12. Hirano M, Hibi S, Yoshida T, Hirade Y, Kasuya H, Kikuchi Y. Acoustic analysis of pathological voice. Some results of clinical application. Acta Otolaryngol. 1988;105(5–6):432–8. pmid:3400446

* View Article

* PubMed/NCBI

* Google Scholar

13. 13. Barlow J, Sragi Z, Rivera-Rivera G, Al-Awady A, Daşdöğen Ü, Courey MS, et al. The use of deep learning software in the detection of voice disorders: a systematic review. Otolaryngol Head Neck Surg. 2024;170(6):1531–43. pmid:38168017

* View Article

* PubMed/NCBI

* Google Scholar

14. 14. Kim H, Jeon J, Han YJ, Joo Y, Lee J, Lee S, et al. Convolutional neural network classifies pathological voice change in laryngeal cancer with high accuracy. J Clin Med. 2020;9(11):3415. pmid:33113785

* View Article

* PubMed/NCBI

* Google Scholar

15. 15. Hu H-C, Chang S-Y, Wang C-H, Li K-J, Cho H-Y, Chen Y-T, et al. Deep learning application for vocal fold disease prediction through voice recognition: preliminary development study. J Med Internet Res. 2021;23(6):e25247. pmid:34100770

* View Article

* PubMed/NCBI

* Google Scholar

16. 16. Kim E-S, Shin DJ, Cho ST, Chung KJ. Artificial intelligence-based speech analysis system for medical support. Int Neurourol J. 2023;27(2):99–105. pmid:37401020

* View Article

* PubMed/NCBI

* Google Scholar

17. 17. Singh S, Xu W. Robust detection of Parkinson’s disease using harvested smartphone voice data: a telemedicine approach. Telemed J E Health. 2020;26(3):327–34. pmid:31033397

* View Article

* PubMed/NCBI

* Google Scholar

18. 18. Wu H, Soraghan J, Lowit A, Di-Caterina G. A deep learning method for pathological voice detection using convolutional deep belief networks. Interspeech 2018. 2018:446–50.

* View Article

* Google Scholar

19. 19. Syed SA, Rashid M, Hussain S, Zahid H. Comparative analysis of CNN and RNN for voice pathology detection. Biomed Res Int. 2021;2021:6635964. pmid:33937404

* View Article

* PubMed/NCBI

* Google Scholar

20. 20. Shih D-H, Liao C-H, Wu T-W, Xu X-Y, Shih M-H. Dysarthria speech detection using convolutional neural networks with gated recurrent unit. Healthcare (Basel). 2022;10(10):1956. pmid:36292403

* View Article

* PubMed/NCBI

* Google Scholar

21. 21. Hung C-H, Wang S-S, Wang C-T, Fang S-H. Using SincNet for learning pathological voice disorders. Sensors (Basel). 2022;22(17):6634. pmid:36081092

* View Article

* PubMed/NCBI

* Google Scholar

22. 22. Mittal V, Sharma RK. Deep Learning Approach for Voice Pathology Detection and Classification. Int J Healthc Inf Syst Inf. 2021;16(4):1–30.

* View Article

* Google Scholar

23. 23. He KM, Zhang XY, Ren SQ, Sun J. Deep residual learning for image recognition. Proc Cvpr Ieee. 2016:770–8. WOS:000400012300083.

* View Article

* Google Scholar

24. 24. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. Proc Cvpr Ieee. 2016:2818–26.

* View Article

* Google Scholar

25. 25. Xie SN, Girshick R, Dollár P, Tu ZW, He KM. Aggregated residual transformations for deep neural networks. 30th Ieee Conference on Computer Vision and Pattern Recognition (Cvpr 2017). 2017:5987–95. doi: https://doi.org/10.1109/Cvpr.2017.634. WOS:000418371406009.

* View Article

* Google Scholar

26. 26. Hu J, Shen L, Albanie S, Sun G, Wu E. Squeeze-and-Excitation Networks. IEEE Trans Pattern Anal Mach Intell. 2020;42(8):2011–23. pmid:31034408

* View Article

* PubMed/NCBI

* Google Scholar

27. 27. Iyer A, Kemp A, Rahmatallah Y, Pillai L, Glover A, Prior F, et al. A machine learning method to process voice samples for identification of Parkinson’s disease. Sci Rep. 2023;13(1):20615. pmid:37996478

* View Article

* PubMed/NCBI

* Google Scholar

28. 28. Fujimura S, Kojima T, Okanoue Y, Shoji K, Inoue M, Omori K, et al. Classification of voice disorders using a one-dimensional convolutional neural network. J Voice. 2022;36(1):15–20. pmid:32173149

* View Article

* PubMed/NCBI

* Google Scholar

29. 29. Fang S-H, Tsao Y, Hsiao M-J, Chen J-Y, Lai Y-H, Lin F-C, et al. Detection of pathological voice using cepstrum vectors: a deep learning approach. J Voice. 2019;33(5):634–41. pmid:29567049

* View Article

* PubMed/NCBI

* Google Scholar

Citation: Park JK, Mun SB, Kim YJ, Kim KG (2025) Deep learning-based classification of speech disorder in stroke and hearing impairment. PLoS One 20(5): e0315286. https://doi.org/10.1371/journal.pone.0315286

About the Authors:

Joo Kyung Park

Roles: Formal analysis, Methodology, Validation, Visualization, Writing – original draft

Affiliation: Department of Biomedical Engineering, College of Medicine, Gachon University, Gil Medical Center, Incheon, Republic of Korea

Sae Byeol Mun

Roles: Conceptualization, Data curation, Investigation, Methodology

Affiliation: Department of Health Sciences and Technology, Gachon Advanced Institute for Health Sciences & Technology, Gachon University, Incheon, Republic of Korea

Young Jae Kim

Roles: Investigation, Project administration, Validation, Writing – review & editing

Affiliation: Gachon Biomedical & Convergence Institute, Gachon University Gil Medical Center, Incheon, Republic of Korea

Kwang Gi Kim

Roles: Funding acquisition, Supervision, Validation, Writing – review & editing

E-mail: [email protected]

Affiliations: Department of Biomedical Engineering, College of Medicine, Gachon University, Gil Medical Center, Incheon, Republic of Korea, Department of Health Sciences and Technology, Gachon Advanced Institute for Health Sciences & Technology, Gachon University, Incheon, Republic of Korea, Gachon Biomedical & Convergence Institute, Gachon University Gil Medical Center, Incheon, Republic of Korea

ORICD: https://orcid.org/0000-0001-9714-6038

References

1. Benninger MS, Ahuja AS, Gardner G, Grywalski C. Assessing outcomes for dysphonic patients. J Voice. 1998;12(4):540–50. pmid:9988041

2. Roy N, Merrill RM, Gray SD, Smith EM. Voice disorders in the general population: prevalence, risk factors, and occupational impact. Laryngoscope. 2005;115(11):1988–95. pmid:16319611

3. Mirza N, Ruiz C, Baum ED, Staab JP. The prevalence of major psychiatric pathologies in patients with voice disorders. Ear Nose Throat J. 2003;82(10):808–10, 812, 814. pmid:14606179

4. Novotný M, Rusz J, Čmejla R, Růžičková H, Klempíř J, Růžička E. Hypernasality associated with basal ganglia dysfunction: evidence from Parkinson’s disease and Huntington’s disease. PeerJ. 2016;4:e2530. pmid:27703866

5. Danesh-Sani SA, Rahimdoost A, Soltani M, Ghiyasi M, Haghdoost N, Sabzali-Zanjankhah S. Clinical assessment of orofacial manifestations in 500 patients with multiple sclerosis. J Oral Maxillofac Surg. 2013;71(2):290–4. pmid:22742955

6. Elliott E, Newton J, Rewaj P, Gregory JM, Tomarelli L, Colville S, et al. An epidemiological profile of dysarthria incidence and assistive technology use in the living population of people with MND in Scotland. Amyotroph Lateral Scler Frontotemporal Degener. 2020;21(1–2):116–22. pmid:31594398

7. Poole ML, Brodtmann A, Darby D, Vogel AP. Motor speech phenotypes of frontotemporal dementia, primary progressive aphasia, and progressive apraxia of speech. J Speech Lang Hear Res. 2017;60(4):897–911. pmid:28289749

8. Steurer H, Schalling E, Franzén E, Albrecht F. Characterization of mild and moderate dysarthria in Parkinson’s Disease: behavioral measures and neural correlates. Front Aging Neurosci. 2022;14:870998. pmid:35651530

9. De Cock E, Oostra K, Bliki L, Volkaerts A-S, Hemelsoet D, De Herdt V, et al. Dysarthria following acute ischemic stroke: Prospective evaluation of characteristics, type and severity. Int J Lang Commun Disord. 2021;56(3):549–57. pmid:33580596

10. Stubbs E, Togher L, Kenny B, Fromm D, Forbes M, MacWhinney B, et al. Procedural discourse performance in adults with severe traumatic brain injury at 3 and 6 months post injury. Brain Inj. 2018;32(2):167–81. pmid:29281420

11. Pruszewicz A, Obrebowski A, Swidziński P, Demeńko G, Wika T, Wojciechowska A. Usefulness of acoustic studies on the differential diagnostics of organic and functional dysphonia. Acta Otolaryngol. 1991;111(2):414–9. pmid:2068930

12. Hirano M, Hibi S, Yoshida T, Hirade Y, Kasuya H, Kikuchi Y. Acoustic analysis of pathological voice. Some results of clinical application. Acta Otolaryngol. 1988;105(5–6):432–8. pmid:3400446

13. Barlow J, Sragi Z, Rivera-Rivera G, Al-Awady A, Daşdöğen Ü, Courey MS, et al. The use of deep learning software in the detection of voice disorders: a systematic review. Otolaryngol Head Neck Surg. 2024;170(6):1531–43. pmid:38168017

14. Kim H, Jeon J, Han YJ, Joo Y, Lee J, Lee S, et al. Convolutional neural network classifies pathological voice change in laryngeal cancer with high accuracy. J Clin Med. 2020;9(11):3415. pmid:33113785

15. Hu H-C, Chang S-Y, Wang C-H, Li K-J, Cho H-Y, Chen Y-T, et al. Deep learning application for vocal fold disease prediction through voice recognition: preliminary development study. J Med Internet Res. 2021;23(6):e25247. pmid:34100770

16. Kim E-S, Shin DJ, Cho ST, Chung KJ. Artificial intelligence-based speech analysis system for medical support. Int Neurourol J. 2023;27(2):99–105. pmid:37401020

17. Singh S, Xu W. Robust detection of Parkinson’s disease using harvested smartphone voice data: a telemedicine approach. Telemed J E Health. 2020;26(3):327–34. pmid:31033397

18. Wu H, Soraghan J, Lowit A, Di-Caterina G. A deep learning method for pathological voice detection using convolutional deep belief networks. Interspeech 2018. 2018:446–50.

19. Syed SA, Rashid M, Hussain S, Zahid H. Comparative analysis of CNN and RNN for voice pathology detection. Biomed Res Int. 2021;2021:6635964. pmid:33937404

20. Shih D-H, Liao C-H, Wu T-W, Xu X-Y, Shih M-H. Dysarthria speech detection using convolutional neural networks with gated recurrent unit. Healthcare (Basel). 2022;10(10):1956. pmid:36292403

21. Hung C-H, Wang S-S, Wang C-T, Fang S-H. Using SincNet for learning pathological voice disorders. Sensors (Basel). 2022;22(17):6634. pmid:36081092

22. Mittal V, Sharma RK. Deep Learning Approach for Voice Pathology Detection and Classification. Int J Healthc Inf Syst Inf. 2021;16(4):1–30.

23. He KM, Zhang XY, Ren SQ, Sun J. Deep residual learning for image recognition. Proc Cvpr Ieee. 2016:770–8. WOS:000400012300083.

24. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. Proc Cvpr Ieee. 2016:2818–26.

25. Xie SN, Girshick R, Dollár P, Tu ZW, He KM. Aggregated residual transformations for deep neural networks. 30th Ieee Conference on Computer Vision and Pattern Recognition (Cvpr 2017). 2017:5987–95. doi: https://doi.org/10.1109/Cvpr.2017.634. WOS:000418371406009.

26. Hu J, Shen L, Albanie S, Sun G, Wu E. Squeeze-and-Excitation Networks. IEEE Trans Pattern Anal Mach Intell. 2020;42(8):2011–23. pmid:31034408

27. Iyer A, Kemp A, Rahmatallah Y, Pillai L, Glover A, Prior F, et al. A machine learning method to process voice samples for identification of Parkinson’s disease. Sci Rep. 2023;13(1):20615. pmid:37996478

28. Fujimura S, Kojima T, Okanoue Y, Shoji K, Inoue M, Omori K, et al. Classification of voice disorders using a one-dimensional convolutional neural network. J Voice. 2022;36(1):15–20. pmid:32173149

29. Fang S-H, Tsao Y, Hsiao M-J, Chen J-Y, Lai Y-H, Lin F-C, et al. Detection of pathological voice using cepstrum vectors: a deep learning approach. J Voice. 2019;33(5):634–41. pmid:29567049

Word count: 4938

Show less

© 2025 Park et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Background and objective

Speech disorders can arise from various causes, including congenital conditions, neurological damage, diseases, and other disorders. Traditionally, medical professionals have used changes in voice to diagnose the underlying causes of these disorders. With the advancement of artificial intelligence (AI), new possibilities have emerged in this field. However, most existing studies primarily focus on comparing voice data between normal individuals and those with speech disorders. Research that classifies the causes of these disorders within the abnormal voice data, attributing them to specific etiologies, remains limited. Therefore, our objective was to classify the specific causes of speech disorders from voice data resulting from various conditions, such as stroke and hearing impairments (HI).

Methods

We experimentally developed a deep learning model to analyze Korean speech disorder voice data caused by stroke and HI. Our goal was to classify the disorders caused by these specific conditions. To achieve effective classification, we employed the ResNet-18, Inception V3, and SEResNeXt-18 models for feature extraction and training processes.

Results

The models demonstrated promising results, with area under the curve (AUC) values of 0.839 for ResNet-18, 0.913 for Inception V3, and 0.906 for SEResNeXt-18, respectively.

Conclusions

These outcomes suggest the feasibility of using AI to efficiently classify the origins of speech disorders through the analysis of voice data.

Details

Title

Deep learning-based classification of speech disorder in stroke and hearing impairment

Author

Park, Joo Kyung; Mun, Sae Byeol; Young Jae Kim; Kim, Kwang Gi

First page

e0315286

Section

Research Article

Publication year

2025

Publication date

May 2025

Publisher

Public Library of Science

e-ISSN

19326203

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1371/journal.pone.0315286

ProQuest document ID

3213204214

Deep learning-based classification of speech disorder in stroke and hearing impairment

Jump to:

Full text

Introduction

Materials and methods

Experimental setup

Data

Data preprocessing

Models for classifying the causes of speech disorder using voice data

Results

Discussion

References

Abstract

Details

Suggested sources