Content area
Full text
Introduction
The diagnosis of Parkinson’s Disease (PD) is traditionally reliant on the clinical assessments focused on the motor symptoms of the individuals1. Traditional methods, while effective, often miss the subtle early symptoms of the disease, leading to delayed interventions2. The situation is further exacerbated by the limited accessibility to specialized neurological healthcare, particularly in regions with significantly lower ratios of neurologists to the population. For instance, Bangladesh had only 86 neurologists for over 140 million people in 20143, while some African nations had one neurologist per three million people, with 21 countries having fewer than five neurologists each4. Given the expected doubling of PD cases by 20305, there’s a pressing need for accessible, home-based diagnostic solutions to address global disparities in healthcare access.
Recent advancements have seen a shift towards integrating digital biomarkers to develop automated AI based at-home PD detection and progression tracking tools6, 7, 8, 9–10. Techniques vary from sensor-based nocturnal breathing signal9 and accelerometric data collection11, to digital analysis of facial expressions12. However, wearables and sensors may be inconvenient for the elderly13,14, and posed expressions can miss subtle diagnostic cues.
Alternatively, speech analysis offers a non-invasive route, leveraging natural speech patterns for PD detection. Traditional speech analysis in PD—primarily relied on sustained phonation tasks6,15, 16, 17, 18, 19, 20–21—although useful, does not reflect the complexities of natural speech. To counter that, studies have been proposed to use continuous speech to develop PD classifiers using varying technologies such as CNNs22, time-frequency analysis23, or SVMs24. However, these studies rely on fixed recording setups and small sample sizes, which limits the generalizability of the models and fail to adequately address accessibility concerns. Even with larger datasets, models like CNNs and SVMs face structural limitations. CNNs, while powerful for feature extraction, are primarily designed for spatial data, and their application to time-series or speech data can be challenging unless properly adapted25,26. They may require deep architectures to capture complex temporal dependencies in PD speech data, increasing the risk of overfitting if the model is not properly regularized or if the dataset lacks sufficient variability to cover real-world...