Introduction
The use of wearable devices via smartphones and tablet personal computers (PCs) has becomes an everyday occurrence. It is estimated that more than 5 billion people have a mobile device, of which more than 60% (77% in developed countries) use a smart device.1
Smartphones and wearables help prevent disease symptoms and can provide long-term disease management through passive and short-term sensing in daily life. The devices do this by capturing the physical, mental, and social aspects of human behavior that constitute well-being.2,3 The accessibility of mobile apps makes it easy to report routine mental health ratings, such as for depression or other mood states, into ecological instantaneous assessment tools.4–9
Existing studies have demonstrated that smartphones are a pervasive computing platform that provide a tremendous opportunity to automatically detect depression using collected sensory data. Further, they can be used for effective depression screenings.10–13 For example, Kolenik and Gams have reported that intelligent cognitive assistant (ICA) technology is used in various fields to imitate human behavior expressed through language models. This technology can be individually tailored to natural language, which has a huge impact on digital mental health services. In particular, ICA can effectively support stress, anxiety, and depression (SAD) by analyzing people's emotional and cognitive phenomena.14
Depression is the most frequent in psychiatric disorder studied via mobile devices. It affects more than 300 million people worldwide.15 Approximately 10–20% of primary care visits are associated with depression, making it the second most common chronic condition observed by primary care physicians.16 However, primary care physicians identify only 50% of depression cases;16 therefore, it can remain undiagnosed, leaving many in need of treatment and putting intervention out of reach. Depression is treated with evidence-based therapeutic approaches; however, only 7% of low-income countries and 28% of high-income countries provide interventions for depressed patients.17 In addition, the early diagnosis of depression with early intervention and treatment is associated with a better prognosis.18 Thus, screening for the symptoms of depression is an important issue. For example, persuasive technology (PT) proactively intervenes to alleviate stress, anxiety, and depression (SAD), which are critical issues in mental health well-being. These technologies are economical and capable of being used over remote distances; however, their usefulness is still limited.19 Therefore, preliminary diagnosis appears as an important issue along with prior intervention.
Previous studies on depression have shown correlations between various smartphone use attributes and depressive behaviors.2,12,20–26 In addition, the development of wristbands and smartphone-embedded sensors over the past decade has provided an opportunity to objectively measure numerous characteristic symptoms of depression and facilitate the passive monitoring of behavioral indicators of low mood.27
Individual behavioral characteristics that discriminate depressive symptoms include physical activity (eg, walking, running, sleeping28), behavioral changes (eg, smartphone conversation patterns such as language fluency and intonation29), circadian activity, and social interaction. These can be detected through a smartphone sensor in association with lassitude, anesthesia, and psychomotor retardation. Depressed people, for example, appear to make fewer phone calls and search the Internet less frequently on their mobile phones.16 A mobile GPS can help assess the severity of depression by movement.28,30–32
These technologies serve as an investigation method that discriminates depressive symptoms based on the characteristics of individual behaviors derived from the monitoring sensor.33–38 They use a smartphone sensor to distinguish depressed people from non-depressed people.12,30–32,36,39 However, studies to screen for and diagnose depression based on differences in levels of depression are lacking.39,40 Therefore, a study targeting different patient groups is necessary for the diagnosis of depression.
A diagnosis of depression is traditionally performed as a paper-type self-report tool. Diagnosis through a smartphone uses a validated self-report screening tool along with passive monitoring.12,41–43 This self-report screening tool is simple, economical, and familiar to people. The Patient Health Questionnaire (PHQ) used in the self-reported depression diagnosis is a depression self-report tool. It is a new tool for the criteria-based diagnosis of depression and other psychiatric disorders commonly encountered in primary care. It is half the length of many other measures of depression, with similar sensitivity and specificity. It consists of nine real-world criteria that form the basis for the diagnosis of DSM-IV Depressive Disorder.44 It is a validated measure of depression often used as a screening tool in clinical settings.45–49 Numerous studies have demonstrated the usefulness of PHQ-9 in influencing clinical decision-making.50 Moreover, a meta-analysis has concluded that shows its superior diagnostic properties when compared with longer screening tools.51
In general, the results of self-report questionnaires, including PHQ-9, are often used as a discriminate tool for depressive symptoms.52 In particular, PHQ-9 has a diagnostic basis for DSM-IV and can be used in various fields because it consists of short questions. In addition, the level of depression indicated by the PHQ-9 score shows a high correlation with the features of depression that can be detected by a smartphone.12,42,53–55
In addition to the merits of PHQ-9 as a discriminate tool for depression, it has high sensitivity and specificity. In a primary care study PHQ-9 data, the algorithm sensitivity and specificity was 73% and 98%,56 respectively. In the validation study for the summed-item method, a PHQ-9 score of ≥10 was 88% for both sensitivity and specificity for major depressive disorder.44 In recent studies, diagnostic measures [eg, sensitivity, specificity, area under the receiver operating characteristic curve (AUC)] are implemented in machine learning by extending the existing statistical approach for the selection criteria and prediction of PHQ-9.13,47,54,57–59 Machine learning is used in psychiatry to increase the accuracy of diagnosis and prognosis and make treatment and prevention decisions.60,61 It is particularly useful for predicting human behavior, including high-risk behaviors, and it is effective for discriminating psychopathology.62,63 Machine learning studies are conducted using measures to discriminate psychopathology (eg, prediction suicide ideation, suicide attempt and behaviors, malingering, personality detecting).61,63–69 In addition, studies using PHQ-9 are emerging. These have strengths in the identification of depression by using machine learning techniques.53,54,58,59,70,71
In this study, studies were analyzed using machine learning techniques on PHQ-9 depression screening data collected through mobile devices according to the latest depression screening trends. In addition, we intend to contribute to predicting depression through mobile devices in the future by examining the predictive diagnostic power of PHQ-9 on mobile devices through diagnostic meta-analysis.
Materials and Methods Data Sources and Searches
We searched EMBASE, MEDLINE, MEDLINE In-Process, and PsychINFO between 1964 and March 26, 2021. "Depression," "depressive disorder," "mood disorder," and "sensing," "sensor," "measuring," and "diagnosis" related to sensing were used as keywords. Combined expressions were searched. Below is the mutually agreed upon some search query, and additional search query is presented in the supplementary (Supplementary Materials).
"depression" AND "smartphone" OR "depression" AND "wearable" OR "depression" AND "mobile" OR "depression" AND "smartphone" AND "sensing" OR "depression" AND "smartphone" AND "sensor" OR "depression" AND "smartphone" AND "measuring" OR "depression" AND "smartphone" AND "diagnosis".
We included review articles, posters, all kinds of unpublished studies, and studies without language restrictions. This study was prepared according to the preferred reporting items for systematic reviews and meta-analyses (PRISMA) guidelines and completed 27-point checklist PRISMA.72
Study Selection
After removing 1340 duplicate papers from the 5476 papers searched, two reviewers independently first selected the papers by title and abstract. All authors applied the eligibility criteria of quality assessment using the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) tool and PRISMA checklist. They screened and reviewed the entire text of the papers and created a final list of papers containing eligible data through consensus (Figure 1). We removed all reports of mental disorders (such as bipolar disorder, and schizophrenia), except for depression because of other psychiatry mental discriminant categories. We excluded studies and interventions that discriminated mental health through sensing. In the case of interventions through sensing, there were studies that verified the therapeutic effects of interventions; therefore, an integrated analysis through meta-analysis was not possible, so this study was excluded. Only the depressed group as selected by the PHQ-9 were included. Those selected by the PHQ-2 and the PHQ-8, which are short-cut scales of the PHQ, were excluded. In addition, methodologies that selected or measured the PHQ-9 by traditional statistical methods other than machine learning were excluded.
Statistical Analysis
Data Synthesis The diagnostic accuracy was extracted in a standardized format with all possible participant characteristics, scores on PHQ-9, and data on sensitivity and specificity. Where appropriate, the cell contents of the 2×2 table were used by analyzing the receiver operating characteristic (ROC) curve calculated from the provided data and plotted.73
Meta-Analysis We performed a bivariate meta-analysis to obtain pooled estimates of specificity and sensitivity and 95% confidence intervals (CIs) and generate 95% confidence ellipses within the ROC curve space.74 A summary receiver operating characteristic (SROC) curve was constructed using a quantitative model.75
The heterogeneity was evaluated using I2(I2; the proportion of true variance), and meta-regression was performed to determine the heterogeneity. In addition, publication bias was assess to determine any bias in which a study may or may not be published according to the characteristics and results of individual studies. The probability of having the disease in question was estimated based on the diagnostic test results through Fagan's nomogram.76 Analyses were performed using STATA 17.0 (Texas, USA) using metandi, midas, metabias, and metareg of Stata algorithm meta-analysis words. "metandi" performs meta-analysis of diagnostic accuracy and it takes as input four variables: tp (true positives), fp (false positives), fn (false negatives), tn (true negatives) within each study.77 "midas" is a comprehensive program of statistical an graphical routines for undertaking meta-analysis of diagnostic test performance in Stata.78 And "metareg" performs random-effects meta-regression using aggregate-level data.79
Quality Assessment We conducted a quality assessment using the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) tool.80 This integrates an assessment of bias risk across four key areas: patient selection, index test, reference standard, and flow and timing of assessments. Two authors independently assessed the risk of bias based on consensus criteria.
Results General Characteristics and Quality Analysis
We performed a meta-analysis of 1099 patients in four studies (age above 18 years old, moderate, and severe depression, data collection through mobile devices, Data analysis through machine learning algorithms). We performed a diagnostic meta-analysis according to the PHQ-9 cutoff score and machine learning algorithm techniques. The machine learning algorithm techniques were random forest, support vector machine (SVM), k-nearest neighbor (KNN), artificial neural network (ANN), and 10-fold cross-validation. Random forest (RF) is an reliable classifier that uses predictions derived from ensembles of decision trees,81 which successfully select and rank variables with the greatest ability to discriminate between the target classes.82 The Support Vector Machine is a discriminant classifier that can be defined as a separating hyperplane. It is the generalization of maximal margin classifier that comes with the definition of hyperplane.81 The K-Nearest Neighbors (kNN) algorithm is used for classification and regression. It performs great in pattern recognition and predictive analysis.83 The artificial neural network (ANN) is a machine learning method evolved from the idea of simulating the human brain.84 It is excellent fault tolerance and is fast and highly scalable with parallel processing.85 Cross-validation is widely used to estimate the prediction error, and also been used for model selection.86 These machine learning algorithm techniques analyzed PHQ-9 data collected by mobile devices, and this study performed diagnostic meta-analysis by integrating these studies.
The depression diagnosis cutoff score used with the PHQ-9 was 10 points for most studies, 15 points for severe cases, and 5 points or more as a depression diagnosis in one study (Table 1). Therefore, this study performed meta-regression and publication bias according to the level of depression. In addition, a quality analysis of the papers that were analyzed in a meta-analysis according to QUADAS-2 was performed (Table 2).
Diagnostic Meta Results of PHQ-9 Using Machine Learning
Among the summarized estimates, the pooled sensitivity was 0.797 (95% CI = 0.642–0.895), the pooled specificity was 0.850 (95% CI = 0.780–0.900), and the diagnostic odds ratio (OR) was 22.16 (95% CI = 7.273–67.499). The p-value of HOROC beta was not significant (0.071). If the HOROC beta value is a parameter representing the shape of the SROC curve and is statistically significant, heterogeneity is suspected. Therefore, this study did not show heterogeneity. However, the Corr (logits) correlation coefficient of the bivariate model was 0.763. To confirm heterogeneity between studies, the heterogeneity value was analyzed using a meta-regression analysis (Table 3). Heterogeneity refers to the degree of dispersion of effect sizes from each individual study and the degree of inconsistency in effect sizes across studies. This heterogeneity is important in meta-analysis to increase the relevance of conclusions drawn from subject studies and to improve scientific understanding of the evidence as a whole.87,88 As such, meta-regression analysis performs whether between-study heterogeneity can be explained by one or more moderators.89,90 Therefore, in this study, the cause of the heterogeneity was further tested through meta-regression analysis.
By analyzing the diagnostic meta result value for the diagnostic value of PHQ-9 using machine learning with a forest plot, the inter- and intra-study variation were confirmed. The intra-study variation was relatively large compared with that of previous studies (sensitivity = 0.50 [95% CI; 0.01–0.99], specificity = 0.83 [95% CI; 0.63–0.95]). McIntyre's study showed a large inter-study variation in both sensitivity and specificity compared with other studies; however, it showed a good balance overall (Figure 2).
Heterogeneity of Diagnostics Meta-Analysis
Although a large amount of heterogeneity is not suspected from the summary estimate and the SROC curve (Figure 3), the summary estimate Corr (logits) correlation coefficient indicates a negative value exceeding 0; therefore, so the PHQ-9 cutoff value was analyzed through meta-regression analysis (Table 4, Figure 4). As a result of the meta-regression analysis, the p-value of the PHQ-9 severe group was 0.95, which was not significant. This confirmed that the PHQ-9 score was not the cause of the heterogeneity.
Publication Errors
The publication errors of the studies used were distributed non-biased asymmetrically based on the regression line. The p-value was 0.50, which did not indicate publication errors (Figure 5).
Diagnosed with Depression
Fagan's nomogram is a graphical tool that can measure the probability of having a disease based on the results of a diagnostic test using Bayes' theorem that describes the probability of an event.76,91 Prior probability is entered to calculate the posterior probability (probability of contracting a specific disease). In the case of depression screening, this appears between 5% and 10% in primary care.92,93 In the meta-analysis study of Mitchell et al that screened for depression using PHQ-2 and PHQ-9, the prevalence in primary care was 11.3% (95% CI 10.92–11.68%).94 Therefore, assuming that the pre-prevalence value selected with PHQ-9 is 11%, the posterior probability of being diagnosed with depression is 40% if LR_positive, according to this Fagan's nomogram diagnostic test. If the diagnosis is LR_negative, the probability of being diagnosed with depression appears to be 3% (Figure 6).
Discussion
This study is the first diagnostic meta-analysis to reveal the efficacy of depression screening when used with a computer-based mobile device and PHQ-9. In addition, it is the first study to determine through meta-analysis whether machine learning algorithm methods have a diagnostic strength when PHQ-9 is used over traditional statistical methods via machine learning techniques that have predictive power.
The PHQ-9 has diagnostic properties for major depressive disorder.47,51,94–99 It is a short period and gold standard screening tool for depression based on the DSM-IV diagnostic criteria.44,56,100,101 In addition, the PHQ-9 is the computer version, and the diagnostic reliability and effectiveness are identical to offline methods102,103 and with smartphones.9,53,104
Similar to the depression diagnostic values of PHQ-9 through the existing traditional technique in this paper,105 when PHQ-9 data from mobile devices are analyzed with a machine learning algorithm, good results are reported with 80% sensitivity and 85% specificity. This indicates good diagnostic properties.
The depression-diagnostic properties of PHQ-9 are similar when measurements are performed through mobile devices and machine learning techniques are applied. The PHQ-9 cutoff score is optimally ≥10. When meta-regression was performed by dividing according to the level of PHQ-9, five studies assessing severe depression (PHQ-9 ≥15) showed a sensitivity of 78% and specificity of 84%. Six studies of non-severe depression that did not exceed 15 points showed a sensitivity and specificity similar to the overall sensitivity and specificity, of 81% and 86% respectively. Therefore, a PHQ-9 cutoff of ≥10 has more discriminating power to diagnose depression. This is in line with previous studies.51,106 Taken together, this study verified the diagnostic discriminatory power of depression according to the PHQ-9 of ≥10 cutoff.
Previous studies on diagnostic meta-analysis in various settings using PHQ have shown high sensitivity and specificity. In the case of PHQ-9, the sensitivity was 0.80–0.82 (95% CI 0.71–0.89) and the specificity was 0.84–0.92 (95% CI 0.80–0.95).51,94 As such, this study confirmed that PHQ-9 shows similar high sensitivity and specificity even with machine learning statistical techniques in a mobile environment.
Diagnosing depression by collecting depressive data from individuals ≥18 years of age in a mobile environment is more meaningful than diagnosing depression in primary care or clinical care settings. Screening for major depression in the mobile environment has been carried out in the general population (eg, university students, college students, the general community12,30,42,52,53,107,108), clinical field, and primary care field.104,109–111 Furthermore, depression diagnoses are performed for those examined in previous studies32,112,113 and participants recruited through an app.114,115 The strength of depression screening in a mobile environment is that the group is not limited to one environment; therefore, more data can be obtained. In particular, self-report assessments in a mobile environment support patients to overcome the locational limitations of traditionally managed assessments (ie, paper-and-pencil-based).6 Moreover, they can collect passive data to reveal the any correlation strengths. An additional potential benefit of depression screening in a mobile environment is that the app can be developed and used based on the storage of mobile devices, portable accessibility, and time-sensitive local and push notifications.4
The strength of this paper is that it shows the diagnostic meta-analysis of machine learning methods with the usefulness of the PHQ-9 for depression screening through mobile devices. Each machine learning algorithm showed little heterogeneity and good diagnostic usefulness. Machine learning selects algorithms according to research and sets up suitable training and testing sets. In this diagnostic meta-analysis, various ML models were analyzed and showed a high fit of AUC = 0.89. For the diagnosis of depression and mood disorders, machine learning with excellent predictive suitability has been introduced.57,110,113,116,117 Datasets collected in mobile settings are large; machine learning-based predictive models can analyze a large amount of data. This technique is useful for analyzing and conceptualizing multiple predictors.118,119 In the case of depression, if a diagnosis is delayed by screening tests, the prognosis deteriorates,120 and the importance of early detection is raised.121–124 In addition, as efficiency such as the cost-effectiveness of testing is proven,125 machine learning and depression screening in the mobile field show potential strengths.
The limitation of this paper is that there are various research results on depression screening conducted in the mobile field, but the results for PHQ-9 are sporadic, and in particular, there are few studies that use machine learning algorithm techniques. In addition, PHQ-9 has excellent diagnostic properties for depression; however, studies on its utility in DSM-5 should be conducted as it is based on DSM-IV. Additionally, the reference standards for diagnosing depression are narrow in the mobile field. Based on the effectiveness of machine learning algorithms for the diagnosis of depression in the mobile field of this study, we expect machine learning studies on depression diagnosis using various sensing data and self-report tests will be collected.
Conclusion
In this study, a diagnostic meta-analysis was performed on the selection of depression using machine learning techniques on data related to the PHQ-9, which diagnoses depression in the mobile field. We used mobile data from 1099 subjects with a pooled sensitivity of 80%, specificity of 85%, and an AUC = 0.89 for various machine learning algorithm methods. We found, excellent diagnostic effects by integrating meta-analysis. This study confirms that PHQ-9, which has been proven to be a useful screening test for depression, is an effective diagnostic tool in mobile assessments, as well as in primary care and clinical settings.
Funding
This research was supported by the Technology Innovation Program (20012931) funded by the Ministry of Trade, Industry and Energy (MOTIE, Korea). This funding source had no role in the design of this study and will not have any role during its execution, analyses, interpretation of the data, or decision to submit results.
Disclosure
The authors declare no conflict of interest.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2021. This work is licensed under https://creativecommons.org/licenses/by-nc/3.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Purpose: Depression is a symptom commonly encountered in primary care; however, it is often not detected by doctors. Recently, disease diagnosis and treatment approaches have been attempted using smart devices. In this study, instrumental effectiveness was confirmed with the diagnostic meta-analysis of studies that demonstrated the diagnostic effectiveness of PHQ-9 for depression using mobile devices.
Patients and Methods: We found all published and unpublished studies through EMBASE, MEDLINE, MEDLINE In-Process, and PsychINFO up to March 26, 2021. We performed a meta-analysis by including 1099 subjects in four studies. We performed a diagnostic meta-analysis according to the PHQ-9 cut-off score and machine learning algorithm techniques. Quality assessment was conducted using the QUADAS-2 tool. Data on the sensitivity and specificity of the studies included in the meta-analysis were extracted in a standardized format. Bivariate and summary receiver operating characteristic (SROC) curve were constructed using the metandi, midas, metabias, and metareg functions of the Stata algorithm meta-analysis words.
Results: Using four studies out of the 5476 papers searched, a diagnostic meta-analysis of the PHQ-9 scores of 1099 people diagnosed with depression was performed. The pooled sensitivity and specificity were 0.797 (95% CI = 0.642– 0.895) and 0.85 (95% CI = 0.780– 0.900), respectively. The diagnostic odds ratio was 22.16 (95% CI = 7.273– 67.499). Overall, a good balance was maintained, and no heterogeneity or publication bias was presented.
Conclusion: Through various machine learning algorithm techniques, it was possible to confirm that PHQ-9 depression screening in mobiles is an effective diagnostic tool when integrated into a diagnostic meta-analysis.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer