This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction
Recent studies by the World Health Organization (WHO) [1] have revealed that 56 million Indians suffer from depression and another 38 million Indians suffer from anxiety disorders, and only a fraction of them receive adequate treatment. Even though this disorder is highly treatable, only a fraction of those suffering receive treatment, due to the societal stigma associated with mental health. Diagnosis and subsequent treatment for depression are often delayed, imprecise, and/or missed entirely. The social media activity of individuals presents a revolutionary approach to transforming early depression intervention services, especially for young adults [2, 3]. Many depressed individuals seldom choose not to discuss their mental health with their family and friends because the taboo surrounding depression is still high, especially in India. Such individuals, when they tweet, consciously and subconsciously use words that indicate their mental health. The advent of social media platforms has made it relatively easier to find these individuals [4, 5]. Since it is nearly impossible to check the hints from the posts of each user across all platforms for a human being or even a team of them, automating the entire process becomes the need of the hour. One such approach accepted globally is sentiment analysis [6, 7]. It is a cross platform ML approach that can be implemented to filter out a particular user based on the pattern of their social media posts.
2. Related Works
The ability of algorithms to evaluate text has substantially improved as a result of recent advances in the field of deep learning [8, 9]. Sentiment analysis and opinion mining algorithms for social multimedia [10, 11] summarizes existing research on multimodal sentiment analysis, which incorporates numerous media outlets. Data mining to detect depressed people on social networking platforms in the field of psychology [12, 13]. To begin, a sentiment analysis method is proposed that uses vocabulary and man-made rules to calculate the depression inclination of each post or microblog. A hybrid model for identifying depressed individuals via CNN and LSTM models is based on normal conversation-based text data obtained from Twitter [14]. However, the vast majority of these studies were conducted with an audience that spoke only English. There has not been much work done on the subject of sentiment analysis for an audience that predominantly uses Indian languages in microblogging websites. Instead of learning character or word-level representation, a model was proposed that includes learning subword-level representation in the LSTM architecture [7]. In excessively noisy text with misspellings, the model performs well. Twitter-based annotated corpus of mixed social media material in Hindi, English, and Hinglish for coding [6,15]. To create a more diverse canvas, the study used words with ambiguous meanings and irregular spellings in both languages [16,17].
3. Proposed System
The proposed system uses a classifier model to classify tweets as “depressed” or “not depressed”. The model utilizes a pipeline composed of the TF
As for the working of the model, a tweet from the Twitter API serves as the input for the model. This tweet can be written in English, Hindi, or a mix of these two languages (Hinglish). The model classifies the tweet into one of the two target class labels, depressed (denoted by 0 in the dataset) and not depressed (denoted by 1 in the dataset) based on the words present in the tweet (for instance, depressed tweets most commonly include the keywords “depressed,” “anxiety,” “sad,” etc.), and the class of the tweet is displayed on the screen. Figure 1 represents the architecture of the proposed model.
[figure(s) omitted; refer to PDF]
4. Technique Used
4.1. Data Collection
The tweets in the dataset were obtained using the Python module Tweepy via the Twitter API. Hashtags (#) like #depressed, #anxiety, and #sad were used to filter out depressed tweets, whereas #happy and #life were used to filter out tweets that were not depressed. These tweets were then turned into a 670-data-point raw dataset with three columns: TID (unique Twitter ID), TWEET, and LABEL. Figure 2 represents the output derived. The tweets were then compiled into a CSV file, shown in Table 1.
[figure(s) omitted; refer to PDF]
4.2. Data Preprocessing
The raw dataset was preprocessed to bring all the textual data into a form that is predictable and analyzable for the model. Figure 1 depicts the flow of processes in data preprocessing. The Python modules stopwords, RegexpTokenizer, WordNetLemmatizer, and PorterStemmer from NLTK were used along with String. We also included Hindi stopwords [20] separately as NLTK does not have this provision.
4.3. Undersampling
Initially, the dataset contained 670 data points, out of which 409 were associated with label 1, and 260 were associated with label 0. This created a bias, which if not rectified, would skew the results of the model. So, we proceeded with undersampling the data associated with label 1, after which there was an equal distribution of data for both target class labels, consisting of 520 data points in the dataset.
4.4. TF
The TF
4.5. Multinomial Naive Bayes
The MNB algorithm is used as the primary classifier because it is more accurate than the Naive Bayes (NB) algorithm [5]. While NB considers the independent probability of each feature, MNB considers a feature vector where each term represents the TF
4.6. Grid Search
Selecting the best hyperparameters for tuning the model can be exhaustive and time-consuming if performed manually. To automate this process, grid search has been used [21]. These are the best hyperparameters that were determined for the proposed model.
An important feature to note is that the value of
5. Implementation
The model is an application of supervised machine learning, and the requirement of a user is to deploy and collect the result. Deploying this application needs basic interaction where it asks for the keys and tokens to access the database (as for Twitter, it needs access_token, secret access token, consumer key, and consumer secret key, respectively). The application later requires minimal to no intervention from the user until the output is provided by the application. The application collects a collection of tweets from the database (Twitter), which is fed into the core of the application. The core contains a trained model to classify the tweets into one of two classifications: depressed or not depressed. The model is trained in one of the best methods, using grid search. Grid search as already mentioned in the previous section, chooses the best combination of parameters and derives an output. The parameters have chosen for the model are a pipeline of TFIDF, countvectorizer, and multinomial Naive Bayes. The model is capable of prioritizing accuracy in different types of data provided to it. The model can successfully read Hindi tweets as well and classify them using its knowledge of the different Hinglish terms that are commonly used over social media. After classification, the application can provide an accurate result of up to 96.15% (data based on training dataset) and can provide a visual representation of the different key lexicons it has encountered throughout the dataframe.
One of the best features of the implementation is its modularized approach, where each of the jobs is assigned to different modules and each of the major module clusters is capable of working individually without interference from other module clusters. This improves the implementation, upgradability, and readability of code. A vivid test report for different types of tweets is provided by Table 2.
Table 1
Generated classification report of the model.
Precision | Recall | F1-score | Support | |
0 | 0.9815 | 0.9464 | 0.9636 | 56 |
1 | 0.9400 | 0.9792 | 0.9592 | 48 |
Accuracy | 0.9615 | 104 | ||
Macro avg | 0.9607 | 0.9628 | 0.9614 | 104 |
Weighted avg | 0.9623 | 0.9615 | 0.9616 | 104 |
6. Experimental Setup
The 670 data point raw dataset taken from Twitter has a collection of real tweets that include the Hindi and English language. The dataset has been split into 2 groups: the train set, which is to be input as training samples, and the development set, which is to verify the accuracy of the checkpoint of the grid search; for each of the datasets, the train set represents around 90% of the whole data amount, and the development set is around 10%. For the testing, we train the grid search model several times and choose the one with the highest average development accuracy, as shown in Table 3.
Table 2
Test cases for testing the model.
Test case ID | Test condition | Tweets | Expected result | Actual result | Status |
01 | Test MNB (TF | Hello hk im soooo happi luv soooo much | Not depressed | Not depressed | Pass |
02 | Test MNB (TF | @Rishabverma740 | Depressed | Depressed | Pass |
Tere dukh tere he rahenge | |||||
Phir tu isko suna | |||||
Ya usko suna | |||||
Kya farak padta hai | |||||
#Depressed | |||||
03 | Test MNB (TF | Udaas rehne ki wajah to bohot hai life me..!! | Not depressed | Depressed | Fail |
04 | Test MNB (TF | Depress start counsel next monthal want happi | Not depressed | Depressed | Pass |
05 | Test MNB (TF | I Feel lost inside myself! #illness #lifelessons #useless #depressed | Depressed | Depressed | Pass |
06 | Test MNB, (TF | @sidnaaz_ka | Not depressed | Not depressed | Pass |
7. Results and Discussions
The model, which is a hybrid of MNB, TF
Table 3
Model metrics.
Metric | Value derived for proposed model |
MAE (mean absolute error) | 0.038461538461538464 |
R2 score | 0.85 |
Log loss | 1.3284375420378214 |
IoU (Jaccard score) | 0.9215686274509803 |
MSE (mean squared error) | 0.038461538461538464 |
RMSE (root mean squared error) | 0.19611613513818404 |
MSLE (mean squared log error) | 0.018478962073776976 |
NAE (normalized absolute error | 0.20450490315512837 |
When applying MNB, TF
After training, the model applies the evaluation measures to check how the model is performing. Accordingly, the following evaluation parameters are used to check the performance of the models, respectively:
(i) Accuracy score
(ii) Confusion matrix with plot
(iii) ROC-AUC Curve
Accuracy: as far as the accuracy of the model is concerned, MNB (TF
F1-score: MNB(TF
The model has been evaluated against several metrics to compare the model’s predictions with the (known) values of the dependent variable in a dataset. Table 1 describes the model metrics derived for the classification model.
A study has been conducted to compare the proposed model metrics, specifically the accuracy and F1-score, with preexisting works, and the results of this study is shown in Table 4.
Table 4
Comparison between existing models and proposed model.
Method | Reported in | Accuracy (%) | F1-score |
Char-LSTM | Joshi, A. et al. (2016) | 59.8 | 0.511 |
Subword-LSTM | Joshi, A. et al. (2016) | 69.7 | 0.658 |
CNN-BiLSTM | Garg, N., and Sharma, K. (2020) | 83.21 | 0.556 |
MNB (TF | Proposed | 96.15 | 0.914 |
Figure 3 and Figure 4 represent the ROC curve and precision-recall curve obtained for the proposed model, respectively, and Figure 5 represents the confusion matrix of the model.
[figure(s) omitted; refer to PDF]
8. Conclusion and Future Enhancement
The proposed model helps to identify those depressed individuals from the large data pool and easily identify them using a quick-fix solution that is done with minimal changes and hardly any human intervention. Another distinguishing factor of the proposed model is that it is able to classify tweets written in English, Hindi, and Hinglish languages. The entire architecture works over English and Hindi languages, which shall help in implementation globally, especially in India and across multiple platforms. This will help put a stop to the ever-increasing depression rates in an automated manner.
This work can be readily upgraded into an interactive bot. The bot adapts himself to the depressed person and makes him/her able to express themselves. This would help people to spend time working on their mental health and have a regular conversation with the bot. This can be extended to include several other Indian languages.
[1] World Health Organization, "WHO Director-General’s opening remarks at the media briefing on COVID," March 2020. https://www.who.int/dg/speeches/detail/who-director-general-s-opening-remarks-at-the-media-briefing-on-covid-19-11-march-2020
[2] P. A. Cavazos-Rehg, M. J. Krauss, S. Sowles, S. Connolly, C. Rosas, M. Bharadwaj, L. J. Bierut, J. Laura, Bierut, "A content analysis of depression-related tweets," Computers in Human Behavior, vol. 54, pp. 351-357, DOI: 10.1016/j.chb.2015.08.023, 2016.
[3] U. Chawda, S. K. Rakesh, "Implementation and Analysis of Depression Detection Model Using Emotion Artificial Intelligence," International Journal of Computer sciences and engineering, vol. 7,DOI: 10.26438/ijcse/v7i4.912, 2019.
[4] M. R. Islam, Ashad Kabir Muhammad, M. A. Kabir, A. Ahmed, A. R. M. Kamal, H. Wang, A Ulhaq, "Depression detection from social network data using machine learning techniques," Health Information Science and Systems, vol. 6 no. 1,DOI: 10.1007/s13755-018-0046-0, 2018.
[5] M. Trotzek, S. Koitka, M. Christoph, Friedrich, "Utilizing neural networks and linguistic metadata for early detection of depression indications in text sequences," IEEE Transactions on Knowledge and Data Engineering, vol. 32 no. 3, pp. 588-601, DOI: 10.1109/TKDE.2018.2885515, 2018.
[6] A. Agarwal, B. Xie, I. Vovsha, R. Owen, J. Passonneau Rebecca, "Sentiment analysis of twitter data," pp. 30-38, .
[7] A. Joshi, A. Prabhu, M. Shrivastava, V. Varma, "Towards sub-word level compositions for sentiment analysis of Hindi-English code mixed text," Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2482-2491, .
[8] S. Sharma, S. Sharma, "Analyzing the depression and suicidal tendencies of people affected by COVID-19’s lockdown using sentiment analysis on social networking websites," Journal of Statistics & Management Systems, vol. 24,DOI: 10.1080/09720510.2020.1833453, 2020.
[9] A. Ziani, N. Azizi, D. Schwab, M. Aldwairi, N. Chekkai, D. Zenakhra, S. Cheriguene, "Recommender system through sentiment analysis," 2nd International Conference on Automatic Control, Telecommunications and Signals, . hal-01683511
[10] N. Garg, K. Sharma, "Annotated corpus creation for sentiment analysis in code-mixed Hindi-English (Hinglish) social network data," Indian Journal of Science and Technology, vol. 13 no. 40, pp. 4216-4224, DOI: 10.17485/ijst/v13i40.1451, 2020.
[11] D. M Hussein, E. D, M. Mohey El-Din Mohamed, "A survey on sentiment analysis challenges," Journal of King Saud University - Engineering Sciences, vol. 30 no. 4, pp. 330-338, DOI: 10.1016/j.jksues.2016.04.002, 2018.
[12] S. Liao, J. Wang, R. Yu, K. Sato, Z. Cheng, "CNN for situations understanding based on sentiment analysis of twitter data," Procedia Computer Science, vol. 111, pp. 376-381, DOI: 10.1016/j.procs.2017.06.037, 2017.
[13] X. Wang, C. Zhang, Y. Ji, L. Sun, L. Wu, Z. Bao, A depression detection model based on sentiment analysis in micro-blog social network Pacific-asia Conference on Knowledge Discovery and Data Mining, pp. 201-213, DOI: 10.1007/978-3-642-40319-4_18, 2013.
[14] B. Verma, S. Gupta, L. Goel, A neural network based hybrid model for depression detection in twitter, pp. 164-175, DOI: 10.1007/978-981-15-6634-9_16, 2020.
[15] Z. Li, Y. Fan, B. Jiang, T. Lei, W. Liu, "A survey on sentiment analysis and opinion mining for social multimedia," Multimedia Tools and Applications, vol. 78 no. 6, pp. 6939-6967, DOI: 10.1007/s11042-018-6445-z, 2019.
[16] N. F. Liu, M. Gardner, Y. Belinkov, M. E. Peters, N. A. Smith, "Linguistic knowledge and transferability of contextual representations," ,DOI: 10.18653/v1/n19-1112, 2019.
[17] A. A. Maksutov, V. I. Zamyatovskiy, V. N. Vyunnikov, A. V. Kutuzov, "Knowledge base collecting using natural language processing algorithms," pp. 405-407, DOI: 10.1109/eiconrus49466.2020.9039303, .
[18] T. Shen, J. Jia, G. Shen, F. Feng, X. He, H. Luan, J. Tang, T. Tiropanis, T. S. Chua, W. Hall, "Cross-domain depression detection via harvesting social media," Proceedings of the International Joint Conferences on Artificial Intelligence, pp. 1611-1617, DOI: 10.24963/ijcai.2018/223, .
[19] G. Singh, B. Kumar, L. Gaur, A. Tyagi, "Comparison between multinomial and Bernoulli naïve Bayes for text classification," pp. 593-596, DOI: 10.1109/ICACTM.2019.8776800, .
[20] S. Rana, "HinglishNLP [Source code]," 2020. https://github.com/TrigonaMinima/HinglishNLP/blob/master/data/assets/stop_hinglish
[21] P. C. Bhat, H. B. Prosper, S. Sekmen, C. Stewart, "Optimizing event selection with the random grid search," Computer Physics Communications, vol. 228, pp. 245-257, DOI: 10.1016/j.cpc.2018.02.018, 2018.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2022 Carmel Mary Belinda M J et al. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/
Abstract
According to recent studies, young adults in India faced mental health issues due to closures of universities and loss of income, low self-esteem, distress, and reported symptoms of anxiety and/or depressive disorder (43%). This makes it a high time to come up with a solution. A new classifier proposed to find those individuals who might be having depression based on their tweets from the social media platform Twitter. The proposed model is based on linguistic analysis and text classification by calculating probability using the TF
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details






1 Department of Computer Science & Engineering, Vel Tech Rangarajan Dr Sagunthala R and D Institute of Science and Technology, Chennai, India
2 Department of Computer Science and Information Technology, University of Lahore, Lahore, Pakistan
3 Department of Electrical and Computer Engineering, Bule Hora University, Bule Hora, Ethiopia