Linguistic Analysis of Hindi-English Mixed Tweets

Full text

Turn on search term navigation

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Recent studies by the World Health Organization (WHO) [1] have revealed that 56 million Indians suffer from depression and another 38 million Indians suffer from anxiety disorders, and only a fraction of them receive adequate treatment. Even though this disorder is highly treatable, only a fraction of those suffering receive treatment, due to the societal stigma associated with mental health. Diagnosis and subsequent treatment for depression are often delayed, imprecise, and/or missed entirely. The social media activity of individuals presents a revolutionary approach to transforming early depression intervention services, especially for young adults [2, 3]. Many depressed individuals seldom choose not to discuss their mental health with their family and friends because the taboo surrounding depression is still high, especially in India. Such individuals, when they tweet, consciously and subconsciously use words that indicate their mental health. The advent of social media platforms has made it relatively easier to find these individuals [4, 5]. Since it is nearly impossible to check the hints from the posts of each user across all platforms for a human being or even a team of them, automating the entire process becomes the need of the hour. One such approach accepted globally is sentiment analysis [6, 7]. It is a cross platform ML approach that can be implemented to filter out a particular user based on the pattern of their social media posts.

2. Related Works

The ability of algorithms to evaluate text has substantially improved as a result of recent advances in the field of deep learning [8, 9]. Sentiment analysis and opinion mining algorithms for social multimedia [10, 11] summarizes existing research on multimodal sentiment analysis, which incorporates numerous media outlets. Data mining to detect depressed people on social networking platforms in the field of psychology [12, 13]. To begin, a sentiment analysis method is proposed that uses vocabulary and man-made rules to calculate the depression inclination of each post or microblog. A hybrid model for identifying depressed individuals via CNN and LSTM models is based on normal conversation-based text data obtained from Twitter [14]. However, the vast majority of these studies were conducted with an audience that spoke only English. There has not been much work done on the subject of sentiment analysis for an audience that predominantly uses Indian languages in microblogging websites. Instead of learning character or word-level representation, a model was proposed that includes learning subword-level representation in the LSTM architecture [7]. In excessively noisy text with misspellings, the model performs well. Twitter-based annotated corpus of mixed social media material in Hindi, English, and Hinglish for coding [6,15]. To create a more diverse canvas, the study used words with ambiguous meanings and irregular spellings in both languages [16,17].

3. Proposed System

The proposed system uses a classifier model to classify tweets as “depressed” or “not depressed”. The model utilizes a pipeline composed of the TF $*$ IDF and multinomial Naive Bayes (MNB) algorithms, with MNB serving as the classifier. The implementation of the Bayes algorithm takes minimal effort, thus keeping the development phase short and elongating the testing phase to perfect it [18]. The proposed model is based on linguistic analysis and text classification by calculating probability using the TF $*$ IDF weight instead of word count, as the TF $*$ IDF weight reflects how important the word is to the document; this is an improvement over probability calculated using word count. Grid search is included to perform hyperparameter optimization to determine the optimal values for the model. The performance of a model significantly depends on the hyperparameters used by the estimators; selecting optimal parameters manually can take a considerable amount of time and resources [19]. Thus, grid search has been used to automate this entire process.

As for the working of the model, a tweet from the Twitter API serves as the input for the model. This tweet can be written in English, Hindi, or a mix of these two languages (Hinglish). The model classifies the tweet into one of the two target class labels, depressed (denoted by 0 in the dataset) and not depressed (denoted by 1 in the dataset) based on the words present in the tweet (for instance, depressed tweets most commonly include the keywords “depressed,” “anxiety,” “sad,” etc.), and the class of the tweet is displayed on the screen. Figure 1 represents the architecture of the proposed model.

[figure(s) omitted; refer to PDF]

4. Technique Used

4.1. Data Collection

The tweets in the dataset were obtained using the Python module Tweepy via the Twitter API. Hashtags (#) like #depressed, #anxiety, and #sad were used to filter out depressed tweets, whereas #happy and #life were used to filter out tweets that were not depressed. These tweets were then turned into a 670-data-point raw dataset with three columns: TID (unique Twitter ID), TWEET, and LABEL. Figure 2 represents the output derived. The tweets were then compiled into a CSV file, shown in Table 1.

[figure(s) omitted; refer to PDF]

4.2. Data Preprocessing

The raw dataset was preprocessed to bring all the textual data into a form that is predictable and analyzable for the model. Figure 1 depicts the flow of processes in data preprocessing. The Python modules stopwords, RegexpTokenizer, WordNetLemmatizer, and PorterStemmer from NLTK were used along with String. We also included Hindi stopwords [20] separately as NLTK does not have this provision.

4.3. Undersampling

Initially, the dataset contained 670 data points, out of which 409 were associated with label 1, and 260 were associated with label 0. This created a bias, which if not rectified, would skew the results of the model. So, we proceeded with undersampling the data associated with label 1, after which there was an equal distribution of data for both target class labels, consisting of 520 data points in the dataset.

4.4. TF $^{*}$ IDF

The TF $^{*}$ IDF algorithm was applied to generate a score that implied how relevant a word was to the proposed model. The Python libraries CountVectorizer and Tfidftransformer are used for this purpose. The mathematical formula for the TF $*$ IDF algorithm is given as follows: $\begin{matrix} (1) & w_{i, j} = t f_{i, j} X \log \frac{N}{d f_{i}}, \end{matrix}$ where $t f_{i, j}$ = number of occurrences of $i$ in $j$ , $d f_{i}$ = number of documents containing $i$ , and $N$ = number of documents.

4.5. Multinomial Naive Bayes

The MNB algorithm is used as the primary classifier because it is more accurate than the Naive Bayes (NB) algorithm [5]. While NB considers the independent probability of each feature, MNB considers a feature vector where each term represents the TF $*$ IDF weight of each word, i e., not only considering the frequency of the word but also how important that word is in the entire document. This allows us to make classifications using only the most important words in each line of text. MNB can be represented mathematically by $\begin{matrix} (2) & C_{N B} = \arg \max_{k ε K} \log P C_{k} + \sum_{i = 1}^{n} x_{i} \cdot \log P k_{i}, \end{matrix}$ where $p_{k i}$ = probability of $i - t h$ event occurring in class $k$ , $x_{i}$ = frequency of $i - t h$ event.

4.6. Grid Search

Selecting the best hyperparameters for tuning the model can be exhaustive and time-consuming if performed manually. To automate this process, grid search has been used [21]. These are the best hyperparameters that were determined for the proposed model.

An important feature to note is that the value of $α = 1$ for the MNB algorithm, indicating that Laplace smoothing has been used for smoothing categorical data. A small-sample correction, or pseudocount, is incorporated into every probability estimate. Consequently, no probability will be zero. This is a fairly efficient method to regularize the MNB algorithm.

5. Implementation

The model is an application of supervised machine learning, and the requirement of a user is to deploy and collect the result. Deploying this application needs basic interaction where it asks for the keys and tokens to access the database (as for Twitter, it needs access_token, secret access token, consumer key, and consumer secret key, respectively). The application later requires minimal to no intervention from the user until the output is provided by the application. The application collects a collection of tweets from the database (Twitter), which is fed into the core of the application. The core contains a trained model to classify the tweets into one of two classifications: depressed or not depressed. The model is trained in one of the best methods, using grid search. Grid search as already mentioned in the previous section, chooses the best combination of parameters and derives an output. The parameters have chosen for the model are a pipeline of TFIDF, countvectorizer, and multinomial Naive Bayes. The model is capable of prioritizing accuracy in different types of data provided to it. The model can successfully read Hindi tweets as well and classify them using its knowledge of the different Hinglish terms that are commonly used over social media. After classification, the application can provide an accurate result of up to 96.15% (data based on training dataset) and can provide a visual representation of the different key lexicons it has encountered throughout the dataframe.

One of the best features of the implementation is its modularized approach, where each of the jobs is assigned to different modules and each of the major module clusters is capable of working individually without interference from other module clusters. This improves the implementation, upgradability, and readability of code. A vivid test report for different types of tweets is provided by Table 2.

Table 1

Generated classification report of the model.

	Precision	Recall	F1-score	Support
0	0.9815	0.9464	0.9636	56
1	0.9400	0.9792	0.9592	48
Accuracy			0.9615	104
Macro avg	0.9607	0.9628	0.9614	104
Weighted avg	0.9623	0.9615	0.9616	104

6. Experimental Setup

The 670 data point raw dataset taken from Twitter has a collection of real tweets that include the Hindi and English language. The dataset has been split into 2 groups: the train set, which is to be input as training samples, and the development set, which is to verify the accuracy of the checkpoint of the grid search; for each of the datasets, the train set represents around 90% of the whole data amount, and the development set is around 10%. For the testing, we train the grid search model several times and choose the one with the highest average development accuracy, as shown in Table 3.

Table 2

Test cases for testing the model.

Test case ID	Test condition	Tweets	Expected result	Actual result	Status
01	Test MNB (TF $^{*}$ IDF) with grid search (for English not depressed tweet)	Hello hk im soooo happi luv soooo much	Not depressed	Not depressed	Pass

02	Test MNB (TF $^{*}$ IDF) with grid search (for Hindi depressed tweet)	@Rishabverma740	Depressed	Depressed	Pass
		Tere dukh tere he rahenge
		Phir tu isko suna
		Ya usko suna
		Kya farak padta hai
		#Depressed

03	Test MNB (TF $^{*}$ IDF) with grid search (for Hindi not depressed tweet)	Udaas rehne ki wajah to bohot hai life me..!!Par fookat me khush rehne ka maza hi kuch aur hai.!! #happy #sad #life	Not depressed	Depressed	Fail

04	Test MNB (TF $^{*}$ IDF) with grid search(for English not depressed tweet	Depress start counsel next monthal want happi	Not depressed	Depressed	Pass

05	Test MNB (TF $^{*}$ IDF) with grid search (for English depressed tweet)	I Feel lost inside myself! #illness #lifelessons #useless #depressed#Ignored #worthless #pathetic.	Depressed	Depressed	Pass

06	Test MNB, (TF $^{*}$ IDF) with grid search (for Hinglish not depressed tweet)	@sidnaaz_kaHappy birthday preeti diLots of love and prayers! Hamesha khush rehna app!	Not depressed	Not depressed	Pass

7. Results and Discussions

The model, which is a hybrid of MNB, TF $*$ IDF, and grid search, is able to classify tweets as depressed or not depressed with an accuracy of 96.15%. The full classification report of the proposed model is shown in Table 1. The model is trained on the full development set and the scores are computed on the full evaluation set.

Table 3

Model metrics.

Metric	Value derived for proposed model
MAE (mean absolute error)	0.038461538461538464
R2 score	0.85
Log loss	1.3284375420378214
IoU (Jaccard score)	0.9215686274509803
MSE (mean squared error)	0.038461538461538464
RMSE (root mean squared error)	0.19611613513818404
MSLE (mean squared log error)	0.018478962073776976
NAE (normalized absolute error	0.20450490315512837

When applying MNB, TF $*$ IDF, and grid search to the dataset, TF $*$ IDF got the best results. We trained, tested, and validated the dataset with a batch size of 500, the number of epochs = 20, the drop out size of any network = 0.4, vocabulary size that we applied our models to it was 5000, with 32 hidden layers for every DL model, and finally the embedding size was equaled to 60. The evaluation splitting parameter was tested on 90%, 80%, and 70% for training with dividing equally the remaining for testing and validation.

After training, the model applies the evaluation measures to check how the model is performing. Accordingly, the following evaluation parameters are used to check the performance of the models, respectively:

(i) Accuracy score

(ii) Confusion matrix with plot

(iii) ROC-AUC Curve

Accuracy: as far as the accuracy of the model is concerned, MNB (TF $*$ IDF)-Grid search performs better than Char-LSTM, Subword-LSTM, and CNN-BiLSTM.

F1-score: MNB(TF $*$ IDF)-Grid search (F1-score = 0.914) < , Subword-LSTM (F1-score = 0.658) < CNN-BiLSTM. (F1-score = 0.556) < Char-LSTM (F1-score = 0.92).

The model has been evaluated against several metrics to compare the model’s predictions with the (known) values of the dependent variable in a dataset. Table 1 describes the model metrics derived for the classification model.

A study has been conducted to compare the proposed model metrics, specifically the accuracy and F1-score, with preexisting works, and the results of this study is shown in Table 4.

Table 4

Comparison between existing models and proposed model.

Method	Reported in	Accuracy (%)	F1-score
Char-LSTM	Joshi, A. et al. (2016)	59.8	0.511
Subword-LSTM	Joshi, A. et al. (2016)	69.7	0.658
CNN-BiLSTM	Garg, N., and Sharma, K. (2020)	83.21	0.556
MNB (TF $^{*}$ IDF)-grid search	Proposed	96.15	0.914

Figure 3 and Figure 4 represent the ROC curve and precision-recall curve obtained for the proposed model, respectively, and Figure 5 represents the confusion matrix of the model.

[figure(s) omitted; refer to PDF]

8. Conclusion and Future Enhancement

The proposed model helps to identify those depressed individuals from the large data pool and easily identify them using a quick-fix solution that is done with minimal changes and hardly any human intervention. Another distinguishing factor of the proposed model is that it is able to classify tweets written in English, Hindi, and Hinglish languages. The entire architecture works over English and Hindi languages, which shall help in implementation globally, especially in India and across multiple platforms. This will help put a stop to the ever-increasing depression rates in an automated manner.

This work can be readily upgraded into an interactive bot. The bot adapts himself to the depressed person and makes him/her able to express themselves. This would help people to spend time working on their mental health and have a regular conversation with the bot. This can be extended to include several other Indian languages.

References

[1] World Health Organization, "WHO Director-General’s opening remarks at the media briefing on COVID," March 2020. https://www.who.int/dg/speeches/detail/who-director-general-s-opening-remarks-at-the-media-briefing-on-covid-19-11-march-2020

[2] P. A. Cavazos-Rehg, M. J. Krauss, S. Sowles, S. Connolly, C. Rosas, M. Bharadwaj, L. J. Bierut, J. Laura, Bierut, "A content analysis of depression-related tweets," Computers in Human Behavior, vol. 54, pp. 351-357, DOI: 10.1016/j.chb.2015.08.023, 2016.

[3] U. Chawda, S. K. Rakesh, "Implementation and Analysis of Depression Detection Model Using Emotion Artificial Intelligence," International Journal of Computer sciences and engineering, vol. 7,DOI: 10.26438/ijcse/v7i4.912, 2019.

[4] M. R. Islam, Ashad Kabir Muhammad, M. A. Kabir, A. Ahmed, A. R. M. Kamal, H. Wang, A Ulhaq, "Depression detection from social network data using machine learning techniques," Health Information Science and Systems, vol. 6 no. 1,DOI: 10.1007/s13755-018-0046-0, 2018.

[5] M. Trotzek, S. Koitka, M. Christoph, Friedrich, "Utilizing neural networks and linguistic metadata for early detection of depression indications in text sequences," IEEE Transactions on Knowledge and Data Engineering, vol. 32 no. 3, pp. 588-601, DOI: 10.1109/TKDE.2018.2885515, 2018.

[6] A. Agarwal, B. Xie, I. Vovsha, R. Owen, J. Passonneau Rebecca, "Sentiment analysis of twitter data," pp. 30-38, .

[7] A. Joshi, A. Prabhu, M. Shrivastava, V. Varma, "Towards sub-word level compositions for sentiment analysis of Hindi-English code mixed text," Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2482-2491, .

[8] S. Sharma, S. Sharma, "Analyzing the depression and suicidal tendencies of people affected by COVID-19’s lockdown using sentiment analysis on social networking websites," Journal of Statistics & Management Systems, vol. 24,DOI: 10.1080/09720510.2020.1833453, 2020.

[9] A. Ziani, N. Azizi, D. Schwab, M. Aldwairi, N. Chekkai, D. Zenakhra, S. Cheriguene, "Recommender system through sentiment analysis," 2nd International Conference on Automatic Control, Telecommunications and Signals, . hal-01683511

[10] N. Garg, K. Sharma, "Annotated corpus creation for sentiment analysis in code-mixed Hindi-English (Hinglish) social network data," Indian Journal of Science and Technology, vol. 13 no. 40, pp. 4216-4224, DOI: 10.17485/ijst/v13i40.1451, 2020.

[11] D. M Hussein, E. D, M. Mohey El-Din Mohamed, "A survey on sentiment analysis challenges," Journal of King Saud University - Engineering Sciences, vol. 30 no. 4, pp. 330-338, DOI: 10.1016/j.jksues.2016.04.002, 2018.

[12] S. Liao, J. Wang, R. Yu, K. Sato, Z. Cheng, "CNN for situations understanding based on sentiment analysis of twitter data," Procedia Computer Science, vol. 111, pp. 376-381, DOI: 10.1016/j.procs.2017.06.037, 2017.

[13] X. Wang, C. Zhang, Y. Ji, L. Sun, L. Wu, Z. Bao, A depression detection model based on sentiment analysis in micro-blog social network Pacific-asia Conference on Knowledge Discovery and Data Mining, pp. 201-213, DOI: 10.1007/978-3-642-40319-4_18, 2013.

[14] B. Verma, S. Gupta, L. Goel, A neural network based hybrid model for depression detection in twitter, pp. 164-175, DOI: 10.1007/978-981-15-6634-9_16, 2020.

[15] Z. Li, Y. Fan, B. Jiang, T. Lei, W. Liu, "A survey on sentiment analysis and opinion mining for social multimedia," Multimedia Tools and Applications, vol. 78 no. 6, pp. 6939-6967, DOI: 10.1007/s11042-018-6445-z, 2019.

[16] N. F. Liu, M. Gardner, Y. Belinkov, M. E. Peters, N. A. Smith, "Linguistic knowledge and transferability of contextual representations," ,DOI: 10.18653/v1/n19-1112, 2019.

[17] A. A. Maksutov, V. I. Zamyatovskiy, V. N. Vyunnikov, A. V. Kutuzov, "Knowledge base collecting using natural language processing algorithms," pp. 405-407, DOI: 10.1109/eiconrus49466.2020.9039303, .

[18] T. Shen, J. Jia, G. Shen, F. Feng, X. He, H. Luan, J. Tang, T. Tiropanis, T. S. Chua, W. Hall, "Cross-domain depression detection via harvesting social media," Proceedings of the International Joint Conferences on Artificial Intelligence, pp. 1611-1617, DOI: 10.24963/ijcai.2018/223, .

[19] G. Singh, B. Kumar, L. Gaur, A. Tyagi, "Comparison between multinomial and Bernoulli naïve Bayes for text classification," pp. 593-596, DOI: 10.1109/ICACTM.2019.8776800, .

[20] S. Rana, "HinglishNLP [Source code]," 2020. https://github.com/TrigonaMinima/HinglishNLP/blob/master/data/assets/stop_hinglish

[21] P. C. Bhat, H. B. Prosper, S. Sekmen, C. Stewart, "Optimizing event selection with the random grid search," Computer Physics Communications, vol. 228, pp. 245-257, DOI: 10.1016/j.cpc.2018.02.018, 2018.

Word count: 2983

Show less

Copyright © 2022 Carmel Mary Belinda M J et al. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/

Abstract

Translate

According to recent studies, young adults in India faced mental health issues due to closures of universities and loss of income, low self-esteem, distress, and reported symptoms of anxiety and/or depressive disorder (43%). This makes it a high time to come up with a solution. A new classifier proposed to find those individuals who might be having depression based on their tweets from the social media platform Twitter. The proposed model is based on linguistic analysis and text classification by calculating probability using the TF $*$ IDF (term frequency-inverse document frequency). Indians tend to tweet predominantly using English, Hindi, or a mix of these two languages (colloquially known as Hinglish). In this proposed approach, data has been collected from Twitter and screened via passing them through a classifier built using the multinomial Naive Bayes algorithm and grid search, the latter being used for hyperparameter optimization. Each tweet is classified as depressed or not depressed. The entire architecture works over English and Hindi languages, which shall help in implementation globally and across multiple platforms and help in putting a stop to the ever-increasing depression rates in a methodical and automated manner. In the proposed model pipeline, composed techniques are used to get the better results, as 96.15% accuracy and 0.914 as the F1 score have been attained.

Details

Title

Linguistic Analysis of Hindi-English Mixed Tweets for Depression Detection

Author

Carmel Mary Belinda M J¹

; Ravikumar, S¹

; Arif, Muhammad²

; Dhilip, Kumar V¹

; Antony, Kumar K¹

; Arulkumaran, G³

¹ Department of Computer Science & Engineering, Vel Tech Rangarajan Dr Sagunthala R and D Institute of Science and Technology, Chennai, India
² Department of Computer Science and Information Technology, University of Lahore, Lahore, Pakistan
³ Department of Electrical and Computer Engineering, Bule Hora University, Bule Hora, Ethiopia

Editor

Naeem Jan

Publication year

2022

Publication date

2022

Publisher

John Wiley & Sons, Inc.

ISSN

23144629

e-ISSN

23144785

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1155/2022/3225920

ProQuest document ID

2653908398

Linguistic Analysis of Hindi-English Mixed Tweets for Depression Detection

Jump to:

Full text

Abstract

Details

Suggested sources