This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction
Living in a society, we humans communicate with each other to share our thoughts, feelings, ideas, and different types of information. We use different communication mediums, like text messages, emails, audio, and videos, to express ourselves to others. Besides this, people nowadays use a variety of emojis along with text messages to represent their feelings more precisely. However, without any doubt, among all the communication forms, speech is the most natural and easiest one to express ourselves.
In recent years, contact with computing devices has become more chatty. Dialogue systems like Siri, Alexa, Cortana, and many more have more widely penetrated the consumer market than before [1]. Thus, to make them more like a human conversational partner, it is important to recognize human emotions from the user’s voice signals. Understanding the emotional state of a speaker is important for perceiving the exact meaning of what he or she says. Therefore, research studies on automatic speech emotion recognition, which is the task of predicting the emotional state of humans from speech signals have emerged in recent years as it enhances human-computer interaction (HCI) systems and makes them more natural. Moreover, with the world being digitized day by day, speech emotion recognition has found increasing applications in our daily lives. Call centers, e-tutoring, surveillance systems, psychological treatments, robotics, and online marketing are just some of them.
It has been seen from cross-lingual speech emotion recognition studies that models trained with a language corpus do not perform well when tested on a different language corpus compared to the monolingual recognition rate [2, 3]. However, it will be interesting to find out whether those models perform better for different languages from the same groups. To carry out this study, the first thing is to find out the available resources of the target language group. Hence, in this study, we aimed to investigate recent advancements in SER for the Indo-Aryan and Dravidian language families. Indo-Aryan and Dravidian languages are spoken by 800 million and 250 million people, all over the world, respectively [4, 5]. The speakers of Indo-Aryan languages are mostly from Bangladesh, India, Nepal, Sri Lanka, and Pakistan, and speakers of Dravidian languages are mainly from southern India. Having a large number of speakers, yet most of them are low-resource languages. So far, there is no review work that highlights the SER experiments for the Indo-Aryan or Dravidian language groups. Therefore, this study presents a brief review of some work done for the development of SER for languages of the Indo-Aryan and Dravidian families.
The remaining part of the paper is organized as follows: Section 2 gives a brief overview of a speech emotion recognition system with different types of emotional speech corpora, features, and classification algorithms utilized for the development of an SER system. Trends in speech emotion recognition research have been discussed in Section 3. Section 4 discusses some research works on SER in different languages in the last two decades. In Section 5, the advancement of SER works in Indo-Aryan and Dravidian languages is shown, and lastly, the study is concluded in Section 6.
2. Overview of Speech Emotion Recognition System
A speech emotion recognition (SER) system analyzes human speech and makes predictions about the emotion reflected by the speech. The system that recognizes the emotion from speech may be dependent or independent of the speaker and gender. Comparatively, the recognition accuracy of a speaker-dependent system is higher than that of a speaker-independent system, but the disadvantage of this strategy is that the system only responds appropriately to the person who trained the system. As reflected in Figure 1, the first requirement for building an SER system is a suitable speech dataset having different emotional states. For this purpose, raw speech data are collected from speakers in a variety of ways. Based on the generation of the corpus, emotional speech databases may be natural [6–10], acted [11–13], or elicited [14, 15]. Table 1 summarizes some prominent databases used for SER.
[figure(s) omitted; refer to PDF]
Table 1
Summary of some commonly used databases for speech emotion recognition.
S/N | Database | Language | Year | Size of database | Data format | Emotions |
1 | RAVDESS [16] | English | 2018 | 7356 recordings by 24 actors | Audio-visual | Happy, angry, calm, sad, fearful, disgust, surprise, neutral |
2 | EmoDB [11] | German | 2005 | 800 recordings by 10 actors | Audio | Joy, boredom, fear, sadness, anger, disgust, neutral |
3 | IEMOCAP [14] | English | 2008 | 12 hours of data by 10 actors | Audio-visual | Happiness, anger, sadness, frustration, neutral |
4 | SUBESCO [12] | Bangla | 2021 | 7 hours of recordings containing 7,000 utterances by 20 native speakers | Audio | Happiness, anger, sadness, fear, surprise, disgust, neutral |
5 | FAU AIBO [6] | German | 2008 | 9.2 hours of speech by 51 children talking to robot Aibo | Audio | Angry, emphatic, neutral, positive, rest |
6 | BanglaSER [17] | Bangla | 2022 | 1,467 recording by 34 actors | Audio | Angry, happy, neutral, sad, surprise |
7 | MELD [7] | English | 2018 | 13,000 utterances from the TV-series friends by multiple speakers | Audio-visual and textual | Anger, fear, joy, surprise, sadness, disgust, neutral |
8 | AESDD [18] | Greek | 2018 | 500 utterances from 5 different actors | Audio | Anger, disgust, fear, happiness, sadness |
9 | SAVEE [13] | English | 2014 | 480 British English utterances by 4 male actors | Audio-visual | Disgust, fear, anger, happiness, sadness, surprise, neutral |
10 | RECOLA [19] | French | 2013 | 3.8 hours of recordings by 46 participants | Audio-visual | Social behaviors (agreement, dominance, engagement, performance, rapport) |
11 | CHEAVD [8] | Chinese | 2017 | 140 min emotional segments extracted from talk shows, TV plays, and films by 238 speakers | Audio-visual | Angry, fear, happy, neutral, sad, surprise |
12 | LSSED [20] | English | 2020 | 147,025 utterances from 820 subjects | Audio | Happiness, fear, anger, excitement, sadness, boredom, disappointment, disgust, surprise, normal and other |
13 | Urdu [2] | Urdu | 2018 | 400 utterances by 38 speakers from Urdu talk shows | Audio | Angry, happy, neutral, sad |
14 | Urdu-Sindhi speech emotion corpus [21] | Urdu, Sindhi | 2020 | 1435 speech recordings | Audio | Happiness, anger, sadness, disgust, surprise, sarcasm, neutral |
15 | IITKGP-SEHSC [22] | Hindi | 2011 | 12000 utterances by 10 professionals | Audio | Happy, anger, fear, disgust, surprise, sad, sarcastic, neutral |
16 | KSUEmotions [23] | Arabic | 2017 | 5 hours and 10 minutes of recordings by 23 speakers | Audio | Neutral, happiness, sadness, surprise, anger |
17 | Oriya emotional speech database [15] | Oriya | 2010 | 900 emotional utterances for text fragments from various drama scripts of Oriya language by 35 Oriya speakers | Audio | Happiness, sadness, astonish, anger, fear, neutral |
18 | Mandarin Chinese emotional speech database [24] | Mandarin | 2008 | 3,400 emotional speech utterances by 18 males and 16 females | Audio | Anger, happiness, sadness, boredom, neutral |
19 | CASIA natural emotional audio-visual database [9] | Chinese | 2014 | 2 hours spontaneous emotional segments extracted from 219 speakers from films, TV plays and talk shows | Audio-visual | Happy, angry, disgust, surprise, worried, sad, fear |
20 | Egyptian Arabic speech emotion (EYASE) database [25] | Arabic | 2020 | 579 utterances by 3 male and 3 female professional actors from Egyptian TV series | Audio-visual | Angry, happy, neutral, sad |
21 | Interface multilingual emotional speech database [26] | English, French, Spanish and Slovenian | 2002 | In English interface database contains 8,928 utterances, Slovenian 6,080, French 5,600 and Spanish 5,520 by 2 actors | Audio | Joy, disgust, anger, fear, sadness, surprise, neutral |
22 | Toronto emotional speech set (TESS) [27] | English | 2010 | 2,800 utterances by 2 actresses | Audio | Happiness, anger, sadness, disgust, pleasant surprise, fear, neutral |
23 | ANAD [28] | Arabic | 2018 | 1,384 recordings from Arabic talk shows | Audio | Happy, angry, surprised |
24 | Multilingual emotional speech database of north east India (MESDNEI) [29] | Assamese | 2009 | 4,200 utterances of 5 native languages of Assam by 30 speakers | Audio | Sadness, anger, fear, disgust, happiness, surprise, neutral |
25 | ShEMO [10] | Persian | 2019 | Speech data of 3 hours and 25 minutes from online radio plays | Audio | Anger, fear, happiness, sadness, surprise, neutral |
26 | SEMOUR+: a scripted emotional speech repository for Urdu [30] | Urdu | 2021 | 27,640 instances recorded by 24 actors | Audio | Anger, happiness, surprise, disgust, sadness boredom, fearful, neutral |
27 | Arabic natural corpus [28] | Arabic | 2018 | 1,384 recordings from online Arabic talk shows | Audio | Angry, surprised, happy |
28 | EMOVO [31] | Italian | 2014 | 588 utterances of 14 sentences by 6 actors | Audio | Joy, sadness, anger, disgust, fear, surprise, neutral |
29 | Punjabi emotional speech database [32] | Punjabi | 2021 | 900 utterances by 15 speakers | Audio | Happy, angry, sad, fear, surprise, neutral |
30 | IITKGP-SESC [33] | Telegu | 2009 | 12,000 utterances by 10 professionals | Audio | Happy, anger, sad, disgust, fear, sarcastic, surprise, neutral |
Once the data are collected, the raw speech data go through some preprocessing techniques such as noise reduction, silence removal, framing, windowing, and normalization for enhancing the speech signal [34]. After the preprocessing of raw data, the system opts for the feature extraction phase, which analyzes speech signals and obtains different speech characteristics. Any machine learning model’s success is largely dependent on its features. Selecting the right features could result in a more effective trained model, whereas choosing the wrong ones would significantly impede training. The selection of the proper signal features is crucial for better performance in recognizing the emotion of speech. From the beginning of SER research, various arrangements of speech features known as acoustic features like Mel Frequency Cepstral Coefficient (MFCC), pitch, zero crossing rate (ZCR), energy, and linear predictive cepstral coefficients (LPCC) have been used [35]. In various studies, nonspeech characteristics called nonacoustic features have also been integrated with the acoustic ones for the identification of emotion [36, 37]. Gestures, facial images, videos, and linguistic features are some of them.
After the feature selection process, a classifying algorithm is implemented to recognize the speech emotion. For the recognition of emotion from voice signals, many classifying algorithms have been used by researchers. A variety of supervised and unsupervised machine learning models have been employed for this purpose. Hidden Markov model (HMM), support vector machine (SVM), Gaussian mixture model (GMM), K-nearest neighbor (KNN), artificial neural network (ANN), and decision tree (DT) are some of them. In recent years, along with the traditional classification methods, several deep learning techniques are also being utilized for the classification process and have shown promising results. Convolutional neural network (CNN), long short-term memory (LSTM), deep CNN, and recurrent neural network (RNN) are the commonly used ones. In many SER studies, multiple classifiers are integrated to enhance the recognition rate. Authors Zhu et al. [38] combined two classifiers, deep belief network (DBN) and support vector machine (SVM), to classify the emotions of anger, fear, happiness, sadness, neutrality, and surprise in the Chinese Academy of Sciences emotional speech database. They used MFCC, pitch, formant, short-term ZCR, and short-term energy as features and achieved a mean accuracy of 95.8%, which is better than using SVM or DBN individually.
3. Speech Emotion Recognition Trends
The very first approach for determining the emotional state of a person from his/her speech was made in the late 1970s by Williamson [39]. Williamson provided a speech analyzer for the determination of an individual’s underlying emotion by analyzing pitch or frequency changes in the speech pattern. Later on, in 1996, Dellaert et al. [40] published the first research paper on the topic and introduced statistical pattern recognition techniques in speech emotion recognition. Authors Dellaert et al. [40] implemented K-nearest neighbors (KNN), Kernel regression (KR), and Maximum Likelihood Bayes’ (MLB) classifier using pitch characteristic of the utterances for the recognition of four different emotions, happiness, fear, anger, and sadness. Along with MLB and nearest neighbor (NN), Kang et al. [41] implemented the hidden Markov model (HMM), where HMM performs the best with 89.1% accuracy for recognizing happiness, sadness, anger, fear, boredom, and neutral emotions utilizing pith and energy features. Onward, HMM has been largely used by researchers for speech emotion recognition showing satisfactory results [42–45]. SVM, GMM, and decision tree (DT) are some more traditional machine learning models which have been reliably used over the years for the same purpose [45–50]. In the 2000s, neural network (NN) has also been widely used for speech emotion recognition studies [51–54]. Indeed, in the earlier approaches, the use of conventional machine learning algorithms was widespread for recognizing the underlying emotion in human speech.
However, in the last decade, the trend of using conventional machine learning models for the recognition of emotion from human speech has moved towards deep learning models. Therefore, deep learning approaches have become more popular, showing promising results. Deep learning algorithms are neural networks with multiple layers. CNN, DCNN, LSTM, BLSTM, and RNN are some widely implemented deep learning techniques for SER [55–57].
In recent times, multitask learning and attention mechanism are also being used for improved performance [58, 59]. For cross-corpus and cross-lingual speech emotion recognition, the transfer learning technique is being widely used [3, 60, 61].
Figure 2 depicts an analysis that shows that the use of deep learning techniques like CNN, RNN, LSTM, and DBN has increased over the years, along with traditional machine learning algorithms like SVM, DT, KNN, HMM, and GMM.
[figure(s) omitted; refer to PDF]
4. Survey on Speech Emotion Recognition Research Studies
After the first published research work on speech emotion recognition in 1996, the field of SER has received a great deal of attention over the past 20 years. Moderate progress has been made to create an automatic SER system. Several acoustic and nonacoustic features have been utilized along with different classifying models. Comparatively, the number of SER experiments conducted in English, German, and French languages is higher than that of the research conducted in other languages. One main reason is the availability of established and publicly accessible databases for the mentioned languages. RAVDESS, IEMOCAP, and SAVEE are some prominent emotional speech databases for the English language, EmoDB for Berlin, FAU AIBO for German, and RECOLA for French. The IEMOCAP database was used by researchers in [56, 58, 59, 62–64] for speech emotion recognition. Fayek et al. [56] evaluated deep learning techniques with CNN and LSTM-RNN using the IEMOCAP database and achieved 64.78% and 61.71% test accuracy for CNN and LSTM-RNN, respectively. Implementing spectrogram-based self-attentional CNN-BLSTM, Li et al. [58] gained a weighted accuracy of 81.6% and unweighted accuracy of 82.8% for the IEMOCAP dataset for classifying angry, happy, neutral, and sad emotions. Using BLSTM with an attention mechanism, Yu and Kim [59] got a weighted accuracy of 73% and an unweighted accuracy of 68% for the IEMOCAP corpus. Meng et al. [62] used attention mechanism-based dilated CNN with residual block and BiLSTM for both IEMOCAP and Berlin EmoDB and got 74.96% speaker-dependent and 69.32% speaker-independent accuracy for IEMOCAP and 90.78% speaker-dependent and 85.39% speaker-independent for Berlin EmoDB.
A combination of prosodic and modulation spectral features (MSFs) with an SVM classifier was implemented by Wu et al. [65] for the Berlin EmoDB database and the recognition rate was 91.6% for recognizing the emotions in the Berlin EmoDB database. An improved recognition rate of 96.97% was observed by deep convolutional neural network (DCNN) for the Berlin EmoDB database for the recognition of angry, neutral, and sad emotions [66]. For Chinese language authors, Zhang et al. [67] employed SVM and a deep belief network (DBN) with MFCC, pitch, and formant features and got 84.54% mean accuracy by SVM and 94.6% by DBN for the Chinese Academy of Sciences emotional speech database. A higher mean accuracy of 95.8% was achieved for the same Chinese dataset in [38] by combining deep belief network (DBN) with support vector machine (SVM).
Experiments have also been conducted for cross-lingual speech emotion recognition. Sultana et al. [3] showed a cross-lingual study for English and Bangla languages using RAVDESS and SUBESCO datasets, respectively, where the proposed system integrates a deep CNN and a BLSTM network with a TDF layer. Transfer learning was used for the cross-lingual experiment, achieving weighted accuracy of 86.9% for SUBESCO and 82.7% for RAVDESS. Latif et al. [2] used an SVM classifier for cross-lingual emotion recognition for Urdu, German, English, and Italian languages. The authors used SAVEE, EmoDB, EMOVO, and URDU databases for English, German, Italian, and Urdu languages, respectively, for the evaluation of the cross-corpus study. Xiao et al. [68] investigated the cross-lingual study of emotion recognition from speech using the databases EmoDB, DES, and CDESD for German, Danish, and Mandarin languages, respectively. Using CDESD as the training dataset and EmoDB as the testing dataset, the authors achieved the best accuracy of 71.62% for the cross-corpus study with a sequential minimal optimization (SMO) classifier. The IEMOCAP and Recola databases were used for cross-lingual study by Neumann [69] for English and French languages, respectively, where an attentive convolutional neural network (ACNN) was used. 59.32% unweighted average recall was achieved for the IEMOCAP test database while trained on Recola and 61.27% for Recola while training was carried out on the IEMOCAP database. A cross-lingual cross-corpus study was carried out for four languages, German, Italian, English, and Mandarin by Goel and Beigi [70]. Transfer learning and multitask learning techniques were used, providing accuracy of 32%, 51%, and 65% for EMOVO, SAVEE, and EmoDB databases, respectively, using IEMOCAP as the training database.
Apart from using available prominent databases, researchers are also creating emotional speech corpus using acted, elicited, or natural recordings and experimenting with various classification models for the identification of speech emotion. A multilingual database containing 720 utterances by 12 native Burmese and Mandarin speakers was built by Nwe et al. [43]. Using the short-time log frequency power coefficients (LFPC) feature, the authors implemented the HMM classifier, which classifies six emotions, namely, anger, disgust, fear, joy, sadness, and surprise, with an average accuracy of 78% and the best accuracy of 96%.
5. Advancement of Speech Emotion Recognition in Indo-Aryan and Dravidian Languages
Indo-Aryan languages, also known as Indic languages, are the native languages of the Indo-Aryan people, which are a branch of the Indo-Iranian languages in the Indo-European language family. An estimation made at the beginning of the 21st century shows that more than 800 million people, mostly in India, Bangladesh, Sri Lanka, Nepal, and Pakistan, speak Indo-Aryan languages [4]. Hindi, Bangla, Sinhala, Urdu, Punjabi, Assamese, Nepali, Marathi, Odia, Gujarati, Sindhi, Rajasthani, and Chhattisgarhi are some prominent Indo-Aryan languages. Besides, Dravidian or Dravidic languages are spoken by 250 million people primarily in southern India, southwest Pakistan, and north-eastern Sri Lanka [5]. Tamil, Malayalam, Telugu, and Kannada are the most spoken Dravidian languages. Although a lot of work on speech emotion recognition in English, German, Chinese, Mandarin, and French languages has been conducted by researchers, compared to that, the number of experiments in the Indo-Aryan and Dravidian languages is not much. Inadequacy of available resources and variation in the nature of the languages are some reasons for that. However, in the last decade, improvement has been seen in speech emotion recognition research for both language families. Figure 3 shows an analysis of research works done for some of the languages.
[figure(s) omitted; refer to PDF]
5.1. Emotional Speech Databases for Indo-Aryan and Dravidian Languages
Some established and validated emotional speech corpora are available for some of the languages. Hindi is the most spoken language among the Indo-Aryan languages in terms of native speakers. The IITKGP-SESC, Indian Institute of Technology Kharagpur Simulated Emotion Speech Corpus, developed by a team of Indian Institute of Technology Kharagpur in 2009, is the first corpus in Telugu, an Indian language [33]. The corpus contains 12,000 emotional speech utterances in Telugu, with happiness, surprise, anger, disgust, sadness, fear, sarcasm, and neutral emotions expressed by ten speakers.
Afterward, emotions being language independent, Koolagudi et al. [22] felt the necessity of a speech corpus in other Indian languages and created the Indian Institute of Technology Kharagpur Simulated Emotion Hindi Speech Corpus (IITKGP-SEHSC) developed in the Hindi language. The database contains 12,000 utterances of Hindi speech recorded by ten professional FM radio artists in India. Eight emotions, namely, happiness, sadness, surprise, anger, sarcasm, fear, disgust, and neutral, are present in the database.
A publicly available speech emotion corpus is available in the Urdu language containing 400 utterances by 38 speakers from different Urdu talk shows annotated with emotions of anger, happiness, neutral, and sadness [2]. Asghar et al. [71] built a corpus comprising 2,500 emotional speech utterances by 20 speakers with sadness, anger, disgust, happiness, and neutrality emotions.
SUBESCO, SUST Bangla Emotional Speech Corpus, is the largest available emotional speech corpus for the Bangla language consisting of more than 7 hours of speech with 7,000 utterances [12]. Happiness, surprise, anger, sadness, disgust, fear, surprise, and neutrality are the emotional states present in the database.
Mohanty and Swain [15] developed an Oriya emotional speech corpus for the Oriya language having six emotion classes, namely, happiness, anger, fear, sadness, astonishment, and neutrality.
For the Assamese language, there exists an emotional speech corpus containing utterances in five native Assamese languages, namely, Assamese, Karbi, Bodo (or Boro), Missing (or Mishing), and Dimasa [29].
A Punjabi speech database was created by Kaur and Singh [32] consisting of 900 emotional speech utterances by 15 speakers. Happiness, fearful, angry, surprised, sad, and neutral are the six emotions present in the database.
Kannada emotional speech (KES) database developed by Geethashree and Ravi [72] contains acted emotional utterances in the local languages of Karnataka. The database includes the basic emotions of happiness, sadness, anger, and fear, with a neutral state by four native Kannada actors.
A Malayalam elicited emotional speech corpus for recognizing human emotion from the speech was built by Jacob [73]. The database consists of 2,800 speech recordings in the six basic emotions and neutral by ten native educated and urban female Malayalam speakers.
Apart from these corpora, there are many more small speech databases created for emotion recognition purposes in Indo-Aryan and Dravidian languages [74–77].
5.2. Speech Emotion Recognition for Indo-Aryan and Dravidian Languages
Over the last fifteen years, there has been a moderate progress in SER research for languages of the Indo-Aryan and Dravidian families. Although the earlier approaches were traditional machine learning-based, in recent times, state-of-the-art models are being used by researchers with good performance. After the first large Telugu (IITKGP-SESC) [33] and Hindi (IITKGP-SEHSC) [22] emotional speech databases were published in 2009 and 2011, respectively, many experiments have been done for the languages. In 2021, Agarwal and Om [78] used deep neural network with deer hunting optimization algorithm and got the highest accuracy of 93.75% for the IITKGP-SEHSC dataset. The model implemented for the RAVDESS database outperforms the state-of-the-art accuracy giving 97.14% highest recognition rate [78]. Combining DCNN and BLSTM, the model proposed by Sultana et al. [3] obtained state-of-the-art efficiency with 82.7% and 86.9% accuracy for the RAVDESS and SUBESCO databases for English and Bangla languages, respectively.
Swain et al. [93] in 2022 implemented a deep convolutional recurrent neural network-based ensemble classifier for Odia and RAVDESS databases, which provides better results than some state-of-the-art models for the mentioned databases, giving an accuracy rate of 85.31% and 77.54%, respectively. Likewise, conventional approaches, along with deep learning techniques, are also showing good performance for the language families. Table 2 summarizes some experiments on speech emotion recognition for Indo-Aryan and Dravidian languages.
Table 2
Review of some speech emotion recognition experiments for Indo-Aryan and Dravidian languages.
S/N | Reference | Database | Approach used | Recognized emotions | Results | |
Name | Language | |||||
1 | Koolagudi et al. [79] | IITKGP-SESC | Telugu | SVM and GMM with energy and pitch parameters | Happy, anger, fear, disgust, sarcastic, sad, neutral, surprise | 63.75% average accuracy obtained |
2 | Sultana et al. [3] | SUBESCO and RAVDESS | Bangla and English | The system integrates a DCNN and a BLSTM network with a TDF layer | Happy, calm, sad, surprise, fearful, disgust, angry, neutral | For the SUBESCO and RAVDESS datasets, the proposed model has achieved weighted accuracies of 86.9% and 82.7%, respectively |
3 | Kumar and Yadav [80] | IITKGP-SEHSC | Hindi | Deep LSTM with GMFCC and DMFCC features | Happy, fear, angry, sad, neutral | The proposed framework gives average accuracy of 91.2% for male speech and 87.6% for female speech |
4 | Mohanty and Swain [15] | Oriya emotional speech database | Oriya | Fuzzy K-means | Anger, sadness, astonish, fear, happiness, neutral | 65.16% recognition rate by incorporating mean pitch, first two formants, jitter, shimmer, and energy as feature vectors |
5 | Samantaray et al. [48] | MESDNEI | Assamese | SVM with dynamic, quality, derived, and prosodic features | Happy, anger, fear, disgust, surprise, sad, neutral | 82.26% average accuracy rate for speaker-independent case |
6 | Bhavan et al. [81] | EmoDB, RAVDESS and IITKGP-SEHSC | German, English and Hindi | Bagged ensemble of SVM using MFCCs, spectral, and centroids | Happy, sad, calm, angry, surprise, fear, disgust, neutral | Obtained accuracy EmoDB: 92.45%, RAVDESS: 75.69% and IITKGP-SEHSC: 84.11% |
7 | Swain et al. [82] | Self-created database using utterances from two native languages of Odisha: Cuttacki and Sambalpuri | Oriya | SVM using MFCC as feature vector | Happiness, fear, anger, disgust, sadness, surprise, neutral | 82.14% recognition accuracy for SVM classifier |
8 | Zaheer et al. [30] | SEMOUR+ | Urdu | Ensemble classifier, CNN combined with VGG-19 model | Anger, disgust, happiness, surprise, boredom, sadness, fearful, neutral | The proposed model achieved 56% speaker-independent recognition rate |
9 | Wankhade et al. [47] | Speech emotional database containing dialogues from different bollywood movies | Hindi | SVM classifier with MFCC and MEDC feature set | Angry, happy, sad, neutral | 71.66% recognition rate using SVM classifier |
10 | Ali et al. [83] | Self-created speech emotional corpus recorded in 5 regional languages of Pakistan | Urdu, Sindhi, Pashto, Punjabi, and Balochi | Learning classifiers (adaboostM1, J48, classification via regression, decision stump) with prosodic features | Happiness, sad, anger, neutral | 40% classification accuracy with pitch feature |
11 | Ancilin and Milton [84] | Urdu | Urdu | SVM classifier with mel frequency magnitude coefficient (MFMC) | Happy, sad, anger, neutral | 95.25% emotion recognition rate using MFMC |
12 | Farhad et al. [85] | Urdu | Urdu | Neural network, random forest and meta iterative classifiers with pitch and MFCC features | Happy, sad, angry | With an accuracy of 78.75%, random forest outperforms other classifiers |
13 | Darekar and Dhande [86] | Marathi database | Marathi | Adaptive ANN combining cepstral, non-negative matrix factorization (NMF) and pitch features | Happy, sad, angry, fear, neutral, surprised | Proposed model obtains 80% accuracy combining the 3 features |
14 | Koolagudi et al. [87] | IITKGP-SESC | Telugu | SVM and GMM model with epoch parameters were used | Happy, anger, fear, sadness, disgust, neutral | Average recognition rates are 58% and 61% for SVM and GMM, respectively |
15 | Kandali et al. [49] | Self-created acted emotional speech database by 27 speakers | Assamese | GMM classifier with MFCC features | Happy, sad, disgust, fear, angry, surprise, neutral | Highest mean classification score is 76.5% |
16 | Dhar and Guha [88] | Abeg: self-collected Bangla emotional speech dataset | Bangla | Logistic regression model with MFCC and LPC features | Happy, angry, neutral | Proposed model achieved 92% accuracy combining MFCC and LPC features |
17 | Jacob [89] | Hindi emotional speech database containing 2240 wav files collected from 10 speakers | Hindi | ANN model with jitter and shimmer features | Happy, sad, anger, fear, surprise, disgust, neutral | 83.3% overall accuracy obtained combining jitter and shimmer features |
18 | Fernandes and Mannepalli [90] | Acted emotional speech database containing 1400 utterances by 10 actors | Tamil | LSTM and BiLSTM with MFCC, MFCC delta, spectral kurtosis, bark spectrum, and spectral skewness features | Happy, anger, sad, fear, boredom, disgust, neutral | 84% accuracy rate obtained using LSTM and BiLSTM with dropout layers |
19 | Rajisha et al. [91] | Acted emotional dataset created by the authors | Malayalam | ANN and SVM classifier with MFCC, short-time energy, and pitch features | Happy, anger, sad, neutral | 88.4% recognition rate obtained using ANN and 78.2% with SVM |
20 | Kannadaguli and Bhat [92] | Self-created database containing 2800 emotional recordings | Kannada | Bayesian and HMM model with MFCC feature | Happy, excited, angry, sad | Average emotion error rate of 25.5% for Bayesian and 0.2% for HMM approach |
6. Conclusion
Speech emotion recognition being an integral part of HCI, a successful SER system with a healthy level of accuracy is essential for the better performance of a human-computer interaction system. This paper presents a survey on speech emotion recognition research for Indo-Aryan and Dravidian languages. A brief review of 31 research studies, including the development of emotional speech corpora and implemented approaches with utilized features for emotion recognition purposes, has been covered for the mentioned language families. Besides, a thorough study of some standard available emotional speech corpora and research works conducted for the identification of emotional states from human speech for different languages has also been presented in this paper. Therefore, researchers working in this field might find helpful insights about speech emotion recognition in this study.
[1] B. W. Schuller, "Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends," Communications of the ACM, vol. 61 no. 5, pp. 90-99, DOI: 10.1145/3129340, 2018.
[2] S. Latif, A. Qayyum, M. Usman, J. Qadir, "Cross lingual speech emotion recognition: Urdu vs. western languages," pp. 88-93, DOI: 10.1109/FIT44659.2018, .
[3] S. Sultana, M. Z. Iqbal, M. R. Selim, M. M. Rashid, M. S. Rahman, "Bangla speech emotion recognition and cross-lingual study using deep cnn and blstm networks," IEEE Access, vol. 10, pp. 564-578, DOI: 10.1109/access.2021.3136251, 2022.
[4] W. contributors, "Indo-aryan languages — wikipedia, the free encyclopedia," 2022b. https://en.wikipedia.org/w/index.php?title=Indo-%20Aryan_languages&oldid=1107172048
[5] Wikipedia, "Wikipedia contributors, 2022a. Dravidian languages — wikipedia, the free encyclopedia," . https://en.wikipedia.org/w/index.php?title=Dravidian_languages&oldid=11%2009158908
[6] A. Batliner, S. Steidl, E. Nöth, "Releasing a Thoroughly Annotated and Processed Spontaneous Emotional Database: The Fau Aibo Emotion Corpus," 2008.
[7] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, R. Mihalcea, "Meld: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations," 2018. http://arXiv.org/abs/1810.02508
[8] Y. Li, J. Tao, L. Chao, W. Bao, Y. Liu, "Cheavd: a Chinese natural emotional audio–visual database," Journal of Ambient Intelligence and Humanized Computing, vol. 8 no. 6, pp. 913-924, DOI: 10.1007/s12652-016-0406-z, 2017.
[9] W. Bao, Y. Li, M. Gu, M. Yang, H. Li, L. Chao, J. Tao, "Building a Chinese natural emotional audio-visual database," pp. 583-587, DOI: 10.1109/ICSP32469.2014, .
[10] O. Mohamad Nezami, P. Jamshid Lou, M. Karami, "Shemo: a large-scale validated database for Persian speech emotion detection," Language Resources and Evaluation, vol. 53,DOI: 10.1007/s10579-018-9427-x, 2019.
[11] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, B. Weiss, "A Database of German Emotional Speech," Proceedings of the Interspeech 2005 - Eurospeech, 9th European Conference on Speech Communication and Technology, pp. 1517-1520, DOI: 10.21437/Interspeech.2005-446, .
[12] S. Sultana, M. S. Rahman, M. R. Selim, M. Z. Iqbal, "Sust bangla emotional speech corpus (subesco): an audio-only emotional speech corpus for bangla," PLoS One, vol. 16 no. 4,DOI: 10.1371/journal.pone.0250173, 2021b.
[13] P. Jackson, S. Haq, Surrey Audio-Visual Expressed Emotion (Savee) Database, 2014.
[14] C. Busso, M. Bulut, C. C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, S. S. Narayanan, "Iemocap: interactive emotional dyadic motion capture database," Language Resources and Evaluation, vol. 42 no. 4, pp. 335-359, DOI: 10.1007/s10579-008-9076-6, 2008.
[15] S. Mohanty, B. K. Swain, "Emotion recognition using fuzzy k-means from oriya speech," Proceedings of the 2010 for International Conference [ACCTA-2010],DOI: 10.47893/IJCCT.2011.1066, .
[16] S. R. Livingstone, F. A. Russo, "The ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in north american English," PLoS One, vol. 13 no. 5,DOI: 10.1371/journal.pone.0196391, 2018.
[17] R. K. Das, N. Islam, M. R. Ahmed, S. Islam, S. Shatabda, A. M. Islam, "Banglaser: a speech emotion recognition dataset for the bangla language," Data in Brief, vol. 42,DOI: 10.1016/j.dib.2022.108091, 2022.
[18] N. Vrysas, R. Kotsakis, A. Liatsou, C. A. Dimoulas, G. Kalliris, "Speech emotion recognition for performance interaction," Journal of the Audio Engineering Society, vol. 66 no. 6, pp. 457-467, DOI: 10.17743/jaes.2018.0036, 2018.
[19] F. Ringeval, A. Sonderegger, J. Sauer, D. Lalanne, "Introducing the recola multimodal corpus of remote collaborative and affective interactions," ,DOI: 10.1109/FG.2013.6553694, .
[20] W. Fan, X. Xu, X. Xing, W. Chen, D. Huang, "Lssed: a large-scale dataset and benchmark for speech emotion recognition," pp. 641-645, DOI: 10.1109/ICASSP39728.2021, .
[21] Z. S. Syed, S. Ali, M. Shehram, A. Shah, "Introducing the Urdu-Sindhi speech emotion corpus: a novel dataset of speech recordings for emotion recognition for two low-resource languages," International Journal of Advanced Computer Science and Applications, vol. 11 no. 4,DOI: 10.14569/ijacsa.2020.01104104, 2020.
[22] S. G. Koolagudi, R. Reddy, J. Yadav, K. S. Rao, "Iitkgp-sehsc: Hindi speech corpus for emotion analysis," .
[23] A. H. Meftah, M. A. Qamhan, Y. Seddiq, Y. A. Alotaibi, S. A. Selouani, "King saud university emotions corpus: construction, analysis, evaluation, and comparison," IEEE Access, vol. 9, pp. 54201-54219, DOI: 10.1109/access.2021.3070751, 2021.
[24] T. Pao, Y. Chen, J. Yeh, "Emotion recognition and evaluation from Mandarin speech signals," International Journal of Innovative Computing, Information and Control, vol. 4, pp. 1695-1709, 2008.
[25] L. Abdel-Hamid, "Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features," Speech Communication, vol. 122, pp. 19-30, DOI: 10.1016/j.specom.2020.04.005, 2020.
[26] V. Hozjan, Z. Kacic, A. Moreno, A. Bonafonte, A. Nogueiras, "Interface Databases: Design and Collection of a Multilingual Emotional Speech Database," .
[27] K. Dupuis, M. K. Pichora-Fuller, "Recognition of emotional speech for younger and older talkers: behavioural findings from the toronto emotional speech set," Canadian Acoustics, vol. 39, pp. 182-183, 2011.
[28] S. Klaylat, Z. Osman, L. Hamandi, R. Zantout, "Emotion recognition in Arabic speech," Analog Integrated Circuits and Signal Processing, vol. 96 no. 2, pp. 337-351, DOI: 10.1007/s10470-018-1142-4, 2018.
[29] A. B. Kandali, A. Routray, T. K. Basu, "Vocal emotion recognition in five native languages of Assam using new wavelet features," International Journal of Speech Technology, vol. 12,DOI: 10.1007/s10772-009-9046-4, 2009.
[30] N. Zaheer, O. U. Ahmad, M. Shabbir, A. A. Raza, "Speech emotion recognition for the Urdu language," Language Resources and Evaluation,DOI: 10.1007/s10579-022-09610-7, 2022.
[31] G. Costantini, I. Iaderola, A. Paoloni, M. Todisco, "Emovo corpus: an Italian emotional speech database," pp. 3501-3504, .
[32] K. Kaur, P. Singh, "Punjabi emotional speech database: design, recording and verification," International Journal of Intelligent Systems and Applications in Engineering, vol. 9 no. 4, pp. 205-208, DOI: 10.18201/ijisae.2021473641, 2021.
[33] S. G. Koolagudi, S. Maity, V. A. Kumar, S. Chakrabarti, K. S. Rao, "Iitkgp-sesc: speech database for emotion analysis," pp. 485-492, DOI: 10.1007/978-3-642-03547-0_46, .
[34] T. M. Wani, T. S. Gunawan, S. A. A. Qadri, M. Kartiwi, E. Ambikairajah, "A comprehensive review of speech emotion recognition systems," IEEE Access, vol. 9, pp. 47795-47814, DOI: 10.1109/access.2021.3068045, 2021.
[35] A. Koduru, H. B. Valiveti, A. K. Budati, "Feature extraction algorithms to improve the speech emotion recognition rate," International Journal of Speech Technology, vol. 23 no. 1, pp. 45-55, DOI: 10.1007/s10772-020-09672-4, 2020.
[36] C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, S. Narayanan, "Analysis of emotion recognition using facial expressions, speech and multimodal information," Proceedings of the 6th International Conference on Multimodal Interfaces, pp. 205-211, .
[37] P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, S. Zafeiriou, "End-to-end multimodal emotion recognition using deep neural networks," IEEE Journal of selected topics in signal processing, vol. 11 no. 8, pp. 1301-1309, DOI: 10.1109/jstsp.2017.2764438, 2017.
[38] L. Zhu, L. Chen, D. Zhao, J. Zhou, W. Zhang, "Emotion recognition from Chinese speech for smart affective services using a combination of svm and dbn," Sensors, vol. 17 no. 7,DOI: 10.3390/s17071694, 2017.
[39] J. D. Williamson, "Speech analyzer for analyzing pitch or frequency perturbations in individual speech pattern to determine the emotional state of the person," US Patent, vol. 4 no. 093, 1978.
[40] F. Dellaert, T. Polzin, A. Waibel, "Recognizing emotion in speech," pp. 1970-1973, DOI: 10.1109/ICSLP.1996.606911, .
[41] B. S. Kang, C. H. Han, S. T. Lee, D. H. Youn, C. Lee, "Speaker dependent emotion recognition using speech signals," Proceedings of the Sixth International Conference on Spoken Language Processing, .
[42] B. Schuller, G. Rigoll, M. Lang, "Hidden Markov model-based speech emotion recognition," ,DOI: 10.1109/ICME.2003.1220939, .
[43] T. L. Nwe, S. W. Foo, L. C. De Silva, "Speech emotion recognition using hidden Markov models," Speech Communication, vol. 41 no. 4, pp. 603-623, DOI: 10.1016/s0167-6393(03)00099-2, 2003.
[44] A. Nogueiras, A. Moreno, A. Bonafonte, J. B. Mariño, "Speech emotion recognition using hidden Markov models," Proceedings of the Seventh European Conference on Speech Communication and Technology, .
[45] Y. L. Lin, G. Wei, "Speech emotion recognition based on hmm and svm," pp. 4898-4901, DOI: 10.1109/ICMLC10707.2005, .
[46] L. Sun, B. Zou, S. Fu, J. Chen, F. Wang, "Speech emotion recognition based on dnn-decision tree svm model," Speech Communication, vol. 115, pp. 29-37, DOI: 10.1016/j.specom.2019.10.004, 2019.
[47] S. B. Wankhade, P. Tijare, Y. Chavhan, "Speech emotion recognition system using svm and libsvm," International Journal of Computer Science and Applications, vol. 4, 2011.
[48] A. K. Samantaray, K. Mahapatra, B. Kabi, A. Routray, "A novel approach of speech emotion recognition with prosody, quality and derived features using svm classifier for a class of north-eastern languages," pp. 372-377, DOI: 10.1109/ReTIS35379.2015, .
[49] A. B. Kandali, A. Routray, T. K. Basu, "Emotion recognition from Assamese speeches using mfcc features and gmm classifier," ,DOI: 10.1109/tencon14243.2008, .
[50] H. Hu, M. X. Xu, W. Wu, "Gmm supervector based svm with spectral features for speech emotion recognition," pp. IV-413, DOI: 10.1109/ICASSP.2007.366592, .
[51] J. Nicholson, K. Takahashi, R. Nakatsu, "Emotion recognition in speech using neural networks," Neural Computing & Applications, vol. 9 no. 4, pp. 290-296, DOI: 10.1007/s005210070006, 2000.
[52] X. Mao, L. Chen, L. Fu, "Multi-level speech emotion recognition based on hmm and ann," pp. 225-229, DOI: 10.1109/csie15041.2009, .
[53] B. Schuller, G. Rigoll, M. Lang, "Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture," ,DOI: 10.1109/ICASSP.2004.1326738, .
[54] M. W. Bhatti, Y. Wang, L. Guan, "A neural network approach for human emotion recognition in speech," pp. II-181, DOI: 10.1109/ISCAS.2004, .
[55] B. J. Abbaschian, D. Sierra-Sosa, A. Elmaghraby, "Deep learning techniques for speech emotion recognition, from databases to models," Sensors, vol. 21 no. 4,DOI: 10.3390/s21041249, 2021.
[56] H. M. Fayek, M. Lech, L. Cavedon, "Evaluating deep learning architectures for speech emotion recognition," Neural Networks, vol. 92, pp. 60-68, DOI: 10.1016/j.neunet.2017.02.013, 2017.
[57] S. K. Pandey, H. S. Shekhawat, S. M. Prasanna, "Deep learning techniques for speech emotion recognition: a review," ,DOI: 10.1109/RADIOELEKTRONIKA45779.2019, .
[58] Y. Li, T. Zhao, T. Kawahara, "Improved End-To-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning," Proceedings of the Interspeech, pp. 2803-2807, DOI: 10.21437/Interspeech.2019-2594, .
[59] Y. Yu, Y. J. Kim, "Attention-lstm-attention model for speech emotion recognition and analysis of iemocap database," Electronics, vol. 9 no. 5,DOI: 10.3390/electronics9050713, 2020.
[60] S. Latif, R. Rana, S. Younis, J. Qadir, J. Epps, "Transfer Learning for Improving Speech Emotion Classification Accuracy," 2018b. http://arXiv.org/abs/1801.06353
[61] J. Deng, Z. Zhang, E. Marchi, B. Schuller, "Sparse autoencoder-based feature transfer learning for speech emotion recognition," pp. 511-516, DOI: 10.1109/ACII31428.2013, .
[62] H. Meng, T. Yan, F. Yuan, H. Wei, "Speech emotion recognition from 3d log-mel spectrograms with deep learning network," IEEE Access, vol. 7, pp. 125868-125881, DOI: 10.1109/access.2019.2938007, 2019.
[63] D. Issa, M. Fatih Demirci, A. Yazici, "Speech emotion recognition with deep convolutional neural networks," Biomedical Signal Processing and Control, vol. 59,DOI: 10.1016/j.bspc.2020.101894, 2020.
[64] M. Neumann, N. T. Vu, "Attentive Convolutional Neural Network Based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech," 2017. http://arXiv.org/abs/1706.00612
[65] S. Wu, T. H. Falk, W. Y. Chan, "Automatic speech emotion recognition using modulation spectral features," Speech Communication, vol. 53 no. 5, pp. 768-785, DOI: 10.1016/j.specom.2010.08.013, 2011.
[66] P. Harár, R. Burget, M. K. Dutta, "Speech emotion recognition with deep learning," pp. 137-140, DOI: 10.1109/SPIN39695.2017, .
[67] W. Zhang, D. Zhao, Z. Chai, L. T. Yang, X. Liu, F. Gong, S. Yang, "Deep learning and svm-based emotion recognition from Chinese speech for smart affective services," Software: Practice and Experience, vol. 47, pp. 1127-1138, 2017.
[68] Z. Xiao, D. Wu, X. Zhang, Z. Tao, "Speech emotion recognition cross language families: Mandarin vs. western languages," pp. 253-257, .
[69] M. Neumann, "Cross-lingual and multilingual speech emotion recognition on English and French," pp. 5769-5773, DOI: 10.1109/ICASSP34228.2018, .
[70] S. Goel, H. Beigi, "Cross Lingual Cross Corpus Speech Emotion Recognition," 2020. http://arXiv.org/abs/2003.07996
[71] A. Asghar, S. Sohaib, S. Iftikhar, M. Shafi, K. Fatima, "An Urdu speech corpus for emotion recognition," PeerJ Computer Science, vol. 8,DOI: 10.7717/peerj-cs.954, 2022.
[72] A. Geethashree, D. Ravi, "Kannada emotional speech database: design, development and evaluation," pp. 135-143, .
[73] A. Jacob, "Modelling speech emotion recognition using logistic regression and decision trees," International Journal of Speech Technology, vol. 20 no. 4, pp. 897-905, DOI: 10.1007/s10772-017-9457-6, 2017.
[74] R. Kaushik, M. Sharma, K. K. Sarma, D. I. Kaplun, "I-vector based emotion recognition in Assamese speech," International Journal of Engineering and Future Technology, vol. 1, pp. 111-124, 2016.
[75] V. B. Waghmare, R. R. Deshmukh, P. P. Shrishrimal, G. B. Janvale, B. Ambedkar, "Emotion recognition system from artificial Marathi speech using mfcc and lda techniques," Proceedings of the Fifth International Conference on Advances in Communication, Network, and Computing–CNC, .
[76] A. Agrawal, A. Jain, "Speech emotion recognition of Hindi speech using statistical and machine learning techniques," Journal of Interdisciplinary Mathematics, vol. 23 no. 1, pp. 311-319, DOI: 10.1080/09720502.2020.1721926, 2020.
[77] K. Mannepalli, P. N. Sastry, M. Suman, "Analysis of emotion recognition system for Telugu using prosodic and formant features," pp. 137-144, .
[78] G. Agarwal, H. Om, "Performance of deer hunting optimization based deep learning algorithm for speech emotion recognition," Multimedia Tools and Applications, vol. 80 no. 7, pp. 9961-9992, DOI: 10.1007/s11042-020-10118-x, 2021.
[79] S. G. Koolagudi, N. Kumar, K. S. Rao, "Speech emotion recognition using segmental level prosodic analysis," .
[80] S. Kumar, J. Yadav, "Emotion recognition in Hindi language using gender information, gmfcc, dmfcc and deep lstm," .
[81] A. Bhavan, P. Chauhan, R. R. Shah, R. R. Shah, "Bagged support vector machines for emotion recognition from speech," Knowledge-Based Systems, vol. 184,DOI: 10.1016/j.knosys.2019.104886, 2019.
[82] M. Swain, S. Sahoo, A. Routray, P. Kabisatpathy, J. N. Kundu, "Study of feature combination using hmm and svm for multilingual odiya speech emotion recognition," International Journal of Speech Technology, vol. 18 no. 3, pp. 387-393, DOI: 10.1007/s10772-015-9275-7, 2015.
[83] S. A. Ali, A. Khan, N. Bashir, "Analyzing the impact of prosodic feature (pitch) on learning classifiers for speech emotion corpus," International Journal of Information Technology and Computer Science, vol. 7, pp. 54-59, DOI: 10.5815/ijitcs.2015.02.07, 2015.
[84] J. Ancilin, A. Milton, "Improved speech emotion recognition with mel frequency magnitude coefficient," Applied Acoustics, vol. 179,DOI: 10.1016/j.apacoust.2021.108046, 2021.
[85] M. Farhad, H. Ismail, S. Harous, M. M. Masud, A. Beg, "Analysis of emotion recognition from cross-lingual speech: Arabic, English, and Urdu," pp. 42-47, DOI: 10.1109/ICCAKM50778.2021, .
[86] R. V. Darekar, A. P. Dhande, "Emotion recognition from Marathi speech database using adaptive artificial neural network," Biologically inspired cognitive architectures, vol. 23, pp. 35-42, DOI: 10.1016/j.bica.2018.01.002, 2018.
[87] S. G. Koolagudi, R. Reddy, K. S. Rao, "Emotion recognition from speech signal using epoch parameters," ,DOI: 10.1109/SPCOM16513.2010, .
[88] P. Dhar, S. Guha, "A system to predict emotion from Bengali speech," International Journal of Mathematics and Soft Computing, vol. 7 no. 1, pp. 26-35, DOI: 10.5815/ijmsc.2021.01.04, 2021.
[89] A. Jacob, "Speech emotion recognition based on minimal voice quality features," pp. 0886-0890, DOI: 10.1109/ICCSP37400.2016, .
[90] B. Fernandes, K. Mannepalli, "Speech emotion recognition using deep learning lstm for Tamil language," Pertanika Journal of Science and Technology, vol. 29 no. 3,DOI: 10.47836/pjst.29.3.33, 2021.
[91] T. Rajisha, A. Sunija, K. Riyas, "Performance analysis of Malayalam language speech emotion recognition system using ann/svm," Procedia Technology, vol. 24, pp. 1097-1104, DOI: 10.1016/j.protcy.2016.05.242, 2016.
[92] P. Kannadaguli, V. Bhat, "A comparison of bayesian and hmm based approaches in machine learning for emotion detection in native Kannada speaker," ,DOI: 10.1109/eTechNxT42767.2018, .
[93] M. Swain, B. Maji, P. Kabisatpathy, A. Routray, "A dcrnn-based ensemble classifier for speech emotion recognition in Odia language," Complex & Intelligent Systems, vol. 8 no. 5, pp. 4237-4249, DOI: 10.1007/s40747-022-00713-w, 2022.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2022 Syeda Tamanna Alam Monisha and Sadia Sultana. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/
Abstract
Speech emotion recognition (SER) has grown to be one of the most trending research topics in computational linguistics in the last two decades. Speech being the primary communication medium, understanding the emotional state of humans from speech and responding accordingly have made the speech emotion recognition system an essential part of the human-computer interaction (HCI) field. Although there are a few review works carried out for SER, none of them discusses the development of SER system for the Indo-Aryan or Dravidian language families. This paper focuses on some studies carried out for the development of an automatic SER system for Indo-Aryan and Dravidian languages. Besides, it presents a brief study of the prominent databases available for SER experiments. Some remarkable research works on the identification of emotion from the speech signal in the last two decades have also been discussed in this paper.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer