Full text

Turn on search term navigation

1. Introduction

Human-Computer Interaction (HCI) is becoming increasingly important with the development of digital devices and software. HCI is moving away from simple command input and output and moving toward providing a better user experience by having machines understand human emotions and intentions and respond appropriately. Emotion Recognition (ER) plays a key role in this interaction, and technology that understands human emotions and responds appropriately is becoming an important element in improving the user experience [1]. Emotion recognition technology improves the quality of HCI and plays a role in increasing usability, accessibility, and efficiency in various industries, including education and organizational productivity [2]. In the field of education, emotion recognition technology can monitor students’ emotional states and positively impact learning performance. Based on this, it can provide customized learning approaches and adaptive feedback. In an e-learning environment, utilizing multimodal emotion recognition technology can help identify students’ emotions in real time and adjust learning difficulty or provide appropriate feedback [3]. This technology can contribute to increasing learning engagement and reducing learning dropouts due to stress and frustration. In the area of organizational productivity, emotion recognition technology also contributes to stress management, fatigue control, and improving job satisfaction by monitoring the emotional state of employees in the workplace. Since the emotional state of employees is closely related to work performance, it is important to understand it in real time and take appropriate measures. Especially in a remote work environment, where face-to-face interaction is limited, emotion recognition systems become even more necessary. Through this, productivity can be maintained and improved by monitoring the emotional state of employees and providing necessary support even in a non-face-to-face environment. Emotion recognition is also gaining attention in media applications, such as automated movie dubbing, where synchronizing emotional expression in both audio and visuals is crucial. By integrating emotional cues, recent studies [4] have shown that such technology enhances the naturalness and expressiveness of dubbed performances. As such, the need for emotion recognition technology will continue to increase in various fields in the future, and it will play an important role in improving work efficiency and performance in each industry.

Korean has unique linguistic characteristics for emotion recognition. First, in terms of text, Korean is an agglutinative language that performs various grammatical functions through word inflection. This characteristic makes it essential to understand the entire context of a sentence because the word order in a sentence is flexible, and subjects and objects are often omitted. For example, the meaning of a sentence can change depending on the position or inflection of the same word, and emotional expression can also change subtly. Therefore, in Korean text emotion analysis, it is necessary to deeply analyze the structure and context of the sentence beyond the vocabulary level. In terms of phonetics, Korean intonation and pronunciation patterns play an important role in emotional expression. Even the same word can convey emotions differently depending on intonation, and the pronunciation system of Korean (long, short, aspirated, liquid sounds, etc.) provides important clues for understanding emotional states. For example, emotions in Korean change depending on the pitch of the speech, pronunciation stress, speaking speed, and emotional states can be predicted through this. Scherer’s [5] study explains that the pitch and speed of speech can indicate an emotional state, and this can also be an important factor in Korean speech emotion recognition.

Existing emotion recognition studies have mainly focused on one modality, either text or speech, but showed limitations in capturing important emotion information. For example, Bharti’s [6] study combined various text data sets (ISERA, WASSA, Emotion-stimulus) and used a combination of CNN (Convolutional Neural Network) and Bi-GRU (Bidirectional Gated Recurrent Unit) models to recognize emotions. This study classified emotions based on text data but did not consider speech or other modalities and thus could not sufficiently reflect the complexity of emotions. Similar limitations also appear in speech emotion recognition. Kim’s [7] study extracted speech features such as Mel-Spectrogram and MFCC using the Emo-DB and RAVDESS data sets and then combined BiLSTM-Transformer and 2D CNN to recognize emotions. However, this study also focused only on speech data and did not consider interactions with other modalities, such as text. Thus, single-modality approaches have limitations in sufficiently reflecting the complex characteristics of emotions. In addition, many emotion recognition studies have been conducted based on data in German or English, and emotion recognition studies in Korean are relatively lacking. To solve this problem, a multimodal approach that combines various modalities, such as text, speech, and video, has recently attracted attention. A multimodal approach is a methodology that combines the unique emotional information of each modality to increase the accuracy of emotion recognition. For example, in the study by H. Park [8], deep learning models were trained for text and speech separately, and then a weighted average ensemble was used to improve emotion recognition performance. Y. Kim [9] also trained deep learning models for text and speech separately and then used an ensemble method by averaging, showing higher performance than with a single modality. However, multimodal studies tend to rely mainly on simple combination methods such as average ensemble or weighted average ensemble. This may limit emotion recognition performance because it does not sufficiently reflect the interaction between modalities. To complement this, this study proposes a deep learning model that incorporates preprocessing reflecting the characteristics of the Korean language, adds a transformer encoder to each modality, and enhances the interaction between text and speech through cross-modal attention.

This study aims to design an emotion recognition model that reflects the unique linguistic and speech characteristics of Korean. Korean is an agglutinative language with many grammatical variations, and context-dependent analysis is essential. To this end, text data are processed through the KoELECTRA model, and speech data are analyzed using HuBERT embeddings to extract key features of speech signals. KoELECTRA is a BERT-based Korean-specific model that learns Korean honorifics and informal speech, as well as complex lexical changes. Through this, text data are converted into high-dimensional vectors, and speech data are vectorized by extracting HuBERT embeddings that capture complex emotional information in speech. A multimodal transformer model is used to combine text and speech data. This model consists of a transformer encoder that processes each modality individually and a cross-modal attention mechanism that combines text and speech data. Cross-modal attention learns the interaction between the two modalities by using text embeddings as queries and speech embeddings as keys and values. This effectively combines complementary information between text and speech and improves the accuracy of emotion recognition. Therefore, this study proposes a multimodal transformer emotion recognition model that considers the interaction between Korean text and speech to reflect the characteristics of the Korean language and overcome the limitations of existing studies. This study will contribute to overcoming the limitations of existing emotion recognition studies by reflecting the linguistic and speech characteristics of the Korean language and improving the performance of Korean-based multimodal emotion recognition. In addition, it will contribute to the development of emotion recognition technology that can be applied to various languages and modality data in the future.

The structure of this paper is as follows. In Section 2, we define the contents of existing studies related to text, speeh, and multimodal emotion recognition, and in Section 3, we explain the proposed multimodal transformer model. In Section 4, we evaluate the proposed model through experiments and verify the performance of the model by comparing it with existing emotion recognition studies using objective performance indicators. Finally, in Section 5, we present the conclusions of the study.

2. Related Work

This chapter covers the latest research trends in emotion recognition technology using text, speech, and multimodal data. First, the main principles and performance of emotion recognition technology that analyzes text and speech separately using a single-modality approach are explained. Then, multimodal emotion recognition that combines text and speech data is discussed, and the limitations of existing research and potential improvements are presented.

2.1. Text Emotion Recognition

Text Emotion Recognition (TER) is a technology that analyzes text data to identify emotional states. It works by extracting emotional patterns from text and learning them. Early text emotion recognition mainly utilized keyword-based analysis methods, but these methods had the limitation of not sufficiently reflecting the context. In particular, such a simple keyword method is not appropriate for languages such as Korean, where sentence structures are flexible, and subjects and objects are often omitted. To solve this problem, deep learning-based natural language processing techniques have been widely used recently. Commonly used text embedding techniques include classical methods such as Word2Vec, GloVe, and FastText and context-based embedding techniques such as BERT. In Sabbeh’s [10] study, BERT embedding showed better performance when compared to various word embedding techniques. BERT achieves higher accuracy by efficiently processing contextual information, whereas classical embedding techniques have limitations in handling complex contexts. In Mutinda’s [11] study, the proposed LeBERT model performed sentiment analysis by combining BERT-based embedding and lexicon-based features and achieved excellent performance in experiments using large-scale review datasets such as Yelp and IMDB. In Bharti’s [6] study, high performance was achieved through analysis by combining various text sentiment recognition datasets using a combination of CNN and Bi-GRU, and in Li’s [12] study, a BERT-based deep learning model also recorded high accuracy. These studies show that text sentiment recognition technology is evolving from simple keyword-based to deep learning-based complex context analysis, and various text embedding techniques play an important role in improving performance.

2.2. Speech Emotion Recognition

Speech Emotion Recognition (SER) is a technology that analyzes speech data to identify the speaker’s emotions. In speech emotion recognition, speech signals are analyzed, emotional patterns are extracted from their characteristics, and a model is trained to predict emotional states based on this data. Representative techniques for extracting speech characteristics include MFCC and Mel-Spectrogram. MFCC analyzes the frequency components of speech signals to extract important information, such as the timbre of the speech, and the Mel-Spectrogram is an important tool for analyzing emotional states by visualizing the temporal changes in frequency components. In Reggiswarashari’s study [13], speech data were processed using frame-based segmentation and overlap segmentation techniques, and the data converted to MFCC were used for emotion recognition through CNN, and various datasets (RAVDESS, SAVEE, and TESS) were combined to achieve an accuracy of 83.69%. In Kim’s [7] study, various features of speech signals were extracted through Mel-Spectrogram, MFCC, and Spectral Contrast, and BiLSTM-Transformer and 2D CNN were combined for learning, and as a result, accuracies of 95.65% and 80.19% were recorded on the Emo-DB and RAVDESS datasets, respectively. In Hazra’s [14] study, after extracting emotional features of speech data using MFCC, five deep learning models—CNN, LSTM, ANN, MLP, and CNN-LSTM—were used to conduct experiments on the TESS, SAVEE, and RAVDESS datasets, and among them, the CNN-LSTM model achieved the highest accuracy of 84.35%. In this way, recent speech recognition studies are improving the performance of emotion recognition by combining various deep learning models based on feature extraction techniques such as MFCC and Mel-Spectrogram.

2.3. Multimodal Emotion Recognition Using Text and Speech Data

Multimodal emotion recognition is a technology that recognizes emotions by combining various modality data such as text, speech, and images. According to the study of H. Park [8], when text and speech data were processed separately, and then the average ensemble method was applied, the performance actually decreased, but when the weighted average ensemble was used, the F1-Score improved by 3%. In the study of Y. Kim [9], each text and speech model was trained independently, and then the outputs of the two models were combined using the average ensemble method, which showed a performance of 78.51%, higher than that achieved when using a single modality. Currently, most multimodal emotion recognition studies mainly use ensemble techniques that combine the results of each modality. This approach improves performance but has the limitation of not sufficiently reflecting the interaction between modalities. In addition, if the quality or amount of data between each modality is imbalanced, it can have a negative impact on overall model performance [15,16]. To address these limitations, recent studies have explored advanced methods to enhance cross-modal alignment and integration. For example, TensorFormer leverages a tensor-based multimodal transformer architecture that integrates text and audio for sentiment analysis and depression detection [17], while UniMF addresses challenges such as missing modalities and unaligned sequences in multimodal sentiment analysis through a unified multimodal framework [18]. Additionally, comprehensive reviews on multimodal learning with transformers highlight how transformer architectures can manage complex interactions across diverse data types [19]. Furthermore, recent studies have introduced adaptive cross-interaction mechanisms that refine modality alignment, ensuring more effective feature fusion across modalities [20] and spatio-temporal fusion networks that enhance the interaction of spatial and temporal features, leading to improved detection accuracy [21]. Inspired by these approaches, our study seeks to advance emotion recognition by incorporating a transformer encoder for each modality and applying Cross-Modal Attention to enable more nuanced interactions between text and speech data, ultimately resulting in higher accuracy in emotion recognition.

3. Materials and Methods

3.1. Overall Process

In this paper, we propose a multimodal transformer emotion recognition model that considers the interaction between Korean text and speech. Text data are embedded using the KoELECTRA model that reflects the characteristics of Korean, and speech data are vectorized using HuBERT embeddings. Each text and speech dataset is processed by a separate transformer encoder, and the model recognizes emotions by combining features through Cross-Modal Attention. Figure 1 shows the overall research structure of the proposed method.

3.2. Preprocessing and Feature Extraction

This section covers two modalities (text and speech) and explains the process of extracting important features from each modality and converting them into a form that the model can learn. Text data are embedded using the KoELECTRA model to reflect the linguistic characteristics of Korean, and speech data features are extracted using HuBERT embeddings. The two datasets are adjusted to the same size to be used in training with the multimodal transformer model.

3.2.1. KoELECTRA, Which Takes Korean Characteristics into Account

This section describes the text embedding process using the KoELECTRA model to effectively process Korean data. KoELECTRA is designed to reflect the contextual features of Korean more accurately and is based on the structure of BERT and ELECTRA models. First, the text is analyzed morphologically to determine the role of each word in the sentence. Then, the KoELECTRA model is used to convert this morphologically analyzed text into a high-dimensional vector. This vector contains information, including the semantic relationships and context between words, and is in a form that the model can process. The converted high-dimensional embedding vector generates vectors of different sizes depending on the length of each sentence. For example, the vector size of the sentence “Oh, you clean it for me!” is (11, 768), which means that a sentence consisting of 11 tokens is embedded into a 768-dimensional vector. The vector size of the sentence “We first met at the academy and started dating because we liked each other” is (18, 768), which means that the sentence consisting of 18 tokens is embedded as a 768-dimensional vector. The size of this embedding vector varies depending on the length of the sentence, and in order to use it as input for a deep learning model, the length of all sentences must be the same. To match the input dimensions, the [PAD] token is inserted into the parts where the number of tokens is insufficient to match the length. The [PAD] token is masked so that it is not learned during the model’s calculation process. This process sets the weight for the [PAD] token to 0 in the attention mechanism, helping the model focus only on the actual information of the sentence. In other words, the [PAD] token is only used to match the input size and does not affect the important information that the model needs to learn. As a result, the deep learning model can receive inputs of consistent size without change, which increases computational efficiency and maintains the actual information of the sentence. In this study, in order to make the high-dimensional vectors for all sentences the same size, the number of tokens is calculated for each sentence, and then the size of the vector is adjusted based on the sentence with the largest number of tokens. First, the number of tokens is calculated using the following formula. A given sentence $S$ consists of words $W_{1}, W_{2}, \dots, W_{n}$ . Each word $W_{I}$ is split into multiple subword tokens $t_{i 1}, t_{i 2}, \dots, t_{n}$ by the WordPiece tokenizer, and this can be expressed as Equation (1). $T$ represents the total number of tokens generated from the entire sentence, $n$ is the number of words in the sentence, and $k_{I}$ represents the number of subword tokens into which each word $w_{i}$ is divided.

(1) $T = \sum_{i = 1}^{n} k_{i}$

Also, when inputting data into the model, the final number of tokens is calculated as in Equation (2) by adding the [CLS] token to the beginning of the sentence and the [SEP] token to the end of the sentence. This is the number of tokens calculated using the tokenizer for the actual data used. After tokenizing the entire dataset, the longest sentence was found to have 73 tokens.

(2) $T_{f i n a l} = T + 2$

In order to match the number of tokens in all sentences to the longest sentence, the dimensions of the vectors were unified by adding [PAD] tokens to the shorter sentences.

In Table 1, we present examples of original Korean sentences and their corresponding embedding vectors generated using the KoELECTRA model. Since KoELECTRA is designed specifically for the Korean language, the WordPiece tokenizer splits the Korean words into subwords, as indicated by the ‘##’ symbols. These subwords are then embedded into 768-dimensional vectors. To ensure uniform input size for the deep learning model, shorter sentences are padded with the [PAD] token to match the longest sentence, resulting in a final vector size of (73, 768) for all sentences. For example, in the sentence “어, 청소 니가 대신 해 줘!” (“Oh, you clean it for me!”), the word “니가” is split into two subword tokens, “니” and “##가”. Similarly, “처음 학원에서 만났다 가 서로 좋아해서 사귀게 되었지” (“We first met at the academy and started dating because we liked each other”) contains 18 tokens, including several subwords like “##에서” and “##게”. By preserving the Korean subword structure in the embedding process, KoELECTRA captures the unique linguistic characteristics of the Korean language, such as its agglutinative nature, where suffixes like “##게” and “##에서” carry significant grammatical meaning. This allows the model to process the intricate relationships between words and subwords, ultimately leading to a more accurate understanding of Korean text in emotion recognition tasks.

3.2.2. Using HuBERT Embeddings While Considering the Characteristics of the Korean Language

Speech data is an important element in emotion recognition, enabling the frequency characteristics and intonation information of the speech to be reflected simultaneously. To this end, HuBERT embeddings are used to extract the features of the speech signal, capturing both frequency components and intonation information relevant to emotional expression. HuBERT embeddings analyze the acoustic and linguistic features of the speech signal through a deep learning model pre-trained on diverse speech data. By learning both phonetic and emotional nuances, HuBERT captures complex features that contribute to a more accurate emotion recognition process. This approach provides an efficient representation of speech characteristics, which enhances the ability of the model to understand emotion-specific attributes in the signal.

In this paper, we adjust the dimensions of HuBERT embeddings to match the text and speech data to the same dimension. Text data were converted to a vector of size (73, 768) based on the maximum word length of 73 using the KoELECTRA model. Accordingly, the dimension of the HuBERT embeddings was adjusted to match the speech data to the same dimension as the text data. HuBERT embeddings were generated with a sequence length of 73 and 768 feature dimensions, resulting in a vector of size (73, 768) consistent with the text data. Through this process, the speech data achieves the same vector size as the text data, allowing text and speech data to be consistently combined in a multimodal emotion recognition model. This approach preserves important characteristics of the speech signal, facilitating information combination between text and speech and ultimately improving the efficiency of multimodal learning. The final dimension of the speech data using HuBERT embeddings can be expressed as Equation (3).

(3) $X_{H u B E R T} \in R^{73 \times 768}$

In this study, both text data and speech data were used as inputs to the model by aligning them to vectors of size (73, 768). This enabled us to effectively combine text and speech data in a multimodal emotion recognition model by maintaining consistent dimension configurations between the two modalities. This dimension unification prevented information loss due to dimension mismatch and maximized emotion recognition performance when the model learned the interaction between the two modalities.

3.3. Multimodal Transformer for Emotion Recognition

This section describes a multimodal transformer model that recognizes emotions by combining text and speech data. The model is designed to take text data and speech data as input and improve emotion recognition performance by considering the interaction between the two modalities. In particular, it effectively learns complementary information between text and speech data by combining the Multi-Head Attention and Cross-Modal Attention mechanisms and performs the final emotion classification through Global Average Pooling. The proposed multimodal transformer model has two transformer encoders that independently process text and speech data. Multi-Head Attention in each encoder is a core mechanism of the transformer model and is used to simultaneously learn various patterns in the input data. Each head learns multiple perspectives in the text or speech data to obtain richer representations. Multi-head attention calculates the interaction using query (Q), key (K), and value (V) vectors as in Equation (4), and the result is passed to the next layer through a residual connection and layer normalization. This process is repeated multiple times for each modality to effectively extract features from the input data.

(4) $A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V$

The Cross-Modal Attention mechanism is designed to effectively model the interaction between text and speech data, enabling the model to leverage complementary information from each modality. This mechanism follows a series of steps to compute attention between modalities, as described below and illustrated in Equation (5).

(5) ${A t t e n t i o n}_{c r o s s} (Q_{t e x t}, K_{s p e e c h}, V_{s p e e c h}) = s o f t m a x (\frac{Q_{t e x t} K_{s p e e c h}^{T}}{\sqrt{d_{k}}}) V_{s p e e c h}$

(1). Generation of Query, Key, and Value Vectors: The output from the text modality is first passed through a linear transformation to create the Query (Q) vector, representing what the model is searching for within the context of the speech data. Similarly, the output from the speech modality is transformed into the Key (K) vector, which provides the context that allows the Query to identify relevant information, and the Value (V) vector, which contains the actual data to be aggregated based on the attention scores.
(2). Calculation of Attention Scores: The attention scores are calculated by taking the dot product of the Query vector (from the text) with the Key vector (from the speech), producing scores that represent the alignment between the text and speech data at each position. To stabilize the gradients and prevent the scores from becoming excessively large, the result is divided by the square root of the Key vector’s dimension (d_k), which effectively normalizes the scores (see Equation (5)).
(3). Application of the Softmax Function: The resulting scores are passed through a softmax function to produce normalized weights. These weights represent the attention distribution over the speech data for each element in the text data, effectively highlighting the most relevant parts of the speech data in relation to the text.
(4). Weighted Summation of Value Vectors: Finally, the weights generated from the softmax function are used to perform a weighted summation of the Value vectors (from the speech modality). This step aggregates the speech data in alignment with the information from the text data, allowing the model to focus on relevant features in the speech data based on the text context.

Through this process, the Cross-Modal Attention mechanism, as detailed in Equation (5), enables the model to capture the intricate interactions between text and speech, enhancing emotion recognition performance by effectively integrating information from both modalities.

The proposed model consists of two transformer encoders and one Cross-Modal Attention module. Text data are vectorized using KoELECTRA, and speech data are converted into HuBERT embeddings. The converted data are processed by the transformer encoder of each modality, and the interaction is learned through Cross-Modal Attention. The learned features are summarized through Global Average Pooling and classified into seven emotion classes through the final output layer. The transformer encoder of each modality consists of a multi-layer, Multi-Head Attention and Feed-Forward Network (FFN). The Multi-Head Attention stage processes text and speech data in parallel to help interpret the data from multiple perspectives and automatically learns what information is important in each modality. Next, the FFN learns higher-dimensional representations based on the output values obtained from Multi-Head Attention to enhance the detailed features of each modality. Finally, Layer Normalization and Residual Connection are applied to reduce the difference between input and output values and to increase the stability of the learning process. Layer Normalization prevents the output values of each layer from becoming excessively large or small, and Residual Connection minimizes information loss during learning by allowing the input values to be directly transmitted to the next layer. Through this structure, the complex features of text and speech data are effectively learned, and the final emotion recognition performance is improved by strengthening the interaction between the two modalities. After passing through the Cross-Modal Attention module, the combined text and speech feature maps summarize the most important information through Global Average Pooling and express it as a single vector. The feature vector summarized through Global Average Pooling finally passes through the Dense layer and Softmax activation function to calculate the probability distribution for the emotion class. This model classifies seven emotion classes, and the Softmax function calculates the probability value for each emotion class and then determines the class with the highest probability as the final emotion prediction. This model learns emotion recognition from text and speech data, and for this purpose, it uses the categorical cross-entropy loss function to minimize the difference between the model’s prediction results and the actual labels. In addition, the Adam optimizer is used to update the model’s weights during the learning process, and the learning rate is adaptively adjusted. Text and speech data are learned together, and the interaction between these two modalities plays an important role in improving the model’s performance. The following pseudocode, referred to as Algorithm 1, describes the entire proposed multimodal transformer model.

Algorithm 1. Pseudocode of Multimodal Transformer Model for Emotion Recognition

Input: Text Features

X_{t e x t}

, Speech Features

X_{s p e e c h}

Output: Emotion Classification

Y

// Initialize speech and text transformers

T e x t E n c o d e r : Z_{t e x t} = T r a n s f o r m e r_{t e x t} (X_{t e x t})

S p e e c h E n c o d e r : Z_{s p e e c h} = T r a n s f o r m e r_{s p e e c h} (X_{s p e e c h})

// Feature Extraction and Input

F_{t e x t} = K o E L E C T R A (X_{t e x t})

F_{s p e e c h} = H u B E R T (X_{s p e e c h})

Z_{t e x t} = T r a n s f o r m e r_{t e x t} (F_{t e x t})

Z_{s p e e c h} = T r a n s f o r m e r_{s p e e c h} (F_{s p e e c h})

// Cross Modal Attention

A t t e n t i o n S c o r e = s o f t m a x (\frac{Q_{t e x t} \cdot K_{s p e e c h}^{T}}{\sqrt{d_{k}}}) \cdot V_{s p e e c h}

C o m b i n e d F e a t u r e s = A t t e n t i o n S c o r e \cdot V_{s p e e c h}

// Global Average Pooling and Final Output

Z_{p o o l e d} = \frac{1}{N} \sum_{i = 1}^{N} Z_i

y_{p r e d} = s o f t m a x (W_{d e n s e} Z_{p o o l e d} + b_{d e n s e})

Y = a r g m a x (y_{p r e d})

4. Experiments and Assessment

4.1. Data Set

The dataset used is the Korean emotion recognition dataset provided by AI-HUB. The dataset was built for the purpose of developing a multi-emotion recognition model, and conversations spoken in various situations were collected through an application. The structure of the collected data consists of audio files, transcriptions, and five emotion labels, as shown in Table 2. The collected data were labeled with seven emotions (angry, disgust, fear, happiness, neutral, sadness, surprise), and each utterance was assigned an emotion by five experts. In this study, in order to increase the accuracy of emotion recognition, the emotion most frequently selected among the emotions assigned by the five experts was chosen as the final emotion. If two or more emotions were selected the same number of times, both emotions were assigned. However, to optimize the consistency and performance of the emotion recognition model, data that were assigned two or more emotions were removed, and only data that were clearly assigned a single emotion were used.

The data used included 14,606 entries from the 4th year, 10,011 entries from the 5th year, and 19,374 entries from the 5th year (2nd). After removing the data with multiple emotions, a total of 36,888 entries were used in the experiment. Looking at the emotion-specific data in Table 3, we can see that the data are severely imbalanced. To address this, we conducted the experiment by adjusting the number of all emotion data to the median value of 3412. In this way, we minimized the imbalance between emotion classes, allowing the emotion recognition model to learn in a balanced way for each emotion.

4.2. Experimental Results

In this section, we present an evaluation of the proposed multimodal emotion recognition model using a dataset of 19,313 text and speech samples provided by AI-HUB. The dataset was split into 17,381 training samples and 1932 test samples, following a 9:1 ratio. Each model was trained on the training data, and performance was assessed on the test data. To assess the performance of the proposed model, we conducted comparative experiments with existing emotion recognition methodologies. First, we implemented an LSTM model that performs emotion recognition using text data alone. Second, we used a CNN model with speech data as the sole input for emotion recognition. Third and fourth, we evaluated the performance of recent multimodal emotion recognition methods, specifically the average ensemble and weighted average ensemble techniques, which combine the outputs from text and speech modalities. The performance of each model was evaluated based on Accuracy, Precision, Recall, and F1-Score, and the results are summarized in Table 4. The text-only and speech-only models demonstrated relatively lower performance, while the ensemble models showed improvements by combining information from both modalities. Notably, the proposed model achieved superior performance by learning the interaction between text and speech through a Cross-Modal Attention mechanism, surpassing the ensemble models. These results indicate that enhancing interaction through Cross-Modal Attention contributes to improved emotion recognition performance.

Next, to compare the changes in Accuracy and Loss based on the Epoch, a graph visualizing the learning process of each model was added to Figure 2. This graph shows the performance changes during the learning process of the model, and in particular, it can be seen that the proposed multimodal transformer model converges quickly and maintains higher accuracy. In addition, the Loss graph visually confirms that the model is stably optimized during the learning process.

To emphasize that the proposed multimodal transformer model achieved the best performance, we present a detailed confusion matrix in the form of a heatmap in Figure 3. This visualization provides an in-depth analysis of the model’s performance across different emotion classes. From the heatmap, it is evident that certain emotions are predicted with high accuracy, as indicated by prominent values along the diagonal. Specifically, the model exhibits strong performance in accurately predicting Happiness (331), Anger (277), and Sadness (237), as shown by the high values for these classes on the diagonal. This consistency highlights the model’s effectiveness in distinguishing these emotions from others, resulting in minimal misclassifications and high prediction accuracy for these categories. Conversely, emotions such as Fear (195) and Disgust (196) show relatively lower values along the diagonal, suggesting that the model occasionally struggles to classify these emotions correctly, leading to some overlap with other emotion classes. Additionally, Surprise has a relatively low value of 43, indicating that it is often misclassified as other emotions, likely due to the subtle distinctions required for accurate classification. The Neutral emotion also demonstrates reasonably high accuracy, with a diagonal value of 209, although some off-diagonal values indicate occasional misclassifications. Overall, this confusion matrix visualization confirms that the proposed multimodal transformer model performs robustly across most emotion classes, with particularly strong accuracy in predicting Happiness, Anger, and Sadness. This analysis also identifies specific areas, such as Fear, Disgust, and Surprise, where the model’s performance could be enhanced. This comprehensive performance assessment highlights the model’s effectiveness and reliability in accurately predicting a diverse range of emotional expressions.

Finally, Figure 4 provides a bar graph that comprehensively visualizes the Accuracy, Precision, Recall, and F1-Score of each model, clearly showing that the multimodal transformer model achieved the highest performance. The proposed model shows excellent results in all performance metrics, which proves that the Cross-Modal Attention technique, which effectively learns the interaction between text and speech, played a significant role in improving performance. These results suggest that the multimodal approach can significantly improve emotion recognition performance compared to single-modality models.

The results illustrated in Figure 4 demonstrate the comparative performance of different models on emotion recognition metrics, namely Accuracy, Precision, Recall, and F1-Score. The text-only and speech-only models exhibit relatively low accuracy, indicating limitations when relying on a single modality for emotion recognition. Although the Average Ensemble and Weighted Ensemble methods show a slight improvement by combining the two modalities, their performance remains suboptimal due to the lack of direct interaction between text and speech features. These ensemble methods primarily aggregate information without effectively leveraging the complementary nature of each modality. In contrast, the proposed Multimodal Transformer model, which incorporates Cross-Modal Attention, achieves a significant improvement across all performance metrics. This model effectively captures and utilizes the interactions between text and speech data, suggesting that the enhanced performance in emotion recognition can be attributed to the model’s ability to learn intricate relationships between modalities. These findings underscore the importance of modality interaction in achieving superior performance in multimodal emotion recognition tasks.

5. Conclusions

In this paper, we propose KoHMT: A Multimodal Emotion Recognition Model Integrating KoELECTRA, HuBERT with Multimodal Transformer, which improves the performance of emotion recognition. The proposed model combines text embedding based on KoELECTRA and speech features using HuBERT embeddings and effectively learns the interaction between text and speech using the Cross-Modal Attention mechanism. This allows us to achieve higher accuracy and efficiency in emotion recognition. The experimental results show that the multimodal approach outperforms the single-modality model. In particular, the Multimodal Transformer model recorded higher accuracy and F1-Score than when text and speech data were used alone and achieved an accuracy of 73.13% in emotion classification. This shows that the performance of emotion recognition can be significantly improved by combining complementary information from text and speech. In addition, the preprocessing process and model design that reflect the linguistic and phonetic characteristics of Korean enhanced the performance of emotion recognition. KoELECTRA provided an embedding that reflected the context of Korean well, and HuBERT successfully captured complex acoustic features of the speech, thereby improving the performance of speech emotion recognition. This study made an important contribution to research on multimodal emotion recognition based on Korean and will be able to contribute to the development of emotion recognition technology that combines various language and modality data in the future. In future studies, it is expected that emotion recognition performance can be further improved by combining new modalities, such as video data, or applying more advanced multimodal combination techniques.

Author Contributions

Conceptualization, M.-H.Y.; Methodology, M.-H.Y. and K.-C.K.; Software, M.-H.Y.; Validation, J.-H.S.; Resources, M.-H.Y. and K.-C.K.; Writing—original draft preparation, M.-H.Y.; Writing—review and editing, M.-H.Y. and K.-C.K. and J.-H.S.; Supervision, J.-H.S. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

The data presented in this study are publicly available at: “https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=263 (accessed on 24 November 2024)”.

Acknowledgments

The authors would like to acknowledge and thank all reviewers for their constructive and helpful reviews.

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1. Research structure.

Figure 2. Accuracy and loss change graph based on epoch by model.

Figure 3. Confusion Matrix heatmap for each emotion class.

Figure 4. Comprehensive results of Accuracy, Precision, Recall, and F1-Score by model.

Table 1

Final Results of KoELECTRA Embedding.

Original Text (In Korean)	Embedding Vector	Vector Size
어, 청소 니가 대신 해 줘!	[‘[CLS]’, ‘어’, ‘,’, ‘청소’, ‘니’, ‘##가’, ‘대신’, ‘해’, ‘줘’, ‘!’, ‘[SEP]’, ‘[PAD]’, ..., , ‘[PAD]’]	(73, 768)
둘 다 하기 싫어서 화내	[‘[CLS]’, ‘둘’, ‘다’, ‘하’, ‘##기’, ‘싫’, ‘##어’, ‘##서’, ‘화’, ‘##내’, ‘.’, ‘[SEP]’, ‘[PAD]’, ..., , ‘[PAD]’]	(73, 768)
처음 학원에서 만났다가 서로 좋아해서 사귀게 되었지	[‘[CLS]’, ‘처음’, ‘학원’, ‘##에’, ‘##서’, ‘만났’, ‘##다가’, ‘서로’, ‘좋’, ‘##아’, ‘##해서’, ‘사귀’, ‘##게’, ‘되’, ‘##었’, ‘##지’, ‘.’, ‘[SEP]’, ‘[PAD]’, ..., , ‘[PAD]’]	(73, 768)
룸메이트와 너무 자주 싸우게 돼	[‘[CLS]’, ‘룸’, ‘##메이트’, ‘##와’, ‘너무’, ‘자주’, ‘싸우’, ‘##게’, ‘돼’, ‘.’, ‘[SEP]’, ‘[PAD]’, ..., , ‘[PAD]’]	(73, 768)
부모님한테 아직 말 안했는데 말하기가 두려워	[‘[CLS]’, ‘부모’, ‘##님’, ‘##한’, ‘##테’, ‘아직’, ‘말’, ‘안’, ‘##했’, ‘##는데’, ‘말’, ‘##하’, ‘##기’, ‘##가’, ‘두려워’, ‘.’, ‘[SEP]’, ‘[PAD]’, ..., , ‘[PAD]’]	(73, 768)

Table 2

Data structure.

Wav_id	Original Text (In Korean)	Emotion 1	Emotion 2	Emotion 3	Emotion 4	Emotion 5	Final Emotion
a6a4	어, 청소 니가 대신 해줘	Neutral	Angry	Neutral	Neutral	Angry	Neutral
a6a6	둘 다 하기 싫어서 화내	Angry	Angry	Neutral	Angry	Angry	Angry
1618	처음 학원에서 만났다가 서로 좋아해서 사귀게 되었지	Neutral	Neutral	Happiness	Neutral	Neutral	Neutral
161e	룸메이트와 너무 자주 싸우게 돼	Sadness	Neutral	Sadness	Sadness	Sadness	Sadness

Table 3

Number of data by emotion.

Emotion	Collected Data	Final Data Used
Angry	7126	3412
Disgust	2265	2265
Fear	2534	2534
Happiness	3412	3412
Neutral	5611	3412
Sadness	15,074	3412
Surprise	866	866
Total	36,888	19,313

Table 4

Performance evaluation for each model.

Data Type	Model	Accuracy	Precision	Recall	F1-Score
Text	LSTM	0.6630	0.6712	0.6630	0.6652
Speech	CNN	0.5279	0.5386	0.5279	0.5297
Text + Speech	Average Ensemble	0.6884	0.7034	0.6884	0.6919
Text + Speech	Weighted Average Ensemble	0.6909	0.7060	0.6909	0.6945
Text + Speech	Multimodal Transformer	0.7701	0.7745	0.7701	0.7703

References

1. Xie, Z.; Guan, L. Multimodal information fusion of audiovisual emotion recognition using novel information theoretic tools. Proceedings of the 2013 IEEE International Conference on Multimedia and Expo (ICME); San Jose, CA, USA, 15–19 July 2013; IEEE: New York, NY, USA, 2013; pp. 1-6.

2. Kalateh, S.; Estrada-Jimenez, L.A.; Pulikottil, T.; Hojjati, S.N.; Barata, J. The human role in Human-centric Industry. Proceedings of the 48th Annual Conference of the IEEE Industrial Electronics Society (IECON 2022); Brussels, Belgium, 17–20 October 2022; IEEE: New York, NY, USA, 2022.

3. Bahreini, K.; Nadolski, R.; Westera, W. Towards multimodal emotion recognition in e-learning environments. Interact. Learn. Environ.; 2016; 24, pp. 590-605. [DOI: https://dx.doi.org/10.1080/10494820.2014.908927]

4. Zhang, Z.; Li, L.; Cong, G.; Yin, H.; Gao, Y.; Yan, C.; van den Hengel, A.; Qi, Y. From Speaker to Dubber: Movie Dubbing with Prosody and Duration Consistency Learning. Proceedings of the MM2024 Conference; Cleveland, OH, USA, 28 July–1 August 2024.

5. Scherer, K.R.; Johnstone, T.; Klasmeyer, G. Vocal expression of emotion. Handbook of Affective Sciences; Oxford University Press: Oxford, UK, 2003; pp. 433-456.

6. Bharti, S.K.; Varadhaganapathy, S.; Gupta, R.K.; Shukla, P.K.; Bouye, M.; Hingaa, S.K.; Mahmoud, A. Text-Based Emotion Recognition Using Deep Learning Approach. Comput. Intell. Neurosci.; 2022; 2022, 2645381. [DOI: https://dx.doi.org/10.1155/2022/2645381] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36052029]

7. Kim, S.; Lee, S.-P. A BiLSTM–Transformer and 2D CNN Architecture for Emotion Recognition from Speech. Electronics; 2023; 12, 4034. [DOI: https://dx.doi.org/10.3390/electronics12194034]

8. Park, H. Enhancement of Multimodal Emotion Recognition Classification Model Through Weighted Average Ensemble of KoBART and CNN Models; Korean Institute of Information Scientists and Engineers: Seoul, Republic of Korea, 2023; pp. 2157-2159.

9. Kim, Y.-J.; Roh, K.; Chae, D. Feature-Based Emotion Recognition Model Using Multimodal Data; Korean Institute of Information Scientists and Engineers: Seoul, Republic of Korea, 2023; pp. 2169-2171.

10. Sabbeh, S.F.; Fasihuddin, H.A. A Comparative Analysis of Word Embedding and Deep Learning for Arabic Sentiment Classification. Electronics; 2023; 12, 1425. [DOI: https://dx.doi.org/10.3390/electronics12061425]

11. Mutinda, J.; Mwangi, W.; Okeyo, G. Sentiment Analysis of Text Reviews Using Lexicon-Enhanced Bert Embedding (LeBERT) Model with Convolutional Neural Network. Appl. Sci.; 2023; 13, 1445. [DOI: https://dx.doi.org/10.3390/app13031445]

12. Li, H.; Ma, Y.; Ma, Z.; Zhu, H. Weibo Text Sentiment Analysis Based on BERT and Deep Learning. Appl. Sci.; 2021; 11, 10774. [DOI: https://dx.doi.org/10.3390/app112210774]

13. Reggiswarashari, F.; Sihwi, S.W. Speech emotion recognition using 2D-convolutional neural network. Int. J. Electr. Comput. Eng.; 2022; 12, pp. 6594-6601. [DOI: https://dx.doi.org/10.11591/ijece.v12i6.pp6594-6601]

14. Hazra, S.K.; Shubham, M.; Kaushal, C.; Prabhakar, N. Emotion recognition of human speech using deep learning method and MFCC features. Radioelectron. Comput. Syst.; 2022; 4, pp. 161-172. [DOI: https://dx.doi.org/10.32620/reks.2022.4.13]

15. Poria, S.; Majumder, N.; Hazarika, D.; Cambria, E.; Gelbukh, A.; Hussain, A. Multimodal Sentiment Analysis: Addressing Key Issues and Setting Up the Baselines. IEEE Intell. Syst.; 2018; 33, pp. 17-25. [DOI: https://dx.doi.org/10.1109/MIS.2018.2882362]

16. Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.P. Tensor Fusion Network for Multimodal Sentiment Analysis. arXiv; 2017; arXiv: 1707.07250

17. Sun, H.; Chen, Y.-W.; Lin, L. TensorFormer: A Tensor-Based Multimodal Transformer for Multimodal Sentiment Analysis and Depression Detection. IEEE Trans. Affect. Comput.; 2023; 14, pp. 2776-2786. [DOI: https://dx.doi.org/10.1109/TAFFC.2022.3233070]

18. Huan, R.; Zhong, G.; Chen, P.; Liang, R. UniMF: A Unified Multimodal Framework for Multimodal Sentiment Analysis in Missing Modalities and Unaligned Multimodal Sequences. IEEE Trans. Multimed.; 2024; 26, pp. 5753-5768. [DOI: https://dx.doi.org/10.1109/TMM.2023.3338769]

19. Xu, P.; Zhu, X.; Clifton, D.A. Multimodal Learning With Transformers: A Survey. IEEE Trans. Pattern Anal. Mach. Intell.; 2023; 45, pp. 12113-12132. [DOI: https://dx.doi.org/10.1109/TPAMI.2023.3275156] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37167049]

20. Chen, Y.; Du, C.; Zi, Y.; Xiong, S.; Lu, X. Scale-Aware Adaptive Refinement and Cross-Interaction for Remote Sensing Audio-Visual Cross-Modal Retrieval. IEEE Trans. Geosci. Remote Sens.; 2024; 62, 4706914. [DOI: https://dx.doi.org/10.1109/TGRS.2024.3443085]

21. Hu, F.; Zhang, L.; Yang, X.; Zhang, W.A. EEG-Based Driver Fatigue Detection Using Spatio-Temporal Fusion Network With Brain Region Partitioning Strategy. IEEE Trans. Intell. Transp. Syst.; 2024; 25, pp. 9618-9630. [DOI: https://dx.doi.org/10.1109/TITS.2023.3348517]

Word count: 6915

Show less

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

With the advancement of human-computer interaction, the role of emotion recognition has become increasingly significant. Emotion recognition technology provides practical benefits across various industries, including user experience enhancement, education, and organizational productivity. For instance, in educational settings, it enables real-time understanding of students’ emotional states, facilitating tailored feedback. In workplaces, monitoring employees’ emotions can contribute to improved job performance and satisfaction. Recently, emotion recognition has also gained attention in media applications such as automated movie dubbing, where it enhances the naturalness of dubbed performances by synchronizing emotional expression in both audio and visuals. Consequently, multimodal emotion recognition research, which integrates text, speech, and video data, has gained momentum in diverse fields. In this study, we propose an emotion recognition approach that combines text and speech data, specifically incorporating the characteristics of the Korean language. For text data, we utilize KoELECTRA to generate embeddings, and for speech data, we extract features using HuBERT embeddings. The proposed multimodal transformer model processes text and speech data independently, subsequently learning interactions between the two modalities through a Cross-Modal Attention mechanism. This approach effectively combines complementary information from text and speech, enhancing the accuracy of emotion recognition. Our experimental results demonstrate that the proposed model surpasses single-modality models, achieving a high accuracy of 77.01% and an F1-Score of 0.7703 in emotion classification. This study contributes to the advancement of emotion recognition technology by integrating diverse language and modality data, suggesting the potential for further improvements through the inclusion of additional modalities in future work.

Details

Title

KoHMT: A Multimodal Emotion Recognition Model Integrating KoELECTRA, HuBERT with Multimodal Transformer

Author

Moung-Ho Yi¹; Keun-Chang Kwak¹

; Ju-Hyun, Shin²

¹ Department of Electronic Engineering, Chosun University, Gwangju 61452, Republic of Korea; [email protected] (M.-H.Y.); [email protected] (K.-C.K.)
² Department of New Industry Convergence, Chosun University, Gwangju 61452, Republic of Korea

First page

4674

Publication year

2024

Publication date

2024

Publisher

MDPI AG

e-ISSN

20799292

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/electronics13234674

ProQuest document ID

3144081027