Full text

Turn on search term navigation

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

With the advancement of human-computer interaction, the role of emotion recognition has become increasingly significant. Emotion recognition technology provides practical benefits across various industries, including user experience enhancement, education, and organizational productivity. For instance, in educational settings, it enables real-time understanding of students’ emotional states, facilitating tailored feedback. In workplaces, monitoring employees’ emotions can contribute to improved job performance and satisfaction. Recently, emotion recognition has also gained attention in media applications such as automated movie dubbing, where it enhances the naturalness of dubbed performances by synchronizing emotional expression in both audio and visuals. Consequently, multimodal emotion recognition research, which integrates text, speech, and video data, has gained momentum in diverse fields. In this study, we propose an emotion recognition approach that combines text and speech data, specifically incorporating the characteristics of the Korean language. For text data, we utilize KoELECTRA to generate embeddings, and for speech data, we extract features using HuBERT embeddings. The proposed multimodal transformer model processes text and speech data independently, subsequently learning interactions between the two modalities through a Cross-Modal Attention mechanism. This approach effectively combines complementary information from text and speech, enhancing the accuracy of emotion recognition. Our experimental results demonstrate that the proposed model surpasses single-modality models, achieving a high accuracy of 77.01% and an F1-Score of 0.7703 in emotion classification. This study contributes to the advancement of emotion recognition technology by integrating diverse language and modality data, suggesting the potential for further improvements through the inclusion of additional modalities in future work.

Details

Title
KoHMT: A Multimodal Emotion Recognition Model Integrating KoELECTRA, HuBERT with Multimodal Transformer
Author
Moung-Ho Yi 1 ; Keun-Chang Kwak 1   VIAFID ORCID Logo  ; Ju-Hyun, Shin 2   VIAFID ORCID Logo 

 Department of Electronic Engineering, Chosun University, Gwangju 61452, Republic of Korea; [email protected] (M.-H.Y.); [email protected] (K.-C.K.) 
 Department of New Industry Convergence, Chosun University, Gwangju 61452, Republic of Korea 
First page
4674
Publication year
2024
Publication date
2024
Publisher
MDPI AG
e-ISSN
20799292
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
3144081027
Copyright
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.