Content area
Speech emotion recognition (SER) plays a pivotal role in enabling machines to determine human subjective emotions based only on audio information. This capability is essential for enabling effective communication and enhancing the user experience in human-computer interactions (HCI). Recent studies have successfully integrated temporal and spatial features to improve recognition accuracy. This study presents a novel approach that integrates parallel convolutional neural networks (CNNs) with a Transformer encoder and incorporates a collaborative attention mechanism (co-attention) to extract spatiotemporal features from audio samples. The proposed model is evaluated on multiple datasets and uses various fusion methods. The parallel CNNs combined with a transformer and hierarchical co-attention yield the most promising performance. In version v1 of the ASVP-ESD dataset, the proposed model achieves a weighted accuracy (WA) of 70% and an unweighted accuracy (UW) of 67%. In version 2 of the ASVP-ESD dataset, the model achieves a WA of 52% and a UW of 45%. Furthermore, the model was evaluated on the ShEMO data set to confirm its robustness and effectiveness in diverse datasets, achieving a UW of 68%. These comprehensive evaluations across multiple datasets highlight the generalizability of the proposed approach.
Details
Artificial neural networks;
Recognition;
Neural networks;
User experience;
Emotions;
Coders;
Datasets;
Human-computer interaction;
Human-computer interface;
Computer mediated communication;
Emotion recognition;
Attention;
Classification;
Effectiveness;
Phonetics;
Algorithms;
Audio data;
Acoustics;
Ablation;
Interpersonal communication;
Speech;
Speech recognition;
Subjectivity;
Human technology relationship;
Robustness;
Models;
Acknowledgment;
Generalizability;
Machinery;
Accuracy