Enhancing speech emotion recognition through parallel CNNs with transformer encoder and co-attention

Abstract

Speech emotion recognition (SER) plays a pivotal role in enabling machines to determine human subjective emotions based only on audio information. This capability is essential for enabling effective communication and enhancing the user experience in human-computer interactions (HCI). Recent studies have successfully integrated temporal and spatial features to improve recognition accuracy. This study presents a novel approach that integrates parallel convolutional neural networks (CNNs) with a Transformer encoder and incorporates a collaborative attention mechanism (co-attention) to extract spatiotemporal features from audio samples. The proposed model is evaluated on multiple datasets and uses various fusion methods. The parallel CNNs combined with a transformer and hierarchical co-attention yield the most promising performance. In version v1 of the ASVP-ESD dataset, the proposed model achieves a weighted accuracy (WA) of 70% and an unweighted accuracy (UW) of 67%. In version 2 of the ASVP-ESD dataset, the model achieves a WA of 52% and a UW of 45%. Furthermore, the model was evaluated on the ShEMO data set to confirm its robustness and effectiveness in diverse datasets, achieving a UW of 68%. These comprehensive evaluations across multiple datasets highlight the generalizability of the proposed approach.

Details

Identifier / keyword

Speech emotion recognition; Parallel networks; Transformer encoders; CNN; Co-attention mechanism; Deep learning

Title

Enhancing speech emotion recognition through parallel CNNs with transformer encoder and co-attention

Author

Hashem, Ahlam; Arif, Muhammad; Alghamdi, Manal; Al Ghamdi, Mohammed A; Almotiri, Sultan H

Publication title

PeerJ Computer Science; San Diego

Publication year

2025

Publication date

Nov 7, 2025

Publisher

PeerJ, Inc.

Place of publication

San Diego

Country of publication

United States

Publication subject

Computers

e-ISSN

23765992

Source type

Scholarly Journal

Language of publication

English

Document type

Journal Article

DOI

https://doi.org/10.7717/peerj-cs.3254

ProQuest document ID

3269734304

Document URL

https://www.proquest.com/scholarly-journals/enhancing-speech-emotion-recognition-through/docview/3269734304/se-2?accountid=208611

© 2025 Hashem et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Last updated

2025-11-14

Database

ProQuest One Academic

Enhancing speech emotion recognition through parallel CNNs with transformer encoder and co-attention

Content area

Abstract

Details