Content area
Abstract
Speech emotion recognition (SER) plays a vital role in enhancing human–computer interaction (HCI) and can be applied in affective computing, virtual support, and healthcare. This research presents a high-performance SER framework based on a lightweight 1D Convolutional Neural Network (1D-CNN) and a multi-feature fusion technique. Rather than employing spectrograms as image-based input, frame-level characteristics (Mel-Frequency Cepstral Coefficients, Mel-Spectrograms, and Chroma vectors) are calculated throughout the sequences to preserve temporal information and reduce the computing expense. The model attained classification accuracies of 94.0% on MELD (multi-party talks) and 91.9% on RAVDESS (acted speech). Ablation experiments demonstrate that the integration of complimentary features significantly outperforms the utilisation of a singular feature as a baseline. Data augmentation techniques, including Gaussian noise and time shifting, enhance model generalisation. The proposed method demonstrates significant potential for real-time emotion recognition using audio only in embedded or resource-constrained devices.
Details
Accuracy;
Embedded systems;
Deep learning;
Datasets;
Wavelet transforms;
Affective computing;
Artificial intelligence;
Human-computer interface;
Emotion recognition;
Spectrograms;
Artificial neural networks;
Neural networks;
Ablation;
Support vector machines;
Random noise;
Emotions;
Methods;
Acoustics;
Real time;
Speech;
Speech recognition