Content area
As fundamental drivers of human behavior, emotions can be expressed through various modalities, including facial expressions. Facial emotion recognition (FER) has emerged as a pivotal area of affective computing, enabling accurate detection of human emotions from visual cues. To enhance the efficiency and maintain accuracy, we propose a novel approach that leverages deep learning and transfer learning techniques to classify emotions based on only half of the human face. We introduce EMOFACE, a comprehensive half-facial imagery dataset annotated with 25 distinct emotion labels, providing a diverse and inclusive resource for multi-label half-facial emotion classification. By combining this dataset with the established FER2013 dataset, we employ a staged transfer learning framework that effectively addresses the challenges of multi-label half facial emotion classification. Our proposed approach, which utilizes a custom convolutional neural network (ConvNet) and five pre-trained deep learning models (VGG16, VGG19, DenseNet, MobileNet, and ResNet), achieves impressive results. We report an average binary accuracy of 0.9244 for training, 0.9152 for validation, and 0.9138 for testing, demonstrating the efficacy of our method. The potential applications of this research extend to various domains, including affective computing, healthcare, robotics, human–computer interaction, and self-driving cars. By advancing the field of half-facial multi-label emotion recognition, our work contributes to the development of more intuitive and empathetic human–machine interactions.
Introduction
Overview
Emotions are central to human experience and profoundly influence how individuals perceive, interact, and respond to their environment. They are critical drivers of behavior, affecting decision-making, communication, and social interactions [1, 2–3]. Emotions are not confined to internal experiences; they are expressed outwardly through various channels, including facial expressions, vocal tone, body language, and physiological responses [4, 5]. These expressions serve as essential tools for nonverbal communication, enabling individuals to convey their emotional states and understand those of others [6, 7]. Recognizing and interpreting human emotions has gained significant importance in the field of affective computing and human–computer interaction (HCI) [8, 9]. As technology becomes increasingly embedded in everyday life, there is a growing demand for systems capable of understanding and responding to human emotions. Such systems are crucial for developing empathetic and adaptive technologies, enhancing user experience, improving communication, and supporting mental health [10, 11]. Facial emotion recognition (FER) is vital in this context. The human face, with its ability to convey a vast array of emotions through subtle muscle movements, is the primary source of emotional expression [12, 13]. The challenge of accurately recognizing these emotions using computationally intelligent software systems and models is a key focus in affective computing and emotion recognition. By enabling machines to detect and interpret facial emotions, more intuitive, interactive, and responsive systems can be created, leading to advancements in areas such as social robotics, virtual reality, and personalized healthcare [14, 15]. The significance of emotion recognition extends beyond technological innovation. It has far-reaching implications for improving human–computer interactions by making them more natural, engaging, entertaining, and effective [16, 17]. As we develop systems that understand and respond to human emotions, we move closer to bridging the gap between human intuition and machine intelligence, ultimately fostering meaningful and impactful interactions between humans and technology [18, 19].
Emotions significantly influence daily behavior and decision making. According to [20, 21–22], they are conscious mental responses and reactions to specific stimuli or events in the environment, often accompanied by physical (physiological) and psychological changes. Individual emotional states vary, despite similar experiences. Emotions act as nonverbal communication within societies, expressed through facial movements, vocal tone, physical gestures, written language, and physiological indicators (vital signs) [23]. As noted in [24, 25–26], emotional psychology analyzes emotions through their components: subjective experiences, physiological responses, and behavioral reactions. Subjective experiences arise from internal or external triggers that prompt the brain to release hormones that are interpreted based on personal history, culture, beliefs, and experiences. Physiological responses involve body system changes such as the fight-or-flight response, which has evolved to enhance survival. Behavioral responses affect actions, decisions, and reactions, manifesting as verbal and nonverbal cues such as facial expressions, voice modulation, body movements, and vital sign changes. These forms of emotional expression are essential for overall health and wellbeing.
Literature review
Emotional psychology, as noted in [24, 25–26], identifies two main categories of emotional theories: developmental and classification. Developmental theories explore the evolution of physiological and psychological responses, whereas classification theories categorize and group emotions according to specific traits. Six key developmental theories provide a comprehensive understanding of the emotions in this field. They aim to link physiological responses to behavioral or psychological reactions that influence individual emotional experiences. According to [27], psychology uses two primary methods to categorize emotions: dimensional and discrete emotion models. Dimensional models analyze emotions in a continuous spectrum, facilitating a thorough examination of their interrelationships. These models typically employ two or three dimensions–valence, arousal, duration, and dominance–to identify emotional patterns and complexities. An example is James Russell’s circumplex model from the 1980 s, which uses valence and arousal as axes to create a two-dimensional framework. Various dimensional models can be constructed by combining different emotional dimensions to study diverse aspects of the complexity of human emotions.
Discrete emotions theory, as described in [1], suggests that humans universally experience a limited finite set of specific emotions. Proponents of this theory contend that emotions are distinct, innate, and biologically determined rather than influenced by cultural or social factors. They argued that emotions are primitive and unique. According to this theory, all humans experience six universal emotions: happiness, sadness, fear, surprise, disgust, and anger. The discrete emotions theory classifies emotions into a limited set of categories, facilitating the use of computer vision and deep learning for emotion recognition and establishing benchmarks for addressing affective computing challenges. William James, an early advocate, grouped emotions as fear, rage, grief, and love. Paul Ekman proposed the concept of universal emotions, initially identifying six fundamental emotions in the 1970 s: happiness, sadness, fear, disgust, anger, and surprise, later expanding to include embarrassment and excitement in the 1990 s. Robert Plutchik introduced the wheel of emotions in the 1980 s, suggesting eight primary emotions in complementary pairs: happiness-sadness, fear-anger, disgust-trust, and surprise-anticipation [28, 29]. In 2014, researchers at the University of Glasgow reduced the number of basic emotions to four, observing that fear and surprise, along with anger and disgust, share similar underlying facial muscle patterns [30, 31–32]. A study conducted at the University of California, Berkeley in 2017 by Alan S. Cowen and Dacher Keltner identified 27 distinct categories of human emotions [4, 33, 34].
As described in [2], affective computing combines elements of computer science, engineering, and psychology to develop software systems capable of understanding and interpreting human emotions. The process of emotion recognition involves detection, interpretation, and analysis of human emotions through various channels. These emotion recognition systems employ different methods, including facial, voice, gait, text, and brain-based approaches, each with its own set of characteristics in terms of performance, complexity, reliability, and applicability. There are two primary approaches to developing these systems: unimodal, which focuses on processing a single type of emotional signal (such as facial images, speech, gait, or text), and multimodal, which integrates multiple signals to enhance performance and achieve more accurate results. The effectiveness of each modality in identifying emotions varied. Among these methods, facial-based emotion recognition is the most widely used, primarily because of the abundance of vision-related imagery datasets and advancements in image processing and computer-vision algorithms.
According to the following sources [35, 36, 37–38], facial emotion recognition generally employs two main approaches: traditional computer vision methods and deep learning techniques. Traditional approaches rely heavily on manually extracting features using handcrafted feature engineering techniques, such as histogram of oriented gradients (HOG) and local binary patterns (LBP), which are then combined with machine learning algorithms, such as support vector machines (SVMs), random forest, decision trees, logistic regression, and K-nearest neighbors. This method requires extensive domain knowledge and expertise in feature engineering. In contrast, deep learning utilizes automatic feature extraction through encoder-decoder design pattern architectures, particularly convolutional neural networks (CNNs). These networks are adept at learning intricate patterns and features, resulting in enhanced performance in classifying facial expressions in supervised machine learning tasks. As mentioned in [3], research on facial emotion recognition (FER) has primarily focused on multiclass classification-supervised machine learning tasks, aiming to categorize facial images into distinct predefined emotion categories, such as happiness, sadness, anger, and surprise. Significant advancements have been made using convolutional neural networks (CNNs) and transfer learning techniques with the use of pre-trained models, such as VGG16, VGG19, ResNet, and MobileNet, achieving high accuracy on benchmark datasets with reported rates often exceeding 70% and categorical cross-entropy loss ranging from 0.5 to 1.0. However, these methods fall short of the human baseline performance, reported to be between 85 and 90% accuracy in recognizing human emotions. Although these approaches provide robust frameworks for emotion classification, they oversimplify the complexity of human emotions by limiting each image to a single label. This limitation fails to capture the simultaneous occurrence of multiple emotions common in real-world interactions. Multilabel facial emotion recognition is essential for capturing the complexity of human facial expressions, which often involve multiple emotions experienced simultaneously. Unlike multiclass classification, which assigns one emotion per image, multilabel classification assigns several emotions per sample data point, thereby providing a more realistic and comprehensive understanding of human emotions. This concept is significant in fields such as human–computer interaction, social robotics, and affective computing. Several studies, such as [37, 38, 39, 40, 41–42] are pioneering efforts in multi-label classification machine learning techniques, highlighting the advantages and potential for more detailed and accurate facial emotion recognition systems. This study seeks to fill this knowledge gap by creating systems capable of identifying multiple pertinent emotions in a single half facial image. By employing multi-label classification-supervised machine learning techniques, we aimed to improve the accuracy, durability, and authenticity of computer vision systems designed for facial emotion recognition.
Available datasets
According to [4], numerous facial emotion classification image datasets are available for affective computing research. These datasets exhibit unique features such as variations in size, image quality, distribution, and inherent biases. Among these are the extended Cohn-Kanade Dataset (CK +), Affect Net, Japanese Female Facial Expression Dataset (JAFFE), and Facial Emotion Recognition 2013 Dataset (FER2013). The creation of these datasets was motivated by the need to tackle the issue of multiclass facial emotion recognition and classification into a limited set of fundamental emotions, as listed in Table 1.
Table 1. Most commonly used facial imagery emotion recognition datasets
Dataset name | Description | img | Of classes | Age range | Ethnic diversity | Source |
|---|---|---|---|---|---|---|
FER- 2013 | Facial expression recognition dataset from a Kaggle competition | 35,887 | 7 | Various | Various | Kaggle |
AffectNet | Contains facial images labeled for seven different emotions | 440,000 | 7 | Various | Various | AffectNet |
CK + | Extended Cohn-Kanade dataset with posed and spontaneous expressions | 593 | 7 | Adults | Mainly Caucasian | CK + |
JAFFE | Japanese Female Facial Expression dataset with basic emotions | 213 | 7 | Adults | Asian | JAFFE |
KDEF | Karolinska Directed Emotional Faces with images of 6 basic emotions | 4900 | 6 | Adults | Mainly Caucasian | KDFE |
RAF-DB | Real-World Affective Faces Database with naturalistic facial expressions | 30,000 + | 7 | Various | Various | RAF-DB |
Emoface | Multi-label facial emotion recognition color imagery dataset | 4000 | 25 | Various + Diverse | Inclusive + Diverse | Systems and Biomedical Engineering Cairo University |
This paper introduces a novel computer vision dataset designed specifically for multi-label facial emotion recognition using only half of a human facial image. It also provides computational evidence for facial symmetry and its potential applications in deep learning and computer vision. Additionally, it proposes an innovative deep learning technique for recognizing multilabel facial emotions from half of the faces using staged transfer learning. This method significantly improves the precision and resilience of recognizing multiple complex facial expressions, while reducing computational complexity in terms of time and space. Through extensive evaluations and analyses, we demonstrate that our technique outperforms the existing methods in terms of accuracy and loss function, highlighting its superior performance.
The structure of this paper is as follows. Section II details the methodology, including the proposed staged transfer learning approach and the experimental setup for multilabel facial emotion recognition. The results are presented in Section III, which compares the performance and outcomes of models with and without full-face usage using the proposed technique across different subsets. Section IV discusses the implications of the findings in the context of the existing research. Finally, Section V concludes the paper by summarizing its key contributions and proposing avenues for future research.
Methods
Our methodology comprises five distinct segments: data assembly, facial symmetry and computational proof, data preparation, image pre-processing, deep learning, evaluation metrics and performance measure.
Data assembly
The availability of high-quality data is essential for the development of robust algorithms and models in the realm of artificial intelligence, particularly in machine learning and deep learning. Our study employed a cutting-edge, unique, and inclusive multi-label facial emotion recognition imagery dataset, designed for real-time deep learning software systems. The dataset, known as “Emoface,” is described in [42] as being notable for its uniqueness, balance, diversity, inclusiveness, and lack of bias. This dataset comprises 4000 RGB images, each with dimensions of (150 × 150 × 3), and features high-quality, noise-free images collected to eliminate biases related to gender, ethnicity, and age. All images depict frontal face views and are annotated with 25 labels encompassing a wide range of basic and complex emotions, including happiness, sadness, disgust, excitement, boredom, focus, depression, and shock. Notably, the dataset incorporates AI-generated synthetic images to enhance the diversity and robustness of deep learning models, serving as a method of data augmentation to increase the dataset size. The primary goal of this dataset is to facilitate the development of multi-label full facial emotion classification algorithms. Our research also utilized the well-known fer2013 dataset, which contains 28,709 grayscale training images and 7178 grayscale testing images, all with a resolution of 48 × 48 pixels. Each image in this dataset is labeled with one of seven emotion annotations: happiness, sadness, fear, disgust, anger, surprise, and neutrality.
Facial symmetry and computational proof
Theoretical basis for facial symmetry and expressions asymmetry
According to [43, 44, 45, 46–47], human faces are characterized by symmetry, with the left side typically symmetric to the right side. Facial expressions exhibit asymmetry where certain emotions may be more dominant on one side. Studies suggest that emotions are processed differently in the hemispheres of the brain, leading to varying expressiveness between facial halves. The right hemisphere, controlling the left side of the face, is more involved in processing emotions and expressing negative emotions, such as sadness, fear, and anger. The left hemisphere, controlling the right side, is associated with cognitive and socially moderated positive emotions such as happiness and surprise [48]. This brain hemispheric lateralization suggests that a single half-face may be sufficient for robust facial emotion recognition, rather than relying on full-face data. This idea opens a new direction for AI and computer vision researchers to approach facial-based applications including face recognition, gender recognition, age recognition, emotion recognition, and other biometric-related applications. This concept is promising, as it theoretically reduces computational complexity by half in terms of time and space, leading to efficient computational resource allocation and processing power usage.
Computational efficiency and practical applications of half-face recognition
Although full-face facial emotion recognition is conventional, prior research suggests that treating the face as a symmetrical whole may introduce redundancy or weaken the classification performance. Studies have shown that the left hemiface is often perceived as more emotionally expressive than the right [49]. Borod et al. found participants viewed mirrored left halves of faces as more intensely expressive than mirrored right halves, suggesting one half might be more informative for classification [50]. Sackeim et al. demonstrated that the left half is consistently rated as more expressive, particularly for spontaneous emotions [51]. Recent research has shown that deep learning models can achieve competitive performance using only half-faces, indicating that full-face images do not always provide additional useful information [52]. By leveraging one side of the face, our study reduces computational complexity while maintaining classification performance and robustness.
Challenges and advantages in real-world scenarios
In real-world scenarios, occlusions, such as face masks, side poses, or partial camera views, often result in only half of the face being visible. Owing to pandemic-related mask usage, emotion recognition systems have struggled with accuracy for occluded faces [53]. Security surveillance, driver monitoring systems, and video conferencing applications often capture faces from side angles, leading to partial visibility of the expressions. By designing a model trained on half-faces, our approach enhances the robustness in settings where full-face data may not be available. This aligns with human perception studies, which indicate that observers can interpret emotions from partial facial information [54], thereby reflecting human nature in intelligent systems.
Computational advantages of half-face recognition
A key advantage of half-face recognition is the reduction in computational complexity. Processing only half of the facial imagery input reduces the parameters required for deep learning models, leading to lower memory usage, reduced storage space, faster inference times, and improved energy efficiency, which are crucial for real-time applications in embedded systems, robotics, and mobile devices [55]. Studies have shown that reducing the input image size while maintaining discriminative features can significantly improve processing speed with minimal accuracy loss [56]. By focusing on half-faces, our approach enables real-time performance while preserving high classification accuracy, making it suitable for human–computer interaction, healthcare, and affective computing applications.
Ablation study design and methodology
To empirically verify the effectiveness of half-face recognition, we propose an ablation study that compares two model configurations.
Full-face multi-label facial emotion recognition models: trained and tested on entire face images on the emoface dataset (full-face version).
Right and left half-face multi-label facial emotion recognition models: trained and tested using only the left half and right half of the face on the emoface dataset (half-face version).
Each model was evaluated based on binary accuracy (at a threshold of 0.5), binary cross-entropy logistic loss, and computational efficiency (inference speed and storage size). Previous studies have suggested that left hemiface models often achieve comparable or superior performance to full-face models in emotion recognition [57]. If our results confirm that half-faces can achieve similar or better accuracy, this would validate our approach, while demonstrating its efficiency and practical applicability.
By focusing on half-face emotion recognition, this study contributes to the broader field of affective computing and human-centered AI. Emotion recognition is a crucial component of robotic assistants, mental health monitoring, and human–computer interaction [8]. By reducing the computational demands while maintaining accuracy, our method makes emotion-aware AI systems more scalable and deployable in real-world applications. Additionally, in autonomous systems, such as driver monitoring, detecting emotional states from only a partial face can improve road safety and driver attention monitoring [58].
Conversion of multi-label full facial emoface dataset to half-facial imagery version
Convert the currently available multilabel full facial emoface dataset into a multilabel half-facial imagery dataset to address the problem of half-facial multilabel emotion recognition. Using 4000 images from the emoface dataset, we randomly selected a sample image to test the initial concept of facial symmetry. The algorithm involves reading the input image, converting it to grayscale, sharpening the image, and applying the Canny edge detection algorithm [45] to extract key features, including horizontal and vertical edges. The thresholds for Canny edge detection were set to 100 and 250, respectively. We divided the resulting edge detection image into two equal halves, flipped the right half horizontally, and compared it pixel-wise to the left half. The image difference between the two halves is calculated using the Euclidean distance function. A smaller numerical difference indicates greater similarity between the two halves. This algorithm was employed on the emoface dataset, iterating each image to compute the numerical difference between halves. The primary goal was to validate the conversion of the emoface dataset from full facial imagery to half facial imagery, without losing spatial information or key features. Figure 1 illustrates the process of calculating the numerical difference between two halves of a randomly selected sample image.
Fig. 1 [Images not available. See PDF.]
a Random sample image, (b) edge extraction via the Canny edge detection algorithm, (c) dividing the image into the left half (150 × 75 × 1), (d) dividing the image into the right half (150 × 75 × 1), and (e) mirroring the right half facial model for comparison with the left half
Statistical evaluation of half-facial imagery
We computed statistical measures to evaluate the use of half of the faces in the emoface dataset for deep learning and computer vision tasks, demonstrating the facial symmetry computationally. A histogram illustrates the numerical differences between the right and left halves of each facial image for enhanced data analysis and visualization. The histogram revealed a bell-shaped Gaussian normal distribution of Euclidean Distances for the dataset of 4000 images. The results showed a normal distribution, with most image pairs exhibiting Euclidean distances between 20 and 40. The peak at approximately 30 suggests that for most facial images, the differences between the two halves are relatively small, reflecting the general symmetry of human faces. This analysis provided insights into the consistency of facial features across the dataset and highlighted the degree of variation between the two halves.
The histogram shows a maximum numerical difference of approximately 60.0 due to the Canny edge detector threshold settings, which is acceptable. The Gaussian distribution confirmed that the emoface dataset primarily consisted of frontal view images, validating its prescribed characteristics and supporting the reliability and validity of converting full facial images to half facial representations using the proposed technique, as most of the images were vertically symmetrical. The statistical analysis of the half-face algorithm revealed an average discrepancy of 31.59, median of 31.87, modal value of 29.73, maximum disparity of 57.90, and instances of perfect symmetry with a minimum difference of 0.0. These insights enhance our understanding of the efficacy of half-facial representations in emoface datasets for multilabel half-facial emotion recognition.
Data preprocessing strategies and novel dataset augmentation techniques
Computational proof validation identifies four data preprocessing strategies for half-facial multilabel emotion recognition using the 4k emoface dataset: all right halves, all left halves, 50% right halves, 50% left halves, and a combination of all right and left halves. The fourth strategy was used effectively to double the dataset size to 8k images, introducing a novel technique for computer vision-related imagery dataset augmentation and enhancing deep learning model flexibility in processing both the right and left halves of the faces.
The histogram was segmented into six primary regions, each encompassing a specific number of images from the Emoface dataset. The primary goals of dividing the numerical difference histogram are to investigate potential correlations between numerical differences and the performance of deep learning models, and to determine if there is any association between numerical differences and the selected half of the human face for analysis.
Data preparation and image preprocessing
Using the half-facial algorithm, we systematically converted all emoface dataset images from full-facial to half-facial by horizontally splitting each (150 × 150 × 3) image into two equal parts column-wise (150 × 75 × 3). We then resized each half-image to (100 × 100 × 3) to enhance the spatial dimensions, resolution, and quality. Consistent image dimensions in terms of height and width are essential for most deep learning models and pre-trained architectures, ensuring compatibility and improved processing performance. Following splitting and resizing, the 4k emoface dataset was transformed into 4k right-halves and 4k left-halves, which were then concatenated and shuffled into an 8k dataset. This ensured fair training, unbiased learning, randomization, reduced overfitting, and generalization, thereby eliminating any sequence or order pattern. The same process was applied to the target labels (annotations) in order to maintain the corresponding ground truth for each image. The half-facial 8k emoface dataset was preprocessed and noise free. 75% (6000 images) were used for training, with the remaining 25% (2000 images) equally split between cross-validation (1000 images) and testing (1000 images). Images were sized at (100 × 100 × 3) to incorporate color as a feature for training the deep neural networks. The labels are multihot-encoded to handle multiple emotions per image data point.
The FER2013 dataset images, originally sized at 48 × 48 pixels, were resized to 100 × 100 × 3 dimensions by converting grayscale images to pseudo-RGB format. This conversion is essential for pre-trained meta-architectures that require three-dimensional tensor inputs. This step ensures that the convolution computations between the initial layer's filters of deep convolutional neural networks (CNNs) and the corresponding input image channels are accurately aligned and matched, facilitating precise dot product calculations between filter weights (parameters) and input image/tensor pixel values (features). This process involves duplicating and concatenating the 2D grayscale image to create a pseudo-RGB image with a depth of three, as depicted in Fig. 2. The ultimate reason behind this preprocessing step is to adhere to the input standards of most common deep neural networks, as it does not have a direct influence on the multiclass classification performance of the first stage of multi-class facial emotion recognition. The dataset included 25,841 training images, 2868 cross-validation images, and 7178 testing images. The images from both datasets were normalized from [0–255] to [0–1] to enhance the training convergence and fitting speed of the models. Data augmentation techniques such as height and width translation, rotation, zooming, shearing, horizontal flipping, and the “nearest” fill mode were used to enlarge the fer2013 dataset.
Fig. 2 [Images not available. See PDF.]
a Pseudo-RGB image sample and data augmentation, including b zooming, c flipping, and d rotation
Deep learning
We introduced an innovative approach in the fields of computer vision and deep learning for recognizing multiple emotions from partial (half-face) facial images using a staged transfer learning method. This strategy tackles intricate challenges by breaking them down into smaller and more manageable subtasks across several learning phases, each concentrating on a particular aspect of the primary issue. The fundamental concept involves gradually and incrementally training algorithms by exposing them to various levels of complexity. The results from one phase become the input for subsequent phases. Our method incorporates hierarchical decomposition, which breaks down complex issues into simpler sub-problems: modular design, treating each phase as a separate learning unit, sequential processing, transferring knowledge between phases, progressive refinement, improving model performance, and resilience for effective problem solving. The primary goal of our proposed staged transfer learning technique is to adjust the learnable parameters of neural networks, whether they are randomly initialized in custom convolutional neural networks or pre-learned in pre-trained deep-learning models.
We employed a two-stage approach to half facial emotion recognition using iterative staged transfer learning. The first stage concentrates on full facial emotion recognition across multiple classes. We trained various deep learning models, including a custom convnet and five pretrained meta-architectures, to categorize human emotions into seven fundamental classes using the multiclass fer2013 dataset. This step was designed to help the models develop an intuitive grasp and basic understanding of basic emotions such as happiness, sadness, and fear. The second stage shifted our focus to multilabel, half-facial emotion recognition. In this stage, we trained all the models on the multi-label half-facial emoface dataset to classify human emotions into 25 distinct multi-emotions per sample image.
The research employed a custom convolutional neural network (CNN) and five pretrained meta-architectures (vgg16, vgg19, MobileNet, Densenet, and ResNet) [59, 60, 61–62] across two learning stages. The custom CNN featured an encoder-decoder design pattern, with the encoder comprising five convolutional blocks, each including a convolution layer, max-pooling layer, and flattening layer, to extract high-level features and generate a one-dimensional feature vector. The decoder includes three hidden dense layers, batch normalization layers for faster convergence, and dropout layers for regularization. For full-facial multiclass emotion classification, the output layer had seven units with a softmax activation function, ensuring that the sum of all output probabilities equals one. In the second stage of half-facial multilabel emotion classification, the final output layer was replaced with 25-unit binary classifiers using a sigmoid activation function, where each unit independently produced an output probability between 0 and 1. The structural details of both stages, applicable to the fer2013 and emoface datasets, are shown in Fig. 3.
Fig. 3 [Images not available. See PDF.]
Custom convnet encoder-decoder configuration for staged transfer learning on fer2013 full facial multiclass emotion recognition and emoface multilabel half facial emotion recognition with color code for each layer
In the initial phase of full-facial multiclass emotion classification, we configured a custom convnet model with specific parameters. We used the Adam optimizer, set the learning rate to 0.001, and employed categorical cross-entropy as the loss function. Accuracy was used as the performance metric. The hyperparameters of the models included 100 epochs and a batch size of 256 for all the datasets. We incorporated callback functions such as early stopping, model checkpoints, learning rate decay, and CSV loggers. For the subsequent stage focusing on multilabel half-facial multi-label emotion classification, we maintained the Adam optimizer with a 0.001 learning rate but switched to binary cross-entropy as the loss function. We used binary accuracy with a 0.5 threshold value as our metric. The hyperparameters were adjusted to 1000 epochs, with a training batch size of 32 and smaller batch sizes of eight for validation and testing. The same callback function was retained in the first stage.
We fine-tuned the pretrained models by unfreezing the shallow layers in the encoders of the pretrained convnets and freezing the deep layers to preserve the conceptual idea of knowledge transfer. The decoders employed the same design as our custom-designed convnet across all the pretrained models. The pre-trained meta-architecture model structures for staged transfer learning (STL) are shown in Fig. 4.
Fig. 4 [Images not available. See PDF.]
Pretrained models encoder-decoder architecture design for staged transfer learning on both fer2013 full facial multiclass and emoface half-facial multilabel emotion recognition
During the initial stage of compiling the pretrained model, we employed the Adam optimizer, categorical cross-entropy loss function, and an accuracy metric. The training process consisted of 100 epochs with a batch size of 256 for all the three stages: training, validation, and testing. We used the same callbacks as those used for the custom convolutional neural network. For the second stage, we switched to binary cross-entropy (logistic loss) as the loss function, and modified the accuracy metric to binary accuracy with a threshold value of 0.5. We reduced the batch size to 16 for training and 8 for validation and testing. The callbacks were consistent with those used in custom convnet.
In our staged learning, transfer learning on the fer2013 dataset involves pseudo-RGB grayscale full facial images as the input and seven output classes for multiclass emotion classification. Subsequently, the emoface dataset used half-facial color images as the input and 25 binary classifiers as the output for multi-label emotion classification. This strategy entails removing the output layer and appending a new layer with the required number of outputs and their corresponding activation functions. Our multistage learning approach aims to shift (shake) learnable parameters between different spaces, enabling step-by-step learning, and reducing the burden of extensive representation learning, which typically occurs in a single learning shot. The main concept of the proposed technique is shown in Fig. 5
Fig. 5 [Images not available. See PDF.]
Parameter shaking/shifting concept of staged transfer learning for multi-label half-facial emotion recognition
M. Evaluation metrics and performance measures
To calculate the numerical difference (distance measure) between the two halves of each image, we employed the standard Euclidean distance function represented by the following equation:
1
D represents the Euclidean distance between the binary edge-detected image functions F1 and F2, where H and W denote the height and width of the images, respectively, and i and j are the indices for height and width, respectively.
In the initial phase of multiclass full facial emotion classification, the primary loss function typically employed is categorical cross-entropy loss. This function is mathematically represented by the following formulas:
2
The categorical cross-entropy loss function in Eq. (2) uses N as the total sample count. The variable denotes whether sample i is part of class c (1 if true, 0 if false), while represents the model’s predicted probability for class c. This loss function calculates the negative log of the predicted probability for the correct class, and then averages this value across all classes and samples. A lower loss value indicates superior model performance.
In the second phase of multi-label half facial emotion classification, binary cross-entropy was employed as the primary loss function. This function is represented by the following mathematical notation:
3
The most frequently employed function for binary classification and multilabel classification is the binary cross-entropy loss, also known as logistic loss. This function evaluates the disparity between the actual and predicted probabilities by imposing penalties for deviations from true labels. In Eq. (3), N signifies the total number of samples (images), represents the actual label (either 0 or 1), and denotes the predicted probability that the sample label is 1. When a prediction is flawless, the cross-entropy loss equals zero, with lower loss values indicating a superior model performance.
Accuracy served as the evaluation metric, utilizing multiclass accuracy for the first stage and binary accuracy for the second stage of multilabel classification with similar fundamental concepts and mathematical principles. In multiclass accuracy, the highest probability index determines the predicted class label, whereas in binary accuracy, the output of each neuron in the final layer of the 25 binary classifiers is binarized to 0 or 1 based on a 0.5 threshold. The mathematical notation for both accuracies is as follows:
4
where N represents the total number of samples, is the predicted class label for the i-th sample, is the actual class label for the i-th sample, and 1(.) is the indicator function that equals 1 if = and 0 otherwise.Results
A complete computer vision system for multi-label half-facial emotion recognition is depicted as follows:
This section also details the results of multilabel half-FER using a custom convnet and five pretrained meta-architectures (Fig. 6). The performance of each model was assessed using training, cross-validation, and testing datasets, focusing on the binary accuracy and binary cross-entropy loss (logistic loss). Additionally, we compared the half-facial strategy with the full-facial strategy using the proposed staged transfer learning (STL), as illustrated in Table 2.
Fig. 6 [Images not available. See PDF.]
Sample image of the multi-label half-facial emotion recognition full computer vision pipeline system
Table 2. Staged transfer learning multi-label full-facial and half-facial emotion classification training, cross-validation, and testing performance metrics
Model | Training binary cross entropy loss | Training binary accuracy | Validation binary cross entropy loss | Validation binary accuracy | Testing binary cross entropy loss | Testing binary accuracy |
|---|---|---|---|---|---|---|
Staged transfer learning full-face multi-label facial emotion recognition | ||||||
Custom Convnet | 0.1910 | 0.9280 | 0.2221 | 0.9181 | 0.2241 | 0.9162 |
VGG16 | 0.1745 | 0.9352 | 0.2042 | 0.9259 | 0.2170 | 0.9231 |
VGG19 | 0.1361 | 0.9494 | 0.2096 | 0.9274 | 0.2262 | 0.9218 |
MobileNet | 0.1589 | 0.9499 | 0.2349 | 0.9184 | 0.2630 | 0.9161 |
DenseNet | 0.1454 | 0.9510 | 0.2142 | 0.9239 | 0.2249 | 0.9212 |
ResNet | 0.2209 | 0.9197 | 0.2378 | 0.9184 | 0.2510 | 0.9138 |
Staged transfer learning half-face multi-label facial emotion recognition | ||||||
Custom Convnet | 0.1951 | 0.9260 | 0.2264 | 0.9198 | 0.2303 | 0.9162 |
VGG16 | 0.1893 | 0.9271 | 0.2289 | 0.9188 | 0.2316 | 0.9187 |
VGG19 | 0.1828 | 0.9258 | 0.2256 | 0.9179 | 0.2299 | 0.9175 |
MobileNet | 0.2033 | 0.9293 | 0.2548 | 0.9177 | 0.2631 | 0.9156 |
DenseNet | 0.1669 | 0.9377 | 0.2283 | 0.9196 | 0.2319 | 0.9168 |
ResNet | 0.3030 | 0.9002 | 0.3049 | 0.8977 | 0.3046 | 0.8982 |
Calculate the relative difference ratios for the binary cross-entropy loss and binary accuracy in full versus. Half-facial multilabel emotion recognition using training, cross-validation, and testing datasets (Fig. 7).
Fig. 7 [Images not available. See PDF.]
Relative differences in a binary cross-entropy logistic loss and b binary accuracy between half and full faces for three different datasets
We assessed each image batch’s numerical difference normal distribution histogram using a custom convnet, evaluating the binary accuracy and binary cross-entropy loss to identify potential correlations between numerical differences and deep learning model performance. The dataset was divided into six batches based on the histogram of Euclidean distances between the right and left halves of the facial images. For each batch, binary cross-entropy loss (logistic loss) and binary accuracy at a 0.5 threshold were calculated using a custom convnet. Batch 1 (12 images) achieved a loss of 0.2564 and accuracy of 0.9181, while Batch 2 (164 images) had a loss of 0.2402 and accuracy of 0.9187. Batch 3 (1363 images) reported a loss of 0.2347 and an accuracy of 0.9173, and Batch 4 (2110 images) showed a loss of 0.2324 and an accuracy of 0.9169. Batch 5 (329 images) had a loss of 0.2344 and an accuracy of 0.9128, and batch 6 (12 images) achieved the lowest loss of 0.1691 and the highest accuracy of 0.9233.
The binary cross-entropy loss and binary accuracy at a 0.5 threshold were evaluated for both the right and left halves of 4000 facial images across several models. For the custom convolutional neural network (CNN), the right half exhibited a loss of 0.1987 and an accuracy of 88.78%, while the left half showed a loss of 0.1995 and an accuracy of 88.79%. VGG16 yielded a loss of 0.1916 and an accuracy of 88.65% for the right half and a 0.1932 loss with 88.55% accuracy for the left half. Similarly, VGG19 reported losses of 0.1881 and 0.1899, with accuracies of 88.67% and 88.85% for the right and left halves, respectively. MobileNet produced a loss of 0.1985 and an accuracy of 89.75% for the right halves and a loss of 0.1978 with 89.96% accuracy for the left halves. DenseNet recorded a loss of 0.1708 and an accuracy of 90.11% for the right half and a loss of 0.1721 with 90.10% accuracy for the left half. Finally, ResNet showed a loss of 0.2950 and an accuracy of 89.00% for the right half and a loss of 0.2958 with an accuracy of 89.08% for the left half.
Comparison of processing inference time (time complexity) between full- and half-spatial-dimensional images.
Comparison of computational load (space complexity) between full and half facial imagery datasets.
Samples from the testing dataset used to validate the performance and efficiency of multi-label half-facial emotion recognition are shown in Fig. 9.
Discussion
We evaluated the models using the binary accuracy and binary cross-entropy loss across the training, cross-validation, and testing datasets. As noted in Table 2, the custom convnet performs well with a balanced binary cross-entropy loss and binary accuracy, although slight overfitting is evident because the training accuracy is slightly higher than the validation and testing accuracies. Both vgg16 and vgg19 exhibited consistent performances, with vgg19 showing a slight advantage in terms of lower training loss and good generalization. MobileNet has a higher training accuracy but a drop in cross-validation and testing accuracy, indicating overfitting with higher validation and testing logistic losses compared to other models. DenseNet demonstrated the best overall performance in training binary accuracy and logistic loss, with consistent results across the validation and testing sets. Although the cross-validation loss was slightly higher than that of the VGG model, the testing accuracies remained similar. ResNet shows the poorest performance, with the highest binary cross-entropy loss and lowest binary accuracy across all datasets, indicating difficulty in learning or training on the half facial emoface dataset. DenseNet is deemed to be the most effective and robust owing to its balanced performance, low logistic loss, and high binary accuracy. VGG models also exhibit strong and reliable performance and can be considered robust alternatives to DenseNet. Custom ConvNet and MobileNet exhibited slight overfitting, which can be mitigated by additional regularization, data collection, and data augmentation. ResNet’s poor performance suggests it may not be suitable for this task or requires different tuning.
Table 2 and Fig. 7 show that full-facial multilabel emotion recognition consistently surpasses the half-facial approach across all deep learning models, as evidenced by the increased binary cross-entropy loss values and decreased binary accuracy values. This indicates that full-face images provide more reliable data and richer spatial information, leading to more accurate facial-emotion classification. An increase in binary cross-entropy loss indicates a slight decrease in logistic loss, whereas a decrease in binary accuracy indicates a slight decrease in binary accuracy.
We used the custom convnet model to evaluate batches of images in the histogram based on binary accuracy with a 0.5 threshold and binary cross-entropy loss. Each batch was divided into 50% right-half and 50% left-half using a custom convnet to measure the performance for each batch separately. The results demonstrated no correlation between batch numerical differences and the deep learning model performance. To transform a facial-based task into a half-facial task, all the images in the dataset must be visually inspected for side views or orientations. In addition, to eliminate the dependency on numerical differences, the dataset should be preprocessed using either the strategy of 50% right halves and 50% left halves, or 50% all right halves and 50% all left halves. This ensured fair learning and flexibility for the models in interpreting emotions from either side of the human face. We used the fourth strategy to reinforce the diversity and ease of identifying human emotions in multiple forms from either half of the human face.
The performance of the custom convnet and five pretrained models on 4 thousand right- and left-half subsets of the emoface dataset is as described as follows: DenseNet exhibited the best performance for both halves, with the lowest logistic loss and the highest binary accuracy. Vgg16, vgg19, and custom convnet demonstrated consistent results and balanced performance across the two halves, with vgg19 exhibiting a slightly better accuracy on the left-half subset. MobileNet performed well, with high accuracy on both halves, slightly favoring the left half. Despite consistent results across subsets, ResNet had the lowest binary accuracy and highest binary cross-entropy loss, making it the least effective model. Overall, we can conclude that the model results were consistent across the two half-subsets, indicating the robustness of the dataset preparation, preprocessing, and generalization capabilities of the models. This indicates that there is no direct relationship between the trained deep learning model performance and the side of the face to evaluate, as we trained all of them on a mix of right and left halves.
Table 3 and Fig. 8 illustrate that employing the half facial image strategy for multi-label facial emotion recognition notably decreases the average processing time per sample image (inference time) across all models compared to full facial images. Densenet, MobileNet, and ResNet demonstrated the highest efficiency, with time reductions exceeding 50%. This makes the half-facial approach particularly suitable for real-time applications, where speed is critical. Although Custom Convnet, VGG16, and VGG19 exhibited lower time reductions of less than 50%, they still presented significant improvements and substantial gains. These significant reductions in computational complexity in terms of time highlight the industrial and practical benefits of half-facial multilabel emotion recognition for real-time applications, such as multi-label facial emotion recognition apps and self-driving car systems for monitoring the emotional states of drivers and passengers.
Table 3. Processing (inference) times per sample using half- and full-face methods
Different models | Average time per sample full face (ms) | Average time per sample half face (ms) | Time reduction ratio (%) |
|---|---|---|---|
Custom Convnet | 1.111 | 0.494 | 44.46 |
Vgg16 | 2.953 | 1.375 | 46.56 |
Vgg19 | 3.250 | 1.591 | 48.95 |
Mobilenet | 1.448 | 0.734 | 50.69 |
Densenet | 2.278 | 1.163 | 51.05 |
Resnet | 2.440 | 1.243 | 50.94 |
Fig. 8 [Images not available. See PDF.]
Average time per sample and time complexity reduction bar chart for full-face and half-face multilabel emotion recognition
Table 4 highlights the notable improvements in computational load and space complexity. The reduction in spatial dimensions resulted in smaller image sizes and decreased computational demands. The pixel count was reduced by approximately 55.6%, resulting in less data processing per image. Consequently, smaller image sizes facilitate faster loading times and more efficient memory usage, whether on local machines or cloud-computing resources. The storage requirements on disks were reduced by nearly 54.7%, enhancing the efficiency of large dataset storage. In addition, the decreased computational load of the half-facial approach makes it more suitable for real-time applications in mobile and embedded systems, with limited processing power and computing resources. We observed a substantial decrease in computational complexity in both memory usage and execution time for all models when transitioning from the full face (entire spatial dimension) to half the spatial space, as detailed in our research study.
Table 4. Computational size (dimensional load) using half- and full-face methods
Parameter | Previous dimension | New dimension |
|---|---|---|
Spatial Size | 150 × 150 × 3 | 100 × 100 × 3 |
Pixels count | 67,500 ps | 30,000 ps |
Image size | ~ 66 KB | ~ 30 KB |
Storage size on disk | ~ 265 MB | ~ 120 MB |
Computational load | High | Low |
The results underscore the importance and utility of staged transfer learning in the fields of deep learning and computer vision, particularly when dealing with limited datasets and hard learning problems, such as multi-label half facial emotion recognition. Our study demonstrated a performance that exceeded the human benchmark for emotion recognition accuracy, which ranged from 80 to 90%, thus validating the efficacy of staged transfer learning in practical applications. The primary benefit of our proposed system is its ability to process half of the input facial image data and classify multiple emotions present in it, ranging from one to four per sample image, based on the selected classification threshold. As shown in Fig. 6, the computer vision system starts by inputting the facial image into the face detection and image processing blocks and splitting the facial image into two halves vertically, followed by the ANN block including the encoder and decoder parts, which identify the emotions in both halves of the facial images (right and left).
Figure 9 also demonstrates that our models can detect one–four emotions in each input sample image. This capability is determined by the binary classification threshold hyperparameter and the specific performance aspect we aim to emphasize. The number of detectable emotions is influenced by the classification threshold, which is adjusted based on whether we prioritize high precision or high recall (sensitivity) for our emotion multi-label classification system. The performance metrics of our deep learning models, particularly binary accuracy using a classification threshold of 0.5, outperform those reported in earlier research on multi-label facial emotion recognition [63] while maintaining both binary accuracy and binary cross entropy compared with those in [64]. This demonstrates the success of the staged transfer learning approach and the application of multi-label half facial emotion recognition compared to previous studies.
Fig. 9 [Images not available. See PDF.]
Model inferences from random samples in the testing dataset
According to Table 5 and in the context of recent advancements in multi-label facial emotion recognition, our proposed staged transfer learning method for half-face images demonstrates a significant improvement over existing techniques. While previous studies, such as the work by Pons and Masip (2018) using residual convolutional networks and Li and Deng (2019) with deep bi-manifold CNNs, have shown considerable accuracy (86% and 88%, respectively) using full-face images, our approach achieves a superior accuracy of 91% with a reduced average loss of 0.24. Additionally, unlike the extensive models evaluated by Greco et al. (2022), which focus on full-face recognition, our method leverages the computational efficiency of half-face images, ensuring faster processing times without compromising the accuracy. Furthermore, our method’s unique contribution lies in being the first to employ staged transfer learning on half-face images, offering a novel approach that balances computational complexity with high accuracy, thus making it highly suitable for real-time applications in affective computing and human–computer interaction.
Table 5. Benchmarking comparison table for multi-label facial emotion recognition techniques
Paper | Authors | Year | Dataset | Model architecture | Average accuracy | Average loss | Full/half face | Key findings | Unique contributions |
|---|---|---|---|---|---|---|---|---|---|
Multi-task, Multi-label and Multi-domain Learning with Residual Convolutional Networks for Emotion Recognition [65] | Gerard Pons, David Masip | 2018 | Two non-controlled datasets | Residual Convolutional Networks | 86% | 0.33 | Full Face | Improved performance by jointly learning emotion recognition and facial Action Units detection | Multi-task and multi-domain learning framework |
Blended Emotion in-the-Wild [66] | Shan Li, Weihong Deng | 2019 | RAF-ML, JAFFE, CK +, SFEW, MMI | Deep Bi-Manifold CNN | 88% | 0.32 | Full Face | Effective for multi-label expressions in the wild | Use of crowdsourced annotations for training data |
Benchmarking Deep Networks for Facial Emotion Recognition in the Wild [67] | Antonio Greco, Nicola Strisciuglio, Mario Vento, Vincenzo Vigilante | 2022 | RAF-DB-C, RAF-DB-P | VGG, DenseNet, SENet, Xception | 87% | 0.35 | Full Face | Increased robustness to real-world variations | Extensive benchmarking of multiple deep networks |
Multi-label Facial Emotion Recognition Using Korean Drama Video Clips [63] | Heeryon Cho, Woo Kyu Kang, Younsoo Park, Sungeu Chae, Seong-joon Kim | 2022 | Korean Drama Dataset | Autoencoder, CNN, ResNet50 | 85% | 0.34 | Full Face | Valuable Asian multi-label facial emotion recognition dataset | Unique dataset sourced from Korean drama video clips |
Learning Graph Convolutional Networks for Multi-Label Recognition and Applications [68] | Z.-M. Chen, X.-S. Wei, P. Wang, Y. Guo | 2023 | RAF-DB | SVML-RGCN | 90% | 0.28 | Full Face | Explored relationships between expression labels | Novel graph-based approach for multi-label recognition |
Fore-Background Features [69] | Yuehua Feng, Ruoyan Wei | 2024 | Emotic | FB-ER, ML-ERC | 89% | 0.30 | Full Face | Improved recognition by considering fore-background features | Integration of fore-background feature extraction |
Proposed Staged Transfer Learning for Multi-Label Facial ER Method | Mohamed M. Abd ElMaksoud, Sherif H. ElGohary, Ahmed H. Kandil | 2025 | EMOFACE, FER2013 | Custom CNN, VGG16, VGG19, DenseNet, MobileNet, ResNet | 91% | 0.24 | Half Face | Reduced computational complexity while maintaining high accuracy | First to use staged transfer learning on half-face images |
The outcomes of this research are highly encouraging, presenting new opportunities for researchers in the fields of deep learning and computer vision who concentrate on face-based biometric applications. The proposed method has several notable advantages, including reduced computational complexity (in terms of both space and time), improved privacy protection (as half faces maintain their anonymity), increased robustness to occlusion (half faces are less affected by obstructions), utilization of facial symmetry (taking advantage of human face symmetry), development of innovative feature extraction techniques (specific to half faces), more efficient data annotation (as half-face images are easier and quicker to label), applicability to real-world scenarios (enabling computer vision systems to operate in challenging environments), and more efficient inference time and storage for AI algorithms.
Conclusions
Significant advancements in computational efficiency have been achieved by reducing the inference execution time and minimizing the memory and storage requirements, exemplifying green AI principles. These enhancements not only improve performance, but also promote environmentally friendly AI practices by decreasing energy consumption and resource usage. Our research underscores the potential and advantages of the proposed staged transfer learning technique in deep learning and computer vision, particularly in processing half of the facial images in multilabel facial emotion recognition systems. This study opens new avenues for practitioners and researchers in computer vision and deep learning to more effectively address face-related tasks. While further exploration is needed, our findings contribute to the evolving landscape of computer vision research, paving the way for innovative applications in emotion analysis, human–computer interaction, healthcare, mental health, robotics, self-driving cars, embedded systems, and other fields.
Potential applications and future work
Our proposed method for multi-label half facial emotion recognition shows significant promise in healthcare, robotics, and human–computer interaction (HCI). In health care, it can monitor patients'emotional well-being and aid mental health professionals. Robotics enhances empathetic interactions between robots and humans, making them better companions and assistants. HCI improves the user experience by enabling systems to respond to users'emotional states. The main application of our research study is to augment and support children with autism by providing an interactive rehabilitative AI-based GUI software system. Preliminary tests in non-controlled environments, such as public spaces and workplaces, using standard webcams and mobile device cameras showed high accuracy and robustness despite challenges such as varying lighting and diverse backgrounds. The deployment challenges include privacy concerns, optimizing resource-constrained devices, and handling environmental variability. Future work will focus on improving the model robustness with additional training data collected from various non-controlled environments and collaborating with industry partners to refine the method based on user feedback.
Acknowledgements
Not applicable.
Authors’ contributions
S.H. ElGohary, A.H. Kandil and M.M. Abd ElMaksoud all proposed the idea, analyzed the data, made discussions and composed the manuscript. All authors have read and approved the final manuscript.
Funding
Not applicable.
Data availability
Not applicable.
Declarations
Ethics approval and consent to participate
This article does not include any studies with human participants performed by any of the authors.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no conflicts of interest.
Abbreviations
Facial emotion recognition
Staged transfer learning
Convolutional neural networks
Region of interest
Milliseconds
Pixels count
Kilobytes
Megabytes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1 Ekman, P. An argument for basic emotions. Cognition & Emotion; 1992; 6,
2 R. S. Lazarus, Emotion And Adaptation, Oxford University Press, 1991
3 Plutchik, R. Emotion, a psychoevolutionary synthesis; 1980; New York, Harper & Row:
4 Scherer KR (2005) What are emotions? And how can they be measured?. Soc Sci Inf 44(4):695–729
5 Darwin C (1872) The Expression of the Emotions in Man and Animals. John Murray, London
6 Ekman, P; Friesen, WV. "Constants across cultures in the face and emotion". J Pers Soc Psychol; 1971; 17,
7 Matsumoto D, Hwang HC (2016) Emotion and culture: An overview, in Handbook of Competence and Motivation: Theory and Application, AJ Elliot, CS Dweck, and DS Yeager, eds., 2nd ed., Guilford Press p 285–305
8 Picard RW (1997) Affective Computing. MIT Press, Cambridge, MA
9 Calvo RA, D'Mello S (2010) Affect detection: An interdisciplinary review of models, methods, and their applications. IEEE Trans Affect Comput 1(1):18–37
10 Picard RW (2000) Toward computers that recognize and respond to user emotion. IBM Syst J 39(3–4):705–719
11 Cowie R et al, (2000) FEELTRACE: An instrument for recording perceived emotion in real time, in Proc. ISCA Workshop on Speech and Emotion. Newcastle, Northern Ireland, p 19–24
12 Ekman, P. Facial expression and emotion. American Psychologist; 1993; 48,
13 Ekman P (ed) (2005) What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACS), 2nd ed., Oxford University Press
14 Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58
15 Pantic M, Rothkrantz LJM (2003) Toward an affect-sensitive multimodal humancomputer interaction. Proceedings of the IEEE 91(9):1370–1390
16 Pantic M, Pentland A, Nijholt A, Huang TS (2006) Human computing and machine understanding of human behavior: A survey, in Proc. 8th International Conference on Multimodal Interfaces, Banff, AB, Canada p 239–248
17 D'Mello S, Kory J (2015) A review and meta-analysis of multimodal affect detection systems. ACM Computing Surveys 47(3):1–36
18 Cowie R, Douglas-Cowie MG, Douglas-Cowie E (2012) Tracing emotion: An overview. Int J Synt Emot 3(1):1–17
19 Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machinebelief network architecture. vol. 1 in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Montreal, p 577–580
20 Ekman P, Davidson R. J (Eds.) (1994) The Nature of Emotion: Fundamental Questions. Oxford University Press
21 Barrett, LF; Russell, JA. The Structure of Current Affect: Controversies and Emerging Consensus. Current Directions in Psychological Science; 1999; 8,
22 Goleman D (1995) Emotional Intelligence: Why It Can Matter More Than IQ. Bantam Books, New York
23 OpenStax, Psychology 2e, Houston, Texas: OpenStax, 2020
24 University of West Alabama (UWA), "The Science Of Emotion: Exploring The Basics Of Emotional Psychology," 27 June 2019. [Online]. Available: https://online.uwa.edu/news/emotional-psychology/
25 M. Kendra Cherry, "Emotions and types of emotional resposnes," 29 June 2023. [Online]. Available: https://www.verywellmind.com/what-are-emotions-2795178#:~:text=Plutchik%20proposed%20eight%20primary%20emotional,as%20happiness%20%2B%20anticipation%20%3D%20excitement.
26 Worthy L. D, Lavigne T, Romero F (2019) Culture and Psychology. Maricopa County Community College District
27 Wikipedia contributors, "Emotion Classification," 8 March 2024. [Online]. Available: https://en.wikipedia.org/wiki/Emotion_classification
28 S. Seconds, "Plutchik’s Wheel of Emotions: Exploring the Emotion Wheel," Six Seconds, 7 February 2024. [Online]. Available: https://www.6seconds.org/2022/03/13/plutchik-wheel-emotions/
29 Wikipedia contributors, "Robert Plutchik," Wikipedia contributors, 8 November 2023. [Online]. Available: https://en.wikipedia.org/wiki/Robert_Plutchik
30 University of Glasgow—University news, "Humans express four basic emotions rather than six," University of Glasgow—University news, February 2014. [Online]. Available: https://www.gla.ac.uk/news/archiveofnews/2014/february/headline_306019_en.html
31 The Scotsman, "Four emotions shown through expressions—research," The Scotsman, 3 February 2014. [Online]. Available: https://www.scotsman.com/news/four-emotions-shown-through-expressions-research-1546042
32 BPS, "Study of dynamic facial expressions suggests there are four basic emotions, not six," 25 October 2022. [Online]. Available: https://www.bps.org.uk/research-digest/study-dynamic-facial-expressions-suggests-there-are-four-basic-emotions-not-six
33 Cowen, AS; Keltner, D. Self-report captures 27 distinct categories of emotion bridged by continuous gradients. Proceedings of the National Academy of Sciences; 2017; 114,
34 Tracy, JL; Randles, D. Four Models of Basic Emotions: A Review of Ekman and Cordaro, Izard, Levenson, and Panksepp. Emotion Review; 2011; 3,
35 R. P. R. W. P. Ana C. R. Paiva, Affective Computing and Intelligent Interaction: Second International Conference, ACII 2007, Lisbon, Portugal, September 12–14, 2007, Proceedings, Lisbon, Portugal: Springer, 2007
36 C. Dalvi, M. Rathod, S. Patil, S. Gite and K. Kotecha, "A Survey of AI-Based Facial Emotion Recognition: Features, ML & DL Techniques, Age-Wise Datasets and Future Directions," IEEE Access, vol. 9, pp. 165,806—165,840, 2021
37 Li W, Zhang P, Huang W (2021) A New Deep Learning Method for Multi-label Facial Expression Recognition based on Local Constraint Features. International Conference on Computer Engineering and Artificial Intelligence (ICCEAI), Shanghai
38 Zhao K, Zhang H, Dong M, Guo J, Qi Y, SongY-Z (2013) A multi-label classification approach for Facial Expression Recognition. Visual Communications and Image Processing (VCIP) Kuching Malaysia
39 T. L. Praveena and N. Lakshmi, "Multi Label Classification for Emotion Analysis of Autism Spectrum Disorder Children using Deep Neural Networks," in Third International Conference on Inventive Research in Computing Applications (ICIRCA), Coimbatore, India, 2021
40 H. Cho, W. K. Kang, Y.-S. Park, S. G. Chae and S.-j. Kim, "Multi-Label Facial Emotion Recognition Using Korean Drama Video Clips," in IEEE International Conference on Big Data and Smart Computing (BigComp), Daegu, Korea, 2022
41 H. Han and E. B. Sönmez, "Facial Expression Recognition on Wild and Multi-Label Faces with Deep Learning," in 3rd International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME), Tenerife, Canary Islands, Spain, 2023
42 M. M. Abd ElMaksoud and S. H. ElGohary, "Transfer Learning for Multi-label Emotion Recognition: A Promising Approach for Biomedical Applications," in 11th International Japan-Africa Conference on Electronics, Communications, and Computations (JAC-ECC), Alexandria, Egypt, 2023
43 W. H. Wafa Mellouk, "Facial emotion recognition using deep learning: review and insights," Procedia Computer Science, vol. 175, pp. 689–694, 2020
44 T. T. a. R. W. P. Jianhua Tao, Affective Computing And Intelligent Interaction, Springer; 2005th edition, 2005
45 M. K. R. R. a. G. V. Rishabh Vats, "Comprehensive review and analysis on facial emotion recognition methods," Journal of Electronic Imaging, vol. 32, no. 4, p. 040901, 2023
46 E. Kodhai, A. Pooveswari, P. Sharmila and N. Ramiya, "Literature Review on Emotion Recognition System," in International Conference on System, Computation, Automation and Networking (ICSCAN), Pondicherry, India, 2020
47 J. A. A. T. A. Shahad Salh Ali, "Automatic Facial Emotion Recognition: A Review," in Fifth College of Science International Conference of Recent Trends in Information Technology (CSCTIT), Baghdad, Iraq, 2022
48 Adolphs, R. Neural systems for recognizing emotion. Current Opinion in Neurobiology; 2002; 12,
49 [50] P. Ekman and W. V. Friesen, Unmasking the Face: A Guide to Recognizing Emotions from Facial Clues, Prentice-Hall, 1975
50 Borod, JC; Haywood, MR; Koff, S. Neuropsychological aspects of facial asymmetry during emotional expression: A review. Neuropsychology Review; 1997; 7,
51 Sackeim, H; Gur, R; Saucy, J. Emotional asymmetry in the human face: Evidence from a split-face test. Neuropsychologia; 1978; 16,
52 Chen, C; Whitney, D. Facial emotion recognition using deep learning: The role of left vs. right hemiface in training. IEEE Transactions on Affective Computing; 2019; 12,
53 Ngan, S; Lyon, MD. Masked face recognition: Challenges and future directions. Pattern Recognition Letters; 2021; 141, pp. 74-80.
54 Calder, B et al. Perception of facial expressions from partial information. Cognitive Neuroscience Journal; 2005; 17,
55 Y. LeCun et al., "Efficient deep learning architectures for vision applications," Proceedings of CVPR, pp. 1547–1556, 2019
56 A. Howard et al., "MobileNets: Efficient convolutional neural networks for mobile vision applications," arXiv preprint arXiv:1704.04861, 2017
57 Baltrušaitis, G; Robinson, P; Morency, L-P. Cross-domain learning in facial emotion recognition. IEEE Transactions on Affective Computing; 2018; 9,
58 Soleymani, M et al. Multimodal affective computing and AI. IEEE Transactions on Affective Computing; 2019; 10,
59 Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014)
60 Howard, Andrew G., Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. "Mobilenets: Efficient convolutional neural networks for mobile vision applications." arXiv preprint arXiv:1704.04861 (2017)
61 Huang, Gao, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. "Densely connected convolutional networks." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. 2017
62 He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. 2016
63 H. Cho, W. K. Kang, Y. -S. Park, S. G. Chae and S. -j. Kim, "Multi-Label Facial Emotion Recognition Using Korean Drama Video Clips," 2022 IEEE International Conference on Big Data and Smart Computing (BigComp), Daegu, Korea, Republic of, 2022, pp. 215–221, https://doi.org/10.1109/BigComp54360.2022.0004
64 ElMaksoud, MMA; Kandil, AH; ElGohary, SH; "Staged Transfer Learning for Multi-Label Facial Emotion Recognition from Full Faces,", ,. 6th Novel Intelligent and Leading Emerging Sciences Conference (NILES). Giza, Egypt; 2024; 2024, pp. 66-71. [DOI: https://dx.doi.org/10.1109/NILES63360.2024.10753183]
65 G. Pons and D. Masip, "Multi-task, multi-label and multi-domain learning with residual convolutional networks for emotion recognition," arXiv preprint arXiv:1802.06664, 2018
66 S. Li and W. Deng, "Blended Emotion in-the-Wild: Multi-label Facial Expression Recognition Using Crowdsourced Annotations and Deep Locality Feature Learning," International Journal of Computer Vision, vol. 127, no. 3, pp. 884–906, 2019. [Online]. Available: https://link.springer.com/article/https://doi.org/10.1007/s11263-018-1131-1
67 A. Greco, N. Strisciuglio, M. Vento, and V. Vigilante, "Benchmarking deep networks for facial emotion recognition in the wild," Multimedia Tools and Applications, vol. 82, pp. 11,189–11,220, 2022. [Online]. Available: https://link.springer.com/article/https://doi.org/10.1007/s11042-022-10734-2
68 Z. -M. Chen, X. -S. Wei, P. Wang and Y. Guo, "Learning Graph Convolutional Networks for Multi-Label Recognition and Applications," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 6969–6983, 1 June 2023, https://doi.org/10.1109/TPAMI.2021.3063496
69 Y. Feng and R. Wei, "Method of Multi-Label Visual Emotion Recognition Fusing Fore-Background Features," Applied Sciences, vol. 14, no. 18, p. 8564, 2024. [Online]. Available: https://www.mdpi.com/2076-3417/14/18/8564
Copyright Springer Nature B.V. Dec 2025