Facial emotion recognition using deep Siamese neural networks: multi-classifier fusion for single-emotion and multi-emotion models across age groups

Abstract

This research examines facial expressions as social cues in online platforms, focusing on online learning and remote work. Our high-accuracy emotion recognition framework is designed for post-session evaluation, aiding teaching strategies and identifying students needing support without real-time monitoring. By employing a sophisticated fusion of data, image, and feature-level analysis, complemented by multi-classifier systems, this study capitalizes on the strengths of Siamese networks to achieve a refined understanding of emotion recognition. The investigation spans various age groups and ethnicities, employing multiple datasets such as LIRIS-CSE, Cohn-Kanade, and Jaffe, alongside the author’s datasets for children and teens. This rigorous examination underlines the role of Information Fusion in enriching communication and collaboration within digital interfaces. The research underscores the use of advanced techniques in interpreting facial cues by merging Siamese networks with pretrained models such as VGG 19 and Inception Resnet V2. The paper compares Siamese networks with other architectures for remote work/play, and asserts that such networks are more flexible. Comparatively, networks with Siamese architecture use convolutional neural network and recurrent neural network for input branches while other networks such as VGG19 and Inception Resnet V2 use a single neural network. The SCNN-IRV2 model demonstrated impressive test accuracies, ranging from 96 to 99% for single-emotion models and 84–96% for multi-emotion models, reflecting an improvement of 1.05–10.13% over the Inception Resnet V2 CNN architecture. In a similar vein, the SCNN-VGG19 model achieved test accuracies between 95% and 99% for single-emotion models and 75–94% for multi-emotion models, surpassing the VGG19 CNN architecture by 207–422%. These findings highlight the role of advanced fusion techniques and thoughtful design in improving online education and remote work, fostering progress in emotional data analysis and human-computer interaction.

Full text

Translate

Turn on search term navigation

Introduction

Facial emotions

Facial expressions are important in so far as they give people much information about their feelings and thoughts [1]. At first the signals like widened eyes are more or less learnt purely for protection but they are now taught to as a way of interacting including warning and the pacifying of aggression among others [2]. This evolution paved way to the formation of languages and structuredmethod of communication [2]. Studying facial affect is very important for understanding people’s behavior and personality as well as advancing artificial intelligence [1].

Need for facial emotion recognition

Facial emotion recognition is adapting the human and computer interface in game, robots, hospitals, and retail and marketing. Emotion detection algorithms make a user’s experience enjoyable, boost the effectiveness of technologies, and provide a perspective on people’s behavior [3]. This technology helps to gain insight on how exactly people experience emotions with consideration to biology, culture and social aspect. Critics have noted that studies with findings that call into question the universality of the six regulating dimensions present in the Ekman’s work, stress the culture, age, and gender aspects of the subject [4]. Mobile through smartphone, security, IoT devices, facial emotion recognition application mainly contributes in enhancing the future technologies specifically user focused technologies and makes machines and devices more intelligent to understand humans in digital age.

Use of FER for online platforms

There is a technology boon that has been serving as the foundation of human-computer interaction (HCI) which involves human and computer systems have been unable to identify and recognize human emotions appropriately for better decision-making processes. In view of this, emotional recognition systems are critical so that machines are able to identify and respond in the right manner in relation to the emotion being expressed [5]. The demand for such systems has emerged due to new possibilities of the internet and appearance ofhumanoid robots in the wake of the Covid-19. These systems are based on the use of machine learning and help facilitate communication while at the same time fight cyber bullying and promote the safety of students. Consequently, it is possible to assert that the emotional significance of graphics is critical for both the proper identification of emotions on the World Wide Web and the provision of relevant reactions [6].

Technology has embraced the higher learning through flexibility and customised learning initiatives which have boosted the system’s uptake, but the conventional systems do not track the student behaviors well. This is another factor that makes it challenging to ascertain the kind of mood a learner is experiencing particularly if one never gets to see them physically. To address this there is need to include facial emotion detection to monitor the students’ learning activities and emotions. Previous work on determining the emotion from an image has beenЇlatest work on determining the emotion from an image has been with low accuracy and involves high labelling cost, but use of facial organ state detection could help in accurate image labelling [7]. In remote work, facial emotion recognition has gained importance due to challenges in expressing emotions and building trust within geographically dispersed teams. The physical separation inherent in remote work hinders communication and task urgency, leading to job isolation, as Emerson (1976) noted. Facial emotion recognition technology can improve self-expression, foster trust, and enhance team collaboration [8]. Amid the COVID-19 pandemic and the rise of virtual meetings, facial expressions have become a critical signal of emotions, often being the only visible form of communication. This has driven the growing demand for emotion detection technology to better understand and interpret feelings in remote environments [9].

Use of machine learning for FER

SVM and AdaBoost are both popular models used primarily for facial emotion recognition. SVM is most known for its accuracy and flexibility in coping with different data sizes. In facial emotion recognition, AdaBoost is used as either a feature selector or a classifier, as it builds a powerful classier from weak classes that`s’ why it is called boosting algorithm. In addition, multi-class AdaBoost was developed for different facial emotion recognition tasks. On the other hand, in this area, another classifier is employed - it is Random Forest comprising of the numerous decision trees. Nevertheless, typical machine learning models such as these suffer from performance drawbacks because they are not deep learners and can’t handle nonlinear or high dimensional data.

Use of deep learning for FER

An emergence of Deep Learning (DL) has changed areas like computer vision together with natural language processing. Through its employment in Emotion Recognition, it gives way for researchers to extract complex patterns within various datasets thus improve our understanding of subtle emotions [10]. Deep learning has generated major advancements in facial emotion recognition. Methods ranging from data preprocessing through augmentation to attention mechanisms plus spatiotemporal architectures have been employed by researchers who utilized convolutional neural networks (CNNs) to conduct such an effort. According to some research articles different innovative strategies were considered; in particular one article talked about the simultaneous use of more than one CNN model alone as well as integrating them into LSTMs (Long Short-Term Memory) networks. These advancements continue to hold promise for increasing the accuracy of FER.

Use of Siamese networks for FER

In Meta-Learning netwroks such as Matching Networks, Prototypical Networks, and Relation Networks, build upon the concept of metric learning by utilizing embeddings for few-shot classification. These networks differ in how they compute and compare embeddings, with Matching Networks focusing on cosine distance, Prototypical Networks using class prototypes, and Relation Networks learning a custom distance function. While these approaches improve upon each other in various ways, Siamese Networks remain a preferred choice in computer vision due to their simplicity, robust performance, and ability to handle the fundamental task of comparing image pairs effectively [11]. The Siamese Networks consist of two similar neural networks and, they learn the hidden representation of the input vectors that distinguish them by distance measures like cosine measure. This architecture is especially useful for the given tasks like image matching and face recognition because such tasks involve comparing the samples of complex data [12]. Using grids, previous approaches of image recognition are not only slow computationally, but also less accurate especially on large dataset while Siamese Networks are characterized by generalization capability and very high performance on all the datasets [13]. The general approach of learning good features takes time and can prove to be cumbersome with little information; this is something that Siamese Networks can assist in averting [14]. Thus, Siamese Networks could be as powerful as FER by inherently featuring the local structure of the embedding space. Siamese Networks help to cluster the classes in the dataset and leads to better performances and generalizations. The process of feature extraction and classification over a given period is also balanced in order to give accurate FER according to Shao et al. [15]

Ethical implications of FER research

The major concerns with FER technology are issues to do with privacy, fairness and potential negative impact on the vulnerable groups such as the children. They include issues to do with skewed results, privacy infringement, and improper use of emotions. Biases are more likely to be kept in FER systems because the created datasets are unbalanced regarding race, age, and gender which impact the recognition of emotions depending on cultural and demographic discriminants. The gathering of emotion data can be problematic in most cases such as in education, the legal system, and the medical field and may encourage surveillance capitalism. A major problem of FER is thus its reliability since annotating the emotion is a subjective process that can be influenced by the conditions chosen by the raters. Similar to the process of identifying falsehoods and the truth in facial expressions, FER tends to be quite subjective with the assumptions it makes. Facial expressions are captured in a more elicited manner from the databases data, so it gives great inter observer consistency but less natural emotions and thus difficult for the emotions classification [16]. To reduce the effects of bias, it is very essential to add enough variance into one’s training set. Nevertheless, bias is not entirely diminished in this approach. Among the identified considerations for creating an unbiased performance, it is necessary to incorporate users from different racial backgrounds into the training set [17].

While creating the FER dataset, the relatively minor ethical issues like privacy and biases were of concern and the data set contains images from different age groups, races, and gender to reduce bias as much as possible. Thus, we consider that the suggested improvements are able to reduce the number of bias problems in FER systems and make it more equal. As researchers, however, it should be acknowledged that this is quite a persistent problem. These biases will require further research and the diversification of the data as the only way to work towards a more efficient and fair emotion recognition software. It is for this reason that the ethical implications of FER, especially when using sensitive information like children’s facial images, are very crucial. In response to such issues, the data applied in this study as well as the proprietary data were collected with proper conformity to consent and the convenience of the participants. Consent to participate within the study was obtained through consent forms by all the members of the sample in a voluntary manner. Furthermore, to avoid melodrama of privacy issues and data protection, the datasets have been archived in a secure place and the access rights are only available to a selected few and in limited capacities. This information is tightly regulated, and it will stay that way indefinite period with the option of making the data set public when the necessary measures to ensure the privacy of the data subjects will be taken. This show the promises given and protection of individuals’ confidentiality and rights this makes the research legal and moral. According to the collected datasets given below though they are restricted [by way of size they contain spontaneous samples of emotions. It was rather demanding to get a variety of emotions targeting the diverse ethnicities and demographics of the society, as well as some of the test subjects avoiding participation. Though, we tried our best to seek participation from individuals belonging to different regions of India especially different categories.

Contributions of the paper

Novel Algorithm Introduction: This research is about a revolutionary code that combines two convolutional neural networks based on Siamese architecture: Inception Resnet V2 and VGG19 together with pre-trained models as well as discusses how separate models can be built per emotion recognized; thus, by merging these models, accuracy and efficiency of already existing ones are increased.
Extensive Research on Multi-Source Dataset: The analysis was conducted by the study through use of a multi-sourced data set involving varying ages groups ranging between kids and grown-ups. By going through it you realize how strong and good for all people that was suggested in that context thereby making it to be relevant under potential situations happening daily.
Development of a High-Accuracy, Emotion-Specific Modeling Framework : We introduced a novel methodology of constructing and training dedicated, single-emotion Siamese models for each target emotion. This approach diverges from standard multi-class classification and is specifically designed to maximize precision and recall for individual emotional states, a significant shift in model design for FER.
A Novel Multi-Model Fusion and Evaluation Protocol for Siamese Networks : We designed and implemented a comprehensive testing framework that collectively leverages our suite of single-emotion models for robust multi-emotion recognition. This contribution includes the novel development of a flexible inference system where parameters like the number of reference images and similarity thresholds can be tuned per emotion to optimize performance and mitigate bias, representing a new approach to deploying Siamese ensembles.
Innovative Testing Method Development: We pioneered a testing methodology that moves beyond simple argmax classification. By employing configurable, emotion-specific similarity thresholds and a “winning model” selection process, our framework provides a flexible mechanism to control for dominant emotions and boost the recognition of subtle or under-represented expressions, directly addressing common bias issues in FER systems.
- Over another. It provides an improved and sophisticated way in detecting feelings since it enhances the whole performance in the algorithm suggested.
Creation of a Unique Spontaneous Emotion Dataset: The research has introduced a distinct and all-inclusive emotion dataset with a focus on teenagers that is spontaneous. The establishment of this dataset helped fill a notable void in available resources and so became an invaluable tool for subsequent studies within hitherto unchartered territories of emotion recognition.
- These contributions collectively demonstrate the paper’s significant impact on advancing the field of emotion recognition, offering novel techniques, comprehensive analysis, and groundbreaking results that contribute to the existing knowledge and pave the way for further advancements in this domain.

Review of literature

The COVID-19 pandemic has compelled individuals to embrace new digital lifestyles, which involved the adoption of online teaching methods., as well as remote work and work-from-home arrangements across many industries. In the realm of education, most schools and universities have shifted to online or hybrid teaching methods, which present unique challenges for both teachers and students. One critical aspect of teaching effectiveness is analysing the emotions of students to gauge their learning moods. Due to the pandemic, many industries have also shifted to remote work and work-from-home arrangements, presenting new challenges for both employers and employees. In this context, the ability to analyse the emotional states of remote workers can be critical for understanding their engagement, productivity, and well-being. While physiological indices are integral to emotion generation, research predominantly focuses on human behaviours like facial expressions, voice, text, and gestures for emotion recognition [18]. Facial expressions serve as a primary visual indicator, revealing underlying human intentions, physiological responses, and emotional and cognitive states during communication [19]. Furthermore, the ability to analyse emotional states in both educational and professional settings can be enhanced through the use of machine learning techniques such as meta-learning, which has shown promising results in related fields. However, there is currently a lack of research on the application of meta-learning for emotional analysis in online education and remote work. To bridge this knowledge gap, the researchers in this case study performed an extensive review of existing literature to investigate the utilization of machine learning in the domain of facial emotion recognition. Specifically, they focused on exploring the utilization of Siamese networks as documented in Tables 1, 2, 3 and 4. Additionally, a comprehensive survey was conducted to examine the latest techniques employed in the field of facial emotion recognition, encompassing both basic and advanced machine learning approaches, as illustrated in Fig. 1.

Table 1. Emotion recognition using Siamese networks

Ref.	Methodology	Modality & Dataset	Issues	Accuracy
[20]	Deep Siamese neural network.	Modality: Facial Images Dataset: AffectNet	1. No comparison with other recent methods. 2. Limited dataset evaluation. 3. No details on the proposed loss function. 4. Limited discussion on hyperparameter impact.	64%
[21]	MLARE (Meta-Learning Approach to Recognize Emotions)	Modality: Facial expressions. Dataset: AED-2	The paper acknowledges its limited in-house dataset and lack of comparison with existing methods for emotion recognition. Ethical considerations regarding emotion recognition technology are not fully addressed.	90.6%
[22]	Siamese CNN	Modality: Facial expressions. Dataset: Multi-PIE	Limitations of the method include limited dataset evaluation on Multi-PIE, dependency on large training data, insufficient feature analysis, and lack of comparison with state-of-the-art methods for facial expression recognition.	84.87% − 88.47%
[23]	MREAP (metric-based meta-learning + Siamese Networks)	Modality: Facial expressions. Dataset: AED-1	Although the findings of the proposed work are promising, it should be noted that the methodology has only been evaluated on an in-house dataset, and its performance on other datasets or real-world scenarios needs further investigation.	80%
[24]	STLEV (Siamese Neural Network with triplet loss and LSTM architecture)	Modality: video data. Dataset: BU-4DFE	Proposed method needs: 1. Larger and diverse dataset evaluation. 2. Detailed computational complexity analysis. 3. Comprehensive comparison with state-of-the-art methods. 4. Addressing interpretability of decision-making process.	87.5%

Table 2. Emotion recognition using meta-learning

Ref.	Methodology	Modality & Dataset	Issues	Accuracy
[25]	Few-shot meta-learning with prototypical networks.	Modality: Facial images Dataset: CMU Multi-PIE & AffectNet	The method used many episodes from the CMU Multi-PIE dataset, contradicting claims of requiring fewer data. Achieved 68% accuracy on AffectNet, but reliance on many episodes may limit applicability in low-data scenarios.	90% - CMU Multi-PIE and 68% - AffectNet
[26]	Meta-Learning Supervisor Neural Network	Modality: Facial Images Dataset: The BookClub artistic makeup Dataset	The study has limitations such as a small dataset that may not be representative of the general population, limited emotional expressions, and closed eyes photo-shoots. The results lack in-depth analysis, comparison with other methods, and future direction for improvement.	54% − 72%
[27]	Few-shot classification with cross-validation	Modality: video Dataset: SMIC, CASME, and CASME II.	The method proposed relies on pre-trained models and requires ground truth frames for feature computation, limiting applicability. Generalization to real-world scenarios is unclear, and evaluation is limited to a few datasets.	69.59% - CASME 80.95% - CASME II 63.13% SMIC
[28]	Meta-transfer learning	Modality: Audio and Visual Dataset: eNTERFACE, SAVEE, and EMO-DB.	The proposed transfer learning method for emotion recognition addresses data scarcity, but limitations include potential performance issues, computational expense, and limited evaluation on benchmark datasets, raising concerns about generalizability.	94% - FER 85% − 94% - Audio Based Emotion Recognition

Table 3. Emotion recognition for online platforms

Ref.	Methodology	Modality & Dataset	Issues	Accuracy
[29]	Convolutional Neural Networks (CNN)	Modality: Facial Images Dataset: FER-2013	Limitations include no information on scaling for multi-device or classroom scenarios, improving dataset quality, or enhancing model accuracy. The paper lacks test accuracy, which is a more reliable performance measure, as it only reports validation accuracy.	Validation Accuracy − 65% Test Accuracy - N/A
[30]	Convolutional Neural Network (CNN)	Modality: Facial Images Dataset: FER-2013	Limitations include small sample size, webcam quality and lighting conditions affecting accuracy, variation based on individual factors, subjective labeling, and uncertainty about optimal model architecture.	86%
[31]	LBP and Wavelet Transform	Modality: Facial Images Dataset: Surveillance video on online learning students	The research has limitations in dataset diversity, ethical concerns related to privacy, and the accuracy of the proposed methodology for online learning behavior analysis. Further research and validation are needed to address these limitations and ensure reliability.	80.04%

Ref.

Methodology

Modality & Dataset

Issues

Accuracy

[29]

Convolutional Neural Networks (CNN)

Modality: Facial Images

Dataset: FER-2013

Limitations include no information on scaling for multi-device or classroom scenarios, improving dataset quality, or enhancing model accuracy. The paper lacks test accuracy, which is a more reliable performance measure, as it only reports validation accuracy.

Validation Accuracy − 65%

Test Accuracy - N/A

[30]

Convolutional Neural Network (CNN)

Modality: Facial Images

Dataset: FER-2013

Limitations include small sample size, webcam quality and lighting conditions affecting accuracy, variation based on individual factors, subjective labeling, and uncertainty about optimal model architecture.

86%

[31]

LBP and Wavelet Transform

Modality: Facial Images

Dataset: Surveillance video on online learning students

The research has limitations in dataset diversity, ethical concerns related to privacy, and the accuracy of the proposed methodology for online learning behavior analysis. Further research and validation are needed to address these limitations and ensure reliability.

80.04%

Table 4. Emotion recognition for remote work

Ref.	Methodology	Modality & Dataset	Issues	Accuracy
[32]	Convolutional neural network (CNN) based on the VGG16 architecture	Modality: Facial expressions, head movements, and gaze behavior. Datasets: CK+, JAFFE, BU-3DFE, FFQH, and a large FER dataset, as well as data collected from a virtual seminar called COINs 2020.	Limitations include difficulty establishing causality and restricted ability of video presentations to convey emotions. Additional investigation is required to delve into the correlation between emotions and the quality of presentations in more depth.	84% − 82.6%

Ref.

Methodology

Modality & Dataset

Issues

Accuracy

[32]

Convolutional neural network (CNN) based on the VGG16 architecture

Modality: Facial expressions, head movements, and gaze behavior.

Datasets: CK+, JAFFE, BU-3DFE, FFQH, and a large FER dataset, as well as data collected from a virtual seminar called COINs 2020.

Limitations include difficulty establishing causality and restricted ability of video presentations to convey emotions. Additional investigation is required to delve into the correlation between emotions and the quality of presentations in more depth.

84% − 82.6%

Table 2 highlights the effectiveness of meta-learning techniques in enhancing model adaptability and generalization across diverse datasets and age groups. Table 3 focuses on emotion recognition within online platforms, demonstrating how the proposed framework improves engagement assessment in virtual learning environments. Table 4 examines the applicability of the model in remote work scenarios, emphasizing its role in evaluating emotional cues to enhance collaboration and productivity.

Recent studies have explored diverse modalities and non-contact approaches for emotion recognition and human state assessment. Zhu et al. [33] introduced RMER-DT, a robust multimodal framework leveraging diffusion and transformer architectures to improve emotion recognition in conversational contexts. Similarly, Zhu et al. [34] proposed a client–server–based system for non-contact assessment of single and multiple emotional and behavioral states, demonstrating its applicability in real-world biomedical scenarios. In a related direction, Zhang et al. [35] investigated WiFi-based non-contact human presence detection technology, highlighting the potential of wireless sensing as a complementary modality for emotion and behavior analysis.

[See PDF for image]

Fig. 1

Flow chart representing SOTA techniques in FER

Gap analysis

A comprehensive analysis of the literature, as summarized in Tables 1, 2, 3 and 4, reveals several consistent limitations and a clear research gap that this study aims to address. While Siamese and meta-learning networks show promise for FER, existing studies are often constrained by;

Evaluation on limited or non-diverse datasets [20, 21, 23].
A lack of comparison with strong baseline models to concretely establish superiority.
An absence of work focused on the crucial demographics of children and teenagers in spontaneous settings.
Limited exploration of model flexibility and parameter tuning to handle real-world challenges like class imbalance and bias.

Furthermore, the application of these advanced architectures specifically for the post-hoc analysis of online learning and remote work environments remains underexplored. There is a distinct lack of a comprehensive framework that is both highly accurate and practically adaptable for use across diverse age groups and datasets.

Therefore, this study is necessitated by the need to develop a robust, flexible, and highly accurate FER framework that:

Systematically combines Siamese networks with powerful pre-trained feature extractors (VGG19, InceptionResNetV2) in a novel fusion approach.
Is rigorously validated on a multi-source, multi-age, multi-ethnicity dataset, including novel spontaneous datasets for children and teens.
Provides a clear and significant performance benchmark against standard CNN baselines.
Introduces a tunable, multi-model testing protocol to address bias and improve adaptability for real-world deployment.

This work directly addresses these gaps, moving beyond incremental improvements to offer a novel and holistic solution for non-real-time emotion recognition.

Methodology

Dataset

This section gives an overview of the data sets used in FER research. Variety exists in quantity, recipients, recording conditions as well as in age categories of the multi-source dataset. The data ranges contain both controlled and spontaneous expressions and some are posed while others are naturally elicited. Furthermore, modes of colours vary in the datasets; some are RGB while others are grayscale. They have been used in training and testing many FER models to improve their accuracy and performance. More information about the datasets that were used is provided below in Table 5.

Table 5. Comparing the datasets used

Dataset	Sample	Subjects	Recording Condition	Elicitation Method	Expressions	Age Group	Color Mode
Author’s Kids	81 Videos	12	Home	Spontaneous + Posed	Anger, Disgust, Fear, Happy, Neutral, Sad, Surprise	7–10	RGB
Author’s Teen	314 Videos	14	Home + Classroom	Spontaneous	Anger, Disgust, Fear, Happy, Sad, Surprise	14–19	RGB
LIRIS-CSE	208 Videos	12	Controlled + Home	Spontaneous + Posed	Disgust, Fear-Surprise, Happy, Surprise, Fear, Sad, Happy-Disgust, Happy-Surprise, Confusing	4–12	RGB
Cohn - Kanade	363 Images	123	Controlled + Lab	Posed	Anger, Contempt, Disgust, Fear, Happy, Neutral, Sad, Surprise	18–50	Grayscale
JAFFE	213 Images	10	Lab	Posed	Anger, Disgust, Fear, Happy, Neutral, Sad, Surprise	19–30	Grayscale

Author’s Kids dataset

[See PDF for image]

Fig. 2

Sample images of the Author’s Kids dataset

The authors of the dataset on children’s facial expressions utilized professional equipment and established a controlled environment with optimal lighting conditions. They used a high-quality DSLR camera, capable of recording videos at 60 frames per second, to capture a wide range of emotions displayed by the children. The dataset is divided into two parts: one comprising posed videos with 81 samples, and the other consisting of recorded online study lectures during the COVID-19 pandemic. The videos were captured at a resolution suitable for high-definition playback, ensuring a clear and detailed analysis of facial expressions. The dataset primarily includes children from India due to accessibility reasons. All the children included in the dataset fall within the age range of 7 to 10 years. Sample images of this dataset can be seen in Fig. 2 above.

Author’s teen dataset

[See PDF for image]

Fig. 3

Sample images of the Author’s Teen dataset

A unique and groundbreaking dataset capturing spontaneous emotions in teenagers has been created. The dataset comprises 14 subjects, including both male and female individuals aged between 14 and 19. It stands out as a diverse collection, representing various ethnicities. The subjects were exposed to specially designed videos aimed at eliciting six fundamental emotions: anger, disgust, fear, happiness, sadness, and surprise. The reactions of the subjects were recorded using a high-definition camera, capturing the subtle nuances of their emotional expressions. The recordings took place in both home and classroom settings, providing a comprehensive range of environments. The dataset consists of 314 videos, each lasting between 5 to 8 seconds. This remarkable dataset encompasses approximately 59,000 frames, including instances of the ‘Neutral’ emotion, offering a comprehensive exploration of teenage emotions. Sample images of this dataset can be seen in Fig. 3 below.

LIRIS-CSE dataset

[See PDF for image]

Fig. 4

Example of a reproduced image from the ‘CSE/allowed images’ directory. Copyright notice: ©LIRIS-CSE

The dataset comprises dynamic images of a dozen children of different ethnicities exhibiting six universal emotional expressions in a non-restrictive setting. The recordings were captured using a high-speed webcam mounted on a laptop, enabling spontaneous and natural expressions. The dataset was verified by 22 human evaluators of various ages. The reason for the author’s usage of this dataset is attributed to the shortage of research on recognizing emotions independently in children and to demonstrate the robustness of the model. The researchers aimed to investigate facial expression recognition from children to adults, and they employed this dataset to compare it with their own children’s dataset [36, 37–38]. Sample images of this dataset can be seen in Fig. 4 below.

Cohn-Kanade dataset

[See PDF for image]

Fig. 5

Example images of the Cohn-Kanade dataset

The Cohn-Kanade database is a well-known repository of facial expressions that includes images of people exhibiting diverse emotions and facial expressions. The database is one that has been developed by the researchers of the University of Pittsburgh and the Carnegie Mellon University to acquire the facial expressions of the subjects through video cameras. It is annotated with Facial Action Coding System or action units and emotionally-specified expressions. The authors used this dataset while carrying out their study to demonstrate the performance capacity of their proposed model in recognizing FER to adults. Furthermore, to evaluate the FER of the proposed model, the authors employed the Cohn-Kanade database for the sake of the adult category as it is crucial in the case of online learning and remote work [39, 40–41]. Some of the sample of the images that fall under this classification are shown in Fig. 5 below.

JAFFE dataset

[See PDF for image]

Fig. 6

Sample images of the JAFFE dataset

The JAFFE dataset comprises 213 grayscale images showcasing 10 Japanese female models exhibiting 7 distinct facial expressions, encompassing 6 fundamental expressions and 1 neutral expression. These images underwent evaluation by 60 Japanese individuals, who rated them based on 6 emotion descriptors. Kyushu University’s Psychology Department served as the setting for capturing the dataset, which was compiled by Michael Lyons, Miyuki Kamachi, and Jiro Gyoba. The images are in TIFF format, without compression, and have a resolution of 256 × 256 pixels. The authors chose this dataset due to its reputation as a challenging dataset for FER, and because they lacked an equivalent dataset of female teenagers to balance the dataset they had for male teenagers in their research [42, 43, 44–45]. Sample images of this dataset can be seen in Fig. 6 below.

Deep models used

VGG19 and InceptionResNetV2 are popular deep convolutional neural network models used for FER. We have decided to fine-tune these models and combine them with a Siamese network for their study. They made slight variations in the fine-tuning process for each model.

VGG19 architecture & feature extraction

VGG19 was selected as a feature extraction backbone due to its deep architecture and proven effectiveness in capturing detailed features. Its architecture consists of 19 layers: 16 convolutional layers using 3 × 3 filters with ReLU activation, 5 max-pooling layers (2 × 2 windows), and 3 fully connected (FC) layers. For our Siamese network integration, we utilized the pre-trained VGG19 model with ImageNet weights. We removed the original FC layers (responsible for ImageNet classification) and froze the weights of all convolutional layers to preserve the pre-trained feature representations. The output of the final convolutional block (a 3 × 3 × 512 feature map for a 100 × 100 input) was then used as the feature embedding for the Siamese distance calculation.

InceptionResNet V2 architecture & feature extraction

InceptionResNet V2 was used as it combines the strengths of both Inception and ResNet architectures. The Inception modules extract multi-scale features using varying convolutional filter sizes (1 × 1, 3 × 3, 5 × 5) within the same layer, while the residual connections (skip connections) mitigate the vanishing gradient problem and enable training of very deep networks. Similar to VGG19, we employed the pre-trained ImageNet weights, discarded the original classification head, and froze the convolutional base. The resulting output feature map from the InceptionResNet V2 backbone was then flattened and fed into the Siamese comparison layers.

Incorporating these pre-trained models into the Siamese network offered high-quality feature extraction with managed computational complexity, leveraging their powerful pre-trained weights for strong generalized feature representation.

Proposed model

Model architecture

This study introduces Siamese Convolutional Neural Networks (SCNN) for emotion recognition, leveraging their pairwise learning mechanism to perform efficient training even with limited samples. SCNNs are particularly suited for tasks such as facial expression recognition where labeled data may be scarce or class imbalance exists. The network’s core functionality is to compare two input images and determine their similarity, a fundamental capability for nuanced emotion classification tasks.

The SCNN architecture was tested with two prominent feature extraction backbones:

SCNN-IRV2: Integrated with InceptionResNetV2 for deeper feature extraction.
- Total parameters: 54,338,273.
- Trainable parameters: 54,277,729.
- Non-trainable parameters: 60,544.
SCNN-VGG19: Integrated with VGG19 for comparison with a simpler feature extraction framework.
- Total parameters: 20,028,993.
- Trainable parameters: 4,609.
- Non-trainable parameters: 20,024,384.

The fundamental advantage of the Siamese architecture over a standard CNN for FER lies in its learning objective. A standard CNN classifier is trained for categorical classification. It learns to map an input image to one of N emotion labels by minimizing a categorical cross-entropy loss. This forces the network to find features that are discriminative for separating all classes simultaneously, which can be challenging with subtle inter-class variations (e.g., between Fear and Surprise) and significant intra-class variations (e.g., different subjects expressing Happiness).

In contrast, our Siamese network is trained for metric learning. Its objective is not to classify a single image but to learn a feature embedding space where the Euclidean or Manhattan distance between two images directly corresponds to their semantic similarity. The twin networks (VGG19/IRV2) act as powerful feature extractors (Φ) that are optimized to project input images into this embedding space. The network learns to minimize the distance between embeddings of the same emotion (anchor-positive pairs) and maximize the distance between embeddings of different emotions (anchor-negative pairs). This pairwise training paradigm forces the feature extractors to learn highly nuanced and discriminative features that are robust to identity, ethnicity, and pose, as the model’s success depends on its ability to judge similarity rather than assign a label based on potentially biased cues. The final binary classification layer simply interprets this learned distance metric.

In addition, this study emphasizes the adaptability of the SCNN architecture. This adaptability facilitates tuning the similarity score threshold and feature learning to address class imbalance and improve computational efficiency. Figure 7 illustrates the comprehensive workflow of the designed model, encompassing the entire process from dataset preparation to the final prediction results obtained.

[See PDF for image]

Fig. 7

Representation of the architecture of the proposed model

The detailed architecture of the Feature Extraction Layer mentioned in Fig. 7.

[See PDF for image]

Fig. 8

Representation of the (a) SCNN-IRV2 and (b) SCNN-VGG19 models using Netron

Figure 8 illustrates the architectural representation of the SCNN-IRV2 and SCNN-VGG19 models using Netron, a tool for visualizing neural network structures. Subfigure (a) depicts the SCNN-IRV2 model, showcasing its fusion of Siamese Convolutional Neural Networks (SCNNs) with Inception ResNet V2, highlighting its ability to process facial expressions through parallel CNN and RNN branches. Subfigure (b) presents the SCNN-VGG19 model, demonstrating its integration of Siamese networks with VGG19, leveraging deep feature extraction for enhanced emotion recognition. These visualizations provide insight into the model architectures, facilitating a clearer understanding of their design and functionality in emotion classification tasks.

Training workflow

Data Preparation
- For each emotion-specific model, the data was divided into distinct subsets:
  - Anchor: Representative images of the target emotion.
  - Positive: Images sharing the same emotion as the anchor.
  - Negative: Images from other emotions, ensuring equal representation compared to anchor and positive sets.
- The dataset structure enables the generation of labeled image pairs for training
  - The dataset structure enables the generation of labeled image pairs for training
  - Anchor-Negative Pairs: Labelled as dissimilar (0).
  To ensure balance and reduce bias, the reference images for each class were carefully distributed across anchor, positive, and negative subsets.

Preprocessing

Images were resized to 100 × 100 pixels and normalized to values between 0 and 1. These transformations standardize the input data for efficient feature extraction by the Siamese network.

Training procedure

Loss Function: The model was trained using binary cross-entropy (BCE) loss. This choice is fundamental to the Siamese network’s pairwise learning paradigm. Unlike a standard multi-class classifier that uses categorical cross-entropy to assign a single label to an image, our network receives a pair of images and must predict a binary outcome: 1 (same emotion) or 0 (different emotions). The BCE loss is perfectly suited for this binary classification task. It measures the divergence between the predicted probability of similarity (a value between 0 and 1 from the sigmoid output) and the true binary label, effectively teaching the network to minimize the distance between feature vectors of the same emotion and maximize it for different emotions. This directly aligns with the core objective of metric learning for emotion recognition.
Optimizer: The Adam optimizer was used with an initial learning rate of 1e-4. Adam was selected for its adaptive learning rate capabilities, which combine the advantages of AdaGrad and RMSProp, making it well-suited for the complex parameter space of the deep Siamese architecture and promoting stable convergence.
Gradient Updates: A custom training loop utilized TensorFlow’s GradientTape to compute and apply gradients for fine-grained optimization of model weights. This approach provided greater flexibility in implementing the pairwise training logic.

The training process incorporated checkpointing mechanisms to ensure model integrity and recovery during training. Hyperparameters such as learning rate (1e-4), batch size (kept between 4 and 8), and epochs (up to 150–200) were carefully tuned to achieve a desired accuracy threshold, with early stopping applied to prevent overfitting.

Separate SCNN models were trained for each emotion to ensure specialization and generalization across unseen samples. The models were fine-tuned based on cross-class assessments to address any performance gaps between classes.

Testing methodology

Individual Model Evaluation
- For each trained model:
  Predictions: Each image pair was assigned a similarity score. Scores exceeding a threshold of 0.9 were categorized as similar.
  Performance Metrics:
  - Accuracy: Measures overall prediction correctness.
  - Precision and Recall: Evaluate the model's capability to identify true positive and negative pairs effectively.
  - F1 Score: Balances precision and recall for a unified performance metric.

[See PDF for image]

Fig. 9

Representation of the training of SCNN model for a single emotion

Figure 9 illustrates the training process of the SCNN model for recognizing a single emotion. The diagram represents how Siamese networks process paired facial images, extracting deep features through CNN and RNN branches to measure similarity and distinguish emotional expressions. The model undergoes iterative learning, refining its ability to classify emotions with high accuracy. This visualization highlights the training dynamics, emphasizing the role of contrastive loss in optimizing feature differentiation for precise emotion recognition.

Multi-Model Testing.

A multi-model framework was designed to assess collective performance across emotions:

Preprocessing: Consistent resizing and normalization across all test sets.
Testing Workflow: For each emotion-specific model, pairs of input images were processed, and the proportion of positive predictions was calculated.
Flexibility: The framework allowed adjustment of key parameters such as the number of reference images per model and the similarity score threshold to optimize performance. For instance, a similarity threshold of 90% was used to classify pairs as similar, with the model showing the highest matches selected as the final prediction.

[See PDF for image]

Fig. 10

Method to test the individual emotion models (SCNN-Multi-model) trained in real-time

Figure 10 illustrates the testing methodology for individual emotion models within the SCNN-Multi-model framework in real-time. This approach involves feeding facial images into the trained Siamese network, where deep feature extraction and similarity comparison enable accurate emotion classification. The model evaluates each input against learned representations, ensuring robust recognition across different facial expressions. This real-time testing process validates the effectiveness of the SCNN-Multi-model in dynamic environments, making it suitable for applications in online learning, remote work, and human-computer interaction.

Features and disadvantages

Features:

Binary Classification: Simplifies the interpretability of the model’s predictions, making it straightforward to identify emotional states.
Custom Distance Calculation: Includes a custom Keras layer to compute the L1 distance (Manhattan distance) between feature embeddings. This metric quantifies the similarity between two input images, a critical aspect for accurate emotion recognition.
Addressing Class Imbalance: Unlike traditional methods, the proposed methodology ensures effective learning despite unequal dataset sizes across classes. By training on single emotions, the model compares each emotion against the summation of all others, equalized in size. This approach enhances learning and minimizes bias. Furthermore, the multi-modal testing framework refines performance by storing reference images of all emotions against which test images are compared. Dominant emotions can be controlled by reducing reference images or increasing the similarity threshold, and vice versa for underperforming emotions.
Scalability and Adaptability: The architecture supports adjustments in parameters like reference image count and similarity thresholds, enabling customization for specific dataset characteristics and improving robustness.

Disadvantages:

Challenges with Pairwise Learning: The pairwise learning approach may encounter difficulties when dealing with large datasets exceeding 2000–3000 pairs for a single emotion. This could lead to confusion and hinder effective learning. A gradual increase in the size of positive, negative, and anchor sets is recommended, with 400–500 images identified as an optimal threshold to maintain training effectiveness and clarity.

Computational consumption of the proposed architectures

System Configuration.
- ◦The experiments were conducted on a system equipped with an Intel^® Xeon^® Silver 4208 CPU @ 2.10 GHz processor, 128 GB RAM, and a 64-bit operating system.
Training Performance.
- ◦SCNN-IRV2 Model (50 epochs).
  - Total training time: 924.88 s.
  - Memory utilization: 4098.61 MB.
  - GPU memory consumption: 38 MB.
- ◦SCNN-VGG Model (50 epochs).
  - Total training time: 347.97 s.
  - Memory utilization: 546.52 MB.
  - GPU memory consumption: 7 MB.
Inference Performance (Testing on 10 Images with 10 Reference Images).
- ◦SCNN-IRV2 Model.
  - Total inference time: 32.37 s.
  - Memory utilization: 803.45 MB.
  - GPU memory consumption: 14 MB.
- ◦SCNN-VGG Model.
  - Total inference time: 12.42 s.
  - Memory utilization: 265.64 MB.
  - GPU memory consumption: 14 MB.

Computational complexity and Real-Time viability

The data presented in above reveals a clear trade-off between model complexity, accuracy, and computational demand, which is critical for assessing real-world deployment potential.

The SCNN-IRV2 model, with its 54.3 million total parameters, achieves superior accuracy, particularly on datasets with subtle expressions (JAFFE, LIRIS) and across diverse demographics. However, this comes at a significant computational cost: a training time approximately 2.7x longer and a memory footprint nearly 7.5x larger than SCNN-VGG19 during training. Its inference time of 3.24 s per image (32.37s / 10 images) makes it unsuitable for real-time processing on standard hardware but ideal for high-accuracy post-session analysis, which is the primary application focus of this work.

Conversely, the SCNN-VGG19 model, with only 4,609 trainable parameters (a result of heavy freezing), presents a highly efficient alternative. Its significantly lower memory consumption (546.52 MB vs. 4098.61 MB) and faster inference (1.24 s per image) make it a compelling candidate for near-real-time applications on resource-constrained devices, including edge computing platforms or integrated web applications. While its absolute accuracy is generally lower than SCNN-IRV2, its performance—especially the dramatic 207–422% improvement over its baseline VGG19 CNN—remains highly effective. This suggests that the Siamese framework imbues even lightweight architectures with powerful discriminative capabilities.

This analysis provides a clear guideline for practitioners:

For applications demanding the highest possible accuracy in offline, post-hoc analysis (e.g., detailed educational or workplace engagement reports), SCNN-IRV2 is the recommended choice.
For applications requiring faster feedback or running on limited hardware (e.g., a live pilot indicator in a virtual meeting platform), SCNN-VGG19 offers an excellent balance of performance and efficiency, effectively making real-time emotion recognition more viable.

Mathematical representation of the proposed model

This section provides the mathematical formalization of the training procedure described in Sect. “Training Procedure”, detailing the loss computation and parameter updates.

Let A denote the input image tensor. Similarly, let B represent the validation image tensor of the same shape as A.

Feature extraction using VGG19/InceptionResnet V2 represented using M

Here, F(A) and F(B) are feature tensors extracted from the input and validation images, respectively, using the feature extraction model.

Calculation of L1 distance

Here, D(A, B) is the L1 distance tensor calculated element-wise between the corresponding elements of the feature tensors) F(A) and F(B). The shape of D(A, B) is the same as that of F(A) and F(B).

Flattening

Here, Flatten(D) reshapes the L1 distance tensor D(A, B) into a 2D tensor, where D_flat represents the flattened dimension, typically obtained by multiplying the height, width, and depth of D(A, B).

Dense layer

Here, W represents the weight matrix of the dense layer, and b represents the bias vector. The output tensor is obtained by applying the sigmoid activation function σ to the affine transformation of the flattened L1 distance tensor.

Training process

Where α is the learning rate, θ are the model parameters, and ∇θ denotes the gradient w.r.t. θ. The training process involves iteratively updating the weights of the Siamese model using the Adam optimizer with a predefined learning rate. The binary cross-entropy loss is computed for each pair of anchor images and validation. The loss function is minimized by adjusting the model parameters in the direction that reduces the discrepancy between predicted and ground truth similarities.

Gradient descent update

The weights are updated using the gradient descent algorithm, where the gradients of the loss function with respect to the model parameters are computed in Eq. (3). The updated weights are obtained by subtracting the scaled gradients from the current weights, as shown in the Eq. (4).

Batch training

For each batch j = 1,2…, S loss is computed and weights are updated as described above. To enhance model generalization and prevent overfitting, the Siamese network is trained using mini-batch stochastic gradient descent. Each training iteration involves selecting a batch of anchor-validation image pairs from the training dataset. This enables the model to learn from multiple examples simultaneously, facilitating faster convergence and improved generalization.

Inference

Once the Siamese network is trained, it can be used to predict the class using the similarity (σ) between new pairs of images during inference. The computed similarity is obtained from the output of the classification layer, which applies a sigmoid activation function to the flattened representation of the Siamese distance components. The thresholding of similarity using a predefined threshold value yields the final binary similarity decision.

Testing

Let Z_true,Z_pred,B and S be actual binary labels of test data, binary labels predicted by the Siamese model, total number of data batches in testing and total number of samples in the testing respectively.

Experimental evaluations are conducted to assess the performance of the proposed Siamese network architecture. This includes measuring metrics such as accuracy, precision, recall, and F1 score on a separate validation dataset to validate the effectiveness of the trained model in identifying similar image pairs.

Breakdown of preprocessing methods used in brief

In Sect. “Breakdown of Preprocessing Methods used in Brief”, we highlight that each dataset holds its individuality, and as such, different preprocessing techniques were employed to prepare them for the approach of Siamese Networks. These tailored preprocessing methods ensure that the datasets are optimized and compatible with the specific requirements of the Siamese network architecture, allowing for more accurate emotion recognition across diverse datasets.

Author’s kids

In the author’s dataset on children’s facial expressions, videos were first processed to extract individual frames. All videos were recorded at 60 frames per second (fps). For efficient processing and to minimize redundancy between consecutive frames, we extracted frames at a rate of 1 frame per second (1 fps). This sampling rate was chosen to capture a wide variety of expressions without over-representing near-identical frames. The first step was to convert them into frames of size 100 × 100 in JPEG/JPG format. Additionally, the Haar cascade classifier was utilized to detect and crop out faces from the frames. However, this conversion from video files to frames introduced the possibility of overfitting due to the presence of highly similar, sequential images. To actively mitigate this source of overfitting and ensure the model learns from a diverse set of facial expressions rather than memorizing minor temporal variations, the ‘fiftyOne’ library was used to filter out unique images. The ‘compute_uniqueness’ function was utilized to calculate the uniqueness of each sample relative to the rest of the samples, effectively removing duplicate and near-duplicate images and minimizing biases. For each subject, the 50 most unique images were filtered and used.

[See PDF for image]

Fig. 11

Uniqueness score computed using the FiftyOne library for each frame extracted from the video of the Author’s Kids dataset for the emotion Anger, 1 being the most unique

As seen from Fig. 11 each frame’s uniqueness is computed in the lower-left corner of the image. The final data split of the preprocessed data can be seen below in Fig. 12.

[See PDF for image]

Fig. 12

Final Author’s Kids dataset after preprocessing

Author’s Teen dataset

A similar process was used for the Author’s Teen Dataset as used in 3.5.1. Videos were processed by extracting frames at a sampling rate of 1 fps. These frames were then converted into JPEG/JPG format and resized to 100 × 100 pixels. Faces were detected using the Haar cascade classifier. To mitigate overfitting from highly similar sequential frames extracted from video, the ‘fiftyOne’ library filtered out duplicates using the ‘compute_uniqueness’ function. The top 50 unique images per subject were selected for analysis. Sample images of the unique images are shown in Fig. 13 above.

[See PDF for image]

Fig. 13

Sample Image from the Author’s teen Dataset of the uniqueness computed using the FifityOne library

The final data split of the preprocessed data is shown below in Fig. 14.

[See PDF for image]

Fig. 14

Final Author’s Teen Dataset after preprocessing

Cohn-Kanade dataset

This dataset was found with duplicates thus using the ‘hash’ function duplicates were first deleted. The size of all the images present were 48 × 48 but for the model designed the authors had decided for the size of the input image to be 100 × 100. Therefore, to increase the size of the image proportionally ESRGAN (Enhanced Super-Resolution Generative Adversarial Networks) was used. ESRGAN, an enhanced iteration of SRGAN, employs the RRDB architecture, relativistic GAN’s adversarial loss approach, and features before activation for perceptual loss. These modifications lead to improved visual quality with realistic textures. ESRGAN has shown superior performance, as evidenced by its first-place win in the PIRM2018-SR Challenge. The authors have utilized ESRGAN to enhance the images, leveraging its advanced features and performance. A sample is shown in Fig. 15 below.

[See PDF for image]

Fig. 15

(a) Before and (b) After using ESRGAN (Resolution of (a): 48 × 48; resolution of (b): 192 × 192)

After this, the same augmentation techniques were used as described in 9.1 and 9.2. The final data split of the preprocessed data is shown below in Fig. 16.

[See PDF for image]

Fig. 16

Final Cohn-Kanade dataset after preprocessing

LIRIS-CSE

The procedure used in Sect. 9.1 to preprocess the Author’s Kids Dataset was also applied to the Children’s Spontaneous Facial Expressions (LIRIS-CSE) Dataset. However, after grouping the emotions by subject, it was observed that not all 12 subjects depicted all seven emotions. To avoid introducing any biases, the focus was narrowed down to only four emotions: Fear, Happy, Sad, and Surprise. Subsequently, the filtered dataset was augmented using nine augmentation techniques, including horizontal and vertical flipping, random brightness adjustment, channel dropout, hue saturation, random noise addition, Gaussian blur, sharpening, and random rotation. These augmentations were employed to improve the resilience of the model for improved performance. The final data split of the preprocessed data is shown below in Fig. 17.

[See PDF for image]

Fig. 17

Final LIRIS-CSE dataset after preprocessing

JAFFE

For the Japanese Female Facial Expression (JAFFE) Database, the image resolution was already suitable for the analysis as it met the required standards. The main task was to segregate the images based on the emotion depicted by the subjects in the images, which was relatively straightforward due to the naming convention used in the dataset. No additional resizing, just the format conversion was necessary from TIFF to JPG. All the images were augmented similar to the previous datasets mentioned. The final data split of the preprocessed data is shown below in Fig. 18.

[See PDF for image]

Fig. 18

Final JAFFE dataset after preprocessing

Summary of the final datasets used for training and testing

In this section, we present the final preprocessed dataset, detailing its demographic, age, sex, and emotion distribution to provide a clear understanding of the variances involved. Given the sensitivity of FER, we aim to ensure that our findings are transparent, minimizing the risk of any conclusive bias.

[See PDF for image]

Fig. 19

Demographic distribution

[See PDF for image]

Fig. 20

Sex distribution

[See PDF for image]

Fig. 21

Emotion distribution

Figure 19 illustrates the demographic distribution of participants across various age groups, highlighting diversity in the dataset. Figure 20 presents the sex distribution, ensuring a fair representation of male and female participants to prevent gender bias in model performance. Figure 21 depicts the emotion distribution, showcasing the variety of facial expressions captured in the dataset, ensuring balanced training for both single-emotion and multi-emotion recognition tasks. These figures collectively emphasize the dataset’s diversity and its role in enhancing the model’s adaptability to real-world scenarios. Figure 22 illustrates the age distribution of participants in the final dataset used for training and testing the emotion recognition models. The dataset includes a diverse range of age groups, from children to adults, ensuring that the models generalize well across different developmental stages. This balanced representation helps in accurately capturing age-related variations in facial expressions, enhancing the model’s effectiveness for online learning, remote work, and human-computer interaction applications.

[See PDF for image]

Fig. 22

Age distribution

Experimental setup and results

Evaluation of single emotion models

In this section, we evaluate four models categorized into two groups. The first group focuses on individual emotion models trained separately. This includes the SCNN-Single Emotion Model-IRV2 (SCNN-SEM-IRV2), built on the Siamese-Inception ResNet V2 architecture, and the SCNN-Single Emotion Model-VGG19 (SCNN-SEM-VGG19), based on the Siamese-VGG19 architecture. Each model’s performance is analyzed independently to assess their capabilities in recognizing specific emotions.

Evaluation of multi-model architectures

The second category involves the collective assessment of individual emotion models, termed Multi-model evaluation. This includes SCNN-Multi-model-IRV2 (SCNN-MM-IRV2) and SCNN-Multi-model-VGG19 (SCNN-MM-VGG19). These models are compared against simpler deep learning models utilizing Inception ResNet V2 and VGG19 architectures.

Statistical analysis and metrics

To evaluate the performance of the multi-models comprehensively, a range of statistical parameters is analyzed, including:

Sensitivity
Specificity
Precision
Recall
Negative Predicted Value (NPV)
False Positive Rate (FPR)
False Discovery Rate (FDR)
False Negative Rate (FNR)
Accuracy
F1 Score
Matthews Correlation Coefficient (MCC)

This analysis ensures a thorough comparison of the models’ effectiveness in emotion recognition tasks.

Comparison with State-of-the-Art Models

Finally, the four models are compared against state-of-the-art benchmarks to evaluate their individual performances in accurately predicting emotions. This comparison highlights the strengths and limitations of each model, offering insights into their potential applications in real-world facial emotion recognition scenarios.

By structuring the evaluation into these sections, we aim to provide a clear and systematic understanding of the models’ performance and their broader implications for emotion recognition tasks.

Test method 1 (SCNN-Single emotion Models)

After training separate models for each emotion using Siamese networks in combination with VGG19 and Inception Resnet V2, the models were evaluated based on test accuracy, precision, recall, and F1-score. The models were trained on 7/10th of the total dataset, and the remaining data was kept for testing. The dataset was already segregated into positive and anchor sets for similarity comparisons and negative sets for dissimilar comparisons, with an equal number of images from the other six emotions. Validation data segregation was not needed as the Siamese network architecture handles validation implicitly. The results showed high accuracy, precision, recall, and F1-score for all emotions, indicating the effectiveness of the proposed approach as shown in Fig. 23.

[See PDF for image]

Fig. 23

Average performance evaluation of scnn-single emotion models on different metrics

Table 6. Standard deviation for Fig. 23

Model	SCNN-SEM-IVR2				SCNN-SEM-VGG19
Metrics	Test Accuracy	Precision	Recall	F1-Score	Test Accuracy	Precision	Recall	F1-Score
Author’s Kids	0.02	0.04	0.02	0.03	0	0.01	0.01	0.01
Author’s Teen	0.02	0.01	0.04	0.02	0.04	0.05	0.02	0.03
LIRIS	0.01	0.04	0	0.03	0.01	0.02	0	0.01
Cohn-Kanade	0.03	0.04	0.06	0.05	0.02	0.03	0.04	0.03
JAFFE	0.03	0.04	0.03	0.04	0.04	0.05	0.04	0.04

The data shown above in Fig. 23 represents the average performance metrics of models trained for each emotion and treated as a single model and Table 6 proves that the average calculated on the data has negligible deviation.

Author’s Kids Dataset: This dataset consists of images of both male and female children (ages 7–10) of Indian ethnicity, featuring both posed and spontaneous emotions in RGB images. Both models performed well here, with SCNN-SEM-VGG19 slightly outperforming SCNN-SEM-IRV2 in test accuracy. The high performance on this dataset suggests that both models are adept at handling images with a mix of posed and spontaneous emotional expressions in a relatively controlled environment (age and ethnicity).
Author’s Teen Dataset: Featuring both male and female teenagers (ages 14–19) of Indian ethnicity with spontaneous emotional expressions in RGB images, both models achieved high accuracies. However, SCNN-SEM-IRV2 slightly outperformed SCNN-SEM-VGG19 in terms of test accuracy, which may indicate that SCNN-SEM-IRV2 is better at generalizing emotional variations in older subjects within this dataset. This observation could be tied to the subtle nuances of emotion in spontaneous expressions in teens.
LIRIS-CSE Dataset: With images from children (ages 7–10) of various ethnicities exhibiting spontaneous emotions in RGB, both models performed exceptionally well. SCNN-SEM-VGG19 showed a slight advantage in all evaluation metrics (test accuracy, precision, recall, and F1-score). The better performance of SCNN-SEM-VGG19 on this dataset could be attributed to its enhanced ability to handle RGB images of young children displaying natural emotional expressions, which may require more advanced feature extraction capabilities.
Cohn-Kanade Dataset: The Cohn-Kanade dataset, comprising individuals of Euro-American, Afro-American, and other ethnicities (ages 18–50), with posed emotional expressions in black and white images, presented different challenges. SCNN-SEM-IRV2 outperformed SCNN-SEM-VGG19 in terms of precision, recall, and F1-score, despite both models achieving high test accuracy. This suggests that SCNN-SEM-IRV2 may be better equipped to handle posed emotions in black-and-white images, potentially due to its more robust feature extraction from monochrome images or its better adaptation to controlled emotional expressions.
JAFFE Dataset: The JAFFE dataset, featuring only female subjects of Japanese ethnicity (ages 19–30) with slight variations in emotional expressions and black-and-white images, posed significant challenges. SCNN-SEM-IRV2 outperformed SCNN-SEM-VGG19 in terms of test accuracy, precision, and recall. This could indicate that SCNN-SEM-IRV2 is more sensitive to the subtle variations in emotion present in this dataset, particularly when dealing with black-and-white images with slight variations in expressions.

In summary, both SCNN-SEM-IRV2 and SCNN-SEM-VGG19 exhibit exceptional performance across a range of emotion datasets, with all performance metrics surpassing 95% for both models. However, their relative performance varies based on dataset characteristics. SCNN-SEM-VGG19 generally outperforms in RGB images, particularly with spontaneous emotional expressions and younger subjects, while SCNN-SEM-IRV2 shows superior results with posed emotions, black-and-white images, and older age groups. Notably, SCNN-SEM-IRV2, which has more trainable parameters than VGG, may leverage its increased model complexity to better capture subtle features in these datasets. These findings highlight the significant influence of dataset characteristics on model performance in emotion recognition tasks. While both models demonstrate high accuracy, the variations in their performance across different datasets suggest that model selection should be tailored to the specific attributes of the dataset. This analysis is based on single-emotion models, but the next sections will explore the results when all emotion models are tested together. This is where potential biases and sensitivities—long-standing challenges in facial emotion recognition (FER)—will be thoroughly evaluated, offering deeper insights into the generalizability and robustness of these models. Thus, while the current results are promising, they open the door to more nuanced interpretations of FER performance, particularly when dealing with multi-emotion scenarios where inter-model dynamics can introduce further complexities. Sample images of the test can be seen in Fig. 24 above.

[See PDF for image]

Fig. 24

SCNN-Single Emotion Model Predictions: (a) Author’s Teen Dataset - ‘Happy’, (b) Cohn-Kanade dataset - ‘Anger’ (c) Author’s Kids dataset - ‘Anger’, (b) Author’s Teen Dataset - ‘Fear’

To enhance the understanding of the model’s predictions, we have included sample images for visualization. Image (a) and (b) show correct predictions, where the model accurately classifies both images in the pair. In contrast, Image (c) demonstrates an incorrect prediction, where the second row is misclassified by the model as Anger, although only one image represents Anger, while the other does not. Similarly, in Image (d), the model incorrectly classifies the second row as representing Fear, while one image shows Fear and the other shows Surprise.

Test Method 2 (SCNN-Multi-model)

Evaluation Process
- To evaluate emotion models, predictions were made on random pairs of images.
- Smaller datasets like LIRIS, JAFFE, and Cohn-Kanade were tested on their entire dataset, excluding augmented images.
- For larger datasets like the Author’s Kids and Teen datasets, computational efficiency was ensured using cross-validation and stratified sampling.
- Five sets of 100 random images were created to represent the entire dataset, and the average performance of these sets was used for evaluation.
Generalization and Testing
- To validate the model’s ability to generalize, 70% of the data was used for training, and the remaining was used for testing across all datasets.
- Testing the model on the entire dataset was crucial due to the probabilistic nature of Siamese networks, which depend on image pairs and their relationships. This ensured that the model could generalize effectively.
Emotion-Specific Analysis
- Each emotion was evaluated individually, and a correlation matrix was created to analyze relationships between emotions (e.g., Fear and Sad).
- This matrix helped understand intercorrelations and how the models scored for each emotion.
Handling Ambiguity in Predictions
- When the model failed to detect an emotion (i.e., similarity scores didn’t surpass the required threshold), this was factored into the analysis.

Comparison with Standard Models

The authors trained and tested simpler pretrained models like VGG19 and Inception ResNet V2 for comparison with their SCNN-Multi-Models.
Key Parameters for Testing.
- Reference Images: A specific number of images were selected to represent the dataset while reducing bias.
  - For Author’s Kids, Teen, and LIRIS datasets, one image per subject per emotion was chosen, ensuring fair representation.
  - These datasets were divided into five random sets, and reference images were randomly chosen for each set.
Dataset-Specific Adjustments
- For the Cohn-Kanade dataset (many subjects), all images for each subject were stored, and ten images per emotion were randomly selected for evaluation.
- For the JAFFE dataset, a detailed analysis was performed to determine the optimal number of reference images. The focus was on balancing common parameters (used across emotions) and specialized parameters (fine-tuned for dominant emotions).
Improving Model Flexibility and Accuracy
- Adjusting the reference images for dominant emotions in JAFFE demonstrated the model’s adaptability.

This approach aimed to enhance accuracy while showcasing the model’s capability to handle variations across datasets. By tailoring reference image selection and thresholds to each dataset’s characteristics, the authors effectively optimized the model’s performance and demonstrated its flexibility. This is further showed in Table 7 giving a detailed overview of the reference images used for testing SCNN-IRV2 and SCNN-VGG for the JAFFE dataset and in the third column we have the number of reference images used for other datasets.

Table 7. Parameters set for the testing of the JAFFE dataset

Emotion	Number of Reference images for IRV2	Number of Reference images for VGG19	Common Number of Reference Images for both IRV2 and VGG 19 for other datasets
Anger	10	9	10
Disgust	10	9
Fear	12	11
Happy	10	11
Neutral	4	7
Sad	15	15
Surprise	8	10

To ensure consistency, threshold values were set for emotion recognition using different models. Specifically, a threshold of 0.9 was established for the SCNN-MM-IRV2, indicating that a similarity score of at least 90% was required to recognize emotion. For the SCNN-MM-VGG19 model, the threshold value was set at 0.5, meaning that a similarity score of 50% or higher was necessary for emotion recognition. These threshold values were implemented to maintain uniformity and establish clear criteria for determining when emotion could be identified by each model.

Correlation matrix analysis after conducting the test method 2

The correlation matrix was obtained based on the number of samples of images predicted correctly and incorrectly on imbalanced datasets. To normalize the matrix each cell was divided by the sum of the whole column. Thus, the diagonal matrix represents the accuracy of each class and when averaged gives the test accuracy of that particular dataset, and the off-diagonal matrix values tell where the model has confused with other emotions.

Author’s Kids

[See PDF for image]

Fig. 25

Correlation Matrix for (a) SCNN-Multi-model-IRV2 and (b) SCNN-Multi-model-VGG19 on the Author’s Kids dataset

The correlation matrix provides insight into how well the Siamese models perform on each emotion in the Author’s Kids dataset. The correlation matrix provided for this dataset is the summation of the 5 sets of the dataset tested separately. For the SCNN-MM-IRV2, the highest accuracy is observed for the emotions Neutral and Surprise, with a perfect accuracy of 1, which indicates that the model can correctly identify all instances of Neutral and Suprise emotions in the dataset. The model performs reasonably well on the other emotions, with accuracies ranging from 0.89 to 0.99 for other emotions. Similarly, the SCNN-MM-VGG19 model also performed quite well on all emotions ranging from 0.86 to 0.97. Overall, the results suggest that both the Multimodels perform well on the Author’s Kids dataset, with high accuracies for all emotions.

Figure 25 presents the correlation matrix for SCNN-Multi-model-IRV2 and SCNN-Multi-model-VGG19 on the author’s kids dataset. Subfigure (a) shows SCNN-Multi-model-IRV2, while subfigure (b) depicts SCNN-Multi-model-VGG19, illustrating the relationship between predicted and actual emotions. High diagonal values indicate strong accuracy, while off-diagonal values highlight misclassifications. This analysis showcases the effectiveness of Siamese networks in recognizing children’s emotions with improved precision.

Author’s Teen

[See PDF for image]

Fig. 26

Correlation Matrix for (a) SCNN-Multi-model-IRV2 and (b) SCNN-Multi-model-VGG19 on the Author’s Teen Dataset

For the SCNN-MM-IRV2, it has performed exceptionally well achieving high accuracy for all the emotions. However, for the SCNN-MM-VGG19 Model, the model struggled with the emotion Fear and Surprise confusing it with Surprise and Sad respectively. Overall, the SCNN-MM-IRV2 has performed well on average whereas the SCNN-MM-VGG19 Model has struggled with some emotions.

Figure 26 presents the correlation matrix for SCNN-Multi-model-IRV2 and SCNN-Multi-model-VGG19 on the author’s teen dataset. Subfigure (a) displays the correlation matrix for SCNN-Multi-model-IRV2, while subfigure (b) represents SCNN-Multi-model-VGG19. These matrices illustrate the alignment between predicted and actual emotions, with strong diagonal values indicating high accuracy and off-diagonal values highlighting misclassifications. This analysis demonstrates the models’ effectiveness in recognizing teenagers’ emotions, further validating the Siamese network-based approach for diverse age groups.

LIRIS-CSE

[See PDF for image]

Fig. 27

Correlation Matrix for (a) SCNN-Multi-model-IRV2 and (b) SCNN-Multi-model-VGG19 on the LIRIS-CSE Dataset

From the correlation matrix, it can be inferred that both the Multi-Models performed well in recognizing fear expressions, with accuracies of 0.94 and 0.98 respectively. Happy is also recognized with relatively high accuracy by both models, with accuracies ranging from 1 to 0.82. However, the Multi-Models perform quite poorly on the emotion of Sad and Surprise confusing it with the emotion Fear. We also see an instance where an image has failed to cross the threshold in the SCNN-MM-IRV2 model. Hence, no emotion was detected. Overall, again on average, the SCNN-MM-IRV2 has performed better than the SCNN-MM-VGG19 but both struggled with the emotion of Sad making it a challenging emotion to recognize.

Figure 27 presents the correlation matrix for SCNN-Multi-model-IRV2 and SCNN-Multi-model-VGG19 on the LIRIS-CSE dataset. Subfigure (a) illustrates the correlation matrix for SCNN-Multi-model-IRV2, while subfigure (b) represents SCNN-Multi-model-VGG19. These matrices depict the relationship between predicted and actual emotions, where high diagonal values indicate strong classification accuracy, and off-diagonal values highlight misclassifications. The results emphasize the effectiveness of Siamese network-based models in accurately recognizing emotions within the LIRIS-CSE dataset, reinforcing their applicability in diverse real-world scenarios.

Cohn-Kanade

[See PDF for image]

Fig. 28

Correlation Matrix (a) SCNN-Multi-model-IRV2 and (b) SCNN-Multi-model-VGG19 on the Cohn-Kanade Dataset

Starting with the correlation matrix in Fig. 28 (a), the model performed well on all emotions. The confusion between emotions is relatively low, with the highest confusion being between Anger and Happy (0.09) and between Sad and Anger (0.07). Moving on to Fig. 28 (b)’s correlation matrix, the model struggles on average compared to SCNN-MM-IRV2. It is worth noting that although the SCNN-MM-IRV2 seems to perform better overall compared to the SCNN-MM-VGG19 Model, it still failed to detect emotions for certain images. Overall, the correlation matrix provides a good insight into the performance of the models on each emotion, as well as the confusion between emotions. The SCNN-MM-IRV2 performed better than SCNN-MM-VGG19 as seen in previous datasets as well.

Figure 28 presents the correlation matrix for SCNN-Multi-model-IRV2 and SCNN-Multi-model-VGG19 on the Cohn-Kanade dataset. Subfigure (a) shows the correlation matrix for SCNN-Multi-model-IRV2, while subfigure (b) represents SCNN-Multi-model-VGG19. These matrices illustrate the agreement between predicted and actual emotions, with strong diagonal values indicating high classification accuracy and off-diagonal values highlighting areas of misclassification. The results validate the robustness of Siamese network-based models in recognizing facial emotions, further demonstrating their effectiveness on benchmark datasets.

JAFFE

[See PDF for image]

Fig. 29

Correlation Matrix for (a) SCNN-Multi-model-IRV2 and (b) SCNN-Multi-model-VGG19 on the JAFFE Dataset using common parameters

The correlation matrix shown in Fig. 29 represents the model performance on the JAFFE dataset by using the parameters set for the rest of the datasets. In both (a) and (b) the neutral emotion is quite dominant. There are also instances in both models struggling with other emotions. Thus, to resolve these issues and to prove the model’s flexibility the authors configured the model’s parameters accordingly.

[See PDF for image]

Fig. 30

Correlation Matrix for (a) SCNN-Multi-model-IRV2 and (b) SCNN-Multi-model-VGG19 on the JAFFE Dataset using specialized parameters

When comparing Figs. 29 and 30 and a significant improvement in the model’s performance is observed. For the SCNN-Multi-model-IRV2 the accuracy increased from 0.73 to 0.87 and for the SCNN-Multi-model-VGG19 it has increased from 0.71 to 0.75. For the SCNN-MM-IRV2 matrix, the model performs quite well except for the emotion Fear and Sad confusing it with Surprise and Disgust respectively. The SCNN-MM-VGG19 performs best on Fear with an accuracy of 0.90, better than the SCNN-MM-IRV2 model. Other than this the model has struggled with all the emotions on average. The results suggest that the SCNN-MM-IRV2 again performs better on average while the SCNN-MM-VGG19 model outperforms better on Fear. However, both models have difficulty distinguishing between certain emotions, most of all the emotion Sad.

SCNN-Multi-model Performance Analysis in Comparison with Simple Deep Learning Models with the Architecture of Pretrained Models

The authors trained VGG19 and Inception Resnet V2 models on the datasets used to compare and prove the performance increase and robustness of the SCNN-IRV2 and VGG19 models. The researchers utilized VGG19 and Inception Resnet V2 architectures in training a Convolutional Neural Network (CNN) for FER on each dataset. The data was loaded using the Image Data Generator class and divided into training, validation, and test sets, following an 8:1:1 ratio. The pre-trained model was initialized with weights from the ImageNet dataset, while the fully connected layers were excluded by setting ‘include_top’ to ‘False’. To accommodate the facial expression classification task, a new dense layer was introduced with output units corresponding to each facial expression, employing a softmax activation function. The model was then compiled with a categorical cross-entropy loss function and accuracy metric. Additionally, a Model Checkpoint callback was implemented to save the model with the highest validation accuracy observed during training. The detailed performance of these models is mentioned in Table 8 below.

Table 8. Performance of the simple deep learning models trained on the datasets for comparative analysis

Model	Inception Resnet V2					VGG19
Dataset	Training Accuracy	Training Loss	Validation Accuracy	Validation Loss	Test Accuracy	Training Accuracy	Training Loss	Validation Accuracy	Validation Loss	Test Accuracy
Author’s Kids	0.99	0.02	0.99	0.01	0.95	0.18	1.93	0.18	1.93	0.18
Author’s Teen	0.98	0.08	0.99	0.04	0.93	0.18	1.78	0.19	1.78	0.19
LIRIS	0.97	0.1	0.76	1.29	0.79	0.25	1.38	0.28	1.38	0.26
Cohn-Kanade	0.98	0.06	0.94	0.23	0.94	0.27	1.7	0.24	1.73	0.27
JAFFE	0.95	0.18	0.81	0.53	0.79	0.15	1.93	1.92	0.16	0.15

[See PDF for image]

Fig. 31

Comparison of Test Accuracies between SCNN-Multi-models and Simple Pre-trained Models

By examining the table presented in Fig. 31, it becomes evident that the SCNN models consistently outperformed the normal architectures. This indicates that the Siamese network approach is better suited for this task compared to the normal model. The percentage increase is displayed in Table 9.

Table 9. Percentage increase of the SCNN models from their baseline models

Dataset Name	Performance Increase (%)
Dataset Name	SCNN - IRV2	SCNN - VGG
Author’s Kids	1.05	422.22
Author’s Teen	1.07	331.58
LIRIS	6.33	211.54
Cohn-Kanade	0	207.41
JAFFE	10.13	400

Both models perform exceptionally well on datasets with clear and distinct features, such as the Author’s Kids, Teen, and Cohn-Kanade datasets, showing only minimal percentage improvements for SCNN-IRV2. For datasets with subtle and nuanced expressions, such as LIRIS and JAFFE, SCNN-IRV2 demonstrates a notable performance boost, showcasing its ability to handle challenging datasets effectively. A remarkable performance increase is observed, ranging from 204.41 to 422.22%, emphasizing the significant advantage SCNN provides when paired with computationally lightweight models like VGG. The SCNN approach distinctly enhances performance due to its pairwise framework and testing methodology, which allows for more precise evaluation and learning. The dramatic improvements, particularly with VGG, highlight the SCNN’s potential as a computationally efficient solution when it comes to CNN architectures with less complexity and trainable parameters, for improving performance in emotion recognition tasks. This underscores its utility in scenarios where lightweight models are preferred.

Statistical analysis of SCNN-Multi-models

SCNN-IRV2 - LIRIS Dataset
Class	Sensitivity	Specificity	Precision	Recall	NPV	FPV	FDR	FNR	Accuracy	F1 Score	MCC
Fear	0.94	0.86	0.69	0.94	0.97	0.14	0.31	0.06	0.88	0.8	0.73
Happy	1	0.99	0.98	1	1	0.01	0.02	0	1	0.99	0.99
Sad	0.76	0.95	0.84	0.76	0.92	0.05	0.15	0.24	0.91	0.8	0.74
Surprise	0.68	0.99	0.94	0.68	0.9	0.01	0.05	0.32	0.91	0.79	0.75
SCNN-VGG19 - LIRIS Dataset
Class	Sensitivity	Specificity	Precision	Recall	NPV	FPV	FDR	FNR	Accuracy	F1 Score	MCC
Fear	0.98	0.64	0.48	0.98	0.99	0.36	0.52	0.02	0.73	0.64	0.54
Happy	0.82	1	1	0.82	0.94	0	0	0.18	0.96	0.9	0.88
Sad	0.52	0.96	0.81	0.52	0.86	0.04	0.19	0.48	0.85	0.63	0.57
Surprise	0.42	0.98	0.88	0.42	0.84	0.02	0.13	0.58	0.84	0.57	0.53
SCNN-IRV2 - Cohn-Kanade Dataset
Class	Sensitivity	Specificity	Precision	Recall	NPV	FPV	FDR	FNR	Accuracy	F1 Score	MCC
Anger	0.96	0.97	0.83	0.96	0.99	0.03	0.17	0.04	0.96	0.89	0.87
Disgust	0.98	0.99	0.95	0.98	1	0.01	0.05	0.02	0.99	0.97	0.96
Fear	0.96	0.99	0.86	0.96	1	0.01	0.14	0.04	0.98	0.91	0.9
Happy	0.84	1	1	0.84	0.96	0	0	0.16	0.96	0.91	0.9
Sad	0.89	0.99	0.89	0.89	0.99	0.01	0.11	0.11	0.98	0.89	0.88
Surprise	0.98	1	0.99	0.98	0.99	0	0.01	0.02	0.99	0.98	0.98
SCNN-VGG19 - Cohn-Kanade Dataset
Class	Sensitivity	Specificity	Precision	Recall	NPV	FPV	FDR	FNR	Accuracy	F1 Score	MCC
Anger	0.8	0.7	0.31	0.8	0.95	0.3	0.69	0.2	0.71	0.45	0.36
Disgust	0.49	1	0.97	0.49	0.89	0	0.03	0.51	0.9	0.65	0.65
Fear	0.72	0.95	0.56	0.72	0.97	0.05	0.44	0.28	0.93	0.63	0.6
Happy	0.71	0.99	0.94	0.71	0.92	0.01	0.06	0.29	0.93	0.81	0.78
Sad	0.71	0.95	0.59	0.71	0.97	0.05	0.41	0.29	0.93	0.65	0.61
Surprise	0.54	1	1	0.54	0.86	0	0	0.46	0.88	0.7	0.68
SCNN-IRV2 - Author’s Kids Dataset
Class	Sensitivity	Specificity	Precision	Recall	NPV	FPV	FDR	FNR	Accuracy	F1 Score	MCC
Anger	0.96	1	1	0.96	0.99	0	0	0.04	0.99	0.98	0.97
Disgust	0.89	0.99	0.93	0.89	0.98	0.01	0.07	0.11	0.97	0.91	0.89
Fear	0.99	1	0.99	0.99	1	0	0.01	0.01	1	0.99	0.98
Happy	0.94	0.99	0.94	0.94	0.99	0.01	0.06	0.06	0.98	0.94	0.93
Neutral	1	0.99	0.93	1	1	0.01	0.07	0	0.99	0.97	0.96
Sad	0.93	1	0.98	0.93	0.99	0	0.02	0.07	0.99	0.96	0.95
Surprise	1	0.99	0.93	1	1	0.01	0.07	0	0.99	0.97	0.96
SCNN-VGG19 - Author’s Kids Dataset
Class	Sensitivity	Specificity	Precision	Recall	NPV	FPV	FDR	FNR	Accuracy	F1 Score	MCC
Anger	0.94	1	1	0.94	0.99	0	0	0.06	0.99	0.97	0.97
Disgust	0.97	0.96	0.79	0.97	1	0.04	0.21	0.03	0.96	0.87	0.85
Fear	0.97	0.99	0.93	0.97	1	0.01	0.07	0.03	0.99	0.95	0.94
Happy	0.86	0.99	0.92	0.86	0.98	0.01	0.08	0.14	0.97	0.89	0.87
Neutral	0.94	1	1	0.94	0.99	0	0	0.06	0.99	0.97	0.97
Sad	0.96	1	1	0.96	0.99	0	0	0.04	0.99	0.98	0.97
Surprise	0.93	1	0.97	0.93	0.99	0	0.03	0.07	0.99	0.95	0.94
SCNN-IRV2 - Author’s Teen Dataset
Class	Sensitivity	Specificity	Precision	Recall	NPV	FPV	FDR	FNR	Accuracy	F1 Score	MCC
Anger	0.88	0.99	0.95	0.88	0.98	0.01	0.05	0.13	0.97	0.91	0.89
Disgust	0.91	0.98	0.9	0.91	0.98	0.02	0.1	0.09	0.97	0.91	0.89
Fear	0.9	1	1	0.9	0.98	0	0	0.1	0.98	0.95	0.94
Happy	0.99	0.99	0.96	0.99	1	0.01	0.04	0.01	0.99	0.98	0.97
Sad	1	0.97	0.86	1	1	0.03	0.14	0	0.97	0.92	0.91
Surprise	0.94	0.99	0.96	0.94	0.99	0.01	0.04	0.06	0.98	0.95	0.94
SCNN-VGG19 -Author’s Teen Dataset
Class	Sensitivity	Specificity	Precision	Recall	NPV	FPV	FDR	FNR	Accuracy	F1 Score	MCC
Anger	0.86	0.99	0.92	0.86	0.97	0.01	0.08	0.14	0.97	0.89	0.87
Disgust	0.88	0.98	0.9	0.88	0.97	0.02	0.1	0.12	0.96	0.89	0.87
Fear	0.59	0.99	0.94	0.59	0.93	0.01	0.06	0.41	0.93	0.73	0.71
Happy	0.84	0.98	0.88	0.84	0.97	0.02	0.12	0.16	0.96	0.86	0.83
Sad	0.96	0.93	0.76	0.96	0.99	0.07	0.24	0.04	0.94	0.85	0.82
Surprise	0.77	0.92	0.64	0.77	0.95	0.08	0.36	0.23	0.89	0.7	0.64
SCNN-IRV2 - JAFFE Dataset
Class	Sensitivity	Specificity	Precision	Recall	NPV	FPV	FDR	FNR	Accuracy	F1 Score	MCC
Anger	0.97	0.98	0.88	0.97	0.99	0.02	0.13	0.03	0.98	0.92	0.91
Disgust	0.97	0.95	0.78	0.97	0.99	0.05	0.22	0.03	0.96	0.86	0.84
Fear	0.62	1	1	0.62	0.94	0	0	0.38	0.95	0.77	0.76
Happy	203	0.98	0.91	0.97	0.99	0.02	0.09	0.03	0.98	0.94	0.92
Neutral	201	0.98	0.87	1	1	0.02	0.13	0	0.98	0.93	0.92
Sad	201	0.99	0.95	0.66	0.94	0.01	0.05	0.34	0.95	0.78	0.76
Surprise	203	0.98	0.88	0.97	0.99	0.02	0.13	0.03	0.98	0.92	0.91
SCNN-VGG19 - JAFFE Dataset
Class	Sensitivity	Specificity	Precision	Recall	NPV	FPV	FDR	FNR	Accuracy	F1 Score	MCC
Anger	0.72	0.96	0.75	0.72	0.95	0.04	0.25	0.28	0.93	0.74	0.69
Disgust	0.76	0.94	0.69	0.76	0.96	0.06	0.31	0.24	0.92	0.72	0.67
Fear	0.9	0.94	0.72	0.9	0.98	0.06	0.28	0.1	0.94	0.8	0.77
Happy	0.83	0.96	0.78	0.83	0.97	0.04	0.22	0.17	0.94	0.81	0.77
Neutral	0.71	0.95	0.71	0.71	0.95	0.05	0.29	0.29	0.92	0.71	0.67
Sad	0.48	0.95	0.64	0.48	0.92	0.05	0.36	0.52	0.89	0.55	0.49
Surprise	0.83	0.99	0.96	0.83	0.97	0.01	0.04	0.17	0.97	0.89	0.88

The significant performance increase of the SCNN models, particularly the dramatic 207–422% improvement over the standard VGG19 CNN, can be directly attributed to the metric learning paradigm of the Siamese architecture. The standard CNNs struggled significantly, as evidenced by their high training and validation loss and poor test accuracy shown in Table 8, indicating an inability to effectively learn discriminative features from the relatively smaller and more challenging FER datasets using a pure classification objective. They likely overfitted to superficial patterns or biases in the data.

The Siamese framework circumvented this issue. By learning a similarity metric, the SCNN models became adept at comparing facial expressions based on their core emotional content, rather than relying on features that might be specific to a dataset or subject. This is especially evident in the results for the JAFFE and LIRIS datasets, which contain subtle and spontaneous expressions. The standard CNNs failed on these, but the SCNN models, particularly SCNN-IRV2 with its greater representational capacity, achieved high accuracy. The ability to fine-tune the decision threshold and the number of reference images post-hoc further allowed the multi-model system to balance the sensitivity and precision for each emotion, a flexibility absent in standard CNNs. This demonstrates that the Siamese architecture is not just a different classifier but a superior framework for learning the fundamental concept of emotional similarity, which is why it excels where standard CNNs fail.

LIRIS Dataset:

The models used for LIRIS dataset showed similar trends for each parameter with SCNN-IRV2 showing slightly higher performance than SCNN-VGG:

Happy: The class with good performance in all metrics represents the model’s optimal functioning. This suggests that the features for this class are highly distinct and well-represented in the dataset, allowing the model to confidently and accurately classify instances without significant trade-offs.
Sad and Surprise: The model is highly conservative in identifying this class, prioritizing minimizing false positives over correctly detecting true positives. This results in strong performance on negative cases but poor sensitivity for positive ones, which might indicate inherent difficulty in recognizing this class.
Fear: The model is biased toward predicting positives for this class, prioritizing high Recall at the expense of Precision. This suggests a potential imbalance in decision thresholds or features leading to a tendency to over-predict positives.

The model’s consistently good performance for Happy suggests that its features are distinct, universal, and well-represented in the dataset, making it easily identifiable across different ethnicities and age groups. In contrast, the model exhibits conservative behavior for Sad and Surprise, with high specificity but low sensitivity, likely due to subtle, context-dependent expressions that overlap with other emotions, such as Fear. For Fear, the model shows high recall but low precision, indicating that the features associated with this emotion overlap with Sad and Surprise as seen in Fig. 28, leading to over-prediction and highlighting ambiguities in the dataset’s emotional expressions.

Cohn Kanade Dataset

There was a direct and distinct observation for the performance of both SCNN-IRV2 and SCNN-VGG:

Performance Comparison: IRV2 outperforms VGG across all parameters, which can be attributed to the significant difference in trainable parameters and layer complexity. IRV2’s deeper architecture and advanced feature extraction capabilities enable it to handle the dataset more effectively.
Dataset Characteristics: The Cohn-Kanade dataset, featuring a mature audience aged 18–50 who posed for expressions, offers clear and distinct emotional cues, which simplifies classification for models like IRV2. The controlled nature of these expressions contributes to the model’s high performance.
Generalization and Ethnicity: Despite the dataset’s mixed ethnicity, IRV2 demonstrates strong generalization capabilities, effectively handling diverse facial features. This robustness allows it to accurately classify expressions across a variety of ethnicities, further improving its performance over VGG.

Author’s Kids

Both SCNN-IRV2 and SCNN-VGG performed exceptionally well:

Exceptional Performance for Both Models: The Author’s Kids dataset has demonstrated exceptional performance for both VGG and IRV2, indicating that the features of expressions in this dataset are well-defined and easily distinguishable. This is in contrast to the LIRIS dataset, where emotional expressions were more spontaneous and subtle. The clear, partially posed expressions in the Author’s Kids dataset allowed both models to accurately classify emotions, benefiting from the distinct nature of the features.
Ethnicity and Expression Clarity: Another key factor contributing to the outstanding performance is that the Author’s Kids dataset features expressions from children of the same ethnicity, making it easier for the models to generalize across instances. The partial posing of expressions further enhances the clarity, enabling the models to better capture the subtle variations in emotional cues, which leads to higher classification accuracy.
VGG’s Struggle with Disgust Class: While both VGG and IRV2 perform well overall, the VGG model’s performance on the Disgust class is the notable exception. It shows comparatively lower precision and a higher false discovery rate, indicating that the disgust expression in this dataset might share subtle similarities with other emotions, which VGG struggles to distinguish effectively. This could be due to VGG’s relatively simpler architecture and limited feature extraction capabilities in comparison to IRV2.

Author’s Teen

There was a direct and distinct observation for the performance of both SCNN-IRV2 and SCNN-VGG:

Exceptional Performance of IRV2: Despite the spontaneous nature of the emotions in the Author’s Teen Dataset, IRV2 has effectively captured the distinctions between different emotional expressions. This success can be attributed to IRV2’s advanced architecture and deep feature extraction capabilities, which allow it to handle the complexity and variability of spontaneous emotions, ensuring high performance across all metrics.
Poor Performance of VGG: In contrast, VGG struggles significantly with the Fear, Sad, and Surprise classes. For Fear, it shows low sensitivity and recall, along with a high false negative rate, indicating difficulty in identifying this emotion. For Sad and Surprise, VGG exhibits low precision and high false discovery rate, while Surprise in particular also has low recall and sensitivity, showcasing the model’s inability to effectively differentiate these emotions.
Correlation Insights and Dataset Characteristics: The correlation between Surprise and both Fear and Sad suggests that these emotions share similar facial cues, which likely contributes to the difficulty in distinguishing them, especially for VGG. Despite the spontaneous nature of the expressions, IRV2’s ability to effectively capture these subtle distinctions highlights its superiority in handling the complexities of the dataset compared to VGG.

JAFFE

SCNN-IRV2 performed well while SCNN-VGG struggled due its low trainable parameters:

Challenge of JAFFE Dataset: The JAFFE dataset’s black-and-white images of female adults with minimal emotional distinctions make it inherently difficult for models to achieve high accuracy. The lack of clear visual cues between emotions increases the complexity of emotion classification, which is reflected in the model performances.
IRV2’s Strong Performance but Struggles with Specific Emotions: While IRV2 outperforms other models on this challenging dataset, its struggles with Fear and Sad emotions—due to low Sensitivity, Recall, and high False Negative Rate—highlight the difficulty in distinguishing these emotions in the dataset. This is evident from the fact that there was correlation between Sad-Disgust and Fear-Surprise. These results suggest that even advanced models like IRV2 face challenges when emotional cues are subtle.
VGG’s Mixed Performance: VGG’s performance on the JAFFE dataset is mixed. While it performs well on metrics like False Positive Rate, Negative Predicted Value, Specificity, and Accuracy, its poor performance in Sensitivity, Precision, and Recall suggests that it struggles with detecting positive emotional expressions, which points to limitations in its ability to capture and differentiate subtle emotional cues effectively.

SOTA analysis

Table 10 shows the SOTA analysis.

Table 10. Sota analysis

Category	Type	Model	Dataset	Accuracy	Study
Machine Learning	SVM	LBP	MMI, CK+, JAFFE	73.3%, 97.3%, 86.7%	[31]
	KNN	SIFT	BU-2DFE, Multi-Pie, CK+	80.1%, 85.2%, 99.1%
	Random Forest	CNN	JAFFE, CK+, FER2013, RAF-DB	98.9%, 99.9%, 84.3%, 92.3%
Deep Learning	CNN		CK+, FER2013, MultiPie, MMI	98.9%, 75.1%,94.7%, 77.9%
	GANs	STAR-GAN	MMI, KDEF	98.3%, 95.97%	[32]
	LSTM	DCBiLSTM	CK+, Oulu-CASIA, MMI	99.6%, 91.07%, 80.71%	[33]
Advanced ML/DL	Multi-Task Learning	DMTL-CNN	CK+, OuluCASIA	99.5%, 89.6%	[34]
	Transfer Learning	TL-DCNN	KDEF, JAFFE	96.51%, 99.52%	[35]
	Meta-Learning	Siamese-CNN	Multi-Pie	84.87–88.47%	[16]
	Meta-Learning	Meta Transfer Learning + Pathnet	INTERFACE, SAVEE, EMO-DB	94%	[22]
Proposed Model	Meta + Deep Learning	SCNN-Single Emotion Model-IRV2	Author’s Kids and Teen Dataset, LIRIS-CSE, Cohn-Kanade, JAFFE	97.5%, 98.57%, 99.02%, 97.71%, 98.55%	-
		SCNN-Multi-model-IRV2		96%, 94%, 84%, 94%, 87%
		SCNN-Single Emotion Model-VGG19		99.08%, 99.54%, 99.6%, 97.73%, 94.75%
		SCNN-Multi-model-VGG19		94%, 82%, 81%, 83%, 75%

Conclusion

The research demonstrates that the single-emotion model approach, leveraging pairwise learning and employing it as a multimodal framework for pairwise testing, has significantly enhanced the performance of SCNN over its baseline models, InceptionResNetV2 and VGG19. Detailed analysis revealed trade-offs in datasets like LIRIS and JAFFE, which feature subtle emotional expressions. A clear pattern emerged, highlighting common challenges with emotions such as fear and surprise, especially fear in children. These insights pave the way for future research to include mixed emotions and train models to handle these complexities more effectively.

Despite these challenges, the SCNN model effectively captured emotional nuances, even with limited datasets, and avoided bias due to its flexible, fine-tunable parameters—an essential achievement in the sensitive domain of facial emotion recognition (FER). Additionally, the study underscores the potential of architectures with fewer trainable parameters, such as VGG19, to be effectively enhanced for lightweight, computationally efficient models—a critical requirement for software development and deployment in real-world applications.

The findings emphasize that datasets encompassing diverse age groups, ethnicities, data types, and spontaneous as well as posed expressions are fundamental for advancing FER. They also highlight the unique emotional characteristics tied to different demographics and environments, advocating for specialized models tailored to specific contexts rather than a universal solution. The focus should be on enhancing these specialized models and maintaining adaptability to tackle challenging cases, as demonstrated by the Siamese network framework.

As a future direction, we propose the application of this framework for post-analysis in classrooms or workplace settings, enabling teachers or managers to correlate emotions with ongoing activities.

In this approach, the model processes recorded sessions through video frames to evaluate the emotional responses of participants during the learning experience. The insights derived from this analysis can help educators understand engagement levels, emotional fluctuations, and student reactions to specific content or teaching methods. By integrating this technology with existing platforms, such as learning management systems (LMS), the system can provide educators with actionable feedback and reports after each session, allowing them to make informed adjustments for future lessons or sessions. This method ensures minimal disruption to the learning process while still providing valuable insights into student emotional engagement.

With our demonstrated high accuracies, this could lead to tailored enhancements for group-level topics or individualized instructions. Furthermore, we recommend the generation of more datasets eliciting spontaneous emotions, as these provide critical training data to distinguish subtle emotional nuances.

By addressing these avenues, the study lays the groundwork for building robust, flexible, and inclusive FER systems with the potential to revolutionize emotion analysis in diverse real-world scenarios.

Author contributions

Conceptualization, T.R., and S.P.; methodology, T.R., and S.P.; software, T.R. and A.S.; validation, T.R., S.P., and A.S.; data acquisition and curation, T.R., formal analysis, T.R. and P.K.; investigation, T.R.; resources, T.R.; writing—original draft preparation, T.R.; writing—review and editing, S.P., A.S, P.K. and A.K.; visualization, T.R.; project administration, S.P. and A.K; All authors have read and agreed to the published version of the manuscript.

Funding

Open access funding provided by Symbiosis International (Deemed University).

Data availability

The dataset(s) supporting the conclusions of this article is(are) included within the article (and its additional file(s)) For Code Files Used and Statistical Analysis done: Project name: EmoRec-SCNN Project home page: [https://github.com/TejasARathod/EmoRec-SCNN] Archived version: [https://doi.org/10.5281/zenodo.8103565] **Data Availability Statement:** The availability of data and materials used in the study varies, with some datasets publicly accessible through specific repositories and others requiring reasonable requests to the corresponding authors due to licensing restrictions.- The Author’s Teen dataset generated and/or analysed during the current study are available in the Zenodo repository, [https://doi.org/10.5281/zenodo.8081621] - The Cohn-Kanade dataset generated and/or analysed during the current study are available on Kaggle repository, [https://www.kaggle.com/datasets/shawon10/ckplus]- The Author’s Kids dataset used and/or analysed during the current study are available from the corresponding author on reasonable request.- The data that support the findings of this study are available from LIRIS Children Spontaneous Facial Expression Video Database but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of [email protected] - The data that support the findings of this study are available from The Japanese Female Facial Expression (JAFFE) Dataset but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of JAFFE’s Zenodo Repository manager: [https://doi.org/10.5281/zenodo.3451524

Declarations

Ethics approval and consent to participate

During the curation of datasets involving human participants, ethical guidelines were followed, and informed consent was obtained.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Abbreviations

VGG

Visual Geometry Group

FER

Facial Emotion Recognition

SVM

Support Vector Machine

LSTM

Long short-term memory

FACS

Facial Action Coding System

Cohn-Kanade

JAFFE

Japanese Female Facial Expression

SCNN

Siamese Convolutional Neural Network

IRV2

Inception Resnet V2

ESRGAN

Enhanced Super-Resolution Generative Adversarial Networks

GANs

Generative adversarial networks

SCNN-SEM-IRV2

Siamese Convolutional Neural Network-Single Emotion Model- Inception Resnet V2

SCNN-SEM-VGG19

Siamese Convolutional Neural Network-Single Emotion Model- Visual Geometry Group 19

SCNN-MM-IRV2

Siamese Convolutional Neural Network-Multi Model-Inception Resnet V2

SCNN-MM-VGG19

Siamese Convolutional Neural Network-Multi Model-Visual Geometry Group 19

NPV

Negative Predicted Value

FPR

False Positive Rate

FDR

False Discovery Rate

FNR

False Negative Rate

MCC

Matthews Correlation Coefficient

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1. Ekman P. Should we call it expression or communication? Innovation (Abingdon) [Internet]. 1997;10(4):333–44. Available from: https://doi.org/10.1080/13511610.1997.9968538

2. Frith C. Role of facial expressions in social interactions. Philos Trans R Soc Lond B Biol Sci [Internet]. 2009;364(1535):3453–8. Available from: https://doi.org/10.1098/rstb.2009.0142

3. Canal FZ, Müller TR, Matias JC, Scotton GG, de Sa Junior AR, Pozzebon E et al. A survey on facial emotion recognition techniques: A state-of-the-art literature review. Inf Sci (Ny) [Internet]. 2022;582:593–617. Available from: https://doi.org/10.1016/j.ins.2021.10.005

4. Dalvi C, Rathod M, Patil S, Gite S, Kotecha K. A survey of AI-based facial emotion recognition: Features, ML & DL techniques, age-wise datasets and future directions. IEEE Access [Internet]. 2021;9:165806–40. Available from: https://doi.org/10.1109/access.2021.3131733

5. Zhang J, et al. Emotion recognition using multi-modal data and machine learning techniques: a tutorial. Rev Inform Fusion. 2020;59:103 -26.

6. Pal S, Mukhopadhyay S, Suryadevara N. Development and progress in sensors and technologies for human emotion recognition. Sensors (Basel) [Internet]. 2021;21(16):5554. Available from: https://doi.org/10.3390/s21165554

7. Wang S. Online learning behavior analysis based on image emotion recognition. Trait Du Signal [Internet]. 2021;38(3):865–73. Available from: https://doi.org/10.18280/ts.380333

8. Ha, NT WORKPLACE ISOLATION IN THE GROWTH TREND OF REMOTE WORKING. ALiterature Rev Rev Economic Bus Stud; 2021; 27, pp. 97-113.

9. Bissinger B, Märtin C, Fellmann M. Support of virtual human interactions based on facial emotion recognition software. In: HCI 2022, Held as Part of the 24th HCI International Conference, HCII 2022, Virtual Event. Cham: Springer International Publishing; 2022.

10. Geetha, AV et al. Multimodal emotion recognition with deep learning: advancements, challenges, and future directions. Inform Fusion; 2024; 105, 102218. [DOI: https://dx.doi.org/10.1016/j.inffus.2023.102218]

11. Bennequin E. Meta-learning algorithms for few-shot computer vision. arXiv preprint. 2019. arXiv:1909.13579.

12. Chicco, D. Cartwright, H. Siamese neural networks: an overview. Artificial neural networks. Methods in molecular biology; 2021; New York, NY, Humana: [DOI: https://dx.doi.org/10.1007/978-1-0716-0826-5_3] 2190

13. Du, J et al. Advancements in image recognition: A Siamese network approach. Inform Dynamics Appl; 2024; 3, 2 pp. 89-103. [DOI: https://dx.doi.org/10.56578/ida030202]

14. Koch G, Zemel R, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image recognition. ICML deep learning workshop. 2015;2(1)

15. Hayale, W; Negi, PS; Mohammad, H. Mahoor. Deep Siamese neural networks for facial expression recognition in the wild. IEEE Trans Affect Comput; 2021; 14, 2 pp. 1148-58. [DOI: https://dx.doi.org/10.1109/TAFFC.2021.3077248]

16. Mattioli, M; Cabitza, F. Not in my face: challenges and ethical considerations in automatic face emotion recognition technology. Mach Learn Knowl Extr; 2024; 6, pp. 2201-31. [DOI: https://dx.doi.org/10.3390/make6040109]

17. Sham, AH; Aktas, K; Rizhinashvili, D et al. Ethical AI in facial expression analysis: Racial bias. SIViP; 2023; 17, pp. 399-406. [DOI: https://dx.doi.org/10.1007/s11760-022-02246-8]

18. Bello H et al. InMyFace: inertial and mechanomyography-based sensor fusion for wearable facial activity recognition. Inform Fusion 2023: 101886.

19. Ma, Y et al. Audio-visual emotion fusion (AVEF): A deep efficient weighted approach. Inform Fusion; 2019; 46, pp. 184-92. [DOI: https://dx.doi.org/10.1016/j.inffus.2018.06.003]

20. Zhou, Xu, et al. Efficient lower layers parameter decoupling personalized federated learning method of facial expression recognition for home care robots. Inform Fusion. 2024: 102261.

21. Hayale W, Negi P, Mahoor M. Facial expression recognition using deep Siamese neural networks with a supervised loss function. In: 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE; 2019.

22. Maddula, NVSS; Nair, LR; Addepalli, H; Palaniswamy, S. Emotion recognition from facial expressions using Siamese network. Communications in computer and information science; 2021; Singapore, Springer Singapore: pp. 63-72.

23. Baddar, WJ; Kim, DH; Ro, YM. Learning features robust to image variations with Siamese networks for facial expression recognition. MultiMedia modeling; 2017; Cham, Springer International Publishing: pp. 189-200. [DOI: https://dx.doi.org/10.1007/978-3-319-51811-4_16]

24. Ramakrishnan S, Upadhyay N, Das P, Achar R, Palaniswamy S, Nippun Kumaar AA. Emotion Recognition from Facial Expressions using Images with Arbitrary Poses using Siamese Network. In: 2021 2nd International Conference on Smart Electronics and Communication (ICOSEC). IEEE; 2021.

25. Lawrance D, Palaniswamy S. Emotion recognition from facial expressions for 3D videos using siamese network. In: 2021 International Conference on Communication, Control and Information Sciences (ICCISc). IEEE; 2021.

26. Kuruvayil S, Palaniswamy S. Emotion recognition from facial images with simultaneous occlusion, pose and illumination variations using meta-learning. J King Saud Univ - Comput Inform Sci. 2021.

27. Selitskiy S, Christou N, Selitskaya N. Isolating uncertainty of the face expression recognition with the meta-learning supervisor neural network. In: 2021 5th International Conference on Artificial Intelligence and Virtual Reality (AIVR). New York, NY, USA: ACM; 2021.

28. Gong W, Zhang Y, Wang W, Cheng P, Gonzàlez J. Meta-MMFNet: Meta-learning based multi-model fusion network for micro-expression recognition. ACM Trans Multimed Comput Commun Appl [Internet]. 2022; Available from: https://doi.org/10.1145/3539576

29. Nguyen D, Nguyen DT, Sridharan S, Denman S, Nguyen TT, Dean D et al. Meta-transfer learning for emotion recognition. Neural Comput Appl [Internet]. 2023;35(14):10535–49. Available from: https://doi.org/10.1007/s00521-023-08248-y

30. Divya, M. Smart teaching using human facial emotion recognition (Fer) model. Turkish J Comput Math Educ (TURCOMAT); 2021; 12, pp. 6925-32.

31. Ma C. A deep learning approach for online learning emotion recognition. In: 13th International Conference on Computer Science & Education (ICCSE). IEEE; 2018.

32. Rößler J, Sun J, Gloor P. Reducing videoconferencing fatigue through facial emotion recognition. Future Internet. 2021;13.

33. Zhu, Z; Chen, H; Wang, R; Yu, X; Liu, Z; Cambria, E. RMER-DT: robust multimodal emotion recognition in conversational contexts based on diffusion and Transformers. Inform Fusion; 2025; 123, 103268. [DOI: https://dx.doi.org/10.1016/j.inffus.2025.103268]

34. Zhu, X; Liu, Z; Cambria, E; Yu, X; Fan, X; Chen, H; Wang, R. A client–server based recognition system: Non-contact single/multiple emotional and behavioral state assessment methods. Comput Methods Programs Biomed; 2025; 260, 108564. [DOI: https://dx.doi.org/10.1016/j.cmpb.2024.108564]

35. Zhang Y, Wang X, Wen J, et al. WiFi-based non-contact human presence detection technology. Sci Rep. 2024;14(3605). https://doi.org/10.1038/s41598-024-54077-x.

36. Khan RA, Crenn A, Meyer A, Bouakaz S. A novel database of children’s spontaneous facial expressions (LIRIS-CSE). Image Vis Comput [Internet]. 2019;83–84:61–9. Available from: https://doi.org/10.1016/j.imavis.2019.02.004

37. Ekundayo OS, Viriri S. Facial expression recognition: A review of trends and techniques. IEEE Access [Internet]. 2021;9:136944–73. Available from: https://doi.org/10.1109/access.2021.3113464

38. Wang X, Gong J, Hu M, Gu Y, Ren F. LAUN improved StarGAN for facial emotion recognition. IEEE Access [Internet]. 2020;8:161509–18. Available from: https://doi.org/10.1109/access.2020.3021531

39. Kanade T, Cohn JF, Tian Y. Comprehensive database for facial expression analysis. In: Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat No PR00580). IEEE Comput. Soc; 2002.

40. Mellouk W, Handouzi W. Facial emotion recognition using deep learning: review and insights. Procedia Comput Sci [Internet]. 2020;175:689–94. Available from: https://doi.org/10.1016/j.procs.2020.07.101

41. Ming Z, Xia J, Luqman MM, Burie JC, Zhao K. Dynamic multi-task learning for face recognition with facial expression. 2019.

42. Lyons M, Kamachi M, Gyoba J. The Japanese female facial expression (JAFFE) dataset. Zenodo; 1998.

43. Lyons MJ, Kamachi M, Gyoba J. Coding facial expressions with Gabor wavelets (IVC special issue) [Internet]. arXiv [cs.CV]. 2020 [cited 2023 Jul 1]. Available from: http://arxiv.org/abs/2009.05938

44. Lyons MJ, Excavating. AI re-excavated: Debunking a fallacious account of the JAFFE dataset [Internet]. arXiv [cs.CY]. 2021 [cited 2023 Jul 1]. Available from: http://arxiv.org/abs/2107.13998

45. Akhand, MAH; Roy, S; Siddique, N; Kamal, MAS; Shimamura, T. Facial emotion recognition using transfer learning in the deep CNN. Electronics; 2021; 10, 9 1036. [DOI: https://dx.doi.org/10.3390/electronics10091036]

Word count: 14522

Show less

© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Facial emotion recognition using deep Siamese neural networks: multi-classifier fusion for single-emotion and multi-emotion models across age groups

Content area

Abstract

Full text