Content area
In the field of facial expression recognition (FER), two main trends point to the data-driven FER and feature-driven FER exist. The former focused on the data problems (e.g., sample imbalance and multimodal fusion), while the latter explored the facial expression features. As the feature-driven FER is more important than the data-driven FER, for deeper mining of facial features, we propose an expression recognition model based on Local–Global information Reasoning and Landmark Spatial Distributions. Particularly to reason local–global information, multiple attention mechanisms with the modified residual module are designed for the Res18-LG module. In addition, taking the spatial topology of facial landmarks into account, a topological relationship graph of landmarks and a two-layer graph neural network are introduced to extract spatial distribution features. Finally, the experiment results on FERPlus and RAF-DB datasets demonstrate that our model outperforms the state-of-the-art methods.
Introduction
Facial expression recognition (FER) is an important research of computer vision, which has found many applications such as human–computer interaction [1], scene understanding [2], fatigue driving detection [3] and many others [4]. In recent years, this field has mainly considered two trends, i.e., data-driven FER and feature-driven FER [3, 5]. The former utilizes more modal data and alleviates some limitations of datasets, while the latter exploits facial features more comprehensively.
Fig. 1 [Images not available. See PDF.]
The illustration of local information and global information. In (a), excessive attention to local information about the mouth may lead to a false prediction of expression, where a negative facial expression is determined via the global information of the whole face. And in (b), focusing only on the global information would give the impression that this is a sad adult, but the corners of his mouth reveal him to be a snickerer
Fig. 2 [Images not available. See PDF.]
Comparison between RGB pictures and the topological graph. For (a), most CNN-based methods extract spatial distribution features of the whole RGB images (the triangle picture in the upper left corner and the other point pictures). For (b), spatial distribution features of landmark coordinate points and edges in the topological graph are extracted with GAT
Deep learning has been applied as a data-driven methodology for facial expression recognition. A large number of researchers have proposed various approaches to improve the performance of the data-driven method in FER [6], including multimodal fusion [7, 8], ambiguous annotation [9, 10], imbalanced class distribution [11, 12], intra-class variations and inter-class similarities [3, 6, 13, 14]. These led to improved characteristics of facial expressions and better model performance. Facial features [15], however, demonstrated an essential advantage in helping understand the facial expression, reflecting the characteristics of facial expressions. Much attention is paid to feature-driven FER research. Feature-driven FER methods can be grouped into three categories: facial feature enhancement, local or global facial feature reasoning, and spatial distribution feature extraction of facial landmarks.
Facial feature enhancement For facial feature enhancement, attention mechanisms are usually adopted to automatically learn the weight of key features for discriminative features [16, 17, 18, 19, 20, 21, 22, 23–24]. Additionally, some new loss functions [14, 25, 26–27] are specially designed to achieve this goal by reducing inter-class similarity, increasing intra-class similarity or other ways.
Local or global facial feature reasoning Most approaches [6, 7] merely focused on local or global features, causing a bias in identification. As illustrated in Fig. 1a, the child’s mouth seems to reflect his smiling, despite the negative expression on the face. Similarly, the adult shown in Fig. 1b looks painful or sad but instead is tittering by observing the corners of his mouth. Besides, some methods [21, 28, 29–30] also take their fusion into account by two channels, but such separated extraction ignores the correlations between local and global features.
Spatial distribution feature extraction of facial landmarks Existing deep learning-based methods [31, 32, 33, 34–35] extracted landmark spatial distribution features from RGB images, such as triangular images or dot images shaped by facial landmarks, as shown in Fig. 2a. It is hard to directly represent the inherent topological relationships from the RGB images (as compared to the topological graphs shown in Fig. 2b). The non-Euclidean topology can flexibly describe the adjacency, association and inclusion relations among the essential spatial target points, lines and planes, which contain rich spatial distribution features. Therefore, constructing the topological graph with facial landmarks is beneficial for extracting spatial distribution features.
In summary, the main contributions of our LGR–LSD model are:
To exploit better discriminative local–global features and their fusion, we implemented a local–global information-reasoned module, whose global feature is closely correlated with the local feature.
We designed an augmentation loss function to improve the feature fusion via deep metric learning (DML) [36], leading to intra-class compactness and inter-class separation.
We used the landmark’s original coordinates to construct a topological graph to deduce the spatial distribution features better.
Extensive experiments are carried out on the Real-world Affective Faces Data Base (RAF-DB) [37] and the Facial Expression Recognition 2013 Plus (FERPlus) [38] datasets. Experimental results show that our proposed model can exploit local–global information and landmark spatial distribution features well and outperform state-of-the-art methods.
Related work
Here, we first give an overall view of FER, including data-driven FER methods and feature-driven FER methods. Meanwhile, due to the use of GAT in our model, we also offer a brief introduction to extract space-distribution features with graph neural networks.
Facial expression recognition
As many reviews for FER have shown, data-driven FER and feature-driven FER are the two main trends in FER tasks [1, 6]. We aim to develop the feature-driven FER for more comprehensive local–global features of facial expressions and spatial distribution features of facial landmarks; thus, we introduce the former briefly and illustrate the latter in detail.
Data-driven FER methods
As shown in Sect. 1, data-driven FER can be primarily divided into twofold: one utilizing more multimodal data and the other focused on dataset problems. For the former, multimodal analysis gained increasing popularity because of the higher accessibility and availability of data and the fact that multimodal data can make up for the limitations of single-modal data. Early studies [39] have shown that verbal and physical signals, brain waves, heart rate and physiological hormone levels can reflect human emotions. Later studies prove that body posture [40, 41], skeletal features [42], and scene information [43] can contribute to facial expression recognition as well. On this basis aforementioned, many signs of progress have been made to fuse various modalities for FER [8, 39, 40, 41, 42, 43, 44–45]. For the latter, dataset problems [2, 3, 6] for FER are mainly divided into the ambiguous annotation, imbalanced class distribution, intra-class variations and inter-class similarities. In the beginning, to address the problem of ambiguous annotation, recent works take a measure to suppress the fuzzy samples in the training process [9]. Then, She et al. [11] mine and describe the latent distribution in the labels pace with the ambiguity extent, solving the imbalanced class distribution problem. Several studies have been developed concerning the last problem, among which the main trends can be summarized as proposing new loss in the corresponding networks, such as [14].
Feature-driven FER methods
The feature-driven idea is of more essence in the FER task, prompting many researchers to enhance facial features or to utilize local or global facial features better. Besides, how to mine the landmark spatial distribution features is yet a critical problem in the field of feature-driven FER.
To capture more facial features, attentional mechanisms and loss functions are heavily used. Li et al. [16] and Yao et al. [18] leverage a general attention mechanism and a hybrid spatial-channel joint attention mechanism to extract more useful features. Wen et al. [26] employ multiple mechanisms and design two loss functions to assist its feature clustering strategy. Gong et al. [27] combine focal smoothing loss and aggregation-separation loss to enhance the extracted features. Farzaneh et al. [14] and Fard [27] devise a deep attentive center loss and an adaptive correlation-based loss, respectively. These two losses can improve the discriminative power of the learned embedded features through Deep Metric Learning. Li et al. [25] define a new heuristic objective function which can guide deep neural network to learn better features.
In addition, early works [19, 20] start combining them by simple concatenation or weighted fusion to better fuse local and global information about facial features. Then, with the advance in Transformer [46], researchers devise more efficient networks to integrate local and global features. Huang et al. [47] and Ma et al. [30] take the Transformer module to improve the global features. Pecoraro et al. [48] formulate a novel self-attention module in which the limitation of convolution in the local area is better addressed than that of global attention. With the help of ViT [23], Xue et al. [22] and Zheng et al. [31] directly obtain a feature which contains both local and global information. Kim et al. [21], Ma et al. [30], Liu et al. [49], Xue et al. [50] and Liang et al. [29] extract local and global features separately through dual channels, whereas these methods have one or more of the following disadvantages: only enhancements of one feature, direct acquisition of a final feature without real fusion, fusion through dual channels or simple concatenation leading to a lack of some correlation between local and global features.
Furthermore, most of the existing deep learning-based methods [32, 33, 34–35, 51] extract features in RGB images converted by original landmark coordinates, such as constructing triangle pictures or dot pictures for the landmark spatial distribution features of faces. Wang et al. [32] use landmarks to construct the heatmap. Ayeche et al. [33] use landmarks to construct the triangular image, and Hasani et al. [18] use landmarks to construct the mask-like importance weight map. And similar to [34], Wang et al. [51] construct their weighted mask by combining landmarks and the correlation coefficient. However, these methods fail to consider the spatial topological relationship between original landmark coordinates.
The work above on local or global features fails to consider both features comprehensively or fails to integrate both truly. The work on spatially distributed features ignores the inherent spatial topological relationship between landmarks. Hence, in this paper, a novel feature-driven FER technology is developed to explore and fuse more discriminative local–global facial features. Besides, the topological graph is also introduced to encode the spatial information in the facial landmarks, alleviating the limitations of triangular images or dot images.
Fig. 3 [Images not available. See PDF.]
The overall framework of LGR–LSD. In the first component, a Res18-LG module built on the ResNet-18 network modified with multiple attention mechanisms is formulated for local reasoning, global reasoning and further local–global fusion. Then an augmentation module with a designed augmentation loss is used to enhance fused features. In another component, corresponding topological graphs are constructed first, where landmark original coordinates are directly input to a two-layer GAT to extract the space-distribution feature
Table 1. The illustration of ResNet-18 and Res18-LG
Layer order | ResNet-18 | Res18-LG |
|---|---|---|
Layer 1 | [3 * 3 Conv,64] * 2 | *2 |
Layer 2 | [3 * 3 Conv,128] * 2 | * 2 |
Layer 3 | [3 * 3 Conv,256] * 2 | * 2 |
Layer 4 | [3 * 3 Conv,512] * 2 | [T_M] * 2 |
The symbol “[]” represents the residual structure, and the symbol “_res” represents the addition of an attention mechanism in the form of a residual structure. T_M represents the part of the Transformer, including the MHSA. In T_M, the 1 * 1 convolution layer is replaced with the 3 * 3 convolution layer
Fig. 4 [Images not available. See PDF.]
The diagram of Res18-LG. The residual structure in Layer 4 has the function of local–global fusion. In this way, feeding local features directly into Layer 4 for global information perception allows the obtained global feature to be closely correlated with the local feature
Fig. 5 [Images not available. See PDF.]
The demonstration of the LG augmentation module
Spatial distribution feature extraction with graph neural networks
As mentioned in Sect. 1, the topological graph has rich spatial distribution features. Hence, selecting a suitable tool to manipulate the non-Euclidean data is essential. [52] has shown that a graph neural network (GNN) is better suited to handle such data structure. So, this paper introduces the GNN to extract the spatial distribution features contained in the facial landmarks.
To process graph-structured data, researchers attempt to expand the neural network, thus generating GNN. Ulteriorly, to apply the powerful learning ability of convolution operation to GNN, Kipf, and Welling [53] carefully designed a graph convolution network (GCN), which incorporates convolution operations to extract spatial features of graphs for learning. However, GCN cannot assign different weights to reason the interactions among nodes [54]. To address this problem, Velickovic [54] introduces the attention mechanism and proposes GAT to perform node classification of graph-structured data. Furthermore, Brody et al. [55] identified that GAT does not compute dynamic attention and thus proposed GATv2 by modifying the computational order of linear transformation in attention and introducing the ability to compute dynamic attention. This makes GATv2 better than GAT and the relatively best-performing GNN available, which is why we chose GATv2.
Fig. 6 [Images not available. See PDF.]
The demonstration of how to construct the topological relationship graph. The dots represent landmark coordinate points, the red edges represent information flow from point to point, and the green edges represent information flow from group to group
Methods
In this section, we introduce our LGR–LSD model in detail, including the overall framework and the sub-modules, including the local–global reasoning component and the landmark space-distribution reasoning component, followed by illustrating their functions of fusion, classification, and optimization.
Overall framework
As shown in Fig. 3, our LGR–LSD model can be divided into two components: Local–Global reasoning and Landmark Space-Distribution reasoning.
In the local–global reasoning (LGR) component, to eliminate the limitations brought by focusing only on local details, a Res18-LG module is modified to carry out local reasoning, global reasoning and local–global fusion. Then, a local–global augmentation module is used to enhance the fused feature through the modified augmentation loss (AL), which utilizes Deep Metric Learning to make the fused feature achieve intra-class compactness and inter-class separation. In the landmark space-distribution reasoning (LSD) component, the original landmark coordinate data are used to construct a topological relation graph to take full advantage of the spatial topological relationship. Then, GATv2 is introduced to deduce the spatial distribution feature. At last, the outputs of the two components are concatenated and fed into a fully connected layer for facial expression recognition.
Local–global reasoning
As shown in Fig. 3, this component is separated into two modules: the Res18-LG module (Table 1, Fig. 4) and the LG augmentation module (Fig. 5).
Firstly, the Res18-LG module is modified to explore and fuse more discriminative local–global facial features. Table 1 lists the details in each layer of ResNet-18 and Res18-LG. In ResNet-18, each layer contains multiple convolution kernels, which can extract different local features as multi-channel feature maps. The spatial and channel attention mechanisms are embedded in the first three layers to reason them for enhanced presentations further. On this basis, we utilize the global information extraction ability of MHSA in Transformer [46] to integrate the reasoned local features as the global ones. Here, it is worth noting that we choose to replace the convolution layer in the Transformer with the convolution layer for a larger receptive field. Finally, the residual connection fuses the local and global features.
After local–global fusion, the fused feature is fed to the LG augmentation module. Here, the Deep Attentive Center Loss (DACL) [14] method with our modified augmentation loss enhances the fused feature. A detailed description of the modified loss is provided in Sect. 3.4. Besides, the fused feature is further fed to a De-albino sub-module [46] to compensate for the convolution operation’s side effect of zero padding.
Fig. 7 [Images not available. See PDF.]
The pipeline of the Landmark Space-Distribution reasoning part. A node represents a feature of a point, and the dashed line indicates the initial feature, which is directly replaced by the two-dimensional coordinates of the point. Nodes of the same color indicate that they are in the same set
Landmark space-distribution reasoning
To reason the spatial distribution features of facial landmarks, we construct a topological graph based on the landmark coordinates and introduce GATv2 to manipulate the graph.
We first follow Wang et al. [56] and utilize the Multi-Task Convolutional Neural Network (MTCNN) [57] to obtain the coordinates of facial landmarks. As shown in Fig. 6a, it contains the 68 key points, which shape the facial contour, eyes, eyebrows, nose, and mouth. Because the facial contour only determines the facial shape (e.g., round face and pointed face), it is less correlated with the facial expression. Hence, we use the 57 key points, excluding the facial contour, to construct a graph in this model (Fig. 6b). Specifically, they are divided into six sets according to facial organs. In the same set, the points are connected sequentially to construct feature flow relationships (Fig. 6c). To form associates among six sets, partial points are further connected to construct the topological graph (Fig. 6d). It can be formulated as follows:
1
which is composed of node set and weighted adjacency matrix . Each node is represented by its coordinates, normalized by the width and height of the facial image. A(i, j) represents the weight of the edge between node i and node j. They are initialized as 1 if node i is connected to node j; 0, otherwise.After constructing the above graph, the two-layer GATv2 is introduced to reason its inner spatial distribution features. Generally, the input of one-layer GATv2 can be initialized as,
2
where denotes the features of the i-th node and N denotes the number of nodes. And it is 57 in this model. Consequently, the propagation and update are formulated as,3
where W is the linear transformation matrix for the node update, is the set of neighbor nodes of node i for the message propagation, and is the weight between the node i and the node j. The weight is calculated by the attention mechanism as follows,4
5
where is the learned parameter matrix, and symbol represents vector concatenation.As shown in Fig. 7, we stack two GATv2 as our two-layer spatial distribution extraction module. The input of the first layer is the node sets V in Eq. 1, and the neighbor nodes are up to the weighted adjacent matrix A. It is worth noting that the two-layer structure can propagate the message to deeper nodes.
To obtain the spatial distribution features of the entire graph, we first utilize the graph partition to divide the nodes into six sets, as in the above graph construction. Then, graph pooling is added within each set to get their representations, respectively. Finally, multilayer perceptron (MLP) is utilized to fuse them as the final representations of the entire graph.
Fusion, classification, and optimization
The outputs of the two components are concatenated and fed into a fully connected layer for facial expression classification to obtain more discriminative feature representations.
Besides, the cross-entropy loss and the designed augmentation loss are used to optimize our model:
Cross-entropy loss The cross-entropy loss computes the discrepancy between the prediction and the true label to formulate the softmax loss function as follows:
6
where c means the class number 7, N is the number of training samples in every batch, is the real label value vector and is the predicted value vector.Augmentation loss To enhance intra-class feature similarity and inter-class feature differences, the attention module in Sect. 3.2 calculates the attention weights, which is used to calculate a weighted Within Cluster Sum of Squares (WCSS) [14] between the deep features and their corresponding class centers . Based on this, the augmentation loss is as follows:
7
where indicates element-wise multiplication, denotes the weight of the i-th deep feature along the dimension in the embedding space, is the j-th dimension of the i-th sample deep feature vector belonging to the -th class, is its corresponding class center according to the center loss, and is the variance of the class centers.8
Final loss We chose the same setting as DACL method, adding the cross-entropy loss and the augmentation loss at the ratio of 1: to obtain the final loss as follows:9
Table 2. Results of various methods on the RAF-DB dataset
Method | Year | Acc. (%) | Avg. Acc. (%) |
|---|---|---|---|
WeiCL [13] | TVCJ 2023 | 86.96 | – |
SCN [9] | CVPR 2020 | 87.03 | – |
DACL [14] | WACV 2021 | 87.78 | 80.44 |
VTFF [30] | TAC 2021 | 88.14 | – |
PACVT [49] | Inf. Sci 2023 | 88.21 | – |
FER-VT [47] | Inf. Sci 2021 | 88.26 | 80.63 |
CT-DBN [29] | TVCJ 2023 | 88.40 | – |
HPMI [18] | WPC 2022 | 88.44 | – |
Squeeze ViT [21] | Sens. 2022 | 88.90 | – |
HO Loss [25] | TVCJ 2023 | 89.03 | – |
DMUE [11] | CVPR 2021 | 89.42 | – |
FDRL [59] | CVPR 2021 | 89.47 | – |
DAN [26] | Biomim. 2023 | 89.70 | 85.32 |
MM-Net [28] | TVCJ 2023 | 89.77 | – |
EAFR [27] | NCAA 2022 | 89.80 | – |
CMCNN [60] | PR 2022 | 90.36 | 82.26 |
TransFER [22] | ICCV 2021 | 90.91 | – |
Facial Chirality [62] | TMM 2022 | 91.20 | – |
APViT [50] | TAC 2022 | 91.98 | 86.36 |
POSTER [31] | ICCV 2023 | 92.05 | 86.04 |
ARBEx [63] | arXiv 2023 | 92.47 | – |
PAtt-Lite [64] | arXiv 2023 | 95.05 | 90.38 |
92.54 | 86.67 |
Experiments and results
To validate the efficiency of the proposed model, we conduct extensive comparative experiments and ablation studies in this section. We first introduce two widely used datasets, including RAF-DB [37] and FERPlus [38], followed by the implementation details. Then, we make the comparison of our proposed model with the state-of-the-art (SOTA) methods on these two datasets. Furthermore, a group of ablation experiments are devised to prove the effectiveness of our sub-modules corresponding to local–global reasoning and landmark space-distribution reasoning.
Dataset
We first introduce two widely used datasets:
RAF-DB consists of about 30,000 facial images acquired using crowdsourcing techniques. The images were annotated with categorical and compound expressions. In our experiments, we followed [56, 58, 59–60] to use the images with 7 discrete basic expressions (i.e., happy, sad, surprise, anger, fear, disgust and neutral). There are 12,271 images for training and 3068 images for testing. The overall sample accuracy (Acc.) and the average accuracy (Avg.Acc.) are used for measurement.
FERPlus is extended from FER2013 as used in the ICML 2013 Challenges. It is a large-scale dataset collected by the Google search engine, which consists of 28,709 training images, 3589 validation images and 3589 test images. Although contempt is included, which leads to 8 classes in this dataset, we also followed [25, 26, 27, 28, 29–30] to use the same 7 basic expressions as the RAF-DB dataset, and only the images with more than half of the votes are chosen. The overall sample accuracy (Acc.) is used for evaluation.
Fig. 8 [Images not available. See PDF.]
The confusion matrix on RAF-DB and FERPlus datasets
Implementation details
Our implementation details are as follows:
Hardware Environment We implemented our LGR–LSD model using PyTorch on an NVIDIA RTX 2080Ti GPU, and the graphic memory is 11GB. Besides, we also used Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50GHz and 128GB main memory.
Hyper-parameter Setting The batch size, initial learning rate, and the number of epochs were set to 64, 0.1, and 80, respectively. During the training process, the learning rate decayed to one-tenth of the previous one every 15 epochs. Besides, we set the value of to 0.01 in the final loss.
Training Details During the training phase, the stochastic gradient descent (SGD) optimizer was adopted for backpropagation, and the pre-trained ResNet-18 weights on the MS-CELEB-1M [61] dataset were used for initialization.
Table 3. Results of various methods on the FERPlus dataset
Method | Year | Acc. (%) |
|---|---|---|
PACVT [49] | Inf. Sci 2023 | 88.72 |
VTFF [30] | TAC 2021 | 88.81 |
RAN [56] | TIP 2020 | 88.95 |
HO Loss [25] | TVCJ 2023 | 89.03 |
CT-DBN [29] | TVCJ 2023 | 89.17 |
MM-Net [28] | TVCJ 2023 | 89.34 |
SCN [9] | CVPR 2020 | 89.39 |
DMUE [11] | CVPR 2021 | 89.51 |
EAFR [27] | NCAA 2022 | 89.57 |
DAN [26] | Biomim. 2023 | 89.70 |
FER-VT [47] | Inf. Sci 2021 | 90.04 |
TransFER [22] | ICCV 2021 | 90.83 |
APViT [50] | TAC 2022 | 90.86 |
POSTER [31] | ICCV 2023 | 91.62 |
ARBEx [63] | arXiv 2023 | 93.09 |
PAtt-Lite [64] | arXiv 2023 | 95.55 |
92.13 |
Comparison with state-of-the-art methods
We present the comparative results with the state-of-the-art methods on RAF-DB and FERPlus datasets in Tables 2 and 3, respectively. Regarding overall comparison, our LGR–LSD model achieves a recognition accuracy of 92.54% and an average recognition accuracy of 86.67% on the RAF-DB dataset. And it also has a good performance on the FERPlus dataset with an accuracy of 92.13% compared to the state-of-the-art methods. Besides, similar to DAN [26] and HPMI [18], we also use the spatial and channel attention mechanisms, but our model is 2.84% and 4.10% higher, respectively. This suggests that our extra LSD component and AL loss certainly play a role in FER. And compared with Squeeze ViT [21], TransFER [22], MM-Net [28], CT-DBN [29], VTFF [30], their methods involve the local–global features by dual-channel network. Compared with these methods, our approach extracts global features associated with local features, which focuses on the correlations between local and global features and has better performance. In addition, compared to TransFER [22], FER-VT [47], APViT [50] and POSTER [31] that both refer to the Transformer structure, the improved performance indicates that our approach of using MHSA to extract global features associated with local features instead of directly extracting global features is reasonable.
Table 4. The number of samples for each classification of the training set
Dataset | Sur | Fea | Dis | Hap | Sad | Ang | Neu |
|---|---|---|---|---|---|---|---|
RAF-DB | 1290 | 281 | 717 | 4772 | 1982 | 705 | 2524 |
FERPlus | 7495 | 7083 | 2742 | 2384 | 1643 | 46 | 340 |
However, compared to ARBEx [63], our model demonstrates higher accuracy on the RAF-DB dataset and lower performance on the FERPlus dataset. This is caused by the unbalanced data distribution on the FERPlus dataset, which further indicates that our model is sensitive to the long-tailed distribution issue which will also be analyzed later. Besides, the PAtt-Lite [64] modifies MobileNetV1 using deep separable convolution, pointwise convolution and self-attention mechanisms for a lighter-weight model. Meanwhile, it also achieves the best performance. Compared with this method, our model primarily emphasizes the local–global information of images and the spatial distribution features of landmarks. Moreover, we design the topological graph to model these landmarks and conduct extensive experiments to demonstrate their effectiveness. Hence, the Patt-Lite and our methods make contributes to different issues for facial expression recognition. As for the local feature of the Patt-Lite, it refers to spatial locality of the data introduced by the depth separable convolution, which differs from our local–global information of images.
Fig. 9 [Images not available. See PDF.]
Features PCA reduced-dimensional scatter plot of anger classification and another two classifications. The points represent the features of a sample
To further analyze our model, we present the confusion matrix in Fig. 8, where the sub-figure (a) shows no confusion in all classifications as a whole, indicating that the performance of our proposed AL loss function is consistent with the property of intra-class compactness and inter-class separation. However, in Fig. 8b, there is an obvious confusion between the Ang and Sad, Ang and Hap. As shown in Table 4, this is because of the significant long-tail distribution in the FERPlus dataset compared with the RAF-DB dataset. And our model is vulnerable to this phenomenon. In addition, we also notice that the accuracy of anger classification is very low, and the samples in anger classification are heavily misclassified, which is likely not just the side effect of the long-tail distribution. Since anger and sadness are both negative expressions, and anger and happiness may both have mouths open, we suspect that they share similar features. On this basis, we utilize the principal component analysis method [65] to downscale the features for each classification and draw a two-dimensional scatter plot of the features between the anger classification and another three classifications (Sad, Happy, and Neutral, shown in Fig. 9). As illustrated in Fig. 9a, b, the points in both classifications almost overlap. In contrast, the points in both classifications in Fig. 9c are still entirely separate. This result shows that there are some similarities between Ang classification and Sad classification, Ang classification and Hap classification, which accounts for the significant confusion. Also, to be more intuitive, Fig. 10 shows two Ang samples classified as Sad (a) and two Ang samples classified as Hap (b). From Fig. 10a, we can find that there is a strong confusion between Ang and Sad. And compared to happy expression, Fig. 10b has similar features such as raising the corners of the mouth, opening the mouth for laughing.
Fig. 10 [Images not available. See PDF.]
The anger samples are classified as Sad or Happy
Fig. 11 [Images not available. See PDF.]
The demonstration of the topological relationship graph composed of 68 key points
Table 5. Ablation experiments on RAF-DB in terms of standard accuracy
Method | Acc. (%) |
|---|---|
ResNet-18 (baseline) | 90.42 |
Res18-LG | 91.07 |
Res18-LG + Aug (w/o AL) | 91.49 |
Res18-LG + Aug (w AL) | 92.21 |
Res18-LG + Aug (w AL) + Landmark-68 | 92.54 |
Res18-LG + Aug (w AL) + Landmark-57 (final) | 92.54 |
“w” represents the phrase “with,” and “w/o” represents the phrase “without”
Ablation study
In this section, a group of ablation experiments are given to prove the validity of our different modules. Since our model is vulnerable to the long-tail distribution in the FERPlus dataset, experiments are only conducted on the RAF-DB dataset. And we first the experiments as follows:
ResNet-18 (baseline) This method takes ResNet-18 as the backbone, followed by the De-albino module, which can make up for the effect of zero padding of convolution. In fact, it is the ARM method [66] proposed by Shi et al.
Res18-LG Based on the above ResNet-18 (baseline), we utilize multiple attention mechanisms to modify the ResNet-18 for the local–global feature reasoning.
Res18-LG + Aug (w/o AL) Based on Res18-LG, we add an augmentation module with the original loss function of DACL.
Fig. 12 [Images not available. See PDF.]
Comparative visualizations between different models. For example, “ResNet-18 VS. Res18-LG” refers to the comparative results between incorrect classification of ResNet-18 and correct classification of Res18-LG
Res18-LG + Aug (w AL) Based on Res18-LG, we add an augmentation module with our designed AL loss.
Res18-LG + Aug (w AL) + Landmark-68 Based on Res18-LG + Aug (w AL), we add the Landmark Space-Distribution reasoning component. Unlike the final model, the composition of the graph (Fig. 11) uses all 68 key points.
Res18-LG + Aug (w AL) + Landmark-57 This is our final model. The composition of the graph is shown Fig. 6, using 57 key points that do not contain the facial contour.
From Table 5, it can be seen that the accuracy rate increases by 0.65 percentage points after the Local–Global reasoning module is added, which accounts for its effectiveness. In addition, after using the augmentation module with the original loss, the accuracy is improved. When our augmentation loss is introduced, the accuracy is further improved by 0.42 percentage points. This result shows that our normalized operation can help the sparse center loss achieve better intra-class compactness and inter-class separation. Finally, when adding the Landmark Space-Distribution reasoning component, the final accuracy rate is up to 92.54%. Thus, it demonstrates that the facial landmark’s spatial distribution contains facial expression information. Table 5 also shows that with the addition of key points in the facial contour, the recognition accuracy does not improve, but the inference complexity of 68 key points is clearly improved over that of 57 key points due to the use of the GATv2 network. Therefore, our choice of discarding the key points in the facial contour is justified due to the reason that the facial contour mainly determines the facial shape, which changes little and has little correlation with facial expressions in our model.
Partial visualization
In order to show the advantages of our method more clearly, some images in the testing dataset are visualized (Fig. 12), and the predicted values of different methods are marked nearby to compare the recognition results.
In Fig. 12a, it is evident that when the samples in the first row focus solely on the local area, it may result in an incorrect recognition outcome on the left. For instance, concentrating solely on the corner of the mouth could lead to a misclassification as Hap. Conversely, in the second row of samples, attention to local information is crucial for obtaining the correct result on the right. For example, the global facial information of the latter three samples is not sufficient and can easily lead to a classification as Neu. These show that paying attention to local–global features at the same time is easier to avoid error messages than paying attention to only one of them.
And in Fig. 12b, c, we find that the spatial distribution feature of landmark plays a significant role in the samples with clear facial organs. This is because landmark itself is obtained through facial detection. And if the face is not clear, effective landmark cannot be obtained, resulting in a lack of corresponding spatial distribution features.
Conclusion
In this paper, we propose a novel LGR–LSD model integrating local/global information and features landmark’s space-distribution feature, making up for the limitations of focusing on single information by the local–global reasoning component, and achieving further intra-class compactness and inter-class separation by the augmentation loss. In addition, the combination “topological graph + GATv2” can effectively extract the spatial distribution feature of the facial landmarks. Based on the efficient recognition performance on RAF-DB and FERPlus datasets, it is verified that our LGR–LSD method is superior to other state-of-the-art methods. We are fully aware; however, our model is currently vulnerable to long-tail distribution, which is precisely what we intend to investigate next. Besides, we also prepare to mine the dynamic features of the face and take the static–dynamic feature fusion into account in future research.
Acknowledgements
This work was supported by the Sichuan Science and Technology Program under Grant 2023YFS0195.
Data availability
Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.
Declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence their work reported in this paper.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Chattopadhyay, J., Kundu, S., Chakraborty, A., Banerjee, J.S.: Facial expression recognition for human computer interaction. In: International Conference on Computational Vision and Bio Inspired Computing, pp. 1181–1192. Springer (2018)
2. Wu, S; Wang, B. Facial expression recognition based on computer deep learning algorithm: taking cognitive acceptance of college students as an example. J. Ambient Intell. Hum. Comput.; 2021; 13, pp. 1-12.
3. Wolf, K. Measuring facial expression of emotion. Dialogues Clin. Neurosci.; 2022; 17,
4. Ye, J; Yu, Y; Fu, G; Zheng, Y; Liu, Y; Zhu, Y; Wang, Q. Analysis and recognition of voluntary facial expression mimicry based on depressed patients. IEEE J. Biomed. Health Inform.; 2023; 27,
5. Kollias, D.: Multi-label compound expression recognition: C-expr database & network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5589–5598 (2023)
6. Li, S; Deng, W. Deep facial expression recognition: a survey. IEEE Trans. Affect. Comput.; 2020; 13, 1195.
7. Huiqun, H; Guiping, S; Fenghua, H. Summary of expression recognition technology. J. Front. Comput. Sci. Technol.; 2022; 16,
8. Huang, Y; Du, C; Xue, Z; Chen, X; Zhao, H; Huang, L. What makes multi-modal learning better than single (provably). Adv. Neural Inf. Process. Syst.; 2021; 34, pp. 10944-10956.
9. Wang, K., Peng, X., Yang, J., Lu, S., Qiao, Y.: Suppressing uncertainties for large-scale facial expression recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6897–6906 (2020)
10. Zhang, Y; Wang, C; Deng, W. Relative uncertainty learning for facial expression recognition. Adv. Neural Inf. Process. Syst.; 2021; 34, pp. 17616-17627.
11. She, J., Hu, Y., Shi, H., Wang, J., Shen, Q., Mei, T.: Dive into ambiguity: latent distribution mining and pairwise uncertainty estimation for facial expression recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6248–6257 (2021)
12. Zhao, Z., Liu, Q., Zhou, F.: Robust lightweight facial expression recognition network with label distribution training. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 3510–3519 (2021)
13. Xi, Y; Mao, Q; Zhou, L. Weighted contrastive learning using pseudo labels for facial expression recognition. Vis. Comput.; 2023; 39,
14. Farzaneh, A.H., Qi, X.: Facial expression recognition in the wild via deep attentive center loss. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer vision, pp. 2402–2411 (2021)
15. Saurav, S; Gidde, P; Saini, R; Singh, S. Dual integrated convolutional neural network for real-time facial expression recognition in the wild. Vis. Comput.; 2022; 38, pp. 1-14. [DOI: https://dx.doi.org/10.1007/s00371-021-02069-7]
16. Li, J; Jin, K; Zhou, D; Kubota, N; Ju, Z. Attention mechanism-based CNN for facial expression recognition. Neurocomputing; 2020; 411, pp. 340-350. [DOI: https://dx.doi.org/10.1016/j.neucom.2020.06.014]
17. Hu, M; Ge, P; Wang, X; Lin, H; Ren, F. A spatio-temporal integrated model based on local and global features for video expression recognition. Vis. Comput.; 2021; 38, pp. 1-18.
18. Yao, L; He, S; Su, K; Shao, Q. Facial expression recognition based on spatial and channel attention mechanisms. Wirel. Pers. Commun.; 2022; 125, pp. 1-18. [DOI: https://dx.doi.org/10.1007/s11277-022-09616-y]
19. Yu, M; Zheng, H; Peng, Z; Dong, J; Du, H. Facial expression recognition based on a multi-task global-local network. Pattern Recognit. Lett.; 2020; 131, pp. 166-171. [DOI: https://dx.doi.org/10.1016/j.patrec.2020.01.016]
20. Zhang, H; Su, W; Wang, Z. Weakly supervised local–global attention network for facial expression recognition. IEEE Access; 2020; 8, pp. 37976-37987. [DOI: https://dx.doi.org/10.1109/ACCESS.2020.2975913]
21. Kim, S; Nam, J; Ko, BC. Facial expression recognition based on squeeze vision transformer. Sensors; 2022; 22,
22. Xue, F., Wang, Q., Guo, G.: Transfer: learning relation-aware facial expression representations with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3601–3610 (2021)
23. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth words: transformers for image recognition at scale. arXiv Preprint arXiv:2010.11929 (2020)
24. Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L.: Cvt: introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22–31 (2021)
25. Li, H; Xiao, X; Liu, X; Guo, J; Wen, G; Liang, P. Heuristic objective for facial expression recognition. Vis. Comput.; 2023; 39,
26. Wen, Z; Lin, W; Wang, T; Xu, G. Distract your attention: multi-head cross attention network for facial expression recognition. Biomimetics; 2023; 8,
27. Gong, W; Fan, Y; Qian, Y. Effective attention feature reconstruction loss for facial expression recognition in the wild. Neural Comput. Appl.; 2022; 34,
28. Xia, H; Lu, L; Song, S. Feature fusion of multi-granularity and multi-scale for facial expression recognition. Vis. Comput.; 2023; 40, pp. 1-13.
29. Liang, X; Xu, L; Zhang, W; Zhang, Y; Liu, J; Liu, Z. A convolution-transformer dual branch network for head-pose and occlusion facial expression recognition. Vis. Comput.; 2023; 39,
30. Ma, F; Sun, B; Li, S. Facial expression recognition with visual transformers and attentional selective fusion. IEEE Trans. Affect. Comput.; 2021; 14, 1236. [DOI: https://dx.doi.org/10.1109/TAFFC.2021.3122146]
31. Zheng, C., Mendieta, M., Chen, C.: Poster: a pyramid cross-fusion transformer network for facial expression recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3146–3155 (2023)
32. Wang, X; Wang, Y; Li, W; Du, Z; Huang, D. Facial expression animation by landmark guided residual module. IEEE Trans. Affect. Comput.; 2021; 14, 878. [DOI: https://dx.doi.org/10.1109/TAFFC.2021.3100352]
33. Ayeche, F; Alti, A. Facial expressions recognition based on delaunay triangulation of landmark and machine learning. Traitement Signal; 2021; 38,
34. Hasani, B., Mahoor, M.H.: Facial expression recognition using enhanced deep 3d convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 30–40 (2017)
35. Wang, Z; Zeng, F; Liu, S; Zeng, B. OAENet: oriented attention ensemble for accurate facial expression recognition. Pattern Recognit.; 2021; 112, 107694. [DOI: https://dx.doi.org/10.1016/j.patcog.2020.107694]
36. Kaya, M; Bilge, HŞ. Deep metric learning: a survey. Symmetry; 2019; 11,
37. Li, S., Deng, W., Du, J.: Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2852–2861 (2017)
38. Barsoum, E., Zhang, C., Ferrer, C.C., Zhang, Z.: Training deep networks for facial expression recognition with crowd-sourced label distribution. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 279–283 (2016)
39. Sebe, N., Cohen, I., Gevers, T., Huang, T.S.: Multimodal approaches for emotion recognition: a survey. In: Internet Imaging VI, vol. 5670, pp. 56–67. SPIE (2005)
40. Mittal, T., Guhan, P., Bhattacharya, U., Chandra, R., Bera, A., Manocha, D.: Emoticon: context-aware multimodal emotion recognition using Frege’s principle. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14234–14243 (2020)
41. Sun, B; Cao, S; He, J; Yu, L. Affect recognition from facial movements and body gestures by hierarchical deep spatio-temporal features and fusion strategy. Neural Netw.; 2018; 105, pp. 36-51. [DOI: https://dx.doi.org/10.1016/j.neunet.2017.11.021]
42. Shi, J; Liu, C; Ishi, CT; Ishiguro, H. Skeleton-based emotion recognition based on two-stream self-attention enhanced spatial-temporal graph convolutional network. Sensors; 2020; 21,
43. Huang, Y., Wen, H., Qing, L., Jin, R., Xiao, L.: Emotion recognition based on body and context fusion in the wild. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3609–3617 (2021)
44. Chen, J; Wang, C; Wang, K; Yin, C; Zhao, C; Xu, T; Zhang, X; Huang, Z; Liu, M; Yang, T. HEU emotion: a large-scale database for multimodal emotion recognition in the wild. Neural Comput. Appl.; 2021; 33,
45. Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., Mihalcea, R.: Meld: a multimodal multi-party dataset for emotion recognition in conversations. arXiv Preprint arXiv:1810.02508 (2018)
46. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
47. Huang, Q; Huang, C; Wang, X; Jiang, F. Facial expression recognition with grid-wise attention and visual transformer. Inf. Sci.; 2021; 580, pp. 35-54.
48. Pecoraro, R; Basile, V; Bono, V. Local multi-head channel self-attention for facial expression recognition. Information; 2022; 13,
49. Liu, C; Hirota, K; Dai, Y. Patch attention convolutional vision transformer for facial expression recognition with occlusion. Inf. Sci.; 2023; 619, pp. 781-794. [DOI: https://dx.doi.org/10.1016/j.ins.2022.11.068]
50. Xue, F; Wang, Q; Tan, Z; Ma, Z; Guo, G. Vision transformer with attentive pooling for robust facial expression recognition. IEEE Trans. Affect. Comput.; 2022; 14, pp. 3244-3256. [DOI: https://dx.doi.org/10.1109/TAFFC.2022.3226473]
51. Liu, Y; Zhang, X; Li, Y; Zhou, J; Li, X; Zhao, G. Graph-based facial affect analysis: a review. IEEE Trans. Affect. Comput.; 2022; 14, pp. 2657-2677. [DOI: https://dx.doi.org/10.1109/TAFFC.2022.3215918]
52. Scarselli, F; Gori, M; Tsoi, AC; Hagenbuchner, M; Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw.; 2008; 20,
53. Welling, M., Kipf, T.N.: Semi-supervised classification with graph convolutional networks. In: International Conference on Learning Representations (ICLR 2017) (2016)
54. Velickovic, P; Cucurull, G; Casanova, A; Romero, A; Lio, P; Bengio, Y. Graph attention networks. Stat; 2017; 1050, 20.
55. Brody, S., Alon, U., Yahav, E.: How attentive are graph attention networks? In: International Conference on Learning Representations (2021)
56. Wang, K; Peng, X; Yang, J; Meng, D; Qiao, Y. Region attention networks for pose and occlusion robust facial expression recognition. IEEE Trans. Image Process.; 2020; 29, pp. 4057-4069. [DOI: https://dx.doi.org/10.1109/TIP.2019.2956143]
57. Zhang, K; Zhang, Z; Li, Z; Qiao, Y. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett.; 2016; 23,
58. Li, H., Sui, M., Zhao, F., Zha, Z., Wu, F.: MVT: mask vision transformer for facial expression recognition in the wild. arXiv Preprint arXiv:2106.04520 (2021)
59. Ruan, D., Yan, Y., Lai, S., Chai, Z., Shen, C., Wang, H.: Feature decomposition and reconstruction learning for effective facial expression recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7660–7669 (2021)
60. Yu, W; Xu, H. Co-attentive multi-task convolutional neural network for facial expression recognition. Pattern Recognit.; 2022; 123, 108401. [DOI: https://dx.doi.org/10.1016/j.patcog.2021.108401]
61. Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: MS-celeb-1M: A dataset and benchmark for large-scale face recognition. In: European Conference on Computer Vision, pp. 87–102. Springer (2016)
62. Lo, L; Xie, H; Shuai, HH; Cheng, WH. Facial chirality: from visual self-reflection to robust facial feature learning. IEEE Trans. Multimed.; 2022; 24, pp. 4275-4284. [DOI: https://dx.doi.org/10.1109/TMM.2022.3197365]
63. Wasi, A.T., Šerbetar, K., Islam, R., Rafi, T.H., Chae, D.K.: Arbex: Attentive feature extraction with reliability balancing for robust facial expression learning. arXiv preprint arXiv:2305.01486 (2023)
64. Ngwe, J.L., Lim, K.M., Lee, C.P., Ong, T.S.: PAtt-Lite: lightweight patch and attention MobileNet for challenging facial expression recognition. arXiv preprint arXiv:2306.09626 (2023)
65. Abdi, H; Williams, LJ. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat.; 2010; 2,
66. Shi, J., Zhu, S., Liang, Z.: Learning to amend facial expression representation via de-Albino and affinity. arXiv Preprint arXiv:2103.10189 (2021)
Copyright Springer Nature B.V. Jan 2025