This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction
In recent years, with the rapid development of space technology, human-robot interaction in on-orbit service (OOS) space robots has become an important research area in space technology [1–3]. Although the intelligence of space robots is limited, space human-computer interaction plays an important role in space mission applications. Space robots can replace or assist astronauts in various on-board/off-board activities, and it is particularly important for space robots to recognise astronaut commands [4]. Facial expression recognition by astronauts is a widely used method of human-robot interaction in space that does not rely on the highly intelligent capabilities of space robots, and it can effectively combine the decision-making capabilities of humans with the precise operational capabilities of space robots to improve their operational capabilities [5–7]. The accuracy of the astronaut’s facial expression recognition and the size of the expression recognition model are important indications of the increased efficiency of space robots.
Early on in the research process, the features of facial expressions were basically extracted manually, but the recognition accuracy was not high because the facial expressions in the natural environment were easily affected by many factors, such as occlusion, background, and pose [8]. In recent years, deep learning has achieved major breakthrough results in image recognition. Sun et al. [9] designed a facial expression recognition system combining shallow and deep features with an attention mechanism and proposed an attention mechanism model based on the relative positions of facial feature points and textural features of local regions of faces for better extraction of shallow features. Wenmeng and Hua [10] proposed a new end-to-end coattentive multitasking convolutional neural network that consists of a channel coattentive module and a spatial coattentive module. Their approach demonstrates better performance relative to single tasking and multitasking. Shi et al. [11] proposed a facial expression recognition method based on a multibranch cross-connected convolutional neural network, which was built based on residual connections, network-in-network, and tree structure combined; it also added fast cross-connections for the summation of the convolutional output layer, which makes the data flow between networks smoother and improves the feature extractability of each sensory domain. Kong et al. [12] proposed a lightweight facial expression recognition method based on an attention mechanism and key region fusion, and to reduce the computational complexity, a lightweight convolutional neural network was used as the basic recognition model for expression classification, which reduces the computational effort of the network to some extent. Zhou et al. [13] designed a lightweight convolutional neural network that uses a multitask cascaded convolutional network to accomplish face detection and combines a residual module and a depthwise separable convolutional module to reduce a large number of parameters of the network and make the model more portable.
Although most of the above studies were able to extract features and lighten the model to some extent, there are still shortcomings. For example, the face acquisition process is susceptible to factors, such as lighting, background, and pose, resulting in a reduced learning ability of the model when training the face sample set and insufficient feature extractability. The number of network layers of the deep learning model also affects the accuracy of classification recognition to a certain extent, i.e., as the number of network layers increases, the phenomenon of gradient disappearance occurs, causing a decrease in recognition accuracy. To solve the above problems, this paper proposes a multiscale feature fusion attention lightweight network, making the following main contributions.
First, during the image preprocessing stage, a random erasing method based on data labels is used to mask the facial expression images to expand the training set samples and improve the robustness of the model.
Second, to further extract the deep features of facial expressions, an improved convolutional block attention module (CBAM) is embedded in the model, which rerepresents the features of facial expressions in both channel and spatial dimensions.
Third, to solve the problem of model redundancy caused by too many convolutional layers, the improved bottleneck layer is used to reduce the dimensionality of the network, which saves the computation of the network and increases the nonlinear expression capability of the model.
Fourth, to lighten the model, an improved depthwise separable convolution module is added to reduce the number of parameters computed by the network while speeding up the network operations.
Finally, through comparison with different network models, it can be verified that the model proposed in this paper has higher accuracy and lightness.
2. Related Work
2.1. Spatial/Channel Attention Mechanism [14, 15]
CBAM is a lightweight module that combines channel attention and spatial attention to dramatically improve model performance while requiring a small amount of computation and a small number of parameters. The channel attention mechanism [16–18] focuses on which channel features are meaningful using global average pooling and global maximum pooling to obtain two feature maps and then feeds them sequentially into a weight-sharing multilayer perceptron with a
2.2. BottleNeck Layer
The bottleneck layer [21] is the core structure of the residual network [22], which mainly contains three convolutional layers, as shown in Figure 1. The size of the convolution kernel in the first layer is
[figure(s) omitted; refer to PDF]
The formula for calculating the number of parameters during the conventional convolution operation is shown in
Assuming that the size of the intermediate feature map channels is
From the calculation of the input and the output feature map sizes in Figure 1, the number of parameters generated by the regular convolution operation process can be obtained:
The number of parameters generated by the bottleneck layer is:
By comparing the two, the number of parameters generated during the bottleneck operation is greatly reduced.
2.3. Depthwise Separable Convolution
Depthwise separable convolution [23] is the core structure of the lightweight network MobileNet [24, 25], which is a combination of two parts: depthwise convolution and pointwise convolution. The specific structure is shown in Figure 2. Depthwise separable convolution contains a lower number of parameters and lower computational cost than the conventional convolution operation process. The number of convolution kernels in depthwise convolution is the same as the number of channels in the previous layer, and one convolution kernel is responsible for one channel. The number of channels in the feature map generated by this process is the same as the number of input channels, which cannot extend the dimensionality of the feature map, and the convolution operation for each channel independently cannot effectively use the feature information of different channels at the same spatial location. Pointwise convolution [26] mainly uses a
[figure(s) omitted; refer to PDF]
We assume that the input feature map size is
The depthwise separable convolution first uses a convolution size of
The ratio of the two is:
If the size of the input feature map is
2.4. Wide Residual Neural Network Model
To resolve the problem of gradient disappearance caused by increasing depth in deep neural networks, a residual learning unit is introduced to more easily optimize deep networks by adjusting the relationship between the input and output through constant mapping. In the ResNet residual learning unit, the neural network input is
The residual learning unit is shown in Figure 3, where dropout regularization prevents overfitting of the model and ReLU denotes the activation function;
[figure(s) omitted; refer to PDF]
3. Methods
3.1. Overall Architecture
Since too few layers of a fully connected neural network will lead to insufficient feature representation of facial expressions in the model, too many layers will increase the computation of the network and cause the problem of network redundancy. This paper combines the above problems and designs a multiscale feature fusion attention lightweight facial expression recognition network. In the image preprocessing stage, noise is added to the training set by an improved random erasing method, which enhances the robustness of the model while enriching the entire dataset.
Then, the preprocessed facial expression images are passed into the network. First, the number of parameters of the model is reduced by the depthwise separable shuffle module to speed up the computing speed of the network. The SCAM is embedded in the middle, and then, the network is characterized by the grouping bottleneck module to reduce the dimensionality of the network, which saves the computation of the network and increases the nonlinear expression capability of the model. Then, it passes through the depthwise separable shuffle module and finally enters the Softmax layer to classify the output results. The overall architecture of the model is shown in Figure 4.
[figure(s) omitted; refer to PDF]
The input object image size of this network is
Table 1
Model parameters.
Light-NTWRN | ||||
Type | Filters | Size | Output | Repetition |
Input | — | — | — | |
Conv | 16 | — | ||
BN+ReLU | — | — | — | |
DS-1 | 96 | 5 | ||
BN+ReLU | ||||
DS-2 | 96 | |||
BN+ReLU+dropout | ||||
DS-3 | 96 | |||
BN+ReLU+SCAM | ||||
Conv-1 | 192 | 8 | ||
BN+ReLU | ||||
GConv-1 | 192 | |||
BN+ReLU+dropout | ||||
Conv-2 | 192 | |||
BN+ReLU+SCAM | ||||
DS-4 | 384 | 5 | ||
BN+ReLU | ||||
DS-5 | 384 | |||
BN+ReLU+dropout | ||||
DS-6 | 384 | |||
BN+ReLU+SCAM | ||||
GlobalAvg pooling | — | — | — | |
Softmax | — | — | — |
3.2. Image Preprocessing
Data enhancement is a common method in the image preprocessing stage, which mitigates the overfitting of the model and improves its generalizability to a certain extent. This paper expands the training set samples and enhances the robustness of the model by adding a small amount of noise to the images through an improved random erasing method [29].
First, in the preprocessing stage, the probability of random erasing of the object image is set as
Among them, the specific parameters of random erasing are set, as shown in Table 2.
Table 2
Random erasing parameters.
Parameter | Value |
0.5 | |
0.05 | |
0.3 | |
0.3 |
A randomly selected point
Since the background noise of the facial expression pictures affects the accuracy of recognition and the random erasing processing does not necessarily cover the facial expression region, causing redundancy in the original dataset, the random erasing method is improved to ensure that the random erasing region must be at the face location, and the coordinate values of
[figure(s) omitted; refer to PDF]
As seen from Figure 5, the improved method can ensure that each random erasing is within the range of facial expressions, artificially extends the dataset of training samples, improves the robustness of the model, and effectively reduces the risk of model overfitting.
3.3. Spatial Channel Attention Module (SCAM)
To further extract the deep features of different facial expressions and improve the accuracy of facial expression recognition, this paper improves the lightweight attention module (convolutional block attention module) proposed by Woo et al. [30]. This is a simple and effective attention module for convolutional neural networks. Given an intermediate feature map, our module sequentially generates attention maps along two separate dimensions, channel and space, and then multiplies the attention map into the input feature map for adaptive feature refinement. Because SCAM is a lightweight, general-purpose module, it can be seamlessly integrated into any CNN architecture with negligible computational cost. Since convolutional operations extract information features by mixing cross-channel and spatial information, we use our modules to emphasise features that are meaningful in these two main dimensions: the channel and the spatial axis. To achieve this, we apply the channel and spatial attention modules in turn, so that each branch can learn what and where to pay attention to on the channel and spatial axes, respectively. Our modules thus effectively aid the flow of information in the network by learning which information needs to be emphasised or suppressed. The features of the object image are represented in two dimensions, spatial and channel, first by the spatial attention module and then by the channel attention module, and finally, the generated features are obtained. The structure of the SCAM proposed in this paper is shown in Figure 6.
[figure(s) omitted; refer to PDF]
The proposed SCAM contains two independent submodules, the spatial attention module and the channel attention module, which perform feature extraction on space and channels. The input feature map
(1) Spatial attention module
In the process of facial expression recognition, different expressions are associated with specific regions. Moreover, an overall facial expression consists of several regions, and more attention needs to be paid to the local features with the highest expression relevance. The SCAM is shown in Figure 7.
[figure(s) omitted; refer to PDF]
First, the input feature map will perform global max pooling and global average pooling, followed by a CONCAT operation based on the channel and a
(2) Channel attention module
To represent the feature information of facial expressions in multiple dimensions, the feature maps output by the spatial attention module are used as the input of this module, based on global max pooling and global average pooling of width and height, respectively, and the two obtained features are fed into a neural network composed of hidden layers and a multilayer perceptron (MLP). Then, the final features are merged and output using element-by-element summation, as follows:
3.4. Grouping Bottleneck Method
In this paper, the grouping bottleneck is improved based on the group convolution method, and its specific structure is shown in Figure 8, which consists of
[figure(s) omitted; refer to PDF]
Grouped convolution groups the input feature maps and then convolves each group separately, if the input feature map size is
The number of parameters of the bottleneck and the grouping bottleneck is shown in Table 3, and it was found that the grouping bottleneck block has a substantial decrease in the number of parameters compared with the original bottleneck block, with a ratio of nearly 1/10. The nonlinear expression capability of the model is increased.
Table 3
Comparison of the number of participants before and after bottleneck improvement.
Module | Number of participants (pcs) | Ratio |
580327 | 1.78 | |
Bottleneck | 324873 | 1 |
Grouping bottleneck | 33257 | 0.10 |
3.5. Depthwise Separable Shuffle Method
In this paper, channel shuffling is used to improve the depthwise separable convolution [31, 32], and its structure is shown in Figure 9. The depthwise separable convolution first uses depthwise convolution to process the input feature map, and different channels use different convolution operations and then use the CONCAT method for channel stitching. Thus, the final output features are derived from only part of the input channel features, and there is no information exchange between the different channels, which leads to the limited characterizability of the extracted features. Although the depthwise separable convolution uses pointwise convolution to further increase the dimensionality of features, which can enhance the communication of spatial feature information to a certain extent, the increase in dimensionality leads to an increase in the number of network parameters. Deep separable convolution is divided into deep convolution operation and point-by-point convolution operation. In the deep convolution operation, if the input feature dimension is
[figure(s) omitted; refer to PDF]
4. Experiment and Analysis
4.1. Experiment Preparation
To verify the accuracy and the effectiveness of the Light-NTWRN network model proposed in this paper, the light-NTWRN network model is subjected to comparative ablation experiments on the FER2013, CK+ dataset, and JAFFE dataset. The experiment is based on the TensorFlow deep learning framework for training, and testing is conducted on Pycharm with the following hardware environment configuration: Win10 operating system, Intel Core i7-10700F with 2.9 GHz CPU and 16 G RAM and NVIDIA GeForce RTX 3070 (8 GB) graphics card. During the experiments, 70% of the facial expression images are randomly selected as the training set, and 30% of the facial expression images are randomly selected as the test set. Additionally, the experimental parameters are set as shown in Table 4.
Table 4
Experimental parameter settings.
Parameter | FER2013 | CK+ | JAFFE |
Optimizer | SGD | SGD | SGD |
Momentum | 0.9 | 0.9 | 0.9 |
Batch size | 30 | 20 | 40 |
Learning rate | 0.01 | 0.01 | 0.01 |
Learning rate decay | 0.5/50 | 0.5/50 | 0.5/50 |
Loss function | Cross entropy | Cross entropy | Cross entropy |
Epochs | 300 | 300 | 300 |
4.2. Facial Expression Dataset
The FER2013 facial expression dataset consists of 35,886 facial expressions, and the dataset is expanded to 80,000 by an improved random erasing method where the training set contains 56,000 and the test set contains 24,000, and each image is composed of a grayscale image with a fixed size of
CK+ is expanded from the Cohn-Kanda dataset, which contains a total of 123 participants, 593 image sequences, and a total of 7 expressions. The CK+ dataset acquisitions are all collected under the same lighting background, the acquisition environment is better, and the dataset is expanded to 1500 images through an improved random erasing method, with 70% of the training set and 30% of the test set.
The JAFFE dataset was selected from 10 Japanese female students who each made 7 different expressions, consisting of a total of 213 photos, which were expanded to 3408 photos by rotation, flip, contrast enhancement, panning, cropping, scaling, and improved random erasing methods.
4.3. Ablation Experiment
To verify the effectiveness of the Light-NTWRN network model proposed in this paper, ablation experiments are conducted for each module, and the experimental results are shown in Table 5. WRN denotes the improved wide residual network, RE denotes the improved random erasing method, SCAM denotes the improved attention mechanism module, GBN denotes the grouping bottleneck method, and DS denotes the depthwise separable shuffle method, where WRN+RE+SCAM+GBN+DS denotes the Light-NTWRN network proposed in this paper.
Table 5
Light-NTWRN network ablation experiments.
Model | FER2013 (%) | CK+ (%) | JAFFE (%) | Parameter (M) |
WRN | 69.60% | 95.38% | 91.42% | 18.35 |
WRN+RE | 71.05% | 97.54% | 93.88% | 18.35 |
WRN+RE+SCAM | 72.27% | 98.35% | 94.28% | 23.21 |
WRN+RE+SCAM+GBN | 72.58% | 98.60% | 94.87% | 15.72 |
Light-NTWRN (ours) | 73.21% | 98.72% | 95.21% | 10.14 |
First, the facial expression images are input into the model after the improved random erasing operation, and for the model to acquire more local features of facial expressions, the improved SCAM is embedded into the network to reassign the feature weights of facial expressions from both the channel and space dimensions. The grouping bottleneck method is improved to solve the problem of model redundancy caused by too many convolutional layers. To reduce the number of parameters computed by the network and speed up the network operation, an improved depthwise separable shuffle method is added. To verify the effectiveness of each improved module, the Light-NTWRN network ablation experiments are shown in Table 5.
The ablation experiments are shown in Figure 10, where part a represents the FER2013 ablation experiment, part b represents the CK+ ablation experiment, and part c represents the JAFFE ablation experiment. According to the ablation experiments of the FER2013 dataset in part a, we can see that Light-NTWRN has the fastest convergence rate, and the model recognition accuracy grows slowly when trained to 100 epochs and gradually levels off when trained to 210 epochs. The accuracy gradually levels off, and the highest accuracy reaches 73.21%.
[figure(s) omitted; refer to PDF]
From the ablation experiments of the CK+ dataset in part b of Figure 10, it can be seen that the accuracy of the model increases rapidly at the beginning of training, and the accuracy of the model recognition oscillates up and down from the 50th epoch to the 100th epoch. When the training reaches 150 epochs, the accuracy tends to be stable, and the highest accuracy can reach 98.72%. From the ablation experiments of the JAFFE dataset in part c, we can see that the accuracy of the model also grows faster at the beginning of training, and when, the training reaches 180 epochs, the accuracy tends to be stable, and the highest accuracy can reach 95.21%. From the dataset, it was found that the accuracy of the model is improved after adding SCAM, but there is a slight loss of its network operation speed. GBN and DS can effectively reduce the number of network parameters and improve the accuracy of the model. Furthermore, the accuracy of the model proposed in this paper on the three datasets FER2013, CK+, and JAFFE is improved by 3.61%, 3.34%, and 3.79%, respectively, compared with that of the original model, and the number of parameters is reduced by 44.74% compared to that for the original network, which proves that the proposed model has better effectiveness and faster computing speed.
To further verify the effectiveness and the robustness of the proposed model in this paper, the confusion matrix experiments are shown in Figure 11, where part a represents the confusion matrix on the FER2013 dataset, part b represents the confusion matrix on the CK+ dataset, and part c represents the confusion matrix on the JAFFE dataset.
[figure(s) omitted; refer to PDF]
From the confusion matrix on the FER2013 dataset in part a, we can see that the recognition accuracy of the three categories of anger, fear, and sadness is low because the activities of these three categories of facial expressions are less obvious, and the feature points are difficult to extract. The recognition performance of each category on the CK+ dataset is better, and the accuracy is higher. On the JAFFE dataset, the recognition accuracy of the anger and disgust categories is lower because the misidentified samples all belong to the negative category of emotions, which are more similar, facial features are difficult to extract, so recognition is more challenging.
4.4. Mainstream Algorithm Comparison Experiment
To verify the effectiveness of the Light-NTWRN algorithm proposed in this paper for facial expression recognition, comparison experiments are conducted with five mainstream algorithms, mainly AlexNet, VGG16, VGG19, ResNet18, and ResNet50, to compare the size of the number of parameters and the specific recognition accuracy on the three datasets, and the specific results are shown in Table 6.
Table 6
Comparison experiments of mainstream algorithms.
Model | FER2013 (%) | CK+ (%) | JAFFE (%) | Parameter (M) |
AlexNet | 67.51 | 87.59 | 89.83 | 60.92 |
VGG16 | 68.89 | 95.46 | 91.04 | 14.75 |
VGG19 | 68.53 | 92.18 | 90.37 | 20.06 |
ResNet18 | 70.09 | 89.39 | 92,55 | 11.69 |
ResNet50 | 71.26 | 92.46 | 93.08 | 25.56 |
Light-NTWRN (ours) | 73.21 | 98.72 | 95.21 | 10.14 |
The Light-NTWRN algorithm proposed in this paper has the highest accuracy for facial expression recognition on the FER2013 dataset, with an improvement of nearly 2% compared to the ResNet50 model in the mainstream algorithm. From the experimental results on the CK+ dataset, the recognition accuracy of the VGG16 model is the highest among the mainstream networks, while the recognition accuracy of the model proposed in this paper is improved by 3.26% compared to the VGG16. The recognition accuracy on the JAFFE dataset is as high as 95.21%.
This can further verify the effectiveness of the three improved methods proposed in this paper, which can improve the recognition accuracy. Compared with the other five mainstream algorithms, the Light-NTWRN algorithm proposed in this paper has the highest accuracy and the best algorithm performance in terms of facial expression recognition and has strong generalization.
In terms of the number of model parameters, the number of model parameters of the network proposed in this paper is 10.14 M, which is the lowest compared with the other five mainstream algorithms and can maintain high recognition accuracy, which verifies the advanced and excellent model. It also further verifies the effectiveness of the three improvement methods proposed in this paper for model lightweighting.
We compare the proposed method in this paper with other existing methods. The existing more advanced methods are MANet [33], a model that obtains key region features by adaptively learning weights; Minaee [34], a model that assigns residual blocks to spatial mask information; WMDCNN [35], a model that mixes two-channel weighting of static images; and APRNET50 [36], a model that uses multiscale feature extraction blocks instead of residual units. The comparison is performed on the FER2013, CK+, and JAFFE datasets. It can be seen in Table 7 that the model proposed in this paper has the highest accuracy, and the effectiveness of the model proposed in this paper can be proven by the above experiments.
Table 7
Recognition rates of various algorithms on the facial expression dataset.
Model | FER2013 (%) | CK+ (%) | JAFFE (%) |
MANet [33] | 69.46 | 96.28 | — |
Minaee [34] | 70.20 | 98.00 | 92.80 |
WMDCNN [35] | — | 98.50 | 92.30 |
APRNET50 [36] | 73.00 | 94.95 | 94.80 |
Light-NTWRN (ours) | 73.21 | 98.72 | 95.21 |
5. Conclusion
This paper proposes a multiscale feature fusion attention lightweight facial expression recognition method that effectively suppresses the influence of irrelevant feature information on the model while slowing the gradient disappearance caused by too many layers of the neural network, thus reducing the number of parameters computed by the network and improving the computational speed of the model. The improved SCAM module focuses more on feature information to speed up the convergence of the model and improve its performance. The improved random erasing method expands the training set while enhancing the robustness of the model to noise. The grouping bottleneck method reduces the dimensionality of the target image while increasing the nonlinear expression capability of the model. In addition, the depthwise separable shuffle method reduces the number of parameters computed by the network while speeding up the computational speed of the network. The accuracy of the proposed model (Light-NTWRN) is 73.21% on the FER2013 dataset, 98.72% on the CK+ dataset, and 95.21% on the JAFFE dataset, while having a lower number of parameters, and the experimental results are better than many current mainstream algorithms, showing better effectiveness and robustness. However, the recognition accuracy is still not high enough in the case of obscured facial expressions, and more attention should be given to the recognition performance of these datasets in the future.
Acknowledgments
The work was supported by the Key Research Project of Science and Technology of Chongqing Education Commission (no. kjzd-k201801901) and Chongqing Postgraduate Research and Innovation Project (CYS22663).
[1] P. E. Jianqing, W. U. Haoxuan, L. I. Tianliang, H. A. Yu, "Workspace, stiffness analysis and design optimization of coupled active-passive multilink cable-driven space robots for on-orbit services," Chinese Journal of Aeronautics,DOI: 10.1016/j.cja.2022.03.001, 2022.
[2] K. Hambuchen, J. Marquez, T. Fong, "A review of NASA human-robot interaction in space," Current Robotics Reports, vol. 2 no. 3, pp. 265-272, DOI: 10.1007/s43154-021-00062-5, 2021.
[3] Q. Gao, X. Zhang, W. Pang, "Fast and accurate hand visual detection by using a spatial-channel attention SSD for hand-based space robot teleoperation," International Journal of Aerospace Engineering, vol. 2022,DOI: 10.1155/2022/3396811, 2022.
[4] L. Yingxiao, H. Ju, M. Ping, R. Jiang, "Target localization method of non-cooperative spacecraft on on-orbit service," Chinese Journal of Aeronautics,DOI: 10.1016/j.cja.2022.04.001, 2022.
[5] X. L. Ding, Y. C. Wang, Y. B. Wang, K. Xu, "A review of structures, verification, and calibration technologies of space robotic systems for on-orbit servicing," SCIENCE CHINA Technological Sciences, vol. 64 no. 3, pp. 462-480, DOI: 10.1007/s11431-020-1737-4, 2021.
[6] R. R. Santos, D. A. Rade, I. M. da Fonseca, "A machine learning strategy for optimal path planning of space robotic manipulator in on-orbit servicing," Acta Astronautica, vol. 191, pp. 41-54, DOI: 10.1016/j.actaastro.2021.10.031, 2022.
[7] P. Rousso, S. Samsam, R. Chhabra, "A mission architecture for on-orbit servicing industrialization," 2021 IEEE Aerospace Conference (50100), .
[8] J. Xing, J. Zhong, "MiniExpNet: a small and effective facial expression recognition network based on facial local regions," Neurocomputing, vol. 462, pp. 353-364, DOI: 10.1016/j.neucom.2021.07.079, 2021.
[9] X. Sun, P. Xia, F. Ren, "Multi-attention based deep neural network with hybrid features for dynamic sequential facial expression recognition," Neurocomputing, vol. 444, pp. 378-389, DOI: 10.1016/j.neucom.2019.11.127, 2021.
[10] Y. Wenmeng, X. Hua, "Co-attentive multi-task convolutional neural network for facial expression recognition," Pattern Recognition, vol. 123,DOI: 10.1016/j.patcog.2021.108401, 2022.
[11] S. Cuiping, T. Cong, W. Liguo, "A facial expression recognition method based on a multibranch cross-connection convolutional neural network," IEEE ACCESS, vol. 9, pp. 39255-39274, DOI: 10.1109/ACCESS.2021.3063493, 2021.
[12] Y. Kong, Z. Ren, K. Zhang, S. Zhang, Q. Ni, J. Han, "Lightweight facial expression recognition method based on attention mechanism and key region fusion," Journal of Electronic Imaging, vol. 30 no. 6, article 063002,DOI: 10.1117/1.JEI.30.6.063002, 2021.
[13] N. Zhou, R. Liang, W. Shi, "A lightweight convolutional neural network for real-time facial expression detection," IEEE Access, vol. 9, pp. 5573-5584, DOI: 10.1109/ACCESS.2020.3046715, 2020.
[14] Z. Niu, G. Zhong, H. Yu, "A review on the attention mechanism of deep learning," Neurocomputing, vol. 452, pp. 48-62, DOI: 10.1016/j.neucom.2021.03.091, 2021.
[15] Y. Chen, L. Liu, V. Phonevilay, K. Gu, R. Xia, J. Xie, Q. Zhang, K. Yang, "Image super-resolution reconstruction based on feature map attention mechanism," Applied Intelligence, vol. 51 no. 7, pp. 4367-4380, DOI: 10.1007/s10489-020-02116-1, 2021.
[16] H. Zhang, G. Peng, Z. Wu, J. Gong, D. Xu, H. Shi, "MAM: a multipath attention mechanism for image recognition," IET Image Processing, vol. 16 no. 3, pp. 691-702, DOI: 10.1049/ipr2.12370, 2022.
[17] L. Yao, S. He, K. Su, Q. Shao, "Facial expression recognition based on spatial and channel attention mechanisms," Wireless Personal Communications, vol. 125 no. 2, pp. 1483-1500, DOI: 10.1007/s11277-022-09616-y, 2022.
[18] H. Wang, H. Zhang, "Adaptive target tracking based on channel attention and multi-hierarchical convolutional features," Pattern Analysis and Applications, vol. 25 no. 2, pp. 305-313, DOI: 10.1007/s10044-021-01043-2, 2022.
[19] Z. Qiu, S. I. Becker, A. J. Pegna, "Spatial attention shifting to emotional faces is contingent on awareness and task relevancy," Cortex, vol. 151, pp. 30-48, DOI: 10.1016/j.cortex.2022.02.009, 2022.
[20] C. Chen, D. Gong, H. Wang, Z. Li, K. Y. K. Wong, "Learning spatial attention for face super-resolution," IEEE Transactions on Image Processing, vol. 30, pp. 1219-1231, DOI: 10.1109/TIP.2020.3043093, 2021.
[21] Z. Xue, T. Li, S. T. Peng, C. Y. Zhang, H. C. Zhang, "A data-driven method to predict future bottlenecks in a remanufacturing system with multi-variant uncertainties," Journal of Central South University, vol. 29 no. 1, pp. 129-145, DOI: 10.1007/s11771-022-4906-z, 2022.
[22] S. Panigrahi, U. S. N. Raju, "Pedestrian detection based on hand-crafted features and multi-layer feature fused-Res Net model," International Journal on Artificial Intelligence Tools, vol. 30 no. 5, article 2150028,DOI: 10.1142/S0218213021500287, 2021.
[23] C. Sekhar Vorugunti, V. Pulabaigari, P. Mukherjee, A. Sharma, "DeepFuseOSV: online signature verification using hybrid feature fusion and depthwise separable convolution neural network architecture," IET Biometrics, vol. 9 no. 6, pp. 259-268, DOI: 10.1049/iet-bmt.2020.0032, 2020.
[24] R. F. Rachmadi, S. Nugroho, I. Purnama, "Lightweight residual network for person re-identification," IOP Conference Series Materials Science and Engineering, vol. 1077 no. 1, article 012046,DOI: 10.1088/1757-899X/1077/1/012046, 2021.
[25] Y. Nan, J. Ju, Q. Hua, H. Zhang, B. Wang, "A-MobileNet: an approach of facial expression recognition," Alexandria Engineering Journal, vol. 61 no. 6, pp. 4435-4444, DOI: 10.1016/j.aej.2021.09.066, 2022.
[26] C. F. Xception, "Deep learning with depthwise separable convolutions," Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251-1258, .
[27] A. I. Mohammed, A. A. Tahir, "A new optimizer for image classification using wide ResNet (WRN)," Academic Journal of Nawroz University, vol. 9 no. 4,DOI: 10.25007/ajnu.v9n4a858, 2020.
[28] S. Zagoruyko, N. Komodakis, "Wide residual networks," 2017. https://arxiv.org/abs/1605.07146
[29] Z. Zhong, L. Zheng, G. Kang, S. Li, Y. Yang, "Random erasing data augmentation," Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34 no. 7, 2017.
[30] S. Woo, J. Park, J. Y. Lee, I. S. Kweon, "Cbam: convolutional block attention module," Proceedings of the European conference on computer vision (ECCV), .
[31] L. Wang, D. He, "Image super-resolution reconstruction algorithm based on channel shuffle," 2021 Asia-Pacific Conference on Communications Technology and Computer Science (ACCTCS), pp. 225-229, DOI: 10.1109/ACCTCS52002.2021.00051, .
[32] X. Y. Zhang, K. Zhao, T. Xiao, M. M. Cheng, M. H. Yang, "Structured sparsification with joint optimization of group convolution and channel shuffle," Uncertainty in Artificial Intelligence. PMLR, pp. 440-450, .
[33] Y. Gan, J. Chen, Z. Yang, L. Xu, "Multiple attention network for facial expression recognition," IEEE Access, vol. 8, pp. 7383-7393, DOI: 10.1109/ACCESS.2020.2963913, 2020.
[34] A. Abdolrashidi, "Deep-emotion: facial expression recognition using attentional convolutional network," Sensors, vol. 21 no. 9,DOI: 10.3390/s21093046, 2021.
[35] H. Zhang, B. Huang, G. Tian, "Facial expression recognition based on deep convolution long short-term memory networks of double-channel weighted mixture," Pattern Recognition Letters, vol. 131, pp. 128-134, DOI: 10.1016/j.patrec.2019.12.013, 2020.
[36] C. Jiamin, X. Yang, "Expression recognition based on attention pyramid convolution residual network," , . Computer engineering and application: 1-11 [2022-04-26] http://kns.cnki.net/kcms/detail/11.2127.TP.20210702.1749.004.html
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2022 Jinyuan Ni et al. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/
Abstract
Facial expression recognition based on residual networks is important for technologies related to space human-robot interaction and collaboration but suffers from low accuracy and slow computation in complex network structures. To solve these problems, this paper proposes a multiscale feature fusion attention lightweight wide residual network. The network first uses an improved random erasing method to preprocess facial expression images, which improves the generalizability of the model. The use of a modified depthwise separable convolution in the feature extraction network reduces the computational effort associated with the network parameters and enhances the characterization of the extracted features through a channel shuffle operation. Then, an improved bottleneck block is used to reduce the dimensionality of the upper layer network feature map to further reduce the number of network parameters while enhancing the network feature extraction capability. Finally, an optimized multiscale feature lightweight attention mechanism module is embedded to further improve the feature extractability of the network for human facial expressions. The experimental results show that the accuracy of the model is 73.21%, 98.72%, and 95.21% on FER2013, CK+ and JAFFE, respectively, with a covariance of 10.14 M. Compared with other networks, the model proposed in this paper has faster computing speed and better accuracy at the same time.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer