Content area
To improve the accuracy and precision of gesture recognition, this study improves YOLOv5 by incorporating a coordinate attention mechanism and a bidirectional feature pyramid network. Based on the improved YOLOv5, a static gesture recognition model is constructed. In addition, this study introduces a multimodal inter-frame motion attention weight module to enhance the model’s ability to recognize dynamic gestures. In the performance evaluation experiments, the proposed model achieves an area under the receiver operating characteristic curve of 0.94, a harmonic mean of 96.4%, and an intersection over union of 0.9. The accuracy of static gesture recognition reaches 100%, while the average accuracy of dynamic gesture recognition achieves 95.7%, which significantly outperforms the comparison models. These results demonstrate that the proposed gesture recognition model offers high accuracy for static gestures and reliable recognition performance for dynamic gestures. This approach provides a potential method and perspective for improving human–computer interaction in virtual reality and intelligent assistance scenarios.
Introduction
With the rapid development of information technology, gestures have attracted increasing attention as one of the most intuitive and natural forms of human–computer interaction [1]. Early gesture recognition technologies mainly rely on wearable devices, which are inconvenient to use and expensive to produce, severely limiting their development [2]. The rise of computer vision has given birth to vision-based gesture recognition methods [3]. However, the complexity of real-world environments makes feature extraction difficult and reduces the accuracy of dynamic gesture recognition. There is an urgent need for a method that can recognize dynamic gestures with both efficiency and precision. Multimodal technology enables the integration of multiple data sources. In gesture recognition, it can capture distance and spatial structure information separately, making full use of the complementary advantages of different modalities [4]. Inter-frame motion carries rich trajectory information of hand gestures. Analyzing the motion between consecutive frames helps accurately locate the hand movement region and reduce environmental interference [5]. Attention mechanisms enhance the ability to extract critical information, while sharing attention weights enables better representation of features relevant to gesture recognition [6]. This study designs a multimodal inter-frame motion attention weight module and, based on it, constructs a dynamic gesture recognition model using a deep learning network to achieve efficient and accurate recognition. The key innovation lies in the proposed module that fuses multimodal inter-frame motion features from Red Green Blue (RGB) images and depth images through shared attention weights. By integrating this module, the static gesture recognition model based on YOLOv5 is further strengthened in recognizing dynamic gestures. The model is expected to provide new insights for the intelligent development of human–computer interaction.
Related works
Inter-frame motion has been a core concept in dynamic visual analysis and has attracted attention from both domestic and international researchers. To select motion pixels relevant to the target in videos, Srinivas and Ganivada developed an improved inter-frame difference method. This method enhanced the correlation of target-related pixels through a motion feature matrix. Simulation results showed higher accuracy in recognizing related pixels [7]. Zhong et al. aimed to improve the accuracy of detecting forged videos by introducing a fast intra-frame forgery detection method. By applying sparse feature extraction and matching, they accelerated the algorithm to determine video authenticity. Experimental results demonstrated improved detection accuracy and robustness [8]. Since variations in inter-frame motion are often subtle, attention mechanisms are essential for enhancing feature extraction. Many researchers have explored their application. For instance, Arepalli and Naik proposed an intelligent monitoring system for dissolved oxygen in water. The system used a lightweight spatial shared attention long short-term memory model to predict hypoxic conditions and achieved a recognition accuracy of 99.8% in real-world use [9]. To detect surface temperature changes more precisely, Suleman and Shridevi introduced a spatial attention long short-term memory model. This model learned key time steps and weather variables simultaneously to capture multiple meteorological features for temperature prediction. Compared to baseline models, it maintained state-of-the-art prediction accuracy [10]. Zhou et al. addressed the limitations of convolutional neural networks in extracting features from medical images. They developed a peak cortex model based on global and local attention modules. By applying weighted summation across components, the model emphasized key features in medical images and significantly improved recognition accuracy [11].
Attention mechanisms have also been widely used in gesture recognition to enhance feature extraction. For example, Ojeda-Castelo et al. proposed a gesture recognition model that combined deep neural networks and attention mechanisms for touchless gesture interaction, aiming to replace keyboard and mouse-based operations. The model demonstrated robust and reliable gesture recognition with low response time [12]. To improve the accuracy of gesture recognition systems, Al Farid et al. proposed a method that considered multi-text interpretation and non-rigid hand features. Compared to existing studies, this method raised recognition accuracy to 97%, showing strong practical value [13]. Jin et al. developed a convolutional neural network-based recognition approach for dynamic gesture recognition using millimeter-wave radar. By reconstructing frequency-modulated continuous waves into three-dimensional data blocks and extracting their features through convolutional neural networks, the model achieved 96% accuracy under randomly disturbed scenarios [14]. To enhance the robustness of gesture recognition, Sun et al. proposed a multi-level feature fusion algorithm based on a two-stream convolutional neural network. They built a gesture database from sensor images and used it to improve tracking and recognition precision. Experiments on multiple datasets confirmed that the method outperformed others in recognition accuracy [15]. Guo et al. proposed a spatial kernel selection and halo attention network for deep forgery detection to address the problem of poor performance in dealing with unknown forgery images due to the inability to consider the effects of spatial receptive fields and local representation learning in current deep forgery detection methods. Comparative experiments on three public datasets show that the research method outperforms state-of-the-art methods in both internal and cross testing [16].
In summary, although many existing studies have improved gesture recognition accuracy through neural networks, challenges remain in dynamic gesture recognition, particularly regarding precision and vulnerability to environmental interference. Therefore, this study designs a multimodal inter-frame motion attention weight module and integrates it into a YOLOv5-based static gesture recognition model to form a dynamic gesture recognition system. The model fuses inter-frame motion features from RGB and depth images and applies a shared attention weight mechanism to enhance adaptability across modalities. It is expected to improve the accuracy and precision of dynamic gesture recognition and facilitate human–computer interaction.
Gesture recognition model based on inter-frame motion and shared attention weights
Design of gesture recognition method based on YOLOv5
Gesture recognition technology provides various conveniences for smart applications. YOLOv5 is widely applied in gesture recognition due to its precise object detection and localization, which allow it to efficiently identify gesture features. However, YOLOv5 still lacks sensitivity to subtle hand movements and posture changes. The accuracy of recognizing complex gestures such as grasping and pinching remains relatively low. Therefore, it is necessary to improve its ability to extract multi-scale fine-grained features [17]. Coordinate Attention (CA) enhances performance in tasks such as object detection and semantic segmentation. It increases the model’s sensitivity to spatial positions and directional features of targets [18]. Therefore, this study uses CA to optimize feature extraction in the backbone of YOLOv5. The operation process of CA is shown in Fig. 1.
[See PDF for image]
Fig. 1
Operation flow of CA
As can be seen from Fig. 1, unlike the traditional attention mechanism, CA decomposes the spatial dimension into two orthogonal directions (horizontal and vertical), and performs global pooling along each direction respectively. This process generates a pair of direction-aware feature maps while retaining precise position information. Specifically, for the input feature maps, the CA mechanism first performs average pooling along the horizontal and vertical directions respectively to generate two sets of intermediate feature maps. After concatenating these intermediate features, a fused feature map is generated through shared 1 × 1 convolution, batch normalization, and nonlinear activation functions. Subsequently, the fused features are further split along the spatial dimension into two tensors, and 1 × 1 convolution is applied respectively to generate attention weights in the horizontal and vertical directions. Finally, multiply the original input feature map with the generated attention map to achieve an enhanced representation of the spatially significant features. The output of tensors with different heights and widths in the channel is calculated as shown in Eq. (1).
1
In Eq. (1), and represent the width and height of the channel, is the channel, and and are the horizontal and vertical coordinates of the input features in the channel. Equation (1) can capture the long-range dependency relationship along two spatial directions, thereby enhancing the discrimination ability of gesture features. The 3D convolution operation for input feature maps is expressed in Eq. (2).
2
In Eq. (2), and represent the size of the input feature, and the stride is . Although CA significantly improves the ability of YOLOv5 to capture gesture features, it still has limitations in semantic information transmission and detail preservation during multi-scale feature fusion. A single attention mechanism is not sufficient to balance high-level semantic information with low-level spatial details. The Bidirectional Feature Pyramid Network (BiFPN) addresses this issue by integrating features more efficiently across different levels through bidirectional cross-scale connections and weighted feature fusion [19]. The structure of BiFPN is shown in Fig. 2.
[See PDF for image]
Fig. 2
Structure of BiFPN
As shown in Fig. 2, BiFPN adds a bottom-up reverse path to enable bidirectional information exchange, making it easier to fuse low-level detail features with high-level semantic features. Skip connections between feature levels and bidirectional flow between adjacent levels are also supported. For cases where input and output nodes are on the same level, an extra edge (marked in green) is added to promote feature fusion. The specific fusion process can be divided into the following four steps: firstly, high-level semantic features are upsampled and fused with corresponding scale low-level features to enhance the expression of detailed information; The second is to downsample the low-level detailed features and fuse them with the corresponding scale high-level features to enhance semantic consistency. The third is to preserve and enhance the lateral connections in traditional FPNs, avoiding gradient vanishing and preserving more original feature information; The fourth is to apply learnable weights to different input features and automatically adjust the contribution of each feature during the fusion process. The computation with additional edges is shown in Eq. (3).
3
In Eq. (3), is the coefficient ensuring stable fusion, and is the weight parameter of features with different levels of importance. The final output result at is calculated as shown in Eq. (4).
4
In Eq. (4), indicates the downsampling pooling operation for feature fusion. The weight normalization in BiFPN is expressed in Eq. (5).
5
In Eq. (5), represents the number of channels in multi-scale feature fusion. While CA enhances the model’s ability to capture key gesture features, BiFPN improves multi-scale target detection. Therefore, this study improves the YOLOv5 network using CA and BiFPN, naming the modified network AB-YOLO, and builds a gesture recognition method based on it. The structure of the AB-YOLO gesture recognition method is shown in Fig. 3.
[See PDF for image]
Fig. 3
Structure of AB-YOLO gesture recognition model
As shown in Fig. 3, the framework of AB-YOLO is similar to that of the original YOLOv5. The main improvements are the addition of the CA attention module in the backbone and the BiFPN module in the neck. In YOLOv5, the Spatial Pyramid Pooling-Extended (SPPE) module performs multi-scale pooling on feature maps to adjust the receptive field. Placing the CA module before SPPE allows important information in the feature map to be better expressed through channel weight adjustment. The updated BiFPN structure extracts effective spatial location features from shallow layers and key semantic information from deeper layers, enabling multi-scale feature fusion. The computation of depth-wise and standard convolution operations is given in Eq. (6).
6
In Eq. (6), , , and represent the kernel size, the number of input feature map channels, and the number of output channels, respectively. The loss function of the AB-YOLO gesture recognition method is defined in Eq. (7).
7
In Eq. (7), and are parameters that balance the priorities of loss functions and prevent discrepancies between predicted bounding box sizes and ground truth.
Construction of multimodal gesture recognition model
Although AB-YOLO achieves accurate recognition of static gestures, its single-frame-based detection mechanism lacks the ability to capture temporal features of gesture dynamics. Inter-frame motion describes the movement of objects or pixels between adjacent frames in a video sequence, which allows the analysis of inter-frame motion of human joints, such as running or hand gestures. However, background features between frames are often ignored in gesture recognition tasks [20]. To address this issue, this study proposes a Multimodal Inter-Frame Motion Attention Module (MIMA). MIMA uses the prominent movement of the hand to focus the attention mechanism on relevant search regions. It incorporates multimodal information from RGB and depth images to reduce redundancy and improve the model’s ability to dynamically recognize gesture features. The operation process of MIMA is shown in Fig. 4.
[See PDF for image]
Fig. 4
Structure of MIMA
As shown in Fig. 4, the MIMA module mainly consists of the following three steps. Firstly, patch embedding is used to expand the spatial channel dimension and project the frame sequence to a higher dimensional space. Then use convolutional modules and linear layers to extract local features from adjacent inter frame motion. Finally, by using a shared attention weight mechanism, redundant environmental information is reduced and the calculation process is simplified. The tensor reshaping operation in the feature extractor is calculated in Eq. (8).
8
In Eq. (8), and represent feature sets from adjacent regions in the previous and current frames, respectively. This formula extracts local features from adjacent inter frame motion through convolutional modules and linear layers. A shortcut connection is used to reduce degradation, as shown in Eq. (9).
9
In Eq. (9), is the local feature from the original frame. It combines the local features of the original frame with the enhanced features. The enhanced feature representation through the linear layer is calculated in Eq. (10).
10
In Eq. (10), and are the outputs after enhancing features with the linear layer. , , and represent the weight matrix, bias, and transpose of the linear layer, respectively. Equation (10) enhances features through weight matrix, bias, and transpose. To enable the exchange of weight information between modalities during training and improve recognition precision, this study proposes a method called Share Attention Weights Among Modules (SAWAM). SAWAM allows each modality to adjust its attention weights using weights from other modalities. This reduces redundant environmental information and simplifies the computation process. The operation flow of SAWAM is shown in Fig. 5.
[See PDF for image]
Fig. 5
Operation flow of SAWAM (Icon source from: https://www.aigei.com/)
As shown in Fig. 5, SAWAM mainly includes three processes: average pooling, Hadamard product, and squared difference of weights. The pooling operation obtains representative local weights of a single modality through the weight matrix, which are shared with other modalities. The Hadamard product performs element-wise multiplication on matrices of the same dimension, producing representative weights that reflect the influence across modalities. The average pooling calculation for locally shared representative weights is shown in Eq. (11).
11
In Eq. (11), and are the attention weights of RGB and depth modalities output by MIMA. To ensure the shared weights represent both modalities, the fusion using the Hadamard product is calculated in Eq. (12).
12
In Eq. (12), is the effective shared weight value. To reduce the impact of incorrect weights, the distance between shared weights is calculated as shown in Eq. (13).
13
In Eq. (13), is an effective weight at a specific position, and represents the number of rows and columns. The purpose of sharing attention weights is to exchange weight information between different modalities, thereby improving recognition accuracy. In this way, each modality can adjust its attention weights using the weights of other modalities, reducing redundant environmental information and simplifying the calculation process. Assuming there are two modalities: RGB image and depth image. During the training process, the MIMA module generates a set of attention weights for each modality. Then, through the SAWAM mechanism, these weights will be shared and fused between modalities. For example, if a certain area in an RGB image is important for gesture recognition, the high attention weight of that area will be shared with the depth image modality to help it better focus on that area. In this way, the model can more accurately recognize gestures because it combines information from two modalities. The study aims to maintain a balance between shared attention weights and specific modal information to avoid overfitting. Firstly, SAWAM does not directly replace the original modal weights with shared weights, but instead uses Eq. (12) for weighted fusion, which preserves the original feature distribution of each modality while ensuring that modal specific information is not lost. The second is that the attention weights of RGB and depth modalities are independently calculated before fusion to preserve the unique feature response patterns of the modalities. The shared weight comes from the pooling aggregation of bimodal information, but its output only participates in fusion as complementary information, rather than the dominant signal. The third is that Eq. (13) explicitly constrains the degree of difference between cross modal weights to prevent one modal weight from excessively affecting another, thus maintaining the independence and diversity of modes. Fourthly, the outputs of pooling, Hadamard product, and other operations in SAWAM are trainable, and the model can dynamically adjust the ratio of shared information to specific information based on actual tasks, achieving adaptive balance. AB-YOLO is not efficient in recognizing gesture variation features across multiple frames. Therefore, this study improves AB-YOLO by integrating MIMA and SAWAM and constructs a gesture recognition model named FA-YOLO to distinguish and identify dynamic gestures. The structure of FA-YOLO is shown in Fig. 6.
[See PDF for image]
Fig. 6
Structure of FA-YOLO gesture recognition model (Icon source from: https://www.aigei.com/)
As shown in Fig. 6, FA-YOLO consists of five main components: parameter tuning, CA-based feature extraction enhancement, BiFPN-based multi-scale fusion enhancement, MIMA for inter-frame motion attention, and system training and deployment. Parameter tuning is essential; setting appropriate learning rates and weight decay coefficients improves training efficiency and recognition accuracy. The CA module in the backbone extracts multi-level features from the input image. BiFPN strengthens the fusion of small-scale features. MIMA improves the model's overall ability to dynamically recognize gesture features. The classification loss uses cross-entropy, calculated in Eq. (14).
14
In Eq. (14), and are the manually defined label and the model’s estimated value based on the label. The summation of each loss function item with shared weights across modalities in SAWAM is calculated in Eq. (15).
15
In Eq. (15), is the percentage of effective attention weight loss.
Comprehensive performance evaluation of gesture recognition model
Performance analysis of FA-YOLO for static gesture recognition
To verify the recognition performance of the FA-YOLO model on static gestures, the study compared it with Mobile Network Version 3 (MNV3), Efficient Network Baseline Model B0 (EN-B0), and Vision Transformer (ViT) on the MNIST gesture dataset. The MNIST gesture dataset contained grayscale images of gestures such as waving, heart shape, and thumbs up, each sized 28 × 28 pixels. It contains 70,000 image data, which are randomly divided into 60,000 images for training, 5000 images for validation, and 5000 images for testing. The study chose this dataset because it can effectively eliminate the interference of complex backgrounds, lighting changes, and other factors, thereby more clearly evaluating the model's ability to extract and recognize the essential features of gestures. And it facilitates fair and reproducible performance comparisons between models, laying the foundation for further expansion research on more complex datasets. The experiments ran on a Windows 10 system equipped with an RTX 4090 GPU, 256 GB SSD, and 64 GB RAM. All algorithms were implemented using Python 3.7. The experimental parameters were set as desired, and the model was trained using Adam optimizer with an initial learning rate of 0.001, and dynamically adjusted through cosine annealing strategy. The training batch size is set to 32, and the model is trained for a total of 100 epochs to ensure sufficient convergence. To improve the generalization ability of the model, data augmentation methods such as random horizontal flipping, brightness contrast adjustment, scaling, and rotation were used during the training process to simulate the appearance changes in real environments and enhance the adaptability of the model to changes in lighting, posture, and scale. Confusion matrices for the recognition of thumbs up and heart gestures by these models are shown in Fig. 7.
[See PDF for image]
Fig. 7
Confusion matrices of gesture recognition for different models
As shown in Fig. 7a, MNV3 achieved recognition accuracies of 92.8% and 95.8%. In Fig. 7b, ViT reached accuracies of 76.2% and 85.8%, with correctly recognized counts of 32 and 230, respectively. Figure 7c shows that EN-B0 achieved accuracies of 85.7% and 94.7%, with 36 and 251 correct recognitions, slightly better than ViT. Figure 7d shows FA-YOLO achieved 100.0% and 99.6% accuracies with 42 and 264 correct recognitions, significantly outperforming the other models. These results demonstrated that FA-YOLO achieved superior static gesture recognition accuracy compared to several market-leading models. To further verify model performance, the study also compared Receiver Operating Characteristic (ROC) curves and Area Under the Curve (AUC) values of the models, shown in Fig. 8.
[See PDF for image]
Fig. 8
ROC curves and AUC for different models
In Fig. 8a, FA-YOLO’s ROC curve clearly bowed toward the upper left corner, with an AUC of 0.94 and a true positive rate of 0.9 at a false positive rate of 0.4. Figure 8b showed ViT’s ROC curve slightly bowed left but overall had poorer true positive rates under 0.4 false positive rate, with an AUC of 0.82. Figure 8c displayed MNV3’s ROC curve with some left-upper bowing and an AUC of 0.84, reaching a true positive rate of 0.80 at 0.4 false positive rate. In Fig. 8d, EN-B0 had an AUC of 0.83 and showed poorer true positive detection. These findings confirmed FA-YOLO’s superior static gesture recognition with the largest ROC AUC among compared models. The study also analyzed the F1 Score (harmonic mean of precision and recall) and training loss for all models, as shown in Fig. 9.
[See PDF for image]
Fig. 9
F1 scores and loss rates over iterations
Figure 9a showed that FA-YOLO’s F1 score stabilized above 95.6% after 20 iterations, significantly higher than MNV3, EN-B0, and ViT with 91.2%, 86.8%, and 82.7%, respectively. FA-YOLO better captured true positives and handled unsupervised clustered data features compared to the other models. In Fig. 9b, losses were initially high for all models but decreased rapidly before iteration 10. After 30 iterations, FA-YOLO’s loss stabilized while others did not converge well. These results indicated that FA-YOLO converged faster with fewer iterations, showing better stability and recognition accuracy. The above results may be due to the fact that for gestures with high intra class variation and low inter class variation, the research method maintains discriminative ability through a multimodal complementary mechanism, and its recognition relies on the synergistic effect of spatial appearance, temporal dynamics, and depth features. The CA attention mechanism enhances the perception ability of subtle spatial structures such as fingertip orientation and finger spacing. The RGB mode provides high-resolution appearance information, while the depth mode provides spatial structure information that is not affected by lighting and texture interference. Together, they achieve the differentiation of similar gestures in spatial form. The MIMA module extracts temporal dynamic features by analyzing the trajectory, velocity, and direction of hand movements between consecutive frames. Even if the static postures are similar, there may still be differences in their motion patterns, and temporal modeling effectively utilizes these dynamic discriminative information. The SAWAM mechanism dynamically adjusts the contribution ratio of RGB and depth modes by sharing attention weights. For similar gestures that are difficult to distinguish, the model can adaptively enhance more discriminative modal features, thereby achieving robust recognition.
Performance analysis of FA-YOLO for dynamic gesture recognition
After confirming FA-YOLO’s static gesture performance, the study explored its optimization of dynamic gesture recognition using multimodal inter-frame motion attention. FA-YOLO was compared with Two-Stream Convolutional Neural Network (Two-Stream CNN), Skeleton Point Graph Neural Network (SPGNN), and Recurrent Neural Network (RNN) on the NYU gesture dataset. The above model has a wide impact on dynamic recognition tasks, with good comparability and reference value. However, models based on spatiotemporal transformers require high computing resources and rely on large-scale data training. In practical application scenarios such as embedded systems and real-time interactions, a balance between accuracy and efficiency still needs to be achieved. Therefore, the study did not include it in the comparative experiment. This dataset included RGB and depth images suitable for multimodal recognition. It strictly divides the data of different subjects into training, validation, and testing sets based on the principle of subject independence, ensuring that the model evaluation results reflect its generalization ability to new users. The training set contains all gesture sequences from the training subjects, while the validation set and test set each contain data from different subjects. In addition, during the training and testing process, the acquisition and synchronization of RGB images and depth data rely on the aligned multimodal data provided by the NYU gesture dataset. This dataset is synchronously collected by Microsoft Kinect sensors and utilizes infrared structured light technology to generate depth information in real-time that is consistent with the RGB image frame rate and spatially registered. During the training phase, the system automatically matches each frame of RGB image with its corresponding depth map based on timestamps, ensuring strict temporal and spatial alignment of bimodal data; In the testing phase, the same paired input method is used to achieve multimodal feature fusion and recognition through forward propagation. Throughout the process, there is no need for additional manual alignment operations, relying on hardware level synchronization and calibration parameters to ensure the consistency of multimodal data. Recognition accuracies of ten dynamic gestures by each model are listed in Table 1.
Table 1. Dynamic gesture recognition accuracies of different models
Types of gestures | The number of gestures | Detection rate (%) | |||
|---|---|---|---|---|---|
RNN | Two-Stream CNN | SPGNN | FA-YOLO | ||
Call | 40 | 87.5 | 90.4 | 92.5 | 94.6 |
Dislike | 40 | 87.5 | 92.3 | 93.4 | 98.7 |
Fist | 40 | 92.3 | 86.4 | 94.5 | 93.1 |
Four | 40 | 88.6 | 90.0 | 92.4 | 96.6 |
Like | 40 | 90.0 | 88.9 | 91.6 | 89.8 |
Ok | 40 | 92.3 | 85.4 | 93.4 | 93.4 |
One | 40 | 80.5 | 92.6 | 92.5 | 96.7 |
Palm | 40 | 92.5 | 94.3 | 91.4 | 96.3 |
Peace | 40 | 87.5 | 91.6 | 92.4 | 99.8 |
Mute | 40 | 82.5 | 88.8 | 93.3 | 98.4 |
According to Table 1, each gesture group contained 40 samples. RNN achieved the lowest accuracy of 82.5% on the Mute gesture, while FA-YOLO reached the highest accuracy of 99.8% on the Peace gesture. Average accuracies for RNN, Two-Stream CNN, SPGNN, and FA-YOLO were 88.1%, 90.0%, 92.7%, and 95.7%, respectively. FA-YOLO’s accuracies on Fist, Like, and Ok gestures were relatively lower at 93.1%, 89.8%, and 93.4%, while Dislike, Peace, and Mute were higher at 98.7%, 99.8%, and 98.4%. These results indicated that FA-YOLO achieved a high average accuracy of 95.7% on dynamic gestures, outperforming comparison models in recognition performance and generalization. This demonstrates that the multimodal fusion architecture and attention mechanism of the research method can effectively learn robust gesture features independent of the subject, rather than overfitting specific user motion patterns. The study also performed cluster analysis on recognition of four complex dynamic gestures: Call, Dislike, Four, and Mute, shown in Fig. 10.
[See PDF for image]
Fig. 10
Recognition clustering for dynamic gestures
In Fig. 10a, SPGNN showed good recognition with clear clusters for Call and Dislike, but poor recognition for the other two gestures. Figure 10b showed Two-Stream CNN only distinctly separated Call, with inaccurate recognition for others. Figure 10c revealed no clear clusters, indicating poor recognition. Figure 10d showed FA-YOLO had clear clustering for all four gestures, demonstrating superior recognition accuracy. This further confirms that FA-YOLO forms clear and compact clusters of gesture samples from unknown subjects in the feature space, with distinct boundaries between classes, indicating that the model successfully extracts feature representations with high discriminability and generalization ability. The study further investigated Intersection over Union (IoU) between predicted and true bounding boxes, and dynamic gesture recognition accuracy over iterations, as the results shown in Fig. 11.
[See PDF for image]
Fig. 11
Bounding box overlap and recognition accuracy changes
Figure 11a showed that FA-YOLO’s predicted bounding box curve closely followed the true bounding box curve, with an IoU value of 0.9. Two-Stream CNN and RNN curves were more dispersed, showing poor bounding box fits at 2 mm, 6 mm, and 8 mm with IoUs of 0.6 and 0.5 respectively. Particularly, RNN deviated over 50% in fitting length at 6 mm. SPGNN had an IoU of 0.7, showing better overlap. In Fig. 11b, after 10 iterations, FA-YOLO achieved the highest recognition accuracy at 87.2%. With increased iterations, FA-YOLO’s accuracy reached 100%, while SPGNN, Two-Stream CNN, and RNN only reached 95.3%, 92.1%, and 90.0% after 50 iterations. These findings showed that FA-YOLO achieved better overlap of small targets and lower missed detections, reaching high recognition performance with fewer iterations.
Analysis of application results based on FA-YOLO
To further investigate the performance of research methods in the real world, the study was tested on the more challenging EgoHands dataset, which includes first person gesture videos covering complex backgrounds, occlusions, and lighting changes. The dataset is divided into training set, testing set, and validation set in a ratio of 7:2:1, and is subjected to 5 repeated experiments for evaluation through paired t-test. Firstly, to test the contribution of different models to performance, ablation experiments were conducted. For devices without depth sensors, SAWAM and depth branches are removed, and only RGB stream processing is retained. The specific configuration is as follows, using MIMA to extract RGB temporal features instead of cross modal fusion; Compensate for missing depth information by enhancing temporal attention weights. The results are shown in Table 2.
Table 2. Results of ablation experiment
Model | Accuracy/% | Parameter quantity/M | FLOPs/G | Delay/ms |
|---|---|---|---|---|
Baseline model (YoLov5s) | 76.2 ± 0.8 | 7.2 | 16.5 | 12.3 |
+ CA | 81.5 ± 0.6* | 0.4 | 0.8 | 1.2 |
+ BiFPN | 85.3 ± 0.7* | 1.1 | 1.6 | 2.1 |
+ MIMA (RGB only) | 88.9 ± 0.5* | 2.3 | 3.2 | 4.5 |
+ MIMA + SAWAM (Full) | 92.7 ± 0.4* | 3.1 | 4.7 | 6.8 |
“*” indicates p < 0.05 compared to the previous line
According to Table 2, the CA module enhances spatial position perception, resulting in a 5.3% improvement in accuracy with relatively low computational overhead; BiFPN further improves accuracy by 3.8% through multi-scale feature fusion, but introduces higher parameters and increases FLOPs; The MIMA module (RGB only) utilizes temporal information to bring a 3.6% performance gain, but significantly increases latency. The multimodal fusion of SAWAM contributes 3.8% to the final performance improvement, but its additional computational cost is relatively high. The complete FA-YOLO model has increased 3.1 M parameters and 4.7G FLOPs compared to the baseline, and the single frame inference delay has increased by 6.8 ms. Its measured delay on Jetson Xavier NX is 38.2 ms (about 26 FPS), which basically meets the real-time requirements. In addition, the addition of all modules is accompanied by an increase in computational overhead, but the performance gains they bring are statistically significant (p < 0.05), demonstrating the effectiveness of each module in improving the model's recognition ability. To comprehensively evaluate the generalization ability and robustness of FA-YOLO in real complex environments, system testing was conducted on EgoHands, with a focus on examining the model's ability to maintain performance under factors such as lighting changes, local occlusion, and background interference. The results are shown in Table 3.
Table 3. Robustness results of different methods in real-world scenarios
Test conditions | YOLOv5s | Two-Stream CNN | SPGNN | FA-YOLO |
|---|---|---|---|---|
Normal conditions | 76.2 | 83.1 | 89.5 | 92.7 |
Strong light/weak light | 62.4 | 70.8 | 75.2 | 91.1 |
partial occlusion | 58.7 | 65.3 | 71.6 | 89.9 |
Complex dynamic background | 60.1 | 68.5 | 74.3 | 87.2 |
Comprehensive challenge scenario | 55.3 | 62.7 | 68.4 | 86.5 |
Table 3 shows that under normal testing conditions, FA-YOLO achieved an accuracy of 92.7%, significantly better than other comparative models. Under strong or weak lighting conditions, its accuracy drops to 91.1%, although there is a certain decrease, the performance retention rate is the highest at 91.8%, significantly better than YOLOv5s and Two Stream CNN, indicating that the Coordinate Attention mechanism and multimodal fusion effectively alleviate the interference caused by lighting changes. When the gesture is partially occluded, the FA-YOLO multimodal version can still maintain an accuracy of 89.9%, which is better than SPGNN and Two Stream CNN, thanks to the MIMA module's utilization of temporal motion information and SAWAM's enhancement of effective features. Under complex dynamic background interference, the accuracy of FA-YOLO is 87.2%, showing the most robust performance, demonstrating the advantages of BiFPN multi-scale fusion and attention mechanism in suppressing background noise. In comprehensive scenarios that involve multiple challenges simultaneously, the FA-YOLO multimodal version can still achieve an accuracy of 86.5%, significantly leading other models and demonstrating strong comprehensive robustness.
Conclusion
To improve the accuracy and precision of dynamic gesture recognition, the study optimized the YOLOv5 network using Coordinate Attention and Bidirectional Feature Pyramid Network, aiming to enhance feature extraction and multi-scale fusion. A Multimodal Inter-Frame Motion Attention Module was also proposed to strengthen YOLOv5's ability to recognize multi-frame motion gestures. Based on the improved YOLOv5, a dynamic gesture recognition model was constructed. In the static gesture recognition experiments, FA-YOLO achieved a recognition accuracy of 100.0% for the thumbs-up gesture. The ROC curve showed a clear upward-left convex shape with an AUC of 0.94, and the F1 score reached 96.4%, all of which significantly outperformed the comparison models. In the dynamic gesture recognition experiments, the average recognition accuracy of the FA-YOLO model reached 95.7%. For complex dynamic gestures such as Dislike, Peace, and Mute, the recognition accuracies were 98.7%, 99.8%, and 98.4%, respectively. In the bounding box overlap experiment, the Intersection over Union value reached 0.9, notably higher than the 0.6 and 0.5 achieved by the Two-Stream CNN and RNN models, indicating better alignment with small targets in the images. These results demonstrated that the proposed FA-YOLO gesture recognition model achieved high accuracy in static gesture recognition and showed excellent recognition performance and precision for multimodal motion images, significantly outperforming existing models in the market. Although the effectiveness of the FA-YOLO model has been verified through experiments, the study did not consider the negative impact of lighting variations on gesture recognition. Future work will introduce synthetic lighting changes into training data, such as brightness adjustment, contrast perturbation, and simulating shadow enhancement models to enhance their robustness to lighting conditions. Meanwhile, multimodal data such as infrared or thermal imaging are integrated to reduce the model's dependence on visible light illumination. In addition, lighting invariance constraints can be added to the loss function, and domain adaptive cross environment generalization methods can be explored to improve the adaptability and recognition stability of the model in complex lighting scenarios.
Acknowledgements
The authors have no acknowledgments to report.
Author’s contribution
Q.L. independently completed the article and made a significant contribution to it.
Funding
No funding was received.
Data availability
Data are provided within the manuscript.
Declarations
Conflict of interest
The authors declare no competing interests.
Ethics, consent to participate, and consent to publish
Not applicable.
Clinical trial number
Not applicable.
Ethical approval
Not applicable.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Zhu, X; Wang, Y; Cambria, E; Zhu, X; Rida, I. RMER-DT: Robust multimodal emotion recognition in conversational contexts based on diffusion and transformers. Inf Fusion; 2025; 123, 103268. [DOI: https://dx.doi.org/10.1016/j.inffus.2025.103268]
2. Wang, R; Xu, D; Cascone, L; Wang, Y; Chen, H; Zheng, J; Zhu, X. RAFT: robust adversarial fusion transformer for multimodal sentiment analysis. Array; 2025; 27, 1004445. [DOI: https://dx.doi.org/10.1016/j.array.2025.100445]
3. Gao, M; Sun, J; Li, Q; Khan, MA; Shagn, J; Zhu, X; Jepon, G. Towards trustworthy image super-resolution via symmetrical and recursive artificial neural network. Image Vis Comput; 2025; 158,
4. Cheng L, Zhang H, Di B, Niyato D, Song L. Large language models empower multimodal integrated sensing and communication. IEEE Commun Mag. 2025;63(5):190-7.
5. Song, R; Ai, Y; Tian, B; Chen, L; Zhu, F. Msfanet: a light weight object detector based on context aggregation and attention mechanism for autonomous mining truck. IEEE Trans Intell Veh; 2022; 8,
6. Xu, C; Wu, X; Wang, M; Qiu, F; Liu, Y. Improving dynamic gesture recognition in untrimmed videos by an online lightweight framework and a new gesture dataset ZJUGesture. Neurocomputing; 2023; 523,
7. Srinivas, Y; Ganivada, A. A modified inter-frame difference method for detection of moving objects in videos. Int J Inf Technol; 2025; 17,
8. Zhong, JL; Gan, YF; Yang, JX. A fast forgery frame detection method for video copy-move inter/intra-frame identification. J Ambient Intell Humaniz Comput; 2023; 14,
9. Arepalli, PG; Naik, KJ. A deep learning-enabled IoT framework for early hypoxia detection in aqua water using light weight spatially shared attention-LSTM network. J Supercomput; 2024; 80,
10. Suleman, MAR; Shridevi, S. Short-term weather forecasting using spatial feature attention based LSTM model. IEEE Access; 2022; 10,
11. Zhou, Q; Huang, Z; Ding, M et al. Medical image classification using light-weight CNN with spiking cortical model based attention module. IEEE J Biomed Health Inform; 2023; 27,
12. Ojeda-Castelo, JJ; Capobianco-Uriarte, MLM; Piedra-Fernandez, JA; Ayala, R. A survey on intelligent gesture recognition techniques. IEEE Access; 2022; 10,
13. Al Farid, F; Hashim, N; Abdullah, J; Bhuiyan, MR; Shahida Mohd Isa, WN. A structured and methodological review on vision-based hand gesture recognition system. J Imag; 2022; 8,
14. Jin, B; Ma, X; Zhang, Z; Lian, Z; Wang, B. Interference-robust millimeter-wave radar-based dynamic hand gesture recognition using 2-D CNN-transformer networks. IEEE Internet Things J; 2023; 11,
15. Sun, Y; Weng, Y; Luo, B; Li, G; Tao, B. Gesture recognition algorithm based on multi-scale feature fusion in RGB-D images. IET Image Process; 2023; 17,
16. Guo, S; Li, Q; Gao, M; Zhu, X; Rida, I. Generalizable deepfake detection via spatial kernel selection and halo attention network. Image Vis Comput; 2025; 160, 105582. [DOI: https://dx.doi.org/10.1016/j.imavis.2025.105582]
17. Purohit, J; Dave, R. Leveraging deep learning techniques to obtain efficacious segmentation results. Arch Adv Eng Sci; 2023; 1,
18. Santoso, J; Yamada, T; Ishizuka, K; Hashimoto, T; Makino, S. Speech emotion recognition based on self-attention weight correction for acoustic and text features. IEEE Access; 2022; 10,
19. Ektefaie, Y; Dasoulas, G; Noori, A; Farhat, M; Zitnik, M. Multimodal learning with graphs. Nat Mach Intell; 2023; 5,
20. Rajamohanan, R; Latha, BC. An optimized YOLO v5 model for tomato leaf disease classification with field dataset. Eng Technol Appl Sci Res; 2023; 13,
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.