Full Text

Turn on search term navigation

© The Author(s) 2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Human action recognition has been identified as an important research topic in computer vision because it is an essential form of communication and interplay between computers and humans to assist computers in automatically recognizing human behaviors and accurately comprehending human intentions. Inspired by some keyframe extraction and multifeatured fusion research, this paper improved the accuracy of action recognition by utilizing keyframe features and fusing them with video features. In this article, we suggest a novel multi-stream approach architecture made up of two distinct models fused using different fusion techniques. The first model combines convolutional neural networks in two-dimensional (2D-CNN) with long-short term memory networks to glean long-term spatial and temporal features from video keyframe images for human action recognition. The second model is a three-dimensional convolutional neural network (3D-CNN) that gathers quick spatial–temporal features from video clips. Subsequently, two frameworks are put forth to explain how various fusion structures can improve the performance of action recognition. We investigate methods for video action recognition using early and late fusion. While the late-fusion framework addresses the decision fusion from the two models' choices for action recognition, the early-fusion framework examines the impact of early feature fusion of the two models for action recognition. The various fusion techniques investigate how much each spatial and temporal feature influences the recognition model's accuracy. The HMDB-51 and UCF-101 datasets are two important action recognition benchmarks used to evaluate our method. When applied to the HMDB-51 dataset and the UCF-101 dataset, the early-fusion strategy achieves an accuracy of 70.1 and 95.5%, respectively, while the late-fusion strategy achieves an accuracy of 77.7 and 97.5%, respectively.

 Article Highlights

Current models for human action recognition rely on still photos or brief video clips, and they have trouble correctly identifying activities that take place outside of their constrained temporal context.

We suggest frameworks to investigate several deep learning models with different fusion methods for effective recognition and categorization of human actions.

On the standard datasets, our proposed frameworks outperform the majority of state-of-the-art techniques in terms of performance.

Details

Title
Various frameworks for integrating image and video streams for spatiotemporal information learning employing 2D–3D residual networks for human action recognition
Author
Yosry, Shaimaa 1 ; Elrefaei, Lamiaa 1 ; ElKamaar, Rafaat 1 ; Ziedan, Rania R. 1 

 Benha University, Electrical Engineering Department, Faculty of Engineering at Shoubra, Cairo, Egypt (GRID:grid.411660.4) (ISNI:0000 0004 0621 2741) 
Pages
141
Publication year
2024
Publication date
Apr 2024
Publisher
Springer Nature B.V.
ISSN
25233963
e-ISSN
25233971
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
2963049981
Copyright
© The Author(s) 2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.