Full text

Turn on search term navigation

© 2022 Zhuang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Video question answering (Video-QA) is a subject undergoing intense study in Artificial Intelligence, which is one of the tasks which can evaluate such AI abilities. In this paper, we propose a Modality Attention Fusion framework with Hybrid Multi-head Self-attention (MAF-HMS). MAF-HMS focuses on the task of answering multiple-choice questions regarding a video-subtitle-QA representation by fusion of attention and self-attention between each modality. We use BERT to extract text features, and use Faster R-CNN to ex-tract visual features to provide a useful input representation for our model to answer questions. In addition, we have constructed a Modality Attention Fusion (MAF) framework for the attention fusion matrix from different modalities (video, subtitles, QA), and use a Hybrid Multi-headed Self-attention (HMS) to further determine the correct answer. Experiments on three separate scene datasets show our overall model outperforms the baseline methods by a large margin. Finally, we conducted extensive ablation studies to verify the various components of the network and demonstrate the effectiveness and advantages of our method over existing methods through question type and required modality experimental results.

Details

Title
Modality attention fusion model with hybrid multi-head self-attention for video understanding
Author
Zhuang, Xuqiang  VIAFID ORCID Logo  ; Fang’ai Liu  VIAFID ORCID Logo  ; Hou, Jian; Hao, Jianhua; Cai, Xiaohong
First page
e0275156
Section
Research Article
Publication year
2022
Publication date
Oct 2022
Publisher
Public Library of Science
e-ISSN
19326203
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
2722285035
Copyright
© 2022 Zhuang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.