Content area

Abstract

Object detection networks’ extracted region features have been pivotal in visual question answering (VQA) advancements. However, lacking global context, these features may yield inaccurate answers for questions demanding such information. Conversely, grid features provide detailed global context but falter on questions requiring high-level semantic insights due to their lack of semantic richness. Therefore, this paper proposes an improved attention-based dual-stream visual fusion network (MDVFN), which fuses region features with grid features to obtain global context information, while grid features supplement high-level semantic information. Specifically, we design a visual crossed attention (VCA) module in the attention network, which can interactively fuse two visual features to enhance their performance before guiding attention with the question features. It is worth noting that in order to reduce the semantic noise generated by the interaction of two image features in the visual cross attention (VCA) module, the targeted optimization is carried out. Before fusion, the visual position information is embedded, respectively, and the visual fusion graph is used to constrain the fusion process. Additionally, to combine text information, grid features, and region features, we propose a modality-mixing network. To validate our model, we conducted extensive experiments on the VQA-v2 benchmark dataset and the GQA dataset. These experiments demonstrate that MDVFN outperforms the most advanced methods. For instance, our proposed model achieved accuracies of 72.16% and 72.03% on the VQA-v2 and GQA datasets, respectively.

Details

Title
Modular dual-stream visual fusion network for visual question answering
Publication title
Volume
41
Issue
1
Pages
549-562
Publication year
2025
Publication date
Jan 2025
Publisher
Springer Nature B.V.
Place of publication
Heidelberg
Country of publication
Netherlands
Publication subject
ISSN
01782789
e-ISSN
14322315
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2024-05-28
Milestone dates
2024-02-28 (Registration); 2024-02-26 (Accepted)
Publication history
 
 
   First posting date
28 May 2024
ProQuest document ID
3159547624
Document URL
https://www.proquest.com/scholarly-journals/modular-dual-stream-visual-fusion-network/docview/3159547624/se-2?accountid=208611
Copyright
Copyright Springer Nature B.V. Jan 2025
Last updated
2025-01-31
Database
ProQuest One Academic