Content area

Abstract

Object detection networks’ extracted region features have been pivotal in visual question answering (VQA) advancements. However, lacking global context, these features may yield inaccurate answers for questions demanding such information. Conversely, grid features provide detailed global context but falter on questions requiring high-level semantic insights due to their lack of semantic richness. Therefore, this paper proposes an improved attention-based dual-stream visual fusion network (MDVFN), which fuses region features with grid features to obtain global context information, while grid features supplement high-level semantic information. Specifically, we design a visual crossed attention (VCA) module in the attention network, which can interactively fuse two visual features to enhance their performance before guiding attention with the question features. It is worth noting that in order to reduce the semantic noise generated by the interaction of two image features in the visual cross attention (VCA) module, the targeted optimization is carried out. Before fusion, the visual position information is embedded, respectively, and the visual fusion graph is used to constrain the fusion process. Additionally, to combine text information, grid features, and region features, we propose a modality-mixing network. To validate our model, we conducted extensive experiments on the VQA-v2 benchmark dataset and the GQA dataset. These experiments demonstrate that MDVFN outperforms the most advanced methods. For instance, our proposed model achieved accuracies of 72.16% and 72.03% on the VQA-v2 and GQA datasets, respectively.

Full text

Turn on search term navigation

Copyright Springer Nature B.V. Jan 2025