Modular dual-stream visual fusion network for visual question answering

Abstract

Object detection networks’ extracted region features have been pivotal in visual question answering (VQA) advancements. However, lacking global context, these features may yield inaccurate answers for questions demanding such information. Conversely, grid features provide detailed global context but falter on questions requiring high-level semantic insights due to their lack of semantic richness. Therefore, this paper proposes an improved attention-based dual-stream visual fusion network (MDVFN), which fuses region features with grid features to obtain global context information, while grid features supplement high-level semantic information. Specifically, we design a visual crossed attention (VCA) module in the attention network, which can interactively fuse two visual features to enhance their performance before guiding attention with the question features. It is worth noting that in order to reduce the semantic noise generated by the interaction of two image features in the visual cross attention (VCA) module, the targeted optimization is carried out. Before fusion, the visual position information is embedded, respectively, and the visual fusion graph is used to constrain the fusion process. Additionally, to combine text information, grid features, and region features, we propose a modality-mixing network. To validate our model, we conducted extensive experiments on the VQA-v2 benchmark dataset and the GQA dataset. These experiments demonstrate that MDVFN outperforms the most advanced methods. For instance, our proposed model achieved accuracies of 72.16% and 72.03% on the VQA-v2 and GQA datasets, respectively.

Full text

Translate

Turn on search term navigation

Introduction

Multimodal learning that combines vision and language information has been widely studied in the fields of computer vision and natural language processing in recent years. Important breakthroughs have been made in directions such as image-text matching [1, 2–3], image captioning [4, 5–6], and visual question answering [7, 8–9]. VQA represents a particularly complex challenge, necessitating an intricate comprehension of both the visual and textual elements present in an image and its associated question, alongside the capability to logically deduce and provide precise responses.

The challenges in visual question answering (VQA) are not only to decipher the connections between objects in an image and the essence of the query but also to accurately identify critical information amidst complex visual attributes. As such, VQA systems demand sophisticated capabilities for visual and textual analysis as well as logical reasoning. Over time, significant strides in VQA have been supported by innovative strategies such as the implementation of residual networks [10] and the application of attention mechanisms [11].

Among these advancements, a notable technique involves substituting traditional grid features with image region features identified by object detection frameworks like Faster R-CNN [12] as the model’s input, region features provide object-level information, which can easily capture salient object information in the image and greatly reduce the difficulty of visual embeddings in the model. Nevertheless, their shortfall in encompassing global context means certain details, such as object placement and proximities, remain elusive. Grid features excel here, their extensive and structured coverage offering a comprehensive view of the entire image.

Nevertheless, grid features have clear limitations. While they encompass the entirety of an image, they fall short in providing deeper semantic insights. This shortfall hampers their ability to discern intricate details within the image, like facial expressions and movements, consequently detracting from precision. This shortfall underscores the shift toward region features. How, then, can we address the deficiencies inherent in both types of visual features?

Obviously, it is a common idea to combine them to form a more comprehensive and rich feature embedding. Yet, a significant challenge persists: traditional grid features, despite encapsulating global contextual information absent in region features, significantly lack the semantic depth provided by region features. Achieving an optimal blend of grid and region features for enhanced fusion outcomes is challenging. Jiang et al. [13] tackled this challenge by adapting a pre-trained object detector to extract grid features at a parallel semantic level, offering theoretical groundwork for our research. Another aspect to consider is the application of this concept to attention mechanisms in visual question answering models. Attention-based methods, such as the VQA approach initially introduced by Shih et al. [14], and co-attention networks like BAN [15] and MCAN [8] in recent years, have become staples in VQA models. We opt to extend the MCAN model’s framework for our investigation. A critical question in incorporating our concept into the MCAN framework is whether merging and enhancing the two visual features might adversely affect the co-attention mechanism.

Through a large number of experiments, we found that if the two visual features interact directly, a lot of noise will be introduced. At this time, the enhanced visual features and problem features will carry noise information in the final model output information by guiding attention, for example, in Fig. 1 left: For the given image, the model is asked the question "Is this a black and white striped fire hydrant?" If the green and red regions from both features are interactively fused, then when the guidance attention is focused on "fire hydrant" using the question feature, both visual features clearly introduce color noise information for "black and white".

Fig. 1 [Images not available. See PDF.]

The left part of the figure is an example of the semantic noise introduced when the region features are fused with the grid map, and the green and red shaded areas in the grid indicate the areas covered during the fusion. On the right is an example of region restriction for visual fusion

Fig. 2 [Images not available. See PDF.]

Overview of the proposed modular dual-stream visual fusion network architecture. Faster R-CNN (G) represents the modified Faster R-CNN network for extracting grid features. The input question and image are first represented as a series of word embeddings, image region features, and grid features in the feature representation layer. Based on the obtained feature representation, the information within features between visual features and between modalities is mined through the encoder–decoder layer. See Fig. 5 for the internal structure. The output features are combined together through the modality-mixing network to generate the final feature Z and obtain the answer

To address these issues, we propose a new modular dual-stream visual fusion network (MDVFN). Specifically, our dual-stream model (as shown in Fig. 2) first extracts deep features of the two visual information through a self-attention. Distinct from existing VQA attention mechanisms, our approach incorporates position encoding within the attention calculation to minimize noise during visual fusion. In addition, a visual cross attention (VCA) module is proposed to address the fusion of the two visual information. In this module, we introduce a visual fusion graph to further reduce noise in the fusion. For example, Fig. 1 right: when we restrict the fusion of the two visual information to the blue region, the introduction of noise will be effectively reduced. Through the VCA module, object-level information can be propagated from region features to grid features, and the global context information of grid features can be complemented into region features. Due to the introduction of two visual features, the final output number of the model increases from two to three, including grid information, region information, and text information. This leads to the fact that traditional fusion mechanisms, such as traditional linear fusion method or more advanced fusion methods such as MUTAN [16] and BLOCK [17], are not applicable to our model. Therefore, we proposed a new fusion network, whose main idea is to mix the three information through their intrinsic connections.

To validate the proposed MDVFN model, we conduct extensive experiments on VQA-v2 dataset [18] and the GQA dataset [19], and state-of-the-art performance was achieved.

In summary, this article has three main contributions:

We propose a deep modular dual-stream visual fusion network (MDVFN), which makes up for the shortcomings of the region feature and grid feature visual question answering model, realizes the complementarity of the region feature and grid feature, propagates object-level information from the region feature to the grid feature, and supplements the global information of grid feature to the region feature. In addition, the MDVFN model can focus not only on information between two modes, but also on information between two visual features.
We propose a visual cross attention module. This module seamlessly integrates both types of visual information, employing visual location encoding and a visual fusion map to diminish noise during the fusion, thereby enhancing the precision of answer prediction.
We propose a modality-mixing network, which uses the interrelation of three kinds of information to control the mixing process of model outputs of grid visual information, region visual information and question text information to obtain the final feature representation for predicting the answer.

Related work

In this chapter, we briefly introduced the recent development and problems solved in the field of visual question answering. We also discussed the development of visual feature input in the field of visual question answering. Finally, we reviewed the methods of traditional feature fusion.

Visual question answering

Visual question answering (VQA) has emerged as a focal area within multimodal machine learning, blending computer vision with natural language processing to emulate human-like responses to visual queries. Initially, the approach by Antol et al. [20] leveraged CNNs and LSTMs to merge visual and textual data, extracting and classifying features through element-wise multiplication.

Early VQA efforts involved merging visual data from two broad multimodal sources to formulate answers, a process that often omitted crucial details due to its reliance on global information. To address these shortcomings, advancements such as the integration of residual networks were introduced, significantly enhancing the model’s capability to assimilate and refine fusion features [10].

To address these challenges, Chen et al. [11] introduced a novel attention guidance mechanism, distinct from prior approaches. They identified visual features within the spatial feature map that correlated with the question’s semantic features to create a "question-guided attention map," utilizing a configurable convolution kernel infused with visual data derived from the question itself. Yang et al. [21] further enhanced this method by integrating it with the "Stacked Attention Network" (SAN), enabling iterative answer inference. Ben et al. [16] developed a multimodal bilinear pooling technique that merges visual and textual features from the image space and the question, respectively.

An important advancement came with the "bottom-up" attention model proposed by Anderson et al. [22], marking a shift in VQA attention models from focusing on regional to grid features.

In order to learn both text attention and visual attention simultaneously, Lu et al. [23] proposed a joint attention learning framework that alternates between learning text attention and visual attention in a hierarchical manner. Yu et al. [24] proposed a co-attention architecture that reduces noise by applying self-attention to the question embedding and applying question-conditioned attention to the image embedding. To focus on the dense interaction between each question word and each image region, Yu et al. [8] proposed a deep modular co-attention network that achieves complete interaction between image regions and question words. This dense co-attention model significantly outperforms its predecessors, showcasing enhanced performance.

Visual feature

Traditional global grid features based on CNNs have been commonly used as visual features in VQA models. For example, Zhou et al. [25] used GoogLeNet to extract visual features and then used a bag-of-words model to extract textual features of the questions to predict the correct answers. However, with the introduction of attention mechanisms, this type of convolutional-based global grid feature has gradually been replaced. In bottom-up attention models, unlike the ordinary attention to input-specific parts, bottom-up attention uses object-based bounding boxes extracted from the pre-trained object detector Fast R-CNN [12] as visual input. Compared to convolutional-based grid features, object-based region features greatly reduce the difficulty of visual embedding.

Nonetheless, Jiang et al. [13] contended that traditional grid features are not inherently inferior to object-based visual features. They modified the pre-trained object detector to extract grid features at the same level and used them in the VQA model. They found that using grid features resulted in better VQA accuracy than region features. This finding helps to have a more comprehensive consideration of visual feature selection in VQA.

Fusion strategies for VQA

To improve the prediction accuracy of VQA models, it is necessary to use complex fusion mechanisms to combine multimodal features. Currently, linear fusion is the most common and simplest way to fuse multimodal features by adding or multiplying them element-wise. For example, [23] used the element-wise summation method, and [26] used the element-wise multiplication method. However, more complex fusion methods can further align the information from different modalities. Yu et al. [24] proposed the MFB method, which uses a bilinear pooling method that combines co-attention with multimodal matrix factorization. Ben et al. [17] proposed a fusion method based on block-super diagonal tensor decomposition technology, which generalizes the "rank" and "mode rank" of tensors using the concept of block-term ranks, representing accurate interactions between modalities while still preserving single modality information.

Our approach

The paper proposes a modular dual-stream visual fusion network called MDVFN, which consists of a modular dual-stream visual fusion module and a multimodal fusion module. The modular dual-stream visual fusion module is composed of self-attention units (SA), grid guided-attention units (or region guided-attention units) (GA), and visual cross attention units (VCA). The multimodal fusion module mainly combines the three outputs of the previous module to achieve classification prediction. The overall structure of the model is shown in Fig. 2.

Feature extraction and position embedding

The grid features and region features are extracted from the given image in this paper. For region features, a bottom-up approach is adopted to extract intermediate features from the pre-trained object detection model Faster R-CNN [12], which is trained on a large-scale visual dataset. Finally, region visual features of size $n \in [10, 100]$ are obtained, the extracted convolutional features for the i-th object are represented as $r_{i} \in R^{dx}$ . Its regional eigenmatrix is expressed as $r \in R^{(n \times d x)}$ .

The grid features in this paper were obtained using the pre-trained Faster R-CNN model provided by Jiang et al. [13]. The model uses a delayed c5 backbone and 1x1 RoIPool as the detection head, and was trained on the VG dataset. In the feature extraction stage, the delay can be removed and grid features can be extracted normally instead of the original region features. The network was used to extract 8x8 grid features, which are represented as $g \in R^{(64 \times d x)}$ .

To handle input questions, we divide them into word forms, with each question consisting of up to 14 words, i.e., [m $\times$ 14]. Then, we use Glove word embeddings pre-trained on large-scale corpora to convert each word into vector form. Therefore, each question can be represented as an n $\times$ 300 word vector matrix, where n represents the number of words in the question. Finally, it is input into a single-layer LSTM to obtain word embeddings $y \in R^{(n \times d y)}$ , where dy is the number of hidden units in the LSTM.

Here, we choose to add positional information to the grid features and region features. While adding visual information to visual features has been widely used in many other fields, it has been rarely applied in the field of visual question answering. In our experiments, we found that adding visual positional information solely to grid or region features actually decreases the performance of visual question answering models, as shown in Table 2. We hypothesize that this is because, unlike models in other fields, visual question answering models pay more attention to the relationship between question features and visual features rather than simply enhancing the visual features themselves. Therefore, the purpose of introducing visual positional information in our study is not to enhance the visual features themselves, but to optimize the fusion of the two types of visual information and reduce the noise introduced by fusion. The grid position encoding (GPE) is obtained by using two concatenated one-dimensional sine and cosine functions, which is a traditional position encoding method, to embed the position of the grid.

\begin{matrix} GPE (i, j) = [{PE}_{i} ; {PE}_{j}] \end{matrix}

where i and j represent the indices of grid rows and columns and

{PE}_{i}

{PE}_{j}

is defined as follows:

\begin{matrix} PE (pos, 2 k) = & sin (pos / 10000^{2 k / (d_{model} / 2)}) \end{matrix}

\begin{matrix} PE (pos, 2 k + 1) = & cos (pos / 10000^{2 k / (d_{model} / 2)}) \end{matrix}

Here, pos represents the grid position and k represents the scale. For the region features, the paper uses the four-dimensional bounding box

b_{i} = x_{min}, y_{min}, x_{max}, y_{max}

embedded in the region position encoding RPE.

\begin{matrix} RPE (i) = b_{i} W_{emd} \end{matrix}

where i is the region box index,

(x_{min}, y_{min}) (x_{max}, y_{max})

b_{i}

represent the upper left and lower right corners of the region box, respectively, and

W_{emd}

is the embedding matrix.

Fig. 3 [Images not available. See PDF.]

Illustration of the three basic attention units. a Self-Attention (SA) units takes one group of input features X, and output the attended feature Z of X. Specifically, when the input information x is visual information, it should also input the corresponding visual position information. b Visual cross attention (VCA) units takes two group of input visual features $X_{1}$ and $X_{2}$ , and output the attended information Z of $X_{1}$ interacting through $X_{2}$ . (a) Guided-attention (GA) units take two group of input features X and Y, output the attended feature Z of X guided by Y

Fig. 4 [Images not available. See PDF.]

Illustration of the visual fusion diagram. We use 8 $\times$ 8 grid features. The green box area in the picture is the regional feature, and the blue shaded area is the overlap between the grid and the regional feature, which is used to restrict visual information fusion

SA, VCA and GA units

The attention unit in this paper uses the same attention mechanism as other co-attention networks [8, 23, 27], which is an extension of the multi-head attention mechanism. The multi-head attention consists of N parallel attention heads, where each attention head is a scaled dot-product attention.

\begin{matrix} fatt = A (Q, K, V) = softmax (\frac{QK}{\sqrt{d}}) V \end{matrix}

\begin{matrix} {head}_{j} = A (Q W_{j}^{Q}, K W_{j}^{K}, V W_{j}^{V}) \end{matrix}

\begin{matrix} MulitHead (Q, K, V) = Concat ({head}_{1}, \dots \dots, {head}_{h}) W^{o} \end{matrix}

The attention function

A (Q, K, V)

operates on Q, K, and V, which correspond to query, key, and value. The output of this attention function is a weighted average vector, where

W_{j}^{Q}, W_{j}^{K}, W_{j}^{V} \in R^{d \times d_{h}}

is the projection matrix of the j-th head, and

W^{o} \in R^{(h \times d_{h} \times d)}

d_{h}

is the dimensionality of the output features from each head.

Self-attention unit

The self-attention unit is built on the basis of this multi-head attention mechanism, as shown in Fig. 3a. The SA layer consists of multi-head dot-product attention and a feed-forward layer. Given a set of inputs $y = [y_{1}, \dots, y_{m}] \in R^{m \times d}$ and positional information PE (if the input is visual features, then there is positional information PE), to allow the model to focus on the relationship between the paired sample $< y_{i}, y_{j} >$ , all instances in y are weighted and summed through attention to obtain the output $Z \in R^{m \times d}$ . The output of the attention calculation is then fed into the feed-forward layer FFN, which is transformed through a ReLU function and two linear fully connected layers, and finally optimized by layer normalization to produce the output.

\begin{matrix} Y = MHSA (Q, K, V) \\ = Concat ({head}_{1}, \dots \dots, {head}_{h}) W^{o} \end{matrix}

\begin{matrix} {head}_{j} = S A (Q W_{j}^{Q}, K W_{j}^{K}, V W_{j}^{V}, {pos}_{q}, {pos}_{k}) \end{matrix}

\begin{matrix} S A (Q, K, V, {pos}_{q}, {pos}_{k}) \\ = softmax (\frac{{(Q + {pos}_{q}) (K + {pos}_{k})}^{T}}{\sqrt{d_{k}}}) V \end{matrix}

\begin{matrix} Y^{(L + 1)} = FFN (Y^{L}) \end{matrix}

Here,

{pos}_{q}

and

{pos}_{k}

are the location information of Q and K.

Visual cross attention unit

In this paper, we propose a visual cross attention unit to achieve complex interactions between grid features and region features, and to fuse the two types of visual information with each other. To reduce noise, creating a visual fusion graph to constrain the fusion scope of the two types of visual information is an important measure. As shown in Fig. 4, the overlap between the grid and the region in the image is used as a visual fusion graph to ensure that no information from other non-overlapping regions is introduced during the fusion of the two visual information. In the cross attention mechanism of this module, one type of visual information is used as the query, and the other type is used as the key and value. As shown in Eq. 12, the visual fusion graph obtained from the two types of visual information is used as the mask M, and the non-overlapping parts of the grid or region in this mask M are set to 0. This mask M is concatenated with the mask used for attention calculation to obtain the mask $M^{'}$ used for visual cross attention calculation. This reduces the influence of noise information when fusing the two types of visual information.

\begin{matrix} M = \sum_{j \in G_{i}} \frac{e^{W_{i}^{p}}}{\sum_{j \in G_{i}} e^{W_{i}^{p}}} V_{j} \end{matrix}

Here,

W^{p}

is the position encoding of the node,

G_{i}

is the visual fusion graph of node i, which include the surrounding nodes of node i, and

V_{j}

represents its visual features.

\begin{matrix} X = MHVCA (Q, K, V) \\ = Concat ({head}_{1}, \dots \dots, {head}_{h}) W^{o} \end{matrix}

\begin{matrix} {head}_{j} = V C A (Q W_{j}^{Q}, K W_{j}^{K}, V W_{j}^{V}, {pos}_{q}, {pos}_{k}, M) \end{matrix}

\begin{matrix} V C A (Q, K, V, {pos}_{q}, {pos}_{k}, M) \\ = \begin{matrix} softmax \\ M \end{matrix} (\frac{{(Q + {pos}_{q}) (K + {pos}_{k})}^{T}}{\sqrt{d_{k}}}) V \end{matrix}

In the above equation,

\begin{matrix} softmax \\ M \end{matrix}

represents the combination of M and the attention mask for attention calculation, and Mr and Mg represent the masks for computing the region and grid separately in the visual fusion graph. For the L-th output of VCA:

\begin{matrix} X_{r}^{(L)} = & MHVCA (X_{r}^{(L)}, X_{g}^{(L)}, X_{g}^{(L)}, RPE, GPE, M_{r}) \end{matrix}

\begin{matrix} X_{g}^{(L)} = & MHVCA (X_{g}^{(L)}, X_{r}^{(L)}, X_{r}^{(L)}, GPE, RPE, M_{g}) \end{matrix}

Afterward, the fused grid or region features are obtained through a FFN layer.

\begin{matrix} X_{r}^{(L + 1)} = & {FFN}_{r} (X_{r}^{(L)}) \end{matrix}

\begin{matrix} X_{g}^{(L + 1)} = & {FFN}_{g} (X_{g}^{(L)}) \end{matrix}

By VCA, we fuse the region features into the grid features and vice versa, specifically, using the grid features as queries and the region features as keys and values to fuse the object-level information of the region features into the grid features. Conversely, the region features are used as queries, and the grid features are used as keys and values to supplement the detailed context information of the grid features to the region features. The bidirectional feature fusion enhancement is realized.

Guided-attention unit

The guided-attention unit is designed to focus on the relationship between the paired sample $< x_{i}, y_{j} >$ for a given text information $y = [y_{1}, \dots, y_{m}] \in R^{m \times d y}$ and visual information $x_{r} = [x_{1}, \dots, x_{n}] \in R^{n \times d x}$ (or $x_{g} = [x_{1}, \dots, x_{l}] \in R^{l \times d x}$ ), where y guides the attention training of x, as shown in Fig. 3c. It is worth noting that there is no image position information added in this module.

\begin{matrix} X = & M H G A (Q, K, V) \\ = & Concat ({head}_{1}, \dots \dots, {head}_{h}) W^{o} \end{matrix}

\begin{matrix} {head}_{j} = & G A (Q W_{j}^{Q}, K W_{j}^{K}, V W_{j}^{V}) \\ = & softmax (\frac{QK}{\sqrt{d_{k}}}) V \end{matrix}

Similarly, the output is passed through another layer of FFN to obtain the visual information guided by the question.

\begin{matrix} X^{(L + 1)} = FFN (X^{(L)}) \end{matrix}

Modular composition for VQA

Based on the three basic units shown in Fig. 3, the modularized dual-stream visual fusion layer MDVF is formed by combining them, as shown in Fig. 5: for the text feature input $Y = [y_{1}, \dots, y_{m}] \in R^{m \times d y}$ . After N levels of SA layer, the text output $Y^{(N)}$ after n layers of self-attention is obtained. For the L-th layer of grid feature input $X_{r} = [x_{1}, \dots, x_{n}] \in R^{n \times d x}$ and region feature input $X_{g} = [x_{1}, \dots, x_{l}] \in R^{l \times d x}$ , they are separately processed by an SA unit: $S A (X_{r}^{(L)})$ (or $S A (X_{g}^{(L)})$ ), and then separately processed by a visual cross attention unit: $V C A (X_{r}^{(L)}, X_{g}^{(L)})$ (or $V C A (X_{g}^{(L)}, X_{r}^{(L)})$ ) to model the interaction between the paired sample $< X_{r_{i}}, X_{g_{j}} >$ .

After the two visual information are processed by the VCA module, they have incorporated each other. Then, the two visual information are separately guided by attention units: $G A (X_{r}^{(L)}, Y^{(N)})$ ( or $G A (X_{g}^{(L)}, Y^{(N)})$ ), using the text feature $Y^{(N)}$ of the N-th layer, to simulate the complex relationship between $< Y_{i}, X_{r_{j}} >$ ( or $< Y_{i}, X_{g_{j}} >$ ). Finally, the visual outputs $X_{r}^{(L + 1)}$ and $X_{g}^{(L + 1)}$ of the $(L + 1) - t h$ layer are obtained.

In this way, we construct a layer of modular dual-stream visual fusion layer MDVF.

Fig. 5 [Images not available. See PDF.]

Implementation illustration of our encoder–decoder layer. Y represents the problem features, X1 and X2 represent the region features and grid features, and Graph represents the visual fusion graph. The encoder consists of N SA layers in series. The decoder inputs two kinds of visual information as well as the question features of the final layer, and outputs two kinds of visual information for transmission to the next layer

Fig. 6 [Images not available. See PDF.]

Illustration diagram of the proposed modality-mixing network. $X_{1}^{'}, X_{2}^{'}, Y^{'}$ is the output of $X_{1}^{(L)}, X_{2}^{(L)}, Y^{(L)}$ transformed by Eq. 24, and the modality-mixing network mainly realizes the combine of features, namely Equation 28. The visual guidance embedding is learned after concatenating the three features in pairs, and then the final output of the three features is obtained through the modality-mixing computing. The final output Z is obtained by simply concatenating the three features

Modular dual-stream visual fusion network

In this section, we discuss the modular dual-stream visual fusion network proposed in this paper, which inputs three features into the cascaded MDVF encoder–decoder model, namely $Y, X_{r} a n d X_{g}$ . Modular visual fusion model consisting of L MDVF layers cascaded in depth (denoted by ${MDVF}^{(1)}, \dots \dots, {MDVF}^{(L)}$ ), and the inputs of ${MDVF}^{(L)}$ layer are $X_{r}^{(L)}$ and $X_{g}^{(L)}$ , while the outputs are $x_{lr}$ and $x_{\lg}$ . For ${MDVF}^{(1)}$ , our inputs are $Y, X_{r} and X_{g}$ .

The encoder in the model can be understood as processing the textual information through N layers of self-attention to obtain a deep-level textual representation. The decoder, on the other hand, can be understood as processing the two types of visual information through SA unit and VCA unit, and acquiring text guidance information through GA unit to obtain the output of two types of visual information. The $(L + 1) - t h$ layer of the decoder is represented as:

\begin{matrix} [X_{r}^{(L + 1)}, X_{g}^{(L + 1)}] = MDVF (Y^{(N)}, X_{r}^{(L)}, X_{g}^{(L)}) \end{matrix}

Modality-mixing network and output classification

For the output of the L-th layer of MDVF, including text feature, two visual features $Y^{(L)} = [{y_{1}}^{(L)}, \dots, {y_{m}}^{(L)}] \in R^{m \times d y}$ , ${X_{r}}^{(L)} = [{x_{1}}^{(L)}, \dots, {x_{n}}^{(L)}] \in R^{n \times d x}$ and ${X_{g}}^{(L)} = [{x_{1}}^{(L)}, \dots, {x_{l}}^{(L)}] \in R^{l \times d x}$ , our model has three outputs, which makes previous VQA fusion mechanisms no longer applicable to our model. First, we use two layers of $MLP (F C (d) - ReLU - Dropout (0.1) - F C (1))$ to obtain attention features $Y^{'}, X_{r}^{'} and X_{g}^{'}$ from $Y^{(L)}, X_{r}^{(L)} and X_{g}^{(L)}$ , respectively. The attention feature $X^{'}$ is obtained as:

\begin{matrix} X^{'} = \sum_{i = 1}^{n} softmax (MLP, (X^{(L)})) x_{i} \end{matrix}

To combine these three features, we adopt the idea of cross-guidance, using one feature to guide the fusion of the other two. As shown in Fig. 6, we first concatenate these three features pairwise:

\begin{matrix} w_{(1)} = σ (W_{rl} [X_{r} ; Y] + b_{1}) \end{matrix}

\begin{matrix} w_{(2)} = σ (W_{rg} [X_{r} ; X_{g}] + b_{2}) \end{matrix}

\begin{matrix} w_{(3)} = σ (W_{gl} [X_{g} ; Y] + b_{3}) \end{matrix}

In this equation, the symbol

[;]

represents vector concatenation operation,

W_{rl}, W_{rg}

, and

W_{gl}

are weight vectors,

b_{1}, b_{2}

and

b_{3}

are scalar biases, and the function

σ (x)

is defined as

σ (x) = \frac{1}{1 + e^{- x}} x \in R

. Then, the three mix guidance embeddings are used to weight the three features, specifically, for the language feature Y, it guides the combine of the grid feature

X_{g}

and region feature

X_{r}

as follows:

\begin{matrix} h_{rg} = w_{1} (W_{r}, X_{r}) + w_{3} (W_{g}, X_{g}) + b_{rg} \end{matrix}

Here,

W_{r}

and

W_{g}

are weight matrices, and

b_{rg}

is the bias vector. Then, the three cross-fused results are concatenated to obtain the final output vector.

\begin{matrix} Z = h_{rl} + h_{rg} + h_{gl} \end{matrix}

The final combined feature z is projected onto a vector

s \in R^{N}

, followed by a sigmoid function, and binary cross-entropy (BCE) is used as the loss function to train the network.

Experiments

In this section, we conducted experiments on two of the most popular datasets: VQA 2.0 [18], GQA [19], to evaluate the effectiveness of our proposed MDVFN model. We introduced the experimental settings and details, and designed some ablation experiments to verify the effectiveness of our proposed model.

Datasets

In our experiments, we employed three popular datasets: VQA 2.0 [18], GQA [19]. We followed the same partitioning method for training and testing in each dataset.

VQA-v2 dataset is widely used in visual question answering research and evaluation, and is the most commonly used VQA benchmark dataset. VQA-v2 dataset is developed based on VQA-v1 [20], and it contains human-annotated question-answer pairs related to images from the MS-COCO [28] dataset, with three questions per image and 10 answers per question. These questions come from real-world everyday scenes, including human behavior, scene analysis, object recognition, and other domains. The dataset is divided into three parts: The training set consists of 80k images and 444k QA pairs, the val set consists of 40k images and 214k QA pairs, and the test set consists of 80k images and 448k QA pairs. In addition, there are two test subsets, Test-dev and Test-std, for online evaluation of model performance. The results include three accuracies for each type (yes/no, numeric, and other) and overall accuracy.

GQA train split contains 72140 images with 943k questions and val split contains 10243 images with 132k questions. The dataset provides annotated scene graphs for each image and ground-truth reasoning steps for each question. We leverage the reasoning steps to identify visual context and leverage the scene graph annotation to train models with perfect sight. We train the models on GQA train set and test them on GQA val test. GQA also has a Test-dev split and a test split, which are not used in our work because they do not have scene graph and reasoning step annotation.

Experiment and implementation details

We followed the experimental setup of [8], where the dimensions of the input grid and region image features, the input question feature dimension, and the fused multimodal feature dimension were set to 2048, 512, and 1024, respectively. The number of heads h for multi-head attention was set to 8, and the latent dimension d was set to 512, with a size of 512/8=64 for each head. The answer vocabulary size N was 3129.

To train the MDVFN model, we used the Adam solver [29] with $β_{1} = 0.9$ and $β_{2} = 0.98$ . The base learning rate was set to $m i n (2.5 {te}^{- 5}, {1 e}^{- 4})$ , where T is the current epoch number starting from 1. The learning rate decayed by 1/5 every two epochs after 10 epochs. The model was trained for 13 epochs with a batch size of 64.

Experimental results

In the experiment, the MDVFN model was compared with some representative models in recent years. These comparative models include Bottom-up [22], BAN [30],DFAF [27], MCAN [8], AGAN [15], MMNAS [31],DSSQN [32], SA-VQA [33], TRARs [7], TRAR [7], and CFR [34]. And on the GQA dataset, we conducted comparisons with models such as Bottom-up [22], LRTA [35], LCGN [36], LXMERT [37], Oscar [38], NSM [39], VinVL [40], SA-VQA [33], and CFR [34].

Table 1. Comparison between MDVFN and the latest model without large-scale pre-training on the dataset VQA2.0

Method	Test-dev				Test-std
	All	Yes/No	Num	Others	All
Bottom-up [22]	65.32	81.82	44.21	57.26	65.67
BAN [30]	70.04	85.42	54.04	60.52	70.35
DFAF [27]	70.22	86.09	53.32	60.49	70.34
MCAN [8]	70.63	86.82	53.26	60.72	70.90
AGAN [15]	71.16	86.87	54.29	61.56	71.50
MMNAS [31](31)	71.24	87.27	55.68	61.05	71.46
DSSQN [32]	71.03	87.13	53.78	61.18	71.37
SA-VQA [33]	71.10	86.97	55.31	61.11	71.38
QMRGT [41]	71.29	87.56	53.50	61.43	71.41
TRARs [7]	72.00	87.43	54.69	62.72	–
CFR [34]	72.50	–	–	–	–
Ours (MDVFN)	72.16	87.53	55.95	62.74	72.46
Ours (MDVFN $^{* 1}$ )	72.81	88.21	56.32	63.31	73.05

$^{1 *}$ With grid features of resolution 16 $\times$ 16

According to Table 1, the overall accuracy of the proposed MDVFN model on the VQA 2.0 dataset Test-dev exceeds that of other models. For QMRGT network, it is the latest graph-based VQA network proposed. By comparison, it is obvious that our model shows strong competitiveness among the recent models. In particular, it is worth noting that the MCAN model is the standard model in recent years for modular co-attention networks applied in the VQA field, while our proposed MDVFN model performs better on all metrics.

We also investigated the impact of grid layer counts on the experimental results. As shown in Table 1, we utilized grid features of 8 $\times$ 8 and 16 $\times$ 16. The results demonstrate that the MDVFN model performs better on the 16 $\times$ 16 grid, this indicates that finer granularity of the grid not only offers semantic advantages but also enhances the constraints on the fusion range in the integration of visual features, thereby reducing fusion noise.

Table 2. A comparison of the accuracy of our method against other methods on the GQA dataset

Method	GQA
	Open	Binary	Overall
Bottom-up [22]	34.83	66.64	49.74
LRTA [35]	–	–	54.48
LCGN [36]	–	–	56.10
LXMERT [37]	–	–	60.33
Oscar [38]	–	–	61.62
NSM [39]	49.25	78.94	63.17
VinVL [40]	–	–	64.65
SA-VQA [33]	59.05	77.39	67.65
CFR [34]	–	–	72.1
Ours (MDVFN)	64.64	82.45	72.03

Table 2 summarizes the comparative results of our model on the GQA dataset against other advanced models. In contrast to the VQA 2.0 dataset, the GQA dataset features more compositional reasoning questions, which are advantageous for models that focus on the relationship between reasoning questions and images. However, the MDVFN model primarily emphasizes enhancing the model’s understanding of visual features through different granularities of visual features. Although our model’s accuracy on the GQA dataset is slightly lower than that of reasoning-focused models like CFR, it still significantly outperforms other model approaches.

Figure 7 shows the answers to some of the questions predicted by our model and the attention changes shown by the pictures and questions in our model.

Fig. 7 [Images not available. See PDF.]

Attention map learned by the full MDVFN model. In the figure, the highlight indicates the attention location of the picture in the model, Pred indicates the answer predicted by the model, green indicates the correct answer predicted by the model, and red indicates that the answer predicted by the model does not meet the standard answer

Ablation studies

In this section, we conducted several ablation experiments on the VQA-v2 dataset to validate the effectiveness of the modules proposed in this paper.

Visual Feature. To intuitively demonstrate the impact of the two types of visual information on the model, experiments were conducted on the MDVFN model utilizing various visual information inputs. Specifically, we evaluated the performance of using singular visual information, basic visual information fusion, and the integration of both types of visual information via the MDVFN model, comparing the efficacy across multiple models. According to the experimental results (as shown in 3), experiments were conducted on the MDVFN model utilizing various visual information inputs. Specifically, we evaluated the performance of using singular visual information, basic visual information fusion, and the integration of both types of visual information via the MDVFN model, comparing the efficacy across multiple models.

VCA. We also conducted a series of experiments to assess the effectiveness of the proposed visual cross attention (VCA) module. The experimental outcomes (referenced in 3) reveal that the accuracy of the Exp.8 model significantly surpasses that of the Exp.5 model. This indicates that leveraging the interactive relationship between visual information can notably enhance the model’s performance in visual question answering tasks, and our VCA module is highly effective in facilitating this enhancement.

Position Embedding and Visual Fusion Graph. To demonstrate the effectiveness of the introduced visual position embedding and visual fusion graph in reducing noise in the fusion of two visual modalities, we conducted several ablation experiments. Specifically, we incorporated position encoding in the visual fusion module of Exp.6, whereas in Exp.7, we integrated a visual fusion graph in the visual fusion module (as shown in 3). According to the experimental results, in the dual-stream visual information model, incorporating visual position embedding and visual fusion maps significantly improves the model’s performance. This indicates that the two modules reduce noise in the process of visual information fusion, improve the efficiency of visual information fusion, and thus improve the model’s performance. It is noteworthy that by comparing Exp.1, Exp.2, Exp.3, and Exp.4 in Table 3, it can be seen that adding visual position information to a single visual feature model actually reduces its performance. One possible reason for this is that, unlike other multimodal domains such as image captioning, the visual question answering model pays more attention to the connection between problem features and visual features, rather than simply enhancing the visual features themselves.

Modality-mixing. In 3, the fusion method employed in the Exp.8 model is traditional linear fusion, while the Exp.9 model utilizes the modality-mixing approach proposed in this paper. According to the experimental results, the proposed modality-mixing network in this paper can better utilize the complementarity between different modalities, achieving better fusion results than the traditional linear multimodal fusion method.

Table 3. Ablation experiments based on VQA-v2 dataset

Exp	Method	Acc
1	Base+Region(one feature)	70.63
2	Base+Grid(one feature)	71.20
3	Base+Region(one feature)+PE	70.48
4	Base+Grid(one feature)+PE	71.07
5	Two features(w/o Fusion Module $^{1}$ )	71.51
6	Two features+Fusion Module+PE	71.70
7	Two features+Fusion Module+Graph $^{2}$	71.86
8	Two features+VCA(Fusion Module+PE+Graph)	72.01
9	Two features+VCA+Modality-mixing network	72.16

$^{1}$ w/o fusion module indicates that visual information fusion between the grid and the region is not performed.

$^{2}$ where Graph refers to the visual fusion graph.

Table 4. Ablation experiments were conducted on the VQA-v2 dataset to validate the model using different grid densities

Exp	Method	Acc
1	Region & Grid(4 $\times$ 4)	70.98
2	Region & Grid(6 $\times$ 6))	72.02
3	Region & Grid(8 $\times$ 8)	72.16
4	Region & Grid(16 $\times$ 16)	72.81

Grid density. In 4, we integrated grid features with different densities with regional features to explore the impact of grid density on the process of visual feature fusion. We experimented with integrating features from grids with different densities alongside regional features. The results indicate a continuous improvement in the performance of the proposed model as the grid density increases. We posit that when guiding the fusion of visual features through the visual fusion graph for grid regions and regional features, grids with lower densities correspond to fewer but larger grid regions. The increase in the size of the fused grid regions introduces more noise into the visual feature fusion process. As the grid density increases, the visual fusion graph imposes higher constraints on visual feature fusion. This is attributed to higher-density grids finely segmenting entities into more detailed grid regions, thereby eliminating noise in the visual feature fusion process. The accuracy of the 4x4 grid is 1.04% and 1.18% lower than that of the 6x6 and 8x8 grids, respectively, while the accuracy of the 6x6, 8x8 and 16x16 grids is smaller. The reason for the difference is that the accuracy of the semantic information is inferior, and the other reason is that the 4x4 grid only divides an image into 16 parts. As a result, some regional features may interact with more useless information.

Conclusion

In this paper, we propose a dual-stream collaborative network for visual question answering (VQA) that enhances the bidirectional interaction between region features and grid features. Our model combines regions and grids using visual cross attention (VCA) and introduces visual positional information and a visual fusion graph to constrain the visual fusion process, effectively reducing noise generated during visual fusion. Additionally, we propose a multimodal cross-fusion mechanism to fuse three types of output information. Extensive experiments demonstrate the superiority of our approach, with our proposed model achieving significant results on the VQA-v2 dataset. However, using the visual fusion graph generated from the overlapping regions of regional and grid features in the image to constrain the fusion process of the two visual features significantly reduces noise information. Nevertheless, this approach, while effective in preventing noise, also constrains the fusion between regions and other regions. In the future, we aim to further enhance the fusion mechanism to alleviate these limitations.

Author Contributions

Lixia Xue was involved in methodology; Wenhao Wang helped in writing—original draft preparation; Ronggui Wang helped in writing—review and editing; Juan Yang assisted in verification.

Funding

We express our sincere thanks to the anonymous reviewers for their helpful comments and suggestions to raise the standard of this paper. This work was supported partially by the National Key R &D Program of China (grant number 2020YFC1512601) and National Natural Science Foundation of China (grant number 62106064).

Availability of data and materials

The datasets generated during and/or analyzed during the current study are not publicly available due [individual privacy] but are available from the corresponding author on reasonable request.

Declarations

Conflict of interest

(check journal-specific guidelines for which heading to use).

Financial interest or non-financial interest

All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1. Rahman, T., Xu, B., Sigal, L.: Watch, listen and tell: multi-modal weakly supervised dense event captioning. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8907–8916 (2019)

2. Ji, Z; Lin, Z; Wang, H; He, Y. Multi-modal memory enhancement attention network for image-text matching. IEEE Access; 2020; 8, pp. 38438-38447. [DOI: https://dx.doi.org/10.1109/ACCESS.2020.2975594] 1479.37092

3. Diao, H., Zhang, Y., Ma, L., Lu, H.: Similarity reasoning and filtration for image-text matching. In: AAAI Conference on Artificial Intelligence (2021)

4. Anderson, P., Fernando, B., Johnson, M., Gould, S.: Spice: semantic propositional image caption evaluation. In: European Conference on Computer Vision (2016)

5. Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10575–10584 (2019)

6. Luo, Y., Ji, J., Sun, X., Cao, L., Wu, Y., Huang, F., Lin, C.-W., Ji, R.: Dual-level collaborative transformer for image captioning. arXiv:2101.06462 (2021)

7. Zhou, Y., Ren, T., Zhu, C., Sun, X., Liu, J., Ding, X.-h., Xu, M., Ji, R.: Trar: Routing the attention spans in transformer for visual question answering. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2054–2064 (2021)

8. Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6274–6283 (2019)

9. Zheng, X; Wang, B; Du, X; Lu, X. Mutual attention inception network for remote sensing visual question answering. IEEE Trans. Geosci. Remote Sens.; 2022; 60, pp. 1-14. [DOI: https://dx.doi.org/10.1109/TGRS.2022.3225843] 1500.93003

10. Kim, J.-H., Lee, S.-W., Kwak, D., Heo, M.-O., Kim, J., Ha, J.-W., Zhang, B.-T.: Multimodal residual learning for visual qa. In: NIPS (2016)

11. Chen, K., Wang, J., Chen, L.-C., Gao, H., Xu, W., Nevatia, R.: Abc-cnn: An attention based convolutional neural network for visual question answering. arXiv:1511.05960 (2015)

12. Ren, S; He, K; Girshick, RB; Sun, J. Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell.; 2015; 39, pp. 1137-1149. [DOI: https://dx.doi.org/10.1109/TPAMI.2016.2577031] 1240.93194

13. Jiang, H., Misra, I., Rohrbach, M., Learned-Miller, E.G., Chen, X.: In defense of grid features for visual question answering. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10264–10273 (2020)

14. Shih, K.J., Singh, S., Hoiem, D.: Where to look: Focus regions for visual question answering. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4613–4621 (2015)

15. Zhou, Y., Ji, R., Sun, X., Luo, G., Hong, X., Su, J., Ding, X., Shao, L.: K-armed bandit based multi-modal network architecture search for visual question answering. In: Proceedings of the 28th ACM International Conference on Multimedia (2020)

16. Ben-younes, H., Cadène, R., Cord, M., Thome, N.: Mutan: multimodal tucker fusion for visual question answering. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2631–2639 (2017)

17. Ben-younes, H., Cadène, R., Thome, N., Cord, M.: BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection (2019)

18. Goyal, Y; Khot, T; Summers-Stay, D; Batra, D; Parikh, D. Making the v in vqa matter: elevating the role of image understanding in visual question answering. Int. J. Comput. Vis.; 2016; 127, pp. 398-414. [DOI: https://dx.doi.org/10.1007/s11263-018-1116-0]

19. Hudson, C.D. Drew A. Manning: Gqa: a new dataset for real-world visual reasoning and compositional question answering. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6693–6702 (2019)

20. Agrawal, A; Lu, J; Antol, S; Mitchell, M; Zitnick, CL; Parikh, D; Batra, D. Vqa: visual question answering. Int. J. Comput. Vis.; 2015; 123, pp. 4-31.3640737[DOI: https://dx.doi.org/10.1007/s11263-016-0966-6]

21. Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21–29 (2015)

22. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2017)

23. Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. arXiv:1606.00061 (2016)

24. Yu, Z; Yu, J; Xiang, C; Fan, J; Tao, D. Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans. Neural Netw. Learn. Syst.; 2017; 29, pp. 5947-5959. [DOI: https://dx.doi.org/10.1109/TNNLS.2018.2817340] 1278.42021

25. Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A.D., Fergus, R.: Simple baseline for visual question answering. arXiv:1512.02167 (2015)

26. Nam, H., Ha, J.-W., Kim, J.: Dual attention networks for multimodal reasoning and matching. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2156–2164 (2016)

27. Gao, P., Jiang, Z., You, H., Lu, P., Hoi, S., Wang, X., Li, H.: Dynamic fusion with intra- and inter-modality attention flow for visual question answering. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6632–6641 (2018)

28. Ren, M., Kiros, R., Zemel, R.S.: Exploring models and data for image question answering. In: NIPS (2015)

29. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR arxiv:1412.6980 (2014)

30. Kim, J.-H., Jun, J., Zhang, B.-T.: Bilinear attention networks. In: Neural Information Processing Systems (2018)

31. Yu, Z., Cui, Y., Yu, J., Wang, M., Tao, D., Tian, Q.: Deep multimodal neural architecture search. In: Proceedings of the 28th ACM International Conference on Multimedia (2020)

32. Shen, X., Han, D., Chang, C.C., Zong, L.: Dual self-guided attention with sparse question networks for visual question answering. IEICE Trans. Inf. Syst. 105-D, 785–796 (2022)

33. Xiong, P., You, Q., Yu, P., Liu, Z., Wu, Y.: Sa-vqa: structured alignment of visual and semantic representations for visual question answering. arXiv:2201.10654 (2022)

34. Nguyen, T.K.L.T.H.T.E.T.Q.D.N.A. Binh X. Do: Coarse-to-fine reasoning for visual question answering. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 4557–4565 (2021)

35. Liang, F.R.A.N.T.G.T.G. Weixin Niu: Lrta: a transparent neural-symbolic reasoning framework with modular supervision for visual question answering. arXiv:2011.10731 (2020)

36. Hu, A.D.T.S.K. Ronghang Rohrbach: Language-conditioned graph networks for relational reasoning. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10293–10302 (2019)

37. Tan, M. Hao Hao Bansal: Lxmert: learning cross-modality encoder representations from transformers. In: Conference on Empirical Methods in Natural Language Processing (2019)

38. Li, X.L.C.H.X.Z.P.Z.L.W.L.H.H.D.L.W.F.C.Y.G.J. Xiujun Yin: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: European Conference on Computer Vision (2020)

39. Hudson, C.D. Drew A. Manning: Learning by abstraction: the neural state machine. In: Neural Information Processing Systems (2019)

40. Zhang, X.H.X.Y.J.Z.L.W.L.C.Y.G.J. Pengchuan Li: Vinvl: revisiting visual representations in vision-language models. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5575–5584 (2021)

41. Xu, Z; Gu, J; Liu, M; Zhou, G; Fu, H; Qiu, C. A question-guided multi-hop reasoning graph network for visual question answering. Inf. Process. Manag.; 2023; 60, [DOI: https://dx.doi.org/10.1016/j.ipm.2022.103207] 103207.

Word count: 6985

Show less

Modular dual-stream visual fusion network for visual question answering

Content area

Abstract

Full text