Attention Mechanism-Based Cognition-Level Scene

Full text

Turn on search term navigation

1. Introduction

Visual understanding is an important research domain with a long history that attracts extensive models such as Mask RCNN [1], ResNet [2,3], and UNet [4]. These models are successfully employed in various visual understanding tasks such as action recognition, image classification, pose estimation, and visual search [5]. Most gain high-level understanding by identifying the objects in view based on visual input [6,7]. However, reliable visual scene understanding requires both recognition- and cognition-level visual understanding and their seamless integration. More specifically, it is desirable to identify the objects of interest to infer their actions, intents, and mental states to have a comprehensive and reliable understanding of the visual input. While this complex task comes naturally to humans, existing visual understanding systems lack the capacity for higher-order cognition inference [8].

Recent research in visual understanding has shifted inference from recognition-level to cognition-level, which contains more complex relationship inferences to improve cognition-level visual understanding [9]. This directly leads to four significant directions for cognition-level visual understanding research: (1) image generation [10,11], which aims to generate images from a given text description; (2) image caption [12], which focuses on generating text description from given images; (3) visual question answering [13,14,15,16], whose goal is predicting correct answers for given image-and-question pairs; and (4) visual commonsense reasoning (VCR) [8], which additionally provides rational explanations along with question answering, and has gained considerable attention [17,18,19].

The VCR task is a challenging mountain to climb. The model must develop a human-level inference ability to solve a cognition-level visual task [20]. While this might be easy for humans, since we have a reserve of knowledge and excellent, innate reasoning abilities, it is challenging for state-of-the-art AI systems. Recently, a growing body of research [8,17,21,22,23,24,25,26] has aimed to solve this formidable task. In their noteworthy attempts towards achieving VCR, these works typically necessitate pre-training on large-scale data before performing VCR tasks. Therefore, they usually fit well on properties the pre-training data possess, but their generalization in other tasks is far from guaranteed [27]. To remove this dependence on pre-training, another line of research focuses on directly learning the architecture of a system to find straightforward solutions for VCR [21]. However, these methods suffer commonsense-information loss where the commonsense reasoning ability of the model is limited. Further, successful VCR would also require efficient multimodal fusion to fuse visual and textual information together, which remains a daunting task [28,29].

To address the aforementioned challenges, we propose a parallel-structure-based model, PAVCR, which encodes commonsense between sentences with lower information loss. We first design a multimodal fusion layer to fuse visual and textual information [30,31,32,33,34]. Then, we introduce a commonsense encoder to enhance the inference ability of the proposed model. Finally, we present a commonsense encoder to enhance the inference ability of our proposed model. More specifically, the contributions of this research are as follows:

We theoretically analyze the inadequacy of previous work [35] and introduce a new, more effective model to reduce the sequential computation. In [35], the computational complexity of related signals from two positions grows with the distance between the positions, which results in difficulties in learning dependencies among positions for a sequential task. Our proposed multimodal feature fusion layer and commonsense encoder submodules have a novel parallel attention structure to limit the number of operations.
We use anoriginal, innovative model for the VCR task, which represents cognition-level scene understanding, leading to significant improvements in performance.
We use anovel multimodal fusion layer that fuses visual and textual information, resulting in considerable improvements in results on VCR tasks.
We use a new commonsense encoder layer with a parallel-structure attention mechanism and memory cell that avoids information loss while storing the extracted knowledge and extracted commonsense between queries and responses.
We conduct thorough comparisons with popular methods for VCR tasks and ablation studies via extensive experiments on real-world data. Finally, we present a deeper analysis of the results via a case study and qualitative analysis.

This work is an extension of our previous work [36]. The significant changes include the following: (i) a newly proposed multimodal fusion layer for visual–textual information fusion, (ii) a parallel-structure-based commonsense encoder, which enables the model to capture more information without long dependency, along with a memory cell for commonsense storage, (iii) a case study to discuss the superior ability of the proposed model, and, finally, (iv) an ablation study to validate the proposed multimodal fusion layer and commonsense encoder layer.

The remainder of this paper is structured as follows. Related work on question answering (QA), especially VCR, is reviewed in Section 2. Section 3 succinctly covers notation. We present our proposed model in Section 4 and detail how it works to handle VCR tasks. The data, metrics, and experiments are described in Section 5, and Section 6 compares our proposed model to the base model. Finally, quantitative results from PAVCR are presented in Section 7, and we conclude with Section 8.

2. Related Work

From individual object-level scene understanding [1,37,38,39,40,41,42], which aims at object instance segmentation and image recognition, to visual relationship detection [43,44,45,46,47,48], which captures the relationship between any two objects in images or videos, state-of-the-art visual-understanding models have achieved remarkable progress [49,50]. However, there is still a long way to go, since an ideal visual system must have the ability to grasp the deeper meaning behind a scene. To aid in that, recent research on visual understanding has shifted inference from the recognition level to the cognition level, which contains more complex relationship inferences. Rowan et al. [8] further formulated visual commonsense reasoning as the VCR task, an important step towards reliable visual understanding, and benchmarked the VCR dataset. Specifically, the VCR dataset is sampled from a large pool of movie clips in which most scenes refer to logic inferences. For example, “Why isn’t Tom sitting next to David?” requires high-order inference ability about the scene to select the correct answer from available choices. Work towards achieving VCR generally falls into one of the following two categories based on whether a pre-training dataset is necessary or not.

The first line of research, pre-training approaches, trains the model on a large-scale dataset and then fine-tunes the model for downstream tasks. Recent works include ERNIE-ViL-large [17] and UNITER-large [27]. While the former learns semantic relationship understanding for scene graph prediction, the latter is pre-trained to learn joint image–text representations. However, the generalizability of these models relies heavily on the pre-training dataset and therefore is not guaranteed.

The other line of research focuses on studying the model structure to find a straightforward solution for the VCR task. It focuses on encoding the relationship between sentences using sequence-to-sequence-based encoding methods. These methods infer rationales by encoding the long dependency relationship between sentences (see, e.g., R2C [8], TAB-VCR [21], DMVCR [35], RevisitedVQA [51], BottomUpTopDown [52], MLB [53], MUTAN [26]). However, these models face significant reasoning information loss due to the long dependency structure, and it is hard for them to infer reason based on commonsense reasoning about the world. More recently, CAN [36] proposed a co-attention network to ease model training and enhance the ability to capture relationships between sentences and semantic information from surrounding words, which has led to a remarkable improvement in the VCR task. Our work resembles this method, which is independent of a large-scale pre-training dataset. Two distinctions in our proposed work are the following: (i) a cross-attention-based multimodal unit designed to fuse visual information from images and textual information from sentences (i.e., from language), and a (ii) a parallel-structure co-attention network with a memory cell for storing commonsense rather than a long-dependency structure network to enhance the capability of capturing semantic information in the case of long sentences.

In addition, recently, the task of visual commonsense generation has seen significant advancements. Models like VisualCOMET [54], KM BART [55], CE-BART [56], and Dynamic Debiasing Network [57] have extended commonsense reasoning beyond static visual scenes to incorporate temporal dynamics. VisualCOMET [54], for example, not only reasons about what is happening in a still image, but also predicts the events that occurred before and after the scene, and infers the intentions of the individuals in the image. This dynamic context reasoning is an important step towards advancing visual commonsense, as it considers the temporal flow and human intent in visual scenes. KM BART [55] and CE-BART [56] further build on this by leveraging knowledge-enhanced models and cause-and-effect reasoning, enhancing their ability to infer relationships and predict future events in a scene. These works demonstrate the growing importance of dynamic commonsense generation and the ability to reason about the context and sequence of events in visual scenes. While our approach primarily focuses on still images, the insights from these dynamic models are invaluable for future work that aims to combine temporal reasoning with visual commonsense tasks. Despite these advancements, commonsense reasoning remains a central challenge in visual commonsense tasks for several reasons as follows. (1) The Complexity of Real-World Situations: Real-world scenarios often require nuanced reasoning that goes beyond surface-level visual patterns. Commonsense reasoning helps bridge the gap between visual information and human-like understanding, making it essential for tasks like VQA [58,59,60,61,62] and commonsense generation. (2) Contextual Understanding: Visual scenes are often ambiguous and require an understanding of context to interpret them correctly. Commonsense knowledge allows the model to infer relationships between objects and events that are not explicitly present in the image but are part of human experience and intuition. (3) Practical Applications: Commonsense reasoning is critical for applications across various domains, from autonomous systems to robotics and human–computer interaction. These systems require models that can reason effectively and make decisions that align with human understandings of the world.

Recent advancements in multimodal integration for high-level tasks have focused on cross-attention methods to combine visual and textual information effectively. One foundational work, on multimodal Factorized Bilinear (MFB) Pooling [63], efficiently combines visual and textual features for visual question answering (VQA), addressing the high computational complexity of traditional bilinear pooling. It integrates a co-attention mechanism to jointly learn image and question attention, enhancing fine-grained representation. The combined MFB and co-attention model achieve state-of-the-art performance in VQA tasks while improving computational efficiency. ViLT [64] proposed a minimal vision-and-language pre-training model that simplifies visual input processing by eliminating convolutions, making it significantly faster than previous models while maintaining competitive or improved performance on downstream tasks. Flamingo [65] is a Visual Language model that excels at few-shot learning, rapidly adapting to various image and video tasks by leveraging large-scale multimodal training and architectural innovations for seamless integration of visual and textual data. Multilevel Attention networks [16] address challenges by integrating semantic attention for high-level concepts and visual attention for spatial region inference. Bidirectional Attention Flow for Machine Comprehension [66] uses a multistage hierarchical process with bidirectional attention to generate query-aware context representations without early summarization, achieving state-of-the-art results. Although these methods contribute significantly to multimodal fusion, they still face challenges with long-dependency structures.

3. Notations and Problem Formulation

In this section, we succinctly describe necessary notation, the format of the dataset typically used for the VCR task, and how it may be employed to test for the different subtasks that comprise VCR.

The VCR dataset [8] consists of 290 k-labeled subsets. Each subset is composed of an image with one to three associated questions. Each question in turn has four candidate answers and four candidate rationales, with one correct answer and rationale. The overarching task is formulated as a combination of three subtasks: (1) predicting the correct answer for a given question and image ( $Q \to A$ ); (2) predicting the correct rationale for a given question, image, and correct answer ( $Q A \to R$ ); and (3) predicting the correct answer and rationale for a given image and question ( $Q \to A R$ ). Additionally, we defined two language inputs, query $q \in {q_{1}, q_{2}, \dots, q_{n}}$ and response $r \in {r_{1}, r_{2}, \dots, r_{n}}$ , as reflected in Figure 1.

In the $Q \to A$ subtask, query q is the question, and response r is the answer choice. In the $Q A \to R$ subtask, query q becomes the question together with the correct answer, while the possible rationales constitute the response r. For example, in Figure 2, the $Q \to A$ subtask in Question 1 predicts the correct answer choice for a given image and question. Here, q is the given question (“Are [0,1] happy to be here?”), and r is the given answer choices (“A: Yes, they will ⋯”, “B: No, neither of them is happy,⋯”, “C: No, [0,1] took the stairs ⋯”, “D: They both ⋯”). Compared to $Q \to A$ subtask, the $Q A \to R$ subtask predicts the rationale for the given image, question, and correct answer. Here, q is the question (“Are [0,1] happy to be here?”) along with the correct answer (“B: No, neither of them is happy, and they want to go home.”), and r is the rationale choices (“A: [0] looks distressed ⋯”, “B: [1] is in an argument with [0] ⋯”, “C: Both their expression ⋯”, “D: They both ⋯”).

4. Proposed Framework

In this section, we delve into our proposed framework in depth. Figure 1 illustrates our proposed framework’s learning process. It consists of four layers: a feature representation layer, a multimodal fusion layer, a commonsense encoder layer, and a prediction layer. The first layer captures language and image features and converts them into dense representations. The representations are then fed into the multimodal fusion layer to generate meaningful contexts of language-image fused information [27,67,68,69,70,71]. Next, these fused features are fed into a commonsense encoder layer consisting of a co-attention unit and a memory cell to support commonsense storage. Finally, a prediction layer predicts the correct answer or rationale. We explore each layer in detail.

4.1. Feature Representation Layer

Extracting informative features from multi-source information plays a vital role in any machine learning application, especially in our context, where the features themselves are learning targets. As shown in Figure 1, the inputs for the image feature extraction are the image along with its objects. The objects are specified using bounding boxes, which serve as reference points for objects within an image. These bounding boxes are processed through RoiAlign to ensure precise alignment with the feature maps. For the feature extraction process, the deep network ResNet50 [72] backbone is used, with fine tuning applied to the final block of the network after RoiAlign. During fine tuning, the learning rate is set to 1 $\times 10^{- 4}$ , and the optimizer used is Adam, with $β_{1} = 0.9$ , $β_{2} = 0.999$ , and epsilon = 1 $\times 10^{- 8}$ . For data augmentation, we apply random horizontal flipping, scaling, and color jittering to improve the robustness of the model and prevent overfitting. Additionally, we resize the input images to a consistent size of $224 \times 224$ pixels before feeding them into the network.

In terms of architectural choices, skip connections [2] are utilized to mitigate the vanishing gradient problem during training. While skip connections help in preserving gradients across layers, we considered other methods like batch normalization and residual scaling. However, after evaluating these techniques, we found that skip connections were more effective for the particular structure of our model, as they allowed for smoother gradient flow through the network during the fine-tuning phase. The decision to use skip connections was driven by their ability to improve convergence speed without introducing additional computational overhead compared to batch normalization or residual scaling.

Language embedding. Language embeddings are obtained by transforming raw input sentences into low-dimensional embeddings. The embeddings are extracted using an attention mechanism with a parallel structure [73]. The query represented by $q \in {q_{1}, q_{2}, \dots, q_{n}}$ refers to a question in the question-answering task ( $Q \to A$ ), and a question paired with its correct answer in the reasoning task ( $Q A \to R$ ). Responses $r {r_{1}, r_{2}, \dots, r_{n}}$ refer to answer candidates in the question-answering task ( $Q \to A$ ), and rationale candidates in the reasoning task ( $Q A \to R$ ). Note that the sentences contain tags related to objects in the image. For example, see Figure 2 and the question “Are [0,1] happy to be here?” The [0,1] are tags set to identify objects in the image (i.e., the object features of person 1 and person 2).

Object embedding. The images are sourced from movie clips. To ensure a selection of images with rich information, a filter is used to select images containing more than two objects each [8]. These object features are then extracted with a residual connected deep network [2]. The deep network outputs object features with low-dimensional embeddings $o \in {o_{1}, o_{2}, \dots, o_{n}}$ .

4.2. Multimodal Feature Fusion Layer

Figure 3 illustrates the multimodal feature fusion layer. This layer aims to learn fused visual–textual features with semantic, discriminative visual features under the guidance of the textual description without harming their location ability [74]. These learned fused features enable the model to learn the ability to capture the context-level semantics of both vision and text. The multimodal feature fusion layer has the following structure: (1) a text branch to supply text features regarding a query and its responses with attention information for the related object features from the image; (2) a visual branch unit to supply mixed visual–textual information; (3) a text object fusion unit to enable the capability of the model in capturing the context-level semantics of both visual and textual information. Each unit is described in detail below.

Textual Branch Unit. The textual branch unit is designed to supply textual information while extracting semantic information from words. To this end, previously extracted textual features from queries or responses are regarded as value $V_{t}$ , query $Q_{t}$ , and key $K_{t}$ . We employ multi-head attention [75,76] to obtain intended information from surrounding words. Formally put,

(1) $X_{1} = V_{t} s o f t m a x (\frac{Q_{t} K_{t}}{\sqrt{d}})$

where

V_{t}

Q_{t}

, and

K_{t}

are the embedded textual features from queries or responses, and d is the dimension of the embeddings. (See operation ① in Figure 3.)

The output, $X_{1}$ , can supply rich textual information for visual–textual fusion.

Visual Branch Unit. The visual branch is designed to learn joint visual–textual features, which fuses textual information into visual features. Specifically, the previously extracted object features O are considered query $Q_{v}$ , keys $K_{v}$ , and values $V_{v}$ for the visual branch’s self-attention computation. Position encodings (“Pos” in Figure 3) are added before weighted attention computations to identify the order of the objects and the words. Mathematically,

(2) $P E_{p o s, 2 i} = s i n (\frac{p o s}{\frac{10000^{2 i}}{d_{m o d e l}}}), P E_{p o s, 2 i + 1} = c o s (\frac{p o s}{\frac{10000^{2 i}}{d_{m o d e l}}})$

where

P E

represents position encoding,

p o s

is the word position in sentence, i is dimension, and

d_{m o d e l}

is the embedding dimension.

Visual sequential order is crucial for semantic information because it could lead to incorrect meanings in a sentence. Therefore, the visual branch unit first takes the previously extracted object features O as the input of position encoding. The operation is illuminated in Figure 3 in ④ and ⑥, and can be formulated as follows:

(3) $Q_{v e} = K_{v e} = n o r m (O^{i}) + P E (O^{i})$

Next, the output is fed into $s o f t m a x$ (peration ② in Figure 3) to obtain a weighted sum expressed as follows:

(4) $X_{2} = s o f t m a x (\frac{Q_{v e} X_{1}}{\sqrt{d}})$

where

X_{1}

is textual information containing weighted attention information of each word and comes from operation ① in the textual unit.

Recall that the aim of the visual branch unit is to learn joint visual–textual representation. To this end, operation ⑤ concatenates visual–textual information from the output of operation ③, which can be defined as

(5) $Q_{v e} \oplus X_{3}$

where ⊕ represents the concatenate operation and

X_{3}

is the output of operation ③, which can be represented as

(6) $X_{3} = X_{2} X_{1}$

Next, a multi-head attention in operation ⑦ is deployed to obtain fused visual representations, which contain weighted attention information from text features.

Text–Object Fusion Unit. After the visual branch unit, the model has learned rich visual information containing weighted textual information and semantic information from surrounding words. To obtain context-level semantics regarding both textual and visual information, we leverage another multi-head attention operation to learn text–object representations.

Finally, the multimodal fusion layer outputs the fused visual and textual information in the form of $q_{o}$ and $r_{o}$ (Figure 3). $q_{o}$ represents the fused query and object features, while $r_{o}$ denotes the fused response and object features.

4.3. Commonsense Encoder

To achieve the VCR task, we next design a commonsense encoder layer to capture commonsense between sentences and use it to enhance inference. The encoder contains N co-attention blocks (Figure 4), each consisting of a cross-attention and a self-attention unit. Its parallel structure enables the model to capture the semantic information between sentences in parallel and ease the information loss problem. A memory cell follows the N-layer co-attention blocks to store the extracted commonsense. We explore each unit in depth.

Self Attention. Self-attention is designed to capture semantic information within a sequence. Its structure is depicted in Figure 4 as a grey block. The input consists of query Q, keys K, and values V, which are identical, to capture pairwise relationships in a sequence. In more detail, the multi-head attention layer learns the pairwise relationship between samples in a sequence. For instance, for the input sequence $\tilde{q} = [\tilde{q_{1}}, \tilde{q_{2}}, . . ., \tilde{q_{m}}]$ , the multi-head attention learns the relationship between $< \tilde{q_{i}}, \tilde{q_{j}} >$ and outputs attended representations. Subsequently, the attended representations are transformed by a feed-forward network which contains two fully-connected layers with ReLU activation and dropout. The multi-head attention can be formulated as the following:

(7) $M u l t i H e a d = h_{1} \oplus h_{2} \dots \oplus h_{i}$

where

h_{i}

represents an attention head and can be expressed as

(8) $h_{i} = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V$

where

d_{k}

is the dimension size of input embedding K, and

Q, K, V

make up the input sequence

\tilde{q}

Co-Attention. In comparison to self-attention, co-attention focuses on inter-sentence-wise attention and can be regarded as learning weighted information across different sentences. When taking two different sentence representations, $X = [x_{1}, x_{2}, \dots, x_{m}]$ and $Y = [y_{1}, y_{2}, \dots, y_{m}]$ , as the inputs, X is the query Q while key K and value V are Y, guiding the attention learning process for X. Specifically, the multi-head layer in a co-attention unit attends the pairwise relationship between the two paired input sequences $< x_{i}, y_{j} >$ and outputs the attended representations. A feed-forward layer is then applied to transform the attended representations. The co-attention network finally outputs $e_{q o}$ and $e_{r o}$ , which are attention information over images and texts.

Memory cell. Although the commonsense encoder layer improves long-horizon sequence modeling with the attention mechanism, it could have difficulties handling continuous inference, which requires a knowledge base. A knowledge base, however, is a crucial requirement for cognition-level inference. After all, even human beings rely on previous knowledge for inference.

Therefore, we introduce a simple external memory cell to handle this dilemma. A memory cell is readable–writable and learnable. We define the memory cell as M, where $M (i)$ has the same size as $e_{q o}$ or $e_{r o}$ . At time step t, the model first reads memory from M regarding past knowledge with the function $f_{r e a d} (M (t - 1))$ and then concatenates it with the current embedding $h_{t}$ , which is defined as $f_{w r i t e} (M (t))$ . This allows the model to condition current embeddings on previous embeddings for a consistent inference. This can be described formally by the following equations:

(9) $f_{r e a d} (M (t - 1)) = e_{q o / r o}^{1} \oplus e_{q o / r o}^{2} \dots \oplus e_{q o / r o}^{t - 1}$

where ⊕ represents the concatenate operation.

(10) $m_{q / r} = f_{w r i t e} (M (t)) = f_{r e a d} (M (t - 1)) \oplus e_{q o / r o}^{t}$

where

m_{q / r}

represents the memory cell output of queries or responses.

4.4. Prediction Layer

The prediction layer generates a probability distribution of responses from the high-dimension context generated in the encoder layer. It consists of two parts: attention reduction and prediction units, each explored in more detail below.

Attention Reduction. The commonsense encoder layer includes N layers of co-attention operation. However, some of these are unnecessary for prediction. Therefore, an attention reduction unit is designed so that only the most significant information is picked up. Mathematically, it can be represented as

(11) $\tilde{Z_{l}} = \sum_{i = 1}^{m} α_{l}^{i} z_{l}^{i}, α = s o f t m a x (M L P (Z_{l}))$

where

Z_{l}

is either the input query or sequence,

α

is the learned attention weights, and i denotes the position in a sequence.

For better gradient flow through the network, PAVCR also fuses the features by using LayerNorm on the sum of the final attended representations:

(12) $c = L a y e r N o r m (W_{x 1}^{T} \tilde{Z_{q}} + W_{x 2}^{T} \tilde{Z_{r}})$

where

W_{x 1}^{T}

and

W_{x 2}^{T}

are two trainable linear projection matrices.

Prediction. The prediction includes a multi-layer perceptron (Dropout(0.3)-FC (1024)-ReLU-Dropout(0.3)-FC(1)). Our VCR task must predict the correct choices from four given choices (answer candidates and reason candidates). Hence, we treat it as a multi-class task. A popular loss function for multi-classification task cross-entropy [77] is therefore applied to complete the prediction. It aims to output probability distributions for four candidate choices, and the choice with the maximum prediction probability will be the final prediction.

5. Experimental Results

Here, we conduct extensive experiments to demonstrate the effectiveness of our proposed PAVCR network for solving VCR tasks. We introduce the datasets, evaluation metrics, and baseline models used for comparison in the experiments. Next, we compare our model to the baseline model, analyze the impact of different techniques, and present an intuitive interpretation of the predictions. Finally, we conduct ablation studies to investigate our model’s performance further.

The experiments were conducted on a 64-bit machine with a 10-core processor (i9, 3.3 GHz), 64 GB memory, and 3 GTX 1080Ti GPUs.

5.1. Dataset

The VCR dataset [8] consists of 290 k multiple-choice questions, corresponding 290 k correct answers, 290 k correct rationales, and 110 k images. The correct answers and rationales are labeled in the dataset with >90% human agreement. As referenced earlier in Figure 2, each set consists of an image, a question, four available answer choices, and four reasoning choices. The correct answer and rationale are provided in the dataset as ground truths.

The distribution of the dataset is shown in Figure 5. In total, 38% of the types of inference is about explanation, while 24% is about activity inference. Hence, these types of tasks represent cognition-level inference tasks.

5.2. Metric

The VCR task can be regarded as a multi-classification problem. We use mAP (mean average perception) [9] to evaluate performance, a standard metric for assessing prediction accuracy in multi-classification areas. The mAP is usually computed on a dataset.

5.3. Models for Baseline Comparisons

We compare our proposed PAVCR model with recent deep learning-based models for VCR. Specifically, we evaluate against the following baseline approaches:

-. RevisitedVQA [51]: While the most recent methods proposed for VCR have reasoning modules, RevisitedVQA deploys logistic regressions and multi-layer perceptrons (MLP) for reasoning tasks.
-. BottomUpTopDown [52]: BottomUpTopDown uses weighted feature information over language and images to predict answers and reasons.
-. MLB [53]: The key operation for MLB is the Hadamard product for the attention mechanism, which enables a low-rank bilinear pooling for the task.
-. MUTAN [26]: MUTAN consists of a multimodal fusion module tucker decomposition and a multimodal low-rank bilinear (MLB). The MLB is regarded as a reasoning module for inference.
-. R2C [8]: R2C represents a general baseline for VCR tasks. It consists of a fusion module, a contextualization module, and a reasoning module for cognition-level inference. It encodes the sequence based on the sequence relationship model LSTM [78] and attention mechanism.
-. DMVCR [35]: In this work, commonsense between sentences is stored using working dynamic memory as a dictionary. In the inference stage, the dictionary module looks up information from the dictionary as well as updating the information within the dictionary.
-. CAN [36]: CAN proposes a parallel co-attention-based network to enhance the capability of capturing information from surrounding words.

5.4. Experimental Results

Task description We implemented the experiments in three separate steps. First, we conducted $Q \to A$ evaluation, followed by $Q A \to R$ . Finally, we joined the $Q \to A$ and $Q A \to R$ results to obtain the final $Q \to A R$ prediction result. The difference between the implementation of $Q \to A$ and $Q A \to R$ tasks is the input query and response. For the $Q \to A$ task, the query is the paired question, image, and four candidate answers, while the response is the correct answer. For the $Q A \to R$ task, the query is the paired question, image, correct answer, and four candidate rationales, while the response is the correct rationale.

Analysis. We compare our method with several popular visual scene understanding models based on the mean average precision (mAP) metric for the three subtasks: $Q \to A$ , $Q A \to R$ , and $Q \to A R$ . As the results in Table 1 show, our approach outperforms in all subtasks. Specifically, our method outperforms MUTAN and MLB by a large margin, since they lack a commonsense reasoning module, which helps with cognition-level inference. Furthermore, PAVCR also performs better than the recently proposed DMVCR [35]. The reason for this is that the proposed PAVCR model deploys a more efficient multimodal fusion layer and commonsense encoder layer.

In addition, we compare the model used in the previous work [35], which relies on multiple BiLSTM layers [79], with our proposed approach based on multi-head attention modules. This comparison underscores the advantages of our method in addressing the limitations of the BiLSTM-based approach, particularly in handling long-range dependencies in sequential tasks. The key differences are as follows:

Challenges in Capturing Long-Range Dependencies: Although BiLSTMs [79] capture contextual information bidirectionally, their stepwise computation structure inherently limits their ability to effectively model long-range dependencies. Despite using multiple stacked BiLSTMs, these models remain susceptible to vanishing or exploding gradients, especially when processing long sequences. This issue significantly hinders the model’s capacity to learn long-term dependencies, representing a critical limitation of the BiLSTM-based method [35].
Our experiments confirm that this limitation directly affects the computational efficiency of BiLSTM-based models, as they require more time to process long sequences due to the sequential nature of their computations. In contrast, the proposed multi-head attention-based approach reduces the inference time by 2.8% and improves memory usage by 3.1% (see Table 2).
Advantages of Multi-Head Attention: In contrast, the multi-head attention mechanism directly models the dependencies between positions in the sequence by computing weighted sums of the sequence elements. This parallelized mechanism eliminates the need for sequential time-step calculations, enabling each attention head to focus on different parts of the sequence. Consequently, it is better equipped to capture long-range dependencies efficiently.
Unlike BiLSTMs, which struggle with gradient issues over long sequences, multi-head attention overcomes these challenges. It offers a more robust and efficient solution for modeling long-term dependencies, leading to improved computational efficiency. Specifically, PAVCR, which leverages multi-head attention, outperforms the BiLSTM-based DMVCR in terms of both inference time and memory usage (refer to Table 2).
Enhanced Dependency Learning: while BiLSTMs are capable of learning contextual dependencies step by step, their sequential nature restricts their ability to efficiently capture long-range dependencies. In contrast, the multi-head attention mechanism processes the entire sequence in parallel, allowing the model to learn dependencies more comprehensively and efficiently across the entire sequence. This parallel processing not only improves the model’s ability to handle varying lengths of dependencies, but also significantly boosts performance on long-sequence tasks.
As a result, the multi-head attention-based PAVCR demonstrates a substantial reduction in inference time (by 2.8%) and a noticeable improvement in overall performance (by 3.1%) compared to the BiLSTM-based approach, as shown in the experimental results (see Table 2).

This comparison clearly illustrates that the proposed PAVCR model, with its multi-head attention mechanism, offers significant improvements in both computational efficiency and performance over the BiLSTM-based model in [35]. By addressing the limitations associated with long-range dependencies and computational bottlenecks, our approach provides a more effective solution for the VQA task.

Inference Time (s): Represents the average time in seconds taken by each model to process a query and return a result.

Memory Usage (GB): Represents the average memory consumed by each model during inference, measured in gigabytes.

This is expected as PAVCR incorporates a more effective visual grounding module in its encoder network to enhance visual–textual fusion. In addition, to alleviate the information loss suffered by other methods when encoding a long dependence structure for long sentences, PAVCR further encodes semantic information in parallel to capture more comprehensive information from surrounding words, leading to superior performance over the others.

5.5. Ablation Studies

For thoroughness, we also performed ablation studies to examine the performance of the proposed multimodal fusion layer and the commonsense encoder layer. First, we assessed the impact of the multimodal fusion layer. As demonstrated in Table 3, removing the multimodal fusion layer leads to a significant drop in results. This significant decrease in performance indicates that the proposed multimodal fusion layer helps the model improve its capability of capturing visual–textual information.

Next, we conducted an ablation study to prove the effectiveness of the proposed commonsense module. On replacing our encoder layer with a dictionary-based encoder, the performance decreases by 13.26% for the $Q A \to R$ task and by 6.87% for the $Q \to A R$ task (see Table 3). In contrast to the dictionary-based method, the encoder layer proposed in this work captures information from surrounding words in parallel. This results in less information loss, as it does not have a long dependency in long sequential prediction tasks.

6. Case Study

In this section, we present a case study to compare the performance of the proposed method and analyze its superiority. Figure 6 shows prediction results from previous and newly proposed work. The results of the earlier works are marked in red, while the results from the newly proposed model are in green. The model predicts two questions for the given image. Both of the two questions regard human activity prediction.

On analyzing the results in Figure 6, we can conclude that the model proposed in this research performs better than previous work CAN [36], since CAN predicts the wrong answer but correct rationale in Question 2 (see Question 2, Figure 6). In contrast, our method, PAVCR, can accurately predict each question’s correct answer and rationale, demonstrating our model’s superior performance. The multimodal feature fusion layer is crucial to PAVCR’s remarkable performance. With its visual and textual branches, this layer is more powerful in learning visual–textual fusion information, since it considers multimodal fusion while also paying attention to the sequential order, which is a critical factor in semantic meaning.

7. Qualitative Analysis

In this section, we conduct a qualitative study of the results from the PAVCR model and present them visually, analyzing how that reflects on our model’s cognition-level visual understanding ability.

The qualitative examples studied are provided in Figure 7, Figure 8 and Figure 9. We select examples that highlight the varied abilities of our model. The candidate choice in the green text indicates the correct choice; the candidate with the green checkmark represents the prediction of the PAVCR model. We selected examples that demonstrate a range of tasks present in the dataset (see Figure 5), and varied application areas. For example, Figure 7 presents an explanation task, while Figure 8 depicts an activity and explanation type task. An activity task is shown in Figure 9. Another motivation for choosing Figure 8 and Figure 9 is to highlight the performance of our the model on images related to medical settings. We hope to demonstrate the potential contributions this work can have in medical applications.

In each instance, the PAVCR model does exceptionally well in both the question-answering and reasoning tasks, highlighting the inference ability of our proposed PAVCR model in cognition-level visual understanding.

In Figure 7, the following question asked is: “What kind of profession does [0,1] and [2] practice?”. PAVCR can correctly infer the answer and rationale based on elements of dress and activity, even though this question would also be difficult for humans.

PAVCR can also infer emotion from activities. See, for example, Figure 8. Question 2 is, “Is [0] in distress?”. The model selects the correct answer (“[0] is in distress”) as well as the suitable rationale (“[0] has curled up a pillow to his or her face. This is an action one might make when in distress.”).

Finally, the PAVCR model is capable of predicting human activities based on context. It selects the correct answer (“[1] is getting chemotherapy.”) in Figure 9, along with the right rationale (“He is in a hospital with tubes in him.”), indicating the model can analyze activities based on the surrounding environment. This is an important ability of the model and opens the door for possible real-world applications.

8. Conclusions

This work proposes a new model, PAVCR, to solve the challenging cognition-level visual scene understanding task. With a novel attention-based multimodal fusion layer and a commonsense encoder layer, PAVCR is able to capture multiple features containing language and object information and fuse them together to maximize efficacy. Extensive experiments, including comparisons with state-of-the-art techniques, ablation studies, and a case study, demonstrate our model’s superior effectiveness and the results’ intuitiveness. Future directions include extending the proposed framework in conjunction with our previous work [36] to investigate various perspectives of bias in visual reasoning.

Author Contributions

Conceptualization, W.Z.; methodology, W.Z.; software, X.T.; validation, W.Z. and X.T.; formal analysis, X.T.; investigation, X.T.; resources, W.Z.; data curation, X.T.; writing—original draft preparation, X.T.; writing—review and editing, W.Z.; visualization, X.T.; supervision, W.Z.; project administration, W.Z.; funding acquisition, W.Z. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in the visual commonsense dataset repository at https://visualcommonsense.com/download/ (accessed on 20 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1. Proposed framework.

Figure 2. An example of VCR running.

Figure 3. Multimodal Feature Fusion Layer.

Figure 4. Co-attention.

Figure 5. Overview of the types of inference required by questions in VCR.

Figure 6. Case study example 1. The model predicts the correct answer and rationale.

Figure 7. Qualitative example 1. The model predicts the correct answer and rationale.

Figure 8. Qualitative example 2. The model predicts the correct answer and rationale.

Figure 9. Qualitative example 3. The model predicts the correct answer and rationale.

Table 1

Overview of our method compared to other popular methods on the VCR dataset using the mean average precision (mAP) metric. Percentage in parenthesis is our relative improvement over the performance of the best baseline method.

Models	$Q \to A$	$QA \to R$	$Q \to AR$
RevisitedVQA [51]	39.4	34.0	13.5
BottomUpTopDown [52]	42.8	25.1	10.7
MLB [53]	45.5	36.1	17
MUTAN [26]	44.4	32.0	14.6
R2C [8]	61.9	62.8	39.1
DMVCR [35]	62.4	67.5	42.3
CAN [36]	71.1	73.8	47.7
PAVCR	73.1 (2.8%)	74.2 (0.54%)	49.2 (3.1%)

Table 2

Computational efficiency comparison of PAVCR and other methods on the VCR dataset. The table shows inference time (in seconds) and memory usage (in GB).

Models	Inference Time (s)	Memory Usage (GB)
DMVCR [35]	1.60	4.7
PAVCR	1.20	3.3

Table 3

Comparison of results among ablations. Percentage in parenthesis is our relative improvement over the performance of the best baseline method.

Models	$QA \to R$	$Q \to AR$
Without multimodal module	40.2	38.2
Dictionary-based encoder	63.4	69.1
PAVCR	73.1 (15.29%)	74.2 (7.38%)

References

1. Vuola, A.O.; Akram, S.U.; Kannala, J. Mask-RCNN and U-net ensembled for nuclei segmentation. Proceedings of the IEEE ISBI; Venice, Italy, 8–11 April 2019; pp. 208-212.

2. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. Proceedings of the IEEE Conference on CVPR; Las Vegas, NV, USA, 27–30 June 2016; pp. 770-778.

3. Andreas, J.; Rohrbach, M.; Darrell, T.; Klein, D. Neural Module Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Las Vegas, NV, USA, 26–30 June 2016; pp. 39-48.

4. Barkau, R. UNET: One-Dimensional Unsteady Flow Through a Full Network of Open Channels. User’s Manual; Technical Report Hydrologic Engineering Center: Davis, CA, USA, 1996.

5. Papandreou, G.; Zhu, T.; Kanazawa, N.; Toshev, A.; Tompson, J.; Bregler, C.; Murphy, K. Towards Accurate Multi-Person Pose Estimation in the Wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Honolulu, HI, USA, 21–26 July 2017.

6. Sriram, D.; Adey, R.A. Applications of Artificial Intelligence in Engineering Problems; Springer: Berlin/Heidelberg, Germany, 1986.

7. Zhang, X.; Wu, C.; Zhao, Z.; Lin, W.; Zhang, Y.; Wang, Y.; Xie, W. PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering. arXiv; 2023; arXiv: 2305.10415

8. Zellers, R.; Bisk, Y.; Farhadi, A.; Choi, Y. From Recognition to Cognition: Visual Commonsense Reasoning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Long Beach, CA, USA, 16–20 June 2019.

9. Henderson, P.; Ferrari, V. End-to-end training of object class detectors for mean average precision. Proceedings of the Asian Conference on Computer Vision; Taipei, Taiwan, 20–24 November 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 198-213.

10. Gregor, K.; Danihelka, I.; Graves, A.; Rezende, D.J.; Wierstra, D. DRAW: A Recurrent Neural Network for Image Generation. Proceedings of the International Conference on Machine Learning (ICML); Lille, France, 6–11 July 2015; pp. 1462-1471.

11. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv; 2014; arXiv: 1412.6980

12. Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and Tell: A Neural Image Caption Generator. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Boston, MA, USA, 7–12 June 2015; pp. 3156-3164.

13. Patro, B.; Namboodiri, V.P. Differential Attention for Visual Question Answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Salt Lake City, UT, USA, 18–22 June 2018; pp. 7680-7688.

14. Liu, Y.; Wang, Z.; Xu, D.; Zhou, L. Q2ATransformer: Improving Medical VQA via an Answer Querying Decoder. Proceedings of the International Conference on Information Processing in Medical Imaging; Bariloche, Argentina, 18–23 June 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 445-456.

15. Hegde, S.; Jahagirdar, S.; Gangisetty, S. Making the V in Text-VQA Matter. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Vancouver, BC, Canada, 18–22 June 2023; pp. 5579-5587.

16. Yu, D.; Fu, J.; Mei, T.; Rui, Y. Multi-level attention networks for visual question answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA, 21–26 July 2017; pp. 4709-4717.

17. Yu, F.; Tang, J.; Yin, W.; Sun, Y.; Tian, H.; Wu, H.; Wang, H. ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph. Proceedings of the AAAI Conference on Artificial Intelligence; New York, NY, USA, 7–12 February 2020.

18. Wu, Q.; Teney, D.; Wang, P.; Shen, C.; Dick, A.; Van Den Hengel, A. Visual Question Answering: A Survey of Methods and Datasets. Comput. Vis. Image Underst.; 2017; 163, pp. 21-40. [DOI: https://dx.doi.org/10.1016/j.cviu.2017.05.001]

19. Sun, Y.; Wang, S.; Li, Y.; Feng, S.; Chen, X.; Zhang, H.; Tian, X.; Zhu, D.; Tian, H.; Wu, H. ERNIE: Enhanced Representation through Knowledge Integration. arXiv; 2019; arXiv: 1904.09223

20. Lu, S.; Ding, Y.; Liu, M.; Yin, Z.; Yin, L.; Zheng, W. Multiscale Feature Extraction and Fusion of Image and Text in VQA. Int. J. Comput. Intell. Syst.; 2023; 16, 54. [DOI: https://dx.doi.org/10.1007/s44196-023-00233-6]

21. Lin, J.; Jain, U.; Schwing, A. TAB-VCR: Tags and Attributes Based VCR Baselines. Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2019; Volume 32.

22. Lu, J.; Batra, D.; Parikh, D.; Lee, D. VilBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Vancouver, BC, Canada, 8–14 December 2019; pp. 13-23.

23. Malinowski, M.; Rohrbach, M.; Fritz, M. Ask Your Neurons: A Neural-Based Approach to Answering Questions About Images. Proceedings of the IEEE International Conference on Computer Vision (ICCV); Santiago, Chile, 7–13 December 2015.

24. Maslan, N.; Roemmele, M.; Gordon, A.S. One Hundred Challenge Problems for Logical Formalizations of Commonsense Psychology. Proceedings of the AAAI Spring Symposium Series; Stanford, CA, USA, 23–25 March 2015.

25. Kumar, A.; Irsoy, O.; Ondruska, P.; Iyyer, M.; Bradbury, J.; Gulrajani, I.; Zhong, V.; Paulus, R.; Socher, R. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. Proceedings of the International Conference on Machine Learning (ICML); New York, NY, USA, 19–24 June 2016.

26. Ben-Younes, H.; Cadene, R.; Cord, M.; Thome, N. Mutan: Multimodal tucker fusion for visual question answering. Proceedings of the IEEE International Conference on Computer Vision; Venice, Italy, 22–29 October 2017; pp. 2618-2627.

27. Chen, Y.-C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. Uniter: Learning Universal Image-Text Representations. arXiv; 2019; arXiv: 1909.11740

28. Wu, A.; Han, Y. Multi-modal Circulant Fusion for Video-to-Language and Backward. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI); Stockholm, Sweden, 13–19 July 2018; Volume 3, 8.

29. Natarajan, P.; Wu, S.; Vitaladevuni, S.; Zhuang, X.; Tsakalidis, S.; Park, U.; Prasad, R.; Natarajan, P. Multimodal feature fusion for robust event detection in web videos. Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Providence, RI, USA, 16–21 June 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 1298-1305.

30. Baltrušaitis, T.; Ahuja, C.; Morency, L.-P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell.; 2018; 41, pp. 423-443. [DOI: https://dx.doi.org/10.1109/TPAMI.2018.2798607]

31. Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Melbourne, Australia, 15–20 July 2018; pp. 2556-2565.

32. Lu, J.; Yang, J.; Batra, D.; Parikh, D. Hierarchical Question-Image Co-Attention for Visual Question Answering. arXiv; 2016; arXiv: abs/1606.00061

33. Ilievski, I.; Yan, S.; Feng, J. A focused dynamic attention model for visual question answering. arXiv; 2016; arXiv: 1604.01485

34. Yan, F.; Silamu, W.; Li, Y.; Chai, Y. SPCA-Net: A based on spatial position relationship co-attention network for visual question answering. Vis. Comput.; 2022; 38, pp. 3097-3108. [DOI: https://dx.doi.org/10.1007/s00371-022-02524-z]

35. Tang, X.; Huang, X.; Zhang, W.; Child, T.B.; Hu, Q.; Liu, Z.; Zhang, J. Cognitive visual commonsense reasoning using dynamic working memory. Proceedings of the 23rd International Conference on Big Data Analytics and Knowledge Discovery, DaWaK 2021; Virtual Event, 27–30 September 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 81-93.

36. Tang, X.; Zhang, W.; Yu, Y.; Turner, K.; Derr, T.; Wang, M.; Ntoutsi, E. Interpretable Visual Understanding with Cognitive Attention Network. Proceedings of the International Conference on Artificial Neural Networks; Bratislava, Slovakia, 14–17 September 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 555-568.

37. Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D. Vqa: Visual question answering. Proceedings of the IEEE International Conference on Computer Vision; Santiago, Chile, 7–13 December 2015; pp. 2425-2433.

38. Srivastava, N.; Salakhutdinov, R.R. Multimodal learning with deep boltzmann machines. Advances in Neural Information Processing Systems; Curran Associates, Inc.: Lake Tahoe, SN, USA, 2012; pp. 2222-2230.

39. Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D.A. et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis.; 2017; 123, pp. 32-73. [DOI: https://dx.doi.org/10.1007/s11263-016-0981-7]

40. Dey, R.; Salem, F.M. Gate-variants of gated recurrent unit (GRU) neural networks. Proceedings of the IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS); Boston, MA, USA, 6–9 August 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1597-1600.

41. Lu, C.; Krishna, R.; Bernstein, M.; Li, F.-F. Visual Relationship Detection with Language Priors. Proceedings of the European Conference on Computer Vision; Amsterdam, The Netherlands, 11–14 October 2016.

42. Zhang, X.; Zhang, F.; Xu, C. Reducing Vision-Answer biases for Multiple-choice VQA. IEEE Trans. Image Process.; 2023; 32, pp. 4621-4634. [DOI: https://dx.doi.org/10.1109/TIP.2023.3302162]

43. Sabes, P.N.; Jordan, M. Reinforcement learning by probability matching. Advances in Neural Information Processing Systems; Tesauro, G.; Touretzky, D.; Leed, T. Massachusetts Institute of Technology: Cambridge, MA, USA, 1995.

44. Xiong, C.; Merity, S.; Socher, R. Dynamic memory networks for visual and textual question answering. Proceedings of the International Conference on Machine Learning; New York, NY, USA, 19–24 June 2016; pp. 2397-2406.

45. Yang, X.; Tang, K.; Zhang, H.; Cai, J. Auto-encoding scene graphs for image captioning. Proceedings of the IEEE Conference on CVPR; Long Beach, CA, USA, 15–20 June 2019; pp. 10685-10694.

46. Zaremba, W.; Sutskever, I.; Vinyals, O. Recurrent Neural Network Regularization. arXiv; 2014; arXiv: 1409.2329

47. Huang, L.; Wang, W.; Chen, J.; Wei, X. Attention on attention for image captioning. Proceedings of the IEEE International Conference on Computer Vision; Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4634-4643.

48. Wolfe, J.M. Visual search. Curr. Biol.; 2015; 20, pp. R346-R349. [DOI: https://dx.doi.org/10.1016/j.cub.2010.02.016]

49. Carreira, J.; Zisserman, A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 4724-4733.

50. Lu, J.; Ye, X.; Ren, Y.; Yang, Y. Good, Better, Best: Textual Distractors Generation for Multi-Choice VQA via Policy Gradient. Proceedings of the CVPR 2022 Workshop on Open-Domain Retrieval Under a Multi-Modal Setting (O-DRUM@CVPR’22); New Orleans, LA, USA, 20 June 2022; pp. 539-548.

51. Jabri, A.; Joulin, A.; Van Der Maaten, L. Revisiting visual question answering baselines. European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016.

52. Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA, 18–23 June 2018.

53. Kim, J.; On, K.W.; Lim, W.; Kim, J.; Ha, B.; Zhang, B.T. Hadamard product for low-rank bilinear pooling. arXiv; 2016; arXiv: 1610.04325

54. Park, J.S.; Bhagavatula, C.; Mottaghi, R.; Farhadi, A.; Choi, Y. Visualcomet: Reasoning about the dynamic context of a still image. Proceedings of the Computer Vision–ECCV 2020: 16th European Conference; Glasgow, UK, 23–28 August 2020; Proceedings, Part V 16 Springer: Berlin/Heidelberg, Germany, 2020; pp. 508-524.

55. Xing, Y.; Shi, Z.; Meng, Z.; Lakemeyer, G.; Ma, Y.; Wattenhofer, R. Km-bart: Knowledge enhanced multimodal bart for visual commonsense generation. arXiv; 2021; arXiv: 2101.00419

56. Kim, J.; Hong, J.W.; Yoon, S.; Yoo, C.D. CE-BART: Cause-and-effect BART for visual commonsense generation. Sensors; 2022; 22, 9399. [DOI: https://dx.doi.org/10.3390/s22239399]

57. Kim, J.; Park, J.; Seok, J.; Kim, J. Dynamic Debiasing Network for Visual Commonsense Generation. IEEE Access; 2023; 11, pp. 139706-139714. [DOI: https://dx.doi.org/10.1109/ACCESS.2023.3340705]

58. Yu, Z.; Yu, J.; Cui, Y.; Tao, D.; Tian, Q. Deep modular co-attention networks for visual question answering. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Long Beach, CA, USA, 16–20 June 2019; pp. 6281-6290.

59. Wang, P.; Wu, Q.; Shen, C.; Dick, A.; Van Den Hengel, A. Fvqa: Fact-based visual question answering. IEEE Trans. Pattern Anal. Mach. Intell.; 2018; 40, pp. 2413-2427. [DOI: https://dx.doi.org/10.1109/TPAMI.2017.2754246]

60. Andreas, J.; Rohrbach, M.; Darrell, T.; Klein, D. Learning to compose neural networks for question answering. arXiv; 2016; arXiv: 1601.01705

61. Shrestha, R.; Kafle, K.; Kanan, C. Answer them all! toward universal visual question answering models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Long Beach, CA, USA, 16–20 June 2019; pp. 10472-10481.

62. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. Proceedings of the Computer Vision–ECCV 2014: 13th European Conference; Zurich, Switzerland, 6–12 September 2014; Proceedings, Part v 13 Springer: Berlin/Heidelberg, Germany, 2014; pp. 740-755.

63. Yu, Z.; Yu, J.; Fan, J.; Tao, D. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. Proceedings of the IEEE International Conference on Computer Vision; Venice, Italy, 22–29 October 2017; pp. 1821-1830.

64. Kim, W.; Son, B.; Kim, H.S. Vilt: Vision-and-language transformer without convolution or region supervision. Proceedings of the International Conference on Machine Learning; Virtual Event, 18–24 July 2021; pp. 5583-5594.

65. Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst.; 2022; 35, pp. 23716-23736.

66. Seo, M.; Kembhavi, A.; Farhadi, A.; Hajishirzi, S. Bidirectional Attention Flow for Machine Comprehension. arXiv; 2018; arXiv: 1611.01603

67. Lee, D.-B.; Lee, S.; Jeong, W.-T.; Kim, D.; Hwang, S.-J. Generating diverse and consistent QA pairs from contexts with information-maximizing hierarchical conditional VAEs. arXiv; 2020; arXiv: 2005.13837

68. Noh, H.; Han, B. Training Recurrent Answering Units with Joint Loss Minimization for VQA. arXiv; 2016; arXiv: 1606.03647

69. Kim, J.-H.; Lee, S.-W.; Kwak, D.; Heo, M.-O.; Kim, J.; Ha, J.-W.; Zhang, B.-T. Multimodal Residual Learning for Visual QA. Adv. Neural Inf. Process. Syst.; 2016; 29, pp. 2898-2906.

70. Le, X.H.; Ho, H.V.; Lee, G.; Jung, S. Application of long short-term memory (LSTM) neural network for flood forecasting. Water; 2019; 11, 1387. [DOI: https://dx.doi.org/10.3390/w11071387]

71. He, K.; Girshick, R.B.; Dollar, P. Rethinking ImageNet Pre-Training. Proceedings of the IEEE International Conference on Computer Vision (ICCV); Seoul, Republic of Korea, 27 October–2 November 2019.

72. You, Y.; Zhang, Z.; Hsieh, C.-J.; Demmel, J.; Keutzer, K. Imagenet training in minutes. Proceedings of the 47th International Conference on Parallel Processing; Eugene, OR, USA, 13–16 August 2018; pp. 1-10.

73. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Minneapolis, MN, USA, 2–7 June 2019.

74. Du, Y.; Fu, Z.; Liu, Q.; Wang, Y. Visual Grounding with Transformers. Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME); Hong Kong, China, 18–22 July 2022; pp. 1-6. [DOI: https://dx.doi.org/10.1109/ICME52920.2022.9859880]

75. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Long Beach, CA, USA, 4–9 December 2017; pp. 5998-6008.

76. Sukhbaatar, S.; Weston, J.; Fergus, R. End-to-end memory networks. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Montreal, QC, Canada, 7–12 December 2015; pp. 2440-2448.

77. Rubinstein, R. The cross-entropy method for combinatorial and continuous optimization. Methodol. Comput. Appl. Probab.; 1999; 1, pp. 127-190. [DOI: https://dx.doi.org/10.1023/A:1010091220143]

78. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput.; 1997; 9, pp. 1735-1780. [DOI: https://dx.doi.org/10.1162/neco.1997.9.8.1735]

79. Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF models for sequence tagging. arXiv; 2015; arXiv: 1508.01991

Word count: 8853

Show less

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Given a question–image input, a visual commonsense reasoning (VCR) model predicts an answer with a corresponding rationale, which requires inference abilities based on real-world knowledge. The VCR task, which calls for exploiting multi-source information as well as learning different levels of understanding and extensive commonsense knowledge, is a cognition-level scene understanding challenge. The VCR task has aroused researchers’ interests due to its wide range of applications, including visual question answering, automated vehicle systems, and clinical decision support. Previous approaches to solving the VCR task have generally relied on pre-training or exploiting memory with long-term dependency relationship-encoded models. However, these approaches suffer from a lack of generalizability and a loss of information in long sequences. In this work, we propose a parallel attention-based cognitive VCR network, termed PAVCR, which fuses visual–textual information efficiently and encodes semantic information in parallel to enable the model to capture rich information for cognition-level inference. Extensive experiments show that the proposed model yields significant improvements over existing methods on the benchmark VCR dataset. Moreover, the proposed model provides an intuitive interpretation of visual commonsense reasoning.

Details

Title

Attention Mechanism-Based Cognition-Level Scene Understanding

Author

Tang, Xuejiao¹; Zhang, Wenbin²

¹ Institute for Information Processing, Leibniz University Hannover, Welfengarten 1, 30167 Hannover, Germany; [email protected]
² Knight Foundation School of Computing & Information Sciences, Florida International University, Miami, FL 33199, USA

First page

203

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

20782489

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/info16030203

ProQuest document ID

3181516715

Attention Mechanism-Based Cognition-Level Scene Understanding

Jump to:

Full text

2. Related Work

3. Notations and Problem Formulation

4. Proposed Framework

4.1. Feature Representation Layer

4.2. Multimodal Feature Fusion Layer

4.3. Commonsense Encoder

4.4. Prediction Layer

5. Experimental Results

5.1. Dataset

5.2. Metric

5.3. Models for Baseline Comparisons

5.4. Experimental Results

5.5. Ablation Studies

6. Case Study

7. Qualitative Analysis

Abstract

Details

Suggested sources