Full text

Turn on search term navigation

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Given a question–image input, a visual commonsense reasoning (VCR) model predicts an answer with a corresponding rationale, which requires inference abilities based on real-world knowledge. The VCR task, which calls for exploiting multi-source information as well as learning different levels of understanding and extensive commonsense knowledge, is a cognition-level scene understanding challenge. The VCR task has aroused researchers’ interests due to its wide range of applications, including visual question answering, automated vehicle systems, and clinical decision support. Previous approaches to solving the VCR task have generally relied on pre-training or exploiting memory with long-term dependency relationship-encoded models. However, these approaches suffer from a lack of generalizability and a loss of information in long sequences. In this work, we propose a parallel attention-based cognitive VCR network, termed PAVCR, which fuses visual–textual information efficiently and encodes semantic information in parallel to enable the model to capture rich information for cognition-level inference. Extensive experiments show that the proposed model yields significant improvements over existing methods on the benchmark VCR dataset. Moreover, the proposed model provides an intuitive interpretation of visual commonsense reasoning.

Details

Title
Attention Mechanism-Based Cognition-Level Scene Understanding
Author
Tang, Xuejiao 1 ; Zhang, Wenbin 2 

 Institute for Information Processing, Leibniz University Hannover, Welfengarten 1, 30167 Hannover, Germany; [email protected] 
 Knight Foundation School of Computing & Information Sciences, Florida International University, Miami, FL 33199, USA 
First page
203
Publication year
2025
Publication date
2025
Publisher
MDPI AG
e-ISSN
20782489
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
3181516715
Copyright
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.