Full text

Turn on search term navigation

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Traditional multimodal contrastive learning brings text and its corresponding image closer together as a positive pair, where the text typically consists of fixed sentence structures or specific descriptive statements, and the image features are generally global features (with some fine-grained work using local features). Similar to unimodal self-supervised contrastive learning, this approach can be seen as enforcing a strict identity constraint in a multimodal context. However, due to the inherent complexity of remote sensing images, which cannot be easily described in a single sentence, and the fact that remote sensing images contain rich ancillary information beyond just object features, this strict identity constraint may be insufficient. To fully leverage the characteristics of remote sensing images, we propose a multimodal contrastive learning method for remote sensing image feature extraction, based on positive sample tripartite relaxation, where the model is relaxed in three aspects. The first aspect of relaxation involves both the text and image inputs. By introducing learnable parameters in the language and image branches, instead of relying on fixed sentence structures and fixed image features, the network can achieve a more flexible description of remote sensing images in text and extract ancillary information from the image features, thereby relaxing the input constraints. Second relaxation is achieved through multimodal alignment of various features. By aligning semantic information with the corresponding semantic regions in the images, the method allows for the relaxation of local image features under semantic constraints. This approach addresses the issue of selecting image patches in unimodal settings, where there is no semantic constraint. The proposed method for remote sensing image feature extraction has been validated on four datasets. On the PatternNet dataset, it achieved a 91.1% accuracy with just one-shot.

Details

Title
Multimodal Contrastive Learning for Remote Sensing Image Feature Extraction Based on Relaxed Positive Samples
Author
Zhang, Zhenshi 1 ; Li, Qiujun 2 ; Wenxuan Jing 2 ; He, Guangjun 3   VIAFID ORCID Logo  ; Zhu, Lili 4 ; Gao, Shijuan 2 

 College of Basic Education, National University of Defense Technology, Changsha 410073, China; [email protected] 
 School of Geosciences and Info-Physics, Central South University, Changsha 410083, China; [email protected] (Q.L.); [email protected] (W.J.) 
 State Key Laboratory of Space-Ground Integrated Information Technology, Beijing Institute of Satellite Information Engineering, Beijing 100086, China; [email protected] 
 Hunan Key Laboratory of Land Resources Evaluation and Utilization, Hunan Provincial Institute of Land and Resources Planning, Changsha 410083, China 
First page
7719
Publication year
2024
Publication date
2024
Publisher
MDPI AG
e-ISSN
14248220
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
3144170615
Copyright
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.