Full Text

Turn on search term navigation

1. Introduction

Remote sensing imaging technology has significantly advanced in the last decades. Modern airborne sensors provide a large coverage of the Earth’s surface with improved spatial, spectral and temporal resolutions, thereby playing a crucial role in numerous research areas, including ecology, environmental science, soil science, water contamination, glaciology, land surveying and analysis of the crust of the Earth. Automatic analysis of remote sensing imaging brings unique challenges, such as data are generally multi-modal (e.g., optical or synthetic aperture radar sensors), located in the geographical space (geo-located) and typically on a global scale with ever growing data volumes.

Deep learning, especially convolutional neural networks (CNNs), has dominated many areas of computer vision, including object recognition, detection and segmentation. These networks typically take an RGB image as an input and perform a series of convolution, local normalization and pooling operations. CNNs typically rely on a large amount of training data, and the resulting pre-trained models are then utilized as generic feature extractors for a variety of downstream applications. The success of deep learning-based techniques in computer vision has also inspired the remote sensing community with significant advances being made in many remote sensing tasks, including hyperspectral image classification, change detection and very high-resolution satellite instance segmentation.

One of the main building blocks in CNNs is the convolution operation, which captures local interactions between elements (e.g., contour and edge information) in the input image. CNNs encode biases, such as spatial connectivity and translation equivariance. These charactertistics aid in constructing generalizable and efficient architectures. However, the local receptive field in CNNs limits modeling long-range dependencies in an image (e.g., distant part relationships). Moreover, convolutions are content-independent as the convolutional filter weights are stationary with same weights applied to all inputs regardless of their nature. Recently, vision transformers (ViTs) [1] have demonstrated impressive performance across a variety of tasks in computer vision. ViTs are based on the self-attention mechanism that effectively captures global interactions by learning the relationships between the elements of a sequence. Recent works [2,3] have shown that ViTs possess content-dependent long-range interaction modeling capabilities and can flexibly adjust their receptive fields to counter nuisances in data and learn effective feature representations. As a result, ViTs and their variants have been successfully utilized for many computer vision tasks, including classification, detection and segmentation.

Following the success of ViTs in computer vision, the remote sensing community has also witnessed a significant growth (see Figure 1) in the employment of transformer-based frameworks in many tasks, such as very high-resolution image classification, change detection, pan sharpening, building detection and image captioning. This has started a new wave of promising research in remote sensing with different approaches utilizing either ImageNet pre-training [4,5,6] or performing remote sensing pre-training [7] with vision transformers. Similarly, there exist approaches in the literature that are based on pure transformer design [8,9] or utilize a hybrid approach [10,11,12] based on both transformers and CNNs. It is, therefore, becoming increasingly challenging to keep pace with the recent progress due to the rapid influx of transformer-based methods for different remote sensing problems. In this work, we these advances and present an account of recent transformer-based approaches in the popular field of remote sensing. To summarize, our main contributions are the following:

We present a holistic overview of applications of transformer-based models in remote sensing imaging. To the best of our knowledge, we are the first to present a survey on transformers in remote sensing, thereby bridging the gap between recent advances in computer vision and remote sensing in this rapidly growing and popular area.
We present an overview of both CNNs and transformers, discussing their respective strengths and weaknesses.
We present a review of more than 60 transformer-based research works in the literature to discuss the recent progress in the field of remote sensing.
Based on the presented review, we discuss different challenges and research directions on transformers in remote sensing.

The rest of the paper is organized as follows: Section 2 discusses other related surveys on remote sensing imaging. In Section 3, we present an overview of different imaging modalities in remote sensing, whereas Section 4 provides a brief overview of CNNs and vision transformers. Afterwards, we review advances with respect to transformer-based approaches in very high-resolution (VHR) imaging (Section 5), hyperspectral image analysis (Section 6) and synthetic aperture radar (SAR) in Section 7. In Section 8, we conclude our survey and discuss potential future research directions.

2. Related Work

In the literature, several works have performed a review of machine learning techniques for remote sensing imaging in the past decade. Tuia et al. [13] compare and evaluate different active learning algorithms for the supervised remote sensing image classification task. The work of [14] focuses on the problem of hyperspectral image classification and reviews recent advances in relation to machine learning and vision techniques. Zhu et al. [15] present a comprehensive review of utilizing deep learning techniques for remote sensing image analysis. Their work provides a comprehensive review of the existing approaches along with describing a list of resources about deep learning in remote sensing. Ma et al. [16] review major deep learning concepts in remote sensing with respect to image resolution and study area. To this end, their work studies different remote sensing tasks, such as image registration, fusion, scene classification and object segmentation.

Recently, transformer-based approaches have witnessed a significant surge within the computer vision community, following the breakthrough from transformer-based models [17] in natural language processing (NLP). Khan et al. [18] present an overview of the transformer models in vision with emphasis on recognition, generative modeling, multi-modal, video processing and low-level vision tasks. Shamshad et al. [19] survey the use of transformer models in medical imaging, focusing on different medical imaging tasks, such as segmentation, detection, reconstruction, registration and clinical medical report generation. The work of [20] presents an overview of the growing trend of using transformers to model video data. Their work also compares the performance of vision transformers on different video tasks, such as action recognition.

Different from the aforementioned surveys, our work presents a review of recent advances of transformer-based approaches in the popular area of remote sensing. To the best of our knowledge, this is the first survey presenting a comprehensive account of transformers in remote sensing, particularly dedicated to progress in very high-resolution, hyperspectral and synthetic aperture radar image analysis.

3. Remote Sensing Imaging Data

Remote sensing imagery is generally acquired from a range of sources, as well as data collection techniques. Remote sensing image data can be typically characterized by their spatial, spectral, radiometric, and temporal resolutions. Spatial resolution refers to each pixel size within an image along with the area of the surface of the Earth represented by that corresponding pixel. Spatial resolution characterizes the small and fine-detailed features in an imaging scene that can be separated. Spectral resolution refers to the capability of the sensor to collect information about the scene by discerning finer wavelengths with narrower bands (e.g., 10 nm). On the other hand, radiometric resolution characterizes the extent of the information in each pixel, where a larger dynamic range for a sensor implies more details are to be discerned in the image. The temporal resolution refers to the time it takes between consecutive images of the same location on ground acquired by the sensor. Here, we briefly discuss commonly utilized remote sensing imaging types with examples shown in Figure 2.

Very High-resolution Imagery: In recent years, the emergence of very high-resolution (VHR) satellite sensors has paved the way towards yielding the higher spatial resolution imagery beneficial for land use change detection, object-based image analysis (object detection and instance segmentation), precision agriculture farming (e.g., management of crops, soil and pests) and emergency responses. Furthermore, these recent advances in sensor technology, along with new deep learning-based techniques, allow the use of VHR remote sensing imagery to analyze the biophysical and biogeochemical processes both in coastal and inland waters. Nowadays, optical sensors produce panchromatic and multispectral imagery of the Earth’s surface at a much finer spatial resolutions (e.g., 10 to 100 cm/pixel).

Hyperspectral Imagery: Here, each pixel in the scene is captured using a continuous spectrum of light with fine wavelength resolutions. The continuous spectrum extends wavelengths beyond the visible spectrum and includes wavelengths from ultraviolet (UV) to infrared (IR). Generally, the spectral resolution of hyperspectral images are expressed using the wave number along with the nanometers (nm). The most popular continuous spectrum used for measuring the pixels is mid-infrared, which is near infrared and visible wavelength bands. In order to acquire hyperspectral imagery, there are different electromagnetic measurements, such as Raman spectroscopy, X-ray spectroscopy, Terahertz spectroscopy, 3D ultrasonic imaging, magnetic resonance and confocal laser microscopy scanners, that can measure the entire emission spectrum for each pixel at a specific excitation wavelength. The hyperspectral images have high dimensionality and strong resolving power for fine spectra. The imagery offers a wide range of applications, including in environmental science [21] and mining [22]. Different from regular images that contain only the primary colors (red, green and blue) within the visible spectrum, hyperspectral images are rich in spectral information that can reflect the physical structure and chemical composition of the item of interest. In remote sensing, automatically analyzing hyperspectral imagery is an active research topic.

Synthetic Aperture Radar Imagery: A large amount of synthetic aperture radar (SAR) images are produced by Earth observation satellites every day through emission and reception of electromagnetic signals. In the past decades, SAR images have gained popularity due to their higher spatial resolution, all-weather capability, de-speckling tools, such as CAESAR, along with recent advances in the SAR specific image processing. SAR imagery can be used for numerous applications, including geographical localization, object detection, functionalities of basic radars and geophysical feature estimation of complex settings, such as roughness, moisture content and density. Furthermore, SAR imagery can be used for disaster management (oil slick detection and ice tracking), forestry and hydrology.

4. From CNNs to Vision Transformers

In this section, we first present a brief overview of CNNs and then provide a brief description of vision transformers recently utilized for different vision tasks.

4.1. Convolutional Neural Networks

Convolutional neural networks (CNNs) have dominated a variety of computer vision tasks, including image classification [23] and object detection [24]. CNNs are typically made up of series of two main parts: convolutional and pooling layers. The convolutional layer produces feature maps by convolving the local region in the input with a set of kernels. These features are subjected to a non-linear function with the same process repeated for each convolutional layer. In CNNs, the pooling layer carries out a downsampling operation (typically utilizing the max or mean operation) to feature maps. In different existing CNN architectures, the convolutional and pooling layers are followed by a set of fully connected layers, where the last fully connected layer is the softmax computing each object category score.

Popular CNN Backbones: Here, we briefly discuss different popular CNN backbone architectures in the literature.

AlexNet: Krizhevsky et al. [23] propose a CNN architecture, named AlexNet, for the image classification task. AlexNet comprises five convolutional layers followed by three fully-connected layers. The proposed network architecture utilizes Rectified Linear Units (ReLU) for training efficiency. The network contains 60 million parameters and 500,000 neurons with network training performed on the large-scale ImageNet dataset [25]. Different data augmentation techniques are employed to increase the training set. In the ImageNet 2012 competition, AlexNet achieved a competitive performance with top-1 and top-5 error rates of 39.7% and 18.9%, respectively.

VGGNet: Different from AlexNet, Simonyan and Zisserman [26] introduced an architecture named VGGNet that comprises 16 layers in total. The network takes an input image of 224 × 224 size and has around 138 million parameters. It uses different data augmentation techniques, including scale jittering, during network training. The VGGNet architecture comprises convolution layers of 3 × 3 filter, where the receptive fields are convolved at each pixel with a stride of one pixel. The VGGNet contains multiple pooling layers, performing spatial pooling over 2 × 2 windows with a stride of two pixels. Furthermore, VGGNet contains two fully connected layers followed by a softmax for yielding output predictions. The VGG architecture achieved top classification accuracy on the 2014 ImageNet classification challenge.

ResNet: Different from AlexNet and VGGNet, He et al. [27] introduced residual neural networks (ResNet) that stacks residual blocks to build a network. ResNet provides a residual learning approach for training networks that are much deeper than their previously utilised counterparts. Instead of learning unreferenced functions, it explicitly reformulates the layers as learning residual functions with reference to the layer inputs. Extensive empirical evidence demonstrates that residual networks are easier to optimize with improved accuracy from higher depth.

The development of CNN-based architectures has led to the rise of novel techniques, improved hardware (e.g., GPUs and TPUs), better optimization methods and many open-source libraries. Interested readers can go through the survey papers related to CNN methods for remote sensing [15,16]. Previous works have analyzed that CNNs are able to capture image-specific inductive bias, which increases their effectiveness in learning better feature representations. However, CNNs do not capture long-range dependencies that aid enhanced expressivity of the representations. Next, we briefly present vision transformers that are capable of modelling long-range dependencies in the images.

4.2. Vision Transformers

Recently, transformer-based models have achieved promising results across many computer vision and natural language processing (NLP) tasks. Vaswani et al. [17] first introduced transformers as an attention-driven model for machine translation applications. To capture the long-range dependencies, transformers use self-attention layers instead of the traditional recurrent neural network that struggles to encode such dependencies between the elements of a sequence.

To effectively capture the long-range dependencies within an input image, the work of [1] introduces vision transformers (ViTs) for the image recognition task, as shown in Figure 3. ViTs [1] interpret an image as a sequence of patches and process it via a conventional transformer encoder similar to those used in NLP tasks. The success of ViTs in generic visual data have sparked the interest not only in different areas of computer vision, but also in the remote sensing community, where a number of ViT-based techniques have been explored in recent years for various tasks.

Next, we briefly describe the key component of self-attention within transformers.

Self-Attention: The self-attention mechanism has been an integral component of transformers as it captures the long-range dependencies and encodes the interaction between all of the sequences tokens (patch embedding). The key idea of self-attention is to learn self-alignment, that is, to update the token by aggregating global knowledge from all the other tokens in the sequence [28]. Given a 2D image $x \in R^{H \times W \times C}$ , the process starts with flattening the image into a series of 2D patches $x_{p} a t \in R^{M \times (P^{2} C)}$ , where C represents number of channels, H and W represent the height and width of the image, respectively, $P \times P$ is the dimension of each individual patch and $M = H W / P^{2}$ represents the total number of patches. A learnable linear projection layer of E dimension is used to project these flattened patches and can be showed as a matrix $X \in R^{N \times E}$ . The aim of the self-attention is to apprehend the interaction among all the M embeddings, which is achieved by introducing the three learnable weight matrices to modify input X into queries (as $W^{Q} \in R^{E \times E_{q}}$ ), keys (as $W^{K} \in R^{E \times E_{k}}$ ) and values (as $W^{V} \in R^{E \times E_{v}}$ ), where $E_{q} = E_{k}$ . The sequence X is first projected onto these weight matrices to obtain $K = X W^{K}$ , $V = X W^{V}$ and $Q = X W^{Q}$ . The relative attention matrix $A \in R^{M \times M}$ is

(1) $Z = s o f t m a x (\frac{Q K^{T}}{\sqrt{E_{q}}}) V$

Masked Self-Attention: All entities are attended to the usual self-attention layer. These self-attention blocks used in the decoder for the transformer model [17], which is trained to anticipate the next entity in the sequence, are masked to prevent attending to the subsequent entities. This task is performed by an element-wise multiplication operation with a mask $M \in R^{n \times n}$ , where $M$ is an upper-triangular matrix. Here, masked self-attention is represent by

(2) $s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{q}}} \circ M)$

where ∘ represents the Hadamard product. In masked self-attention, the attention ratings of future entities are set to zero when predicting an entity in the sequence.

Multi-Head Attention: Multi-head attention (MHA) comprises multiple self-attention blocks concatenated simultaneously channel-wise in order to capture different complex interactions between different sequences of embeddings. Each of the head of the multi-head self-attention has its own learnable weight matrices represented as $W^{Q_{i}}$ , $W^{K_{i}}$ and $W^{V_{i}}$ , where $i = 0 \cdot \cdot \cdot \cdot \cdot \cdot (h - 1)$ and h denotes the number head in multi-head self-attention. Hence, we can express

(3) $M H A (Q, K, V) = [Z_{0}, . . ., Z_{h - 1}] W^{O}$

where the output of each head is concatenated to form single matrix

B \in R^{M \times h \cdot E_{v}}

, whereas

W^{O} \times R^{h . E_{v} \times M}

computes the linear transformation of the heads.

Popular Transformers Backbones: Here, we briefly discuss some recent transformer-based backbones.

ViT: The work of [1] introduces an architecture, where a pure transformer is utilized directly to a sequence of image patches for the task of image classification. The ViT architecture design does not employ image-specific inductive biases (e.g., translation equivariance and locality), and the pre-training is performed on large-scale ImageNet-21k or JFT-300M dataset.

Swin: Liu et al. [29] improved the ViT design by introducing an architecture that produces hierarchical feature representation. The Swin transformer has linear computational complexity with respect to input image size, where the efficiency is achieved by restricting the self-attention computation to non-overlapping local windows while enabling cross-window connection.

PVT: The work of [30] introduces a pyramid vision transformer (PVT) architecture to perform pixel-level dense prediction tasks. The PVT architecture utilizes a progressively shrinking pyramid and a spatial-reduction attention layer for producing high-resolution multi-scale feature maps. The PVT backbone has shown to achieve impressive performance on object detection and segmentation tasks compared to its CNN counterpart with a similar number of parameters.

Transformers offer unique characteristics that are useful for different vision tasks. Compared to the convolution operation in CNNs, where static filters are computed, filters in self-attention are dynamically calculated. Furthermore, permutations and changes in the number of input points have little effect on self-attention. Recent studies [2,3] have explored different interesting properties of vision transformers and compare them with CNNs. For instance, the recent work of [2] shows that vision transformers are more robust to severe occlusions, domain shifts and perturbations. Next, we present a review of transformers in remote sensing based on the taxonomy shown in Figure 4.

5. Transformers in VHR Imagery

Here, we review transformer-based approaches utilized to address different problems in very-high resolution (VHR) imagery.

5.1. Scene Classification

Remote sensing scene classification is a challenging problem, where the task is to automatically associate a semantic category label to a given high-resolution image comprising ground objects and different land cover types. Among the existing vision transformer-based VHR scene classification approaches, Bazi et al. [4] explore the impact of the standard vision transformer architecture of [1] (ViT) and investigate different data augmentation strategies for generating addition data. In addition, their work also evaluates the impact of compressing the network by pruning the layers while maintaining the classification accuracy. The work of [31] introduces a joint CNN-transformer framework, where there is one CNN stream and another ViT stream, as shown in Figure 5. The features from the two streams are concatenated and the entire framework is trained using a joint loss function, comprising cross-entropy and center losses, to optimize the two-stream architecture. Zhang et al. [32] introduce a framework, called Remote Sensing Transformer (TRS), that strives to combine the merits of CNNs and transformers by replacing the spatial convolutions with multi-head self-attention. The resulting multi-head self attention bottleneck has fewer parameters and is shown to be effective compared to other bottlenecks. The work of [5] introduces a two-stream Swin transformer network (TSTNet) that comprises two streams: original and edge. The original stream extracts standard image features, whereas the edge stream contains a differentiable edge Sobel operator module and provides edge information. Further, a weighted feature fusion module is introduced to effectively fuse the features from the two streams for boosting the classification performance. The work of [6] introduces a transformer-based framework with a patch generation module designed to generate homogeneous and heterogeneous patches. The patch generation module generates the heterogeneous patches directly, whereas the homogeneous patches are obtained using a superpixel segmentation method.

Remote Sensing Pre-training: Different from the aforementioned approaches that either use only transformers or hybrid CNN-transformer designs with backbone networks pretrained on ImageNet datasets, the recent work of [7] investigates training vision transformer backbones, such as Swin, from scratch on the large-scale MillionAID remote sensing dataset [33]. The resulting trained backbone models are then fine-tuned for different tasks, including scene classification. Figure 6 shows the response maps, obtained using Grad-CAM++ [34], of different ImageNet (IMP) and remote sensing pre-trained (RSP) models. It can be observed that RSP models learn better semantic representations by paying more attention to the important targets compared to their IMP counterparts. Furthermore, the transformer-based backbones, such as Swin-T, better capture the contextual information due to the self-attention mechanism. Moreover, backbones, such as ViTAEv2-S, that combine the merits of CNNs and transformers along with RSP can achieve better recognition performance.

Table 1 shows a comparison of the aforementioned classification approaches on one of the most commonly used VHR classification benchmarks: AID [35]. The AID dataset contains images acquired from multi-source sensors. The dataset possesses a high degree of intra-class variation since the images are collected from different countries under different times and seasons with variable imaging conditions. There are in total 10,000 images in the dataset and 30 categories. The performance is measured in terms of mean classification accuracy over all the categories. For more details on AID, we refer to [35]. Other than RSP that performs an initial pre-training on the Million-AID dataset, all approaches here utilize models pre-trained on the ImageNet benchmark.

5.2. Object Detection

Localizing objects in VHR imaging is a challenging problem due to extreme scale variations and the diversity of different object classes. Here, the task is to simultaneously recognize and localize (either rectangle or oriented bounding-boxes) all instances belonging to different object categories in an image. Most existing approaches employ a hybrid strategy by combining the merits of CNNs and transformers within existing two-stage and single-stage detectors. Other than the hybrid strategy, few recent works explore the DETR-based transformers object detection paradigm [36].

Hybrid CNN-Transformers based Methods: The work of [37] introduces a local perception Swin transformer (LPSW) backbone to improve the standard transformers for detecting small-sized objects in VHR imagery. The proposed LPSW strives to combine the merits of transformers and CNNs to improve the local perception capabilities for better detection performance. The proposed approach is evaluated with different detectors, such as Mask RCNN [38]. The work of [39] introduces a transformer-based detection architecture, where a pre-trained CNN is used to extract features and a transformer is adapted to process a feature pyramid of a remote sensing image. Zhang et al. [40] introduce a detection framework where an efficient transformer is utilized as a branch network to improve CNN’s ability to encode global features. Additionally, a generative model is employed to expand the input remote sensing aerial images ahead of the backbone network. The work of [41] proposes a detection framework based on RetinaNet, where a feature pyramid transformer (FPT) is utilized between the backbone network and the post-processing network to generate semantically meaningful features. The FPT enables the interaction among features at different levels across the scale. The work of [42] introduces a framework where transformers are adopted to model the relationship of sampled features in order to group them appropriately. Consequently, better grouping and bounding box predictions are obtained without any post-processing operations. The proposed approach effectively eliminates the background information, which helps in achieving improved detection performance.

Zhang et al. [43] introduce a hybrid architecture that combines the local characteristics of depth separable convolutions with the global (channel) characteristics of MLP. The work of [44] introduces a two-stage angle-free detector, where both the RPN and regression are angle-free. Their work also evaluates the proposed detector with a transformer-based backbone (Swin-Tiny). Liu et al. [45] propose a hybrid network architecture, called TransConvNet, that aims at combining the advantages of CNNs and transformers by aggregating both global and local information to address the rotation invariability of CNNs with a better contextual attention. Furthermore, an adaptive feature fusion network is designed to capture information from multiple resolutions. The work of [46] introduces a detection framework, called Oriented Rep-Points, that utilizes flexible adaptive points as a representation. The proposed anchor-free approach learns to select the point samples from classification, localization and orientation. Specifically, to learn geometric features for arbitrarily-oriented aerial objects, a quality assessment and sample assignment scheme is introduced that measures and identifies high-quality sample points for training, as shown in Figure 7. Furthermore, their approach utilizes a spatial constraint for penalizing the sample points that are outside the oriented box for robust learning of the points.

DETR-based Detection Methods: Few recent approaches have investigated adapting the transformer-based DETR detection framework [36] for oriented object detection in VHR imaging. The work of [47] adapts the standard DETR for oriented object detection. In their approach, an efficient encoder is designed for transformers by replacing the standard attention mechanism with a depthwise separable convolution. Dai et al. [48] propose a transformer-based detector, called AO2-DETR, where an oriented proposal generation scheme is employed to explicitly produce oriented object proposals. Furthermore, their approach comprises an adaptive oriented proposal refinement module that is designed to compute rotation-invariant features by eliminating the misalignment between region features and objects. Furthermore, a rotation-aware matching loss is utilized to perform a matching process for direct set prediction without the duplicated predictions.

Table 2 shows a comparison of the aforementioned detection approaches on the most commonly used VHR detection benchmark, DOTA [49]. The dataset comprises 2806 large aerial images of 15 different object categories: plane, baseball diamond, basketball court, soccer-ball field, bridge, ground track field, small vehicle, ship, large vehicle, tennis court, roundabout, swimming pool, harbor, storage tank and helicopter. The detection performance accuracy is measured in terms of mean average precision (mAP). For more details on DOTA, we refer to [49]. The results show that most of these recent methods obtain similar detection accuracy with a slight improvement in performance obtained when using the Swin-T backbone.

5.3. Image Change Detection

In remote sensing, image change detection is an important task for detecting changes on the surface of the Earth with numerous applications in agriculture [50,51], urban planning [52] and map revision [53]. Here, the task is to generate change maps obtained by comparing the multi-temporal or bi-temporal images with each pixel in the resulting binary change map having a value of either zero or one depending on whether the corresponding position has changed or not. Among the recent transformer-based change detection approaches, Chen et al. [54] propose a bi-temporal image transformer encapsulated in a deep feature differencing-based framework that is designed to model the spatio-temporal contextual information. Within the proposed framework, the encoder is employed to capture context in token-based space-time. The resulting contextualized tokens are then fed to the decoder where the features are refined in the pixel-space. Guo et al. [55] propose a deep multi-scale Siamese architecture, called MSPSNet, that utilizes a parallel convolutional structure (PCS) and self-attention. The proposed MSPSNet performs feature integration of different temporal images via PCS and then features refinement based on self-attention to further enhance the multi-scale features. The work of [56] introduces a Swin transformer-based network with a Siamese U-shaped structure, called SwinSUNet, for change detection. The proposed SwinSUNet comprises three modules: encoder, fusion and decoder. The encoder transforms the input image into tokens and produces multi-scale features by employing a hierarchical Swin transformer. The resulting features are concatenated in the fusion having linear projection and Swin transformer blocks. The decoder contains upsampling and merging within Swin transformer blocks to progressively generate change predictions.

Wang et al. [57] introduce an architecture, called UVACD, that combines CNNs and transformers for change detection. Within UVACD, the high-level semantic features are extracted via a CNN backbone, whereas transformers are utilized to generate better change features by capturing the temporal information interaction. The work of [58] introduces a hybrid architecture, TransUNetCD, that strives to combine the merits of transformers and UNet. Here, the encoder takes features extracted from CNNs and enriches them with global contextual information. The corresponding features are then unsampled and combined with multi-scale features to obtain global-local features for localization. The work of [59] introduces a hybrid multi-scale transformer, called Hybrid-TransCD, that captures both fine-grained and large object features by utilizing heterogeneous tokens via multiple receptive fields.

Table 3 shows a comparison of aforementioned change detection approaches on the most commonly used benchmarks: WHU [60] and LEVIR [61]. The WHU dataset comprises a single pair of high-resolution (0.075m) images. Here, the images are of size 32,507 × 15,354. The LEVIR dataset comprises 637 pairs of high-resolution (0.5 m) images. The images are of size 1024 × 1024. The performance is measured in terms of the F1 score with respect to the change category. Figure 8 presents a qualitative comparison of different methods with SwinSUNet on example images from the WHU-CD dataset.

5.4. Image Segmentation

In remote sensing, automatically segmenting an image into semantic categories by performing pixel-level classification is a challenging problem with a wide range of applications, including geological surveys, urban resources management, disaster management and monitoring. Most existing transformer-based remote sensing image segmentation approaches typically employ a hybrid design with an aim to combine the merits of CNNs and transformers. The work of [65] introduces a light-weight transformer-based framework, Efficient-T, that comprises an implicit edge enhancement technique. The proposed Efficient-T employs hierarchical Swin transformers along with the MLP head. A coupled CNN-transformer framework, called CCTNet, is introduced in [66], which is aimed at combining the local details, such as edges and texture, captured by the CNNs along with the global contextual information obtained via transformers for crop segmentation in remote sensing images. Furthermore, different modules, such as test time augmentation and post-processing steps, are introduced in order to remove holes and small objects at the inference for restoring the complete segmented images. A CNN-transformer framework, named STransFuse, is introduced in [67], where both coarse-grained and fine-grained feature representations at multiple scales are extracted and later combined adaptively by utilizing a self-attentive mechanism. The work of [68] proposes a hybrid architecture, where the Swin transformer backbone that captures long-range dependencies is combined with a U-shaped decoder, which employs an atrous spatial pyramid pooling block based on depth-wise separable convolution along with an SE block to better preserve local details in an image. The work of [69] utilizes a pre-trained Swin Transformer backbone along with three decoder designs, namely U-Net, feature pyramid network and pyramid scene parsing network, for semantic segmentation in aerial images.

We present in Table 4 a quantitative comparison of aforementioned approaches on the two most commonly used semantic segmentation datasets: Potsdam [70] and Vaihingen [71]. The Potsdam dataset comprises 38 patches, where each patch has a resolution of 6000 × 6000 pixels collected over the Potsdam City with a ground sampling distance of 5 cm. The dataset has six categories. The Vaihingen dataset comprises 33 samples, where each sample has a resolution from 1996 × 1995 to 3816 × 2550 pixels. Here, the ground sampling distance is 9 cm. This dataset contains the same categories as Potsdam. The performance is measured in terms of overall accuracy (OA) computed using true positives, false positives, false negatives and true negatives. Figure 9 presents a qualitative comparison between Trans-CNN and other approaches on the Potsdam dataset.

Building Extraction: transformer-based techniques have also been recently explored for the problem of building extraction, where the task is to automatically identify building and non-building pixels in a remote sensing image. A dual-pathway transformer framework is introduced in [72] that strives to learn long-range dependencies both in spatial and channel directions. The work of [73] proposes a transformers framework, STEB-UNet, comprising a Swin transformer-based encoding booster that captures semantic information from multi-level features generated from different scales. The encoder booster is further integrated in a U-shaped network design that fuses local and large-scale semantic features. A transformer-based architecture, called BuildFormer, comprising a window-based linear attention, a convolutional MLP and a batch normalization, is introduced in [74]. The work of [75] explores the problem of generalizability of building extraction models to different areas and proposes a transfer learning approach to fine-tune models from one area to a subset of another unseen area.

Other than semantic image segmentation and building extraction with transformers, a recent study by [37] explores the problem of instance segmentation, where the task is to automatically classify each pixel into an object class within an image while also differentiating multiple object instances. Their approach aims at combining the advantages of CNNs and transformers by designing a local perception Swin transformer backbone to enhance both local and global feature information.

5.5. Others

Apart from the problems discussed above, transformer-based techniques are also explored for other VHR remote sensing tasks, such as image captioning and super-resolution (Table 5).

Image Captioning: Image captioning in remote sensing images is a challenging problem, where the task is to generate a semantically natural description of a given image. Few recent works have explored using transformers for image captioning. The work of [97] introduces a framework where standard transformers are adapted for remote sensing image caption generation by integrating residual connections, dropout layers and fusing features adaptively. Moreover, a reinforcement learning technique is utilized to further improve the caption generation process. An encoder–decoder architecture is introduced in [98], where the multi-scale features are first extracted from different layers of CNNs in the encoder and then a multi-layer aggregated transformer is utilized in the decoder to effectively exploit the multi-scale features for generating sentences. The work of [99] introduces a topic token-based mask transformers framework, where a topic token is integrated into the encoder and serves as a prior in the decoder for capturing improved global semantic relationships.

Image Super Resolution: Remote sensing image super-resolution is the task of recovering high-resolution images from their low-resolution counterparts. A few recent works have explored transformers for this task. A transformer-based multi-stage enhancement structure is introduced in [100] that leverages features from different stages. The proposed multi-stage structure can be combined with conventional super-resolution techniques in order to fuse multi-resolution low- and high-dimension features. Ref. [101] proposes a CNN-transformer hybrid architecture to integrate both local and global feature information for super-resolution. The work of [102] explores the problem of multi-image super-resolution, where the task is to merge multiple low-resolution remote sensing images of the same scene into a high-resolution one. Here, a transformer-based approach is introduced comprising an encoder having residual blocks, a fusion module and a super-pixel convolution-based decoder.

To summarize the review of transformers in VHR imagery, we present a holistic overview of different techniques in the literature in Table 6.

6. Transformers in Hyperspectral Imaging

As discussed earlier, hyperspectral images are represented by several spectral brands and analyzing hyperspectral data is crucial in a wide range of problems. Here, we present a review of recent transformer-based approaches for different hyperspectral imaging (HSI) tasks.

6.1. Image Classification

Here, the task is to automatically classify and assign a category label to each pixel in an image acquired through hyperspectral sensors. We first review recent works that are either based on the pure transformer design or utilize a hybrid CNN-transformer approach. Afterwards, we discuss few recent transformer-based approaches fusing different modalities for hyperspectral image classification.

Pure transformer-based Methods: Among existing works, the approach of [114] introduces a bi-directional encoder representation from transformers, called HSI-BERT, that strives to capture global dependencies. The proposed architecture is flexible and can be generalized from different regions with the need to perform pre-training. A transformer-based backbone, called SpectralFormer, is introduced in [8], which can take pixel-wise or patch-wise inputs and is designed to capture spectrally local sequence knowledge from nearby hyperspectral bands. SpectralFormer utilizes cross-layer skip connection to circulate information from shallow to deep layers by learning soft residuals across layers, thereby producing group-wise spectral embeddings. To circumvent the problem of the fixed geometric structure of convolution kernels, a spectral—spatial transformer network is proposed in [115], comprising a spatial attention and a spectral association module. While the spatial attention aims at connecting the local regions through aggregation of all input feature channels with spatial kernel weights, the spectral association is achieved through the integration of all spatial locations of the corresponding masked feature maps. Transformers are also explored in the spatial and spectral dimensions in [9]. Here, a framework is introduced comprising spectral self-attention that learns to capture interactions along the spectral dimension, and a spatial self-attention designed to pay attention to features along the spatial dimension. The resulting features from both spectral and spatial self-attention are then combined and input to the classifier.

Hybrid CNN-Transformers based Methods: Several works recently have explored combining the merits of CNNs and transformers to better capture both the local information as well as long-range dependencies for hyperspectral image classification. To this end, a convolutional transformer network, named CTN, is introduced in [10], which utilizes center position encoding to generate spatial position features by combining pixel positions with spectral features as well as a convolutional transformer to further obtain local-global features, as shown in Figure 10. A hyperspectral image transformer (HiT) classification approach is proposed in [11], where convolutions are embedded into transformer architecture to further integrate local spatial contextual information. The proposed approach comprises two main modules, where one module, called spectral-adaptive 3D convolution projection, is designed to generate spatial–spectral local information via spectral adaptive 3D convolution layers from hyperspectral images. The other module, named Conv-Permutator, employs depthwise convolutions to capture spatial–spectral representations separately along the spectral, height and width dimensions. The work of [12] introduces a multi-scale convolutional transformer that effectively captures spatial–spectral information, which can be integrated with the transformer network. Furthermore, a self-supervised pre-task is defined that masks the token of the central pixel in the encoder, whereas remaining tokens are input to the decoder in order to reconstruct the spectral information corresponding to the central pixel. In [116], a spectral–spatial feature tokenization transformer, called SSFTT, is proposed that generates spectral–spatial and semantic features. The SSFTT comprises a feature extraction module that produces low-level spectral and spatial features by employing a 3D and a 2D convolution layer. Furthermore, a Gaussian weighted feature tokenizer is utilized in SSFTT for feature transformation, which are then input to a transformer encoder for feature representation. Consequently, a linear layer is employed to generate the sample label. Zhao et al. [10] proposes a convolutional transformer network (CTN) that employs center position encoding to combine spectral features with pixel positions. The proposed architecture introduces convolutional transformer blocks that effectively integrate local and global features from hyperspectral image patches. Yang et al. [11] introduces a hyperspectral image transformer (HiT) framework where convolution operations are embedded within the transformer design for also integrating local spatial contextual information. The HiT framework comprises of a spectral-adaptive 3D convolution projection to capture local spatial–spectral information. Additionally, the HiT framework employs a conv-permutator module that uses the depthwise convolution for explicitly capturing the spatial–spectral information along different dimensions: height, width and spectral. The work of [116] introduces a spectral–spatial feature tokenization transformer, named SSFTT, that consists of a spectral–spatial feature extraction scheme for encoding shallow spectral–spatial features, a feature transformation module which produces transformed features used as input in the encoder.

Multi-modal Fusion Transformers based Methods: Few recent transformer-based works also explore fusing different modalities, such as hyperspectral, SAR and LiDAR, for hyperspectral image classification. A multi-modal fusion transformer, MFT, is introduced in [117] and comprises a data fusion scheme to derive class tokens in the transformers from multi-modal data (e.g., LiDAR and SAR) along with the standard hyperspectral patch tokens. Furthermore, the attention mechanism within MFT fuses information from tokens of hyperspectral and other modalities into a new token of integrated features. The work of [118] introduces an approach where a spectral sequence transformer is utilized to extract features from hyperspectral images along the spectral dimension and a spatial hierarchical transformer to generate spatial features in a hierarchical manner from both hyperspectral and LiDAR data.

Table 7 shows a comparison of some representative CNN-based approaches with both pure transformers and hybrid CNN-transformers-based methods on two popular hyperspectral image classification benchmarks: Indian Pines and Pavia. The Indian Pines dataset is acquired through airborne visible/infrared imaging spectrometer (AVIRIS) sensors in Northwestern Indiana, USA. Here, the images comprise 145 × 145 pixels in the spatial dimension at a ground sampling distance (GSD) of 20m with 220 spectral bands that cover the wavelength range of 400–2500 nm. After the removal of noisy bands, 200 spectral brands are retained. The original dataset contains 16 class, where several methods discard the small classes. For the remaining categories, the number of training samples are 200 per class. The Pavia dataset comprises images acquired through the reflective optics system imaging spectrometer (ROSIS) sensor over Pavia, Italy. Here, the images consist of 610 × 340 pixels in the spatial dimension at a GSD of 1.3m with 103 spectral bands covering from 430 to 860 nm. The dataset contains nine categories, where the number of training samples are 200 per class. Generally, three metrics are used to evaluate the performance of methods quantitatively: overall accuracy, average accuracy and kappa coefficient. The overall accuracy (OA) denotes to the proportion of correctly classified test samples, whereas average accuracy (AA) reflects the average recognition accuracy for each category. The kappa coefficient refers to the consistency between the generated classification maps from the model and the available ground truth. Figure 11 presents a qualitative comparison between HSI-Bert [114] and other existing CNN-based methods on the Pavia dataset.

6.2. Hyperspectral Pansharpening

In the hyperspectral pansharpening problem, the task is to enhance low-resolution hyperspectral image spatially using the spatial information from registered panchromatic image, while preserving the spectral information of the low-resolution image. Pansharpening plays an important role in a variety of tasks in remote sensing, including classification and change detection. Previously, CNN-based approaches have shown promising results for this task. Recently, transformer-based methods have performed favorably for this problem by also utilizing the useful global contextual information. A multi-scale spatial–spectral interaction transformer, MSIT, is proposed by [121] that comprises a convolution–transformer encoder to extract multi-scale local and global features from low-resolution and panchromatic images. The work of [122] introduces an architecture where global features are constructed using transformers and local features are computed using a shallow CNN. These multi-scale features extracted in a pyramidal fashion are learned simultaneously. The proposed approach further introduces a loss formulation with spatial and spectral loss simultaneously used for training using the real data. Liang et al. [123] propose a framework, named PMACNet, where both the region-of-interest from the low-resolution image and the residuals for regression to high-resolution image are learned in a parallel CNN structure. Afterwards, a pixel-wise attention module is utilized to adapt the residuals based on the learned region-of-interest.

A transformer-based regression network is introduced by [124], where the feature extraction of spatial and spectral information is performed by utilizing a Swin transformer model. The work of [125] introduces a transformer-based approach, where multi-spectral and panchromatic features are formulated as keys and queries for enabling joint learning of features across the modalities. Furthermore, this work employs an invertible neural module to perform effective fusion of the features for generating the pansharpened images. Bandara et al. [126] propose a framework comprising separate feature extractors for panchromatic and hyperspectral images, a soft attention mechanism and a spectral-spatial fusion module. The pansharpened image quality is improved by learning cross-feature space dependencies of the different features.

To summarize the review of transformers in hyperspectral imaging, we provide a holistic overview of the existing techniques in literature in Table 8.

7. Transformers in SAR Imagery

As discussed earlier, SAR images are constructed from the signals of the electromagnetic waves through a sensor platform transmitted to the surface of Earth. SAR possesses unique characteristics due to being unaffected with different environmental conditions, such as day, night and fog. Here, we review recent transformer-based approaches for SAR imaging tasks.

7.1. SAR Image Interpretation

Classification: Accurately classifying the target categories within SAR images is a challenging problem with numerous real-world applications. Recently, transformers have been explored for automatic interpretation and target recognition in SAR imagery. The work of [141] explores vision transformers for polarimetric SAR (PolSAR) image classification. In this framework, the pixel values of the image patches are considered as tokens and the self-attention mechanism is employed to capture long-range dependencies followed by multi-layer perceptron (MLP) and learnable class tokens to integrate features. A contrastive learning technique is utilized within the framework to reduce the redundancies and perform the classification task. Figure 12 shows the overview of the framework and a qualitative comparison in terms of supervised classification is presented in Figure 13.

Other than the aforementioned pure transformer-based approach, hybrid methods utilizing both CNNs and transformers also exist in the literature. The work of [142] introduces a globa–local network structure (GLNS) framework that combines the merits of CNNs and transformers for SAR image classification. The proposed GLNS employs a lightweight CNN along with an efficient vision transformer to capture both local and global features, which are later fused to perform the classification task. Other than standard fully-supervised learning, transformers are also explored in the limited supervision regime, such as few-shot SAR image classification. Cai et al. [143] introduces a few-shot SAR classification approach, named ST-PN, where a spatial transformer network is utilized for performing spatial alignment on CNN-based features.

Segmentation and Detection: Detection and segmentation in SAR imagery is vital for different applications, such as crop identification, target detection and terrain mapping. In SAR imagery, segmentation can be challenging due to the appearance of speckles, which is a type of multiplicative noise that increases with the back-scattering radar magnitude. Among recent transformer-based approaches, the work of [144] introduces a framework, named GCBANet, for SAR ship instance segmentation. Within the GCBANet framework, a global contextual block is employed to encode spatial holistic long-range dependencies. Furthermore, a boundary-aware box prediction technique is introduced to predict the boundaries of the ship. Xia et al. [145] introduce an approach, named CRTransSar, that combines the benefits of CNNs and transformers to capture both local and global information for SAR object detection. The proposed CRTransSar works by constructing a backbone with attention and convolutional blocks. A geospatial transformer framework is introduced in [146], comprising the steps of image decomposition, multi-scale geo-spatial contextual attention and recomposition for detecting aircrafts in SAR imagery. A feature relation enhancement framework is proposed in [147] for aircraft detection in SAR imagery. The proposed framework adopts a fusion pyramid structure to combine features of different levels and scales. Further, a context attention enhancement technique is employed to improve the positioning accuracy in complex backgrounds.

Other than ship and aircraft detection, the recent work of [148] introduces a transformer-based framework for 3D detection of oil tank targets in SAR imagery. In this framework, the incidence angle is input to the transformer as a prior token followed by a feature description operator that utilizes scattering centers for refining the predictions.

7.2. Others

Apart from SAR image classification, detection and segmentation, few works exist exploring transformers for other SAR imaging problems, such as image despeckling.

SAR Image Despeckling: The aforementioned interpretation of SAR imaging is made challenging due to the degradation of images caused by a multiplicative noise known as speckle. Recently, transformers have been explored for SAR image despeckling. The work of [149] introduces a transformer-based framework comprising an encoder that learns global dependencies among various SAR image regions. The transformer-based network is trained in an end-to-end fashion with synthetic speckled data by utilizing a composite loss function.

Change Detection in SAR Images: SAR images can be affected by imaging noise, which presents challenges when detecting changes in high-resolution (HR) SAR data. Recently, a self-supervised contrastive representation learning technique has been proposed by [150], where hierarchical representations are constructed using a convolution-enhanced transformer to distinguish the changes from HR SAR images. A convolution-based module is introduced to enable interactions across windows when performing self-attention computations within local windows.

SAR Image Registration: Several applications, such as change detection, involves joint analysis and processing of multiple SAR images that are likely acquired in different imaging conditions. Thus, accurate SAR image registration is desired where the reference and the sensed images are registered. The recent work of [151] explores transformers for large-size SAR dense-matching registration. Here, a hybrid CNN-transformer is employed to register images under weak texture condition. First, coarse registration is performed via the down-sampled original SAR image. Then, cluster centers of registration points are selected from the previous coarse registration step. Afterwards, the registration of image pairs are performed using a CNN-transformer module. Lastly, the resulting point pair subsets are integrated to achieve the final global transformation through RANSAC.

In summary, we present a holistic overview of the existing transformers techniques in SAR imagery in Table 9.

8. Conclusions

In this work, we presented a broad overview of transformers in remote sensing imaging: very-high resolution (VHR), hyperspectral and synthetic aperture radar (SAR). Within these different remote sensory imagery, we further discuss transformer-based approaches on a variety of tasks, such as classification, detection and segmentation. Our survey covers more than 60 transformer-based remote sensing research works in the literature. We observed transformers to obtain favorable performance on different remote sensing tasks likely due to their capabilities to capture long-range dependencies along with their representation flexibility. Further, the public availability of several standard transformer architectures and backbones make it easier to explore their applicability in remote sensing imaging problems.

Open Research Directions: As discussed earlier, most existing transformer-based recognition approaches employ backbones pre-trained on the ImageNet dataset. One exception is the work of [7], which explores pre-training vision transformers on a large-scale remote sensing dataset. However, in both cases the pre-training is performed in a supervised fashion. An open direction is to explore large-scale pre-training in a self-supervised fashion by taking into account an abundant amount of unlabeled remote sensing imaging data.

Our survey also shows that most existing approaches typically utilize a hybrid architecture where the aim is to combine the merits of convolutions and self-attention. However, transformers are typically known to have a higher computational cost to compute global self-attention. Several recent works have explored different improvements in the transformers design, such as, reduced computational overhead [165], efficient hybrid CNN-transformers backbones [166] and unified architectures for image and video classification [167]. Moreover, due to the utilization of more training data by transformers, there is a need to construct larger-scale datasets in remote sensing imaging. For most problems discussed in this work and especially in case of object detection, heavy backbones are typically utilized to achieve better detection accuracy. However, this significantly slows down the speed of the aerial detector. An interesting open direction is to design light-weight transformer-based backbones to classify detect oriented targets in remote sensing imagery. Another open research direction is to explore the adaptability of the transformer-based models to a heterogeneous source of images, such as SAR and UAV (e.g., change detection).

In this survey, we also observe several existing approaches to utilize transformers in a plug-and-play fashion for remote sensing. This leads to the need of designing effective domain-specific architectural components and loss formulations to further boost the performance. Moreover, it is intriguing to study the adversarial feature space of vision transformer models that are pre-trained on remote sensing benchmarks and their transferability.

In the future, it is expected that more sophisticated pure transformer architectures with specifically designed self-attention mechanisms for remote sensing problems will be explored. Another potential future research direction is to investigate new hybrid CNN-transformer architectures that leverage the capabilities of convolutions and self-attention in the context of remote sensing tasks.

Additionally, we intend to frequently update and maintain the latest transformers in remote sensing papers with their respective code at https://github.com/VIROBO-15/Transformer-in-Remote-Sensing, accessed on 6 February 2023.

Author Contributions

Conceptualization, A.A.A., F.S.K. and A.K.; methodology, A.A.A., F.S.K. and A.K.; validation, A.A.A., F.S.K. and A.K.; formal analysis, A.A.A., F.S.K. and A.K.; investigation, A.A.A., F.S.K. and A.K.; resources, A.A.A., F.S.K. and A.K.; writing—original draft preparation, A.A.A., A.K.; writing—review and editing, A.A.A., F.S.K. and A.K.; supervision, R.M.A., S.K., H.C., F.S.K. and G.-S.X.; All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank and express their deepest gratitude to Mohamed bin Zayed University for Artificial Intelligence for the constant, helpful presence throughout the research journey.

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

View Image - Figure 1. Recent transformer-based techniques in the remote sensing imaging. On the left and middle: pie-charts are representing statistics of the articles covered in this survey in terms of different remote sensing imaging problems and data type representations. On the right: we show a plot illustrating the consistent increase in the number of papers recently.

Figure 1. Recent transformer-based techniques in the remote sensing imaging. On the left and middle: pie-charts are representing statistics of the articles covered in this survey in terms of different remote sensing imaging problems and data type representations. On the right: we show a plot illustrating the consistent increase in the number of papers recently.

View Image - Figure 2. Example hyperspectral images from Pavia, Indian Pines and Kennedy Space Center datasets (a), example VHR images (b) and SAR images of the L band E-SAR dataset (c).

Figure 2. Example hyperspectral images from Pavia, Indian Pines and Kennedy Space Center datasets (a), example VHR images (b) and SAR images of the L band E-SAR dataset (c).

View Image - Figure 3. The vision transformer’s architecture is shown on the left and the encoder block’s specifications are shown on the right. The input image is first divided into patches. These are then projected (after flattening) into a feature space, where a transformer encoder analyzes them to create the classification output. * indicates to Extra learnable [class] embedding. Adapted with permission from [1,19].

Figure 3. The vision transformer’s architecture is shown on the left and the encoder block’s specifications are shown on the right. The input image is first divided into patches. These are then projected (after flattening) into a feature space, where a transformer encoder analyzes them to create the classification output. * indicates to Extra learnable [class] embedding. Adapted with permission from [1,19].

View Image - Figure 4. The taxonomy of transformers in VHR, hyperspectral and SAR imagery with a variety of tasks, such as classification, detection, segmentation, pan sharpening and change detection.

Figure 4. The taxonomy of transformers in VHR, hyperspectral and SAR imagery with a variety of tasks, such as classification, detection, segmentation, pan sharpening and change detection.

View Image - Figure 5. The CTNet architecture comprising two modules: the ViT stream (T-stream) and the CNNs stream (C-stream). The T-stream and C-stream are designed to capture semantic features and the local structural information. Figure is from [31]. Best viewed zoomed in.

Figure 5. The CTNet architecture comprising two modules: the ViT stream (T-stream) and the CNNs stream (C-stream). The T-stream and C-stream are designed to capture semantic features and the local structural information. Figure is from [31]. Best viewed zoomed in.

View Image - Figure 6. Comparison in terms of response maps obtained using different models on example VHR images. The original images are shown in (a), whereas the evaluated models are: (b) IMP-ResNet-50, (c) SeCo-ResNet-50, (d) RSP-ResNet-50, (e) IMP-Swin-T, (f) RSP-Swin-T, (g) IMP-ViTAEv2-S and (h) RSP-ViTAEv2-S. Here, IMP denotes ImageNet pre-training and RSP refers to remote sensing pre-training. In the response map, the warmer color indicates a higher response. Figure is from [7].

Figure 6. Comparison in terms of response maps obtained using different models on example VHR images. The original images are shown in (a), whereas the evaluated models are: (b) IMP-ResNet-50, (c) SeCo-ResNet-50, (d) RSP-ResNet-50, (e) IMP-Swin-T, (f) RSP-Swin-T, (g) IMP-ViTAEv2-S and (h) RSP-ViTAEv2-S. Here, IMP denotes ImageNet pre-training and RSP refers to remote sensing pre-training. In the response map, the warmer color indicates a higher response. Figure is from [7].

View Image - Figure 7. Overview of the anchor-free Oriented RepPoints detection architecture [46] that strives to learn selecting points samples for classification, regression and orientation. RepPoints utilizes the same structure of the shared head as in [46], except a quality assessment and sample assignment strategy (APAA) are employed for selecting high-quality sample points for training. Figure is adapted with permission from [46]. Best viewed zoomed in.

Figure 7. Overview of the anchor-free Oriented RepPoints detection architecture [46] that strives to learn selecting points samples for classification, regression and orientation. RepPoints utilizes the same structure of the shared head as in [46], except a quality assessment and sample assignment strategy (APAA) are employed for selecting high-quality sample points for training. Figure is adapted with permission from [46]. Best viewed zoomed in.

View Image - Figure 8. Results of different CD methods visualized, such as FC-EF [62], FC-Siam-Conc [62], FC-Siam-Diff [62], CDNet [63], DASNet [64], STANet [61] and SwinSUNet [56], compared to (a–d) sample imagery sets, such as the WHU-CD [60] test set. Various colors were utilised to convey different denotations; white represents true positive, black represents true negative, red represents false positive and green represents false negative. Figure is from [56].

Figure 8. Results of different CD methods visualized, such as FC-EF [62], FC-Siam-Conc [62], FC-Siam-Diff [62], CDNet [63], DASNet [64], STANet [61] and SwinSUNet [56], compared to (a–d) sample imagery sets, such as the WHU-CD [60] test set. Various colors were utilised to convey different denotations; white represents true positive, black represents true negative, red represents false positive and green represents false negative. Figure is from [56].

View Image - Figure 9. A qualitative comparison between the hybrid Trans-CNN with other existing segmentation approaches. The examples are from the Potsdam dataset. Every two rows present the results as a group. Here, from left to right and top to bottom are: (a) the corresponding ground-truth, (b) results obtained from AFNet + TTA, (c) results of ResUNet, (d) results of CASIA2, (e) results achieved using Trans-CNN and (f) the RGB image. The inccorect classification results from AFNet + TTA, ResUNet, CASIA2 and Trans-CNN are presented in (g–j), respectively. Figure is from [68].

Figure 9. A qualitative comparison between the hybrid Trans-CNN with other existing segmentation approaches. The examples are from the Potsdam dataset. Every two rows present the results as a group. Here, from left to right and top to bottom are: (a) the corresponding ground-truth, (b) results obtained from AFNet + TTA, (c) results of ResUNet, (d) results of CASIA2, (e) results achieved using Trans-CNN and (f) the RGB image. The inccorect classification results from AFNet + TTA, ResUNet, CASIA2 and Trans-CNN are presented in (g–j), respectively. Figure is from [68].

View Image - Figure 10. Overview of the CTN framework [10] for hyperspectral image classification. Given the HSI data patches, CTN processes them to center position encoding (CPE), convolutional transformer and classification modules. Here, the output represents the category label. Figure is from [10]. Best viewed zoomed in.

Figure 10. Overview of the CTN framework [10] for hyperspectral image classification. Given the HSI data patches, CTN processes them to center position encoding (CPE), convolutional transformer and classification modules. Here, the output represents the category label. Figure is from [10]. Best viewed zoomed in.

View Image - Figure 11. A qualitative comparison, in terms of visualization of classification maps between HSI-BERT and several CNN-based methods on the Pavia dataset. Here, (a) CNN, (b) CNN-PPF, (c) CDCNN, (d) DRCNN and (e) HSI-BERT. Figure is from [114].

Figure 11. A qualitative comparison, in terms of visualization of classification maps between HSI-BERT and several CNN-based methods on the Pavia dataset. Here, (a) CNN, (b) CNN-PPF, (c) CDCNN, (d) DRCNN and (e) HSI-BERT. Figure is from [114].

View Image - Figure 12. Overview of the ViT-PolSAR framework [141] for supervised polarimetric SAR image classification. Here, the pixel values of the SAR image patches are considered as tokens and then the self-attention mechanism is utilized to encode longe-range dependencies followed by MLP. Figure is from [141]. Best viewed zoomed in.

Figure 12. Overview of the ViT-PolSAR framework [141] for supervised polarimetric SAR image classification. Here, the pixel values of the SAR image patches are considered as tokens and then the self-attention mechanism is utilized to encode longe-range dependencies followed by MLP. Figure is from [141]. Best viewed zoomed in.

View Image - Figure 13. A visual comparison in terms of supervised classification of the entire map on the ALOS2 San Francisco dataset. Here, (a–h) shows the results obtained from Wishart, RBF-SVM, CV-CNN, 3D-CNN, PSENet, SF-CNN and ViT-PolSAR, respectively. Figure is from [141].

Figure 13. A visual comparison in terms of supervised classification of the entire map on the ALOS2 San Francisco dataset. Here, (a–h) shows the results obtained from Wishart, RBF-SVM, CV-CNN, 3D-CNN, PSENet, SF-CNN and ViT-PolSAR, respectively. Figure is from [141].

Table 1

Performance, in terms of classification accuracy, of different transformer-based methods on the popular AID dataset with 20:80 train-test ratio.

Method	Venue	Backbone	AID (20%)
V16-21K [4]	Remote Sensing	ViT	94.97
CTNet [31]	GRSL	ResNet34 + ViT	96.35
TRS [32]	Remote Sensing	TRS	95.54
TSTNet [5]	Remote Sensing	Swin-T	97.20
RSP [7]	TGRS	RSP-Swin-T-E300	96.83

Table 2

Comparison in terms of detection accuracy (mAP) of different detectors utilizing a hybrid CNN-transformer design, transformers pre-trained backbone or a DETR-based transformer architecture on DOTA benchmark. The results are presented on the orientated bounding-boxes task of the DOTA benchmark.

Method	Venue	Backbone	DOTA
ADT-Det [41]	Remote Sensing	ResNet50	76.89
RBox [42]	CVPR	ResNet50	79.59
Rodformer [43]	Sensors	ResNet50	63.89
Rodformer [43]	Sensors	ViT-B4	75.60
PointRCNN [44]	Remote Sensing	Swin-T	80.14
Hybrid Network [45]	Remote Sensing	TransC-T	78.41
Oriented RepPoints [46]	Arxiv	ResNet50	75.97
Oriented RepPoints [46]	Arxiv	Swin-T	77.63
O²DETR [47]	Arxiv	ResNet50	79.66
AO2-DETR [48]	Arxiv	ResNet50	79.22

Table 3

Comparison, in terms of F1 score, of different transformer-based change detection methods on the two popular benchmarks: WHU and LEVIR.

Method	Venue	WHU	LEVIR
CD-Trans [54]	TGRS	83.98	89.31
MSPSNet [55]	TGRS	-	89.18
UVACD [57]	Remote Sensing	92.84	91.30
SwinSUNet [56]	TGRS	93.8	-
TransUNetCD [58]	TGRS	93.59	91.1
HybridTransCD [59]	IJGI	-	90.06

Table 4

Performance comparison, in terms of overall accuracy (OA), of different transformer-based semantic segmentation methods on two popular benchmarks: Potsdam and Vaihingen.

Method	Venue	Potsdam	Vaihingen
Efficient-T [65]	Remote Sensing	90.08	88.41
STransFuse [67]	JSTAR	86.71	86.07
Trans-CNN [68]	TGRS	91.0	90.40
SwinTF [69]	Remote Sensing	-	90.97

Table 5

Overview of transformer-based approaches in VHR remote sensing imaging. Here, we highlight transformer-based methods for different VHR remote sensing tasks.

Transformers in Very-High Resolution (VHR) Satellite Imagery
Method	Task	Datasets	Metrics	Highlights
V16-21K [4]	Classification	Merced [76], AID [35], Optimal31 [77], NWPU [78]	Overall classification accuracy	Explores vision transformers along with combination of data augmentation techniques for boosting accuracy.
TRS [32]	Classification	Merced [76], AID [35], Optimal31 [77], NWPU [78]	Overall classification accuracy	Integrates transformers into CNNs by replacing the last three ResNet bottlenecks with encoders having multi-head self-attention bottleneck.
TSTNet [5]	Classification	Merced [76], AID [35], NWPU [78]	Overall classification accuracy	A Swin transformer-based two-stream architecture that uses both deep features from the image and edge features from edge stream.
CTNet [31]	Classification	AID [35], NWPU [78]	Overall classification accuracy	Comprises a ViT stream that mines semantic features and the CNN stream, which captures local structural features.
HHTL [6]	Classification	Merced [76], AID [35], RSSDIVCS [79], NWPU [78]	Overall classification accuracy	Explores integrating heterogenous non-overlapping patches and homogenous patches obtained using superpixel segmentation.
RSP [7]	Classification, Segmentation, Detection	MillionAID [33], Potsdam [70], iSAID [80], HRSC2016 [81], DOTA [49], CCD [82], LEVIR [61]	Overall classification accuracy, mAP, F1 score	Investigates pre-training transformers on a large-scale remote sensing dataset.
SAIEC [37]	Detection, Segmentation	DIOR [83], HRRSD [84], NWPU VHR-10 [85]	mAP	Introduces a local perception Swin transformer backbone that aims to combine the merits of transformers and CNNs for improving the local perception capabilities.
T-TRD-DA [39]	Detection	DIOR [83], NWPU VHR-10 [85]	mAP	Proposes a transformer-based detector utilizing a pre-trained CNN for feature extraction and multiple-layer transformers for multi-scale feature aggregation at global spatial positions.
GANsformer [40]	Detection	DIOR [83], NWPU VHR-10 [85]	mAP	Introduces an efficient transformer, with reduced parameters, as a branch network to capture global features along with a generative model to expand the input image ahead of backbone.
ADT-Det [41]	Detection	DIOR [83], HRSC2016 [81]	mAP	Introduces a RetineNet-based framework with a feature pyramid transformer integrated between the backbone and post-processing network for generating multi-scale semantic features.
PointRCNN [44]	Detection	DOTA [49], HRSC2016 [81]	mAP	Introduces a two-stage angle-free dectection framework, which is also evaluated using the transformer-based Swin backbone.
HybridNetwork22 [45]	Detection	DOTA [49], UCAS-AOD [86], VEDAI [87]	mAP	Integrates multi-scale global and local information from transformers and CNNs through an adaptive feature fusion network.
Oriented RepPoints [46]	Detection	DOTA [49], UCAS-AOD [86], HRSC2016 [81]	mAP	Proposes an anchor-free detector and learns flexible adaptive points as representations through a quality assessment and sample assignment scheme.
O²DETR [47]	Detection	DOTA [49], SKU110K-R [88], HRSC2016 [81]	mAP	Extends the standard DETR for oriented detection by introducing an encoder employing depthwise separable convolution.
AO2DETR [48]	Detection	DOTA [49]	mAP	Introduces a DETR-based detector with oriented proposal generation scheme, a refine module to compute rotation-invariant features and a rotation-aware matching loss for performing the matching process for direct set predictions.
RBox [42]	Detection	SynthText [89], ICDAR 2015 (IC15) [90], MLT-2017 (MLT17) [91], MSRA-TD500 [92], MTWI [93], Total-Text [94], CTW1500 [95]	mAP	Proposes a framework employing transformers to model the relationship of sampled features for better grouping and box prediction without requiring post-processing operation.
Rodformer [43]	Detection	DOTA [49]	mAP	A hybrid detection architecture integrating the local characteristics of depth-separable convolutions with the global characteristics of MLP.
CD-Trans [54]	Change Detection	WHU [60], LEVIR [61], DSIFN [96]	F1 score	Introduces a bi-temporal image transformer designed to model the spatio-temporal contextual information. The encoder captures context in token-based space-time, which is then fed to a decoder where feature refinement is performed in the pixel-space.

Table 6

Overview of transformer-based approaches in VHR remote sensing imaging. Here, we highlight transformer-based methods for different VHR remote sensing tasks.

Transformers in Very-High Resolution (VHR) Satellite Imagery
Method	Task	Datasets	Metrics	Highlights
MSPSNet [55]	Change Detection	SYSU-CD [103], LEVIR [61]	F1 score	Introduces a multi-scale Siamese framework employing a parallel convolutional structure for feature integration of different temporal images and self-attention for feature refinement.
SwinSUNet [56]	Change Detection	CCD [82], WHU [60], OSCD [104], HRSCD [105]	F1 score	Introduces a Swin transformer-based network with a Siamese U-shaped structure having encoder, fusion and decoder modules.
TransUNetCD [58]	Change Detection	WHU [60], LEVIR [61], CCD [82], DSIFN [96], OSCD [104], S2Looking [106]	F1 score	Introduces a framework integrating merits of transformers and UNet through capturing enriched contextualized features which are upsampled and fused with multi-scale features to generate global-local features.
Hybrid-TransCD [59]	Change Detection	LEVIR [61], SYSU-CD [103]	F1 score	Introduces a multi-scale transformer that encodes both fine-grained and large object features through heterogeneous tokens via multiple receptive fields.
CCTNet [66]	Segmentation	Barley Remote Sensing Dataset [107]	F1 score, overall accuracy	Proposes a hybrid CNN-transformer framework to combine local details and global conextual information for crop segmentation.
STransFuse [67]	Segmentation	Potsdam [70], Vaihingen [71]	F1 score, overall accuracy	Introduces a framework that encodes both coarse-grained as well as fine-grained features at multiple scales which are fused using self-attentive mechanism.
Trans-CNN [68]	Segmentation	Potsdam [70], Vaihingen [71]	F1 score, overall accuracy	Introduces a framework with a Swin transformer backbone to capture long-range dependencies and a U-shaped decoder with depth-wise separable convolution to encode local details.
SwinTF [69]	Segmentation	Vaihingen [71], Thailand North Landsat-8 corpus (private), Thailand Isan Landsat-8 corpus (private)	F1 score, overall accuracy	Introduces a framework with pre-trained Swin backbone along with a U-Net, feature pyramid network and a pyramid scene parsing network for segmentation.
Efficient-T [65]	Segmentation	Potsdam [70], Vaihingen [71]	F1 score, overall accuracy	Proposes a light-weight framework consisting of an implicit edge enhancement scheme along with a Swin transformers.
STT [72]	Building Extraction	WHU [60], INRIA [108]	IoU, overall accuracy, F1 score	Introduces a transformers framework to learn long-range dependencies both in the spatial and channel direction.
STEB-UNet [73]	Building Extraction	WHU [60], Massachusetts [108]	IoU, F1 score	Introduces a transformer framework capturing semantic information from multi-scale features which are further fused to local features.
BuildFormer [74]	Building Extraction	WHU [60], Massachusetts [108], INRIA [108]	IoU, F1 score	Introduces an architecture consisitng of a window-based linear attention and a convolutional MLP.
T-Trans [75]	Building Extraction	Massachusetts [108] ,INRIA [108]	IoU, F1 score	Explores the task of generalizability of building extraction models to different areas and introduces a transfer learning method to fine-tune models from one area to a subset of another unseen area.
TRL [97]	Image Captioning	RSICD [109], UCM-captions [110], Sydney-Caption [111]	BLEU, ROUGE, METEOR and CIDEr	Proposes an approach adapting transformers by integrating residual connections, dropout and adatpive feature fusion for remote sensing image caption generation.
MLAT [98]	Image Captioning	RSICD [109], UCM-captions [110], Sydney-Caption [111]	BLEU, ROUGE, METEOR and CIDEr	Introduces an architecture where multi-scale features from CNN layers are extracted in encoder and a multi-layer aggregated transformer in the decoder uses those features for sentence generation.
Ren et al. [99]	Image Captioning	RSICD [109], UCM-captions [110], Sydney-Caption [111]	BLEU, ROUGE, METEOR and CIDEr	Proposes a topic token-based mask transformers with the topic token being integrated into encoder while serving as prior in decoder for capturing global semantic relationships.
TR-MISR [102]	Image Super Resolution	RSICD [109], UCM-captions [110], PROBA-V [112]	cPSNR, cSSIM	Introduces a transformer-based architecture with an encoder having residual blocks, a fusion module along with a super-pixel convolution-based decoder for multi-image super-resolution.
MSE-Net [100]	Image Super Resolution	UCMerced [113], AID [35]	cPSNR, cSSIM	Proposes a multi-stage enchancement framework to utilize features from different stages and further integrating them with standard super-resolution technique for combining multi-resolution low as well as high-dimension feature representations.
SRT [101]	Image Super Resolution	UCMerced [113]	cPSNR, cSSIM	Introduces a hybrid framework that integrates local features from CNNs and global features from transformers.

Table 7

Comparison in terms of overall accuracy (OA) of some representative CNN-based methods with pure transformers and hybrid CNN-transformer-based hyperspectral image classification methods on two popular benchmarks: Indian Pines and Pavia. Here, the results are reported using 200 samples for training for each category.

Method	Venue	Type	Indian Pines	Pavia
CNN [119]	Sensors	CNNs	87.01	92.27
CNN-PPF [120]	TGRS	CNNs	93.90	96.48
HSI-BERT [114]	TGRS	Pure	99.56	99.75
DSS-TRM [9]	EJRS	Pure	99.43	98.50
CTN [10]	GRSL	Hybrid	99.11	97.48

Table 8

Overview of transformer-based approaches in hyperspectral and multispectral imaging. Here, we highlight methods for different hyperspectral remote sensing tasks.

Transformers in Hyperspectral Imagery
Method	Task	Datasets	Metrics	Highlights
SpectralFormer [8]	Classification	Indian Pines [127], Pavia University [128], Houston 2013 [129]	Overall classification accuracy, kappa	Introduces a transformer-based backbone to capture spectrally local information from nearby hyperspectral bands by generating group-wise spectral embeddings.
MCT [12]	Classification	Salinas [130], Yellow River Estuary	Overall classification accuracy, kappa	Proposes a multi-scale convolutional transformer to encode spatial-spectral information that is integrated with transformers network.
MFT [117]	Classification	University of Houston [129], Trento, MUUFL Gulfport [131], Augsburg scenes	Overall classification accuracy, kappa	Proposes a multi-modal transfomers that derives class tokens from multi-modal data along with the standard hyperspectral patch tokens.
CTN [10]	Classification	Indian Pines [127], Pavia University [128]	Overall classification accuracy, kappa	Introduces a convolutional transformer network with dedicated blocks that integrates local and global features from hyspectral image patches.
DHViT [118]	Classification	Trento, Houston 2013 [129], Houston 2018 [132]	Overall classification accuracy, kappa	Introduces an approach comprising a spectral sequence transformer to encode features along the spectral dimension and a spatial hierarchical transformer to produce hierarchical spatial features for hyperspectral and LiDAR data.
DSS-TRM [9]	Classification	Pavia University [128], Salinas [130], Indian Pines [127]	Overall classification accuracy, kappa	Introduces a transformer-based approach consisting of spectral self-attention and spatial self-attention to capture interactions along spectral and spatial dimension, respectively.
HiT [11]	Classification	Indian Pines [127], Pavia University [128], Houston2013 [129], Xiongan	Overall classification accuracy, kappa	Proposes a hyperspectral image transformer consisting of a 3D convolution projection module to encode local spatial-spectral details and a conv-permutator modue to capture the information along height, width and spectral dimensions.
HSI-BERT [114]	Classification	Indian Pines [127], Pavia University [128], Salinas [130]	Overall classification accuracy	Proposes a transformer-based method that captures capture global dependencies using a bi-direction encoder representation.
SSFTT [116]	Classification	Indian Pines [127], Pavia University [128], Houston 2013 [129]	Overall classification accuracy, kappa	Proposes a spectral–spatial feature tokenization transformer that utilizes both spectral-spatial shallow and semantic features for representation and learning.
SSTN [115]	Classification	Pavia University [128], Kennedy Space Center, Indian Pines [127], University of Houston [129], Pavia Center [133]	Overall classification accuracy, kappa	Introduces a spectral–spatial transformer with a spatial attention and a spectral association module. The two modules perform spectral and spatial association through the integration of spectral and spatial locations, respectively.
CTIN [134]	Pan-Sharpening	worldview II [135], worldview III [136], GaoFen-2	IQA, ERGAS, PSNR, SAM	A transformer-based approach is introduced, where multi-spectral and panchromatic features are captured for joint feature learning across modalities. Further, an invertible neural module performs feature fusion to generate pansharpened images.
HyperTransformer [126]	Pan-Sharpening	Pavia Center [133], Botswana [137], Chikusei [138]	Cross-correlation(CC), spectral Angle Mapping (SAM), RSNR, ERGAS, PSNR	Introduces a transformer-based framework with separate feature extractors for panchromatic and hyperspectral images and a spectral-spatial fusion module to learn cross-feature space dependencies of features.
PMACNet [123]	Pan-Sharpening	worldview II [135], worldview III [136]	Spatial correlation coefficient(SCC), spectral angle mapper (SAM)	Introduces a framework with a parallel CNN structure to learn ROIs from low-resolution image and residuals from high-resolution image. It also contains a a pixel-wise attention module to adapt residuals on the learned ROIs.
CPT-noRef [122]	Pan-Sharpening	Gaofen-1, worldview II [135], Pleiades [139]	IQA, ERGAS, SAM, correlation coefficient(CC)	A CNN-transformers framework where global features are generated using transformers and local features are constructed using a shallow CNNs. The features are combined and a loss formulation having spatial and spectral losses are utilized for training.
MSIT [121]	Pan-Sharpening	GeoEye-1, QuickBird [140]	ERGAS, SAM, Q4	Introduces a multi-scale spatial–spectral interaction transformer with a convolution-transformer encoder for generating multi-scale global and local features from both low-resolution and panchromatic images.
Su et al. [124]	Pan-Sharpening	worldview II [135], QuickBird [140], GaoFen-2	spatial correlation coefficient(SCC), ESGAS, RMSE, SAM, Q4	A transformer-based approach with spatial and spectral feature extraction performed using a Swin model.

Table 9

Overview of transformer-based approaches in SAR imaging. Here, we highlight methods for different SAR remote sensing tasks.

Transformers in Hyperspectral Imagery
Method	Task	Datasets	Metrics	Highlights
ViT-PolSAR [141]	Classification	AIRSAR Flevoland [152], ESAR Oberpfaffenhofen [153], AIRSAR San Francisco [154], ALOS2 San Francisco [155]	AA, OA, kappa	Explores transformers, where self-attention is used to capture long-range dependencies followed by MLP for polarimetric SAR image classification.
GLNS [142]	Classification	Gaofen-3 SAR [156], F-SAR [157]	AA, OA, kappa	Introduces a global–local network structure to exploit the merits of CNNs and transformers with local and global features that are fused to perform classification.
ST-PN [143]	Classification	MSTAR [158]	Accuracy	Proposes a spatial transformer network for spatial alignment of features extracted from CNNs for few-shot SAR classification.
GCBANet [144]	Segmentation	SSDD [159], HRSID [160]	AP	Introduces a transformer-based approach with a global contextual block for capturing spatial holistic long-range dependencies and a boundary-aware prediction scheme for estimating the boundaries of ship.
CRTransSar [145]	Detection	SMCDD [145], SSDD [159]	Accuracy, recall, mAP, F1	Proposes a backbone based on convolutional and attention blocks for capturing both local and global features.
Geospatial Transformers [146]	Detection	Gaofen-3 [156]	DR, FAR	Introduces a framework with multi-scale geo-spatial attention for aircraft detection in SAR imaging.
SFRE-Net [147]	Detection	Gaofen-3 [156]	Precision, recall, F1	Introduces a feature relation enhancement architecture consisting of a fusion pyramid structure and a context attention enhancement technique.
3DET-ViT [148]	Detection	L1B SAR [161]	AP, AR, mean Offset	Proposes a transformer-based framework that takes incidence angle as a prior token with a feature description operator employing scattering centers for prediction refinement.
ID-ViT [149]	Despeckling	Berkeley Segmentation Dataset [162]	PSNR, SSIM	Proposes a framework comprising an encoder to learn global dependencies among SAR image regions, where the network is trained using synthetic speckled data.
CLT [150]	Change Detection	Brazil and Namibia datasets [163], simulation data [150]	KC	Introduces a self-supervised contrastive representation learning method with a convolution-enhanced transformer to generate hierarchical representations for distinguishing changes from HR SAR images.
CF-ViT [151]	Image Registration	MegaDepth [164]	KC	A CNN-transformers framework that first performs coarse registration on the down-sampled image, followed by registration of image pairs via a CNN-transformer module with the resulting point pair subsets integrated to obtain final global registration.

References

1. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. Proceedings of the ICLR; Virtual-Only, 3–7 May 2021.

2. Naseer, M.; Ranasinghe, K.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Intriguing Properties of Vision Transformers. Proceedings of the NeurIPS; Virtual-Only, 7–10 December 2021.

3. Park, N.; Kim, S. How Do Vision Transformers Work?. Proceedings of the ICLR; Virtual-Only, 25 April 2022.

4. Bazi, Y.; Bashmal, L.; Rahhal, M.M.A.; Dayil, R.A.; Ajlan, N.A. Vision transformers for remote sensing image classification. Remote Sens.; 2021; 13, 516. [DOI: https://dx.doi.org/10.3390/rs13030516]

5. Hao, S.; Wu, B.; Zhao, K.; Ye, Y.; Wang, W. Two-Stream Swin Transformer with Differentiable Sobel Operator for Remote Sensing Image Classification. Remote Sens.; 2022; 14, 1507. [DOI: https://dx.doi.org/10.3390/rs14061507]

6. Ma, J.; Li, M.; Tang, X.; Zhang, X.; Liu, F.; Jiao, L. Homo–Heterogenous Transformer Learning Framework for RS Scene Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.; 2022; 15, pp. 2223-2239. [DOI: https://dx.doi.org/10.1109/JSTARS.2022.3155665]

7. Wang, D.; Zhang, J.; Du, B.; Xia, G.S.; Tao, D. An Empirical Study of Remote Sensing Pretraining. IEEE Trans. Geosci. Remote Sens.; 2022; [DOI: https://dx.doi.org/10.1109/TGRS.2022.3176603]

8. Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking hyperspectral image classification with transformers. IEEE Trans. Geosci. Remote Sens.; 2021; 60, 5518615. [DOI: https://dx.doi.org/10.1109/TGRS.2021.3130716]

9. Liu, B.; Yu, A.; Gao, K.; Tan, X.; Sun, Y.; Yu, X. DSS-TRM: Deep spatial–spectral transformer for hyperspectral image classification. Eur. J. Remote Sens.; 2022; 55, pp. 103-114. [DOI: https://dx.doi.org/10.1080/22797254.2021.2023910]

10. Zhao, Z.; Hu, D.; Wang, H.; Yu, X. Convolutional Transformer Network for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett.; 2022; 19, pp. 1-5. [DOI: https://dx.doi.org/10.1109/LGRS.2022.3169815]

11. Yang, X.; Cao, W.; Lu, Y.; Zhou, Y. Hyperspectral Image Transformer Classification Networks. IEEE Trans. Geosci. Remote Sens.; 2022; 60, 5528715. [DOI: https://dx.doi.org/10.1109/TGRS.2022.3171551]

12. Jia, S.; Wang, Y. Multiscale Convolutional Transformer with Center Mask Pretraining for Hyperspectral Image Classification. arXiv; 2022; arXiv: 2203.04771

13. Tuia, D.; Volpi, M.; Copa, L.; Kanevski, M.; Munoz-Mari, J. A survey of active learning algorithms for supervised remote sensing image classification. IEEE J. Sel. Top. Signal Process.; 2011; 5, pp. 606-617. [DOI: https://dx.doi.org/10.1109/JSTSP.2011.2139193]

14. Camps-Valls, G.; Tuia, D.; Bruzzone, L.; Benediktsson, J.A. Advances in hyperspectral image classification: Earth monitoring with statistical learning methods. IEEE Signal Process. Mag.; 2013; 31, pp. 45-54. [DOI: https://dx.doi.org/10.1109/MSP.2013.2279179]

15. Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geosci. Remote Sens. Mag.; 2017; 5, pp. 8-36. [DOI: https://dx.doi.org/10.1109/MGRS.2017.2762307]

16. Ma, L.; Liu, Y.; Zhang, X.; Ye, Y.; Yin, G.; Johnson, B.A. Deep learning in remote sensing applications: A meta-analysis and review. ISPRS J. Photogramm. Remote Sens.; 2019; 152, pp. 166-177. [DOI: https://dx.doi.org/10.1016/j.isprsjprs.2019.04.015]

17. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. NeurIPS; 2017; 30, pp. 600-610.

18. Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah:, M. Transformers in Vision: A Survey. ACM Comput. Surv.; 2021; 54, pp. 1-41. [DOI: https://dx.doi.org/10.1145/3505244]

19. Shamshad, F.; Khan, S.; Zamir, S.W.; Khan, M.H.; Hayat, M.; Khan, F.S.; Fu, H. Transformers in medical imaging: A survey. arXiv; 2022; arXiv: 2201.09873

20. Selva, J.; Johansen, A.; Escalera, S.; Nasrollahi, K.; Moeslund, T.; Clapes, A. Video Transformers: A Survey. arXiv; 2022; arXiv: 2201.05991[DOI: https://dx.doi.org/10.1109/TPAMI.2023.3243465]

21. Teng, M.Y.; Mehrubeoglu, R.; King, S.A.; Cammarata, K.; Simons, J. Investigation of epifauna coverage on seagrass blades using spatial and spectral analysis of hyperspectral images. Proceedings of the 2013 5th Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS); Gainesville, FL, USA, 26–28 June 2013; pp. 1-4.

22. Notesco, G.; Dor, E.B.; Brook, A. Mineral mapping of makhtesh ramon in israel using hyperspectral remote sensing day and night LWIR images. Proceedings of the 2014 6th Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS); Lausanne, Switzerland, 24–27 June 2014; pp. 1-4.

23. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. NeurIPS; 2012; 60, pp. 84-90. [DOI: https://dx.doi.org/10.1145/3065386]

24. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. NeurIPS; 2015; 28, pp. 1137-1149. [DOI: https://dx.doi.org/10.1109/TPAMI.2016.2577031]

25. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei, L. ImageNet: A large-scale hierarchical image database. Proceedings of the CVPR; Miami, FL, USA, 20–25 June 2009; pp. 248-255.

26. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv; 2014; arXiv: 1409.1556

27. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. Proceedings of the CVPR; Las Vegas, NV, USA, 27–30 June 2016; pp. 770-778.

28. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv; 2014; arXiv: 1409.0473

29. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the CVPR; Montreal, QC, Canada, 10–17 October 2021; pp. 10012-10022.

30. Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. Proceedings of the ICCV; Montreal, QC, Canada, 10–17 October 2021; pp. 568-578.

31. Deng, P.; Xu, K.; Huang, H. When CNNs meet vision transformer: A joint framework for remote sensing scene classification. IEEE Geosci. Remote Sens. Lett.; 2021; 19, pp. 1-5. [DOI: https://dx.doi.org/10.1109/LGRS.2021.3109061]

32. Zhang, J.; Zhao, H.; Li, J. TRS: Transformers for Remote Sensing Scene Classification. Remote Sens.; 2021; 13, 4143. [DOI: https://dx.doi.org/10.3390/rs13204143]

33. Long, Y.; Xia, G.S.; Li, S.; Yang, W.; Yang, M.Y.; Zhu, X.X.; Zhang, L.; Li, D. On Creating Benchmark Dataset for Aerial Image Interpretation: Reviews, Guidances and Million-AID. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.; 2021; 14, pp. 4205-4230. [DOI: https://dx.doi.org/10.1109/JSTARS.2021.3070368]

34. Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV); Lake Tahoe, NV, USA, 12–15 March 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 839-847.

35. Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens.; 2017; 55, pp. 3965-3981. [DOI: https://dx.doi.org/10.1109/TGRS.2017.2685945]

36. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. Proceedings of the ECCV; Glasgow, UK, 23–28 August 2020; pp. 213-229.

37. Xu, X.; Feng, Z.; Cao, C.; Li, M.; Wu, J.; Wu, Z.; Shang, Y.; Ye, S. An Improved Swin Transformer-Based Model for Remote Sensing Object Detection and Instance Segmentation. Remote Sens.; 2021; 13, 4779. [DOI: https://dx.doi.org/10.3390/rs13234779]

38. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. Proceedings of the ICCV; Venice, Italy, 22–29 October 2017; pp. 2961-2969.

39. Li, Q.; Chen, Y.; Zeng, Y. Transformer with Transfer CNN for Remote-Sensing-Image Object Detection. Remote Sens.; 2022; 14, 984. [DOI: https://dx.doi.org/10.3390/rs14040984]

40. Zhang, Y.; Liu, X.; Wa, S.; Chen, S.; Ma, Q. GANsformer: A Detection Network for Aerial Images with High Performance Combining Convolutional Network and Transformer. Remote Sens.; 2022; 14, 923. [DOI: https://dx.doi.org/10.3390/rs14040923]

41. Zheng, Y.; Sun, P.; Zhou, Z.; Xu, W.; Ren, Q. ADT-Det: Adaptive Dynamic Refined Single-Stage Transformer Detector for Arbitrary-Oriented Object Detection in Satellite Optical Imagery. Remote Sens.; 2021; 13, 2623. [DOI: https://dx.doi.org/10.3390/rs13132623]

42. Tang, J.; Zhang, W.; Liu, H.; Yang, M.; Jiang, B.; Hu, G.; Bai, X. Few Could Be Better Than All: Feature Sampling and Grouping for Scene Text Detection. Proceedings of the CVPR; New Orleans, LA, USA, 19–24 June 2022; pp. 4563-4572.

43. Dai, Y.; Yu, J.; Zhang, D.; Hu, T.; Zheng, X. RODFormer: High-Precision Design for Rotating Object Detection with Transformers. Sensors; 2022; 22, 2633. [DOI: https://dx.doi.org/10.3390/s22072633]

44. Zhou, Q.; Yu, C. Point RCNN: An Angle-Free Framework for Rotated Object Detection. Remote Sens.; 2022; 14, 2605. [DOI: https://dx.doi.org/10.3390/rs14112605]

45. Liu, X.; Ma, S.; He, L.; Wang, C.; Chen, Z. Hybrid Network Model: TransConvNet for Oriented Object Detection in Remote Sensing Images. Remote Sens.; 2022; 14, 2090. [DOI: https://dx.doi.org/10.3390/rs14092090]

46. Li, W.; Chen, Y.; Hu, K.; Zhu, J. Oriented RepPoints for Aerial Object Detection. Proceedings of the IEEE/CVF; Nashville, TN, USA, 20–25 June 2021; pp. 1829-1838.

47. Ma, T.; Mao, M.; Zheng, H.; Gao, P.; Wang, X.; Han, S.; Ding, E.; Zhang, B.; Doermann, D. Oriented Object Detection with Transformer. arXiv; 2021; arXiv: 2106.03146

48. Dai, L.; Liu, H.; Tang, H.; Wu, Z.; Song, P. AO2-DETR: Arbitrary-Oriented Object Detection Transformer. arXiv; 2022; arXiv: 2205.12785[DOI: https://dx.doi.org/10.1109/TCSVT.2022.3222906]

49. Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. Proceedings of the CVPR; Salt Lake City, UT, USA, 18–22 June 2018; pp. 3974-3983.

50. Muzein, B.S. Remote Sensing & GIS for Land Cover, Land Use Change Detection and Analysis in the Semi-Natural Ecosystems and Agriculture Landscapes of the Central Ethiopian Rift Valley. Ph.D. Thesis; Institute of Photogrammetry and Remote Sensing, Technology University of Dresden: Dresden, Germany, 2006.

51. Haack, B.; Wolf, J.; English, R. Remote sensing change detection of irrigated agriculture in Afghanistan. Geocarto Int.; 1998; 13, pp. 65-75. [DOI: https://dx.doi.org/10.1080/10106049809354643]

52. Bolorinos, J.; Ajami, N.K.; Rajagopal, R. Consumption change detection for urban planning: Monitoring and segmenting water customers during drought. Water Resour. Res.; 2020; 56, e2019WR025812. [DOI: https://dx.doi.org/10.1029/2019WR025812]

53. Metternicht, G. Change detection assessment using fuzzy sets and remotely sensed data: An application of topographic map revision. ISPRS J. Photogramm. Remote Sens.; 1999; 54, pp. 221-233. [DOI: https://dx.doi.org/10.1016/S0924-2716(99)00023-4]

54. Chen, H.; Qi, Z.; Shi, Z. Remote Sensing Image Change Detection with Transformers. IEEE Trans. Geosci. Remote Sens.; 2021; 60, 5607514. [DOI: https://dx.doi.org/10.1109/TGRS.2021.3095166]

55. Guo, Q.; Zhang, J.; Zhu, S.; Zhong, C.; Zhang, Y. Deep multiscale Siamese network with parallel convolutional structure and self-attention for change detection. IEEE Trans. Geosci. Remote Sens.; 2021; 60, 3131993. [DOI: https://dx.doi.org/10.1109/TGRS.2021.3131993]

56. Zhang, C.; Wang, L.; Cheng, S.; Li, Y. SwinSUNet: Pure Transformer Network for Remote Sensing Image Change Detection. IEEE Trans. Geosci. Remote Sens.; 2022; 60, 5224713. [DOI: https://dx.doi.org/10.1109/TGRS.2022.3221492]

57. Wang, G.; Li, B.; Zhang, T.; Zhang, S. A Network Combining a Transformer and a Convolutional Neural Network for Remote Sensing Image Change Detection. Remote Sens.; 2022; 14, 2228. [DOI: https://dx.doi.org/10.3390/rs14092228]

58. Li, Q.; Zhong, R.; Du, X.; Du, Y. TransUNetCD: A Hybrid Transformer Network for Change Detection in Optical Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens.; 2022; 60, 5622519. [DOI: https://dx.doi.org/10.1109/TGRS.2022.3169479]

59. Ke, Q.; Zhang, P. Hybrid-TransCD: A Hybrid Transformer Remote Sensing Image Change Detection Network via Token Aggregation. Int. J. Geo-Inform.; 2022; 11, 263. [DOI: https://dx.doi.org/10.3390/ijgi11040263]

60. Ji, S.; Wei, S.; Lu, M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans. Geosci. Remote Sens.; 2018; 57, pp. 574-586. [DOI: https://dx.doi.org/10.1109/TGRS.2018.2858817]

61. Chen, H.; Shi, Z. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sens.; 2020; 12, 1662. [DOI: https://dx.doi.org/10.3390/rs12101662]

62. Daudt, R.C.; Le Saux, B.; Boulch, A. Fully convolutional siamese networks for change detection. Proceedings of the ICIP; Athens, Greece, 7 October 2018; pp. 4063-4067.

63. Alcantarilla, P.F.; Stent, S.; Ros, G.; Arroyo, R.; Gherardi, R. Street-view change detection with deconvolutional networks. Auton. Robot.; 2018; 42, pp. 1301-1322. [DOI: https://dx.doi.org/10.1007/s10514-018-9734-5]

64. Chen, J.; Yuan, Z.; Peng, J.; Chen, L.; Huang, H.; Zhu, J.; Liu, Y.; Li, H. DASNet: Dual attentive fully convolutional Siamese networks for change detection in high-resolution satellite images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.; 2020; 14, pp. 1194-1206. [DOI: https://dx.doi.org/10.1109/JSTARS.2020.3037893]

65. Xu, Z.; Zhang, W.; Zhang, T.; Yang, Z.; Li, J. Efficient transformer for remote sensing image segmentation. Remote Sens.; 2021; 13, 3585. [DOI: https://dx.doi.org/10.3390/rs13183585]

66. Wang, H.; Chen, X.; Zhang, T.; Xu, Z.; Li, J. CCTNet: Coupled CNN and Transformer Network for Crop Segmentation of Remote Sensing Images. Remote Sens.; 2022; 14, 1956. [DOI: https://dx.doi.org/10.3390/rs14091956]

67. Gao, L.; Liu, H.; Yang, M.; Chen, L.; Wan, Y.; Xiao, Z.; Qian, Y. STransFuse: Fusing Swin Transformer and Convolutional Neural Network for Remote Sensing Image Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.; 2021; 14, pp. 10990-11003. [DOI: https://dx.doi.org/10.1109/JSTARS.2021.3119654]

68. Zhang, C.; Jiang, W.; Zhang, Y.; Wang, W.; Zhao, Q.; Wang, C. Transformer and CNN Hybrid Deep Neural Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens.; 2022; 60, pp. 1-20. [DOI: https://dx.doi.org/10.1109/TGRS.2022.3144894]

69. Panboonyuen, T.; Jitkajornwanich, K.; Lawawirojwong, S.; Srestasathiern, P.; Vateekul, P. Transformer-Based Decoder Designs for Semantic Segmentation on Remotely Sensed Images. Remote Sens.; 2021; 13, 5100. [DOI: https://dx.doi.org/10.3390/rs13245100]

70. Available online: https://www.isprs.org/education/benchmarks/UrbanSemLab/2d-sem-label-potsdam.aspx (accessed on 27 August 2022).

71. Available online: https://www.isprs.org/education/benchmarks/UrbanSemLab/2d-sem-label-vaihingen.aspx (accessed on 27 August 2022).

72. Chen, K.; Zou, Z.; Shi, Z. Building extraction from remote sensing images with sparse token transformers. Remote Sens.; 2021; 13, 4441. [DOI: https://dx.doi.org/10.3390/rs13214441]

73. Xiao, X.; Guo, W.; Chen, R.; Hui, Y.; Wang, J.; Zhao, H. A Swin Transformer-Based Encoding Booster Integrated in U-Shaped Network for Building Extraction. Remote Sens.; 2022; 14, 2611. [DOI: https://dx.doi.org/10.3390/rs14112611]

74. Wang, L.; Fang, S.; Meng, X.; Li, R. Building extraction with vision transformer. IEEE Trans. Geosci. Remote Sens.; 2022; 14, 2611. [DOI: https://dx.doi.org/10.1109/TGRS.2022.3186634]

75. Qiu, C.; Li, H.; Guo, W.; Chen, X.; Yu, A.; Tong, X.; Schmitt, M. Transferring transformer-based models for cross-area building extraction from remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.; 2022; 15, pp. 4104-4116. [DOI: https://dx.doi.org/10.1109/JSTARS.2022.3175200]

76. Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. Proceedings of the SIGSPATIAL; San Jose, CA, USA, 2–5 November 2010; pp. 270-279.

77. Wang, Q.; Liu, S.; Chanussot, J.; Li, X. Scene classification with recurrent attention of VHR remote sensing images. IEEE Trans. Geosci. Remote Sens.; 2019; 57, pp. 1155-1167. [DOI: https://dx.doi.org/10.1109/TGRS.2018.2864987]

78. Cheng, G.; Li, Z.; Yao, X.; Guo, L.; Wei, Z. Remote sensing image scene classification using bag of convolutional features. IEEE Geosci. Remote Sens. Lett.; 2017; 14, pp. 1735-1739. [DOI: https://dx.doi.org/10.1109/LGRS.2017.2731997]

79. Li, Y.; Zhu, Z.; Yu, J.G.; Zhang, Y. Learning deep cross-modal embedding networks for zero-shot remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens.; 2021; 59, pp. 10590-10603. [DOI: https://dx.doi.org/10.1109/TGRS.2020.3047447]

80. Waqas Zamir, S.; Arora, A.; Gupta, A.; Khan, S.; Sun, G.; Shahbaz Khan, F.; Zhu, F.; Shao, L.; Xia, G.S.; Bai, X. Isaid: A large-scale dataset for instance segmentation in aerial images. Proceedings of the CVPR Workshops; Long Beach, CA, USA, 16–20 June 2019; pp. 28-37.

81. Liu, Z.; Yuan, L.; Weng, L.; Yang, Y. A high resolution optical satellite image dataset for ship recognition and some new baselines. Proceedings of the ICPRAM; Porto, Portugal, 24–26 February 2017.

82. Lebedev, M.; Vizilter, Y.V.; Vygolov, O.; Knyaz, V.; Rubis, A.Y. Change Detection in remote sensing images using conditional adversarial networks. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci.; 2018; 42, pp. 324-331. [DOI: https://dx.doi.org/10.5194/isprs-archives-XLII-2-565-2018]

83. Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens.; 2020; 159, pp. 296-307. [DOI: https://dx.doi.org/10.1016/j.isprsjprs.2019.11.023]

84. Zhang, Y.; Yuan, Y.; Feng, Y.; Lu, X. Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection. IEEE Trans. Geosci. Remote Sens.; 2019; 57, pp. 5535-5548. [DOI: https://dx.doi.org/10.1109/TGRS.2019.2900302]

85. Cheng, G.; Zhou, P.; Han, J. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens.; 2016; 54, pp. 7405-7415. [DOI: https://dx.doi.org/10.1109/TGRS.2016.2601622]

86. Zhu, H.; Chen, X.; Dai, W.; Fu, K.; Ye, Q.; Jiao, J. Orientation robust object detection in aerial images using deep convolutional neural network. Proceedings of the ICIP; Quebec City, QC, Canada, 27–30 September 2015; pp. 3735-3739.

87. Razakarivony, S.; Jurie, F. Vehicle detection in aerial imagery: A small target detection benchmark. J. Vis. Commun. Image Represent.; 2016; 34, pp. 187-203. [DOI: https://dx.doi.org/10.1016/j.jvcir.2015.11.002]

88. Pan, X.; Ren, Y.; Sheng, K.; Dong, W.; Yuan, H.; Guo, X.; Ma, C.; Xu, C. Dynamic refinement network for oriented and densely packed object detection. Proceedings of the CVPR; Seattle, WA, USA, 13–19 June 2020; pp. 11207-11216.

89. Gupta, A.; Vedaldi, A.; Zisserman, A. Synthetic data for text localisation in natural images. Proceedings of the CVPR; Las Vegas, NV, USA, 27–30 June 2016; pp. 2315-2324.

90. Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.; Bagdanov, A.; Iwamura, M.; Matas, J.; Neumann, L.; Chandrasekhar, V.R.; Lu, S. et al. ICDAR 2015 competition on robust reading. Proceedings of the ICDAR; Tunis, Tunisia, 23–26 August 2015; pp. 1156-1160.

91. Nayef, N.; Yin, F.; Bizid, I.; Choi, H.; Feng, Y.; Karatzas, D.; Luo, Z.; Pal, U.; Rigaud, C.; Chazalon, J. et al. Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt. Proceedings of the ICDAR; Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 1454-1459.

92. Yao, C.; Bai, X.; Liu, W.; Ma, Y.; Tu, Z. Detecting texts of arbitrary orientations in natural images. Proceedings of the CVPR; Providence, RI, USA, 16–21 June 2012; pp. 1083-1090.

93. He, M.; Liu, Y.; Yang, Z.; Zhang, S.; Luo, C.; Gao, F.; Zheng, Q.; Wang, Y.; Zhang, X.; Jin, L. ICPR2018 contest on robust reading for multi-type web images. Proceedings of the ICPR; Beijing, China, 20–24 August 2018; pp. 7-12.

94. Ch’ng, C.K.; Chan, C.S. Total-text: A comprehensive dataset for scene text detection and recognition. Proceedings of the ICDAR; Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 935-942.

95. Yuliang, L.; Lianwen, J.; Shuaitao, Z.; Sheng, Z. Detecting curve text in the wild: New dataset and new solution. arXiv; 2017; arXiv: 1712.02170

96. Zhang, C.; Yue, P.; Tapete, D.; Jiang, L.; Shangguan, B.; Huang, L.; Liu, G. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J. Photogramm. Remote Sens.; 2020; 166, pp. 183-200. [DOI: https://dx.doi.org/10.1016/j.isprsjprs.2020.06.003]

97. Shen, X.; Liu, B.; Zhou, Y.; Zhao, J. Remote sensing image caption generation via transformer and reinforcement learning. Multi. Tools Appl.; 2020; 79, pp. 26661-26682. [DOI: https://dx.doi.org/10.1007/s11042-020-09294-7]

98. Liu, C.; Zhao, R.; Shi, Z. Remote-Sensing Image Captioning Based on Multilayer Aggregated Transformer. IEEE Geosci. Remote Sens. Lett.; 2022; 19, 6506605. [DOI: https://dx.doi.org/10.1109/LGRS.2022.3150957]

99. Ren, Z.; Gou, S.; Guo, Z.; Mao, S.; Li, R. A Mask-Guided Transformer Network with Topic Token for Remote Sensing Image Captioning. Remote Sens.; 2022; 14, 2939. [DOI: https://dx.doi.org/10.3390/rs14122939]

100. Lei, S.; Shi, Z.; Mo, W. Transformer-Based Multistage Enhancement for Remote Sensing Image Super-Resolution. IEEE Trans. Geosci. Remote Sens.; 2021; 60, 5615611. [DOI: https://dx.doi.org/10.1109/TGRS.2021.3136190]

101. Ye, C.; Yan, L.; Zhang, Y.; Zhan, J.; Yang, J.; Wang, J. A Super-resolution Method of Remote Sensing Image Using Transformers. IDAACS; 2021; 2, pp. 905-910.

102. An, T.; Zhang, X.; Huo, C.; Xue, B.; Wang, L.; Pan, C. TR-MISR: Multiimage Super-Resolution Based on Feature Fusion with Transformers. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.; 2022; 15, pp. 1373-1388. [DOI: https://dx.doi.org/10.1109/JSTARS.2022.3143532]

103. Shi, Q.; Liu, M.; Li, S.; Liu, X.; Wang, F.; Zhang, L. A deeply supervised attention metric-based network and an open aerial image dataset for remote sensing change detection. IEEE Trans. Geosci. Remote Sens.; 2021; 60, 5604816. [DOI: https://dx.doi.org/10.1109/TGRS.2021.3085870]

104. Daudt, R.C.; Le Saux, B.; Boulch, A.; Gousseau, Y. Urban change detection for multispectral earth observation using convolutional neural networks. Proceedings of the IEEE International Geoscience and Remote Sensing Symposium; Valencia, Spain, 22–27 July 2018; pp. 2115-2118.

105. Daudt, R.C.; Le Saux, B.; Boulch, A.; Gousseau, Y. Multitask learning for large-scale semantic change detection. Comput. Vis. Image Underst.; 2019; 187, 102783. [DOI: https://dx.doi.org/10.1016/j.cviu.2019.07.003]

106. Shen, L.; Lu, Y.; Chen, H.; Wei, H.; Xie, D.; Yue, J.; Chen, R.; Lv, S.; Jiang, B. S2Looking: A satellite side-looking dataset for building change detection. Remote Sens.; 2021; 13, 5094. [DOI: https://dx.doi.org/10.3390/rs13245094]

107. Barley Remote Sensing Dataset. Available online: https://tianchi.aliyun.com/dataset/dataDetail?dataId=74952 (accessed on 27 August 2022).

108. Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Can semantic labeling methods generalize to any city? The inria aerial image labeling benchmark. Proceedings of the IGARSS; Fort Worth, TX, USA, 23–28 July 2017; pp. 3226-3229.

109. Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring models and data for remote sensing image caption generation. IEEE Trans. Geosci. Remote Sens.; 2017; 56, pp. 2183-2195. [DOI: https://dx.doi.org/10.1109/TGRS.2017.2776321]

110. MEGA. Available online: https://mega.nz/folder/wCpSzSoS#RXzIlrv–TDt3ENZdKN8JA (accessed on 27 August 2022).

111. MEGA. Available online: https://mega.nz/folder/pG4yTYYA#4c4buNFLibryZnlujsrwEQ (accessed on 27 August 2022).

112. Märtens, M.; Izzo, D.; Krzic, A.; Cox, D. Super-resolution of PROBA-V images using convolutional neural networks. Astrodynamics; 2019; 3, pp. 387-402. [DOI: https://dx.doi.org/10.1007/s42064-019-0059-8]

113. Available online: http://weegee.vision.ucmerced.edu/datasets/landuse.html (accessed on 27 August 2022).

114. He, J.; Zhao, L.; Yang, H.; Zhang, M.; Li, W. HSI-BERT: Hyperspectral image classification using the bidirectional encoder representation from transformers. IEEE Trans. Geosci. Remote Sens.; 2019; 58, pp. 165-178. [DOI: https://dx.doi.org/10.1109/TGRS.2019.2934760]

115. Zhong, Z.; Li, Y.; Ma, L.; Li, J.; Zheng, W.S. Spectral-spatial transformer network for hyperspectral image classification: A factorized architecture search framework. IEEE Trans. Geosci. Remote Sens.; 2021; 60, 5514715. [DOI: https://dx.doi.org/10.1109/TGRS.2021.3115699]

116. Sun, L.; Zhao, G.; Zheng, Y.; Wu, Z. Spectral–Spatial Feature Tokenization Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens.; 2022; 60, 5522214. [DOI: https://dx.doi.org/10.1109/TGRS.2022.3221534]

117. Roy, S.K.; Deria, A.; Hong, D.; Rasti, B.; Plaza, A.; Chanussot, J. Multimodal fusion transformer for remote sensing image classification. arXiv; 2022; arXiv: 2203.16952

118. Xue, Z.; Tan, X.; Yu, X.; Liu, B.; Yu, A.; Zhang, P. Deep Hierarchical Vision Transformer for Hyperspectral and LiDAR Data Classification. IEEE Trans. Image Process.; 2022; 31, pp. 3095-3110. [DOI: https://dx.doi.org/10.1109/TIP.2022.3162964]

119. Hu, W.; Huang, Y.; Wei, L.; Zhang, F.; Li, H. Deep Convolutional Neural Networks for Hyperspectral Image Classification. Sensors; 2015; 2015, 258619. [DOI: https://dx.doi.org/10.1155/2015/258619]

120. Li, W.; Wu, G.; Zhang, F.; Du, Q. Hyperspectral Image Classification Using Deep Pixel-Pair Features. IEEE Trans. Geosci. Remote Sens.; 2017; 2, pp. 844-853. [DOI: https://dx.doi.org/10.1109/TGRS.2016.2616355]

121. Zhang, F.; Zhang, K.; Sun, J. Multiscale Spatial–Spectral Interaction Transformer for Pan-Sharpening. Remote Sens.; 2022; 14, 1736. [DOI: https://dx.doi.org/10.3390/rs14071736]

122. Li, S.; Guo, Q.; Li, A. Pan-Sharpening Based on CNN+ Pyramid Transformer by Using No-Reference Loss. Remote Sens.; 2022; 14, 624. [DOI: https://dx.doi.org/10.3390/rs14030624]

123. Liang, Y.; Zhang, P.; Mei, Y.; Wang, T. PMACNet: Parallel Multiscale Attention Constraint Network for Pan-Sharpening. IEEE Geosci. Remote Sens. Lett.; 2022; 19, 5512805. [DOI: https://dx.doi.org/10.1109/LGRS.2022.3170904]

124. Su, X.; Li, J.; Hua, Z. Transformer-Based Regression Network for Pansharpening Remote Sensing Images. IEEE Trans. Geosci. Remote Sens.; 2022; 60, 5407423. [DOI: https://dx.doi.org/10.1109/TGRS.2022.3152425]

125. Zhou, M.; Huang, J.; Fang, Y.; Fu, X.; Liu, A. Pan-Sharpening with Customized Transformer and Invertible Neural Network. AAAI; 2022; 36, pp. 3553-3561. [DOI: https://dx.doi.org/10.1609/aaai.v36i3.20267]

126. Bandara, W.; Patel, V. HyperTransformer: A Textural and Spectral Feature Fusion Transformer for Pansharpening. Proceedings of the CVPR; New Orleans, LA, USA, 19–24 June 2022; pp. 1767-1777.

127. 220 Band AVIRIS Hyperspectral Image Data Set: June 12, 1992 Indian Pine Test Site 3. Available online: https://purr.purdue.edu/publications/1947/1 (accessed on 27 August 2022).

128. Available online: https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes#Pavia_Centre_and_University (accessed on 27 August 2022).

129. Available online: https://hyperspectral.ee.uh.edu/?page_id=459 (accessed on 27 August 2022).

130. Available online: https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes#Salinas (accessed on 27 August 2022).

131. Gader, P.; Zare, A.; Close, R.; Aitken, J.; Tuell, G. Muufl Gulfport Hyperspectral and Lidar Airborne Data Set; Technical Report REP-2013-570 University of Florida: Gainesville, FL, USA, 2013.

132. Hyperspectral Image Analysis Lab. Available online: https://hyperspectral.ee.uh.edu/?page_id=1075 (accessed on 27 August 2022).

133. Pavia Centre Scene. Available online: https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes#Pavia_Centre_scene (accessed on 27 August 2022).

134. Zhou, H.; Liu, Q.; Wang, Y. PanFormer: A Transformer Based Model for Pan-sharpening. arXiv; 2022; arXiv: 2203.02916

135. WorldView-2 Full Archive and Tasking. Available online: https://earth.esa.int/eogateway/catalog/worldview-2-full-archive-and-tasking (accessed on 27 August 2022).

136. WorldView-3 Full Archive and Tasking. Available online: https://earth.esa.int/eogateway/catalog/worldview-3-full-archive-and-tasking (accessed on 27 August 2022).

137. Botswana. Available online: https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes#Botswana (accessed on 27 August 2022).

138. Yokoya, N.; Iwasaki, A. Airborne Hyperspectral Data over Chikusei; Technical Report Space Application Laboratory, University of Tokyo: Tokyo, Japan, 2016; Volume 5.

139. Pleiades. Available online: https://pleiades.stoa.org/downloads (accessed on 27 August 2022).

140. QuickBird Full Archive. Available online: https://earth.esa.int/eogateway/catalog/quickbird-full-archive (accessed on 27 August 2022).

141. Dong, H.; Zhang, L.; Zou, B. Exploring Vision Transformers for Polarimetric SAR Image Classification. IEEE Trans. Geosci. Remote Sens.; 2022; 60, 5219715. [DOI: https://dx.doi.org/10.1109/TGRS.2021.3137383]

142. Liu, X.; Wu, Y.; Liang, W.; Cao, Y.; Li, M. High Resolution SAR Image Classification Using Global-Local Network Structure Based on Vision Transformer and CNN. IEEE Geosci. Remote Sens. Lett.; 2022; 19, 4505405. [DOI: https://dx.doi.org/10.1109/LGRS.2022.3151353]

143. Cai, J.; Zhang, Y.; Guo, J.; Zhao, X.; Lv, J.; Hu, Y. ST-PN: A Spatial Transformed Prototypical Network for Few-Shot SAR Image Classification. Remote Sens.; 2022; 14, 2019. [DOI: https://dx.doi.org/10.3390/rs14092019]

144. Ke, X.; Zhang, X.; Zhang, T. GCBANet: A Global Context Boundary-Aware Network for SAR Ship Instance Segmentation. Remote Sens.; 2022; 14, 2165. [DOI: https://dx.doi.org/10.3390/rs14092165]

145. Xia, R.; Chen, J.; Huang, Z.; Wan, H.; Wu, B.; Sun, L.; Yao, B.; Xiang, H.; Xing, M. CRTransSar: A Visual Transformer Based on Contextual Joint Representation Learning for SAR Ship Detection. Remote Sens.; 2022; 14, 1488. [DOI: https://dx.doi.org/10.3390/rs14061488]

146. Chen, L.; Luo, R.; Xing, J.; Li, Z.; Yuan, Z.; Cai, X. Geospatial transformer is what you need for aircraft detection in SAR Imagery. IEEE Trans. Geosci. Remote Sens.; 2022; 60, pp. 1-15. [DOI: https://dx.doi.org/10.1109/TGRS.2022.3162235]

147. Zhang, P.; Xu, H.; Tian, T.; Gao, P.; Tian, J. SFRE-Net: Scattering Feature Relation Enhancement Network for Aircraft Detection in SAR Images. Remote Sens.; 2022; 14, 2076. [DOI: https://dx.doi.org/10.3390/rs14092076]

148. Ma, C.; Zhang, Y.; Guo, J.; Hu, Y.; Geng, X.; Li, F.; Lei, B.; Ding, C. End-to-End Method with Transformer for 3D Detection of Oil Tank from Single SAR Image. IEEE Trans. Geosci. Remote Sens.; 2021; 60, 5217619.

149. Perera, M.; Bandara, W.; Valanarasu, J.; Patel, V. Transformer-based SAR Image Despeckling. arXiv; 2022; arXiv: 2201.09355

150. Dong, H.; Ma, W.; Jiao, L.; Liu, F.; Shang, R.; Li, Y.; Bai, J. A Contrastive Learning Transformer for Change Detection in High-Resolution SAR Images; SSRN 4169439 SSRN: Rochester, NY, USA, 2022.

151. Fan, Y.; Wang, F.; Wang, H. A Transformer-Based Coarse-to-Fine Wide-Swath SAR Image Registration Method under Weak Texture Conditions. Remote Sens.; 2022; 14, 1175. [DOI: https://dx.doi.org/10.3390/rs14051175]

152. Norikane, L.; Broek, B.; Freeman, A. Application of modified VICAR/IBIS GIS to analysis of July 1991 Flevoland AIRSAR data. Proceedings of the AIRSAR Workshop; Pasadena, CA, USA, 1–5 June 1992; Volume 3.

153. E-SAR—The Airborne SAR System of DLR. Available online: https://www.dlr.de/hr/en/desktopdefault.aspx/tabid-2326/3776_read-5679/ (accessed on 27 August 2022).

154. Available online: https://ietr-lab.univ-rennes1.fr/polsarpro-bio/san-francisco/dataset/SAN_FRANCISCO_AIRSAR.zip (accessed on 27 August 2022).

155. Use Data. Available online: https://www.eorc.jaxa.jp/ALOS/en/alos-2/a2_data_e.htm (accessed on 27 August 2022).

156. GF-3 (Gaofen-3). Available online: https://directory.eoportal.org/web/eoportal/satellite-missions/g/gaofen-3 (accessed on 27 August 2022).

157. F-SAR—The New Airborne SAR System. Available online: https://www.dlr.de/hr/en/desktopdefault.aspx/tabid-2326/3776_read-5691/ (accessed on 27 August 2022).

158. MSTAR Overview. Available online: https://www.sdms.afrl.af.mil/index.php?collection=mstar (accessed on 27 August 2022).

159. Li, J.; Qu, C.; Shao, J. Ship detection in SAR images based on an improved faster R-CNN. Proceedings of the BIGSARDATA; Beijing, China, 3–14 November 2017; pp. 1-6.

160. Wei, S.; Zeng, X.; Qu, Q.; Wang, M.; Su, H.; Shi, J. HRSID: A High-Resolution SAR Images Dataset for Ship Detection and Instance Segmentation. IEEE Access; 2020; 8, pp. 120234-120254. [DOI: https://dx.doi.org/10.1109/ACCESS.2020.3005861]

161. CryoSat Products. Available online: https://earth.esa.int/eogateway/catalog/cryosat-products (accessed on 27 August 2022).

162. Martin, D.; Fowlkes, C.; Tal, D.; Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. Proceedings of the ICCV; Vancouver, BC, Canada, 7–14 July 2001; Volume 2, pp. 416-423.

163. TerraSAR-X ESA Archive. Available online: https://earth.esa.int/eogateway/catalog/terrasar-x-esa-archive (accessed on 27 August 2022).

164. Li, Z.; Snavely, N. MegaDepth: Learning Single-View Depth Prediction from Internet Photos. Proceedings of the CVPR; Salt Lake City, UT, USA, 18–23 June 2018; pp. 2041-2050.

165. Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows. Proceedings of the CVPR; New Orleans, LA, USA, 19–24 June 2022; pp. 12124-12134.

166. Mehta, S.; Rastegari, M. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. Proceedings of the ICLR; Virtual-Only, 25 April 2022.

167. Yanghao, L.; Wu, C.Y.; Fan, H.; Mangalam, K.; Xiong, B.; Malik, J.; Feichtenhofer, C. MViTv2: Improved Multiscale Vision Transformers for Classification and Detection. Proceedings of the CVPR; New Orleans, LA, USA, 19–24 June 2022; pp. 4804-4814.

Word count: 16788

Show less

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Deep learning-based algorithms have seen a massive popularity in different areas of remote sensing image analysis over the past decade. Recently, transformer-based architectures, originally introduced in natural language processing, have pervaded computer vision field where the self-attention mechanism has been utilized as a replacement to the popular convolution operator for capturing long-range dependencies. Inspired by recent advances in computer vision, the remote sensing community has also witnessed an increased exploration of vision transformers for a diverse set of tasks. Although a number of surveys have focused on transformers in computer vision in general, to the best of our knowledge we are the first to present a systematic review of recent advances based on transformers in remote sensing. Our survey covers more than 60 recent transformer-based methods for different remote sensing problems in sub-areas of remote sensing: very high-resolution (VHR), hyperspectral (HSI) and synthetic aperture radar (SAR) imagery. We conclude the survey by discussing different challenges and open issues of transformers in remote sensing.

Details

Title

Transformers in Remote Sensing: A Survey

Author

Abdulaziz Amer Aleissaee¹

; Kumar, Amandeep¹; Rao, Muhammad Anwer¹; Khan, Salman¹; Cholakkal, Hisham¹; Gui-Song, Xia²

; Fahad Shahbaz Khan¹

¹ Computer Vision Faculty, Mohamed bin Zayed University of Artificial Intelligence, Building 1B, Masdar City, Abu Dhabi P.O. Box 5224, United Arab Emirates; [email protected] (A.K.);
² School of Computer Science, Wuhan University, Wuchang District, Wuhan 430072, China

First page

1860

Publication year

2023

Publication date

2023

Publisher

MDPI AG

e-ISSN

20724292

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/rs15071860

ProQuest document ID

2799747487

Transformers in Remote Sensing: A Survey

Jump to:

Full Text

Abstract

Details

Suggested sources