Full Text

Turn on search term navigation

1. Introduction

High-resolution remote sensing images play a crucial role in various applications, including urban planning [1,2], environmental monitoring [3,4], precision agriculture [5], and national defense [6]. Advances in remote sensing technologies, such as unmanned aerial vehicles (UAVs) and high-resolution satellite imagery, have led to an unprecedented increase in data availability [7,8]. However, efficiently interpreting and utilizing this vast amount of data remains challenging due to its complexity, variability, and the presence of fine-grained details [6,9].

Semantic segmentation, which classifies each pixel into meaningful categories, is a crucial step in extracting valuable insights from remote sensing imagery [8,10]. Accurate segmentation enables precise land cover classification, infrastructure mapping, and disaster response planning, making it a key component in harnessing the full potential of remote sensing data. However, semantic segmentation in remote sensing is inherently complex due to challenges such as scale variations [11], high intra-class variance [8], and low inter-class variance across different land cover types [8,9,10]. These challenges arise from diverse object scales and complex backgrounds in remote sensing images. Traditional segmentation methods can be categorized into unsupervised learning approaches [12], shallow learning models [8], and deep learning-based methods [3,8]. Unsupervised methods rely on handcrafted feature extraction but often fail to generalize well to diverse and complex remote sensing environments [2,5,12]. Shallow learning models improve upon this but still struggle to capture high-level semantic information, limiting their effectiveness [6,7,8].

In contrast, deep learning-based models have significantly advanced remote sensing segmentation by learning hierarchical feature representations. Architectures such as fully convolutional networks (FCN) [13], U-Net [14], SegNet [13], and DeepLab [15]—originally developed for natural and medical image segmentation—have been adapted to remote sensing tasks, achieving remarkable performance improvements. For instance, multi-scale spatial–spectral feature extraction networks [16] and low-memory collaborative global–local networks have improved feature representation. However, existing approaches still face challenges such as suboptimal feature enhancement, redundant representations, and inefficient fusion of global and local features.

Recent advancements in attention mechanisms have significantly improved deep learning models for image classification [17], object detection [16], and semantic segmentation [18]. Attention mechanisms dynamically emphasize critical features while filtering out less relevant information, thereby enhancing model performance. These mechanisms can be categorized into self-attention [17], which enhances intra-image feature interactions and strengthens global dependencies, and cross-attention [18], which facilitates information exchange between multiple feature representations, improving feature fusion.

In remote sensing, attention mechanisms have been leveraged for semantic segmentation to address challenges related to scale variations and feature fusion. Several recent models have demonstrated their effectiveness in refining feature extraction. For instance, WETS-Net [19] employs a dual attention mechanism to improve the segmentation of weak targets. However, its focus on weak edge targets limits its adaptability to more complex remote sensing scenes. Similarly, GLIMS [20] integrates CNNs and transformers for global–local feature extraction but suffers from high computational overhead, making it less suitable for real-time applications. Other attention-based models, such as AACNet [21], MSANet [22], and DANet [23], leverage different attention mechanisms to enhance spatial and channel-wise feature learning. However, they often struggle with either computational inefficiency, occlusion handling, or overemphasis on global features, which can neglect fine-grained details.

Despite these advancements, existing approaches often fail to fully exploit attention mechanisms for modeling long-range dependencies and emphasizing critical features, leading to suboptimal feature fusion and representation learning. Global–local networks (GLNet) [24], while effective in integrating global and local information, often underutilize attention, resulting in inadequate segmentation outcomes [25].

To address these limitations, we propose the dual-level network (DLNet), an enhanced GLNet-based framework [26] incorporating self-attention and cross-attention modules for refined global–local feature extraction and fusion. DLNet introduces a hierarchical design with macro-level and micro-level processing, fused via a dual-attention mechanism. Specifically, DLNet includes self-attention, cross-attention, and deep feature fusion to maximize contextual awareness and fine-detail preservation. The macro-level branch extracts global context, while the micro-level branch captures fine-grained details. Self-attention enhances features within each level, and cross-attention facilitates bidirectional interaction for comprehensive global–local fusion. Deep feature fusion integrates these enriched features for precise segmentation maps.

DLNet’s dual-level attention mechanism enhances segmentation performance, model transparency, and reliability, making it suitable for applications like urban planning, environmental monitoring, and disaster management. By leveraging attention and feature fusion, DLNet effectively addresses global–local feature extraction challenges, improving segmentation accuracy for high-resolution remote sensing imagery.

Our key contributions are as follows:

This paper develops a novel segmentation framework based on GLNet, improving segmentation accuracy and computational efficiency. The proposed model achieves state-of-the-art performance on high-resolution remote sensing images by enhancing feature fusion through dual-level attention.
This paper incorporates both self-attention and cross-attention mechanisms to enhance feature extraction at global and local levels. The self-attention module improves semantic understanding and preserves fine details, while the cross-attention module facilitates bidirectional interaction between macro (global) and micro (local) features, improving multi-scale feature fusion and enhancing segmentation quality.
This paper proposes a deep feature fusion mechanism, which combines hierarchical multi-level representations using downsampling, upsampling, and concatenation, ensuring a balanced and robust feature-sharing strategy.
This paper conducts extensive ablation studies to analyze the contributions of each module. Results demonstrate that integrating self-attention improves segmentation accuracy from $71.3 %$ to $73.5 %$ mIoU on DeepGlobe, while the addition of cross-attention further raises accuracy to $74.1 %$ . The complete DLNet framework, incorporating both attention mechanisms and deep feature fusion, achieves the highest mIoU of $76.9 %$ , surpassing all baselines while maintaining lower memory usage (1443 MB vs. 2447 MB in EHSNet) and an optimal balance of computation (1632.8 G FLOPs) and efficiency (518.3 ms inference time).

2. Related Work

2.1. Deep Learning-Based Semantic Segmentation

To overcome the limitations of traditional segmentation methods, deep learning-based models have emerged as the dominant approach, significantly improving segmentation performance. Traditional segmentation methods, including unsupervised clustering [12] and shallow machine learning models [8], often struggle to handle the complexity, scale variations, and spectral inconsistencies inherent in remote sensing imagery [2,5,6,7,8,12]. In contrast, deep learning-based segmentation models have emerged as the dominant approach, achieving remarkable improvements in accuracy and generalization.

Fully convolutional networks (FCN) [27] pioneered the application of deep learning to semantic segmentation, introducing an end-to-end framework that replaces fully connected layers with convolutional layers, enabling pixel-wise predictions. FCN reuses pre-trained networks and employs deconvolution layers for upsampling, allowing for finer segmentation results. The encoder–decoder architecture further enhanced segmentation accuracy by incorporating hierarchical feature extraction and structured information flow. U-Net [28], SegNet [13], and ENet [29] further refined segmentation accuracy by integrating skip connections and structured feature hierarchies. Dilated convolution methods [30] address the challenge of preserving spatial details by enlarging the receptive field without increasing computational complexity. This technique was effectively utilized in the DeepLab series, culminating in DeepLabV3+ [31], which improved semantic segmentation through atrous spatial pyramid pooling (ASPP) and an enhanced decoder structure.

Despite these advancements, conventional deep learning models often struggle with multi-scale feature extraction, global contextual representation, and fine-grained detail preservation, motivating the development of hybrid frameworks that integrate local and global features.

2.2. Global–Local Feature Fusion Networks

To address the challenges of multi-scale feature extraction and contextual information loss, several hybrid models have been proposed, focusing on global–local feature fusion. These networks aim to integrate coarse global context with fine-grained local details, improving segmentation accuracy in complex remote sensing images.

The collaborative global and local network (GLNet) [26] is a representative model that efficiently integrates global contextual information with fine-grained local details, enabling improved segmentation accuracy. GLNet employs two parallel branches: one that captures global-level semantic structures and another that preserves high-resolution spatial information. However, existing GLNet-based frameworks often underutilize attention mechanisms, resulting in suboptimal feature fusion and inefficient long-range dependency modeling [25].

Other hybrid frameworks have also demonstrated promising results. The multi-scale context aggregation network (MSCAN) [32] integrates pyramid pooling and multi-resolution feature aggregation to refine segmentation in high-resolution images. HRNet (High-Resolution Network) [33] maintains high-resolution feature representations throughout the network, ensuring that both global structures and local boundaries are preserved. However, HRNet’s computational complexity limits its deployment in real-time applications.

Despite these advancements, existing global–local feature fusion networks still face challenges in efficient feature refinement, dynamic adaptability, and attention-based weighting mechanisms. This motivates the development of DLNet, a framework that further optimizes global–local interactions using self-attention, cross-attention, and hierarchical deep feature fusion.

2.3. Attention Mechanisms in Semantic Segmentation

Inspired by human cognitive processes, attention mechanisms have become an essential component in deep learning, allowing models to dynamically focus on relevant features while suppressing irrelevant information. Self-attention [17], popularized by the transformer architecture [34], enhances intra-image feature interactions, while cross-attention [18] facilitates multi-scale feature exchange and context-aware learning. These mechanisms have been widely adopted in semantic segmentation to capture long-range dependencies and improve feature representation.

Several networks have successfully integrated attention mechanisms to enhance segmentation performance. DANet [23] and SENet [35] utilize channel and spatial attention to refine feature extraction, improving global feature awareness. CBAM (convolutional block attention module) [36] further extends this concept by introducing spatial and channel-wise feature recalibration.

Recent studies have explored the application of dual-attention mechanisms in remote sensing segmentation. For instance, the weak edge target segmentation network (WETS-Net) [19] incorporates self-attention and spatial attention modules to enhance edge preservation and weak target segmentation. However, WETS-Net is primarily optimized for weak edge targets, making it less effective for complex large-scale scenes with dense object distributions. Additionally, its reliance on a predefined feature fusion strategy limits adaptability to diverse remote sensing datasets.

Similarly, the GLIMS network [24] combines CNNs and transformers to balance local feature extraction and global context modeling. While GLIMS effectively improves segmentation accuracy, its high computational overhead makes it unsuitable for real-time applications or deployment on resource-constrained platforms. Furthermore, GLIMS lacks explicit feature regularization, leading to potential feature redundancy and reduced segmentation robustness in heterogeneous landscapes.

3. Methodologies

3.1. Design Overview

The architecture of the DLNet is illustrated in Figure 1. The design integrates macro-level self-attention, micro-level self-attention, dual-level cross-attention, a macro-level stream, a micro-level stream, feature aggregation, and feature map fusion to improve high-resolution remote sensing image segmentation. The network follows a structured pipeline consisting of three main stages: (1) data preprocessing, which involves input scaling and subset regularization to refine spatial structures while preserving essential features; (2) self- and cross-attention, where macro-level self-attention captures global contextual relationships, micro-level self-attention preserves fine-grained details, and dual-level cross-attention facilitates bidirectional information exchange between macro- and micro-level features; and (3) feature fusion and aggregation, where multi-scale feature representations from both streams are integrated through a deep fusion mechanism to generate high-precision segmentation maps. This hierarchical design enables DLNet to effectively capture both global spatial dependencies and fine-resolution structural details, ensuring robust and accurate segmentation performance across complex remote sensing imagery.

3.1.1. Data Preprocessing

Initially, the high-resolution input image undergoes downscaling and subset regularization to refine data representation while preserving essential spatial structures. The data are then processed into two hierarchical levels: the macro level, which captures global contextual features, and the micro level, which focuses on local fine-grained details. This dual-level processing ensures a balanced representation, effectively handling both large-scale spatial dependencies and small-scale structural variations, thereby optimizing feature extraction for the subsequent attention modules.

3.1.2. Self-Attention and Cross-Attention

After the data processing stage, the high-resolution image is processed into two levels to effectively capture both global and local features. Then, the macro-level self-attention module captures global contextual relationships, enabling a broad understanding of spatial dependencies across the entire image. By analyzing large-scale spatial structures, it enhances the model’s ability to differentiate between various land cover types and delineate object boundaries. This mechanism ensures a cohesive high-level semantic representation, allowing the model to retain essential contextual information while preserving structural integrity in complex remote sensing imagery. The micro-level self-attention module focuses on local feature refinement, preserving high-resolution textures and fine details. By concentrating on small-scale variations, this module improves segmentation accuracy in regions with complex textures, edges, and fine-grained structures, preventing the loss of crucial details during downsampling.

The dual-level cross-attention module facilitates bidirectional interaction between the macro-level and micro-level streams, enhancing the fusion of contextual and detailed information. By aligning global and local feature representations, this mechanism ensures that broad contextual insights complement fine-grained segmentation accuracy, leading to improved boundary delineation and overall segmentation performance. The combination of self-attention and cross-attention enables multi-scale feature integration, allowing the model to capture both long-range dependencies and localized patterns effectively.

3.1.3. Feature Fusion and Aggregation

After the self-attention and cross-attention mechanisms are applied at both the macro level and micro level, the global and local features are enhanced, ensuring the preservation of both broad contextual information and fine-grained details. The processed features are then directed into two specialized processing streams. The macro-level stream captures large-scale spatial structures and contextual dependencies, enhancing the model’s ability to understand complex relationships across the entire image. Meanwhile, the micro-level stream preserves high-resolution details, refining texture variations, object boundaries, and small-scale features essential for precise segmentation.

Once processed through their respective streams, the features undergo feature aggregation and fusion, where the enriched macro-level and micro-level representations are integrated. The dual-level cross-attention module plays a key role in aligning these features, ensuring seamless interaction between global context and local precision. This fusion process effectively mitigates inconsistencies between large-scale semantic features and fine-grained structures, leading to a more cohesive and accurate segmentation output.

The final feature fusion module combines the learned representations from both streams, generating a refined segmentation map with enhanced structural coherence and boundary precision. By leveraging both broad contextual awareness and detailed feature representation, this design ensures robust and accurate segmentation for high-resolution remote sensing images. The dual-level attention-enhanced framework effectively integrates global and local features, making it well-suited for complex remote sensing applications requiring high spatial accuracy.

3.2. Data Preprocessing

The first step is the data preprocessing phase, where the high-resolution input images are processed and converted into two levels: macro-level and micro-level. This hierarchical division is essential to effectively balance global contextual understanding and local feature preservation. The macro-level processing captures large-scale spatial structures, enabling the model to understand broad contextual dependencies across the image. Meanwhile, the micro-level processing focuses on fine-grained details, ensuring the preservation of textures, edges, and small-scale variations crucial for accurate segmentation. This dual-level representation enhances the model’s ability to handle complex spatial variations while maintaining computational efficiency, optimizing feature extraction for the subsequent attention mechanisms.

Specifically, for a given dataset of high-resolution images $D$ :

(1) $D = {(I_{i}, S_{i})}_{i = 1}^{N},$

where

I_{i}

represents the i-th input image,

S_{i}

is the corresponding segmentation mask for image i, and both

I_{i}

and

S_{i}

belong to the space

R^{H \times W}

. N denotes the total number of images in the dataset. Then, the proposed macro-level branch scales down the resolution of the input image

I_{i}

as follows:

(2) $D^{macro} = {(I_{i}^{macro}, S_{i}^{macro})}_{i = 1}^{N},$

where

I_{i}^{macro}

represents the i-th low-resolution image at the macro level, and

S_{i}^{macro}

is its corresponding segmentation mask. Both

I_{i}^{macro}

and

S_{i}^{macro}

belong to the space

R^{h_{1} \times w_{1}}

, where

h_{1} ≪ H

and

w_{1} ≪ W

To facilitate multi-scale feature learning, the downsampling operation is applied in the data preprocessing stage to generate macro-level representations. The goal of downsampling is to reduce the resolution of the input images while preserving essential spatial structures and contextual information. This process enables the model to efficiently capture large-scale spatial dependencies while reducing computational overhead.

The downsampling operation is performed using a combination of stride-based convolutional downsampling, bilinear interpolation, and Gaussian pyramid-based smoothing to ensure minimal loss of structural details. Given an input image I of dimensions $H \times W$ , the macro-level representation $I^{macro}$ is obtained as follows:

(3) $I^{macro} = {Conv}_{s = 2} (I),$

where

{Conv}_{s = 2}

represents a convolutional layer with a stride of 2, effectively reducing the spatial resolution by half in both dimensions. This approach ensures that the network retains key structural information while eliminating redundant details.

In addition to convolutional downsampling, bilinear interpolation is employed when further resolution adjustments are required. Bilinear interpolation smoothly resizes the input image while maintaining spatial coherence, making it particularly useful when downsampling factors are non-integer values. Furthermore, to prevent aliasing artifacts, a Gaussian pyramid-based smoothing operation is applied before resizing. This method involves convolving the image with a Gaussian filter to reduce high-frequency noise before downsampling, ensuring that the essential feature representations remain intact.

The downsampling process is applied to both square (e.g., DeepGlobe Land Cover Classification Dataset, Inria Aerial Image Labeling Dataset) and rectangular images, depending on the dataset used. In our implementation, images retain their original aspect ratio during downsampling to prevent distortion of spatial structures. In the case of non-square images, our approach ensures that downsampling does not artificially alter the aspect ratio. Instead, any required resizing maintains proportional scaling to avoid distortion and to retain meaningful spatial relationships in the imagery.

By implementing this hierarchical downsampling strategy, the model effectively captures both high-level global contextual information and fine-grained local details, ensuring balanced multi-scale feature extraction for high-resolution remote sensing image segmentation.

Meanwhile, the proposed micro branch processes cropped high-resolution patches from original datasets $D$ as shown:

(4) $D^{micro} = {{(I_{i j}^{micro}, S_{i j}^{micro})}_{j = 1}^{M}}_{i = 1}^{N},$

where

I_{i j}^{micro}

represents the j-th high-resolution patch in the i-th image at the micro level, and

S_{i j}^{micro}

is its corresponding segmentation mask. Both

I_{i j}^{micro}

and

S_{i j}^{micro}

belong to the space

R^{h_{2} \times w_{2}}

, where

h_{2} ≪ H

and

w_{2} ≪ W

After data processing, the high-resolution input images are transformed into two levels: the macro-level, which consists of downsampled images, and the micro-level, where the original high-resolution image is divided into non-overlapping patches.

This dual-level representation provides several key insights and benefits. The macro-level representation enables the model to capture global contextual dependencies and broad spatial relationships, which are essential for understanding large-scale patterns and structural consistency across the image. By downsampling the input, this level also reduces computational complexity, making it feasible to process large remote sensing datasets efficiently. On the other hand, the micro-level preserves fine-grained details, ensuring that the model retains critical high-resolution information such as sharp edges, small objects, and texture variations. The division into non-overlapping patches prevents excessive feature dilution, allowing the model to focus on localized patterns without interference from global-scale transformations.

By integrating both levels, the model benefits from enhanced multi-scale feature extraction, leading to more accurate segmentation with improved boundary precision. This approach ensures a balanced fusion of large-scale spatial awareness and fine-resolution feature retention, making it particularly effective for complex remote sensing applications such as urban planning, environmental monitoring, and land cover classification.

3.3. Self-Attention and Cross-Attention

After the data processing stage, we apply self-attention and cross-attention mechanisms. Our aim is to refine feature representations by capturing long-range dependencies and enhancing the interaction between global and local features. Specifically, the self-attention module extracts robust contextual information from each level, while the cross-attention module facilitates a bidirectional exchange between the macro-level and micro-level features. This integrated approach significantly improves segmentation performance by ensuring that both broad contextual relationships and fine-grained details are effectively fused.

3.3.1. Self-Attention

The macro-level downsampled images and micro-level high-resolution non-overlapping patches are each processed through their respective self-attention mechanisms. The macro-level self-attention enhances broad contextual feature representation, capturing large-scale spatial relationships and structural patterns across the image. Meanwhile, the micro-level self-attention focuses on preserving intricate details, ensuring the retention of fine textures, edges, and small-scale variations critical for precise segmentation.

The self-attention module strengthens feature expression by dynamically capturing global correlations across the input image. For each feature, it computes a weighted representation based on its relationship to all other features, enabling the model to focus on semantically significant regions.

As shown in Figure 2, the self-attention mechanism is applied to convolutional feature maps extracted from the input using different CNNs. The query $(Q)$ , key $(K)$ , and value $(V)$ matrices are computed from the feature maps via $1 \times 1$ convolutions, denoted as $f (x)$ , $g (x)$ , and $h (x)$ , respectively. Next, the attention mechanism is performed by computing the dot product between Q and the transpose of K to generate the attention map. A softmax operation is then applied to normalize the attention weights. The resulting attention map is multiplied by V to obtain the enhanced global feature representation. Finally, the self-attended feature maps $(O)$ are aggregated to refine contextual understanding and improve segmentation accuracy. The specific calculation process is shown in Equation (5).

(5) $\begin{matrix} A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V, \end{matrix}$

where

Q \in R^{m \times d_{k}}

is the query matrix, derived from the input feature maps.

K \in R^{n \times d_{k}}

is the key matrix, which captures relevant feature representations.

V \in R^{n \times d_{v}}

is the value matrix, which contains the feature information to be attended.

d_{k}

represents the dimension of the key vector, used for scaling to stabilize gradients.

Q K^{T}

computes the similarity matrix, determining the correlation between different spatial locations in the feature space. The softmax function normalizes the similarity scores to produce attention weights. The weighted sum with V generates the enhanced feature representation.

In semantic segmentation, self-attention enhances feature extraction by capturing long-range dependencies and reinforcing spatial correlations. Unlike CNNs, which rely on local receptive fields, self-attention provides a global receptive field, allowing the model to integrate broader contextual information while retaining fine detail accuracy. Cross-attention further refines interactions between different resolution levels. By integrating global semantics with local precision, cross-attention optimizes segmentation boundaries and structural coherence, leveraging the enriched contextual information from self-attention to enhance segmentation performance.

3.3.2. Cross-Attention

Once the macro-level downsampled images and micro-level high-resolution non-overlapping patches are enhanced by their respective self-attention mechanisms, the features undergo cross-attention processing, enabling the exchange of complementary information between the two levels. The refined features are then routed into their respective macro-level and micro-level branches, ensuring a well-balanced fusion of global structural context and fine-grained local details, ultimately enhancing segmentation precision.

The cross-attention mechanism establishes relationships between macro-level and micro-level features, facilitating the fusion of global and local information. The calculation process is illustrated in Figure 3. In the first step, the macro-level features are used as the query matrix (Q), while the micro-level features serve as the key (K) and value (V) matrices, following the standard attention mechanism. The objective is to use macro-level features as an index, allowing for fine-grained details from the micro-level branch to be integrated into the macro-level branch. As a result, the macro-level semantic representation is enriched with high-resolution details, achieving a fusion of macro- and micro-level features. This enables the model to capture both global context and local structures, improving the understanding of the overall image composition.

After completing this step, the refined macro-level features are then used as the new keys (K) and values (V), while the micro-level features act as the new queries (Q). This reciprocal interaction ensures that local details from the micro-level branch also gain a global contextual perspective. Through this iterative exchange, the model effectively establishes a strong connection between local and global semantic details, enhancing segmentation accuracy and overall model performance in high-resolution semantic segmentation tasks.

Let $F_{m}$ denote the macro-level features (downsampled image representation) and $F_{μ}$ denote the micro-level features (non-overlapping patches) after self-attention. The macro-level features $F_{m}$ act as the query Q, while the micro-level features $F_{μ}$ serve as the key K and value V. The attention weights $A_{1}$ are computed as

(6) $A_{1} = softmax (\frac{F_{m} F_{μ}^{T}}{\sqrt{d_{k}}})$

The updated macro-level features after integrating micro-level information:

(7) $F_{m}^{'} = A_{1} F_{μ}$

To ensure a complete bidirectional exchange, the refined macro-level features $F_{m}^{'}$ are then used as keys and values, while the micro-level features $F_{μ}$ act as queries. The second attention weight matrix $A_{2}$ is calculated as

(8) $A_{2} = softmax (\frac{F_{μ} F_{m}^{' T}}{\sqrt{d_{k}}})$

The updated micro-level features incorporating global context:

(9) $F_{μ}^{'} = A_{2} F_{m}^{'}$

After the cross-attention process, the final macro-level and micro-level representations are $F_{m}^{'}$ and $F_{μ}^{'}$ . These enhanced features are then passed to their respective macro and micro branches for further processing.

The cross-attention mechanism facilitates a bidirectional feature exchange, ensuring a balanced integration of global and local information. In the first phase, macro-to-micro attention enhances global representations by incorporating fine-grained details from the micro-level features. This process allows the model to retain broad structural context while refining the representation with local precision. In the second phase, micro-to-macro attention enables local features to gain a richer semantic understanding by attending to refined macro-level features. This interaction ensures that local details are contextualized within the broader spatial structure, improving overall feature coherence.

By leveraging this hierarchical fusion of features, the cross-attention mechanism significantly enhances semantic segmentation accuracy. This process results in a more comprehensive representation of complex high-resolution images, improving the model’s ability to capture both fine details and global patterns effectively.

3.4. Deep Feature Fusion

After undergoing cross-attention, the model generates a refined representation of high-resolution images, effectively capturing both fine-grained local details and global structural patterns. This is followed by deep feature fusion, where enhanced macro-level (global) and micro-level (local) features are integrated to ensure a cohesive and balanced representation. This fusion process significantly improves segmentation accuracy and structural consistency.

As shown in Figure 4, the deep feature fusion module facilitates bidirectional information sharing between the macro-level and micro-level streams. The macro-level stream extracts large-scale semantic representations from downsampled inputs, capturing broader contextual information, while the micro-level stream processes non-overlapping high-resolution patches, preserving intricate object details. For example, in the segmentation of urban landscapes, the macro-level stream effectively captures the overall structure of roads and buildings, ensuring spatial consistency, while the micro-level stream refines details such as narrow streets, rooftop textures, and small vegetation patches that would otherwise be lost in global-scale processing.

To align and integrate features from both levels, several hierarchical operations are performed. Downsampling adjusts the resolution of macro-level features to align them with micro-level features, ensuring consistency across spatial scales. Upsampling restores fine details while maintaining contextual coherence, allowing the model to recover high-resolution structures. Concatenation merges feature maps from both streams, enhancing multi-scale feature integration and improving segmentation precision.

During training, a multi-scale loss function is applied to balance the learning process between macro- and micro-level representations. The main loss function is computed from the final fused feature representation, ensuring optimal segmentation accuracy. Additionally, auxiliary losses are introduced at intermediate layers to reinforce hierarchical feature extraction and improve model generalization.

To prevent overfitting and ensure robust performance across diverse datasets, regularization techniques are incorporated during training. Feature aggregation further integrates deep feature representations across multiple layers, refining segmentation accuracy and ensuring a more comprehensive feature representation.

The dual-level feature fusion mechanism progressively refines macro- and micro-level features through bidirectional pooling and fusion techniques, effectively capturing both global structures and fine-grained object details. This multi-scale representation enhances segmentation accuracy, ensuring a balanced understanding of high-resolution images. The final segmentation layers process the aggregated features to produce high-precision segmentation masks. DLNet achieves robust, accurate, and scalable segmentation for high-resolution remote sensing imagery through hierarchical feature fusion, structured sharing, and loss-guided optimization. Robust performance is defined by maintaining high segmentation accuracy and precise boundaries across diverse datasets with varying spatial resolutions and complexities. DLNet demonstrates strong generalization, adapting effectively to datasets like DeepGlobe and Inria. Its dual-level attention mechanism ensures resilience to spatial variability, while hierarchical processing stabilizes performance across different input resolutions. Cross-attention refines boundaries, and optimized memory usage ensures computational efficiency for large-scale applications. This balance of accuracy, adaptability, and efficiency enables DLNet to deliver consistent, high-quality segmentation in real-world scenarios.

4. Evaluation

4.1. Experimental Setup

The experimental evaluation was conducted using an NVIDIA RTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA) to ensure a consistent hardware setup. The feature pyramid network (FPN) [37] and ResNet50 [38] were selected as the backbone architectures for feature extraction. To facilitate efficient multi-scale feature fusion, deep feature sharing was implemented between the conv2 to conv5 blocks of ResNet50 and across the top-down pathway of FPN. Additionally, weakly coupled regularization was introduced in the final stage of the FPN to enhance model generalization and improve segmentation performance.

To maintain spatial consistency, input images are downsampled and cropped into $500 \times 500$ pixel patches, with a 50-pixel overlap between adjacent patches. This overlap preserves contextual continuity across patch boundaries, preventing segmentation discontinuities in high-resolution imagery.

The model is trained using focal loss to mitigate class imbalance by assigning higher weights to hard-to-classify samples. A batch size of 6 is used to balance computational efficiency and training stability. Training is conducted for 100 epochs using the Adam optimizer with an initial learning rate of $10^{- 4}$ . To ensure a smooth learning rate reduction over time, a cosine annealing learning rate decay strategy is applied, where the minimum learning rate is set to $10^{- 6}$ . The scheduler is configured with $T_{max} = 100$ and $η_{min} = 10^{- 6}$ , ensuring gradual learning rate decay to facilitate convergence.

The optimal model is selected based on the highest mean intersection over union (mIoU) achieved on the validation set. After each epoch, mIoU is computed, and the model weights corresponding to the highest validation mIoU are saved for final testing. This approach ensures that the best-performing model is used for evaluation while preventing overfitting.

4.2. Datasets

The evaluation is conducted on two widely used high-resolution RGB remote sensing datasets, the specific characteristics of which are summarized in Table 1. These datasets were chosen to rigorously evaluate DLNet’s performance on images with specific spectral and spatial properties.

(1) DeepGlobe Land Cover Classification Dataset [39]: This dataset comprises 803 training images, each with a resolution of 2448 × 2448 pixels, annotated across seven land cover classes: urban, agriculture, rangeland, forest, water, barren, and unknown. The dataset is divided into training (803 images), validation (171 images), and testing (172 images). This dataset presents challenges in semantic segmentation due to complex land cover variations within satellite imagery.

(2) Inria Aerial Image Labeling Dataset [40]: This dataset provides 360 high-resolution aerial images, each measuring 5000 × 5000 pixels with a spatial resolution of 0.3 m per pixel. These images cover diverse urban settlements across ten cities worldwide, ranging from densely populated metropolitan areas to alpine towns. The dataset is split into training (180 images, 405 km²) and testing (180 images, 405 km²), ensuring no overlap in city coverage between the sets. Ground truth annotations are provided for two semantic classes—buildings and non-buildings—specifically for aerial imagery.

These datasets, with their defined spectral bands (RGB), channel number (3), and spatial resolutions, provide diverse challenges in semantic segmentation. They are ideal for evaluating DLNet’s ability to generalize across different geographic regions and land cover types within the context of high-resolution RGB imagery.

DLNet is specifically designed and evaluated for high-resolution RGB remote sensing images, as detailed in Table 1. While its framework is adaptable to other types of remote sensing data, including multispectral and hyperspectral imagery, with appropriate modifications to input channels and feature extraction mechanisms, this study intentionally focuses on RGB imagery to establish a clear benchmark on well-established datasets. The self-attention and cross-attention mechanisms in DLNet are optimized for capturing multi-scale spatial relationships inherent in high-resolution RGB data, making the model suitable for applications requiring precise boundary delineation and semantic understanding in these settings.

4.3. Baseline Models

To comprehensively evaluate the performance of DLNet, we compare it against a diverse set of state-of-the-art segmentation models, covering convolutional networks, multi-scale feature aggregation models, transformer-based architectures, hybrid attention-based networks, and real-time segmentation models. These baselines ensure a rigorous and fair evaluation across different methodological paradigms.

These baseline models represent different segmentation strategies, including classical convolutional architectures such as U-Net [28], SegNet [13], and FCN-8s [27]. Multi-scale feature aggregation models include DeepLabV3+ [31], PSPNet [41], and DensASPP [42]. transformer-based methods encompass MaskFormer [43], Mask2Former [44], TransUNet [45], SegViT [46], HRFormer [47], UperNet [48], FCT-L [49], RS3Mamba [50], and SAM-Assisted Model (SAM-Asst.) [51]. Global–local feature fusion techniques include GLNet [26], GLE-Net [52], RS-Dseg [53], MetaSegNet [54], CM-UNet [55], RS3Mamba [50], EHSNet [56], MBNet [57], and MagNet [58]. Additionally, real-time segmentation models such as ICNet [59] prioritize inference speed, while lightweight architectures like MBNet [57] focus on computational efficiency. These models collectively provide a diverse benchmark, allowing for a comprehensive evaluation of DLNet’s effectiveness in high-resolution remote sensing segmentation.

4.4. Evaluation Metrics

To comprehensively assess the accuracy and computational efficiency of the proposed approach, we employ multiple evaluation metrics, including mean intersection over union (mIoU), attention intensity distribution ( $D (A)$ ), overlap ratio (O), boundary F1-score, GPU memory usage, floating-point operations (FLOPs), and inference time.

Mean intersection over union (mIoU): The mIoU is the primary metric used to evaluate segmentation performance. It measures the ratio of the intersection to the union of the predicted segmentation and the ground truth for each class, averaged across all target classes. The mathematical formulation of mIoU is given by

(10) $m I o U = \frac{1}{k} \sum_{i = 1}^{k} \frac{| P_{i} \cap G_{i} |}{| P_{i} \cup G_{i} |},$

where

P_{i}

represents the predicted segmentation mask for class i,

G_{i}

represents the ground truth segmentation for class i, and k denotes the total number of classes. A higher mIoU value indicates better segmentation performance, with more precise boundary alignment between predictions and ground truth.

Attention Intensity Distribution ( $D (A)$ ): To analyze how the model distributes attention across different image regions, we compute the attention intensity distribution. The attention maps generated by the self-attention and cross-attention modules are statistically evaluated. Given an attention weight matrix A, where each element $A_{i j}$ represents the attention weight between pixel i and pixel j, the normalized intensity distribution is computed as

(11) $D (A) = \frac{\sum_{i, j} A_{i j}}{\sum_{i, j} 1} .$

This metric provides insights into how effectively the model allocates attention to key semantic regions. A higher $D (A)$ value suggests a stronger focus on critical features, enhancing segmentation accuracy.

Overlap Ratio with Semantic Boundaries (O): To quantify how well high-attention areas align with ground-truth segmentation boundaries, we introduce the overlap ratio (O). This metric measures the spatial alignment between the regions with the highest attention scores and the actual object boundaries in the ground-truth segmentation mask. It is defined as

(12) $O = \frac{| A_{thresh} \cap S |}{| S |},$

where

A_{thresh}

represents the top

p %

of attention weights, and S denotes the ground-truth segmentation mask. A higher O value indicates that the attention mechanism effectively focuses on meaningful semantic regions, improving segmentation precision.

Boundary F1-Score: This metric evaluates the accuracy of segmentation boundaries by comparing the predicted and ground-truth boundaries. It is computed as

(13) $F 1_{boundary} = \frac{2 \times Precision \times Recall}{Precision + Recall},$

where precision measures the proportion of predicted boundary pixels that align with the ground truth, and recall measures the proportion of true boundary pixels correctly predicted. A higher boundary F1-score indicates better alignment between predicted and actual boundaries, ensuring more precise object delineation.

GPU Memory Usage: We measure memory usage during inference to assess the model’s practical memory footprint. Specifically, we use torch.cuda.max_memory_allocated() in PyTorch 2.4.1 to track the peak GPU memory allocation required to process a single image. This measurement, conducted on an NVIDIA RTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA) records the maximum memory allocated during the forward pass, providing the peak memory usage in megabytes (MB).

Floating-Point Operations (FLOPs): FLOPs represent the total number of arithmetic operations performed by a model during inference. Lower FLOPs indicate a more computationally efficient model, while higher FLOPs suggest increased complexity and processing demands. The FLOPs for each model are computed based on convolutional layers, attention mechanisms, and feature fusion modules, providing insight into computational overhead.

Inference Time: Inference time includes the total processing time required for segmentation, encompassing feature extraction, attention mechanisms, and output generation. It is measured on an NVIDIA RTX 3090 GPU and reported in milliseconds per image. This metric is crucial for evaluating the feasibility of DLNet in real-time and large-scale applications.

By integrating mIoU for segmentation accuracy, attention intensity distribution for attention region analysis, overlap ratio for boundary alignment, boundary F1-score for segmentation precision, GPU memory usage for computational efficiency, FLOPs for complexity assessment, and inference time for real-time applicability, we achieve a well-rounded evaluation of DLNet’s performance. These metrics collectively validate the effectiveness of the proposed model in accurately segmenting high-resolution remote sensing imagery while maintaining computational efficiency.

4.5. Performance on DeepGlobe Dataset

The performance comparison on the DeepGlobe dataset is summarized in Table 2 and Figure 5. Traditional models, including UNet, ICNet, and PSPNet, demonstrated relatively low mean intersection over union (mIoU) scores, all below 60%, indicating limitations in complex land cover segmentation. More advanced architectures, such as SegNet, FCN-8s, and DeepLabv3+, achieved mIoU values between 60% and 71%, showing improved feature extraction but still facing challenges in balancing accuracy and computational efficiency.

Among recent state-of-the-art methods, GLNet achieves an mIoU of 71.6% with a memory footprint of 1224 MB, while EHSNet improves segmentation accuracy to 76.3% at the cost of significantly increased memory usage (2447 MB). In comparison, DLNet achieves the highest mIoU of 76.9%, surpassing all competing models. Additionally, DLNet demonstrates superior computational efficiency with a memory consumption of only 1443 MB.

As shown in Figure 5b, DLNet demonstrates competitive inference speed, especially when compared to models with lower segmentation accuracy, highlighting its efficiency. Figure 5c demonstrates DLNet’s efficient memory management, with memory consumption significantly lower than other high-performing models like MaskFormer and RS3Mamba.

DLNet offers a strong balance between accuracy and computational resources. It requires significantly less memory (1443 MB) than EHSNet (2447 MB) while achieving higher segmentation accuracy. DLNet’s inference time (518.3 ms) is comparable to other high-performing models like FCT-L (453.6 ms) and EHSNet (495.2 ms), while maintaining superior accuracy. Although DLNet has a slightly higher FLOPs count (1632.8 G) compared to EHSNet (1579.6 G), the substantial improvement in accuracy justifies this increased computational cost.

The trade-offs among mIoU, memory consumption, and inference time are illustrated in Figure 6. Figure 6 highlights that models with lower FLOPs, such as UNet and ICNet, struggle with segmentation accuracy, whereas models with higher FLOPs, including DLNet and EHSNet, demonstrate significantly better mIoU values. Figure 6 confirms that DLNet’s combination of self-attention and cross-attention optimally balances segmentation accuracy and computational feasibility.

Overall, DLNet successfully balances segmentation accuracy and computational efficiency. While models like EHSNet consume substantial memory resources, DLNet achieves a higher segmentation accuracy with a significantly lower memory footprint. Furthermore, DLNet exhibits improved inference speed, making it a scalable and practical solution for large-scale remote sensing applications. The results confirm that DLNet is a robust and scalable segmentation model, offering an optimal balance between accuracy, memory efficiency, and inference speed. Its superior segmentation performance, combined with its computational efficiency, establishes DLNet as a highly effective solution for remote sensing image analysis.

4.6. Performance on the Inria Aerial Dataset

The experimental results on the Inria Aerial dataset are presented in Table 3, and Figure 7. This dataset poses unique challenges due to its high-resolution urban imagery, requiring models to effectively distinguish between buildings and non-building regions. Among the tested models, ICNet achieved an mIoU of 31.1%, indicating poor performance in high-resolution aerial image segmentation. However, it maintained the lowest memory usage at 2379 MB, making it a viable option for applications where computational efficiency is prioritized over accuracy. In contrast, DeepLabv3+ and FCN-8s exhibited improved segmentation accuracy, with mIoU scores of 55.9% and 61.6%, respectively. However, these improvements came at a significant computational cost, as DeepLabv3+ consumed 4334 MB and FCN-8s reached 4253 MB, making them less suitable for memory-constrained environments.

GLNet achieved an mIoU of 71.2% while maintaining a memory consumption of 2663 MB, demonstrating a strong balance between segmentation accuracy and computational cost. However, DLNet surpassed all other models, achieving the highest mIoU of 73.6% while requiring only 2401 MB of memory, which is lower than GLNet’s memory consumption. When analyzing the trade-offs among segmentation accuracy, computational efficiency, and inference speed, DLNet proved to be a more efficient alternative to models like DeepLabv3+ and FCN-8s, as it reduced computational cost while achieving higher segmentation accuracy. Compared to DeepLabv3+, which required 425.3 G FLOPs and had an inference time of 128.7 ms, and FCN-8s, which required 498.2 G FLOPs and had an inference time of 142.3 ms, DLNet was computationally more efficient, requiring only 378.6 G FLOPs and achieving a faster inference time of 119.4 ms. These results demonstrate that DLNet is capable of delivering high-precision segmentation while maintaining a lower computational burden, making it ideal for real-world aerial segmentation tasks.

The trade-offs among mIoU, memory consumption, and inference time are illustrated in Figure 8. Figure 8b highlights that DLNet maintained a favorable balance between inference speed and segmentation accuracy, outperforming models that required significantly higher computational resources. Additionally, Figure 8a confirms that DLNet’s self-attention and cross-attention mechanisms optimized feature extraction while keeping memory consumption manageable.

Unlike other high-accuracy models, such as FCN-8s and DeepLabv3+, which required significantly more memory, DLNet achieved the best balance between accuracy, computational efficiency, and inference speed. These findings confirm that DLNet is a scalable and practical solution for aerial image analysis, making it an optimal choice for large-scale remote sensing applications that require high accuracy, low memory consumption, and computational efficiency.

4.7. Impact of Attention on Performance Study

To assess the impact of self-attention and cross-attention mechanisms, we conducted ablation experiments on the DeepGlobe dataset, comparing DLNet with GLNet. The results, presented in Table 4, highlight the role of attention mechanisms in enhancing segmentation accuracy, refining boundary delineation, and optimizing memory usage.

The baseline GLNet model, without attention modules, achieved an mIoU of 71.6% with a memory footprint of 1224 MB and a FLOPs count of 1187.6G. In contrast, DLNet, incorporating self-attention and cross-attention, demonstrated superior segmentation performance. The full DLNet model achieved an mIoU of 76.9% while maintaining a relatively low memory footprint of 1443 MB, with a FLOPs count of 1632.8 G.

Impact of Self-Attention: Integrating self-attention in DLNet resulted in a notable performance improvement. The mIoU increased from 71.6% to 73.5%, and the boundary F1-score improved from 0.50 to 0.72. Additionally, the attention intensity distribution $(D (A))$ increased from 0.03 to 0.15, signifying that self-attention enables the model to focus more effectively on critical semantic regions. However, this enhancement came with an increase in computational complexity, as the FLOPs count rose to 1348.2 G and the inference time increased to 458.1 ms.

Impact of Cross-Attention: Cross-attention further boosted segmentation accuracy, improving the mIoU from 71.6% to 74.1%. The overlap ratio $(O)$ increased from 0.38 to 0.70, demonstrating the effectiveness of cross-attention in aligning high-attention areas with object boundaries. The boundary F1-score also improved from 0.50 to 0.75, reinforcing the role of cross-attention in refining segmentation boundaries. However, the computational cost increased, with FLOPs reaching 1376.9 G and inference time rising to 484.6 ms.

Impact of Combination of Self-attention and Cross-Attention: The combination of both attention mechanisms yielded the highest performance gains. The full DLNet model attained an mIoU of 76.9% and the best boundary F1-score of 0.80. The attention intensity distribution $(D (A))$ reached 0.28, highlighting the model’s enhanced ability to allocate attention to semantically important regions. While the FLOPs increased to 1632.8 G and inference time reached 518.3 ms, the segmentation improvements justify the additional computational cost.

Despite the increased computational demands, DLNet demonstrated superior scalability and efficiency in high-resolution remote sensing segmentation. The model consistently outperformed GLNet in segmentation accuracy, boundary delineation, and contextual feature extraction. Self-attention effectively refined fine-scale structures, while cross-attention provided a global contextual understanding, allowing for better differentiation of land cover types.

Although incorporating attention mechanisms increases computational complexity, the improvements in segmentation quality and boundary precision make the trade-off worthwhile. DLNet maintains a balanced approach by achieving state-of-the-art performance with moderate memory usage. These findings validate that attention-based feature fusion significantly enhances the segmentation of high-resolution remote sensing images while remaining computationally feasible for large-scale applications.

4.8. Qualitative Analysis of Segmentation Results

Figure 9 presents the visual comparison of segmentation outputs from different models, including GLNet, DLNet without self-attention, DLNet without cross-attention, and the full DLNet model. The results highlight key differences in segmentation accuracy, boundary precision, and contextual consistency across the models, demonstrating the impact of attention mechanisms on segmentation performance.

GLNet exhibits over-segmentation, leading to excessive fragmentation of objects within the scene. The model struggles to maintain coherent region boundaries, resulting in visually inconsistent segmentations. The absence of attention mechanisms limits its ability to refine fine-grained details, particularly in complex land cover regions. This behavior results in a segmentation output that lacks structural integrity and exhibits excessive misclassification.

DLNet without self-attention improves upon GLNet by integrating cross-attention for global context modeling. However, the absence of self-attention leads to blurred boundaries, reducing the sharpness of object delineation. While the segmentation captures overall scene structure better than GLNet, it fails to effectively preserve fine-grained details, making the boundaries between different land cover types less distinct. This results in a loss of local precision, particularly in areas with subtle transitions between classes.

DLNet without cross-attention focuses on local feature refinement through self-attention but lacks the ability to capture long-range contextual dependencies. This limitation causes poor fine-grained segmentations, where isolated structures appear less connected to the overall scene. While boundary sharpness is improved compared to DLNet without self-attention, the segmentation output exhibits inconsistencies in region classification, leading to disjointed segmentations in more complex scenes.

The full DLNet model, incorporating both self-attention and cross-attention mechanisms, delivers the most accurate segmentation results. By balancing local feature refinement with global context modeling, it achieves a well-structured segmentation output with clear object delineation and reduced boundary errors. The integration of both attention mechanisms enables the model to maintain fine-grained segmentations while effectively capturing contextual dependencies across the entire image. Compared to the other models, the full DLNet model produces the most visually coherent and semantically accurate segmentation results, demonstrating superior performance in high-resolution remote sensing image analysis.

4.9. Comparative Analysis of Baseline Models

To comprehensively evaluate DLNet, we compare it against a diverse set of state-of-the-art segmentation models, encompassing convolution-based architectures, multi-scale feature aggregation methods, and transformer-based attention mechanisms. Each model has distinct design choices that influence its performance in remote sensing image segmentation.

Convolution-Based Segmentation Models: Convolution-based segmentation models, such as U-Net [28] and SegNet [13], utilize symmetric encoder–decoder architectures with skip connections to preserve spatial information. While effective in medical imaging, their limited receptive fields hinder their ability to capture large-scale contextual relationships in high-resolution remote sensing imagery. This limitation results in lower mIoU values due to their difficulty in segmenting heterogeneous landscapes. Fully convolutional networks, such as FCN-8s [27], improve upon these models by incorporating dense upsampling layers, allowing for better feature reconstruction. However, their lack of explicit attention mechanisms reduces their ability to distinguish between similar land cover types, leading to lower segmentation boundary precision compared to DLNet.

Multi-Scale Feature Aggregation Models: Multi-scale feature aggregation models, such as PSPNet [41] and DeepLabV3+ [31], employ pyramid pooling and atrous spatial pyramid pooling (ASPP) to capture multi-scale context. These techniques significantly enhance feature extraction for complex landscapes. However, their reliance on fixed-scale pooling may lead to loss of fine-grained details, resulting in suboptimal boundary delineation. Densely connected architectures, such as DensASPP [42], extend DeepLabV3+ by incorporating dense connectivity across different dilation rates, improving information flow. Despite this enhancement, such models still struggle to capture long-range dependencies, which DLNet’s self-attention and cross-attention mechanisms effectively address.

Attention-Based and Transformer Models: Recent advancements in segmentation have introduced transformer-based architectures, such as MaskFormer [43] and Mask2Former [44], which frame segmentation as a mask classification problem. While these models improve object-level understanding, they are less effective for pixel-wise land cover classification, where spatially distributed structures dominate. Hybrid models like GLNet [26] and EHSNet [56] incorporate global–local feature fusion, balancing coarse global context with fine-grained details. While these methods improve segmentation boundary accuracy, their reliance on multiple processing branches increases memory consumption, making them computationally expensive compared to DLNet.

Recent transformer-based architectures, such as FCT-L [49], leverage Fourier-based tokenization to improve feature representation and reduce computational overhead. While these approaches show promise, they require extensive pretraining on large-scale datasets to generalize effectively. In contrast, DLNet integrates self-attention and cross-attention mechanisms while maintaining computational efficiency, making it a more scalable solution for high-resolution remote sensing applications. The ability to balance segmentation accuracy, memory consumption, and inference time makes DLNet a highly efficient model compared to both traditional convolution-based networks and transformer-based approaches.

4.10. DLNet’s Innovations and Performance Evaluation

DLNet introduces key innovations that provide advantages over both traditional convolution-based and transformer-based models. Unlike single-scale fusion models, DLNet employs hierarchical feature fusion, iteratively refining macro- and micro-level features using dual-level attention mechanisms to ensure a balanced feature representation across multiple scales. The cross-attention mechanism effectively aligns feature maps with real-world object boundaries, reducing misclassification between adjacent land cover types and improving segmentation precision. While attention-based models often introduce excessive computational overhead, DLNet optimizes memory usage while maintaining higher segmentation accuracy than competing models.

The evaluation of DLNet across multiple remote sensing datasets confirms its effectiveness in high-resolution segmentation. DLNet consistently outperforms state-of-the-art models by achieving high segmentation accuracy while maintaining efficient memory usage, demonstrating a strong balance between precision and computational efficiency. The integration of self-attention and cross-attention enhances feature extraction, improves boundary precision, and refines segmentation quality, effectively capturing both global and local spatial features.

The ablation study highlights the critical role of attention mechanisms in improving segmentation accuracy and structural consistency. While these mechanisms introduce additional computational costs, the trade-off is justified by the significant improvements in feature integration and segmentation precision. Overall, DLNet offers a practical and robust solution for remote sensing segmentation, combining high accuracy with computational efficiency. Its scalability and adaptability make it well-suited for real-world applications requiring precise land cover mapping and object delineation.

5. Limitations and Future Work

5.1. Limitations

Despite these advantages, the DLNet framework has certain limitations:

1. Computational Overhead: The inclusion of self-attention and cross-attention mechanisms increases the overall computational complexity, particularly during inference. This may pose challenges for large-scale deployment, especially in real-time applications or edge computing environments.

2. Generalization Across Diverse Datasets: Although DLNet achieves high accuracy on the DeepGlobe and Inria Aerial datasets, its generalization capability across different remote sensing datasets remains untested. Variations in spectral resolution, imaging conditions, and geographic regions may affect model performance.

3. Scope of Dataset Evaluation: DLNet has been rigorously evaluated within a well-defined scope of high-resolution RGB remote sensing imagery by explicitly defining the spectral properties, channel composition, and spatial resolution of the datasets used. This ensures that the model’s performance is assessed within its intended application domain and prevents the misinterpretation of DLNet as a universally applicable solution for all types of remote sensing data.

5.2. Future Work

To further enhance DLNet’s capabilities and address its current limitations, we propose the following future research directions:

1. Improving Computational Efficiency: Future work will focus on reducing computational complexity by exploring lightweight attention mechanisms and model compression techniques, such as pruning, quantization, and knowledge distillation. These optimizations aim to minimize memory consumption and inference time while maintaining high segmentation accuracy. Additionally, further enhancements will be made to optimize inference speed, ensuring that DLNet is feasible for real-time and large-scale applications.

2. Expanding Dataset Generalization: DLNet will be evaluated on additional remote sensing datasets with diverse spatial resolutions, imaging modalities, and geographic distributions. This expansion will help assess its ability to generalize across different environments, including urban, agricultural, and natural landscapes. We will also explore domain adaptation and transfer learning to enhance generalization across different datasets with limited labeled samples.

3. Integration with Real-World Applications: DLNet will be adapted to support domain-specific applications such as urban planning, vegetation monitoring, and disaster impact assessment. Collaborations with remote sensing practitioners and geospatial experts will be established to align the model’s outputs with operational requirements, ensuring that the segmentation results provide meaningful insights for real-world decision-making.

4. Hybrid Models with Knowledge Incorporation: Future research will explore hybrid deep learning approaches that integrate domain-specific knowledge, rule-based geospatial information, and traditional remote sensing techniques. This integration aims to enhance segmentation accuracy, particularly in cases where annotated training data are scarce or where additional contextual information can improve decision-making.

5. Extending Applicability to Multispectral and Hyperspectral Data: While DLNet has been optimized for high-resolution RGB remote sensing imagery, future work will extend its applicability to multispectral and hyperspectral datasets. This adaptation will involve architectural modifications to support additional spectral bands, along with specialized feature extraction techniques tailored for different sensing modalities. These enhancements will enable DLNet to leverage richer spectral information, improving feature representation and segmentation accuracy in broader remote sensing applications.

By addressing these research directions, DLNet aims to become a scalable, efficient, and high-performing segmentation model for diverse remote sensing applications.

6. Conclusions

This paper presents DLNet, an enhanced global and local network (GLNet)-based framework designed to improve high-resolution remote sensing image segmentation. By integrating self-attention and cross-attention mechanisms, DLNet effectively enhances feature representation and fusion, addressing challenges such as complex spatial structures, fine-grained details, and land cover variations.

The proposed framework leverages self-attention to capture long-range dependencies for improved global contextual understanding, while cross-attention facilitates bidirectional interaction between macro-level and micro-level features. This dual-level attention mechanism refines feature fusion, enabling more precise segmentation with improved spatial coherence and semantic consistency.

Extensive experiments on benchmark datasets validate DLNet’s superior performance. On the DeepGlobe dataset, DLNet achieves a 76.9% mean intersection over union (mIoU), outperforming existing models such as GLNet (71.6%) and EHSNet (76.3%), while maintaining a lower memory footprint of 1443 MB. Additionally, DLNet demonstrates an inference time of 518.3 ms, balancing segmentation accuracy with computational efficiency. Compared to models like EHSNet (495.2 ms) and FCT-L (453.6 ms), DLNet achieves a higher mIoU with only a moderate increase in FLOPs (1632.8 G FLOPs vs. 1579.6 G FLOPs in EHSNet). On the Inria Aerial dataset, DLNet attains an mIoU of 73.6%, surpassing GLNet (71.2%) while also reducing computational cost, demonstrating its adaptability across different remote sensing imagery.

The trade-off between computational complexity and segmentation accuracy is a key consideration in DLNet’s design. While attention mechanisms enhance feature extraction, they introduce additional computational overhead. However, DLNet effectively manages this by maintaining a lower memory footprint compared to transformer-based models while achieving competitive segmentation accuracy. The ablation studies confirm that integrating self-attention and cross-attention modules significantly enhances segmentation accuracy, improving results by over 6% compared to the baseline GLNet model.

Despite these advancements, DLNet has certain limitations. The inclusion of attention mechanisms increases computational overhead, making real-time applications more challenging. Although DLNet exhibits competitive efficiency, further optimization is required to balance segmentation accuracy, inference speed, and computational cost, particularly for large-scale remote sensing applications. Additionally, while the model generalizes well across the tested datasets, further evaluation is needed on diverse remote sensing datasets with different spectral resolutions and geographic variations.

Future research will focus on optimizing computational efficiency through lightweight attention mechanisms, model compression techniques, and hardware-aware optimizations to improve real-time applicability. Expanding DLNet’s applicability to broader datasets and real-world applications, such as urban planning, environmental monitoring, and disaster response, will also be a priority. Additionally, integrating DLNet with multispectral and hyperspectral data will further enhance its adaptability across different remote sensing modalities.

In conclusion, DLNet advances the state of the art in remote sensing segmentation by offering a scalable, efficient, and interpretable solution. Its ability to balance segmentation accuracy, inference speed, and computational efficiency makes it a promising framework for high-resolution remote sensing image analysis, contributing to practical applications in environmental monitoring, land use planning, and disaster management.

Author Contributions

Methodology, W.M. and L.S.; Software, W.M.; Validation, S.M.; Formal analysis, L.S.; Resources, S.M.; Writing—original draft, W.M.; Writing—review & editing, B.H.; Visualization, D.L.; Supervision, B.H. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1. Overview of the proposed dual-level network (DLNet).

Figure 2. Illustration of the self-attention and cross-attention mechanisms in the proposed dual-level network (DLNet).

Figure 3. Illustration of feature fusion and aggregation process in the proposed dual-level network (DLNet).

Figure 4. Illustration of cross-attention mechanism in the proposed dual-level network (DLNet).

Figure 5. Comparison of performance across different models on the DeepGlobe datasets.

View Image - Figure 6. Comparison of different models on the DeepGlobe dataset. (The larger red dotted box represents a zoomed-in view of the overlapping region shown in the smaller red dotted box. Different colors represent various baseline models. Red stars indicate our approach).

Figure 6. Comparison of different models on the DeepGlobe dataset. (The larger red dotted box represents a zoomed-in view of the overlapping region shown in the smaller red dotted box. Different colors represent various baseline models. Red stars indicate our approach).

Figure 7. Comparison of performance across different models on the Inria Aerial dataset.

View Image - Figure 8. Comparison of different models on the Inria Aerial dataset. (The larger red dotted box represents a zoomed-in view of the overlapping region shown in the smaller red dotted box. Different colors represent various baseline models. Red stars indicate our approach).

Figure 8. Comparison of different models on the Inria Aerial dataset. (The larger red dotted box represents a zoomed-in view of the overlapping region shown in the smaller red dotted box. Different colors represent various baseline models. Red stars indicate our approach).

Figure 9. Segmentation examples on high-resolution remote sensing images.

Table 1

Summary of remote sensing datasets used in this study.

	DeepGlobe Dataset	Inria Aerial Dataset
Type of Imagery	High-resolution satellite imagery	High-resolution aerial imagery
Spectral Bands	RGB (3-band multispectral)	RGB (3-band multispectral)
Number of Channels	3 (red, green, blue)	3 (Red, Green, Blue)
Spatial Resolution	$2448 \times 2448$ pixels per image	0.3 m per pixel
Image Size	$2448 \times 2448$ pixels	$5000 \times 5000$ pixels
Land Cover Categories/Classes	Urban, agriculture, rangeland, forest, water, barren, unknown	Buildings, non-buildings

Table 2

Comparison of performance on the DeepGlobe dataset.

Method	Patch Inference				Global Inference
Method	FLOPs (G)	Inference Time (ms)	Memory (MB)	mIoU (%)	FLOPs (G)	Inference Time (ms)	Memory (MB)	mIoU (%)
UNet	89.4	28.5	949	37.3	358.8	97.2	507	38.4
ICNet	54.3	19.8	1195	35.5	217.2	75.6	2557	40.2
PSPNet	210.7	62.1	1153	53.3	842.8	248.3	6289	56.6
SegNet	170.1	49.2	1139	60.8	680.4	186.8	10,339	61.2
DeepLabv3+	195.2	57.8	1279	63.1	780.8	223.5	3119	63.5
FCN-8s	225.5	68.9	1963	63.3	902.0	267.5	5227	70.1
DensASPP	298.4	85.6	2346	64.7	1193.6	348.9	9740	71.0
MaskFormer	325.8	92.4	3088	60.3	1303.2	384.7	12,089	67.2
Mask2Former	340.6	97.1	3894	57.8	1362.4	405.5	13,245	67.0
TransUnet	270.5	78.3	2589	74.1	1082.0	310.2	8942	75.3
SegViT	312.2	89.6	3210	73.5	1250.8	370.1	10,112	74.9
HRFormer	285.7	83.2	2950	74.7	1142.8	340.7	9528	75.1
UperNet	295.1	86.4	3074	73.8	1180.4	355.2	9942	74.6
CM-UNet	280.6	80.5	2742	75.5	1122.4	328.5	9126	76.0
GLE-Net	290.3	84.2	2890	75.8	1160.8	335.6	9432	76.2
RS-Dseg	320.4	92.8	3320	75.2	1282.0	375.2	10,389	75.6
MetaSegNet	305.9	87.6	3189	74.5	1225.6	355.8	10,002	75.0
RS3Mamba	335.6	96.1	3450	76.3	1342.4	392.3	10,624	76.5
SAM-Asst.	315.8	90.5	3278	75.7	1265.8	368.1	10,215	75.9
	FLOPs (G)		Inference Time (ms)		Memory (MB)		mIoU (%)
GLNet	1187.6		367.2		1224		71.6
MBNet	1028.4		317.5		1549		72.6
FCT-L	1452.3		453.6		4332		72.8
MagNet	1105.8		340.9		1559		73.0
EHSNet	1579.6		495.2		2447		76.3
DLNet (Ours)	1632.8		518.3		1443		76.9

Table 3

Comparison of performance on the Inria Aerial dataset.

Model	FLOPs (G)	Inference Time (ms)	Memory (MB)	mIoU (%)
ICNet	99.5	38.2	2379	31.1
DeepLabv3+	425.3	128.7	4334	55.9
FCN-8s	498.2	142.3	4253	61.6
GLNet	456.8	135.1	2663	71.2
GLE-Net	491.2	138.6	2785	72.5
RS-Dseg	505.3	142.8	2900	73.1
MetaSegNet	478.5	136.4	2810	72.8
CM-UNet	462.9	132.5	2655	73.3
RS3Mamba	485.1	134.9	2720	73.4
SAM-Asst.	498.7	139.2	2790	73.0
DLNet (Ours)	378.6	119.4	2401	73.6

Table 4

Ablation study of self-attention and cross-attention on the DeepGlobe dataset.

Model	Self-Attn.	Cross-Attn.	$D (A)$	$(O)$	mIoU (%)	Boundary F1-Score	Memory (MB)	FLOPs (G)	Inference Time (ms)
GLNet			0.03	0.38	71.6	0.50	1224	1187.6	367.2
DLNet (Ours)	✓		0.15	0.65	73.5	0.72	1341	1348.2	458.1
DLNet (Ours)		✓	0.20	0.70	74.1	0.75	1383	1376.9	484.6
DLNet (Ours)	✓	✓	0.28	0.82	76.9	0.80	1443	1632.8	518.3

References

1. Haack, B.N. An Assessment of Landsat MSS and TM Data for Urban and Near-Urban Land Cover Digital Classification. Remote Sens. Environ.; 1987; 21, pp. 201-213.

2. Haack, B.N.; English, R. National Land Cover Mapping by Remote Sensing. World Dev.; 1996; 24, pp. 845-855.

3. Cohen, W.B.; Goward, S.N. Landsat’s Role in Ecological Applications of Remote Sensing. BioScience; 2004; 54, pp. 535-545. [DOI: https://dx.doi.org/10.1641/0006-3568(2004)054[0535:LRIEAO]2.0.CO;2]

4. Weng, Q. Thermal Infrared Remote Sensing for Urban Climate and Environmental Studies: Methods, Applications, and Trends. ISPRS J. Photogramm. Remote Sens.; 2009; 64, pp. 335-344. [DOI: https://dx.doi.org/10.1016/j.isprsjprs.2009.03.007]

5. Weng, Q. Remote Sensing of Impervious Surfaces in the Urban Areas: Requirements, Methods, and Trends. Remote Sens. Environ.; 2012; 117, pp. 34-49. [DOI: https://dx.doi.org/10.1016/j.rse.2011.02.030]

6. Haack, B.N.; David Craven, S.J.; Solomon, E. Urban Growth in Kathmandu, Nepal: Mapping, Analysis, and Prediction. Remote Sens. Environ.; 2002; 80, pp. 337-348.

7. Weng, Q. Techniques and Methods in Urban Remote Sensing; Wiley-IEEE Press: Hoboken, NJ, USA, 2019.

8. Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.S.; Zhang, L. A Review of Remote Sensing Image Segmentation by Deep Learning Techniques. Int. J. Digit. Earth; 2024; 17, 2328827. [DOI: https://dx.doi.org/10.1080/17538947.2024.2328827]

9. Sun, L.; Zou, H.; Wei, J.; Cao, X.; He, S.; Li, M.; Liu, S. Semantic Segmentation of High-Resolution Remote Sensing Images Using Deep Learning. Remote Sens.; 2023; 15, 1598. [DOI: https://dx.doi.org/10.3390/rs15061598]

10. Weng, Q.; Peng, F.; Gao, F. Generating Daily Land Surface Temperature at Landsat Resolution by Fusing Landsat and MODIS Data. Remote Sens. Environ.; 2014; 145, pp. 55-67. [DOI: https://dx.doi.org/10.1016/j.rse.2014.02.003]

11. Ma, X.; Lian, R.; Wu, Z.; Guo, H.; Ma, M.; Wu, S.; Du, Z.; Song, S.; Zhang, W. LOGCAN++: Adaptive Local-Global Class-Aware Network for Semantic Segmentation of Remote Sensing Imagery. arXiv; 2024; arXiv: 2406.16502

12. Saha, S.; Mou, L.; Shahzad, M.; Zhu, X.X. Segmentation of VHR EO Images Using Unsupervised Learning. arXiv; 2021; arXiv: 2108.04222

13. Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. arXiv; 2015; arXiv: 1511.00561[DOI: https://dx.doi.org/10.1109/TPAMI.2016.2644615] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/28060704]

14. Ronneberger, O.; Fischer, P.; Brox, T. U-Net. Papers with Code. 2015; Available online: https://paperswithcode.com/paper/u-net-convolutional-networks-for-biomedical (accessed on 18 May 2015).

15. Ramesh, K.K.D.; Kumar, G.K.; Swapna, K.; Datta, D.; Rajest, S.S. A Review of Medical Image Segmentation Models. Eai Endorsed Trans. Pervasive Health Technol.; 2022; 7, 27.

16. El-Sayed, M.A.; El-Sharkawy, M.A.; El-Gendy, M.A. An Introductory Survey on Attention Mechanisms in Computer Vision. Proceedings of the IEEE Transactions on Neural Networks and Learning Systems; Shenzhen, China, 4–6 December 2020.

17. Guo, M.H.; Xu, T.X.; Liu, J.J.; Liu, Z.N.; Jiang, P.T.; Mu, T.J.; Zhang, S.H.; Martin, R.R.; Cheng, M.M.; Hu, S.M. Attention Mechanisms in Computer Vision: A Survey. arXiv; 2021; arXiv: 2111.07624

18. El-Sayed, M.A.; El-Sharkawy, M.A.; El-Gendy, M.A. Deep Learning Based on Attention in Semantic Segmentation: An Introductory Survey. arXiv; 2022; arXiv: 2204.07756

19. Wu, N.; Jia, D.; Li, Z.; He, Z. Weak Edge Target Segmentation Network Based on Dual Attention Mechanism. Appl. Sci.; 2024; 14, 8963. [DOI: https://dx.doi.org/10.3390/app14198963]

20. Yazıcı, Z.A.; Öksüz, ı.; Ekenel, H.K. GLIMS: Attention-Guided Lightweight Multi-Scale Hybrid Network for Volumetric Semantic Segmentation. Image Vis. Comput.; 2024; 135, 105055.

21. Roy, S.K.; Manna, S.; Song, T.; Bruzzone, L. Attention-based adaptive spectral–spatial kernel ResNet for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens.; 2020; 59, pp. 7831-7843.

22. Qin, J.; Bai, H.; Zhao, Y. Multi-scale attention network for image inpainting. Comput. Vis. Image Underst.; 2021; 204, 103155. [DOI: https://dx.doi.org/10.1016/j.cviu.2020.103155]

23. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Long Beach, CA, USA, 16–17 June 2019; pp. 3146-3154.

24. Li, Y.; Sun, F.; Zhou, C. GLIMS: A Hybrid CNN-Transformer Approach for Local-Global Image Segmentation. arXiv; 2024; arXiv: 2404.17854

25. El-Sayed, M.A.; El-Sharkawy, M.A.; El-Gendy, M.A. Effect of Attention Mechanism in Deep Learning-Based Remote Sensing Image Analysis. Remote Sens.; 2021; 13, 2965.

26. Chen, Y.; Li, J.; Wang, Z.; Liu, X. Collaborative Global-Local Networks for Memory-Efficient Segmentation of Ultra-High Resolution Images. Proceedings of the IEEE International Conference on Computer Vision (ICCV); Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8924-8933.

27. Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Boston, MA, USA, 7–12 June 2015; pp. 3431-3440.

28. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation; Spring: Berlin/Heidelberg, Germany, 2015; pp. 234-241.

29. Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. arXiv; 2016; arXiv: 1606.02147

30. Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. Proceedings of the International Conference on Learning Representations (ICLR); San Diego, CA, USA, 7–9 May 2015.

31. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. Proceedings of the European Conference on Computer Vision (ECCV); Munich, Germany, 8–14 September 2018; pp. 801-818.

32. Lin, Y.; Gao, Z.; Xu, Z.; Zhuang, Y.; Ma, Y.; Zhang, X. Global Context Aggregation by Dilated Convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Long Beach, CA, USA, 15–20 June 2019; pp. 11906-11915.

33. Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X. et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell.; 2020; 43, pp. 3349-3364. [DOI: https://dx.doi.org/10.1109/TPAMI.2020.2983686]

34. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Long Beach, CA, USA, 4–9 December 2017; pp. 5998-6008.

35. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132-7141.

36. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. Proceedings of the European Conference on Computer Vision (ECCV); Munich, Germany, 8–14 September 2018; pp. 3-19.

37. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA, 21–26 July 2017; pp. 2117-2125.

38. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA, 27–30 June 2016; pp. 770-778.

39. Jocher, G.; Stoken, A.; Borovec, J.; Liu, C.; Adam, H.; Laurentiu, D.; Jack, P.; Yu, L.; Prashant, R.; Russ, F. et al. Ultralytics/YOLOv5: V3.0; Zenodo: Genève, Switzerland, 2020.

40. Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Can Semantic Labeling Methods Generalize to Any City? The Inria Aerial Image Labeling Benchmark. Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS); Fort Worth, TX, USA, 23–28 July 2017.

41. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Honolulu, HI, USA, 21–26 July 2017; pp. 2881-2890.

42. Yang, M.; Yu, K.; Zhang, C.; Li, K.; Yang, K.; Li, J. DenseASPP for Semantic Segmentation in Street Scenes. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Salt Lake City, UT, USA, 18–23 June 2018; pp. 3684-3692.

43. Cheng, B.; Collins, M.; Zhu, Y.; Liu, T.; Huang, T.S.; Adam, H.; Chen, L.C. Per-Pixel Classification is Not All You Need for Semantic Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Nashville, TN, USA, 20–25 June 2021; pp. 4784-4793.

44. Cheng, B.; Schwing, A.; Kirillov, A. Masked-Attention Mask Transformer for Universal Image Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); New Orleans, LA, USA, 18–24 June 2022; pp. 1290-1299.

45. Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Liu, Y.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv; 2021; arXiv: 2102.04306

46. Zheng, B.; Zhang, H.; Yuan, Y.; Yang, J.; Wang, X. SegViT: Semantic Segmentation with Vision Transformers. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); New Orleans, LA, USA, 18–24 June 2022; pp. 10023-10033.

47. Gao, Y.; Zhang, J.; Zhang, Y.; Tao, D. HRFormer: High-Resolution Transformer for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote. Sens.; 2022; 60, pp. 1-15.

48. Xiao, T.; Liu, Y.; Dai, B.; Dai, J.; Yuan, L. Unified Perceptual Parsing for Scene Understanding. Proceedings of the European Conference on Computer Vision (ECCV); Tel Aviv, Israel, 23–28 August 2021.

49. Zheng, S.; Lu, J.; Zhao, H.; Xu, X.; Yang, Z.; Zhang, S.; Li, S.; Luo, G.; Xu, Y. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Nashville, TN, USA, 19–25 June 2021; pp. 6881-6890.

50. Mao, Y.; Li, Y.; Li, Y.; Jiang, J. RS3Mamba: Rethinking Efficient Feature Aggregation for Remote Sensing Image Segmentation with State Space Model. arXiv; 2024; arXiv: 2404.02457

51. Ma, X.; Wu, Q.; Zhao, X.; Zhang, X.; Pun, M.O.; Huang, B. SAM-Assisted Remote Sensing Imagery Semantic Segmentation with Object and Boundary Constraints. arXiv; 2023; arXiv: 2312.02464[DOI: https://dx.doi.org/10.1109/TGRS.2024.3443420]

52. Yang, J.; Chen, G.; Huang, J.; Ma, D.; Liu, J.; Zhu, H. GLE-net: Global-Local Information Enhancement for Semantic Segmentation of Remote Sensing Images. Sci. Rep.; 2024; 14, 76622. [DOI: https://dx.doi.org/10.1038/s41598-024-76622-4] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39455717]

53. Luo, Z.; Pan, J.; Hu, Y.; Deng, L.; Li, Y.; Qi, C.; Wang, X. RS-Dseg: Semantic segmentation of high-resolution remote sensing images based on a diffusion model component with unsupervised pretraining. Sci. Rep.; 2024; 14, 18609. [DOI: https://dx.doi.org/10.1038/s41598-024-69022-1]

54. Wang, L.; Dong, S.; Chen, Y.; Meng, X.; Fang, S.; Fei, S. MetaSegNet: Metadata-Collaborative Vision-Language Representation Learning for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens.; 2024; 62, 5644211. [DOI: https://dx.doi.org/10.1109/TGRS.2024.3477548]

55. Liu, M.; Dan, J.; Lu, Z.; Yu, Y.; Li, Y.; Li, X. CM-UNet: Hybrid CNN-Mamba UNet for Remote Sensing Image Semantic Segmentation. arXiv; 2024; arXiv: 2405.10530

56. Chen, W.; Li, Y.; Dang, B.; Zhang, Y. EHSNet: End-to-End Holistic Learning Network for Large-Size Remote Sensing Image Semantic Segmentation. arXiv; 2022; arXiv: 2211.11316

57. Sun, S.; Wu, C.; Fang, W. MBNet: A Lightweight Deep Network for Aerial Image Segmentation. Remote Sens.; 2020; 12, 3278.

58. Chen, Y.; Li, Z.; Zhang, Y.; Liu, Y.; Han, J. MAGNET: A Network that Enhances Segmentation through Magnetized Feature Learning, Ensuring Spatial Coherence and Boundary Precision. arXiv; 2023; arXiv: 2303.11186

59. Zhao, H.; Qi, X.; Shen, X.; Shi, J.; Jia, J. ICNet for Real-Time Semantic Segmentation on High-Resolution Images. Proceedings of the European Conference on Computer Vision (ECCV); Munich, Germany, 8–14 September 2018; pp. 405-420.

Word count: 11594

Show less

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

With advancements in remote sensing technologies, high-resolution imagery has become increasingly accessible, supporting applications in urban planning, environmental monitoring, and precision agriculture. However, semantic segmentation of such imagery remains challenging due to complex spatial structures, fine-grained details, and land cover variations. Existing methods often struggle with ineffective feature representation, suboptimal fusion of global and local information, and high computational costs, limiting segmentation accuracy and efficiency. To address these challenges, we propose the dual-level network (DLNet), an enhanced framework incorporating self-attention and cross-attention mechanisms for improved multi-scale feature extraction and fusion. The self-attention module captures long-range dependencies to enhance contextual understanding, while the cross-attention module facilitates bidirectional interaction between global and local features, improving spatial coherence and segmentation quality. Additionally, DLNet optimizes computational efficiency by balancing feature refinement and memory consumption, making it suitable for large-scale remote sensing applications. Extensive experiments on benchmark datasets, including DeepGlobe and Inria Aerial, demonstrate that DLNet achieves state-of-the-art segmentation accuracy while maintaining computational efficiency. On the DeepGlobe dataset, DLNet achieves a $76.9 %$ mean intersection over union (mIoU), outperforming existing models such as GLNet ( $71.6 %$ ) and EHSNet ( $76.3 %$ ), while requiring lower memory (1443 MB) and maintaining a competitive inference speed of 518.3 ms per image. On the Inria Aerial dataset, DLNet attains an mIoU of $73.6 %$ , surpassing GLNet ( $71.2 %$ ) while reducing computational cost and achieving an inference speed of 119.4 ms per image. These results highlight DLNet’s effectiveness in achieving precise and efficient segmentation in high-resolution remote sensing imagery.

Details

Title

DLNet: A Dual-Level Network with Self- and Cross-Attention for High-Resolution Remote Sensing Segmentation

Author

Meng, Weijun¹

; Lianlei Shan²

; Ma, Sugang¹

; Liu, Dan³

; Hu, Bin⁴

¹ School of Computer Science and Technology, Xi’an University of Posts and Telecommunications, Xi’an 710121, China; [email protected] (W.M.); [email protected] (S.M.)
² School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 101408, China; [email protected]
³ Department of Management, Kean University, Union, NJ 07083, USA; [email protected]
⁴ Department of Computer Science and Technology, Kean University, Union, NJ 07083, USA

First page

1119

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

20724292

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/rs17071119

ProQuest document ID

3188877680

DLNet: A Dual-Level Network with Self- and Cross-Attention for High-Resolution Remote Sensing Segmentation

Jump to:

Full Text

2. Related Work

2.1. Deep Learning-Based Semantic Segmentation

2.2. Global–Local Feature Fusion Networks

2.3. Attention Mechanisms in Semantic Segmentation

3. Methodologies

3.1. Design Overview

3.1.1. Data Preprocessing

3.1.2. Self-Attention and Cross-Attention

3.1.3. Feature Fusion and Aggregation

3.2. Data Preprocessing

3.3. Self-Attention and Cross-Attention

3.3.1. Self-Attention

3.3.2. Cross-Attention

3.4. Deep Feature Fusion

4. Evaluation

4.1. Experimental Setup

4.2. Datasets

4.3. Baseline Models

4.4. Evaluation Metrics

4.5. Performance on DeepGlobe Dataset

4.6. Performance on the Inria Aerial Dataset

4.7. Impact of Attention on Performance Study

4.8. Qualitative Analysis of Segmentation Results

4.9. Comparative Analysis of Baseline Models

4.10. DLNet’s Innovations and Performance Evaluation

5. Limitations and Future Work

5.1. Limitations

5.2. Future Work

Abstract

Details

Suggested sources