Content area

Abstract

What are the main findings?

We propose GPRNet, a geometry-aware semantic segmentation framework that integrates a Geometric Prior-Refined Block (GPRB) and a Mutual Calibrated Fusion Module (MCFM) to enhance boundary sensitivity and cross-stage semantic consistency.

GPRB leverages learnable directional derivatives to construct structure-aware strength and orientation maps, enabling more accurate spatial localization in complex scenes.

MCFM introduces geometric alignment and semantic enhancement mechanisms that effectively reduce the encoder–decoder feature gap.

GPRNet achieves consistent performance gains on ISPRS Potsdam and LoveDA, improving mIoU by up to 1.7% and 1.3% respectively over strong CNN-, attention-, and transformer-based baselines.

What are the implications of the main findings?

Incorporating geometric priors through learnable gradient-based features improves the model’s ability to capture structural patterns and preserve fine boundaries in high-resolution remote sensing imagery.

The mutual calibration mechanism demonstrates an effective design for encoder–decoder interaction, showing potential for broader applicability across segmentation architectures and modalities.

The empirical evidence indicates that geometry-informed representation learning can serve as a general principle for enhancing land-cover mapping in diverse and structurally complex environments.

Semantic segmentation of high-resolution remote sensing images remains a challenging task due to the intricate spatial structures, scale variability, and semantic ambiguity among ground objects. Moreover, the reliable delineation of fine-grained boundaries continues to impose difficulties on existing CNN- and transformer-based models, particularly in heterogeneous urban and rural environments. In this study, we propose GPRNet, a novel geometry-aware segmentation framework that leverages geometric priors and cross-stage semantic alignment for more precise land-cover classification. Central to our approach is the Geometric Prior-Refined Block (GPRB), which learns directional derivative filters, initialized with Sobel-like operators, to generate edge-aware strength and orientation maps that explicitly encode structural cues. These maps are used to guide structure-aware attention modulation, enabling refined spatial localization. Additionally, we introduce the Mutual Calibrated Fusion Module (MCFM) to mitigate the semantic gap between encoder and decoder features by incorporating cross-stage geometric alignment and semantic enhancement mechanisms. Extensive experiments conducted on the ISPRS Potsdam and LoveDA datasets validate the effectiveness of the proposed method, with GPRNet achieving improvements of up to 1.7% mIoU on Potsdam and 1.3% mIoU on LoveDA over strong recent baselines. Furthermore, the model maintains competitive inference efficiency, suggesting a favorable balance between accuracy and computational cost. These results demonstrate the promising potential of geometric-prior integration and mutual calibration in advancing semantic segmentation in complex environments.

Full text

Turn on search term navigation

1. Introduction

Land cover, shaped by the interplay between natural dynamics and human interventions, plays a vital role in regulating Earth’s biogeophysical and biogeochemical processes [1,2,3,4]. It governs energy exchanges between land and atmosphere, influences hydrological regimes, supports biodiversity, and underpins ecosystem functionality and planetary health. With accelerating urbanization and intensified land exploitation, land cover patterns are undergoing unprecedented transformation, reshaping environmental conditions and amplifying socio-ecological risks such as biodiversity loss, climate anomalies, resource degradation, and public health challenges. Against this backdrop, producing accurate and timely land cover information, especially in rapidly evolving urban environments, has become essential for guiding sustainable development and informed policy-making. Remote sensing images (RSIs), owing to their synoptic perspective, frequent acquisition, and rich spectral-spatial information, provides a powerful foundation for large-scale land cover mapping and change detection [5,6,7,8].

Although RSIs provide a wealth of spatial, spectral, and temporal information for large-scale land cover observation, effectively transforming these data into semantically meaningful class labels remains a significant challenge [9,10,11,12,13,14]. Semantic segmentation has emerged as a core technique for addressing this issue, enabling pixel-wise classification by learning discriminative features and capturing spatial context [15,16,17,18,19,20,21]. In the domain of high-resolution RSIs, segmentation-based approaches are particularly valuable for delineating complex land cover patterns at fine scales, including heterogeneous urban structures and transitional zones. Consequently, a growing body of research has explored various deep learning paradigms—ranging from convolutional neural networks (CNNs) to attention mechanisms and transformer architectures—to enhance the performance of semantic segmentation [22,23,24,25,26].

CNNs have laid the foundation for modern semantic segmentation, with pioneering models such as Fully Convolutional Networks (FCN) [27], SegNet [28], U-Net [29], and DeepLab series [30] demonstrating strong performance across various natural image domains. These models exploit hierarchical convolutional layers to learn spatially-aware features while enabling end-to-end pixel-level prediction. In the context of RSIs, many CNN-based architectures have been adapted or newly developed to address domain-specific challenges, such as large-scale variation, fine object structures, and spectral heterogeneity. Representative methods, like ResUNet-a [31], D-LinkNet [32], OBIA-DL [33], HR-PSPNet [34], and so forth, incorporate multi-scale context aggregation for better receptive field expansion. Despite their success, CNN-based models often struggle to preserve detailed geometric structures due to inherent limitations in local receptive fields [35,36,37].

To overcome the locality limitations of CNNs, attention mechanisms have been introduced to enhance feature dependencies across spatial and channel dimensions [38,39]. Methods such as SENet [40], CBAM [41], DANet [42], and NLNet [43] have demonstrated the effectiveness of attention in refining contextual representations for semantic segmentation tasks. Motivated by these advances, researchers in the remote sensing community have incorporated attention modules to better model long-range spatial dependencies and suppress background noise, such as LANet [44], SCAttNet [45], HMANet [46], and HCANet [47], RAANet [48], MACU-Net [49], A2FPN [50], SUAS [51], FSDENet [52], and so forth. Nevertheless, attention-based modules still rely on convolutional backbones, which may limit their ability to fully capture global structures in highly heterogeneous LULC scenes.

Recently, transformers have gained increasing attention in semantic segmentation due to their capability to model global context via self-attention mechanisms. Vision Transformer (ViT) [53], Segmenter [54], and Swin Transformer [55] are notable examples that demonstrate promising results in capturing long-range dependencies and semantic consistency. In the remote sensing field, several transformer-based methods have emerged to address the structural complexity and scale variance inherent in high-resolution RSIs, including FarSeg++ [56], LETFormer [57], CLCFormer [58], CMLFormer [59], UM2Former [60], UnetFormer [61], AAFormer [62], RSAM-Seg [63], TDBAN [64], and so on. However, transformer-based methods often suffer from computational complexity and may still face difficulties in preserving fine-grained geometric details, especially in densely built urban scenes.

Despite these methodological advances, several key challenges persist in high-resolution LULC segmentation:

1.. Firstly, mainstream models rely heavily on stacked convolution and downsampling layers to extract high-level semantic features. While effective for global contextual reasoning, these operations often lead to the degradation or loss of detailed spatial and geometric information. This is especially detrimental in LULC tasks where object boundaries and shape configurations carry critical semantic cues.

2.. Secondly, skip connections or feature fusion modules in encoder–decoder networks are typically designed to combine features from different stages without explicitly addressing the geometric misalignment between low-level detail and high-level abstraction. The lack of structure-aware guidance in this fusion process can result in semantic inconsistencies, blurred boundaries, or incorrect class transitions, particularly in heterogeneous scenes.

3.. Moreover, although some recent approaches have explored edge-aware supervision or auxiliary boundary detection, they tend to treat geometric features as post-hoc refinements rather than integrating them into the core representation learning and feature alignment processes.

These observations suggest that a unified strategy capable of embedding geometric priors directly into the feature extraction and fusion stages is crucial for achieving structure-preserving and semantically consistent segmentation across scales. To overcome the aforementioned limitations, we propose GPRNet, a novel semantic segmentation network that systematically incorporates geometric prior knowledge into both the representation learning path and the feature fusion path. Within the encoder and decoder, we introduce the Geometric Prior-Refined Block (GPRB), which explicitly encodes geometric cues, such as gradient magnitude and orientation, into the feature learning pipeline. By employing learnable directional derivative filters initialized withvbc Sobel-like operators, GPRB provides stable yet adaptable geometric descriptors that enhance structural sensitivity during abstraction and reconstruction. To address feature misalignment between encoder and decoder, we design the Mutual Calibrated Fusion Module (MCFM), which leverages geometric priors to guide the interaction and alignment of multi-level features, ensuring structure-consistent fusion through attention-driven calibration. Together, these modules constitute a unified architecture that integrates geometry-aware reasoning into the entire semantic segmentation process, leading to improved edge preservation, alignment accuracy, and overall segmentation quality. The main contributions are as follows:

1.. We propose the GPRB, which introduces learnable geometric priors—modeled via gradient magnitude and orientation maps—into the encoder and decoder pathways. This module enhances structure-aware feature learning and improves the preservation of geometric fidelity throughout the representation hierarchy.

2.. We design the MCFM, which performs cross-stage alignment between encoder and decoder features using geometric prior-aware calibration. This module alleviates spatial-semantic misalignment and strengthens the consistency of multi-level fusion.

3.. We develop the GPRNet, which unifies the above modules into a geometry-refined encoder–decoder network. Comprehensive experiments on the ISPRS Potsdam [65] and LoveDA [66] datasets validate the effectiveness of GPRNet compared with state-of-the-art baselines. Ablation studies further highlight the roles of geometric branches within GPRB and MCFM.

The paper is structured as follows. Section 2 provides an overview of related works in semantic segmentation of RSIs. Section 3 introduces the GPRNet with its sub-modules. Section 4 presents the experiments and results. Section 5 draws the conclusion of our work and points out the future directions.

2. Related Works

2.1. CNN-Based Semantic Segmentation Methods for RSIs

CNNs have long served as the foundational framework for semantic segmentation of RSIs, owing to their hierarchical feature extraction capabilities and strong spatial locality modeling. Early architectures adapted from natural image domains, such as U-Net and FCN variants, laid the groundwork for encoder–decoder pipelines widely adopted in RSIs. However, the unique challenges posed by high-resolution RSIs—such as complex background textures, diverse object scales, and intricate geometric patterns—necessitate specialized architectural adaptations beyond vanilla CNNs.

To enhance representation learning, several studies have focused on refining feature aggregation strategies and multi-scale encoding. DFANet [67] introduces a deep feature aggregation framework that hierarchically integrates multi-resolution features via both shallow and deep fusion, further refined with a conditional random field for structural consistency. ResUNet-a [31] extends the U-Net framework by incorporating residual connections and atrous convolutions, which enhance semantic feature extraction while preserving spatial resolution. D-LinkNet [32] introduces a ResNet-based encoder with dilated convolutions and a light-weight decoder for efficient segmentation of road networks and other elongated structures, demonstrating high boundary sensitivity. OBIA-DL [33] follows a hybrid approach that integrates object-based image analysis (OBIA) with deep learning, leveraging shape and texture cues in combination with CNN predictions for improved classification accuracy. HR-PSPNet [34] adapts the Pyramid Scene Parsing Network (PSPNet) to the high-resolution RSI domain by refining pyramid pooling and upsampling strategies, thereby enhancing its ability to capture multi-scale context. Similarly, MACANet [68] leverages an adaptive-scale context extraction block and sequential aggregation to bridge semantic gaps across hierarchical layers, enabling more flexible multi-scale representations for HR-RSI segmentation. In the pursuit of computational efficiency without sacrificing accuracy, lightweight networks like MS-DSNet [69] incorporate depthwise separable convolutions across multiple scales, striking a balance between receptive field expansion and inference cost. Dynamic convolution has also emerged as a promising alternative; GDCNet [70] proposes Gaussian dynamic convolution modules combined with pyramid pooling to dynamically adjust receptive fields, enabling more robust segmentation under large-scale variation. Other efforts focus on improving spatial-structural integration. MS-SkipNet [71] employs redesigned multi-scale skip connections with atrous convolutions to preserve spatial resolution during deep encoding. DFFAN [72] introduces a dual-function feature aggregation scheme incorporating affinity matrices and boundary-aware fusion to balance contextual and spatial cues, improving the delineation of fine-scale structures such as edges and borders. These approaches represent diverse directions within non-attention CNN-based segmentation frameworks, emphasizing encoder–decoder refinement, residual learning, and spatial context modeling. However, despite their architectural sophistication, they generally rely on implicitly learned edge and shape information and still exhibit limitations in modeling long-range dependencies and explicitly capturing geometric priors.

2.2. Attention-Enhanced CNN Methods for RSIs

To alleviate the locality limitations of standard CNNs and enhance semantic discrimination, attention mechanisms have been extensively integrated into CNN-based architectures in the context of RSI semantic segmentation. These enhancements aim to strengthen spatial–channel interactions, suppress background interference, and better model long-range dependencies, which are critical for handling heterogeneous land cover scenes [73].

Several early-stage works focused on single or dual attention branches. For instance, LANet [44], SCAttNet [45], HMANet [46], and HCANet [47] incorporated spatial and channel attention in parallel or cascaded forms to recalibrate features and refine segmentation masks. RAANet [48] further explored regional attention aggregation to suppress irrelevant background responses. MACU-Net [49] introduced a multi-scale attention design to enhance complex object recognition across spatial resolutions. In a similar vein, A2FPN [50] proposed adaptive attention fusion at multiple levels of the decoder, providing consistent improvements in delineating fine-grained classes such as buildings and roads. Recent advancements have gone beyond modular integration, focusing instead on constructing hybrid attention-CNN architectures. For example, Swin-CDSA [74] combined cascaded depthwise convolution modules with spatial attention mechanisms, achieving improved edge localization and local–global context understanding. Jiang et al. [75] introduced a feature enhancement U-Net framework enhanced by spatial–channel dual attention and region-of-interest contrast amplification. LDCANet [76] proposed a dual-range context aggregation module that couples lightweight self-attention with convolutions to balance performance and computational cost. Other efforts have explored multi-branch and fusion strategies. The multi-branch attention fusion network (MBAFNet) [77] simultaneously integrates pixel-, channel-, and spatial-level attention on shallow and deep fused features to enhance semantic discrimination. In another design, AFNet [78] devised an adaptive attention fusion framework leveraging hierarchical guidance across decoder stages. MCAT-UNet [79] further expanded on residual-guided aggregation and dual attention modules to refine boundaries and recover small structures.

Despite their structural differences, most attention-enhanced CNN models still share a common design principle: geometric priors are treated as emergent properties implicitly learned by the network rather than as explicit guidance in the feature extraction pipeline. This assumption can be fragile in heterogeneous or densely structured urban scenes, where edges, shapes, and boundaries play a decisive role in maintaining semantic coherence. Furthermore, fusion between encoder and decoder branches is generally attention-driven but rarely incorporates explicit geometric alignment, which may lead to discontinuities or artifacts near object borders. In contrast, GPRNet explicitly introduces geometric priors into both representation learning and cross-stage fusion through GPRB and MCFM, enabling structure-aware refinement and geometrically consistent alignment across different feature levels.

2.3. Transformer-Based Semantic Segmentation Methods for RSIs

Transformer-based models have recently garnered significant attention in RSI segmentation due to their ability to model global dependencies via self-attention. Early adaptations of Vision Transformers (ViTs) in semantic segmentation, such as Segmenter [54] and Swin Transformer [55], paved the way for more sophisticated transformer-based segmentation frameworks. Building upon these foundations, numerous specialized methods have been proposed for RSIs, aiming to address the inherent spatial complexity, object scale variation, and semantic inconsistency present in remote scenes.

In the realm of RSIs, FarSeg++ [56], LETFormer [57], and CLCFormer [58] are among the pioneering works adapting hierarchical or lightweight transformer backbones for efficient segmentation of high-resolution RSIs. These models emphasize multi-scale feature representation and semantic consistency across dense scenes. CMLFormer [59] further enhances multiscale contextual reasoning by combining CNN and multiscale local-context transformers, achieving improved performance with reduced computational overhead. To mitigate challenges such as foreground-background imbalance and coarse object delineation, several innovative methods have emerged. For instance, MMT [80] introduces a mixed-mask attention strategy to improve intra-class and inter-class correlation modeling, along with a progressive multiscale learning scheme. Similarly, CSTUNet [81] employs a dual-encoder architecture with CNN and Swin Transformer branches, integrating spatial attention modules to preserve fine details during downsampling. Other notable hybrid designs include CCTNet [82], which integrates a CNN encoder with a cross-shaped Transformer decoder, enabling lightweight global-local feature aggregation. LSENet [83] proposes spatial and local enhancement modules to guide Swin-based segmentation, specifically targeting edge preservation. SSDT [84] emphasizes semantic scale separation via K-means clustering and decoupled attention, yielding scale-aware and semantically distinct features. RSSFormer [85] and CTFNet [86] both employ CNN-Transformer fusion strategies within encoder–decoder frameworks, aiming to capture complementary local and global cues. Collectively, these recent methods illustrate a clear trend towards hybrid architectures that exploit transformer-style global context modeling while retaining convolutional inductive biases for efficiency.

Despite their impressive performance, transformer-based models often exhibit high computational demands and may still struggle to capture fine-grained geometrical structures, especially in urban environments with complex object layouts and densely packed instances. In practice, the need to balance global context modeling, boundary precision, and computational efficiency motivates alternative designs that can exploit structural priors without incurring excessive overhead. In this context, the proposed GPRNet offers a complementary geometry-refined encoder–decoder solution that explicitly embeds geometric priors into the segmentation pipeline and can be integrated with or serve as a lightweight alternative to transformer-based representations.

3. Method

3.1. Overview

The proposed GPRNet is a geometry-aware semantic segmentation network designed to enhance land cover mapping by incorporating geometric prior knowledge into both feature extraction and fusion stages. As illustrated in Figure 1, GPRNet adopts an encoder–decoder architecture with skip connections, where each stage is augmented with geometry-guided modules to improve spatial structure modeling and semantic consistency. In this work, we instantiate the encoder with a convolutional backbone, but the proposed geometric modules are formulated in a backbone-agnostic manner and can, in principle, be integrated into other encoder–decoder architectures with minimal modification.

Formally, let XRH×W×3 denote the input remote sensing image with height H, width W, and 3 RGB channels. The encoder extracts hierarchical deep features through successive convolutional and downsampling operations, producing multi-level feature maps {Fienc}i=14, where FiencRH2i×W2i×Ci denotes the encoded feature at stage i with channel dimension Ci.

Each encoder stage contains a GPRB, which enriches the convolutional representation by integrating geometric features—specifically, gradient-based magnitude and orientation priors. Given an intermediate feature map FinRh×w×c, the GPRB outputs a refined feature map Fgpr:

(1)Fgpr=GPRB(Fin)=Fres+Fgeo,

where Fres denotes the residual convolutional feature, and Fgeo is the geometrically guided feature refined by the geometric attention map Ageo. This attention map is derived from learnable gradient magnitude M and orientation θ via a geometric prior generation operator (GPGO):

(2)Ageo=σConv1×1[M,θ],

where σ denotes the sigmoid activation, and [·,·] indicates channel-wise concatenation. In this way, the GPRB explicitly encodes local geometric structures into a spatial attention signal that is subsequently used to modulate semantic features.

Following the encoder, a bottleneck block further encodes the deepest features, which are then gradually upsampled and decoded. The decoder mirrors the encoder structure and also integrates GPRBs to preserve structural cues during reconstruction. At each decoding stage, the feature from the previous layer Fi+1dec is upsampled and fused with the corresponding encoder feature Fienc using a MCFM:

(3)Fifuse=MCFM(Fienc,Fi+1dec),

which explicitly aligns the encoder and decoder features by leveraging geometric priors to mitigate spatial misalignment and ensure semantic consistency. Both GPRB and MCFM are designed as plug-and-play units that preserve the spatial resolution and channel dimensionality of their inputs, which facilitates seamless insertion into standard encoder–decoder pipelines.

After sequential decoding, the final high-resolution feature map is passed through a 1×1 convolution and softmax activation to generate the per-pixel classification map:

(4)Ypred=Softmax(Conv1×1(F1dec)),YpredRH×W×K,

where K is the number of land cover categories.

In summary, GPRNet offers a unified architecture with two core designs: (1) GPRB modules inject structural priors into feature learning, improving geometry-aware representation; and (2) MCFM modules bridge semantic gaps between encoder and decoder features with geometry-guided alignment. Compared with purely CNN- or attention-based designs, GPRNet emphasizes explicit geometric reasoning while retaining a relatively lightweight convolutional backbone, aiming to balance boundary precision and computational efficiency for practical LULC applications. Together, they ensure fine-grained boundary preservation and spatial consistency for land cover segmentation.

3.2. Details of Geometric Prior-Refined Block (GPRB)

To enhance geometry-aware feature learning during both encoding and decoding stages, we design the GPRB, which explicitly introduces gradient-based structural priors into the representation pipeline. The detailed architecture of GPRB is depicted in Figure 2. Intuitively, GPRB augments standard convolutional features with learnable gradient responses that highlight object contours, edges, and anisotropic patterns, thereby providing explicit cues for structure-preserving segmentation.

Given an input feature map FinRh×w×c, GPRB consists of two parallel branches:

The top branch includes a standard convolutional block comprising four stacked 3×3 convolutions followed by ReLU activations. Let FresRh×w×c denote the residual feature map obtained from this branch:

(5)Fres=Bconv(Fin),

where Bconv denotes the convolutional block function. This branch preserves the representational capacity of a conventional CNN block and serves as the primary carrier of semantic information.

The bottom branch constructs a geometry-aware attention map that guides the modulation of the input feature. It consists of three stages: (a) learnable directional gradient estimation, (b) geometric prior generation, and (c) feature modulation.

(a) Learnable Directional Gradient Estimation. We define a pair of learnable convolutional filters Gx,GyRk×k×c×1, which are initialized similarly to Sobel filters but trained jointly with the network. These filters compute directional gradients along horizontal and vertical axes:

(6)Fx=FinGx,Fy=FinGy,

where ∗ denotes depth-wise convolution. By starting from Sobel-like initialization and allowing the kernels to be updated during training, the network benefits from stable low-level edge detection at early stages while gradually adapting the gradient operators to dataset-specific geometric patterns (e.g., building footprints, road networks, and parcel boundaries).

(b) Geometric Prior Generation Operator (GPGO). The magnitude map M and orientation map θ are computed as follows:

(7)M=Fx2+Fy2,

(8)θ=arctanFyFx+ϵ,

where MRh×w×1 captures edge strength and θRh×w×1 encodes orientation (normalized to [1,1] via tanh if necessary), and ϵ is a small constant to avoid division by zero. This formulation assumes that local geometric structures can be approximated by first-order directional derivatives, which is a standard yet effective assumption for capturing edges and contour flows in high-resolution RSIs.

Then, a lightweight attention generation module transforms [M,θ] into a spatial attention map:

(9)Ageo=σConv1×1[M,θ],

where [·,·] denotes channel-wise concatenation and σ is the sigmoid activation. The attention map AgeoRh×w×1 selectively emphasizes geometry-salient regions. Notably, Ageo is shared across channels and varies only over spatial locations, which encourages coherent emphasis along object boundaries while keeping the additional parameter overhead negligible.

(c) Geometric Feature Modulation. Using Ageo, the input feature Fin is modulated through scalar multiplication:

(10)Fgeo=FinAgeo,

where ⊙ denotes element-wise multiplication with broadcasting. This operation reinforces geometry-relevant areas while suppressing noisy or irrelevant background regions. Compared with purely implicit learning of edge awareness, this explicit modulation scheme provides a direct mechanism for injecting geometric priors into the feature representation.

Finally, the outputs from the two branches are fused via element-wise summation:

(11)Fgpr=Fres+Fgeo,

yielding the geometry-refined feature map FgprRh×w×c that retains both semantic richness and structural awareness.

In practice, GPRB is inserted at multiple encoder and decoder stages, sharing the same interface as a standard residual block. As demonstrated in the ablation study, this design introduces only modest computational overhead while substantially improving the delineation of fine-scale geometric structures such as building outlines and road segments. To sum up, GPRB is designed as a plug-and-play module applicable across both encoder and decoder stages. It maintains the input-output dimension consistency and introduces minimal computational overhead, while effectively guiding the model to retain fine-grained structural cues essential for accurate land cover delineation.

3.3. Mutual Calibrated Fusion Module (MCFM)

To address the feature misalignment between the encoder and decoder pathways, we design the MCFM, which consists of two complementary sub-modules, Cross-Stage Geometric Alignment (CSGA) and Semantic-Enhanced Fusion (SEF). As shown in Figure 3, MCFM adaptively aligns and fuses multi-level features while preserving both structural and semantic consistency. It operates on features that have been refined by GPRB, allowing geometric cues to guide cross-stage calibration and fusion.

Given the encoder feature FencRh×w×c and decoder feature FdecRh×w×c, the MCFM performs the following steps:

To reduce computational cost in the alignment stage, both features are first passed through a 1×1 convolution layer to obtain channel-reduced features:

(12)Fencc=Conv1×1(Fenc),Fdecc=Conv1×1(Fdec),

where Fencc,FdeccRh×w×c, with c<c. This dimensionality reduction controls the computational burden of the subsequent mutual attention while preserving the dominant semantic and geometric information.

Then, the compressed features are then aligned via a mutual attention mechanism that facilitates cross-stage geometric calibration. Specifically, we compute mutual attention between Fencc and Fdecc:

(13)Amutual=softmaxQencKdecTd,

where Qenc=WQFencc and Kdec=WKFdecc are the query and key matrices, d is the head dimension, and WQ, WK are learnable linear projections.

The calibrated features are then computed as:

(14)Fencmc=Fencc+Amutual·Vdec,

(15)Fdecmc=Fdecc+AmutualT·Venc,

where Vdec,Venc are value projections of Fdecc and Fencc, respectively. Through this bidirectional interaction, encoder features are informed by decoder semantics and vice versa, which helps to reconcile discrepancies in spatial detail and abstraction level around complex structures.

Afterwards, the calibrated features Fencmc and Fdecmc are projected back to the original dimension and summed to form the preliminary fusion:

(16)Fsum=Fencmc+Fdecmc.

To guide the feature selection, a channel-wise attention map is computed. First, we apply global average pooling followed by two fully connected layers with a non-linearity and sigmoid activation:

(17)Afuse=σW2·δ(W1·GAP(Fsum)),

where GAP(·) denotes global average pooling, W1Rc×c/r, W2Rc/r×c, δ is ReLU, and σ is sigmoid. Here, AfuseR1×1×c depends only on the channel dimension and is broadcast over spatial locations, meaning that the gating in SEF is channel-wise rather than pixel-wise. This design choice strikes a balance between expressiveness and efficiency while allowing the network to emphasize channels that are more reliable for boundary localization or semantic discrimination.

The final output feature FfuseRh×w×c is obtained via SEF:

(18)Ffuse=AfuseFencmc+(1Afuse)Fdecmc,

where ⊙ denotes channel-wise multiplication with broadcasting. This dynamic fusion enables the network to selectively combine semantic abstraction from the decoder with geometric precision from the encoder. Consequently, encoder features that preserve fine spatial details can be up-weighted in channels where geometry is critical, while decoder features can dominate in channels carrying high-level semantic cues.

In summary, MCFM functions as a geometry-aware fusion interface between the encoder and decoder, addressing cross-stage misalignment and enhancing boundary localization. Its two-stage design ensures that the fused features respect both the structural integrity and the semantic richness necessary for precise land cover segmentation. As discussed in the experimental section, this design brings consistent improvements over simple skip concatenation or addition, with a moderate increase in FLOPs and parameters.

4. Experiments

4.1. Datasets

4.1.1. ISPRS Potsdam Dataset

The ISPRS Potsdam dataset provides ultra-high-resolution airborne imagery, with each tile measuring 6000 × 6000 pixels and a spatial resolution of 5 cm per pixel (Ground Sampling Distance, GSD). A total of 38 image tiles are available, annotated with six semantic classes consistent with the ISPRS Vaihingen dataset. Each sample includes four spectral bands, including red, green, blue, and near-infrared.

In our experiments, we adopted a specific split protocol: 13 tiles (including IDs such as 2_13, 2_14, 3_13, 3_14, etc.) were used exclusively for testing, while tile 2_10 was designated for validation. The remaining 22 tiles, excluding tile 7_10 due to known label inaccuracies, were used for training. Following common practice in recent works, and to ensure a fair comparison with baselines that only provide RGB configurations, we restrict model inputs to the RGB channels and do not use the near-infrared band. To facilitate model training and evaluation, each large image was partitioned into non-overlapping patches of size 256 × 256 pixels. An example of the Potsdam dataset is visualized in Figure 4.

4.1.2. LoveDA Dataset

The LoveDA dataset [66] offers a diverse collection of 5987 high-resolution optical remote sensing images, each with a spatial resolution of 0.3 meters and a fixed size of 1024 × 1024 pixels. It covers seven land use and land cover (LULC) categories: buildings, roads, water, barren land, forest, agricultural areas, and background. The dataset is divided into 2522 training images, 834 validation images, and 835 test images, following the official protocol.

What makes LoveDA particularly challenging is its scene-level diversity—it contains both urban and rural environments, sourced from three representative Chinese cities: Nanjing, Changzhou, and Wuhan. These variations introduce complexities such as scale diversity, ambiguous class boundaries, and uneven category distributions. The dataset is well-suited for evaluating semantic segmentation models under real-world variability. A representative sample from the dataset is shown in Figure 5.

4.2. Implementation Details

All experiments were conducted using the PyTorch deep learning library, running on a Linux-based environment equipped with a single NVIDIA A40 GPU (48 GB memory). The complete training and evaluation pipeline, including our proposed GPRNet and competing semantic segmentation models, followed a consistent experimental setup summarized in Table 1. Unless otherwise specified, the same training protocol was applied to both datasets and to all compared models to ensure a fair and reproducible benchmark.

To improve generalization and mitigate overfitting, standard data augmentation strategies—including random horizontal flipping and random cropping—were applied during training. For both Potsdam and LoveDA, input images were cropped into 256 × 256 patches during training, while full-size predictions on the test sets were obtained by tiling and stitching the outputs without overlap. All models were trained using the Adam optimizer, with an initial learning rate set to 0.002 and decayed according to a polynomial schedule. The maximum number of training epochs was set to 500, and a batch size of 32 was used. The best-performing model checkpoints were selected based on the minimum validation loss observed during training. We did not employ dataset-specific hyperparameter tuning for GPRNet, so that the gains reported in the following sections are attributable to the proposed geometric modules rather than extensive parameter search.

For fair benchmarking, we compared GPRNet with eleven strong baselines encompassing CNN-based, attention-augmented, and transformer-based segmentation architectures. These include U-Net [29], DeepLab V3+ [30], DANet [42], ResUNet-a [31], D-LinkNet [32], RAANet [48], MACU-Net [49], SCAttNet [45], SUAS [51], RSAM-Seg [63], and UM2Former [60]. All models were re-implemented or adapted based on publicly available official repositories or reproduction protocols to ensure consistency in input resolution, loss function, and training regimen. In particular, we adopted identical data augmentation, optimization settings, and training schedules for GPRNet and all baselines, and we used the same cross-entropy loss to avoid introducing confounding factors across methods.

4.3. Evaluation Metrics

To comprehensively assess the effectiveness of our proposed model and competing baselines, we employ four widely adopted metrics in the field of semantic segmentation: the class-wise F1 score, average F1 score (AF), overall accuracy (OA), and mean intersection over union (mIoU). These indicators collectively reflect both pixel-level correctness and region-level segmentation quality.

The F1 score, defined as the harmonic mean of precision and recall, evaluates the balance between false positives and false negatives for each individual class. The AF is obtained by taking the mean across all semantic classes. OA represents the ratio of correctly predicted pixels to the total number of pixels, offering a global view of classification performance. Meanwhile, mIoU captures the average overlap between predicted and ground truth regions across all classes, and is often considered a primary benchmark metric in segmentation tasks.

Formally, these metrics are computed using the following expressions:

(19)F1=2·Precision·RecallPrecision+Recall,

(20)OA=TP+TNTP+TN+FP+FN,

(21)IoU=TPTP+FP+FN,

where precision and recall are defined as:

(22)Precision=TPTP+FP,

(23)Recall=TPTP+FN,

and TP, TN, FP, and FN refer to the number of true positives, true negatives, false positives, and false negatives, respectively, for a given class. The class-wise F1 and IoU values are first computed for each land cover category, and then the average F1 score and mean IoU are obtained as

(24)AF=1Kk=1KF1(k),

(25)mIoU=1Kk=1KIoU(k),

where K denotes the number of semantic classes and F1(k) and IoU(k) denote the F1 score and IoU for class k, respectively. These metrics collectively offer a robust and multi-faceted evaluation of segmentation accuracy, structural consistency, and classification quality.

4.4. Compared with State-of-the-Art Models

4.4.1. Numerical Evaluation of ISPRS Potsdam

Table 2 reports the quantitative comparison on the ISPRS Potsdam dataset. Compared with traditional CNN-based models, GPRNet exhibits clear improvements on most semantic classes. For instance, relative to DeepLabV3+ and DANet, our method achieves notable gains in the Car and Low vegetation categories, which are typically challenging due to small object scale or low intra-class variance. This behaviour is consistent with the design goal of explicitly modeling geometric priors and structurally modulating feature responses.

When compared with attention-enhanced CNN methods such as SCAttNet and SUAS, GPRNet further improves the F1-scores for all classes. Notably, the F1-score for the Building class reaches 98.01%, suggesting that the proposed structural calibration mechanisms are beneficial for preserving man-made object boundaries. Likewise, the improvements on the Tree and Car categories indicate an enhanced ability to capture both edge precision and object localization, which we attribute to the complementary effects of GPRB and MCFM.

Transformer-based models like RSAM-Seg and UM2Former already exhibit strong performance due to their global context modeling, yet GPRNet still attains the highest class-wise F1-scores and mIoU (82.32%) among all compared methods, while maintaining AF and OA at a comparable level. These results suggest that incorporating lightweight and interpretable geometric priors into the representation learning process can effectively enrich semantic consistency without sacrificing fine-grained details.

Overall, GPRNet achieves competitive or superior performance under all reported metrics on Potsdam, and establishes a new state-of-the-art in terms of mIoU within the considered benchmark setting. The integration of structure-aware guidance and geometry-informed feature fusion leads to robust improvements in semantic segmentation, especially in complex urban scenes with mixed object scales and cluttered layouts.

4.4.2. Visual Inspections of ISPRS Potsdam

To further evaluate the qualitative performance of our GPRNet, we present visual comparisons on the ISPRS Potsdam dataset, as shown in Figure 6. Each row depicts one representative sample, where we compare the segmentation outputs of different state-of-the-art methods with the ground truth. As illustrated in Figure 6, our GPRNet demonstrates clear advantages in capturing fine-grained boundaries and reducing misclassifications, particularly for narrow structures (e.g., cars, roads) and edge areas. For example, in the first row, several baseline models—such as U-Net [29] and DeepLab V3+ [30]—tend to blur the contours between impervious surfaces and buildings. DANet [42] and ResUNet-a [31] introduce improved boundary alignment, but still exhibit noticeable label leakage in the car regions.

More recent transformer-based models, such as RSAM-Seg [63] and UM2Former [60], better retain object shape and contextual coherence, yet may still produce slightly over-smoothed predictions along thin or intricately shaped regions. In contrast, GPRNet leverages geometric priors and mutual-calibrated fusion to achieve sharper delineation of object boundaries and more consistent segmentation across heterogeneous regions. Notably, it more reliably distinguishes visually similar classes (e.g., low vegetation vs. tree), and reduces spurious clutter predictions present in other methods, although occasional confusions may still occur in extremely small or heavily occluded structures.

Taken together, the visual results qualitatively support the quantitative trends in Table 2 and highlight the benefits of explicitly incorporating geometric priors for structure-preserving land cover mapping.

4.4.3. Numerical Evaluation of LoveDA

The performance comparison on the LoveDA dataset is summarized in Table 3. Overall, GPRNet achieves strong performance and attains the highest class-wise F1-scores across all semantic categories, consistently outperforming both CNN- and transformer-based competitors at the category level. For example, in the Building, Road, and Agriculture categories, GPRNet shows noticeable improvements, indicating its capability to model large man-made structures and adapt to irregular shapes of farmland in rural scenes.

Compared with SCAttNet and SUAS, our method further boosts the F1-scores in the Barren and Forest classes, where complex textures and ambiguous boundaries often lead to misclassifications. This reinforces the effectiveness of the GPRB module in extracting directional edge information and emphasizes the role of MCFM in cross-scale feature calibration.

When compared with RSAM-Seg and UM2Former—two recent transformer-based baselines that already offer strong global context modeling—GPRNet still achieves the best mIoU (63.61%) and improves the F1-score for each individual class, while keeping AF and OA at a comparable level. These results suggest that geometric prior modeling not only enriches local structural cues but also complements global semantic consistency in a challenging benchmark characterized by diverse urban and rural scenes.

In summary, GPRNet demonstrates robust segmentation capabilities under complex urban and rural scenarios with varied object scales and uneven class distributions. The gains on LoveDA are moderate but consistent, indicating that explicitly geometry-aware design is beneficial while leaving room for further improvements, especially in extremely heterogeneous or highly imbalanced regions.

4.4.4. Visual Inspections of LoveDA

To further assess the qualitative effectiveness of the proposed GPRNet in complex real-world scenarios, we present representative visual comparisons on the LoveDA dataset in Figure 7. The visualizations include diverse rural and urban scenes that pose challenges such as class imbalance, fine-grained structure segmentation, and complex inter-class confusion.

From the visualization results, we observe that conventional CNN-based models (e.g., U-Net [29], DeepLab V3+ [30]) suffer from considerable misclassifications, especially in distinguishing between vegetation, forest, and agricultural regions, largely due to their limited global context modeling. DANet [42] and ResUNet-a [31] moderately alleviate these issues via enhanced context aggregation, yet they often fail to suppress noisy predictions in complex rural areas.

Recent attention-augmented methods like MACU-Net [49] and SCAttNet [45] show better structure preservation but still misidentify barren areas or small water bodies. Transformer-based RSAM-Seg [63] and UM2Former [60] demonstrate stronger class separation and contextual consistency; however, their results can be over-smooth and may neglect small-scale details, such as narrow roads or isolated background pixels.

In contrast, GPRNet yields more coherent and precise segmentation masks, accurately capturing both large-scale semantic layouts and fine-grained object boundaries. It more effectively handles the intra-class variability within vegetation-related classes and maintains the integrity of linear structures like roads, while also producing fewer spurious predictions in barren and background regions. Nonetheless, some residual confusion persists in highly cluttered or severely imbalanced areas, which we further analyze in the discussion section. Overall, the qualitative results on LoveDA corroborate the quantitative improvements in Table 3 and underscore the advantages of combining geometric priors with cross-stage semantic calibration in challenging LULC scenarios.

4.5. Effects of GPRB

To assess the overall contribution of the proposed GPRB, we first conduct a macro-level ablation study by replacing GPRB with alternative modules while keeping the rest of the GPRNet framework unchanged, including MCFM and the encoder–decoder backbone. Table 4 presents the comparative results on the Potsdam and LoveDA datasets. Specifically, we consider the following design variants:

Conv Block: A standard residual convolutional structure, composed of a 1×1 convolution, a 3×3 convolution, and another 1×1 convolution, followed by an identity skip connection.

MHSA Block: A standard multi-head self-attention module that models long-range dependencies but lacks explicit geometric priors.

GPRB (ours): Our proposed module that integrates learnable directional derivatives and structure-aware modulation, enhancing edge and texture representation through geometric prior guidance.

From Table 4, we observe that while both Conv Block and MHSA Block provide reasonable baselines, GPRB consistently achieves the best AF, OA and mIoU on both datasets. Specifically, the Conv Block serves as a strong CNN baseline but lacks the ability to model long-range or structure-sensitive information. The MHSA Block improves on this by leveraging global context modeling; however, it remains less effective at preserving local geometric details, especially in high-resolution scenes. In contrast, GPRB achieves the best results, benefiting from the explicit modeling of edge significance and orientation through the geometric-prior mechanism. The improvements over MHSA Block are relatively moderate but consistent (e.g., +1.01% mIoU on LoveDA and +1.36% mIoU on Potsdam), indicating that geometric priors and self-attention are complementary and that explicitly encoding geometric cues can further enhance a strong context modeling baseline.

In terms of model efficiency, the Conv Block achieves the lowest parameter count (31.02 MB) and computational cost (88.4 GFLOPs), yet its accuracy lags significantly behind due to its limited context representation capacity. MHSA Block increases the FLOPs to 107.2 G while improving performance, but still falls short in challenging edge-aware segmentation scenarios. The proposed GPRB module introduces only a marginal overhead in parameters (35.12 MB) and computation (110.8 G), yet delivers the highest accuracy across both datasets. This demonstrates that GPRB provides a favorable trade-off between structural awareness and computational efficiency. A more fine-grained analysis of the internal components and deployment strategies of GPRB is presented in Section 4.6, where we further disentangle the roles of geometric magnitude, orientation, and learnable gradient operators.

Beyond quantitative metrics, we also examine how different variants influence the internal feature responses. To further evaluate the effectiveness of the proposed Geometric Prior-Refined Block (GPRB), we provide qualitative comparisons based on feature activation maps. As shown in Figure 8, the visualized results include both the final segmentation predictions and corresponding intermediate activation maps for three representative variants: (i) standard convolutional block (Conv Block), (ii) multi-head self-attention block (MHSA Block), and (iii) our GPRB. The second row of each group provides the feature activation maps extracted before the final decoder output, offering an insight into how each module guides the network’s spatial attention.

From the activation maps, we observe that the Conv Block mainly responds to coarse texture regions, lacking clear edge awareness or structural focus. Its responses are diffused and show limited localization capability. The MHSA Block captures global patterns but tends to dilute fine-grained details, especially in the presence of small or narrow objects such as roads and buildings; its activations appear more scattered and exhibit weaker alignment with object boundaries. In contrast, our GPRB exhibits strong activations along object boundaries and geometric structures, with sharper responses in transition zones and less background interference. This validates that the incorporation of directional derivatives and geometric priors enhances the model’s ability to preserve spatial consistency and delineate fine object contours.

These visual observations reinforce the quantitative results presented above, indicating that GPRB not only improves semantic segmentation accuracy, but also encourages more interpretable and structure-aware feature representations compared with purely convolutional or attention-based alternatives.

4.6. Extended Analysis of Geometric Priors

To gain deeper insights into the geometric prior design, we further perform an extended ablation study that disentangles (i) the internal components of the geometric branch, (ii) the choice of gradient operators, and (iii) the deployment of GPRB across different encoder–decoder stages. Unless otherwise stated, all experiments follow the same training protocol as described in Section 4.2, and are conducted within the GPRNet framework with identical optimization settings.

4.6.1. Decomposition of Geometric Branch Components

We first analyze the contribution of different components inside the geometric branch of GPRB. Specifically, we consider three variants:

Conv-only: GPRB without the geometric branch, i.e., only the residual convolutional branch is used (equivalent to “Conv Block” in Table 4).

Mag-only (M): the attention map Ageo is generated solely from the magnitude map M, without using the orientation θ.

Ori-only (θ): the attention map is generated solely from the orientation map θ, without using the magnitude M.

Mag+Ori (M+θ): the full GPRB, where both M and θ are concatenated and fed into the 1×1 convolution to generate Ageo (our default design).

Table 5 summarizes the results on Potsdam and LoveDA.

Several observations can be made from Table 5. First, introducing even a single geometric cue (Mag-only or Ori-only) leads to a substantial improvement over the Conv-only baseline on both datasets, confirming that explicit geometric modulation is beneficial for land cover segmentation. Second, magnitude-based attention (Mag-only) slightly outperforms orientation-only attention (Ori-only), which is reasonable as edge strength is more directly correlated with object boundaries, while orientation mainly encodes directional continuity. Third, combining both magnitude and orientation (Mag+Ori) yields the best AF, OA and mIoU on Potsdam and LoveDA, indicating that M and θ capture complementary aspects of geometric structure (edge saliency and directional pattern), and that their joint modeling is crucial for fully exploiting geometric priors.

4.6.2. Fixed vs. Learnable Gradient Operators

To further validate the design choice of learnable directional derivatives, we compare GPRB variants that differ only in the way Gx and Gy are defined:

Fixed Sobel: Gx and Gy are fixed Sobel kernels and are not updated during training.

Fixed Canny: edge maps are precomputed by a Canny detector and used as a binary magnitude prior; the orientation channel is derived from the Canny gradient direction and kept fixed.

Learnable Gx/Gy: Gx and Gy are initialized with Sobel-like filters but trained jointly with the network (our default).

All variants share the same backbone and training protocol; only the gradient operator is changed. The results are reported in Table 6.

The fixed Sobel and fixed Canny variants already perform competitively, confirming that classical edge detectors can provide useful priors for high-resolution RSI segmentation. However, the learnable Gx/Gy design consistently achieves the best performance on both datasets, with improvements of about 0.7–1.1% mIoU over fixed operators. This suggests that allowing the gradient filters to adapt to dataset-specific structures (e.g., building footprints with non-rectangular shapes, complex road networks, and heterogeneous vegetation) is beneficial, while still leveraging the inductive bias provided by Sobel-like initialization.

4.6.3. Layer-Wise Deployment of GPRB

Finally, we investigate where GPRB is most effective within the encoder–decoder hierarchy. Let {E1,,E4} denote the four encoder stages from shallow to deep, and {D1,,D4} the corresponding decoder stages. We consider the following configurations:

Shallow encoder only: GPRB is applied at E1 and E2; other stages use Conv Block.

Deep encoder only: GPRB is applied at E3 and E4; other stages use Conv Block.

Encoder only: GPRB is applied at all encoder stages E1E4, and Conv Block is used in the decoder.

Decoder only: Conv Block is used in the encoder, and GPRB is applied at all decoder stages D1D4.

All stages: GPRB is applied at both encoder and decoder stages (our default GPRNet).

The comparison is summarized in Table 7.

Several trends emerge. First, deploying GPRB only at shallow encoder stages (Shallow encoder only) already improves performance over the Conv-only baseline by enhancing low-level edge awareness, but the gains are limited because high-level semantic abstraction is still governed by purely convolutional features. Second, placing GPRB at deep encoder stages (Deep encoder only) brings larger improvements, highlighting that geometric priors are particularly helpful when the network is forming high-level representations that must remain sensitive to object shapes and boundaries. Third, using GPRB in the entire encoder (Encoder only) yields further gains, suggesting that propagating geometry-aware features through the whole encoding path is beneficial. Fourth, equipping only the decoder with GPRB (Decoder only) also improves performance, but is slightly less effective than the Encoder only setting, likely because geometric information injected late in the decoding process cannot fully compensate for the loss of structural cues during early abstraction. Finally, applying GPRB at both encoder and decoder stages (All stages) achieves the best performance on both datasets, confirming that geometric priors are most effective when consistently integrated throughout the representation learning and reconstruction pipeline.

4.7. Effects of MCFM

To evaluate the contribution of the proposed MCFM, we conduct a series of controlled ablation experiments by replacing MCFM with several representative fusion strategies while keeping all other architectural components, including GPRB and the encoder–decoder backbone, unchanged. The compared variants include: (1) simple element-wise addition (w/o MCFM), (2) residual-style summation (Sum Fusion), and (3) concatenation followed by a 1×1 convolution (Concat Fusion). These variants reflect commonly used fusion mechanisms in encoder–decoder segmentation networks and therefore provide a meaningful baseline for assessing the effectiveness of MCFM. The results on the ISPRS Potsdam and LoveDA datasets are shown in Table 8.

From Table 8, several observations can be made. First, removing MCFM entirely and adopting naïve element-wise addition leads to clear performance degradation on both datasets. In particular, AF decreases by 2.16% on Potsdam and 1.94% on LoveDA, demonstrating that encoder and decoder features cannot be directly blended without addressing cross-stage inconsistencies. These inconsistencies arise from differences in spatial semantics and geometric patterns between low-level detailed features and high-level abstract representations, which motivates the need for an explicit calibration mechanism.

Both Sum Fusion and Concat Fusion introduce additional learnable parameters and modestly improve over the naive addition strategy. However, their gains remain limited because they do not explicitly resolve feature misalignment or semantic ambiguity. Summation-based fusion lacks adaptive weighting, while concatenation relies solely on channel mixing without spatially aware calibration. As a result, these strategies struggle to correctly preserve fine boundaries and small-object structures, particularly in heterogeneous regions of Potsdam and the rural–urban transitions in LoveDA.

In contrast, the proposed MCFM achieves the highest AF, OA and mIoU on both datasets. Compared with the best-performing baseline (Concat Fusion), MCFM improves mIoU by +1.25% on Potsdam and +1.04% on LoveDA. These margins, though modest in absolute value, are consistent across all metrics and datasets, indicating that MCFM provides more reliable and stable fusion rather than case-dependent improvements. The performance gain can be attributed to two core mechanisms: (i) CSGA enables cross-stage geometric alignment by leveraging mutual attention between encoder and decoder features, reducing spatial discrepancies; (ii) SEF applies a semantic-aware weighting to adaptively balance geometric detail and contextual abstraction.

Regarding computational overhead, MCFM introduces only 1.38 MB more parameters and approximately 10 GFLOPs compared to simpler fusion schemes. Given the consistent improvements across all metrics, this additional cost represents a favorable trade-off for dense prediction tasks on high-resolution imagery. Moreover, MCFM preserves the plug-and-play nature of fusion modules and can be seamlessly integrated into other encoder–decoder architectures.

Overall, these experiments confirm that MCFM effectively mitigates cross-stage feature misalignment and enhances the interaction between geometric details and semantic abstractions, ultimately improving segmentation robustness in complex remote sensing scenes.

4.8. Discussions

The experimental results across both quantitative and qualitative evaluations provide deeper insights into the behavior, advantages, and remaining challenges associated with the proposed GPRNet architecture. The superior performance on the ISPRS Potsdam benchmark demonstrates the effectiveness of explicitly modeling geometric cues in dense urban environments, where object boundaries, roof outlines, and narrow man-made structures dominate the scene. Likewise, the consistent improvements on the LoveDA dataset illustrate that geometric priors can generalize to rural and mixed landscapes, even under high intra-class variability and irregular texture patterns.

A key observation is that GPRB and MCFM address two complementary aspects of encoder–decoder segmentation models. GPRB enhances the representational capacity of both encoder and decoder by introducing learnable directional gradients that reinforce structurally salient regions, thereby alleviating the common over-smoothing effects found in convolutional or attention-only architectures. On the other hand, MCFM mitigates cross-stage inconsistencies by performing geometry-aware mutual calibration and semantic-aware adaptive fusion, enabling stable feature interactions across different abstraction levels. These two modules jointly contribute to the improved delineation of fine-scale objects such as cars and road boundaries in Potsdam, as well as class separation among vegetation-related categories in LoveDA.

Despite these strengths, our experiments also reveal several limitations. First, classes dominated by homogeneous textures or weak geometric patterns (e.g., water and background in LoveDA) benefit less from geometric priors, suggesting that the effectiveness of GPRB depends partially on boundary richness. Second, while the computational overhead of GPRB and MCFM is relatively modest, it is still non-negligible when compared to lightweight CNN designs. This may limit the applicability of GPRNet in extremely resource-constrained deployment scenarios unless further model compression is considered. Third, although cross-entropy loss ensures fair comparison with existing baselines, it may underutilize the full potential of geometric priors; loss functions tailored to boundary quality (e.g., boundary IoU, Lovász loss or distance-transform losses) could further enhance GPRB’s effectiveness. We opted not to introduce additional losses in order to isolate the contribution of our modules, but this represents a promising future direction.

Finally, the interpretability of GPRNet—evidenced by activation maps that align with true geometric structures—underscores the viability of incorporating physically motivated priors into modern segmentation architectures. Future research may explore multi-scale geometric reasoning, adaptive gradient filters for multispectral data, or extending cross-stage calibration to temporal or multimodal inputs. Overall, the findings from our experimental analysis emphasize that geometric priors are a valuable and complementary component in improving fine-grained segmentation of high-resolution remote sensing imagery.

5. Conclusions

In this work, we introduced GPRNet, a geometry-aware semantic segmentation framework tailored for high-resolution remote sensing imagery. The proposed Geometric Prior-Refined Block (GPRB) enhances feature representation by integrating learnable directional derivatives and explicit geometric priors, enabling improved boundary localization and structural consistency. Complementarily, the Mutual Calibrated Fusion Module (MCFM) addresses cross-stage feature misalignment through geometric alignment and semantic-aware adaptive fusion, leading to more stable interactions between encoder and decoder features.

Extensive experiments on the ISPRS Potsdam and LoveDA datasets demonstrate that GPRNet consistently outperforms strong CNN-, attention-, and transformer-based baselines across a range of metrics. Ablation studies further confirm not only the individual contributions of GPRB and MCFM, but also their synergistic effect in balancing global semantic reasoning and fine-grained geometric precision. Notably, GPRNet achieves these improvements with moderate parameter and computational overhead, reinforcing its suitability for high-resolution remote sensing tasks.

Beyond accuracy improvements, the explicit inclusion of geometric priors enhances interpretability and supports robust generalization across diverse landscapes. Although the current study adopts a standard cross-entropy loss to ensure fair comparison with existing baselines, integrating boundary-aware or geometry-sensitive loss functions may further amplify the strengths of GPRB, particularly in fine-scale delineation. Future work will explore extending the framework to cross-domain, few-shot, and multimodal segmentation scenarios (e.g., DSM, SAR, multispectral imagery), as well as developing more compact model variants suitable for resource-constrained deployments.

Overall, this study demonstrates that combining geometric priors with modern encoder–decoder architectures offers a principled and effective way to advance fine-grained land-cover segmentation. The findings provide a strong foundation for future developments in geometry-aware representation learning and contribute to the broader goal of achieving reliable and interpretable high-resolution remote sensing analysis.

Author Contributions

Conceptualization, Z.L. and X.L.; methodology, Z.L., Z.X., R.X. and X.L.; software, Z.X., J.S. and R.M.; validation, Z.L., Z.X. and L.C.; formal analysis, Z.L., Z.X., R.X. and X.L.; investigation, Z.L., R.X., J.S. and R.M.; resources, X.L., D.L. and L.C.; data curation, Z.X., J.S. and L.C.; writing—original draft preparation, Z.L., Z.X. and R.X.; writing—review and editing, X.L., Z.L. and D.L.; visualization, Z.X., J.S. and R.M.; supervision, X.L. and D.L.; project administration, X.L. and L.C.; funding acquisition, X.L. and D.L. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Public datasets were used in this paper. The download links are: [https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/2d-sem-label-potsdam.aspx], accessed on 12 December 2023 and [https://github.com/Junjue-Wang/LoveDA], accessed on 12 December 2023. The source code and trained models will be released and shared upon request to the first author.

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1 The overall architecture of the proposed GPRNet. It integrates GPRBs within the encoder and decoder stages to inject structural priors during feature learning, and employs MCFMs to guide geometry-aware feature alignment across scales.

View Image -

Figure 2 Details of the GPRB. It integrates a standard convolutional branch with a geometric-prior guided modulation branch, where gradient magnitude M and normalized orientation θ are used to generate the attention map Ageo for structure-aware feature refinement.

View Image -

Figure 3 Architecture of the Mutual Calibrated Fusion Module (MCFM). It includes a CSGA block for cross-stage alignment and an SEF block for semantic-aware adaptive fusion of encoder and decoder features.

View Image -

Figure 4 Example image from the ISPRS Potsdam dataset.

View Image -

Figure 5 Example image from the LoveDA dataset.

View Image -

Figure 6 Visual comparisons on the ISPRS Potsdam dataset. (a) Input image, (b) Ground truth, (c) U-Net [29], (d) DeepLab V3+ [30], (e) DANet [42], (f) ResUNet-a [31], (g) D-LinkNet [32], (h) RAANet [48], (i) MACU-Net [49], (j) SCAttNet [45], (k) SUAS [51], (l) RSAM-Seg [63], (m) UM2Former [60], (n) GPRNet (ours).

View Image -

Figure 7 Visual comparisons on the LoveDA dataset. (a) Input image, (b) Ground truth, (c) U-Net [29], (d) DeepLab V3+ [30], (e) DANet [42], (f) ResUNet-a [31], (g) D-LinkNet [32], (h) RAANet [48], (i) MACU-Net [49], (j) SCAttNet [45], (k) SUAS [51], (l) RSAM-Seg [63], (m) UM2Former [60], (n) GPRNet (ours).

View Image -

Figure 8 Visual comparison of segmentation results and feature activation maps for different GPRB variants. (a) Input image, (b) Ground truth, (c) GPRNet (ours), (d) Conv Block, (e) MHSA Block.

View Image -

Training configuration and hyperparameter settings.

Configuration Item Value
Learning rate scheduler Polynomial decay
Initial learning rate 0.002
Loss function Cross-entropy
Optimizer Adam
Adam parameters (β1,β2) ( 0.9 ,   0.999 )
Epochs 500
Batch size 32
Input patch size 256 × 256
GPU Memory 48 GB

Results on the ISPRS Potsdam dataset. Class-wise F1-score, AF, OA and mIoU are listed, where the bold text indicates the best results.

Methods Impervious Surfaces Building Low Vegetation Tree Car AF OA mIoU
U-Net [29] 87.51 89.32 73.98 87.03 48.19 77.21 75.83 70.02
DeepLab V3+ [30] 84.49 86.13 77.37 77.37 85.47 82.17 80.82 73.59
DANet [42] 86.96 92.04 80.14 79.95 89.36 85.69 83.99 77.13
ResUNet-a [31] 85.50 87.15 78.29 78.29 86.49 83.14 81.40 74.11
D-LinkNet [32] 86.20 88.01 79.30 79.10 87.90 84.10 82.20 75.03
RAANet [48] 90.40 95.70 87.20 81.80 77.60 86.54 85.21 77.95
MACU-Net [49] 89.16 92.76 86.80 83.02 79.25 86.20 85.90 77.99
SCAttNet [45] 91.87 96.90 85.24 87.05 92.78 90.77 89.06 80.83
SUAS [51] 92.10 96.95 86.12 87.89 93.22 91.26 89.61 81.32
RSAM-Seg [63] 92.85 97.32 87.40 88.25 93.76 91.92 90.10 81.92
UM2Former [60] 93.40 97.65 88.31 89.01 94.20 92.51 90.92 82.20
GPRNet (ours) 94.12 98.01 89.20 89.97 94.55 91.97 90.33 82.32

Results on the LoveDA dataset. Class-wise F1-score, AF, OA and mIoU are listed, where the bold text indicates the best results.

Methods Background Building Road Water Barren Forest Agriculture AF OA mIoU
U-Net [29] 50.21 54.74 56.38 77.12 18.09 48.93 66.05 53.07 51.81 47.84
DeepLabV3+ [30] 52.29 54.99 57.16 77.96 16.11 48.18 67.79 53.50 52.30 47.62
DANet [42] 54.47 61.02 63.37 79.17 26.63 52.28 70.02 58.14 54.64 50.18
ResUNet-a [31] 54.15 56.94 59.19 80.73 16.68 49.89 70.20 55.40 53.22 48.46
D-LinkNet [32] 55.02 58.10 60.22 80.91 17.70 50.83 71.33 56.59 54.01 49.35
RAANet [48] 55.02 62.19 65.58 81.03 29.25 54.11 74.07 60.18 58.95 53.93
MACU-Net [49] 59.16 64.08 66.73 81.01 32.23 55.81 75.79 62.12 59.65 54.16
SCAttNet [45] 65.95 71.88 77.04 86.61 50.79 61.19 82.00 70.78 67.31 61.09
SUAS [51] 66.71 72.40 78.02 86.89 52.26 62.10 83.17 71.79 67.95 61.90
RSAM-Seg [63] 67.40 73.32 79.08 87.01 53.82 63.21 84.20 72.86 68.70 62.71
UM2Former [60] 68.11 74.00 80.13 87.32 54.39 64.05 84.89 73.70 69.58 63.17
GPRNet (ours) 68.95 75.12 81.02 88.21 55.60 65.43 85.73 72.15 68.75 63.61

Ablation study of GPRB on Potsdam and LoveDA datasets. All models adopt the GPRNet framework. Results are reported as AF/OA (%)/mIoU (%).

Framework GPRB Variant Potsdam (AF/OA/mIoU) LoveDA (AF/OA/mIoU) Params (MB) FLOPs (G)
GPRNet Conv Block 83.76/81.65/74.38 55.96/53.81/49.23 31.02 88.4
MHSA Block 90.12/88.60/80.96 71.15/67.58/61.04 34.60 107.2
GPRB (ours) 91.97/90.33/82.32 72.15/68.75/63.61 35.12 110.8

Component-wise ablation of the geometric branch in GPRB. Results are reported as AF/OA (%)/mIoU (%). All variants are evaluated within the GPRNet framework with fixed training settings.

Variant Description Potsdam (AF/OA/mIoU) LoveDA (AF/OA/mIoU)
Conv-only No geometric branch 83.76/81.65/74.38 55.96/53.81/49.23
Mag-only (M) Ageo from M only 91.02/89.74/81.43 71.32/67.91/62.01
Ori-only (θ) Ageo from θ only 90.56/89.21/80.98 71.01/67.64/61.73
Mag + Ori (M+θ) Full GPRB (ours) 91.97/90.33/82.32 72.15/68.75/63.61

Ablation on gradient operators in GPRB. Results are reported as AF/OA(%)/mIoU(%).

Gradient Operator Potsdam (AF/OA/mIoU) LoveDA (AF/OA/mIoU)
Fixed Sobel 91.10/89.82/81.61 71.56/68.12/62.47
Fixed Canny 90.88/89.60/81.22 71.23/67.96/62.11
Learnable Gx/Gy (ours) 91.97/90.33/82.32 72.15/68.75/63.61

Layer-wise ablation of GPRB deployment. Results are reported as AF/OA(%)/mIoU(%).

GPRB Placement Potsdam (AF/OA/mIoU) LoveDA (AF/OA/mIoU)
Shallow encoder only (E1E2) 90.95/89.44/80.95 71.02/67.66/62.05
Deep encoder only (E3E4) 91.33/89.81/81.57 71.46/68.10/62.37
Encoder only (E1E4) 91.58/89.97/81.89 71.83/68.41/63.12
Decoder only (D1D4) 91.21/89.65/81.42 71.37/67.96/62.53
All stages (E1E4, D1D4) 91.97/90.33/82.32 72.15/68.75/63.61

Ablation study of MCFM on Potsdam and LoveDA datasets. Results are reported as AF/OA(%)/mIoU(%).

Framework Fusion Strategy Potsdam (AF/OA/mIoU) LoveDA (AF/OA/mIoU) Params (MB) FLOPs (G)
GPRNet w/o MCFM 89.81/88.43/80.47 70.21/67.07/61.89 33.74 96.5
Sum Fusion 90.54/89.11/81.20 71.02/67.89/62.43 34.21 99.6
Concat Fusion 90.47/88.96/81.07 71.15/68.03/62.57 34.46 101.4
MCFM (ours) 91.97/90.33/82.32 72.15/68.75/63.61 35.12 110.8

References

1. Gong, P.; Wang, J.; Huang, H. Stable classification with limited samples in global land cover mapping: Theory and experiments. Sci. Bull.; 2024; 69, pp. 1862-1865. [DOI: https://dx.doi.org/10.1016/j.scib.2024.03.040]

2. Roy, S.K.; Sukul, A.; Jamali, A.; Haut, J.M.; Ghamisi, P. Cross hyperspectral and LiDAR attention transformer: An extended self-attention for land use and land cover classification. IEEE Trans. Geosci. Remote Sens.; 2024; 62, 5512815. [DOI: https://dx.doi.org/10.1109/TGRS.2024.3374324]

3. Jiang, S.; Lin, H.; Ren, H.; Hu, Z.; Weng, L.; Xia, M. Mdanet: A high-resolution city change detection network based on difference and attention mechanisms under multi-scale feature fusion. Remote Sens.; 2024; 16, 1387. [DOI: https://dx.doi.org/10.3390/rs16081387]

4. Irfan, A.; Li, Y.; E, X.; Sun, G. Land Use and Land Cover classification with deep learning-based fusion of SAR and optical data. Remote Sens.; 2025; 17, 1298. [DOI: https://dx.doi.org/10.3390/rs17071298]

5. Sun, X.; Li, X.; Tan, B.; Gao, J.; Wang, L.; Xiong, S. Integrating Otsu Thresholding and Random Forest for Land Use/Land Cover (LULC) Classification and Seasonal Analysis of Water and Snow/Ice. Remote Sens.; 2025; 17, 797. [DOI: https://dx.doi.org/10.3390/rs17050797]

6. Indhanu, N.; Chalermyanont, T.; Chub-Uppakarn, T. Spatial assessment of land use and land cover change impacts on groundwater recharge and groundwater level: A case study of the Hat Yai basin. J. Hydrol. Reg. Stud.; 2025; 57, 102097. [DOI: https://dx.doi.org/10.1016/j.ejrh.2024.102097]

7. Song, L.; Xia, M.; Weng, L.; Lin, H.; Qian, M.; Chen, B. Axial cross attention meets CNN: Bibranch fusion network for change detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.; 2022; 16, pp. 21-32. [DOI: https://dx.doi.org/10.1109/JSTARS.2022.3224081]

8. Sun, Y.; Zhang, Q.; Song, W.; Tang, S.; Singh, V.P. Hydrological responses of three gorges reservoir region (China) to climate and land use and land cover changes. Nat. Hazards; 2025; 121, pp. 1505-1530. [DOI: https://dx.doi.org/10.1007/s11069-024-06870-0]

9. Li, J.; Zheng, K.; Gao, L.; Han, Z.; Li, Z.; Chanussot, J. Enhanced Deep Image Prior for Unsupervised Hyperspectral Image Super-Resolution. IEEE Trans. Geosci. Remote Sens.; 2025; 63, 5504218. [DOI: https://dx.doi.org/10.1109/TGRS.2025.3531646]

10. Li, J.; Cai, Y.; Li, Q.; Kou, M.; Zhang, T. A review of remote sensing image segmentation by deep learning methods. Int. J. Digit. Earth; 2024; 17, 2328827. [DOI: https://dx.doi.org/10.1080/17538947.2024.2328827]

11. Cheng, J.; Deng, C.; Su, Y.; An, Z.; Wang, Q. Methods and datasets on semantic segmentation for Unmanned Aerial Vehicle remote sensing images: A review. ISPRS J. Photogramm. Remote Sens.; 2024; 211, pp. 1-34. [DOI: https://dx.doi.org/10.1016/j.isprsjprs.2024.03.012]

12. Jiang, W.; Sun, Y.; Lei, L.; Kuang, G.; Ji, K. Change detection of multisource remote sensing images: A review. Int. J. Digit. Earth; 2024; 17, 2398051. [DOI: https://dx.doi.org/10.1080/17538947.2024.2398051]

13. Li, X.; Xu, F.; Yu, A.; Lyu, X.; Gao, H.; Zhou, J. A Frequency Decoupling Network for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens.; 2025; 63, 5607921. [DOI: https://dx.doi.org/10.1109/TGRS.2025.3531879]

14. Li, X.; Xu, F.; Zhang, J.; Yu, A.; Lyu, X.; Gao, H.; Zhou, J. Dual-domain decoupled fusion network for semantic segmentation of remote sensing images. Inf. Fusion; 2025; 124, 103359. [DOI: https://dx.doi.org/10.1016/j.inffus.2025.103359]

15. Minaee, S.; Boykov, Y.; Porikli, F.; Plaza, A.; Kehtarnavaz, N.; Terzopoulos, D. Image Segmentation Using Deep Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell.; 2022; 44, pp. 3523-3542. [DOI: https://dx.doi.org/10.1109/TPAMI.2021.3059968]

16. Yuan, X.; Shi, J.; Gu, L. A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Syst. Appl.; 2021; 169, 114417. [DOI: https://dx.doi.org/10.1016/j.eswa.2020.114417]

17. Li, X.; Xu, F.; Liu, F.; Tong, Y.; Lyu, X.; Zhou, J. Semantic Segmentation of Remote Sensing Images by Interactive Representation Refinement and Geometric Prior-Guided Inference. IEEE Trans. Geosci. Remote Sens.; 2024; 62, 3339291. [DOI: https://dx.doi.org/10.1109/TGRS.2023.3339291]

18. Asgari Taghanaki, S.; Abhishek, K.; Cohen, J.P.; Cohen-Adad, J.; Hamarneh, G. Deep semantic segmentation of natural and medical images: A review. Artif. Intell. Rev.; 2021; 54, pp. 137-178. [DOI: https://dx.doi.org/10.1007/s10462-020-09854-1]

19. Li, X.; Xu, F.; Liu, F.; Lyu, X.; Gao, H.; Zhou, J.; Kaup, A. A Euclidean Affinity-Augmented Hyperbolic Neural Network for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens.; 2025; 63, 5636718. [DOI: https://dx.doi.org/10.1109/TGRS.2025.3594760]

20. Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep Learning in Remote Sensing: A Comprehensive Review and List of Resources. IEEE Geosci. Remote Sens. Mag.; 2017; 5, pp. 8-36. [DOI: https://dx.doi.org/10.1109/MGRS.2017.2762307]

21. Li, X.; Xu, F.; Liu, F.; Lyu, X.; Tong, Y.; Xu, Z.; Zhou, J. A Synergistical Attention Model for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens.; 2023; 61, 3243954. [DOI: https://dx.doi.org/10.1109/TGRS.2023.3243954]

22. Li, J.; Zheng, K.; Li, Z.; Gao, L.; Jia, X. X-Shaped Interactive Autoencoders With Cross-Modality Mutual Learning for Unsupervised Hyperspectral Image Super-Resolution. IEEE Trans. Geosci. Remote Sens.; 2023; 61, 5518317. [DOI: https://dx.doi.org/10.1109/TGRS.2023.3300043]

23. Li, J.; Zheng, K.; Gao, L.; Ni, L.; Huang, M.; Chanussot, J. Model-Informed Multistage Unsupervised Network for Hyperspectral Image Super-Resolution. IEEE Trans. Geosci. Remote Sens.; 2024; 62, 5516117. [DOI: https://dx.doi.org/10.1109/TGRS.2024.3391014]

24. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y. . Segment Anything. Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV); Paris, France, 2–6 October 2023; pp. 3992-4003. [DOI: https://dx.doi.org/10.1109/ICCV51070.2023.00371]

25. Ramos, L.T.; Sappa, A.D. Multispectral semantic segmentation for land cover classification: An overview. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.; 2024; 17, pp. 14295-14336. [DOI: https://dx.doi.org/10.1109/JSTARS.2024.3438620]

26. Yuan, B.; Zhao, D. A survey on continual semantic segmentation: Theory, challenge, method and application. IEEE Trans. Pattern Anal. Mach. Intell.; 2024; 46, pp. 10891-10910. [DOI: https://dx.doi.org/10.1109/TPAMI.2024.3446949]

27. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition; Boston, MA, USA, 7–12 June 2015; pp. 3431-3440. [DOI: https://dx.doi.org/10.1109/CVPR.2015.7298965]

28. Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell.; 2017; 39, pp. 2481-2495. [DOI: https://dx.doi.org/10.1109/TPAMI.2016.2644615]

29. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Munich, Germany, 5–9 October 2015; pp. 234-241.

30. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell.; 2018; 40, pp. 834-848. [DOI: https://dx.doi.org/10.1109/TPAMI.2017.2699184]

31. Diakogiannis, F.I.; Waldner, F.; Caccetta, P.; Wu, C. ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS J. Photogramm. Remote Sens.; 2020; 162, pp. 94-114. [DOI: https://dx.doi.org/10.1016/j.isprsjprs.2020.01.013]

32. Zhou, H.; Zhang, Y.; Wu, J.; Wang, C. D-LinkNet: LinkNet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops; Salt Lake City, UT, USA, 18–22 June 2018; pp. 182-186.

33. Du, S.; Du, S.; Liu, B.; Zhang, X. Incorporating DeepLabv3+ and object-based image analysis for semantic segmentation of very high resolution remote sensing images. Int. J. Digit. Earth; 2021; 14, pp. 357-378. [DOI: https://dx.doi.org/10.1080/17538947.2020.1831087]

34. Sun, Y.; Zheng, W. HRNet-and PSPNet-based multiband semantic segmentation of remote sensing images. Neural Comput. Appl.; 2023; 35, pp. 8667-8675. [DOI: https://dx.doi.org/10.1007/s00521-022-07737-w]

35. Li, X.; Li, T.; Chen, Z.; Zhang, K.; Xia, R. Attentively learning edge distributions for semantic segmentation of remote sensing imagery. Remote Sens.; 2021; 14, 102. [DOI: https://dx.doi.org/10.3390/rs14010102]

36. Zhao, Q.; Liu, J.; Li, Y.; Zhang, H. Semantic segmentation with attention mechanism for remote sensing images. IEEE Trans. Geosci. Remote Sens.; 2021; 60, 5403913. [DOI: https://dx.doi.org/10.1109/TGRS.2021.3085889]

37. Li, R.; Zheng, S.; Duan, C.; Su, J.; Zhang, C. Multistage attention ResU-Net for semantic segmentation of fine-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett.; 2021; 19, 8009205. [DOI: https://dx.doi.org/10.1109/LGRS.2021.3063381]

38. Li, X.; Xu, F.; Liu, F.; Xia, R.; Tong, Y.; Li, L.; Xu, Z.; Lyu, X. Hybridizing Euclidean and hyperbolic similarities for attentively refining representations in semantic segmentation of remote sensing images. IEEE Geosci. Remote Sens. Lett.; 2022; 19, 5003605. [DOI: https://dx.doi.org/10.1109/LGRS.2022.3225713]

39. Li, X.; Xu, F.; Lyu, X.; Gao, H.; Tong, Y.; Cai, S.; Li, S.; Liu, D. Dual attention deep fusion semantic segmentation networks of large-scale satellite remote-sensing images. Int. J. Remote Sens.; 2021; 42, pp. 3583-3610. [DOI: https://dx.doi.org/10.1080/01431161.2021.1876272]

40. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132-7141.

41. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. Proceedings of the European Conference on Computer Vision (ECCV); Munich, Germany, 8–14 September 2018; pp. 3-19.

42. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition; Long Beach, CA, USA, 15–20 June 2019; pp. 3141-3149. [DOI: https://dx.doi.org/10.1109/CVPR.2019.00326]

43. Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local Neural Networks. Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794-7803. [DOI: https://dx.doi.org/10.1109/CVPR.2018.00813]

44. Ding, L.; Tang, H.; Bruzzone, L. LANet: Local Attention Embedding to Improve the Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens.; 2021; 59, pp. 426-435. [DOI: https://dx.doi.org/10.1109/TGRS.2020.2994150]

45. Li, H.; Qiu, K.; Chen, L.; Mei, X.; Hong, L.; Tao, C. SCAttNet: Semantic Segmentation Network with Spatial and Channel Attention Mechanism for High-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett.; 2021; 18, pp. 905-909. [DOI: https://dx.doi.org/10.1109/LGRS.2020.2988294]

46. Niu, R.; Sun, X.; Tian, Y.; Diao, W.; Chen, K.; Fu, K. Hybrid Multiple Attention Network for Semantic Segmentation in Aerial Images. IEEE Trans. Geosci. Remote Sens.; 2022; 60, 5603018. [DOI: https://dx.doi.org/10.1109/TGRS.2021.3065112]

47. Li, X.; Xu, F.; Xia, R.; Lyu, X.; Gao, H.; Tong, Y. Hybridizing cross-level contextual and attentive representations for remote sensing imagery semantic segmentation. Remote Sens.; 2021; 13, 2986. [DOI: https://dx.doi.org/10.3390/rs13152986]

48. Zhou, R.L.T.L.N.L.W. RAANet: A Residual ASPP with Attention Framework for Semantic Segmentation of High-Resolution Remote Sensing Images. Remote Sens.; 2022; 14, 3109. [DOI: https://dx.doi.org/10.3390/rs14133109]

49. Li, R.; Duan, C.; Zheng, S.; Zhang, C.; Atkinson, P.M. MACU-Net for Semantic Segmentation of Fine-Resolution Remotely Sensed Images. IEEE Geosci. Remote Sens. Lett.; 2022; 19, 8007205. [DOI: https://dx.doi.org/10.1109/LGRS.2021.3052886]

50. Li, R.; Wang, L.; Zhang, C.; Duan, C.; Zheng, S. A2-FPN for semantic segmentation of fine-resolution remotely sensed images. Int. J. Remote Sens.; 2022; 43, pp. 1131-1155. [DOI: https://dx.doi.org/10.1080/01431161.2022.2030071]

51. Qiu, X.; Zhang, Z.; Luo, X.; Zhang, X.; Yang, Y.; Wu, Y.; Su, J. Semantic Uncertainty-Awared for Semantic Segmentation of Remote Sensing Images. IET Image Process.; 2025; 19, e70045. [DOI: https://dx.doi.org/10.1049/ipr2.70045]

52. Fu, J.; Yu, Y.; Wang, L. FSDENet: A Frequency and Spatial Domains-Based Detail Enhancement Network for Remote Sensing Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.; 2025; 18, pp. 19378-19392. [DOI: https://dx.doi.org/10.1109/JSTARS.2025.3583558]

53. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. . An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv; 2020; arXiv: 2010.11929

54. Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. Proceedings of the IEEE/CVF International Conference on Computer Vision; Montreal, QC, Canada, 11–17 October 2021; pp. 7262-7272.

55. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision; Montreal, QC, Canada, 11–17 October 2021; pp. 10012-10022.

56. Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A.; Zhang, L. FarSeg++: Foreground-Aware Relation Network for Geospatial Object Segmentation in High Spatial Resolution Remote Sensing Imagery. IEEE Trans. Pattern Anal. Mach. Intell.; 2023; 45, pp. 13715-13729. [DOI: https://dx.doi.org/10.1109/TPAMI.2023.3296757]

57. Li, X.; Xu, F.; Xia, R.; Xu, N.; Liu, F.; Yuan, C.; Huang, Q.; Lyu, X. Locality-Enhanced Transformer for Semantic Segmentation of High-Resolution Remote Sensing Images. Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Seoul, Republic of Korea, 14–19 April 2024; pp. 2870-2874. [DOI: https://dx.doi.org/10.1109/ICASSP48485.2024.10446525]

58. Long, J.; Li, M.; Wang, X. Integrating Spatial Details With Long-Range Contexts for Semantic Segmentation of Very High-Resolution Remote-Sensing Images. IEEE Geosci. Remote Sens. Lett.; 2023; 20, 2501605. [DOI: https://dx.doi.org/10.1109/LGRS.2023.3262586]

59. Wu, H.; Zhang, M.; Huang, P.; Tang, W. CMLFormer: CNN and Multiscale Local-Context Transformer Network for Remote Sensing Images Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.; 2024; 17, pp. 7233-7241. [DOI: https://dx.doi.org/10.1109/JSTARS.2024.3375313]

60. Xu, A.; Xue, Z.; Li, Z.; Cheng, S.; Su, H.; Xia, J. UM2Former: U-Shaped Multimixed Transformer Network for Large-Scale Hyperspectral Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens.; 2025; 63, 5506221. [DOI: https://dx.doi.org/10.1109/TGRS.2025.3543821]

61. Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens.; 2022; 190, pp. 196-214. [DOI: https://dx.doi.org/10.1016/j.isprsjprs.2022.06.008]

62. Li, X.; Xu, F.; Li, L.; Xu, N.; Liu, F.; Yuan, C.; Chen, Z.; Lyu, X. AAFormer: Attention-Attended Transformer for Semantic Segmentation of Remote Sensing Images. IEEE Geosci. Remote Sens. Lett.; 2024; 21, 5002805. [DOI: https://dx.doi.org/10.1109/LGRS.2024.3477609]

63. Zhang, J.; Li, Y.; Yang, X.; Jiang, R.; Zhang, L. RSAM-Seg: A SAM-Based Model with Prior Knowledge Integration for Remote Sensing Image Semantic Segmentation. Remote Sens.; 2025; 17, 590. [DOI: https://dx.doi.org/10.3390/rs17040590]

64. Du, B.; Shan, L.; Shao, X.; Zhang, D.; Wang, X.; Wu, J. Transform Dual-Branch Attention Net: Efficient Semantic Segmentation of Ultra-High-Resolution Remote Sensing Images. Remote Sens.; 2025; 17, 540. [DOI: https://dx.doi.org/10.3390/rs17030540]

65. ISPRS 2D Semantic Labeling Contest—Potsdam. Available online: https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/2d-sem-label-potsdam.aspx (accessed on 2 December 2022).

66. Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation. arXiv; 2021; arXiv: 2110.08733

67. Wang, Z.; Guo, J.; Huang, W.; Zhang, S. High-resolution remote sensing image semantic segmentation based on a deep feature aggregation network. Meas. Sci. Technol.; 2021; 32, 095002. [DOI: https://dx.doi.org/10.1088/1361-6501/abfbfd]

68. Li, X.; Lei, L.; Kuang, G. Multilevel adaptive-scale context aggregating network for semantic segmentation in high-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett.; 2021; 19, 6003805. [DOI: https://dx.doi.org/10.1109/LGRS.2021.3091284]

69. Zhang, K.; Bello, I.M.; Su, Y.; Wang, J.; Maryam, I. Multiscale depthwise separable convolution based network for high-resolution image segmentation. Int. J. Remote Sens.; 2022; 43, pp. 6624-6643. [DOI: https://dx.doi.org/10.1080/01431161.2022.2142081]

70. Feng, M.; Sun, X.; Dong, J.; Zhao, H. Gaussian dynamic convolution for semantic segmentation in remote sensing images. Remote Sens.; 2022; 14, 5736. [DOI: https://dx.doi.org/10.3390/rs14225736]

71. Ma, B.; Chang, C.Y. Semantic segmentation of high-resolution remote sensing images using multiscale skip connection network. IEEE Sens. J.; 2021; 22, pp. 3745-3755. [DOI: https://dx.doi.org/10.1109/JSEN.2021.3139629]

72. Huang, J.; Weng, L.; Chen, B.; Xia, M. DFFAN: Dual function feature aggregation network for semantic segmentation of land cover. ISPRS Int. J.-Geo-Inf.; 2021; 10, 125. [DOI: https://dx.doi.org/10.3390/ijgi10030125]

73. Weng, L.; Pang, K.; Xia, M.; Lin, H.; Qian, M.; Zhu, C. Sgformer: A local and global features coupling network for semantic segmentation of land cover. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.; 2023; 16, pp. 6812-6824. [DOI: https://dx.doi.org/10.1109/JSTARS.2023.3295729]

74. Kang, Y.; Ji, J.; Xu, H.; Yang, Y.; Chen, P.; Zhao, H. Swin-CDSA: The Semantic Segmentation of Remote Sensing Images Based on Cascaded Depthwise Convolution and Spatial Attention Mechanism. IEEE Geosci. Remote Sens. Lett.; 2024; 21, 3003405. [DOI: https://dx.doi.org/10.1109/LGRS.2024.3431638]

75. Jiang, J.; Feng, X.; Ye, Q.; Hu, Z.; Gu, Z.; Huang, H. Semantic segmentation of remote sensing images combined with attention mechanism and feature enhancement U-Net. Int. J. Remote Sens.; 2023; 44, pp. 6219-6232. [DOI: https://dx.doi.org/10.1080/01431161.2023.2264502]

76. He, G.; Dong, Z.; Feng, P.; Muhtar, D.; Zhang, X. Dual-range context aggregation for efficient semantic segmentation in remote sensing images. IEEE Geosci. Remote Sens. Lett.; 2023; 20, 2500605. [DOI: https://dx.doi.org/10.1109/LGRS.2023.3233979]

77. Li, K.; Qiang, Z.; Lin, H.; Wang, X. A Multi-Branch Attention Fusion Method for Semantic Segmentation of Remote Sensing Images. Remote Sens.; 2025; 17, 1898. [DOI: https://dx.doi.org/10.3390/rs17111898]

78. Liu, Y.; Zhu, Q.; Cao, F.; Chen, J.; Lu, G. High-resolution remote sensing image segmentation framework based on attention mechanism and adaptive weighting. ISPRS Int. J.-Geo-Inf.; 2021; 10, 241. [DOI: https://dx.doi.org/10.3390/ijgi10040241]

79. Wang, T.; Xu, C.; Liu, B.; Yang, G.; Zhang, E.; Niu, D.; Zhang, H. MCAT-UNet: Convolutional and cross-shaped window attention enhanced UNet for efficient high-resolution remote sensing image segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.; 2024; 17, pp. 9745-9758. [DOI: https://dx.doi.org/10.1109/JSTARS.2024.3397488]

80. Xu, Z.; Geng, J.; Jiang, W. MMT: Mixed-Mask Transformer for Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens.; 2023; 61, 5613415. [DOI: https://dx.doi.org/10.1109/TGRS.2023.3289408]

81. Fan, L.; Zhou, Y.; Liu, H.; Li, Y.; Cao, D. Combining Swin Transformer with UNet for Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens.; 2023; 61, 5530111. [DOI: https://dx.doi.org/10.1109/TGRS.2023.3329152]

82. Wu, H.; Zeng, Z.; Huang, P.; Yu, X.; Zhang, M. CCTNet: CNN and Cross-Shaped Transformer Hybrid Network for Remote Sensing Image Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.; 2024; 17, pp. 19986-19997. [DOI: https://dx.doi.org/10.1109/JSTARS.2024.3487003]

83. Ding, R.X.; Xu, Y.H.; Liu, J.; Zhou, W.; Chen, C. LSENet: Local and Spatial Enhancement to Improve the Semantic Segmentation of Remote Sensing Images. IEEE Geosci. Remote Sens. Lett.; 2024; 21, 7506005. [DOI: https://dx.doi.org/10.1109/LGRS.2024.3431578]

84. Zheng, C.; Jiang, Y.; Lv, X.; Nie, J.; Liang, X.; Wei, Z. SSDT: Scale-Separation Semantic Decoupled Transformer for Semantic Segmentation of Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.; 2024; 17, pp. 9037-9052. [DOI: https://dx.doi.org/10.1109/JSTARS.2024.3383066]

85. Xu, R.; Wang, C.; Zhang, J.; Xu, S.; Meng, W.; Zhang, X. RSSFormer: Foreground Saliency Enhancement for Remote Sensing Land-Cover Segmentation. IEEE Trans. Image Process.; 2023; 32, pp. 1052-1064. [DOI: https://dx.doi.org/10.1109/TIP.2023.3238648] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37022079]

86. Wu, H.; Huang, P.; Zhang, M.; Tang, W. CTFNet: CNN-Transformer Fusion Network for Remote-Sensing Image Semantic Segmentation. IEEE Geosci. Remote Sens. Lett.; 2024; 21, 5000305. [DOI: https://dx.doi.org/10.1109/LGRS.2023.3336061]

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.