Content area
With the rapid development of autonomous driving technology, accurate and efficient scene understanding has become particularly important. Semantic segmentation technology for autonomous driving aims to accurately identify and segment elements such as roads, sidewalks, and vegetation to provide the necessary perceptual information. However, current semantic segmentation algorithms still face some challenges, mainly inaccurate segmentation of road edge contours, misclassification of a part of the whole object into other categories, and difficulty in segmenting objects with fewer pixels. Therefore, this paper proposes a Segmentation Network based on Swin-UNet and Skip Connection (SUSC-SNet). It includes skip connection module (SCM), multi-branch fusion module (MFM), and dual branch fusion module (DBFM). SCM uses a dense skip connection method to achieve aggregated semantic extension and highly flexible encoder features in the decoder. MFM and DBFM control the degree of fusion of each branch through weights, increasing flexibility and adaptability. We conducted a fair experimental comparison between SUSC-SNet and several advanced segmentation networks on two publicly available autonomous driving datasets. SUSC-SNet increased mean intersection over union by 0.67% and 0.9%, respectively, and it increased mean class accuracy by 0.95% and 0.67%, respectively. A series of experiments demonstrated the efficiency, robustness, and applicability of SUSC-SNet.
Introduction
Autonomous driving technology can regulate urban traffic and greatly reduce the probability of traffic accidents1, 2–3. Semantic segmentation technology to realize the environmental perception of vehicles is more conducive to promoting the industrialization of automatic driving4,5. By parsing the scene through semantic segmentation technology, automatic driving vehicles can respond to changes in road conditions more accurately and make correct driving decisions6,7.
The development of semantic segmentation technology is crucial for the development of autonomous driving8, 9–10. Many semantic segmentation methods for autonomous driving, such as LCFNets11, DAT12, and DDRNet13, have made progress in various aspects to advance the industry. However, due to the complex and real-time changing environment in autonomous driving scenarios, the following types of accuracy problems usually exist: inaccurate segmentation of object edge contours, misclassification of a part of the whole object into other categories, and difficulty in segmenting objects occupying fewer pixels. Current segmentation methods are not sufficient to deal with the complex road conditions in the driving scene. As shown in Fig. 1 (warm colors represent regions with high activation values that strongly influence the model’s prediction for road class, while cool colors indicate areas with low or negligible relevance to the predicted road), an extremely important task in semantic segmentation for autonomous driving is to accurately segment the road information, however, the existing methods are inaccurate in segmenting the edge contours of roads. In addition, existing methods misidentify bus stops and walls, which are similar to road features, as roads.
Fig. 1 [Images not available. See PDF.]
Activation map of road category in semantic segmentation. Warm colors (e.g., red) represent regions with high activation values that strongly influence the model’s prediction for road class, while cool colors (e.g., blue) indicate areas with low or negligible relevance to the predicted road. There are two challenges: (1) Segmentation methods are not accurate when segmenting road edge contours. (2) Segmentation methods tend to misidentify bus stops and walls, which are similar to road features, as roads.
Swin-UNet14 fuses multi-scale features from the decoder and encoder via simple skip connections to recover the spatial resolution of the feature map for further segmentation prediction. It demonstrates good segmentation accuracy and robust generalization ability on medical datasets. Inspired by Swin-UNet, to solve the above problems in autonomous driving scenarios, this paper proposes a Segmentation Network based on Swin-UNet and Skip Connection (SUSC-SNet) to further improve the accuracy and safety of semantic segmentation techniques in the field of autonomous driving.
SUSC-SNet introduces multi-fusion dense skip connection into the Swin-UNet model structure. This integration enables effective integration of downsampling feature maps of different depths obtained from the Swin-UNet model structure. Therefore, this model can simultaneously focus on contextual information and spatial visual information. In addition, feature maps of different depths can partially share the same encoder and undergo joint feedback learning through deep supervision. This strategy effectively reduces the semantic gap between the encoder and decoder, while capturing features at different levels and perceptual fields of different sizes to obtain richer multiscale information. Therefore, this model can effectively extract local spatial features and global features of autonomous driving images, and fuse deep and shallow semantic information for merging processing, minimizing the spatial information loss caused by the downsampling process to achieve accurate segmentation tasks in autonomous driving scenarios. The main contributions of this work are as follows.
We propose a hybrid architecture innovation with significant computational efficiency improvement. We achieve dense cross-layer connections for aggregated semantic extension in decoder networks through skip connection module (SCM), experiments show this design improves mIoU by 3.19% and mAcc by 1.79%; multi-branch fusion module (MFM) and dual branch fusion module (DBFM) control branch fusion degrees through adaptive dynamic weighting mechanisms, which improve accuracy by 1.97% and 0.56% respectively on the BDD100K dataset?
We introduce Swin-UNet to autonomous driving through architectural improvements. SCM reduces encoder–decoder semantic gaps while capturing multi-level features and multi-scale receptive fields for richer multi-scale information. The synergistic design of Swin-UNet with the three proposed modules enables the network to achieve 70.12% mIoU and 79.77% mAcc on BDD100K while maintaining 105G FLOPs computational complexity.
We conduct comparative experiments between the proposed SUSC-SNet and several state-of-the-art segmentation networks on two publicly available autonomous driving datasets. SUSC-SNet achieves 4.29 FPS inference speed, which meets autonomous driving requirements, improving mIoU metrics by 0.67% and 0.9% respectively, and increasing mAcc by 0.95% and 0.67%. A series of experiments demonstrate the efficiency, robustness, and applicability of SUSC-SNet.
Related work
Semantic segmentation is a fundamental task in the field of automated driving, which aims to assign each pixel in the incoming image to a specific category label for parsing the visual scene. To meet the accuracy requirements of semantic segmentation, researchers have proposed several approaches. DDRNet13 consists of a deep dual-resolution backbone and an enhanced low-resolution context information extractor. Compared to existing dual-path methods, the fusion of two deep branches and multiple bilateral backbones can generate higher-quality details. LCFNets11 introduces compensation branches to preserve the features of the original image. By using two efficient modules, the Lightweight Detail Guided Fusion Module (L-DGF) and the Lightweight Semantic Guided Fusion Module (L-SGF), the detail branch and the semantic branch are allowed to selectively extract features from this branch. To balance the features of the three branches and guide their effective fusion, a novel aggregation layer was designed. Fan et al.15 proposed a short-term dense connectivity module and designed STDC, a backbone network dedicated to segmentation tasks, which is capable of extracting shallow detail features while reducing the computational cost. Based on this, Peng et al.16 proposed PP-LiteSeg, which includes a lightweight decoder and a unified attention fusion module, which can better utilize shallow detail features and improve the model’s ability to extract detail information with less computational cost. CLUSTSEG17 is a transformer-based framework that tackles different image segmentation tasks (i.e., superpixel, semantic, instance, and panoptic) through a unified neural clustering scheme. CLUSTSEG is innovative in two aspects:(1) cluster centers are initialized in heterogeneous ways so as to pointedly address task-specific demands (e.g., instance- or category-level distinctiveness), yet without modifying the architecture; and (2) pixel-cluster assignment, formalized in a cross-attention fashion, is alternated with cluster center update, yet without learning additional parameters. Wang et al.18 devised a new segmentation framework that boosts query-based models through discriminative query embedding learning. It explores two essential properties, namely dataset-level uniqueness and transformation equivariance, of the relation between queries and instances. SSS-Former19 efficiently combines CNN and encoder–decoder structures in a Transformer-inspired manner, which enhances the spatial topological connectivity of features and thus improves the overall performance. Wang et al.20 incorporate a dual-attention Transformer module to capture channel and spatial relationships, which resolves semantic gaps, resulting in a high-performance image segmentation model. Inspired by the concept of connecting multiple similar blocks, Multi-Unet21 is composed of multiple U-shaped block modules, where each subsequent module directly connected to the previous module to facilitate data transmission between different U-shaped structures. Chen et al.22 introduced multiple fusion dense skip connections between the encoder and decoder to compensate for the spatial information loss caused by inevitable downsampling in Swin-UNet or other methods. BASeg23 uses a Context Aggregation Module (CAM) to capture long-range dependencies between boundary regions and internal pixels of objects to achieve mutual gain and improve intra-class consistency. Zhao et al.24 proposed FE Net, which is a new method to improve segmentation performance in complex backgrounds using edge labels and pixel weights. However, due to the complex environment and numerous categories of traffic participants in autonomous driving scenarios, and the fact that many categories share common features, semantic segmentation in the field of autonomous driving is more challenging compared to other application scenarios.
The above research methods provide valuable references and insights for the study of this article and offer avenues for further exploration. Deep neural networks based on U-shaped structures and skip connections are commonly used for image segmentation tasks25,26. Although useful, they have inherent limitations that hinder their ability to effectively learn global information, especially when dealing with scenarios with numerous and similar categories. This problem has not been solved in either the UNet27 or UNet++28 network models. Swin-UNet14 solves this problem by re-importing tokenized image blocks into a transformer-based U-shaped decoder architecture via skip connections. However, Swin-UNet faces new challenges when trying to obtain detailed segmentation results during the upsampling process. This is because it relies entirely on the transformer structure and single direct skips without incorporating dense multi-fusion skip operations.
Based on the above findings, we propose to introduce a dense multi-fusion skip connection into the Swin-UNet model structure. This integration enables the effective integration of downsampling feature maps of different depths obtained from the Swin-UNet model structure. Therefore, this model, SUSC-SNet, can simultaneously focus on contextual information and spatial visual information. In addition, feature maps of different depths can partially share the same encoder and undergo joint feedback learning through deep supervision. The VGG29 network structure in SCM is concise and uses multiple small convolutional kernels to replace large convolutional kernels for convolution operations, achieving reduced computational complexity; MFM and DBFM utilize gating units to implement weight allocation, enhancing the model’s flexibility in learning from data. SUSC-SNet can effectively extract local spatial features and global features of images, and combine depth and shallow semantic information to accurately perform segmentation tasks in autonomous driving scenes. Compared with existing methods incorporating skip connections, SUSC-SNet presents significant differences in structural design, functional optimization, and application scenarios. Traditional approaches typically adopt simple feature map concatenation or direct element-wise addition: U-Net27 employs channel-wise concatenation without dimensionality adjustment, potentially causing feature redundancy; Swin-Unet14 and iSwin-Unet30 utilize fixed concatenation combined with linear layers for segmentation, lacking adaptive feature importance selection, which reduces the model’s focus on critical features, especially in complex background tasks; UCTransNet31 re-weights features through channel attention, but suppressed important channels may fail to recover in subsequent operations. Moreover, when input resolution increases, UCTransNet’s attention computation and feature concatenation lead to memory explosion. In contrast, SUSC-SNet achieves dynamic encoder–decoder feature alignment through adaptive weight allocation and a learnable lightweight convolutional network, reducing spatial information loss caused by downsampling to enable precise segmentation in autonomous driving scenarios.
Method
Figure 2 shows the network structure of SUSC-SNet, which includes four parts: encoder, decoder, bottleneck, and skip connection module. In addition, SUSC-SNet also includes three components: MFM, DBFM, and VGG. SCM can effectively extract local spatial features and global features of images, integrating deep and shallow semantic information. The VGG network has a concise structure and uses multiple small convolutional kernels to replace large convolutional kernels for convolution operations, achieving reduced computational complexity. MFM and DBFM utilize gating units to implement weight allocation, enhancing the model’s flexibility in learning from data. We will introduce the specific details of SUSC-SNet and provide insights into why it works well.
Fig. 2 [Images not available. See PDF.]
The overall architecture of SUSC-SNet.
Approach details
Encoder
The encoder is based on two Swin Transformer blocks. First, the image is transformed into a sequence embedding, and the W H 3 blocks are divided into nonoverlapping blocks of size 4 4 by patch partitioning. Therefore, the feature dimension of each nonoverlapping block is 4 4 3 = 48. The linear embedding layer projects the feature dimension to C, and then uses features of size H/4 W/4 C as inputs for two successive Swin Transform blocks to learn feature representations. Finally, a 2 downsampling is performed through patch merging layers to reduce the number of representations and double the feature dimension. Thus, the size and resolution changes are completed in the block merging layer. This process is repeated three times in the encoder, each time with two Swin Transformer blocks and one patch merge layer.
Unlike traditional multi-head self-attention (MSA) architectures, the Swin Transformer block32 employs a shifted window mechanism for feature interaction. As illustrated in Fig. 3, the framework cascades two consecutive Swin Transformer blocks. Each block integrates a LayerNorm (LN) layer, a window-based or shifted window-based MSA module (denoted as W-MSA and SW-MSA, respectively), skip connections, and a two-layer MLP with GELU activation. Specifically, the first block adopts W-MSA to process local window regions, while the subsequent block utilizes SW-MSA to enable cross-window communication. These Swin Transformer blocks are to be mathematically expressed as:
1
2
3
4
where and represent the outputs of the W-MSA(SW-MSA) module and the MLP module of the lth block, respectively. The self-attention computation, aligned with previous studies33,34, is defined as:
5
where denote the query, key and value matrices. and d represent the number of patches in a window and the dimension of the query or key, respectively. And, the values in B are taken from the bias matrix .
Fig. 3 [Images not available. See PDF.]
Swin transformer block32.
Bottleneck
Since the Transformer is too deep to converge to35, we use only two successive Swin Transformer blocks to construct bottlenecks for learning deep feature representations. In the bottleneck, the feature dimension and resolution remain unchanged.
Decoder
Similar to the encoder, a symmetric decoder based on Swin Transformer blocks is constructed. In contrast to the patch merging layer used in the encoder, we use a patch expanding layer in the decoder to upsample the extracted deep features. The patch expanding layer transforms the feature maps of adjacent dimensions into higher resolution feature maps (2 upsampling) and correspondingly reduces the feature dimensions to half of the original dimensions.
Skip connection module
SUSC-SNet uses a dense skip connection method to achieve aggregated semantic extension and highly flexible encoder features in the decoder network. As shown in Figs. 2 and 4, the proposed model adopts a dense connection design by introducing nested and multi-fusion dense skip connections while fusing long and short connections. SUSC-SNet also uses MFM and DBFM to control the degree of fusion of each branch by weights, which adds flexibility and adaptability to the module. The VGG architecture employed in SCM utilizes multiple 3 3 convolutional kernels, reducing parameter count while enhancing non-linearity to improve the model’s representational capacity. Furthermore, its increased depth enables the learning of more complex hierarchical features. This dense skip connection strategy effectively reduces the semantic gap between the encoder and decoder, while capturing features at different levels and perceptual fields of different sizes, thus obtaining richer multi-scale information. Therefore, this model can effectively extract local spatial features and global features of autonomous driving images, and fuse deep and shallow semantic information for merging processing, minimizing the spatial information loss caused by the downsampling process.
Fig. 4 [Images not available. See PDF.]
Architecture of the SCM. SCM uses a dense skip connection method to achieve aggregated semantic extension and highly flexible encoder features in the decoder. MFM and DBFM control the degree of fusion of each branch through weights, increasing flexibility and adaptability.
Results
Implementation details
The experimental system is Ubuntu 18.04. CUDA 11.3 is used for NVIDIA A100 GPU acceleration. We set the batch size to 4 and employed the Adam optimization algorithm to update the model’s gradient. In the SUSC-SNet experiment, the training process involved 90k iterations. For fair comparisons, all experiments are implemented on PyTorch 1.12 and MMSegmentation 1.2.036. Unless otherwise noted, all parameters and optimization algorithms follow the settings of MMSegmentation 1.2.0.
Datasets
The BDD100K37 dataset released by the University of California, Berkeley is one of the largest and most diverse autonomous driving datasets to date. The BDD100K dataset contains 100k driving images covering New York, the San Francisco Bay Area, and Berkeley. This dataset includes different scenarios such as city streets, residential areas, and highways.
Cityscapes38 is a semantic segmentation dataset used for intelligent driving scenarios. It was sponsored by Mercedes Benz and is currently recognized as one of the most authoritative and professional image segmentation datasets in the field of computer vision. It contains 5000 driving images from 27 cities in Germany and neighboring regions, aiming to achieve a high degree of diversity in foreground objects, backgrounds, and overall scene layout.
The detailed information of the dataset used in this paper is presented in Table 1.
Table 1. Details of the datasets used in this paper.
Dataset | Resolution | Category | Train | Val | Test |
|---|---|---|---|---|---|
BDD100K | 720 × 1280 | Road, Sidewalk, Building, Wall, Fence, Pole, Traffic light, Traffic sign, Vegetation, Terrain, Sky, Person, Rider, Car, Truck, Bus, Train, Motorcycle, Bicycle | 60,000 | 20,000 | 20,000 |
Cityscapes | 1024 × 2048 | Road, Sidewalk, Building, Wall, Fence, Pole, Traffic light, Traffic sign, Vegetation, Terrain, Sky, Person, Rider, Car, Truck, Bus, Train, Motorcycle, Bicycle | 3000 | 1000 | 1000 |
Evaluation metrics
In autonomous driving perception, two metrics are commonly used to objectively evaluate the difference between segmentation results and ground truth: mean Intersection over Union (mIoU) and mean class Accuracy (mAcc). mIoU is obtained by averaging the intersection over union of all categories, while mAcc is obtained by averaging the accuracy of all categories. For the categories that require special attention in autonomous driving: road, sidewalk, wall, and vegetation, we also compare them using road Intersection over Union (IoU_road), road Accuracy (Acc_road), and sidewalk Intersection over Union (IoU_sidewalk), sidewalk Accuracy (Acc_sidewalk), wall Intersection over Union (IoU_wall), wall Accuracy (Acc_wall), vegetation Intersection over Union (IoU_vegetation), and vegetation Accuracy (Acc_vegetation). The calculation of mIoU and mAcc is as follows.
6
7
here, True Positives (TP) and True Negatives (TN) refer to instances where the model correctly predicts the positive and negative classes, respectively, while False Positives (FP) denote cases where the model incorrectly assigns a negative instance to the positive class, and False Negatives (FN) represent misclassifications of a positive instance as negative. The IoU threshold is set to 0.5. denotes the number of True Positives for class i, denotes the number of True Negatives for class i, denotes the number of False Positives for class i, denotes the number of False Negatives for class i, C is the number of categories.Performance comparison
We have compared the performance of SUSC-SNet with several advanced semantic segmentation methods. These compared methods are TU-net39, MFAFNet40, SAN41, Mask2Former42, STDC15, DDRNet13, SegFormer43, and Segmenter44. We also compared these methods with different backbones. Tables 2 and 3 show their results on BDD100K and Cityscapes, respectively. Figure 5 shows the performance comparison (M2F: Mask2Former, SF: SegFormer, SM: Segmenter).
Table 2. Performance results of multiple methods on BDD100K (Bold represents the best result, Italic represents the second-best result).
Method | Backbone | Crop size | mIoU | mAcc | IoU_road | Acc_road | IoU_sidewalk | Acc_sidewalk | IoU_vegetation | Acc_vegetation |
|---|---|---|---|---|---|---|---|---|---|---|
TU-net39 | U-Net | 512 × 1024 | 59.13 | 70.92 | 95.08 | 97.06 | 68.29 | 80.17 | 87.18 | 92.46 |
MFAFNet40 | ResNet18 | 512 × 1024 | 56.82 | 67.13 | 94.47 | 96.64 | 64.13 | 78.89 | 86.16 | 92.9 |
SAN41 | ViT-B | 640 × 640 | 50.04 | 61.31 | 93.80 | 96.80 | 60.77 | 74.86 | 83.94 | 90.49 |
SAN | ViT-L | 640 × 640 | 48.55 | 57.58 | 93.85 | 97.24 | 62.10 | 73.45 | 84.60 | 91.42 |
Mask2Former42 | R-50 | 512 × 1024 | 62.04 | 72.88 | 95.13 | 97.27 | 68.37 | 81.07 | 87.60 | 92.26 |
Mask2Former | R-101 | 512 × 1024 | 63.32 | 74.64 | 94.76 | 97.30 | 68.69 | 81.94 | 87.46 | 92.37 |
Mask2Former | Swin-T | 512 × 1024 | 63.84 | 74.20 | 95.24 | 97.16 | 67.98 | 83.50 | 88.58 | 93.38 |
Mask2Former | Swin-S | 512 × 1024 | 66.96 | 77.90 | 95.74 | 97.50 | 71.05 | 83.57 | 88.65 | 93.67 |
Mask2Former | Swin-L | 512 × 1024 | 69.45 | 78.82 | 95.83 | 97.46 | 70.87 | 83.30 | 89.26 | 94.06 |
Mask2Former | Swin-B | 512 × 1024 | 68.03 | 77.90 | 95.98 | 97.65 | 71.01 | 83.51 | 89.39 | 94.03 |
STDC15 | STDC1 | 512×1024 | 47.54 | 54.91 | 92.87 | 95.92 | 59.81 | 72.29 | 84.11 | 92.70 |
STDC | STDC2 | 512 × 1024 | 48.35 | 55.53 | 94.03 | 96.57 | 62.61 | 74.42 | 84.41 | 92.82 |
DDRNet13 | DDRNet23 | 512 × 1024 | 42.06 | 54.03 | 87.95 | 90.11 | 56.51 | 71.45 | 83.14 | 92.65 |
SegFormer43 | MIT-B | 512 × 1024 | 51.31 | 59.89 | 91.64 | 95.08 | 57.38 | 71.94 | 85.31 | 91.82 |
Segmenter44 | ViT-B | 512 × 1024 | 54.45 | 62.78 | 94.09 | 96.96 | 61.52 | 73.18 | 86.29 | 93.10 |
Segmenter | ViT-T | 512 × 1024 | 45.30 | 52.36 | 91.87 | 97.15 | 52.98 | 60.31 | 83.38 | 91.54 |
Segmenter | ViT-S | 512 × 1024 | 54.60 | 63.31 | 93.14 | 97.74 | 58.79 | 70.20 | 85.29 | 91.32 |
Segmenter | ViT-L | 512 × 1024 | 61.27 | 70.87 | 94.78 | 97.50 | 65.89 | 76.93 | 86.57 | 92.82 |
SUSC-SNet (Ours) | Swin-L | 512 × 1024 | 70.12 | 79.77 | 96.11 | 97.73 | 72.02 | 82.94 | 89.55 | 94.71 |
Table 3. Performance results of multiple methods on Cityscapes (Bold represents the best result, Italic represents the second-best result).
Method | Backbone | Crop size | mIoU | mAcc | IoU_road | Acc_road | IoU_sidewalk | Acc_sidewalk | IoU_wall | Acc_wall | IoU_vegetation | Acc_vegetation |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
TU-net | U-Net | 512 × 1024 | 79.32 | 87.14 | 98.26 | 99.03 | 85.64 | 92.74 | 59.91 | 68 | 92.66 | 96.55 |
MFAFNet | ResNet18 | 512 × 1024 | 74.33 | 82.76 | 97.64 | 98.45 | 82.41 | 92.32 | 54.07 | 65.2 | 91.37 | 95.9 |
SAN | ViT-B | 640 × 640 | 62.72 | 74.02 | 96.37 | 98.37 | 73.20 | 82.19 | 52.05 | 64.55 | 88.24 | 94.28 |
SAN | ViT-L | 640 × 640 | 65.06 | 76.77 | 97.02 | 98.50 | 77.69 | 86.41 | 58.66 | 68.23 | 89.22 | 94.81 |
Mask2Former | R-50 | 512 × 1024 | 77.69 | 87.38 | 98.04 | 98.68 | 84.86 | 93.51 | 51.11 | 62.10 | 92.59 | 95.98 |
Mask2Former | R-101 | 512 × 1024 | 76.52 | 86.50 | 98.15 | 98.73 | 84.69 | 93.92 | 55.73 | 72.41 | 92.61 | 95.92 |
Mask2Former | Swin-T | 512 × 1024 | 77.81 | 87.31 | 98.27 | 98.77 | 85.71 | 94.62 | 60.17 | 71.67 | 92.98 | 96.36 |
Mask2Former | Swin-S | 512 × 1024 | 81.21 | 89.79 | 98.49 | 98.98 | 87.64 | 94.88 | 61.61 | 82.30 | 93.10 | 96.65 |
Mask2Former | Swin-L | 512 × 1024 | 81.89 | 89.98 | 98.50 | 99.01 | 87.63 | 94.49 | 68.44 | 79.83 | 93.07 | 96.40 |
Mask2Former | Swin-B | 512 × 1024 | 80.03 | 90.07 | 98.50 | 99.08 | 87.64 | 94.56 | 66.40 | 81.69 | 93.17 | 96.15 |
STDC | STDC1 | 512 × 1024 | 71.38 | 79.90 | 97.56 | 98.53 | 81.05 | 91.29 | 48.57 | 53.48 | 90.97 | 96.31 |
STDC | STDC2 | 512 × 1024 | 72.67 | 81.09 | 97.78 | 98.65 | 82.31 | 91.63 | 47.52 | 53.41 | 91.42 | 96.20 |
DDRNet | DDRNet23 | 512 × 1024 | 77.95 | 86.51 | 98.14 | 98.90 | 85.05 | 92.31 | 52.87 | 58.30 | 92.31 | 96.82 |
SegFormer | MIT-B | 512 × 1024 | 69.48 | 79.38 | 97.19 | 98.12 | 79.07 | 89.68 | 51.42 | 65.56 | 91.38 | 96.70 |
Segmenter | ViT-B | 512×1024 | 73.04 | 82.41 | 97.53 | 98.46 | 81.23 | 91.28 | 54.41 | 66.97 | 91.18 | 96.08 |
Segmenter | ViT-T | 512 × 1024 | 65.23 | 74.87 | 96.91 | 98.43 | 77.07 | 87.59 | 47.68 | 57.14 | 90.26 | 95.33 |
Segmenter | ViT-S | 512 × 1024 | 71.16 | 79.36 | 97.45 | 98.65 | 80.73 | 90.17 | 54.45 | 64.92 | 90.93 | 95.74 |
Segmenter | ViT-L | 512 × 1024 | 60.91 | 71.36 | 95.51 | 96.93 | 70.09 | 87.05 | 41.39 | 51.83 | 88.60 | 94.28 |
SUSC-SNet (Ours) | Swin-L | 512 × 1024 | 82.79 | 90.74 | 98.56 | 99.02 | 88.16 | 95.13 | 65.16 | 81.72 | 93.21 | 96.12 |
SUSC-SNet achieves optimal results in mIoU and mAcc on both datasets. On BDD100K, SUSC-SNet achieved 70.12% mIoU, outperforming other methods by at least 0.67%. Meanwhile, SUSC-SNet achieved an mAcc of 79.77%, surpassing other methods by at least 0.95%. Specifically, in terms of categories, SUSC-SNet achieved the best performance in the four metrics of IoU_road, IoU_sidewalk, IoU_vegetation, and Acc_vegetation, while also achieving the second-best results with minimal gap in the Acc_road. On cityscapes, SUSC-SNet achieved 82.79% mIoU, outperforming other models by at least 0.9%. Meanwhile, SUSC-SNet achieved an mAcc of 90.74%, surpassing other models by at least 0.67%. This is because SUSC-SNet uses a dense skip connection method that achieves aggregated semantic extension and highly flexible encoder features in the decoder network. This strategy effectively reduces the semantic gap between the encoder and decoder, while capturing features at different levels and perceptual fields of different sizes to obtain richer multi-scale information.
Table 4. Comparison of the proposed model with several state-of-the-art semantic segmentation methods in computational complexity and inference speed on the BDD100K dataset.
Method | Backbone | FLOPs (G) | Params (M) | FPS | mIoU |
|---|---|---|---|---|---|
TU-net | U-Net | 19 | 20 | 7.51 | 59.13 |
MFAFNet | ResNet18 | 42 | 15 | 5.86 | 56.82 |
SAN | ViT-L | 213 | 143 | 2.96 | 48.55 |
Mask2Former | Swin-L | 303 | 216 | 1.87 | 69.45 |
STDC | STDC1 | 4 | 8 | 21.19 | 47.54 |
DDRNet | DDRNet23 | 14 | 20 | 21.05 | 42.06 |
SegFormer | MIT-B | 22 | 24 | 7.02 | 51.31 |
Segmenter | ViT-L | 310 | 307 | 0.86 | 61.27 |
SUSC-SNet | Swin-L | 105 | 72 | 4.29 | 70.12 |
Fig. 5 [Images not available. See PDF.]
Performance comparison on two datasets.
We conducted a systematic comparison of computational complexity and inference speed between SUSC-SNet and other advanced segmentation networks. The experimental results in Table 4 show that based on the Swin-L backbone, SUSC-SNet achieves 70.12% mIoU with an inference speed of 4.29FPS while maintaining moderate computational complexity (105G FLOPs and 72M parameters). This represents a 0.67% improvement over Mask2Former (69.45% mIoU), the most accurate model in the comparison, and an 8.85% improvement compared to Segmenter (61.27% mIoU) using the ViT-L backbone. In terms of efficiency, SUSC-SNet significantly outperforms Transformer-based models, requiring only 49.3% of the FLOPs of the ViT-L-based SAN model (213G), with an inference speed 5 times faster than Segmenter (0.86FPS). Notably, when compared to lightweight models, SUSC-SNet achieves a 22.58% improvement in mIoU despite having 26 times the computational complexity of STDC (4G FLOPs), validating the effectiveness of the proposed skip connection module in achieving semantic expansion through dense connections and the design advantage of the branch fusion module in enhancing small-object segmentation through weighted fusion strategies. These experimental results further confirm that SUSC-SNet achieves significant advantages in balancing accuracy and efficiency, demonstrating high effectiveness and engineering applicability for semantic segmentation tasks in autonomous driving.
Ablation experiment
To validate the contribution of each module to SUSC-SNet’s performance, this study designed systematic ablation experiments. As shown in Table 5, the model’s performance in autonomous driving semantic segmentation tasks declines when removing SCM, MFM, and DBFM: (1) Critical role of SCM: Removing SCM causes a 3.19% drop in mIoU (70.12% and 66.93%) and a 1.79% reduction in mAcc (79.77% and 77.98%), demonstrating that dense skip connections are crucial for aggregating multi-scale semantic features, particularly irreplaceable for recovering detailed information like road edge contours. (2) Adaptive advantage of MFM: Without MFM, mIoU and mAcc decrease by 1.97% and 1.71%, respectively, validating that dynamic weight-controlled multi-branch fusion effectively mitigates target misclassification issues. (3) Refinement enhancement by DBFM: Removing DBFM alone results in the smallest performance decline (mIoU − 0.56%, mAcc − 0.23%), yet it significantly improves segmentation accuracy for small-object categories. The complete model’s peak performance (mIoU 70.12%, mAcc 79.77%) proves that dual-branch collaborative optimization enhances feature representation robustness. Experimental results consistently indicate that SCM, MFM, and DBFM achieve optimal performance through synergistic interaction in SUSC-SNet.
Table 5. Results of different components in the proposed model on the BDD100K dataset.
Method | mIoU | mAcc |
|---|---|---|
SUSC-SNet w/o SCM | 66.93 | 77.98 |
SUSC-SNet w/o MFM | 68.15 | 78.06 |
SUSC-SNet w/o DBFM | 69.56 | 79.54 |
SUSC-SNet | 70.12 | 79.77 |
To intuitively demonstrate the effectiveness of the core module SCM, Fig. 6 compares activation maps of the baseline model (without SCM), and the proposed SUSC-SNet model across different scenarios. Three driving scenarios were selected for comparative analysis: The baseline model, lacking skip connections, exhibited localized fragmentation in activation responses (e.g., blurred road edges in Cases 1 and 2, and misclassification of buildings as roads in Cases 1 and 3). In contrast, SUSC-SNet with SCM generated activation distributions with stronger semantic consistency—for instance, accurately focusing on continuous road regions in Case 2, and distinguishing buildings from roads in Cases 1 and 3. This visualization validates that SCM significantly enhances the model’s semantic understanding of complex driving scenes by narrowing the semantic gap between encoder and decoder.
Fig. 6 [Images not available. See PDF.]
Comparison of activation maps between SUSC-SNet without the core module SCM and the complete version SUSC-SNet.
Sensitivity analysis
Backbone
To evaluate the impact of different backbone architectures on model performance, we conducted a comprehensive sensitivity analysis by replacing the backbone of SUSC-SNet with ResNet and Swin Transformer variants (ResNet-50, ResNet-101, Swin-T, Swin-S, Swin-B, and Swin-L) while keeping other modules unchanged. As Table 6 shows, when using the ResNet series as the backbone, the model performance was relatively limited: ResNet-50 achieved mIoU of 65.01% and mAcc of 75.13%, while ResNet-101 improved to mIoU of 66.23% and mAcc of 75.28%. In contrast, the Swin Transformer series demonstrated significant advantages, with performance generally improving as the model size increased: Swin-T achieved mIoU of 67.69% and mAcc of 77.27%; Swin-S further optimized to mIoU of 70.07% and mAcc of 79.33%; Swin-B showed slight fluctuations (mIoU of 69.71%, mAcc of 79.36%), but Swin-L ultimately achieved the best performance with mIoU of 70.12% and mAcc of 79.77%. This indicates that larger backbone models generally provide better feature extraction capabilities for semantic segmentation tasks, despite increased computational complexity, which is consistent with our expectations regarding the relationship between model capacity and performance.
Table 6. The results of SUSC-SNet with different backbones on the BDD100K dataset.
Backbone | mIoU | mAcc |
|---|---|---|
R-50 | 65.01 | 75.13 |
R-101 | 66.23 | 75.28 |
Swin-T | 67.69 | 77.27 |
Swin-S | 70.07 | 79.33 |
Swin-B | 69.71 | 79.36 |
Swin-L | 70.12 | 79.77 |
Iteration
In this sensitivity analysis, we investigate the impact of training iteration count on the performance of our proposed SUSC-SNet. As detailed in Table 7, the model’s mIoU and mAcc were evaluated across five distinct iteration levels. The results demonstrate a clear trend: performance improves significantly as the iteration count increases from 60k to 90k, with mIoU rising from 69.93–74.25% and mAcc increasing from 79.62–81.29%. This indicates effective model convergence within this range. However, beyond 90k iterations, further training yields diminishing returns. mIoU peaks at 74.25% at 90k iterations and subsequently declines slightly to 73.81% at 120k iterations, while mAcc reaches a maximum of 81.36% at 105k iterations. This observation suggests that SUSC-SNet achieves optimal generalization around 90k iterations, and extended training may lead to overfitting or computational inefficiency without substantial gains. Consequently, we adopt 90k iterations as the standard setting for all subsequent experiments to ensure robust and efficient performance.
Table 7. The results of SUSC-SNet at different iterations on the BDD100K val set.
Iteration | mIoU | mAcc |
|---|---|---|
60,000 | 69.93 | 79.62 |
75,000 | 72.86 | 80.57 |
90,000 | 74.25 | 81.29 |
105,000 | 73.96 | 81.36 |
120,000 | 73.81 | 81.28 |
Discussion
The confusion matrices in Fig. 7 evaluate the semantic segmentation performance of SUSC-SNet on BDD100K and Cityscapes, where rows correspond to ground-truth classes and columns represent predicted classes. Diagonal entries highlight the accuracy per class. In the BDD100K, sky (98.16%) and vegetation (94.71%) achieved near-perfect recognition due to their distinct visual features. Critical classes such as road (69.05%) and building (91.21%) show robust performance. Fine-grained challenges remain: wall is often confused with building (18.6%), while fence shows mutual confusion with building (9.85%) and pole (8.43%). Dynamic classes such as person (81.28%) and rider (78.64%) also share partial misclassification (6.69% of rider predicted as person), reflecting pose and scale similarities. Despite their sparse occurrence, traffic light (81.96%) and traffic sign (82.71%) maintain competitive accuracy, with little confusion against structural classes such as pole. These results underscore the capability of SUSC-SNet in diverse driving scenarios, while highlighting the need for improved discriminative features to address contextually overlapping or geometrically nuanced classes.
Fig. 7 [Images not available. See PDF.]
Confusion matrices of SUSC-SNet. Rows correspond to ground-truth classes and columns represent predicted classes. Diagonal entries highlight the accuracy per class.
To evaluate the effectiveness and applicability of SUSC-SNet segmentation in autonomous driving scenarios, we randomly selected five cases for multiple method comparison, as shown in Fig. 8. For ease of comparison, Case 1 is divided into two parts. The top image is the whole image, and the bottom image is a zoomed view of a more easily observable part.
Fig. 8 [Images not available. See PDF.]
Comparison of segmentation results from different methods. The red box is a zoomed image of the specific area mentioned in the paper. In Case 1, the road segmented by SAN, DDRNeT, SegFormer, Segmenter, and STDC is curved, the road segmented by MaskFormer is straight, but the sidewalk at the edge of the road is incorrectly segmented. Compared to the other models, only SUSC-SNet successfully segmented straight roads in the left detail image and correctly segmented the sidewalk in the right detail image.
Compared to ground truth, SUSC-SNet successfully segmented almost all road information. Compared to other methods, the roads segmented by SUSC-SNet are more similar to the actual situation. Taking the left detail image of Case 1 as an example, the road segmented by SAN, DDRNeT, SegFormer, Segmenter, and STDC is curved; the road segmented by MaskFormer is straight, but the sidewalk at the edge of the road is incorrectly segmented. Compared to the other models, only SUSC-SNet successfully segmented straight roads in the left detail image and correctly segmented the sidewalk in the right detail image. We believe this is because SUSC-SNet can effectively extract local spatial features and global features of autonomous driving images, and fuse deep and shallow semantic information for merging processing, thereby increasing flexibility and adaptability.
To further validate the discrepancies between segmentation predictions and ground truth, we conducted an overlay analysis of SUSC-SNet’s predictions against the annotated masks. As shown in Fig. 9, the distinct category boundaries in the overlapped visualization demonstrate that our model achieves pixel-level boundary segmentation, which benefits from the SCM module’s capacity to effectively reduce the semantic gap between encoder and decoder while capturing features at different hierarchical levels and perceptual fields of varying sizes to obtain richer multiscale information. These visual observations corroborated with quantitative metrics not only intuitively highlight SUSC-SNet’s advantages in preserving structural details, but also provide a reliable foundation for precise critical-region segmentation in autonomous driving applications. However, it should be noted that pixel deviations persist at certain low-contrast boundaries (e.g., the right road edge in Fig. 9e), suggesting the necessity for enhanced design of the low-contrast boundary perception module in future work.
Fig. 9 [Images not available. See PDF.]
The overlay visualization of SUSC-SNet predictions and ground truth.
The proposed SUSC-SNet systematically addresses three core challenges in driving scene segmentation through its modular design: (1) Precise road edge contour segmentation: The SCM module employs dense skip connections to dynamically calibrate the spatial alignment between encoder and decoder features. Experimental results demonstrate that SUSC-SNet significantly reduces boundary information attenuation caused by downsampling through adaptive weight allocation, achieving sub-pixel level segmentation at road edges (as shown in Fig. 9). (2) Misclassification suppression: The MFM module balances local details and global semantic information via a weighted multi-branch feature pyramid fusion strategy. Experiments show that the model maintains misclassification rates within acceptable ranges during ambiguous category predictions, attributed to the enhanced contextual awareness from branch weight adjustment mechanisms. (3) Small-object segmentation optimization: The DBFM module introduces a dual-path feature enhancement unit to process shallow features and semantic features separately. Combined with lightweight convolution to expand receptive fields, this design yields a 0.56% accuracy gain. While maintaining a real-time inference speed of 4.29 FPS, SUSC-SNet achieves state-of-the-art accuracy. This efficiency-accuracy balance makes it particularly suitable for deployment on vehicular embedded systems, providing a robust engineering foundation for perception in complex driving scenarios.
Limitations of the study
It is worth noting that this work aims to improve the segmentation accuracy of important categories (such as roads) in autonomous driving. However, the problem remains that other similar categories (such as walls and buildings) are easily confused. In addition, several interesting directions not discussed in this paper can be further explored. For example, the newly proposed SUSC-SNet can theoretically be applied to other segmentation scenarios, which requires further experimental verification. The segmentation effect in extreme weather and nighttime still needs more experimental verification. Moreover, the added modules increase computational complexity, unavoidably slowing down inference speed, albeit still fulfilling the real-time demands of intelligent driving systems.
Conclusion
This paper proposes a novel semantic segmentation network for autonomous driving. This network introduces multiple fused dense skip connections between the encoder and decoder, allowing the decoder network to fuse features of different scales and compensate for spatial information loss caused by downsampling. This network includes SCM, MFM, and DBFM. SCM uses a dense skip connection method to achieve aggregated semantic extension and highly flexible encoder features in the decoder. MFM and DBFM control the degree of fusion of each branch through weights, increasing flexibility and adaptability. The experimental results on the publicly available BDD100K and Cityscapes datasets show that this method outperforms other advanced methods in segmentation accuracy and has a positive effect on the development of autonomous driving. In the future, we intend to perform comparative analyses of performance and visualizations on more models, analyze the characteristics of each model, and explore strategies for further performance improvements; We plan to develop dedicated optimization modules for visually similar categories (e.g., walls and buildings); To address challenges posed by low-contrast category boundaries in segmentation networks, we will explore data augmentation techniques; The proposed method incurs higher computational complexity due to additional modules, we aim to develop a lightweight version through channel compression and knowledge distillation techniques to reduce redundancy and parameter count; Additionally, we plan to apply our method to more datasets to further enhance the network’s stability and reliability.
Author contributions
Jiayao Li was accountable for the development of the study, execution of experiments, analysis of the findings, and preparation of the manuscript. Chak Fong Cheang supervised the project and offered a critical evaluation of the written document. Xiaoyuan Yu and Suigu Tang supervised the project and contributed to mathematical modeling. Zhaolong Du and Qianxiang Cheng contributed to experimental data analysis. All authors participated in the review of the manuscript.
Funding
This work was supported by the Science and Technology Development Fund, Macau SAR, under Grant 0067/2023/RIB3 and Grant 0089/2022/A.
Data availability
The data that support the findings of this study are openly available at http://bdd-data.berkeley.edu/ and https://www.cityscapesdataset.com/.
Declarations
Competing interests
The authors declare no competing interests.
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1. Wei, F; Wang, W. SCCA-YOLO: A spatial and channel collaborative attention enhanced YOLO network for highway autonomous driving perception system. Sci. Rep.; 2025; 15, 6459.2025NatSR.15.6459W1:CAS:528:DC%2BB2MXkvVKmt7k%3D [DOI: https://dx.doi.org/10.1038/s41598-025-90743-4] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39987326][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11846854]
2. Chib, PS; Singh, P. Recent advancements in end-to-end autonomous driving using deep learning: A survey. IEEE Trans. Intell. Veh.; 2023; 9, pp. 103-118. [DOI: https://dx.doi.org/10.1109/TIV.2023.3318070]
3. Liu, H; Wu, C; Wang, H. Real time object detection using LiDAR and camera fusion for autonomous driving. Sci. Rep.; 2023; 13, 8056.2023NatSR.13.8056L1:CAS:528:DC%2BB3sXhtVGitbvM [DOI: https://dx.doi.org/10.1038/s41598-023-35170-z] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37198255][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10192255]
4. Ma, M; Fu, Y; Dong, Y; Liu, X; Huang, K. PODI: A private object detection inference framework for autonomous vehicles. Knowl. Based Syst.; 2024; 301, [DOI: https://dx.doi.org/10.1016/j.knosys.2024.112267] 112267.
5. Chen, L et al. End-to-end autonomous driving: Challenges and frontiers. IEEE Trans. Pattern Anal. Mach. Intell.; 2024; 16, pp. 10164-10183. [DOI: https://dx.doi.org/10.1109/TPAMI.2024.3435937]
6. Lai-Dang, Q.-V. A survey of vision transformers in autonomous driving: Current trends and future directions. arXiv preprint arXiv:2403.07542 (2024).
7. Eid Kishawy, MM; Abd El-Hafez, MT; Yousri, R; Darweesh, MS. Federated learning system on autonomous vehicles for lane segmentation. Sci. Rep.; 2024; 14, 25029.2024NatSR.1425029E1:CAS:528:DC%2BB2cXitlCkt77P [DOI: https://dx.doi.org/10.1038/s41598-024-71187-8] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39443500][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11500084]
8. Li, J., Dai, H., Han, H. & Ding, Y. Mseg3d: Multi-modal 3d semantic segmentation for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 21694–21704 (2023).
9. Vinoth, K; Sasikumar, P. Multi-sensor fusion and segmentation for autonomous vehicle multi-object tracking using deep Q networks. Sci. Rep.; 2024; 14, 31130.2024NatSR.1431130V1:CAS:528:DC%2BB2MXnvVakug%3D%3D [DOI: https://dx.doi.org/10.1038/s41598-024-82356-0] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39732930][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11682159]
10. Li, T; Cui, Z; Zhang, H. Semantic segmentation feature fusion network based on transformer. Sci. Rep.; 2025; 15, 6110.2025NatSR.15.6110L1:CAS:528:DC%2BB2MXks1Srurs%3D [DOI: https://dx.doi.org/10.1038/s41598-025-90518-x] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39971961][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11839911]
11. Yang, L; Bai, Y; Ren, F; Bi, C; Zhang, R. LCFNets: Compensation strategy for real-time semantic segmentation of autonomous driving. IEEE Trans. Intell. Veh.; 2024; 9, pp. 4715-4729. [DOI: https://dx.doi.org/10.1109/TIV.2024.3363830]
12. Ni, J. et al. Distribution-aware continual test-time adaptation for semantic segmentation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), 3044–3050 (IEEE, 2024).
13. Pan, H; Hong, Y; Sun, W; Jia, Y. Deep dual-resolution networks for real-time and accurate semantic segmentation of traffic scenes. IEEE Trans. Intell. Transp. Syst.; 2022; 24, pp. 3448-3460. [DOI: https://dx.doi.org/10.1109/TITS.2022.3228042]
14. Cao, H. et al. Swin-Unet: Unet-like pure transformer for medical image segmentation. In European Conference on Computer Vision, 205–218 (Springer, 2022).
15. Fan, M. et al. Rethinking bisenet for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9716–9725 (2021).
16. Peng, J. et al. Pp-LiteSeg: A superior real-time semantic segmentation model. arXiv preprint arXiv:2204.02681 (2022).
17. Liang, J., Zhou, T., Liu, D. & Wang, W. CLUSTSEG: clustering for universal segmentation. In Proceedings of the 40th International Conference on Machine Learning, 20787–20809 (2023).
18. Wang, W; Liang, J; Liu, D. Learning equivariant segmentation with instance-unique querying. Adv. Neural Inf. Process. Syst.; 2022; 35, pp. 12826-12840.
19. Zhou, Y. A serial semantic segmentation model based on encoder–decoder architecture. Knowl. Based Syst.; 2024; 295, [DOI: https://dx.doi.org/10.1016/j.knosys.2024.111819] 111819.
20. Wang, H; Cao, P; Yang, J; Zaiane, O. Narrowing the semantic gaps in u-net with learnable skip connections: The case of medical image segmentation. Neural Netw.; 2024; 178, [DOI: https://dx.doi.org/10.1016/j.neunet.2024.106546] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39053196]106546.
21. Zhao, Q et al. Multi-Unet: An effective multi-U convolutional networks for semantic segmentation. Knowl. Based Syst.; 2024; 309, [DOI: https://dx.doi.org/10.1016/j.knosys.2024.112854] 112854.
22. Chen, Y. et al. Scunet++: Swin-Unet and CNN bottleneck hybrid architecture with multi-fusion dense skip connection for pulmonary embolism CT image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 7759–7767 (2024).
23. Xiao, X et al. BASeg: Boundary aware semantic segmentation for autonomous driving. Neural Netw.; 2023; 157, pp. 460-470. [DOI: https://dx.doi.org/10.1016/j.neunet.2022.10.034] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36434954]
24. Zhao, Z; Chen, X; Cao, J; Zhao, Q; Liu, W. FE-Net: Feature enhancement segmentation network. Neural Netw.; 2024; 174, [DOI: https://dx.doi.org/10.1016/j.neunet.2024.106232] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38490116]106232.
25. Cheng, J., Gao, C., Wang, F. & Zhu, M. SegNetr: Rethinking the local–global interactions and skip connections in U-shaped networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 64–74 (Springer, 2023).
26. Azad, R et al. Medical image segmentation review: The success of U-Net. IEEE Trans. Pattern Anal. Mach. Intell.; 2024; 46, pp. 10076-10095.2024ITPAM.4610076A [DOI: https://dx.doi.org/10.1109/TPAMI.2024.3435571] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39167505]
27. Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9, 2015, proceedings, part III 18, 234–241 (Springer, 2015).
28. Zhou, Z., Rahman Siddiquee, M. M., Tajbakhsh, N. & Liang, J. Unet++: A nested U-Net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, 3–11 (Springer, 2018).
29. Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
30. Chen, S et al. Pavement crack detection based on the improved Swin-Unet model. Buildings; 2024; 14, 1442. [DOI: https://dx.doi.org/10.3390/buildings14051442]
31. Wang, H., Cao, P., Wang, J. & Zaiane, O. R. UCTransNet: Rethinking the skip connections in u-net from a channel-wise perspective with transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, 2441–2449 (2022).
32. Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012–10022 (2021).
33. Hu, H., Gu, J., Zhang, Z., Dai, J. & Wei, Y. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3588–3597 (2018).
34. Hu, H., Zhang, Z., Xie, Z. & Lin, S. Local relation networks for image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3464–3473 (2019).
35. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G. & Jégou, H. Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 32–42 (2021).
36. Contributors, M. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation (2020).
37. Yu, F. et al. BDD100K: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2636–2645 (2020).
38. Cordts, M. et al. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3213–3223 (2016).
39. Xu, Y; Cao, B; Lu, H. Improved U-Net++ semantic segmentation method for remote sensing images. IEEE Access; 2025; 13, pp. 55877-55886. [DOI: https://dx.doi.org/10.1109/ACCESS.2025.3552581]
40. Dang, Y; Gao, Y; Liu, B. MFAFNet: A multiscale fully attention fusion network for remote sensing image semantic segmentation. IEEE Access; 2024; 12, pp. 123388-123400. [DOI: https://dx.doi.org/10.1109/ACCESS.2024.3451153]
41. Xu, M., Zhang, Z., Wei, F., Hu, H. & Bai, X. Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2945–2954 (2023).
42. Cheng, B., Misra, I., Schwing, A. G., Kirillov, A. & Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1290–1299 (2022).
43. Xie, E et al. Segformer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst.; 2021; 34, pp. 12077-12090.
44. Strudel, R., Garcia, R., Laptev, I. & Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 7262–7272 (2021).
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.