Multiscale feature tuned trans-DeepLabV3+ based

Full text

Turn on search term navigation

Introduction

Higher spatial resolution aerial images are now widely available due to recent technology advancements in the remote sensing field¹. They are expected to supply a rich amount of geospatial knowledge and present various fine-structured things within the area that is selected for investigation with high accuracy, which enables them to provide a better interpretation while observing the land covers². The remote sensing sector uses aircraft for capturing these higher spatial resolution aerial images because of the higher demand for such high-quality images in various applications³. The airborne device, particularly the Unmanned Aerial Vehicle (UAV), has demonstrated its advantages in terms of affordability, adaptability, and generation of higher-resolution images while performing a variety of tasks⁴. Particularly, the installation of modern sensors that have been incorporated into these manned and unmanned aerial systems can help to produce higher-quality aerial images by mounting them in the airway with significantly longer ground sampling distances⁵. This makes the aerial structure a viable and appealing option for the dependable gaining of higher-resolution aerial images that can be used in intelligent agriculture, monitoring of traffic, and military surveillance⁶. Aerial systems are typically needed to carry out the following tasks recognition of objects and sense processing whenever they are employed for the remote sensing-based applications⁷.

Semantic segmentation or semantic labeling is a fundamental operational component that is employed for interpreting scenes and objects from the aerial images that are produced using aircraft devices⁸. To determine every single pixel in an image using a semantic description of the object with which it is associated is its major goal⁹. This has significant implications for a wide range of applications related to remote sensing tasks¹⁰. A categorizing task called “semantic segmentation” assigns a class to each pixel in an image. When a deep learning-aided semantic segmentation system has been developed, it must be trained using pixel-level annotated samples¹¹. The advancement in aerial imaging techniques has made it possible to rapidly capture an immense quantity of aerial photographs, with an apparent improvement in spatial resolutions¹². “Semantic segmentation” from these aerial images, however, remains a difficult task to pursue as it requires providing every single pixel with an appropriate class of land cover.

Annotating images at the pixel level is a costly, time-consuming, and difficult process¹³. Since these higher-resolution aerial images frequently show duplicate object characteristics and are more prone to comprise a lot of object classes, it is necessary for the existing algorithms to fully take into consideration factors such as intra-class uniformity and interclass differentiation¹⁴. Even though several earlier approaches based on conventional machine learning and handcrafted characteristics had been put out to address these issues, they are proven to be challenging when a flawless implementation is required¹⁵. While the aircraft system is regarded as a useful source for higher-resolution spatial remote sensing images, there is a lot of interesting research that is carried out on the semantic segmentation techniques that operate on such aerial platforms¹⁶. Nonetheless, there are still certain extremely difficult problems with the higher-resolution spatial aerial image segmentation approaches and their implementation on aircraft platform-based airborne hardware¹⁷. There is a need for an efficient model for semantically segmenting the aerial remote sensing images of higher resolution is highly essential.

Some of the crucial contributions of the suggested semantic segmentation framework are listed as follows.

To design a deep learning-based semantic segmentation model using high-quality aerial images to analyze the topological characteristics of a given region.
To integrate an advanced image enhancement technique as a multi-scale residual network to improve aerial image quality, facilitating a more effective segmentation process.
To design and implement MSTDeeplabv3+ , leveraging multi-scale feature extraction from aerial images to achieve highly accurate land cover classification.
To introduce a new optimization method, Improved Red Piranha Optimization Algorithm, Specifically designed to select critical features extracted from the transformer component of MSTDeeplabv3+ , significantly improving segmentation accuracy while reducing computational complexity.
Conducted extensive benchmarking of the proposed segmentation model against conventional methods and heuristic approaches, demonstrating its superior efficacy in segmentation accuracy and computational efficiency.

The Improved Red Piranha Optimization Algorithm (IRPO) optimization framework stands out as the key innovation, effective refinishing feature selection from the transformer component, leading to improved segmentation performance and computational efficiency.

The following segments in this paper are as follows. In Sect. "Literature survey", the literature review is given. In Sect. “Semantic segmentation with deep learning network”, the process of semantic segmentation of aerial images with the implementation of a multi-scale feature-tuned deep learning network is explained. Section "Image enhancement and enhanced optimization algorithm for semantic segmentation" presents the implementation of image enhancement for semantic segmentation utilizing aerial images, along with a discussion of the optimized method. In Sect. "Deep learning model for semantic segmentation", the deployment of the heuristic-based semantic segmentation model is explained. The results and discussion regarding the executed semantic segmentation model are given in Sect. "Results and discussion". Following this, the paper concludes in Sect. "Conclusion".

Literature survey

Related works

In 2022, Zhao et al.¹⁸ introduced a new approach that aimed to reduce the domain gaps in semantic segmentation tasks using an “Unsupervised Domain Adaptation (UDA)” technique. Often used techniques for reducing the domain gap in aerial images are generative approaches. Better results are provided by these generative techniques in cross-domain. The “Digital Surface Model (DSM)” was generally utilized in the case of obtaining additional information on aerial images. Thus, by considering the Depth Cycle Consistency Loss (DCCL) and Depth Supervised Loss (DSL), the required data for the generative technique was obtained which was used by the implemented Depth assisted ResiDualGAN (DRDG) model. Effective segmentation outcomes were obtained by utilizing the DRDG approach for semantic segmentation of cross-domains which was proved by conducting experimental verifications.

In 2019, Luo et al.¹⁹ suggested the groundbreaking "Channel Attention Mechanism (CAM) model aided by a Fully Convolutional Network (FCN) (CAM-DFCN)” for semantic segmentation tasks. The encoder-decoder architecture was used in the CAM-DFCN framework. A pair of similar deep “Residual Networks (ResNet)” was segmented into distinct layers within the encoding unit and was operated over spectral images and was useful in generating the extra data essential to perform the semantic segmentation task. At every stage, the feature map fusion was carried out. The CAM was incorporated into the decoder to comply with the concatenated feature maps at every stage and to weigh the feature maps in distinct channels automatically for performing the feature selection to choose the more discriminating features for executing the classification task. For more precise predictions, the CAM then balanced the semantic and spatial position data in the neighboring stage’s fused feature maps. By assessing the suggested CAM-DFCN using two standard databases, it was seen that excellent outcomes were obtained by this model.

In 2017, Audebert et al.²⁰ demonstrated a "segment-before-detect" technique for the segmentation, identification, and categorization of multiple wheeled vehicles from aerial images of higher quality. The deployed deep FCN was trained using two benchmark databases and demonstrated how the acquired semantic maps were employed to derive accurate segmentation of vehicles. It was observed that these maps were sufficiently precise to enable straightforwardly connected component extraction to assist in the vehicle detection task. The distribution of vehicles throughout the city was determined with the aid of this technique. Lastly, using CNN on a standard database, the vehicle classification was done. Extensive was done to prove the superiority of the suggested model.

In 2020, Chai et al.²¹ tackled the difficulty in employing "Deep Convolutional Neural Network (DCNN)" to learn the spatial context in semantic segmentation. A ground truth label map was employed to derive a signed distance map. The DCNN was trained to predict the distance maps. By choosing the greatest distance class, the final label map was produced from the anticipated distance maps. In comparison to existing methods, the segmentation results were smoother. Based on experimental findings, the suggested strategy has worked better than most of the state-of-the-art techniques.

In 2018, Chen et al.²² suggested the technique of using “shuffle CNN” to achieve semantic segmentation by periodically rearranging aerial images. This method was meant to be used in addition to the existing techniques of semantic segmentation. Both the shallow and the deep versions of the approach have offered good results in identifying tiny objects. Furthermore, an approach to improve the predictions using "Field-of-View (FoV)" improvement was carried out. By ensembling the sore maps, the detection outcome can be enhanced. The trials that have been carried out on two distinct datasets confirmed the efficiency of this strategy and its applicability in implementing this model on different types of networks.

In 2018, Azimi et al.²³ suggested using wavelet transform to improve a symmetric FCN for autonomous segmentation of lane-marking from aerial images. A modified loss function in conjunction with a novel form of data augmentation procedure was executed. Without utilizing any third-party data, a higher degree of precision in the pixel-by-pixel localization of lane markings was obtained. A dataset with an advanced transportation system was utilized as the dataset for validating this model and it would be more suitable for being utilized by the upcoming techniques and models as well.

In 2023, Dilusha and Jong²⁴ implemented an approach for semantic segmentation of degraded aerial images based on a deep learning approach. The suggested approach was capable of grasping the level of acceptance for a distorted image that was acquired by an analog image receiver. To improve the segmentation score on the distorted images, two frameworks, one for segmentation and the other for approximation were trained jointly. The outcomes demonstrated that the Intersection-over-Union (IoU) achieved by the executed approach was greater than employing a model for performing segmentation alone.

In 2023, Moazzam et al.²⁵ introduced a novel technique that has increased the accuracy of the crop and weed on inter-class categorization tasks at the pixel level. Semantic segmentation was carried out in two steps. To separate the background from vegetation, a binary pixel-level classification model was created in step I. To classify the background, weeds, and tobacco, a three-class pixel-level classification model was generated in step II. The first stage’s output served as the second stage’s input. The model was tested using pixel-by-pixel hand-annotated data. Better pixel-level categorization precision has been demonstrated by the two-stage semantic segmentation approach. In comparison to step II, step I was shallower and required a smaller model to provide adequate detection outcomes.

In 2023, Lingwal et al.²⁶ suggested an approach for mapping the cropland using machine learning methods to estimate the area of the cropland and identify the boundaries of the cropland through the semantic segmentation of land cover. To determine the features responsible for identifying the boundaries of the land using the edge detection method, the process first used a variety of filters. To generate the ground truth to identify objects, the images were either annotated or masked. For semantic segmentation, the chosen features were fed into the Random Forest (RF) model. The satellite images were gathered using the QGIS program. The study concluded that RF has produced the greatest results for accurately segmenting the image into various sections than other models with enhanced kappa coefficient, accuracy, and mean IoU.

In 2020, Gomaa et al.²⁷ proposed an efficient vehicle detection and tracking strategy in aerial videos by employing morphological operations and feature points motion analysis" which described a real-time system for detecting and tracking cars in aerial footage. Combining top-hat and bottom-hat transformations with morphology enhances vehicle detection. Using the KLT tracker and K-means clustering to group object characteristics by motion features eliminates background regions after detection. A good linkage mechanism marks vehicle cluster trajectories. The strategy outperformed current methods with 95.1% recall, 97.5% precision, and 95.2% tracking accuracy in various scenarios.

In 2024, Gomaa et al.²⁸ suggested an advanced Domain Adaptation Technique for Object Detection Leveraging Semi-Automated Dataset Construction and Enhanced YOLOv8" provides a novel approach to domain-wide object detection. The authors provide a semi-automated dataset creation method that removes manual tagging. Motion data from surveillance videos is used to label training data that efficiently separates background and foreground elements. The research also proposes a novel activation function to improve YOLOv8’s nonlinear fitting. Through extensive benchmark dataset trials, this combination technique enhances detection accuracy and lowers annotation costs.

In 2024, Gomaa et al.²⁹ proposed a novel deep learning-based domain adaption technique for object recognition using a Semi-Self Building Dataset (SSBD) framework and modified YOLOv4 architecture. We use a semi-automated data collection approach to build and modify the training dataset using pseudo-labeling and confidence filtering to overcome domain shifts between source and target domains. The new YOLOv4 model improves generalization and accuracy across domains with lightweight adjustments. The recommended adaptation strategy outperforms existing training methods in target domain item identification experiments.

In 2025, Gomaa et al.³⁰ suggested deep learning Residual Channel-Attention (RCA) network improves scene classification in high-resolution remote sensing pictures. Traditional convolutional neural networks (CNNs) struggle to grasp complex semantic linkages and substantial intra-class variability and inter-class similarity in such photos. RCA combines a lightweight residual structure with a squeeze-and-excitation (SE) channel attention mechanism to address these challenges. It helps the model extract multi-scale spatial data and select important feature channels over irrelevant ones. The RCA network outperformed state-of-the-art remote sensing picture classification methods on RSSCN7, PatternNet, and EuroSAT, with classification accuracies of 97%, 99%, and 96%, respectively.

Problem statement

The manual process of semantic segmentation is an impossible task as it consumes more time and produces inaccurate results. Therefore, for automatic segmentation of semantic segmentation tasks, deep learning models are introduced. The features and challenges of the existing deep learning-based semantic segmentation model are given in Table 1. DRDG¹⁸ offers higher segmentation of cross-domain aerial images is better than existing techniques. But, the utilized GAN approach is highly unstable. CAM-FCN¹⁹ enhances the performance of segmentation as it utilizes an effectively selected set of features. However, multi-scale features are not considered in this model thus making it slightly inaccurate while dealing with 3D images. FCN and CNN²⁰ accurately determine the instance of occurrence of an object in an aerial image thus aiding in the efficient real-time vehicle monitoring and traffic planning tasks. However, the number of data required by this approach is more. DCNN²¹ generates highly accurate semantic segmented images. But, the computation process of this approach is more. Shuffle CNN²² effectively detects tiny objects from aerial images. Yet, higher memory cost is needed by this model. FCN²³ provides a more receptive field thus; better semantic segmentation performance is achieved by this technique. Yet, this model’s outcomes are affected by shadow areas in the images. Deep CNN²⁴ provides effective segmentation outcomes even with distorted images. However, larger datasets by utilizing various image-transmitting methods have to be included to segment images regarding other planets. UNet with Vanilla Mini CNN and VGG-16²⁵ facilitate efficient segmentation-based classification outcomes even in the presence of complicated lighting conditions. The interference time of these methods makes it an ideal choice for real-world applications. However, this method has a higher computational complexity. The generalization ability of this approach is not well known. RF²⁶ effectively classifies the cropland as uncultivated land and barren thus aiding in an effective cropland mapping technique. However, only a limited amount of images can be effectively segmented using this method. Thus, to overcome these challenges, an improved deep learning-based semantic segmentation model is developed in this work.

Table 1. Features and challenges of existing semantic segmentation models using deep learning technique.

Author	Methodology	Features	Challenges
Zhao et al.¹⁸	DRDG	This approach to the segmentation of cross-domain aerial images is better than existing techniques	The utilized GAN approach is highly unstable
Luo et al.¹⁹	CAM-FCN	Enhanced performance of segmentation task is provided by this model as it utilizes an effectively selected set of features	Multi-scale features are not considered in this model thus making it slightly inaccurate while dealing with 3D images
Audebert et al.²⁰	FCN and CNN	The instance of occurrence of an object in an aerial image can be detected accurately by this approach thus aiding in the efficient real-time vehicle monitoring and traffic planning tasks	The number of data required by this approach is more
Chai et al.²¹	DCNN	Highly accurate semantic segmented images are obtained from this approach	The computation process of this approach is more
Chen et al.²²	Shuffle CNN	The detection of tiny objects from the aerial images is made possible by this technique	Higher memory cost is needed by this model
Azimi et al.²³	FCN	As the receptive field of this method is greater, better semantic segmentation performance is achieved by this technique	This model’s outcomes are affected by shadow areas in the images
Dilusha and Jong²⁴	Deep CNN	Even distorted images can be segmented effectively by this approach	Larger datasets by utilizing various image-transmitting methods have to be included to segment images regarding other planets
Moazzam et al.²⁵	UNet with Vanilla Mini CNN and VGG-16	Facilitates efficient segmentation-based classification outcomes even in the presence of complicated lighting The interference time of these methods makes it an ideal choice for real-world applications	This method has a higher computational complexity The generalization ability of this approach is not well known
Lingwal et al.²⁶	RF	The effective classification of cropland as uncultivated land and barren is done with the aid of this approach thus aiding in an effective cropland mapping technique	Only a limited amount of images can be effectively segmented using this method

Semantic segmentation with deep learning network

Developed semantic segmentation model

Semantic segmentation is the technique of assigning a pixel to the aerial imagery in a certain land cover category. It is difficult to know the land cover images because of the variation in ground level. Aircraft technology is now known for being a dependable and effective instrument for detecting and interpreting aerial situations. Three obstacles still need to be overcome before accurate and quick semantic segmentation of higher-quality aerial images for applications in remote sensing can be difficult to achieve as the result of the requirement for low-latency activities based on aerial platforms and limited processing resources; and the confusion caused by objects with small inter-class deviations and large intra-class deviations in higher-quality aerial images. CNN was brought to the field of remote sensing in recent years, and several approaches have been put out for the enriched semantic labeling obtained by CNN. However, sufficient performance still is not met by these models. Therefore, a suitable method for performing semantic segmentation tasks has to be developed using advanced deep learning techniques.

An efficient deep learning-based semantic segmentation model is implemented with the aid of a heuristic approach in this research work for aiding in remote sensing applications. An illustration of the implemented semantic segmentation model created using the deep learning technique is shown in Fig. 1. The proposed system incorporates three essential components—MSRN, MSTDeepLabV3+ , and IRPO—to enhance aerial picture segmentation. Their interaction adheres to a sequential and interconnected pattern. The multi-scale residual network is applied to enhance the quality of aerial images. Remote sensing images often suffer from low resolution, noise, and blurriness. MSR refines the images by preserving fine details and sharpening edges. The output of MSRN serves as input for the segmentation model, ensuring that higher-quality images lead to better feature extraction. The MSTDeeplabv3+ model processes the enhanced images to extract multi-scale features and perform initial segmentation. Land cover types vary in scale and complexity: A multiscale transformer-based approach ensures that both fine and coarse details are captured. It takes the enhanced images from MSRN as input and generates feature maps that highlight different land cover areas. Feeds extracted deep feature representations to IRPO for further optimization. The improved red piranha optimization algorithm selects the most relevant features from MSTDeeplabv3+ to enhance segmentation accuracy while reducing computational complexity. The IRPO receives multi-scale feature maps from MSTDeeplabv3+ and iteratively refines feature selection by optimizing the segmentation decision boundaries. The optimized feature set is fed back to MSTDeeplabv3+ for final segmentation. Experimentations are carried out to verify the efficacy of the implemented model. The workflow illustration of the proposed semantic segmentation process is shown in Fig. 2.

Fig. 1 [Images not available. See PDF.]

Pictorial representation of the executed semantic segmentation model developed with the aid of a deep learning approach.

Fig. 2 [Images not available. See PDF.]

The workflow illustration of the proposed semantic segmentation process.

IRPO-MSTDeepLabV3+ enhances DeepLabV3+ by integrating multi-scale transformers into the architecture. While DeepLabV3+ uses Atrous Spatial Pyramid Pooling (ASPP) to capture multi-scale context, MST-DeepLabV3+ replaces or augments this with transformer blocks that model long-range dependencies more effectively. Traditional ASPP (DeepLabV3 +) has been good at capturing context with dilated convolutions, but limited in modeling global dependencies. MST Modules in IRPO-MSTDeepLabV3+ used attention to dynamically focus on important spatial regions across scales, leading to better semantic understanding, especially for complex scenes. Therefore IRPO-MSTDeepLabV3+ improves performance in challenging scenarios such as object boundaries and small or thin objects and it has a significant improvement over DeepLabV3+ , which may struggle with fine boundary preservation.

Aerial image dataset description

The crucial and essential data which consists of various aerial images for performing the semantic segmentation task are collected from distinct websites. The collected aerial images are denoted by . A detailed description of the collected aerial images is given in Table 2.

Table 2. Description of the gathered aerial images.

Dataset name	Dataset link	Description
“Semantic segmentation of aerial imagery”	“https://www.kaggle.com/humansintheloop/semantic-segmentation-of-aerial-imagery?select=Semantic+segmentation+dataset access date: 2024-01-15”	This dataset, acquired using MBRSC satellites and annotated with pixel-wise semantic segmentation of six classes, offers substantial information about aerial photographs of Dubai. There are six tiles containing a total of seventy-two images
“Aerial `Image Segmentation Dataset”	“http://jiangyeyuan.com/ASD/Aerial%20Image%20Segmentation%20Dataset.html access date: 2024-01-15”	The scientific community refers to it as an aerial image segmentation dataset. The dataset includes 80 high-quality photos with spatial resolutions ranging from 0.3 to 1.0 m. It includes many environments such as cities, residences, warehouses, schools, and power plants. These photos have dimensions of 512 × 512 pixels

Image enhancement and enhanced optimization algorithm for semantic segmentation

Multi-Scale RetiNex: image enhancement

The gathered aerial images are given as input to the MSR model for image quality enhancement. Neural vision and color interpretation are improved by the image processing algorithm known as MSR³¹. It is the weighted mean of several RetiNex images that are single-scale. An image processing approach that is based on human perception and improves colored images is called MSR. It offers Dynamic Range Compression (DRC) and color consistency. MSR is also capable of rendering luminance and hue. A single spectral band’s MSR is mathematically computed as given in Eq. (1).

In Eq. (1), the total quantity of scales that are being utilized is denoted by , the blurring factor is represented as , the scale’s weights are indicated as such that , and the scale of the component is denoted by . A single-scale RetiNex is determined using the formula provided in Eq. (2).

In Eq. (2), the symbol indicates the convolution operation, the term indicates the value regarding and coordinates for the RGB model’s color channel and indicates the Gaussian representation of the variable. The enhanced images are obtained from the MSR model which is indicated as .

Proposed IRPO for parameter tuning

The RPO algorithm³² is used in optimally selecting the features from the multi-scale transformer unit of the MSTDeepLabV3+ model. This RPO algorithm is selected as it has an enhanced convergence rate and a greater capability for global optimal solutions. However, the presence of the arbitrary variable in the RPO algorithm makes it complicated to attain the optimum. Therefore, the fitness-based concept is used in the RPO algorithm to update this arbitrary variable.

In Eq. (3), the term indicates the “best fitness value” and denotes the “worst fitness value”. The value of in the traditional RPO algorithm is in the range . This parameter is improvised by the utilization of Eq. (3) in the implemented IRPO algorithm. The value of in Eq. (3) is used to update Eqs. (6), and (14). RPO is a metaheuristic optimization approach based on nature that imitates the Red Piranha fish’s hunting style.

Foraging for fish: A scout that usually spots possible food will send out an Encircle Signal (ES) to the rest of the piranha swarm. Because piranhas are incredibly tidy, they encircle their target to stop them from dispersing or fleeing once they’ve been notified. The strike then starts once the target is surrounded. Given the uncertainty in surrounding the location of the ideal solution, or the target, piranha scout wanders arbitrarily in search of their food. This is not at all like the exploitation stage in the RPO algorithm, in which the piranhas adjust their location according to the leaders’ positions (optimal solutions that are closer to the target). Thus, randomization is the fundamental element needed to accomplish efficient exploration. At this point, the swarm’s leaders are regarded as the scouts, who are selected at random. Presuming that is the overall quantity of iterations that are evenly split over the RPO algorithm’s three stages. As a result, the quantity of times that scouting, surrounding, and striking are repeated is indicated as , , and . Consider that piranhas are present in a population set of . The scout piranha is represented as whose value is given by . Here, the scaling factor is indicated by the term whose value is given within the range . Then the scouts are taken as the leader of the piranha swarm. The representation given to the scouts is . Eliminating the scouts, the rest of the piranhas in the swarm are computed using , which is segmented into clusters. A total of fishes are given to every cluster except the final one. In the final cluster, only piranhas with a count lesser are present. To the scout, the cluster’s fish’s position is updated based on the mathematical functions that are provided in Eqs. (4), (5), (6), and (7).

In Eqs. (4), (5), (6), and (7), the term the position of the cluster’s scout in vector format, represents the piranha’s distance from its target, denotes the position vector of a cluster’s piranha, and represents the coefficient vector, denotes the time step, and , , , and are representation of random variable that is in the range . Equation (3) updates in Eq. (6). The value of is in the range . The value of in the RPO algorithm which falls between [2,0] throughout the repetitions. When seeking a portion of food (in the case of exploration), A will be given an arbitrary number that either exceeds or goes below . The RPO technique begins with arbitrary agent placements within the solution space. The piranha’s location modifying technique enables them to have high exploring capability after every repetition. However, significant exploitation and convergence are accomplished during the striking stage, where comparable updating of the position takes place using a similar equation. This enhanced exploitation and convergence is achieved by making the value of A within [ − 1, 1]. The simulation of piranha behavior involves a drop in the value of , which consequently leads to a fall in . Given that is within the interval [ − 1, 1], the piranha’s new location will fall between where it is now and where its leaders are, which approximately corresponds to the location of their food. Thus, while the piranhas approach their prey continuously, the RPO algorithm guarantees excellent local optimal minimization and rapid convergence throughout the iterations. RPO is therefore regarded as the ideal global optimizer because of the adaptive modification of the search vector . The RPO algorithm makes it simple to switch between exploration and exploitation. Following iterations, distinct position vectors for every piranha exist in the count . These vectors are the outcome of the scout’s cycles along with the agent’s original arbitrary location assumption. Since piranha is an avaricious fish, in which greedy selection is carried out following the completion of the seeking phase’s repetitions. Each piranha’s most advantageous position is accepted to achieve this goal. After that, every piranha is free to begin the surrounding stage from the position that suits it best. The formula is used to provide the starting position vectors of the search agents, in which denote the maximum limit and indicate the minimum limit. In the end, the possible objective function regarding the piranha’s new location is determined.

Prey enclosure: When hunting for prey, piranhas move arbitrarily, even though the scouts direct their movements. This makes it possible for the piranhas to uncover new places inside the lookup area. The term Prey Encircling Signal (PES) refers to this kind of signal. Once information in this signal reaches the members of the piranha swarm, they begin to encircle the target and prevent it from fleeing. Once the target has finished surrounding, the striking stage starts. Throughout the surrounding stage, the logarithmic spiral is selected as the primary technique for updating the location of the swarm members. Initially set the alpha individuals, where is hypothetical and is of range . The prey in dimension is calculated using Eq. (8).

Next, to update every piranha in the swarm and its respective position , the distance between each member and a prospective prey is first determined using Eq. (9).

The search agents’ movement toward the prey is then described by the spiral formula, as shown in Eq. (10).

The value is computed using Eq. (11).

The value of is the range .

Additionally, this ensures that proper exploitation is carried out during the RPO’s surrounding stage. Once more, because piranhas are rapacious fish, the candidate solution (piranha) relocates to their optimal locations at the end of the surrounding stage. The position that offers the most effective validation of the defined objective function is the optimal one for a piranha.

Striking of the target: The target is entirely enveloped and has no opportunity to flee at the final stage of the surrounding phase. Additionally, as the piranhas approach the target too closely, alpha fishes send out a unique signal known as the Frenzy Signal (FS) to order the entire swarm of piranhas to attack the target at once. The other fishes in the swarm go into a frenzy when the FS signal is raised by the alpha piranha as a result of approaching the target closure and then they begin to compete to get closer to the target prey after the alpha piranha. Since they are the fish closest to the prey, the entire swarm follows the alpha piranha. Using Eq. (10) the piranha’s location is allowed to update over the striking stage. After first identifying the alpha individuals, Eq. (10) is used to forecast the probable prey’s location, which is thought to be among the alpha fishes. Lastly, the separation between the piranha and the possible prey is first determined using Eq. (9), and the piranha’s next location is then determined using Eq. (10), which determines the location of the remaining piranhas in the swarm. Updating the position of each search agent during the striking of the target is accomplished by using Eqs. (12–16).

The value of in Eq. (16) varies from [2, 0]. Equation (3) updates in Eq. (14). As a result, the piranha’s latest location will lie somewhere between its present location and the prey’s location. It is now crucial to talk about the peculiar behavior of piranhas when they are attacking. When alpha fishes send out their FS, the remaining piranhas in the swarm follow suit as they come in closer proximity to the target. In this instance, the piranhas in the swarm get extremely near to one another and swarm, a phenomenon known as crowding of solutions that could result in solutions collision. The piranhas become hungrier and more violent as a result of the FS spreading throughout the swarm and their approaching too closely to one another. This could cause the fish to attack one another rather than the prey. This signifies losing a potential solution. The piranha that escapes, which is usually weaker than the other one, starts searching for a spot to hide and then retraces the path taken by the alpha fishes to rejoin the victim. To simulate this sort of action, let us suppose that each fish seeks to defend itself from the other fish by keeping a safe distance, or safety barrier, between it and its peers. It is assumed that the safety shield has a width of . According to this theory, two fish are extremely close to one another if their distance from one another is smaller than . is the impact criterion between and , where is the protective barrier width and is the Cartesian distance between the two piranhas in -dimensional space. Using Eq. (17) the Cartesian distance among piranhas and in -search space is computed.

In Eq. (17), the term and represents the distance vectors. To test the impact criterion, the distance among every pair of piranha should be computed following every repetition of the striking stage. Therefore, if two or more piranha collide, the strongest agent, i.e., the maximum objective function fish stays in its current location, and the weaker one flees to a different random site to begin tracking. However, the algorithm’s execution speed is going to be impacted by the constant detection of collisions between piranhas after every loop, in which denotes the striking phase’s iteration count and denotes the needed count of checkpoints.

The check-pointing technique can be utilized to get over this obstacle. At (i.e., the last) repetition, the objective function’s optimal validation value for every piranha is found. The pseudocode of the IRPO algorithm is given in Algorithm 1.

The flowchart of the executed IRPO algorithm is shown in Fig. 3.

Fig. 3 [Images not available. See PDF.]

Flowchart of the executed IRPO algorithm.

Deep learning model for semantic segmentation

Trans-Deeplabv3+

A variation of the standard Fully Convolutional Network (FCN) is the DeepLabV3+ model, which has demonstrated outstanding results with the utilization of the contextual information of the image^33,34. The most recent iteration of the DeepLab network series is the DeepLabV3+ . The “Atrous Spatial Pyramid Pooling (ASPP)” module, which has its foundation from the Spatial Pyramid Pooling (SPP) from DeepLabV3, is inherited by the model. To capture contextual characteristics at numerous scales, the model, on the one hand, undertakes convolution operations by using parallel atrous convolutional layers at different rates. The model can reconstruct object boundaries thoroughly because of its efficient decoder unit. Moreover, it lessens the conventional CNN’s boundary loss issue. As a result, an in-depth validation of the contextual characteristics in a given input image can be carried out by the DeepLabV3+ framework. Shortly after the division of the input images into several scales, the DeepLabV3+ is modified to perform the multi-scale feature fusion from multi-scale feature maps to acquire primary water forecasting findings. “An encoding unit and a unit for decoding are still the two components that make up the DeepLabV3+ ”. The core system used by the encoder unit of the DeepLabV3+ is ResNet-50. The network employed in this work is composed of five layers of convolution, with varying numbers of bottleneck construction components in every layer.

A wider, deeper environment with more detail is correlated with smaller-scale convolutional outputs. The resultant features of convolutional computing decrease the image’s spatial resolution as the convolution layer deepens because of pooling and down-sampling processes, which causes the original global information to be lost. Thus, we chose the low-level feature set from the initially created convolution layer following the pooling layer and the low-level feature set from the third bottleneck unit of the second convolution layer, which is subsequently expanded through interpolation to the exact size as the feature set , to maintain an adequate quantity of the original global data. These feature maps are combined and are then passed through a -channel convolution. As a result, the resultant low-level feature map has output channels, which is equal to the ASPP module’s output channel count. The ASPP module is linked following down-sampling in each of the five convolution layers. The ASPP module uses the fifth convolution layer’s resultant feature map as its input feature map. Concatenating the five concurrently generated feature maps in ASPP results in a single feature map of channels. This feature map then goes through a layer of convolution and is up-sampled to the matching low-level feature map’s size. The resultant feature maps are then fused with the appropriate low-level feature map and sent through a convolutional layer present in the decoder unit to precisely retrieve the water body segmented information. To create an additional feature map, the related features are subjected to three convolutional operations. The acquired characteristics are then up-sampled to the various sizes of the input images accordingly. To retrieve accurate image features, this step is carried out to modify the segmentation logical size of the feature map. We mosaically merge the up-sampled feature maps of identical size after obtaining the feature maps with two channels from each of the three input image blocks scaled using the decoder unit calculation. As a result, given the same-sized remote sensing images, we acquired three feature maps. The mosaic features of various scales have to be merged to create a feature map to produce a more accurate classification result. Every time, the retrieved characteristics from various scales are combined and fed into a single classifier for categorization. While each scale’s properties are treated equally as a result of this method, the benefits of each feature are not emphasized. We therefore employ the weighted fusion technique to assign varying weights for distinct sizes of feature maps to execute feature fusion, taking into account the various implications of the various scale features on the eventual predicted outcome. With this technique, the level of water body extraction refinement can be controlled by varying the influencing variables of the various-sized feature maps. The computation takes place as given in Eq. (18).

In Eq. (18), the term R_p(S, p) indicates the feature maps that are obtained as output, denotes the feature maps that are estimated at various scales p represents the total number of available classes, q multi-class count in this framework, δ_q indicates the number of weights allocated to the feature maps of the distinct feature maps regarding every scales, indicates the matrix representation of the q^th scale. Following the weighted fusion procedure, the resulting feature map is subjected to a softmax normalization, which ensures that the final result is a distribution of probabilities and yields the pseudo-probabilities of various categories. Lastly, we employ a classification rule to establish each pixel’s final label. The transformer unit is added to the encoder section of the DeepLabV3+ model to form the Trans-DeepLabV3+ model. The transformer’s encoder section is utilized for this task. A transformer³⁰ encoding unit is made of a “Multi-Head Self-Attention (MHSA)” unit, normalization layer, scaling unit, masked layer, and scaling unit. All these layers are concatenated and are fed forward to the encoder unit of the DeepLabV3+ model. The architecture of the Trans-DeepLabV3+ model is shown in Fig. 4.

Fig. 4 [Images not available. See PDF.]

Diagrammatic representation of the trans-deeplabv3+ model.

Enhanced image segmentation using developed MSTDeeplabv3+

The potential redundancy of incorporating the Multi-scale retinex (MSRN) technique before a robust multiscale model such as DeepLabV3+ , which incorporates sophisticated feature normalization and enhancement mechanisms. The rationale for incorporating MSRN is its capacity to perform low-level color and contrast enhancement, which is complementary to the high-level semantic feature extraction conducted by DeepLabV3+ . In particular, MSRN improves the local dynamic range and color constancy of the input images, which can be especially advantageous in low-contrast scenarios or challenging illumination conditions, where critical features may otherwise be underrepresented in the early phases of the model. The enhanced images from the MSR are given to the developed MSTDeepLabV3+ model. The MSTDeepLabV3+ model is designed by providing multi-scale input-enhanced images to the multi-scale transformer model. Three distinct sets of features are obtained from the implemented MSTDeepLabV3+ model’s multi-scale transformer unit. The feature that is extracted from the first scale of the transformers is represented as . The feature that is extracted from the Second scale of the transformers is represented as . The feature that is extracted from the third scale of the transformers is represented as . These three features are optimally selected using the IRPO algorithm. The optimally selected features from the first scale, second scale, and third scale transformer are denoted as , , and , respectively. The three optimally selected features are then concatenated with the weights and which are also optimally selected by the IRPO algorithm. The fusion of these weights with the extracted “multi-scale features” from the distinct layers of the transformer at various scales is given by Eqs. (19) and (20).

In Eqs. (19), and (20), the terms indicate the first set of fused features and denote the final fused features. The finally obtained fused features from the multi-scale transformer unit are given by . These fused features are then considered as input to the encoder unit of the DeepLabV3+ model for performing the segmentation at the final stages.

To perform an effective semantic segmentation task, the features that are being extracted by the multi-scale transformer unit of the deployed in the deployed MSTDeepLabV3+ model are optimally selected using the executed IRPO algorithm. This optimal selection is carried out to enhance the semantic segmentation performance which is given by Eq. (21).

In Eq. (21), the term O_j indicates the objective function, H_i indicates the enhanced dice coefficient, and W_e indicates the maximized accuracy. The features are optimally selected between [1, 100]. The weights are optimized in the range [0.01, 0.99]. The dice coefficient and the accuracy are enhanced by optimally selecting the features. The dice coefficient H_i is determined using Eq. (22).

In Eq. (22), the term represents the mask images and denotes the segmented images. The value of accuracy is computed with the aid of Eq. (23).

In Eq. (23), the term represents the “true negative”, indicates the “true positive”, denotes the “false negative”, and represents the “false positive”. Once the multi-scale fused features are obtained from the transformer unit, the semantically segmented images are given as output from the implemented MSTDeepLabV3+ framework. The pictorial illustration of the executed MSTDeepLabV3+ -based semantic segmentation model is shown in Fig. 5.

Fig. 5 [Images not available. See PDF.]

Diagrammatic representation of multi-scale feature-tuned trans-deeplabv3+ model.

Results and discussion

Heuristic and metaheuristic optimizers are frequently preferred over traditional optimization strategies such as gradient-based approaches or Bayesian optimization, particularly in deep learning, due to their robustness, flexibility, and adaptability to complicated problem landscapes. Deep learning models commonly use extremely non-convex, high-dimensional loss surfaces with local minima, saddle points, and flat areas, which can provide a problem to gradient-based algorithms that rely on smooth, well-behaved gradients. Metaheuristic methods do not require gradient information and may explore the search space more broadly, lowering the chance of being stuck in poor solutions. Furthermore, these optimizers can deal with noisy, discontinuous, or poorly scaled objective functions, which are prevalent in real-world deep-learning tasks. Metaheuristic approaches scale better and are more useful for big neural networks than Bayesian optimization, which is often more successful in low-dimensional settings but computationally expensive in high-dimensional environments. While they do not guarantee optimality, their ability to consistently discover good-enough solutions, as well as their flexibility to many types of neural architectures and loss landscapes, make them useful tools where stability and convergence behavior are crucial.

To address the concern regarding novelty, supplemented our work with empirical results demonstrating the comparative performance and stability of the proposed approach against traditional methods. These results are presented in Fig. 6, which highlights the optimizer’s effectiveness in avoiding local minima and maintaining stable convergence across diverse training scenarios.

Fig. 6 [Images not available. See PDF.]

Experimental outcome from the proposed semantic segmentation model.

Experimental setup

The generated semantic segmentation model was executed in the Python platform for analysis purposes. The experimental investigation is carried out using the Ubuntu operating system and a 16 GB RAM NVIDIA Tesla P100 GPU. The installed software consists of CUDA 11.0, Tensor Flow 2.1.0, and Keras deep learning framework (version 2.3.0). This study used a transformer-based semantic segmentation model and reported an average inference speed of 42 ms per image with a batch size of 32, allowing for real-time processing of more than 23 frames per second. While integrating the multiscale transformer with DeepLabv3+ , the model has approximately 6.4 million parameters, resulting in a model size of about 25.6 megabytes. The IRPO-MSTDeepLabV3+ was designed with a chromosome length of 3*50 + 2, maximum iteration of 50, and population of size 10. The chromosomal length, maximum iterations, and population size were presumably selected to achieve a balance between exploration and computing efficiency in developing the deep learning model. The chromosome length facilitates enough variety in the encoded solutions, whilst the iteration count and population size guarantee that the genetic algorithm identifies effective solutions without imposing undue processing burdens on the system. This setup suggests a refined strategy, presumably for situations with limited processing resources or when rapid convergence is sought without significantly compromising search quality. In this work, two publicly available datasets named semantic segmentation of aerial imagery and aerial image segmentation datasets were used. Around 80% of the collected dataset was utilized for training the model and the remaining 20% was used for testing the framework. The IRPO-MSTDeepLabV3+ -based semantic segmentation model was evaluated by comparing its performance with the conventional approaches like the “Mine Blast Optimization (MBO)-MSTDeepLabV3+ ³⁵, Chameleon Swarm Optimization (CSO)-MSTDeepLabV3+³⁶, Artificial Gorilla Troops Optimizer (AGTO)-MSTDeepLabV3+³⁷, RPO-MSTDeepLabV3+³², Trans-UNet³⁸, ResUNet¹⁴, MSTDeepLabV3+³⁹, G-RDA-DeepLabV3⁴⁰, IBWO-AMC-DeepLabV3⁴¹”, respectively. The algorithmic steps of the Wilcoxon test for performing non-parametric statistical test has been given in algorithm 2.

Computational time

The computational time taken by every model used in this work is listed in Table 3. By analyzing the table it is seen that the computational time taken by the executed IRPO-MSTDeepLabV3+ is much less than other conventional methods.

Table 3. Computational time analysis.

Methods	Computation time (seconds)
MBO-MSTDeepLabv3+ ³⁵	124.29
CSO-MSTDeepLabv3+ ³⁶	123.03
AGTO-MSTDeepLabv3+ ³⁷	121.65
RPO-MSTDeepLabv3+ ³²	120.03
TransUNet³⁸	129.01
ResUNet¹⁴	128.02
MSTDeepLabV3+ ³⁹	123.11
G-RDA-DeepLabV3⁴⁰	119.23
IBWO-AMC-DeepLabV3⁴¹	113.01
IRPO-MSTDeepLabV3+	120.00

Experimental outcomes

The experimental outcomes from the proposed semantic segmentation model for aerial images are provided in Fig. 6. The suggested model specifically targets classes that are frequently pertinent in high-resolution remote sensing imagery, such as buildings, roads, vegetation, water bodies, and land (unpaved area). These categories were chosen due to their significance in a variety of practical applications, including environment monitoring, urban planning, and disaster response. To elucidate the relevance of the segmentation tasks, Fig. 6 presents visual information on class-specific performance, thereby building the model’s practicality in real-world applications.

The implemented IRPO-MSTDeepLabV3+ model-based segmentation outcome is compared with various existing model segmentation outcomes such as TransUNet, ResUNet, and MSTDeepLabV3+ and with mask images. By analyzing the segmentation outcome, it is seen that more accurate as well as clear segmentation outputs closer to the mask images are obtained using the implemented IRPO-MSTDeepLabV3+ model than other conventional approaches.

Evaluation measures

The various evaluation measures that are used to assess the working of the implemented semantic segmentation model are given in the below section.

The sensitivity/recall is calculated with the aid of Eq. (24).
24
The correlation coefficient between the mask and the segmented image is determined as provided in Eq. (25).
25

In Eq. (25), the term indicates the sample size, represents the standard deviation, and denotes the mean function.
The computation of precision takes place with the help of Eq. (26).
26
The specificity is calculated with the aid of Eq. (27).
27
The F1-score is determined with the help of Eq. (28).
28
The “False Negative Rate (FNR)” is estimated using Eq. (29).
29
The determination of the “False Positive Rate (FPR)” is as provided in Eq. (30).
30
The “Negative Predictive Value (NPV)” is determined using Eq. (31).
31
The calculation for the “False Discovery Rate (FDR)” is given by Eq. (32).
32
where the term represents the “true negative”, indicates the “true positive”, denotes the “false negative”, and represents the “false positive”.

Performance assessment of implemented semantic segmentation model regarding distinct algorithms

The performance assessment of the implemented model when compared with various optimization algorithms is given in Tables 4 and 5 for dataset 1 and 2 respectively. Various performance measures including Type I and Type II measures are considered for the validation of the executed model. Each performance metric is computed in terms of “statistical measures such as median, worst, mean, standard deviation, and best”. On comparison, it is noted that the performance offered by the implemented IRPO-MSTDeepLabV3+ -based model is better in all the metrics. The enhanced performance provided by the IRPO-MSTDeepLabV3+ -based semantic segmentation model is shown using Tables 4 and 5. The accuracy of the implemented IRPO-MSTDeepLabV3+ -based semantic segmentation model is 0.71%, 0.51%, 0.4%, and 0.1% higher than the MBO-MSTDeepLabV3+ , CSO-MSTDeepLabV3+ , AGTO-MSTDeepLabV3+ , and RPO-MSTDeepLabV3+ algorithms, respectively when the mean is taken as the statistical measure regarding dataset 2. Thus, it is proved that, out of all the distinct algorithms that are considered for validation, the implemented IRPO-MSTDeepLabV3+ -based semantic segmentation model offers a better and more accurate outcome on semantic segmentation of aerial images in an easier and non-complicated manner.

Table 4. Performance assessment of the implemented semantic segmentation model about distinct algorithms for dataset 1.

Terms/Segmentation models	MBO-MSTDeepLabV3+ ³⁵	CSO-MSTDeepLabV3+ ³⁶	AGTO-MSTDeepLabV3+ ³⁷	RPO-MSTDeepLabV3+ ³²	IRPO-MSTDeepLabV3+
Dice coefficient
Best	98.80	99.00	99.20	99.39	99.59
Worst	98.71	98.90	99.10	99.31	99.51
Mean	98.85	99.05	99.25	99.34	99.55
Median	98.85	99.05	99.25	99.35	99.55
Standard deviation	0.0002	0.0002	0.0002	0.0001	0.0001
Accuracy
Best	98.79	98.99	99.19	99.39	99.59
Worst	98.70	98.90	99.11	99.31	99.51
Mean	98.75	98.95	99.15	99.34	99.55
Median	98.75	98.95	99.15	99.35	99.55
Standard deviation	0.0002	0.0002	0.0002	0.0001	0.0001
Sensitivity
Best	98.81	99.00	99.20	99.40	99.60
Worst	98.70	98.90	99.10	99.30	99.50
Mean	98.74	99.95	99.15	99.34	99.55
Median	98.75	99.95	99.15	99.34	99.55
Standard deviation	0.0003	0.0002	0.0002	0.0002	0.0002
Specificity
Best	98.80	99.01	99.20	99.40	99.59
Worst	98.70	98.90	99.10	99.29	99.49
Mean	98.75	98.96	99.14	99.35	99.55
Median	98.75	98.96	99.14	99.35	99.55
Standard deviation	0.0002	0.0002	0.0002	0.0002	0.0002
Precision
Best	98.81	99.01	99.20	99.40	99.60
Worst	98.69	98.90	99.09	99.29	99.50
Mean	98.75	98.95	99.14	99.35	99.55
Median	98.75	98.95	99.14	99.35	99.55
Standard deviation	0.0002	0.0002	0.0003	0.0002	0.0002
F1-Score
Best	98.80	99.00	99.20	99.49	99.69
Worst	98.71	98.90	99.10	99.31	99.51
Mean	98.85	99.05	99.25	99.40	99.65
Median	98.85	99.05	99.25	99.45	99.65
Standard deviation	0.0002	0.0002	0.0002	0.0001	0.0001

Table 5. Performance assessment of the implemented semantic segmentation model about distinct algorithms for dataset 2.

Terms/ Segmentation Models	MBO-MSTDeepLabV3+ ³⁵	CSO-MSTDeepLabV3+ ³⁶	AGTO-MSTDeepLabV3+ ³⁷	RPO-MSTDeepLabV3+ ³²	IRPO-MSTDeepLabV3+
Dice coefficient
Best	98.81	99.00	99.19	99.40	99.59
Worst	98.70	98.91	99.10	99.30	99.50
Mean	98.75	98.95	99.14	99.35	99.54
Median	98.75	98.96	99.15	99.35	99.55
Standard deviation	0.0002	0.0002	0.0002	0.0002	0.0002
Accuracy
Best	98.81	99.00	99.19	99.39	99.59
Worst	98.87	98.91	99.10	99.30	99.50
Mean	98.75	98.95	99.14	99.35	99.54
Median	98.75	98.95	99.15	99.35	99.54
Standard deviation	0.0002	0.0002	0.0002	0.0002	0.0001
Sensitivity
Best	98.82	99.00	99.20	99.40	99.59
Worst	98.70	98.90	99.10	99.29	99.50
Mean	98.75	98.95	99.14	99.35	99.54
Median	98.75	98.95	99.14	99.35	99.54
Standard deviation	0.0002	0.0002	0.0002	0.0002	0.0002
Specificity
Best	98.80	99.01	99.21	99.39	99.60
Worst	98.70	98.90	99.10	99.30	99.50
Mean	98.75	98.95	99.15	99.35	99.55
Median	98.75	98.95	99.14	99.36	99.55
Standard deviation	0.0002	0.0002	0.0002	0.0002	0.0002
Precision
Best	98.82	99.02	99.21	99.40	99.60
Worst	98.69	98.90	99.09	99.29	99.50
Mean	98.75	98.95	99.15	99.35	99.55
Median	98.75	98.95	99.15	99.36	99.54
Standard deviation	0.0002	0.0002	0.0002	0.0002	0.0002
F1-Score
Best	98.81	99.00	99.29	99.40	99.69
Worst	98.70	98.91	99.10	99.30	99.50
Mean	98.85	99.05	99.14	99.45	99.54
Median	98.85	99.06	99.15	99.45	99.65
Standard deviation	0.0002	0.0002	0.0002	0.0002	0.0002

Performance assessment of the proposed semantic segmentation model versus conventional models

The performance assessment of the implemented semantic segmentation model when compared with other segmentation approaches is provided in Tables 6 and 7 respectively. The generated semantic segmentation model is evaluated using a range of performance metrics like "accuracy, dice coefficient, sensitivity, precision, specificity, and F1-score " by varying the statistical measures like "worst, median, best, mean, and standard deviation". It is seen that the deployed IRPO-MSTDeepLabV3+ -based semantic segmentation system performs better across various metrics than other approaches. Table 6 illustrates the improved performance that the IRPO-MSTDeepLabV3+ -based semantic segmentation model offers. The applied semantic segmentation model based on IRPO-MSTDeepLabV3+ has greater precision of 4.96%, 7.68%, 4.3%, 1.43%, 4.63%, and 5.07% than the existing segmentation models like TransUNet, ResUNet, MSTDeepLabV3+ , G-RDA-DeepLanV3, IBWO-AMC-DeepLabV3, and HFH-EFO-AA-TransDeepLabV3+ techniques, respectively when the mean is taken as the statistical analysis for dataset 1. Thus, it is demonstrated that the implemented IRPO-MSTDeepLabV3+ -based semantic segmentation model provides a more accurate and superior result on the semantic segmentation of aerial images in comparison to all the other segmentation models on remote sensing applications.

Table 6. Performance assessment of the implemented semantic segmentation model vs. conventional techniques for Dataset 1.

Terms/Segmentation Models	TransUNet³⁸	ResUNet¹⁴	MSTDeepLabV3+ ³⁹	G-RDA-Deeplabv3⁴⁰	IBWO-AMC-Deeplabv3⁴¹	IRPO-MSTDeepLabV3+
Dice coefficient
Best	94.8	93.0	95.9	98.2	99.1	99.59
Worst	94.7	91.9	94.9	98.0	91.1	99.51
Mean	94.8	92.4	95.4	98.1	95.1	99.55
Median	94.8	92.4	95.5	98.1	94.7	99.55
Standard deviation	0.000	0.002	0.002	0.000	0.026	0.0001
Accuracy
Best	94.8	93.0	95.7	97.6	99.0	99.59
Worst	94.7	91.9	94.7	97.5	90.0	99.51
Mean	94.8	92.4	95.2	97.6	94.5	99.55
Median	94.8	92.4	95.3	97.6	94.0	99.55
Standard deviation	0.000	0.002	0.003	0.000	0.029	0.0001
Sensitivity
Best	94.8	93.0	100.0	100.0	98.2	99.60
Worst	94.7	91.8	100.0	100.0	89.5	99.50
Mean	94.7	92.4	100.0	100.0	94.2	99.55
Median	94.7	92.4	100.0	100.0	93.9	99.55
Standard deviation	0.000	0.004	0.00	0.000	0.027	0.0002
Specificity
Best	94.8	93.1	91.4	93.4	100.0	99.59
Worst	94.7	91.8	89.4	93.2	90.7	99.49
Mean	94.8	92.5	90.5	93.3	94.9	99.55
Median	94.8	92.5	90.5	93.3	94.2	99.55
Standard deviation	0.000	0.003	0.005	0.001	0.033	0.0002
Precision
Best	94.8	93.1	92.1	96.4	100.0	99.60
Worst	94.7	91.9	90.4	96.1	92.7	99.50
Mean	94.8	92.5	91.3	96.3	96.1	99.55
Median	94.8	92.5	91.3	96.3	95.5	99.55
Standard deviation	0.000	0.003	0.005	0.001	0.025	0.0002
F1-Score
Best	94.8	93.0	95.9	98.2	99.1	99.69
Worst	94.7	91.9	94.9	98.0	91.1	99.51
Mean	94.8	92.4	95.4	98.1	95.1	99.65
Median	94.8	92.4	95.5	98.1	94.7	99.65
Standard deviation	0.000	0.002	0.002	0.000	0.026	0.0001

Table 7. Performance assessment of the implemented semantic segmentation model vs. conventional techniques for dataset 2.

Terms/segmentation models	TransUNet³⁸	ResUNet¹⁴	MSTDeepLabV3+ ³⁹	G-RDA-Deeplabv3⁴⁰	IBWO-AMC-Deeplabv3⁴¹	IRPO-MSTDeepLabV3+
Dice coefficient
Best	94.8	93.0	95.9	98.2	99.1	99.59
Worst	94.7	91.9	94.9	98.0	90.6	99.50
Mean	94.8	92.4	95.4	98.1	95.1	99.54
Median	94.7	92.4	95.4	98.1	95.4	99.55
Standard deviation	0.000	0.003	0.003	0.000	0.027	0.0002
Accuracy
Best	94.8	92.9	95.7	97.6	99.0	99.59
Worst	94.7	91.9	94.6	97.5	90.0	99.50
Mean	94.8	92.4	95.2	97.6	94.7	99.54
Median	94.7	92.4	95.2	97.6	95.0	99.54
Standard deviation	0.000	0.003	0.003	0.000	0.029	0.0001
Sensitivity
Best	94.8	93.1	100.0	100.0	98.1	99.59
Worst	94.7	91.9	100.0	100.0	88.9	99.50
Mean	94.8	92.4	100.0	100.0	94.4	99.54
Median	94.8	92.4	100.0	100.0	95.4	99.54
Standard deviation	0.000	0.004	0.000	0.000	0.030	0.0002
Specificity
Best	94.8	93.1	91.4	93.4	100.0	99.60
Worst	94.7	91.9	89.3	93.2	91.3	99.50
Mean	94.7	92.4	90.3	93.3	95.0	99.55
Median	94.8	92.3	90.3	93.3	94.6	99.55
Standard deviation	0.000	0.004	0.006	0.001	0.029	0.0002
Precision
Best	94.8	93.0	92.1	96.4	100.0	99.60
Worst	94.7	91.8	90.3	96.1	92.3	99.50
Mean	94.7	92.4	91.2	96.3	95.7	99.55
Median	94.7	92.4	91.1	96.3	95.4	99.55
Standard deviation	0.000	0.004	0.005	0.001	0.026	0.0002
F1-Score
Best	94.8	93.0	95.9	98.2	99.1	99.69
Worst	94.7	91.9	94.9	98.0	90.6	99.51
Mean	94.8	92.4	95.4	98.1	95.1	99.65
Median	94.7	92.4	95.4	98.1	95.4	99.65
Standard deviation	0.000	0.003	0.003	0.000	0.027	0.0001

Operation analysis of the executed semantic segmentation model with other algorithms

The operation analysis of the executed semantic segmentation model, when compared with other algorithms for two distinct datasets, is provided in Figs. 7 and 8 for datasets 1 and 2 respectively. The dice coefficient of the executed IRPO-MSTDeepLabV3+ -based semantic segmentation model is 1.01%, 1.27%, and 2.56% higher than the AGTO-MSTDeepLabV3+ , CSO-MSTDeepLabV3+ , and MBO-MSTDeepLabV3+ , and is similar to that of the RPO-MSTDeepLabV3+ algorithm, respectively and dataset 2 when considering the worst as the statistical measure. From the overall analysis, it is observed that enriched performance is given by the IRPO-MSTDeepLabV3+ -based semantic segmentation model in aerial image segmentation tasks.

Fig. 7 [Images not available. See PDF.]

Validation of the performance of the executed semantic segmentation model against conventional algorithms for dataset 1 in terms of (a) Dice coefficients, (b) Accuracy, (c) Sensitivity, (d) Specificity, (e) Precision, (f) F1-score.

Fig. 8 [Images not available. See PDF.]

Validation of the performance of the executed semantic segmentation model against conventional algorithms for dataset 2 in terms of (a) Dice coefficients, (b) Accuracy, (c) Sensitivity, (d) Specificity, (e) Precision, (f) F1-score.

Analyzing the functioning of the generated semantic segmentation model against existing techniques

Figures 9 and 10 illustrate the analysis carried out on the generated semantic segmentation model’s functioning versus the operation of the existing techniques for datasets 1 and 2 respectively. Analyzing the IRPO-MSTDeepLabV3+ -based semantic segmentation framework, it is found that the implemented model produces an effective segmentation result with greater accuracy than the other existing approaches. When mean is taken into account as the statistical metric, the accuracy of the implemented IRPO-MSTDeepLabV3+ -based semantic segmentation model is 4.26%, 6.52%, 6.52%, 5.38%, 8.89%, and 5.95% superior to the HFH-EFO-AA-TransDeepLabV3+ , IBWO-AMC-DeepLabV3, G-RDA-DeepLanV3, MSTDeepLabV3+ , ResUNet, and TransUNet techniques, correspondingly. Therefore, this research verifies the precise semantic segmentation produced by the developed IRPO-MSTDeepLabV3+ -based semantic segmentation framework, which is highly comparable to the ground truth images regarding the provided aerial image, thus it is helpful in real-time segmentation of aerial images.

Fig. 9 [Images not available. See PDF.]

Functioning evaluation of the generated semantic segmentation model against existing models for dataset 1 in terms of (a) Dice coefficients, (b) Accuracy, (c) Sensitivity, (d) Specificity, (e) Precision, (f) F1-score.

Fig. 10 [Images not available. See PDF.]

Functioning evaluation of the generated semantic segmentation model against existing models for dataset 2 in terms of (a) Dice coefficients, (b) Accuracy, (c) Sensitivity, (d) Specificity, (e) Precision, (f) F1-score.

Performance assessment of two state-of-the-art models

The performance offered by the two state-of-the-art models for both datasets is provided in Tables 8 and 9. The performance of these models on semantic segmentation of the aerial images is analyzed for all the two datasets. When compared with the implemented IRPO-MSTDeepLabV3+ model, the semantic segmentation outcome provided by the MobileNet as well as the TransMobileNet model are not much better.

Table 8. Performance assessment of the state-of-the-art models for Dataset 1.

Performance metrics	Terms	MobileNet⁴²	TransMobileNet⁴³	IRPO-MST Deeplabv3+
Dice coefficient	Best	0.913	0.929	0.9959
	Worst	0.893	0.920	0.9951
	Mean	0.904	0.924	0.9955
	Median	0.904	0.925	0.9955
	Std. deviation	0.005	0.002	0.0001
Accuracy	Best	0.913	0.928	0.9959
	Worst	0.893	0.919	0.9951
	Mean	0.904	0.924	0.9955
	Median	0.904	0.925	0.9955
	Std. deviation	0.004	0.002	0.0001
Sensitivity	Best	0.914	0.931	0.9960
	Worst	0.893	0.918	0.9950
	Mean	0.905	0.924	0.9955
	Median	0.905	0.924	0.9955
	Std. deviation	0.006	0.003	0.0002
Specificity	Best	0.913	0.931	0.9959
	Worst	0.893	0.918	0.9949
	Mean	0.903	0.924	0.9955
	Median	0.904	0.924	0.9955
	Std. deviation	0.006	0.003	0.0002
Precision	Best	0.913	0.930	0.9960
	Worst	0.893	0.918	0.9950
	Mean	0.904	0.924	0.9955
	Median	0.904	0.924	0.9955
	Std. deviation	0.005	0.003	0.0002
F1-Score	Best	0.913	0.929	0.9969
	Worst	0.893	0.920	0.9951
	Mean	0.904	0.924	0.9965
	Median	0.904	0.925	0.9965
	Std. deviation	0.005	0.002	0.0001

Table 9. Performance assessment of the state-of-the-art models for dataset 2.

Performance metrics	Terms	MobileNet⁴²	TransMobileNet⁴³	IRPO-MST Deeplabv3+
Dice coefficient	Best	0.912	0.929	0.9959
	Worst	0.894	0.919	0.9950
	Mean	0.903	0.924	0.9954
	Median	0.903	0.924	0.9955
	Std. deviation	0.005	0.002	0.0002
Accuracy	Best	0.912	0.929	0.9959
	Worst	0.894	0.919	0.9950
	Mean	0.903	0.924	0.9954
	Median	0.903	0.924	0.9954
	Std. deviation	0.004	0.002	0.0001
Sensitivity	Best	0.914	0.931	0.9959
	Worst	0.893	0.918	0.9950
	Mean	0.904	0.924	0.9954
	Median	0.905	0.923	0.9954
	Std. deviation	0.006	0.004	0.0002
Specificity	Best	0.914	0.931	0.9960
	Worst	0.893	0.918	0.9950
	Mean	0.903	0.925	0.9955
	Median	0.903	0.925	0.9955
	Std. deviation	0.006	0.004	0.0002
Precision	Best	0.914	0.930	0.9960
	Worst	0.893	0.918	0.9950
	Mean	0.903	0.925	0.9955
	Median	0.903	0.925	0.9955
	Std. deviation	0.005	0.003	0.0002
F1-Score	Best	0.912	0.929	0.9969
	Worst	0.894	0.919	0.9951
	Mean	0.903	0.924	0.9965
	Median	0.903	0.924	0.9965
	Std. deviation	0.005	0.002	0.0001

Ablation experiments

Ablation experiments were performed on both semantic segmentation of aerial images and aerial image segmentation datasets. Compared to UNet, DeepLabV3, DeepLabv3+ , and MST-DeepLabv3+ , IRPO-MSTDeepLabv3+ has superior accuracy in segmenting tiny buildings, roads, and trees, while also presenting a more realistic morphology. The goal boundary for the segmentation of large structures is more accurate and devoid of discernible voids, and the segmentation efficacy is much enhanced. Tables 10 and 11 represent the baseline comparisons with the proposed IRPO-MSTDeepLabv3+ on both semantic segmentation of aerial imagery and aerial image segmentation datasets. These baseline comparisons show that the proposed method has more efficacy compared with other models.

Table 10. Baseline comparisons of IRPO-MSTDeepLabv3+ on semantic segmentation of aerial imagery dataset.

Model	Accuracy	Dice-coefficient
UNet (Encoder–Decoder)	95.16	95.38
DeepLabv3	96.54	96.61
DeepLabv3+ (CNN + ASPP)	96.88	96.66
MST-DeepLabv3+	95.40	95.70
IRPO-MSTDeepLabv3+	99.55	99.55

Table 11. Baseline comparisons of IRPO-MSTDeepLabv3+ on Aerial image segmentation dataset.

Model	Accuracy	Dice-coefficient
UNet (Encoder–Decoder)	95.22	95.88
DeepLabv3	96.53	96.66
DeepLabv3+ (CNN + ASPP)	96.23	96.39
MST-DeepLabv3+	95.70	95.90
IRPO-MSTDeepLabv3+	99.59	99.59

Conclusion

An efficient image segmentation technique was implemented in this work. The aerial images were gathered from standard data sources. The collected images would undergo image enhancement with the help of the MSR approach. The enhanced images were provided as input to the developed MSTDeepLabV3+ model. The features from the MSTDeepLabV3+ were tuned with the help of the IRPO algorithm. The final semantically segmented aerial images were obtained from the MSTDeepLabV3+ . The experimental setup was implemented to evaluate the performance of the implemented MSTDeepLabV3+ -based semantic segmentation model. On experimentation, it was seen that the applied semantic segmentation model based on IRPO-MSTDeepLabV3+ has greater precision of 4.96%, 7.68%, 4.3%, 1.43%, 4.63%, and 5.07% than the existing segmentation models like TransUNet, ResUNet, MSTDeepLabV3+ , G-RDA-DeepLanV3, IBWO-AMC-DeepLabV3, and HFH-EFO-AA-TransDeepLabV3+ techniques, respectively when the mean is taken as the statistical analysis for dataset 1. Thus, it is proved that the implemented model provides excellent semantic segmented images thus making it an ideal model for the remote sensing and land cover analysis tasks. However, this model’s performance is affected while utilizing distorted or mid-resolution images. Thus, a model capable of performing the task of semantic segmentation of distorted images will be carried out in the near future.

Acknowledgements

This research was supported by Kyungpook National University Research Fund, 2024.

Author contributions

P. A. and K. L. conceived the experiment(s), A. N. K. and D. J. P. conducted the experiment(s), Y.V.P.K. and R. M. analyzed the results. P. A. and K. L. wrote the main manuscript text, A. N. K. and D. J. P. prepared figures, and Y.V.P.K. and R. M. edited the manuscript. All authors reviewed the manuscript.

Funding

This research was supported by Kyungpook National University Research Fund, 2024.

Data availability

The experimental dataset utilized in this research has been named the “Semantic Segmentation of Aerial Imagery Dataset” and the “Aerial Image Segmentation Dataset”, sourced from the following sites: https://www.kaggle.com/datasets/humansintheloop/semantic-segmentation-of-aerial-imagery?select=Semantic+segmentation+dataset and http://jiangyeyuan.com/ASD/Aerial%20Image%20Segmentation%20Dataset.html.

Declarations

Competing interests

The authors declare no competing interests.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1. Li, P; Zhang, H; Guo, Z; Lyu, S; Chen, J; Li, W; Song, X; Shibasaki, R; Yan, J. Understanding rooftop PV panel semantic segmentation of satellite and aerial images for better using machine learning. Adv. Appl. Energy; 2021; 4, 100057.

2. Anilkumar, P; Venugopal, P. Research contribution and comprehensive review towards the semantic segmentation of aerial images using deep learning techniques. Secur. Commun. Netw.; 2022; 202, 1 6010912.

3. Hussein, SK; Ali, KH. Semantic segmentation of aerial images using u-net architecture. Iraqi J. Electr. Electronics Eng.; 2021; 18, pp. 58-63.

4. Benjdira, B; Bazi, Y; Koubaa, A; Ouni, K. Unsupervised domain adaptation using generative adversarial networks for semantic segmentation of aerial images. Remote Sens.; 2019; 11, 11 1369.2019RemS..11.1369B

5. Sun, Y; Zhang, X; Xin, Q; Huang, J. Developing a multi-filter convolutional neural network for semantic segmentation using high-resolution aerial imagery and LiDAR data. ISPRS J. Photogramm. Remote. Sens.; 2018; 143, pp. 3-14.2018JPRS.143..3S

6. Lee, S; Kim, J. Land cover classification using semantic image segmentation with deep learning. Korean J. Remote Sens.; 2019; 35, 2 pp. 279-288.

7. Yang, H; Bo, Yu; Luo, J; Chen, F. Semantic segmentation of high spatial resolution images with deep neural networks. GISci. Remote Sens.; 2019; 56, 5 pp. 749-768.

8. Niu, R; Xian Sun, Yu; Tian, WD; Chen, K; Kun, Fu. Hybrid multiple attention network for semantic segmentation in aerial images. IEEE Trans. Geosci. Remote Sens.; 2021; 60, pp. 1-18.

9. Wieland, M; Martinis, S; Kiefl, R; Gstaiger, V. Semantic segmentation of water bodies in very high-resolution satellite and aerial images. Remote Sens. Environ.; 2023; 287, 113452.

10. Mou, L; Yuansheng, H; Zhu, XX. Relation matters: Relational context-aware fully convolutional network for semantic segmentation of high-resolution aerial images. IEEE Trans. Geosci. Remote Sens.; 2020; 58, 11 pp. 7557-7569.2020ITGRS.58.7557M

11. Gunawan, AA; Santoso, IA; Irwansyah, E. Semantic segmentation of aerial imagery for road and building extraction with deep learning. ICIC Express Letters; 2020; 14, 1 pp. 43-51.

12. Volpi, M; Tuia, D. Deep multi-task learning for a geographically-regularized semantic segmentation of aerial images. ISPRS J. Photogramm. Remote. Sens.; 2018; 144, pp. 48-60.2018JPRS.144..48V

13. Kemker, R; Salvaggio, C; Kanan, C. Algorithms for semantic segmentation of multispectral remote sensing imagery using deep learning. ISPRS J. Photogramm. Remote. Sens.; 2018; 145, pp. 60-77.2018JPRS.145..60K

14. Diakogiannis, FI; Waldner, F; Caccetta, P; Chen, Wu. ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS J. Photogramm. Remote. Sens.; 2020; 162, pp. 94-114.2020JPRS.162..94D

15. Marmanis, D; Wegner, JD; Galliani, S; Schindler, K; Datcu, M; Stilla, U. Semantic segmentation of aerial images with an ensemble of CNNs. ISPRS Annals Photogr., Remote Sens. Spatial Inform. Sci.; 2016; 3, pp. 473-480.

16. Chen, K; Weinmann, M; Sun, X; Yan, M; Hinz, S; Jutzi, B. Semantic segmentation of aerial imagery via multi-scale shuffling convolutional neural networks with deep supervision. ISPRS Ann. Photogram., Remote Sens. Spat. Inform. Sci.; 2018; 4, pp. 29-36.

17. Osco, LP; Nogueira, K; Marques Ramos, AP; Faita Pinheiro, MM; Furuya, DE; Gonçalves, WN; de Castro Jorge, LA; Marcato Junior, J; dos Santos, JA. Semantic segmentation of citrus-orchard using deep neural networks and multispectral UAV-based imagery. Prec. Agricult; 2021; 22, 4 pp. 1171-1188.

18. Zhao, Y., Peng G., Han G., & Xiuwan, C. Depth-assisted ResiDualGAN for cross-domain aerial images semantic segmentation. arXiv preprint, (2022).

19. Luo, H; Chen, C; Fang, L; Zhu, Xi; Lijing, Lu. High-resolution aerial images semantic segmentation using deep fully convolutional network with channel attention mechanism. IEEE J. Select. Top. Appl. Earth Observ. Remote Sens.; 2019; 12, 9 pp. 3492-3507.2019IJSTA.12.3492L

20. Audebert, N; Le Saux, B; Lefèvre, S. Segment-before-detect: Vehicle detection and classification through semantic segmentation of aerial images. Remote Sens.; 2017; 9, 4 368.2017RemS..9.368A

21. Chai, D; Newsam, S; Huang, J. Aerial image semantic segmentation using DCNN predicted distance maps. ISPRS J. Photogramm. Remote. Sens.; 2020; 161, pp. 309-322.2020JPRS.161.309C

22. Chen, K; Kun, Fu; Yan, M; Gao, X; Sun, X; Wei, X. Semantic segmentation of aerial images with shuffling convolutional neural networks. IEEE Geosci. Remote Sens. Lett.; 2018; 15, 2 pp. 173-177.2018IGRSL.15.173C

23. Azimi, SM; Fischer, P; Körner, M; Reinartz, P. Aerial LaneNet: Lane-marking semantic segmentation in aerial imagery using wavelet-enhanced cost-sensitive symmetric fully convolutional neural networks. IEEE Trans. Geosci. Remote Sens.; 2018; 57, 5 pp. 2920-2938.2019ITGRS.57.2920A

24. De Silva, KDM; Lee, HJ. Distorted aerial images semantic segmentation method for software-based analog image receivers using deep combined learning. Appl. Sci.; 2023; 13, 11 6816.

25. Moazzam, SI; Khan, US; Qureshi, WS; Nawaz, T; Kunwar, F. Towards automated weed detection through two-stage semantic segmentation of tobacco and weed pixels in aerial Imagery. Smart Agricult. Technol.; 2023; 4, 100142.

26. Lingwal, S; Bhatia, KK; Singh, M. Semantic segmentation of landcover for cropland mapping and area estimation using Machine Learning techniques. Data Intell.; 2023; 5, 2 pp. 370-387.

27. Gomaa, A; Abdelwahab, MM; Abo-Zahhad, M. Efficient vehicle detection and tracking strategy in aerial videos by employing morphological operations and feature points motion analysis. Multimed. Tools Appl.; 2020; 79, pp. 26023-26043. [DOI: https://dx.doi.org/10.1007/s11042-020-09242-5]

28. A. Gomaa, Advanced domain adaptation technique for object detection leveraging semi-automated dataset construction and enhanced YOLOv8, In: 2024 6th Novel Intelligent and Leading Emerging Sciences Conference (NILES), Giza, Egypt, 2024, pp. 211–214, https://doi.org/10.1109/NILES63360.2024.10753164.

29. Gomaa, A; Abdalrazik, A. Novel deep learning domain adaptation approach for object detection using semi-self building dataset and modified YOLOv4. World Electr. Veh. J.; 2024; 15, 6 255. [DOI: https://dx.doi.org/10.3390/wevj15060255]

30. Gomaa, A; Saad, OM. Residual channel-attention (RCA) network for remote sensing image scene classification. Multimed. Tools Appl.; 2025; [DOI: https://dx.doi.org/10.1007/s11042-024-20546-8]

31. Zotin, A. Fast algorithm of image enhancement based on multi-scale retinex. Procedia Comput Sci.; 2018; 131, 6–14 2018.

32. Rabie, AH; Saleh, AI; Mansour, NA. Red piranha optimization (RPO): A natural inspired meta-heuristic algorithm for solving complex optimization problems. J. Ambient Intell. Humaniz Comput.; 2023; 14, 6 pp. 7621-7648. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37228700][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10020777]

33. Li, Z; Wang, R; Zhang, W; Fengmin, Hu; Meng, L. Multiscale features supported DeepLabV3+ optimization scheme for accurate water semantic segmentation. IEEE Access; 2019; 7, pp. 155787-155804.

34. Nascimento, EG; de Melo, TA; Moreira, DM. A transformer-based deep neural network with wavelet transform for forecasting wind speed and wind energy. Energy; 2023; 278, 127678.

35. Alweshah, M; Alkhalaileh, S; Albashish, D; Mafarja, M; Bsoul, Q; Dorgham, O. A hybrid mine blast algorithm for feature selection problems. Soft. Comput.; 2021; 25, pp. 517-534.

36. Braik, MS. Chameleon Swarm Algorithm: A bio-inspired optimizer for solving engineering design problems. Expert Syst. Appl.; 2021; 174, 114685.

37. Abdollahzadeh, B; Soleimanian Gharehchopogh, F; Mirjalili, S. Artificial gorilla troops optimizer: A newnature-inspired metaheuristic algorithm for global optimization problems. Int. J. Intell. Syst.; 2021; 36, 1 pp. 5887-5858.

38. Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L. & Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation, arXiv, (2021).

39. Chen, S; Wei, X; Zheng, W. ASA-DRNet: An improved Deeplabv3+ framework for SAR image segmentation. Electronics; 2023; 12, 6 1300.

40. Anilkumar, P; Venugopal, P. An enhanced multi-objective-derived adaptive deeplabv3 using g-rda for semantic segmentation of aerial images. Arab. J. Sci. Eng.; 2023; 48, 8 pp. 10745-10769.

41. Anilkumar, P; Venugopal, P. An adaptive multichannel DeepLabv3+ for semantic segmentation of aerial images using improved Beluga Whale Optimization Algorithm. Multim. Tools Appl.; 2023; 11, pp. 106688-106705.

42. Ahmed, I; Ahmad, M; Jeon, G. A real-time efficient object segmentation system based on U-Net using aerial drone images. J. Real-Time Image Proc.; 2021; 18, pp. 1745-1758.

43. Jiaqi, F; Bingzhao, G; Quanbo, G; Yabing, R; Jia, Z; Hongqing, C. SegTransConv: transformer and CNN hybrid method for real-time semantic segmentation of autonomous vehicles. IEEE Trans. Intell. Transport Syst; 2023; 25, 2 pp. 1586-1601.

Word count: 10660

Show less

© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

The use of higher-resolution spatial aerial images for semantic segmentation in everyday tasks has increased due to recent advancements in remote sensing and several other applications. Nonetheless, supervised learning necessitates a substantial quantity of images with pixel-level labeling. Currently, available techniques, which are mostly Deep Semantic Segmentation Networks (DSSN), might not be appropriate for application domains with a dearth of labels containing targeted masks of outputs. For “semantic segmentation of higher-quality aerial images", multi-scale semantic details have to be extracted. Many techniques have been executed in recent years to increase the networks’ capacity to capture multi-scale details in a variety of ways. However, these techniques consistently exhibit poor efficiency regarding speed and accuracy when dealing with aerial images. In this work, an effective image semantic segmentation method utilizing deep learning techniques is designed using a heuristic technique. Standard information sources are used to collect the aerial photos. The Multi-Scale RetiNex (MSRN) technique is employed to enhance the obtained images’ color quality. The Multiscale Feature Tuned-Trans-Deeplabv3+ (MSTDeepLabV3+) system is then used to receive the improved image as its input for the feature extraction task. The Improved Red Piranha Optimization (IRPO) approach is deployed to fine-tune the MSTDeepLabV3+ parameters. The MSTDeepLabV3+ helps to provide the final semantically segmented aerial images. To assess how well the implemented model performs, an experimental setup is carried out. The excellent performance offered by the executed model is proved using the simulation outcome.

Details

Title

Multiscale feature tuned trans-DeepLabV3+ based semantic segmentation of aerial images using improved red piranha optimization algorithm

Author

Anilkumar, P.¹; Lokesh, K.²; Naveen Kumar, A.³; John Pradeep, D.⁴; Pavan Kumar, Y. V.⁴; Mallipeddi, Rammohan⁵

¹ Department of Electronics and Communication Engineering, Mother Theresa Institute of Engineering and Technology, 517408, Palamaner, Andhra Pradesh, India (ROR: https://ror.org/03h56sg55) (GRID: grid.418403.a) (ISNI: 0000 0001 0733 9339)
² Department of Artificial Intelligence and Data Science, Mother Theresa Institute of Engineering and Technology, 517408, Palamaner, Andhra Pradesh, India (ROR: https://ror.org/03h56sg55) (GRID: grid.418403.a) (ISNI: 0000 0001 0733 9339)
³ Department of Science and Humanities, Mother Theresa Institute of Engineering and Technology, 517408, Palamaner, Andhra Pradesh, India (ROR: https://ror.org/03h56sg55) (GRID: grid.418403.a) (ISNI: 0000 0001 0733 9339)
⁴ School of Electronics Engineering, VIT-AP University, 522241, Amaravati, Andhra Pradesh, India (ROR: https://ror.org/007v4hf75)
⁵ Department of Artificial Intelligence, School of Electronics Engineering, Kyungpook National University, 37224, Daegu, Republic of Korea (ROR: https://ror.org/040c17130) (GRID: grid.258803.4) (ISNI: 0000 0001 0661 1556)

Pages

30258

Section

Article

Publication year

2025

Publication date

2025

Publisher

Nature Publishing Group

e-ISSN

20452322

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1038/s41598-025-12940-5

ProQuest document ID

3240575199

Multiscale feature tuned trans-DeepLabV3+ based semantic segmentation of aerial images using improved red piranha optimization algorithm

Jump to:

Full text

Abstract

Details

Suggested sources