Full Text

Turn on search term navigation

1. Introduction

With the advancement of remote sensing technology, large numbers of HRSIs are available [1,2]. To make full use of HRSI, HRSI semantic segmentation has received extensive attention from researchers. HRSI semantic segmentation aims to produce the pixel-level classification map from one HRSI, which is a fundamental and critical problem in the remote sensing image understanding domain [3]. It has important application value in remote sensing image interpretation tasks, such as building extraction [4,5], road extraction [6,7], water extraction [8,9] and land-use classification [10,11]. As HRSIs often have rich structures and complex contextual features, HRSI semantic segmentation is still a very challenging research task.

In the past decade, deep learning-based methods have gained great success in many image tasks [12,13]. For semantic segmentation, the mainstream methods are deep convolutional neural networks (DCNNs), which achieve excellent performance, due to their powerful representation capabilities. After Long et al. [14] proposed the milestone work: fully convolutional networks (FCNs), many superior DCNNs have been put forward, such as UNet [15,16], Deeplab [17,18], and HRNet [19,20]. In addition, there are a variety of DCNN-based methods that dominate the field of HRSI semantic segmentation [10,11,21]. Unlike traditional methods [22,23,24], DCNNs can automatically extract rich features from images by fitting the model parameters with massive, labeled data [25]. Generally, the basic difference between these models based on deep learning is their model architecture. For instance, UNet [15] constructs a U-shape structure and concatenates the early features to corresponding up-sampled features. Deeplab v3+ [18] introduces the dilated convolution to expend the receptive field and fuses the front and back features. HRNet [19] stacks blocks layer by layer and maintains the high spatial resolution of features. Therefore, the issue of creating a better classification method becomes one of creating a more sophisticated DCNN model. Nevertheless, compared with designing features for traditional methods, the DCNN is a black-box optimization problem [14], which needs a high level of expert knowledge and involves expensive computation time. Meanwhile, the human being’s prior knowledge of the designed model may be task-dependent, which means we have to design various model architectures towards different segmentation tasks (water extraction [9], building extraction [4], land-use classification [26]). Naturally, automatic design of network architecture, using a data-driven method, would be a promising solution to address various applications in remote sensing.

Neural Architecture Search (NAS), the process of automatic architecture engineering, is, thus, a logical next step for semantic segmentation. NAS aims to obtain superior model architecture on a target dataset directly without human design [27]. The principal idea of NAS can be summarized in three steps: (1) the architecture search space; (2) the search strategy, which selects a sub-model for the following step; and (3) the performance estimation strategy, which receives the previous step’s model architecture and returns the model performance [28]. After several years of NAS development, the main methods can be divided into four types: (1) reinforcement learning (RL); (2) evolutionary algorithms (EA); (3) one-shot NAS; and (4) gradient-based NAS. Early methods based on RL [27,29] or EA [30,31] train candidate architectures from scratch and update controller (RL) or population (EA) with estimation results iteratively, which is extremely time-consuming (thousands of GPU days). To overcome this problem, a weight-sharing mechanism was introduced to later methods, which can be divided into one-shot NAS [32,33] and gradient-based NAS [34,35]. One-shot methods first define the super-net, instead of sampling candidate models, to prevent training every model discretely. Benefiting from the weight-sharing mechanism, these methods just need to train the super-net once, the parameters of which are shared with all subnets. Finally, the subnets are sampled with parameters and are directly evaluated on the validation dataset until a satisfactory model is found. Gradient-based methods use weight-sharing strategy as well, but unlike one-shot NAS, they relax the discrete search space into a continuous super-net with architecture parameters, which are optimized during the search [34]. Most existing NAS methods are designed for natural images. However, due to the large differences between remote sensing images and natural images, it is usually inappropriate to directly transfer these NAS methods to the remote sensing domain.

Most recently, NAS has been gradually applied to remote sensing image scene classification and has achieved excellent results [36,37,38,39,40]. However, there has been little research on the dense image prediction task, semantic segmentation. RSNet [39] and Auto-deeplab [35] follow the weight-sharing strategy and design different search spaces to obtain the segmentation model. Notably, though the weight-sharing strategy decreases the search time significantly, the introduction of the super-net, which is huger than the general model, takes up more display memory. Therefore, for researchers who lack sufficient computing resources, the resolution of images input has no choice but to reduce, which is detrimental, in particular, for remote sensing image segmentation. Due to the very high resolution of remote sensing images, each cropped image has a smaller receptive field perspective and lacks global information. Meanwhile, more clipping also means more border effects. These all affect the performance of NAS and the final searched model. Moreover, semantic segmentation as a dense prediction task needs more complex model architecture. It puts forward higher requirements for the design of super-net.

To address the aforementioned problem, this paper proposes a novel NAS framework, as shown in Figure 1, which is able to take into account sufficient input image size, while having a larger search space. To achieve that, our approach adopted a decoupling strategy, which can be manifested in a super-net design and search method. For super-net design, we first abstracted DCNN architecture into three aspects: up and down sampling, skip connections, and operators. Then, inspired by the two-level (network-level, cell-level) hierarchy super-net in [35], we designed a three-level (path-level, connections-level, cell-level) super-net corresponding to the aspects mentioned above. What is more, due to using a decoupling strategy, our approach complicated the super-net progressively. For the search method, we followed the gradient-based methods [34] to train the super-net level by level. Benefiting from the continuous search space, the super-net in each level could be trained end-to-end, which means it is more flexible and it becomes easier to decode the superior subnet. After obtaining the subnet on each level, we used it to generate the next step of the super-net to make the whole decoupling search coherent. Though the search process is divided into three stages and the super-net needs to train and regenerate three times, the search is still efficient. Finally, we obtained a three-level NAS framework, which covers as much of the network architecture candidates as possible, and the memory consumption is also acceptable for HRSI. Massive experiments on the GID [26] and FU [41] datasets demonstrated that our proposed DNAS can outperform the state-of-the-art handcrafted DCNNs (e.g., PSPNet [42], HRNet [19], and MSFCN [11]) and some existing NAS methods [35,43]. We have released the source code for DNAS at https://github.com/faye0078/DNAS.

To summarize, our main contributions are as follows:

We propose a novel decoupling NAS (DNAS) framework for HRSI semantic segmentation, which can automatically design DCNNs in a data-driven manner. The effectiveness of the proposed framework was verified on the GID and FU datasets.
A three-level super-net design is first proposed in this work. Compared with the existing NAS methods, our hierarchical super-net can cover much more network architecture candidates under the same hardware.
A decoupling search strategy is designed for the three-level super-net, which can largely reduce the display memory consuming during training of the super-net. Therefore, our method can adapt to HRSIs and is able to search for a more suitable network architecture.

The rest of this paper is organized as follows. Related work is discussed in Section 2. Section 3 describes our proposed method. In Section 4, we show the evaluation experimental results. Finally, Section 5 concludes this paper.

2. Related Work

In the following, we first review the development of NAS methods in natural image processing. Then, we present a brief description of existing NAS methods for HRSI processing, including classification and semantic segmentation.

2.1. NAS Methods for Natural Image Processing

Over the past years, many NAS methods have been proposed to promote the process of automatic architecture engineering and have achieved excellent success in different image processing tasks, such as image classification [27,34] and semantic segmentation [35,43]. Early NAS works [27,44] mostly aimed at image classification and directly searched the whole network architecture based on RL or EA. All candidates were trained from scratch to validate the performance. Though these works achieved impressive results, their expensive computation overheads hindered their applications in common downstream tasks (e.g., semantic segmentation). To alleviate this situation, the researchers modified the NAS method in two aspects: search space and search strategy. For search space, researchers proposed restricted search spaces. NASNet [45] first introduced the cell-based search space and searched the single cell architecture, which constrained the size of the search space, so as to make the search process easier. After that, many works [35,43,46] followed the cell-based search space design. For search strategy, the weight-sharing mechanism was introduced to NAS, which meant the search process did not need to train each candidate from scratch, but only one super-net. Gradient-based NAS [34,47] and one-shot NAS [33,48] methods both adapt this mechanism. Benefitting from improvements in these two aspects, NAS methods have gradually been used for more complex semantic segmentation tasks. Due to semantic segmentation demands, both global contextual semantic information and local detailed features, led to researchers beginning to design more flexible search space, containing a multi-scale path. For example, Auto-deeplab [35] introduced a scale path into the search space for the first time, and designed a two-level hierarchical search space. DCNAS [49] designed a more complex super-net structure than Auto-deeplab, which contains cross-layer connections in the search space and adopts a lighter cell structure. In this work, we designed a more flexible search space to extract multi-scale features.

2.2. NAS Methods for HRSI Processing

Currently, most of the NAS methods for HRSI focus on image classification. In [36,38,40] the cell-based search space and gradient-based search strategy were followed. In [37] the NAS method based on EA was used, where the computational complexity and the performance error of the searched network are balanced by employing the multi-objective optimization method. Only a few works have targeted HRSI semantic segmentation. RSNet [39] follows the two-level hierarchical search space in Auto-deeplab, and adapts the gradient-based search strategy. RSBNet [50] is based on the one-shot NAS method and uses EA to search the optimal architecture. Notably, RSNet and RSBNet all search the backbone architecture, which is manually equipped with different recognition heads in the final retraining process.

Above all, the most similar works to ours are Auto-deeplab [35], DCNAS [49], and RSNet [39]. They all use the gradient-based NAS method and design a multi-scale search space for semantic segmentation. However, since these methods use huge super-net, which makes it difficult for the size of the input image to meet the needs of efficient convergence of the super-net, especially for HRSI, the NAS result may be adversely affected.

3. Methodology

In DNAS, three levels of decoupling NAS framework are proposed for HRSI semantic segmentation. We summarize the topology of the DCNN architecture as follows: scale selection path, cross-layer connection, and convolution operator. According to these three characteristics, we constructed a path-level search space, connection-level search space, and cell-level search space, respectively, which decouples the search space so that the entire search process is divided into three stages and forms a progressive search strategy. The overall pipeline of our method is illustrated in Figure 1. In the following subsections, we firstly explain the three-level search space in detail and, then, we expound on the gradient-based search method and decoding method for different levels of the super-net.

3.1. Decoupling Architecture Search Space

Inspired by the idea of a hierarchical search space [35], we divided the search space into three levels and constructed them decoupled. For the path-level, we used the path search space similar to [35], which can cover various options for scaling paths. For the connection-level, we added skip connections between cells to aggregate features from the preceding module and expand the search space. For the cell-level, we populated each cell with specific operators, which are prevalent in modern DCNNs.

3.1.1. Path-Level Search Space

The selection of various scale features is crucial for semantic segmentation, and a key aspect influencing the scale features is the network scale path. Therefore, to obtain an optimal scale path, we adopted the skeleton hyper-network in [35] as the path search space, as shown in Figure 2. Unlike [35], we only used a simple 3 × 3 convolution kernel to fill each cell, focusing the search process on scale path while reducing the display memory consumption of the super-net. Specifically, the super-net structure could be abstracted as a directed acyclic graph, and each cell was simply connected to the adjacent cells before and after through three sampling methods: down-sampling, up-sampling, and keeping. The input feature $F_{l}^{s}$ of each cell can be expressed as:

(1) $F_{l}^{s} = β_{l}^{s \to \frac{s}{2}} C o n v (F_{l - 1}^{\frac{s}{2}}) + β_{l}^{s \to s} C o n v (F_{l - 1}^{s}) + β_{l}^{2 s \to s} C o n v (F_{l - 1}^{2 s})$

where

(s, l) \in ℂ_{1} = S \times L

represents all cells in the search space,

S = 4, 8, 16, 32

represents the scale space of the super-net,

L

represents the length of the supernet, and

C o n v

represents the 3 × 3 convolution operation,

β_{l}^{s^{'} \to s}

represents the architecture weight of the path-level super-net, which is constrained by the softmax function in the process of super-net training as:

(2) $β_{l - 1 \to l}^{s_{1} \to s_{2}} = \frac{\exp (β_{l - 1 \to l}^{s_{1} \to s_{2}})}{\sum_{s \in S} \exp (β_{l - 1 \to l}^{s \to s_{2}})}$

In the same way as [34], we used the gradient-based NAS method for the super-net. The architecture weight $β_{l}^{s^{'} \to s}$ is optimized during super-net training and selected during super-net decoding, which is described in detail in Section 3.2. Since only a simple single convolution kernel was used to fill each cell, the entire super-net is lighter than [35], and it is more capable of processing higher resolution and batch size images. Meanwhile, the lightweight cell structure makes the search focus on scale selection, which leads to a better scale path.

3.1.2. Connection-Level Search Space

After searching through the scale path, we fixed the searched path as P and added cross-layer connections with forward cells to it, as shown in Figure 3. In contrast to the scale path search space, the path of the backbone network was fixed, and the corresponding network architecture parameters were no longer updated. At the same time, we deleted lower-scale cells and introduced complex connections to the scale path P, which was regarded as a critical path. Specifically, we increased the cross-layer connections between the cells on P and the forward cells and retained high-scale cells, which further increased the complexity of the connection between cells. Each cell still used a single convolution operator. Finally, the cells were divided into normal cells and complex cells according to whether they were on the critical path P. For normal cells, in the same way as for the cell structure of the path-level search space, they were only connected with the adjacent cells before and after by up and down sampling. Let $γ$ represent the network parameters, the input features of normal cells can be expressed as:

(3) $F_{l}^{s} = γ_{l - 1 \to l}^{s \to \frac{s}{2}} C o n v (F_{l - 1}^{\frac{s}{2}}) + γ_{l - 1 \to l}^{s \to s} C o n v (F_{l - 1}^{s}) + γ_{l - 1 \to l}^{2 s \to s} C o n v (F_{l - 1}^{2 s})$

For complex cells, due to the addition of cross-layer connections with previous cells, the input features are represented as:

(4) $F_{l_{2}}^{s_{2}} = \sum_{s \in S, l < l_{2}} γ_{l \to l_{2}}^{s \to s_{2}} C o n v (F_{l}^{s}) + β_{l_{1} \to l_{2}}^{s_{1} \to s_{2}} C o n v (F_{l_{1}}^{s_{1}})$

and the network parameters are constrained by the softmax function as:

(5) $γ_{l_{1} \to l_{2}}^{s_{1} \to s_{2}} = \frac{\exp (γ_{l_{1} \to l_{2}}^{s_{1} \to s_{2}})}{\sum_{s \in S, l < l_{2}} \exp (γ_{l \to l_{2}}^{s \to s_{2}})}$

where

s_{2}, l_{2}

represents the scale of the output feature and the super-net length of the corresponding position, and

s_{1}, l_{1}

represents the scale of the previous critical path and the supernet length of the corresponding position.

Using the gradient-based NAS method, $γ$ is also optimized in the process of super-net training and selected in the process of super-net decoding. We highlighted the role of cross-layer connections in the search process by fixing the network parameters $β$ of the critical path, while still populating cells with a simple convolution operator.

3.1.3. Cell-Level Search Space

In the cell-level search space, we defined the operators of each cell, as shown in Figure 4. Specifically, the simple single 3 × 3 convolution operation in the path and connection-level search space were replaced by a richer set of operators:

(6) $C o n v (F_{l}^{s}) \to \sum_{o p \in O P} {(α_{l}^{s})}_{o p} o p (F_{l}^{s})$

The operators in $O P$ are popular in current deep network designs, as shown in Table 1. Compared to [35,39,49], we used a wider variety of operator operations, including different types of pooling operators, convolution operators of different types, kernel sizes, and dilation rates. Thanks to the decoupling search space design, although the number of operators increased, fewer computing resources would be consumed. In addition, since the cross-layer connections between cells have been considered in the connection search space, equality operations were not included in these operators.

In the cell-level search space, the topology between cells was fixed, and only the network architecture weight $α$ of the operator was updated in the process of super-net optimization.

3.2. Optimization and Decoding of Differentiable Decoupling Search Spaces

The network architecture parameters $(β, γ, α)$ were used in the construction of the search space of the above three-level search spaces. We directly trained the entire super-net, and decoded the search results through the network architecture parameters. Below, we elaborate on the training optimization of the three-level super-net and the corresponding decoding process.

3.2.1. Optimization of Decoupling Search Spaces

Inspired by [35], when conducting a differentiable search, we divided the training data into two parts, respectively optimizing the original weight parameters of the training model and the architecture parameters introduced when building the search space. The three-level search space contains network architecture parameters $(β, γ, α)$ , which represent scale paths, cross-layer connections, and operators, respectively. Let $ξ_{i} (i = 1, 2, 3)$ represents $β, γ, α$ , and $ω$ represents the weight parameters of the network model. The super-net optimization process can be expressed in the following way: (1) Divide the training data into training set A and training set B equally. (2) The network weight parameters $ω$ are updated by calculating $\nabla_{ω} L_{t r a i n A} (ω, ξ_{i})$ through the training set A. (3) the architecture parameters $ξ_{i}$ are updated by calculating $\nabla_{ω} L_{t r a i n B} (ω, ξ_{i})$ through the training set B, where the function $L$ is the cross entropy loss function. After completing the supernet training, the searched network structure can be obtained by different decoding strategies.

Since we designed a three-level search space, the training of the super-net needs to be performed three separate times. However, the total time of the three searches does not increase compared to [35,39] because the super-net in each search stage is sufficiently lightweight compared to the coupled methods.

3.2.2. Decoding of Decoupling Search Spaces

After completing the search phase of each search space, the optimized network structure parameters need to be decoded to construct the optimal network architecture obtained from the search. In our work, the network structure parameters of the three stages needed to be decoded separately. These decoding processes aredescribed separately below.

(1). Scale structure parameter decoding: In this search space, since the structure parameter $β$ of each cell satisfied Equation (2), the parameter could be regarded as the probability of each scale selection. Therefore, the goal of decoding was to find a path with the highest selection probability. We used the Viterbi algorithm [51] to solve the path with the highest probability. After obtaining the optimal path, P, the construction of the connection search space needed to rely on the decoding result. The complex cell in Figure 2 is the optimal path obtained by decoding at this time.
(2). Connection structure parameters decoding: In order to ensure that the cross-layer connection was fully utilized, we used the decoding strategy in [49], and selected the connection with $γ > 0$ for the $γ$ before normalization in Equation (5). As mentioned above, the construction of the operator search space depended on the connection decoding and path decoding results.
(3). Operator structure parameters decoding: Since more operator operations were used, compared with [35,39,49], we selected the largest three operations in the structure parameter $α$ for each cell as the search results.

Finally, we constructed the searched segmentation network by concatenating the connection decoding results and operator decoding results. Then, we trained it on the full training set.

4. Experimental Results and Discussion

In this section, we first introduce the dataset used in the experiment, Then, we describe the detailed settings in the experiment and, finally, we give the experimental results and compare them with the current popular methods.

4.1. Datasets Description

To verify the effectiveness of the proposed method on high resolution remote sensing data, we conducted experiments on the Gaofen image dataset (GID) [26] and the FU dataset [41]. Some examples of the two datasets are shown in Figure 5 and Figure 6.

The original GID dataset contains 150 HRSIs acquired by the Gaofen-2 satellite in more than 60 cities in China, and has the advantages of large coverage, wide distribution, and high spatial resolution. These images are widely distributed over a geographic area of more than 50,000 square kilometers. As shown in Figure 5, the GID dataset image contains 4 bands: red, green, blue, and near-infrared bands. The labels contain 5 types of objects: built-up, farmland, forest, meadow, and water. During the experiment, we cut the original image of size 6800 × 7200 and the corresponding label into blocks of size 512 × 512 and obtained a total of 31,500 high-score remote sensing images with labels. These blocks were randomly divided into a training set, a validation set, and a test set in a ratio of 6:2:2. When training the model and calculating the accuracy, we did not consider unlabeled black pixels.

The FU dataset contains 321 HRSIs from 16 cities in different regions of France, each with a size of 10,000 × 10,000. The FU dataset contains 5 bands: red, green, blue, slope, and aspect. The labels contain 12 ground object categories: built-up, infrastructure, mining land, artificial meadow, arable land, permanent land, pasture, forest, shrub land, bare land, wetlands, and water. Similarly, the original image and the corresponding labels were cut into blocks of 512 × 512 size. These blocks were randomly divided into a training set, a validation set, and a test set in a ratio of 6:2:2.

4.2. Evaluation Metrics

In this paper, we calculated the overall accuracy (OA), the mean intersection over union (MIoU), and the frequency weighted intersection over union (FWIoU) to evaluate the performance of the water-body detection [52]. These evaluation indicators can be defined as Equations (7)–(9):

(7) $M I o U = \frac{1}{n + 1} \sum_{i = 0}^{n} \frac{T P_{i}}{F N_{i} + F P_{i} + T P_{i}}$

(8) $O A = \frac{T P + T N}{T P + F P + T N + F N}$

(9) $F W I o U = \frac{1}{n + 1} \sum_{i = 0}^{n} (\frac{T P_{i}}{F N_{i} + F P_{i} + T P_{i}} \cdot \frac{T P_{i} + F N_{i}}{T P_{i} + F P_{i} + T N_{i} + F N_{i}})$

where

T P

T N

F P

and

F N

are the number of true positives, true negatives, false positives, and false negatives, respectively, and

n

is the number of classes.

4.3. Experiment Settings

We set the length of the search space to $L$ = 12, 14 and set the number of operators to 3 when searching for operators. At the same time, the model started with a down-sampling module, which consisted of two convolution operations with stride 2, turning the spatial resolution of the feature to s = 4, and increasing the number of channels to 40. Inside the super-net, the number of channels increased by the same multiple as the feature scale increased. For example, when s = 32, the number of channels increases to 320. The model finally ended with an up-sampling module, which included two convolution operations and up-sampling operations with stride 1, and the dropout rates of the convolution operations were 0.5 and 0.1, respectively.

When performing the model search, the super-net was trained for a total of 60 epochs in the training set. We divided the training set into training sets A and B equally and adopted a random flipping data augmentation strategy during training. For the first 20 epochs, we only used training set A to train the model parameter $ω$ , and for the last 40 epochs, we used training sets A and B to train the model parameter $ω$ and model structure parameter $ξ_{i} (i = 1, 2, 3),$ respectively. Training set A used a stochastic gradient descent (SGD) optimizer with a momentum factor of 0.9 to update the model parameter. The learning rate decreased from 0.025 to 0.001, according to cosine annealing, which used a weight decay strategy with a coefficient of 0.0003. Training set B used an adaptive moment estimation (Adam) optimizer to update the structural parameter $ξ$ . The learning rate was set to 0.003, and a weight decay strategy with a coefficient of 0.001 was used. For the searched model, the head and tail sampling modules were the same as above, and we replaced the super-net with the decoded model. During retraining, all training sets participated in training for 200 epochs using the same SGD optimizer and data augmentation strategy as during super-net training. The experiments were based on the pytorch framework and completed on a NVIDIA GeForce RTX 3090 GPU.

4.4. Comparison with the State-of-the-Art Methods

We conducted experiments on GID and FU datasets with the above experimental settings and compared our proposed method with some popular methods. The artificially designed models included PSPNet [42], DeeplabV3+ [18], HRNet [19], and MSFCN [11]. The backbone networks of PSPNet and DeeplabV3+ were resnet101. To ensure a fair comparison, these models were trained from scratch.

Table 2 and Table 3 show that our method achieved the best MIoU on two datasets, and the visualization results of our method are shown in Figure 7 and Figure 8. In detail, compared with the optimal comparison model MSFCN on the GID dataset, our method was 0.27% higher, and compared with the optimal comparison model Deeplabv3+ on the FU dataset, it was 1.09% higher. Figure 1 and Figure 2 show the visualization results of our method on the two datasets, which was better than that of the comparison model. What is more, compared with artificially designed networks, DNAS was more flexible and robust. For example, MSFCN [11] achieved a sub-optimal MIoU score of 0.9112 on the GID dataset (0.9140 for DNAS), but the accuracy on the FU dataset was 0.4708, which was significantly lower than 0.5216 for DNAS. Deeplabv3+ [18] achieved a sub-optimal MIoU score on the FU dataset of 0.5107 (0.5216 for DNAS), but the MIoU on the GID dataset was 0.8752, which was significantly lower than 0.9140 for DNAS. This shows that the performance of artificially designed networks on different datasets is divergent, and it is often necessary to manually try various structures through a large number of experiments to obtain better prediction results. Our proposed search space can cover enough possible network architecture candidates, including comparison networks, such as Deeplabv3+, HRNet. Based on this search space, our method could search for network structures that outperform the above artificial networks. Therefore, by decoupling the NAS strategy, our method designs proprietary network structures for different datasets and achieves state-of-the-art results with stronger robustness.

In addition, we compared the proposed method with some existing NAS methods. In the experiments of Auto-deeplab, we used the officially provided super-net model and then completed the experiment according to the same settings. Due to the huge memory usage of the super-net, the batch size was only set to 2 during training. In the experiments of Fast-NAS, we used the officially provided code and searched 1000 discrete networks. Auto-deeplab uses a gradient-based NAS method, which is the same as DNAS, but the batch size of its input images during the search training process is too small, which makes it tough for the super-net to converge. Therefore, the network performance obtained by the search was not as good as DNAS. Fast-NAS uses the NAS method based on reinforcement learning. Since it only trains each selected sub-network for 6 epochs and employs a subset of datasets, the final retraining performance of the network was unstable. In DNAS, the search space is more complex than Auto-deeplab and Fast-NAS, and the super-net is trained for 60 epochs on the whole dataset, so the architecture parameters are well converged and can be decoded to a better network. Compared with the existing NAS methods, our proposed method was more suitable for HRSI.

4.5. Efficiency Analysis

Firstly, we analyzed the efficiency of the search process. In Table 4, we compare the NAS methods using the gradient descent strategy. Benefiting from the decoupling search space design, each level super-net was lightweight enough for training to converge. Path-cell super-net was not filled with complex cell structures. Connection-level and cell-level super-net were constrained by the decoding results of the previous stage. Although our method needed to go through a three-stage search process, the search time did not increase. At the same time, our method could accommodate larger input image size and training batch size, making the hyper-network easier to converge, so that a more suitable network architecture could be searched.

We also analyzed the efficiency of the network models obtained from the search, as shown in Table 2 and Table 3. The input image size of all experiments was 512 × 512. Compared with the artificially designed network [18,19,42], the searched network has fewer parameters, memory usage, and computation. In [43] a network search method for real-time semantic segmentation tasks was designed, so the obtained network is the most lightweight, and the search for cross-layer connections is not introduced in [35]. So, the searched network structure was more lightweight than our proposed method.

4.6. Ablation Study

Finally, we conducted ablation experiments on the method on the GID dataset. This part verified the role of the three search spaces we designed and the effect of the size of the hyperparameter $L$ on the results, as shown in Table 5 and Table 6, respectively. For the three-level search space, we trained the network obtained by each stage search on all training sets. In the three-level search, the architecture of the network was continuously optimized. The path-level determined the backbone structure of the network, the connection-level determined the cross-layer connections of the network, and the cell-level determined the specific operators of each cell. It can be seen from Table 5 that as the search continued, the performance of the network was getting better and better, which proved the effectiveness of our designed decoupling search space. On the other hand, for the search space length $L$ , we set it to 12, 14, and 16 for experiments. The best performance was the super-net with $L$ = 14. Intuitively, with the complexity of the search space, the obtained model should be better. However, from another perspective, the expansion of the search space made optimization difficult, so that the optimal network could not be effectively searched. In fact, during the experiment, the training batch size of the super-net was only 2 when $L$ = 16, which was the same as the input of the other NAS methods above. This also proved that we needed a suitable search space that could adapt to image inputs with enough resolution and batch size, and which could converge effectively.

5. Conclusions

For the remote sensing field, which has abundant image data, NAS becomes important, due to its characteristic data-driven automatic design of network architecture. To efficiently converge the NAS process, while enriching the search space, this paper proposed a novel DNAS method for HRSI semantic segmentation. In the DNAS, the search process is divided into three stages to automatically construct a semantic segmentation DCNN model with extremely high complexity. For the first time, we introduced the search of cross-layer connections, which increases the search space complexity and can extract richer image features. Meanwhile, through the method of decoupling search, the method ensures that the super-net can input enough training images in the search process, so that the super-net can effectively converge to get a more suitable search model.

The method was experimented on with two datasets to analyze its accuracy and efficiency, respectively. We compared our method with popular deep network methods and some NAS methods on the GID and FU datasets. The experiments demonstrated that our method outperformed the compared methods in accuracy. Compared with other NAS methods, our method makes it easier to search for excellent models, because it can accept enough input image. In future work, we may expand the existing operator search space (e.g., some remote sensing operators) and prepare to introduce a more efficient neural search method in the third stage of search.

Author Contributions

Conceptualization, Y.L. (Yansheng Li); methodology, Y.W., Y.L. (Yansheng Li); validation, Y.W., W.C.; writing—original draft preparation, Y.W., Y.L. (Yunzhou Li); writing—reviewing and editing, Y.L. (Yunzhou Li), B.D., W.C.; project administration, Y.L. (Yansheng Li); funding acquisition, Y.L. (Yansheng Li). All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

The experiments are conducted on publicly open datasets. The download sites of the publicly open datasets can be found in the corresponding published papers; we do not repeat these sites here.

Conflicts of Interest

The authors declare no conflict of interest.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures and Tables

Figure 1. The framework of decoupling neural architecture search.

Figure 2. Path-level Search Space.

Figure 3. Connection-level Search Space.

Figure 4. Cell-level Search Space.

Figure 5. The example image and corresponding label from the GID dataset.

Figure 6. FU dataset example image and corresponding classification label map.

View Image - Figure 7. Visible semantic segmentation of the GID dataset. (a,b) are raw images and ground truth, respectively. (c,d) are the results of PSPNet and the results of Deeplabv3+, respectively. The results of HRNet and the results of MSFCN are shown in (e,f), respectively. (g) presents the results of Auto-deeplab, and (h) presents the results of Fast-NAS. The results of our proposed DNAS are displayed in (i).

Figure 7. Visible semantic segmentation of the GID dataset. (a,b) are raw images and ground truth, respectively. (c,d) are the results of PSPNet and the results of Deeplabv3+, respectively. The results of HRNet and the results of MSFCN are shown in (e,f), respectively. (g) presents the results of Auto-deeplab, and (h) presents the results of Fast-NAS. The results of our proposed DNAS are displayed in (i).

View Image - Figure 8. Visible semantic segmentation examples on the GID dataset. (a,b) are raw images and ground truth, respectively. (c,d) are the results of PSPNet and the results of Deeplabv3+, respectively. The results of HRNet and the results of MSFCN are shown in (e,f), respectively. (g) presents the results of Auto-deeplab, and (h) presents the results of Fast-NAS. The results of our proposed DNAS are displayed in (i).

Figure 8. Visible semantic segmentation examples on the GID dataset. (a,b) are raw images and ground truth, respectively. (c,d) are the results of PSPNet and the results of Deeplabv3+, respectively. The results of HRNet and the results of MSFCN are shown in (e,f), respectively. (g) presents the results of Auto-deeplab, and (h) presents the results of Fast-NAS. The results of our proposed DNAS are displayed in (i).

Table 1

Candidate operators in the cell-level search space.

Type	Kernel Size	Operator
Pooling	3 × 3	max pooling
	3 × 3	average pooling
	-	global pooling
Convolution	1 × 1	convolution
	3 × 3	convolution
	3 × 3	atrous conv with rate 3
	3 × 3	atrous conv with rate 12
	3 × 3	depthwise-separable conv
	5 × 5	depthwise-separable conv
	7 × 7	depthwise-separable conv

Table 2

Comparison with the existing methods on the GID dataset.

Model	MIoU	OA	FWIoU	Params	Memory	FLOPs
PSPNet [42]	0.9011	0.9757	0.9531	72.58 M	2820.78 M	279.64 G
Deeplabv3+ [18]	0.8752	0.9746	0.9513	59.35 M	1117.32 M	88.9 G
HRNet [4]	0.9001	0.9449	0.9511	65.85 M	1438.48 M	93.77 G
MSFCN [11]	0.9113	0.9767	0.9548	14.17 M	2769.02 M	197.62 G
Auto-deeplab [35]	0.8749	0.9741	0.9505	13.35 M	925.15 M	16.19 G
Fast-NAS [43]	0.8807	0.9759	0.9536	2.18 M	651.46 M	3.9 G
Our DNAS ( $L$ = 12)	0.8917	0.9776	0.9566	6.15 M	753.00 M	16.89 G
Our DNAS ( $L$ = 14)	0.9140	0.9800	0.9611	7.14 M	1195.38 M	54.06 G

Table 3

Comparison with the existing methods on the FU dataset.

Model	MIoU	OA	FWIoU	Params	Memory	FLOPs
PSPNet [42]	0.5003	0.7181	0.5730	72.58 M	2820.78 M	279.64 G
Deeplabv3+ [18]	0.5107	0.7341	0.5896	59.35 M	1117.32 M	88.9 G
HRNet [4]	0.5034	0.7254	0.5782	65.85 M	1438.48 M	93.77 G
MSFCN [11]	0.4708	0.7090	0.5592	14.17 M	2769.02 M	197.62 G
Auto-deeplab [35]	0.4971	0.7268	0.5808	14.52 M	1035.15 M	18.92 G
Fast-NAS [43]	0.4586	0.7116	0.5606	2.82 M	792.26 M	4.2 G
Our DNAS ( $L$ = 12)	0.5155	0.7331	0.5870	6.57 M	1291.14 M	39.49 G
Our DNAS ( $L$ = 14)	0.5216	0.7345	0.5912	12.81 M	1360.04 M	46.97 G

Table 4

The searching efficiency comparison on the GID dataset.

Model	Image size	Batchsize	GPU Days
Auto-deeplab [35]	512	2	2.5
RSNet [39]	256	2	7
Our DNAS	512	24, 16, 6	5.3

Table 5

The efficiency analysis of three-level search space on the GID dataset.

Model	Built-Up	Farmland	Forest	Meadow	Water	MIoU
Ours-Stage1	0.9583	0.9392	0.9509	0.5328	0.9430	0.8648
Ours-Stage2	0.9615	0.9450	0.9511	0.5786	0.9511	0.8775
Ours-Stage3	0.9685	0.9584	0.9694	0.7121	0.9616	0.9140

Table 6

The efficiency analysis of different $L$ on the GID dataset.

Model	Built-Up	Farmland	Forest	Meadow	Water	MIoU
$L$ = 12	0.9678	0.9550	0.9680	0.6206	0.9585	0.8940
$L$ = 14	0.9685	0.9584	0.9694	0.7121	0.9616	0.9140
$L$ = 16	0.9663	0.9499	0.9565	0.5543	0.9537	0.8762

References

1. Gorelick, N.; Hancher, M.; Dixon, M.; Ilyushchenko, S.; Thau, D.; Moore, R. Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sens. Environ.; 2017; 202, pp. 18-27. [DOI: https://dx.doi.org/10.1016/j.rse.2017.06.031]

2. Li, Y.; Ma, J.; Zhang, Y. Image retrieval from remote sensing big data: A survey. Inf. Fusion; 2021; 67, pp. 94-115. [DOI: https://dx.doi.org/10.1016/j.inffus.2020.10.008]

3. Sun, W.; Wang, R. Fully convolutional networks for semantic segmentation of very high resolution remotely sensed images combined with DSM. IEEE Geosci. Remote Sens. Lett.; 2018; 15, pp. 474-478. [DOI: https://dx.doi.org/10.1109/LGRS.2018.2795531]

4. Xu, Y.; Wu, L.; Xie, Z.; Chen, Z. Building extraction in very high resolution remote sensing imagery using deep learning and guided filters. Remote Sens.; 2018; 10, 144. [DOI: https://dx.doi.org/10.3390/rs10010144]

5. Zhao, K.; Kang, J.; Jung, J.; Sohn, G. Building extraction from satellite images using mask R-CNN with building boundary regularization. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops; Salt Lake City, UT, USA, 18–22 June 2018; pp. 247-251.

6. Abdollahi, A.; Pradhan, B.; Shukla, N.; Chakraborty, S.; Alamri, A. Deep learning approaches applied to remote sensing datasets for road extraction: A state-of-the-art review. Remote Sens.; 2020; 12, 1444. [DOI: https://dx.doi.org/10.3390/rs12091444]

7. Zhou, L.; Zhang, C.; Wu, M. D-LinkNet: LinkNet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops; Salt Lake City, UT, USA, 18–22 June 2018; pp. 182-186.

8. Wang, Y.; Li, Z.; Zeng, C.; Xia, G.-S.; Shen, H. An urban water extraction method combining deep learning and Google Earth engine. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.; 2020; 13, pp. 769-782. [DOI: https://dx.doi.org/10.1109/JSTARS.2020.2971783]

9. Li, Y.; Dang, B.; Zhang, Y.; Du, Z. Water body classification from high-resolution optical remote sensing imagery: Achievements and perspectives. ISPRS J. Photogramm. Remote Sens.; 2022; 187, pp. 306-327. [DOI: https://dx.doi.org/10.1016/j.isprsjprs.2022.03.013]

10. Li, Y.; Zhou, Y.; Zhang, Y.; Zhong, L.; Wang, J.; Chen, J. DKDFN: Domain Knowledge-Guided deep collaborative fusion network for multimodal unitemporal remote sensing land cover classification. ISPRS J. Photogramm. Remote Sens.; 2022; 186, pp. 170-189. [DOI: https://dx.doi.org/10.1016/j.isprsjprs.2022.02.013]

11. Li, R.; Zheng, S.; Duan, C.; Wang, L.; Zhang, C. Land cover classification from remote sensing images based on multi-scale fully convolutional network. Geo-Spat. Inf. Sci.; 2022; 25, pp. 278-294. [DOI: https://dx.doi.org/10.1080/10095020.2021.2017237]

12. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA, 27–30 June 2016; pp. 779-788.

13. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2012; Volume 25.

14. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Boston, MA, USA, 7–12 June 2015; pp. 3431-3440.

15. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. Proceedings of the International Conference on Medical image Computing and Computer-Assisted Intervention; Munich, Germany, 5–9 October 2015; pp. 234-241.

16. Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support; Springer: Berlin/Heidelberg, Germany, 2018; pp. 3-11.

17. Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell.; 2017; 40, pp. 834-848. [DOI: https://dx.doi.org/10.1109/TPAMI.2017.2699184]

18. Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. Proceedings of the European Conference on Computer Vision (ECCV); Munich, Germany, 8–14 September 2018; pp. 801-818.

19. Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell.; 2020; 43, pp. 3349-3364. [DOI: https://dx.doi.org/10.1109/TPAMI.2020.2983686]

20. Yu, C.; Xiao, B.; Gao, C.; Yuan, L.; Zhang, L.; Sang, N.; Wang, J. Lite-hrnet: A lightweight high-resolution network. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Nashville, TN, USA, 20–25 June 2021; pp. 10440-10450.

21. Peng, C.; Li, Y.; Jiao, L.; Chen, Y.; Shang, R. Densely based multi-scale and multi-modal fully convolutional networks for high-resolution remote-sensing image semantic segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.; 2019; 12, pp. 2612-2626. [DOI: https://dx.doi.org/10.1109/JSTARS.2019.2906387]

22. McFeeters, S.K. The use of the Normalized Difference Water Index (NDWI) in the delineation of open water features. Int. J. Remote Sens.; 1996; 17, pp. 1425-1432. [DOI: https://dx.doi.org/10.1080/01431169608948714]

23. Li, J.; Narayanan, R.M. A shape-based approach to change detection of lakes using time series remote sensing images. IEEE Trans. Geosci. Remote Sens.; 2003; 41, pp. 2466-2477.

24. Huang, X.; Zhang, L. Road centreline extraction from high-resolution imagery based on multiscale structural features and support vector machines. Int. J. Remote Sens.; 2009; 30, pp. 1977-1987. [DOI: https://dx.doi.org/10.1080/01431160802546837]

25. Yuan, X.; Shi, J.; Gu, L. A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Syst. Appl.; 2021; 169, 114417. [DOI: https://dx.doi.org/10.1016/j.eswa.2020.114417]

26. Tong, X.-Y.; Xia, G.-S.; Lu, Q.; Shen, H.; Li, S.; You, S.; Zhang, L. Land-cover classification with high-resolution remote sensing images using transferable deep models. Remote Sens. Environ.; 2020; 237, 111322. [DOI: https://dx.doi.org/10.1016/j.rse.2019.111322]

27. Zoph, B.; Le, Q.V. Neural architecture search with reinforcement learning. arXiv; 2016; arXiv: 1611.01578

28. Elsken, T.; Metzen, J.H.; Hutter, F. Neural architecture search: A survey. J. Mach. Learn. Res.; 2019; 20, pp. 1997-2017.

29. Liu, C.; Zoph, B.; Neumann, M.; Shlens, J.; Hua, W.; Li, L.-J.; Fei-Fei, L.; Yuille, A.; Huang, J.; Murphy, K. Progressive neural architecture search. Proceedings of the European Conference on Computer Vision (ECCV); Munich, Germany, 8–14 September 2018; pp. 19-34.

30. Real, E.; Aggarwal, A.; Huang, Y.; Le, Q.V. Regularized evolution for image classifier architecture search. Proceedings of the AAAI Conference on Artificial Intelligence; Honolulu, HI, USA, 27 January–1 February 2019; pp. 4780-4789.

31. Suganuma, M.; Shirakawa, S.; Nagao, T. A genetic programming approach to designing convolutional neural network architectures. Proceedings of the Genetic and Evolutionary Computation Conference; Berlin, Germany, 15–19 July 2017; pp. 497-504.

32. Brock, A.; Lim, T.; Ritchie, J.M.; Weston, N. Smash: One-shot model architecture search through hypernetworks. arXiv; 2017; arXiv: 1708.05344

33. Bender, G.; Kindermans, P.-J.; Zoph, B.; Vasudevan, V.; Le, Q. Understanding and simplifying one-shot architecture search. Proceedings of the International Conference on Machine Learning; Stockholm, Sweden, 10–15 July 2018; pp. 550-559.

34. Liu, H.; Simonyan, K.; Yang, Y. Darts: Differentiable architecture search. arXiv; 2018; arXiv: 1806.09055

35. Liu, C.; Chen, L.-C.; Schroff, F.; Adam, H.; Hua, W.; Yuille, A.L.; Fei-Fei, L. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Long Beach, CA, USA, 15–20 June 2019; pp. 82-92.

36. Jing, W.; Ren, Q.; Zhou, J.; Song, H. AutoRSISC: Automatic design of neural architecture for remote sensing image scene classification. Pattern Recognit. Lett.; 2020; 140, pp. 186-192. [DOI: https://dx.doi.org/10.1016/j.patrec.2020.09.034]

37. Ma, A.; Wan, Y.; Zhong, Y.; Wang, J.; Zhang, L. SceneNet: Remote sensing scene classification deep learning network using multi-objective neural evolution architecture search. ISPRS J. Photogramm. Remote Sens.; 2021; 172, pp. 171-188. [DOI: https://dx.doi.org/10.1016/j.isprsjprs.2020.11.025]

38. Peng, C.; Li, Y.; Jiao, L.; Shang, R. Efficient convolutional neural architecture search for remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens.; 2020; 59, pp. 6092-6105. [DOI: https://dx.doi.org/10.1109/TGRS.2020.3020424]

39. Wang, J.; Zhong, Y.; Zheng, Z.; Ma, A.; Zhang, L. RSNet: The search for remote sensing deep neural networks in recognition tasks. IEEE Trans. Geosci. Remote Sens.; 2020; 59, pp. 2520-2534. [DOI: https://dx.doi.org/10.1109/TGRS.2020.3001401]

40. Zhang, Z.; Liu, S.; Zhang, Y.; Chen, W. RS-DARTS: A convolutional neural architecture search for remote sensing image scene classification. Remote Sens.; 2021; 14, 141. [DOI: https://dx.doi.org/10.3390/rs14010141]

41. Castillo Navarro, J.; Bertrand, L.S.; Boulch, A.; Audebert, N.; Lefèvre, S. MiniFrance. IEEE Dataport; 2020; [DOI: https://dx.doi.org/10.21227/b9pt-8x03]

42. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA, 21–26 July 2017; pp. 2881-2890.

43. Nekrasov, V.; Chen, H.; Shen, C.; Reid, I. Fast neural architecture search of compact semantic segmentation models via auxiliary cells. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Long Beach, CA, USA, 15–20 June 2019; pp. 9126-9135.

44. Real, E.; Moore, S.; Selle, A.; Saxena, S.; Suematsu, Y.L.; Tan, J.; Le, Q.V.; Kurakin, A. Large-scale evolution of image classifiers. Proceedings of the International Conference on Machine Learning; Sydney, NSW, Australia, 6–11 August 2017; pp. 2902-2911.

45. Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning transferable architectures for scalable image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA, 18–23 June 2018; pp. 8697-8710.

46. Pham, H.; Guan, M.; Zoph, B.; Le, Q.; Dean, J. Efficient neural architecture search via parameters sharing. Proceedings of the International Conference on Machine Learning; Macau, China, 26–28 February 2018; pp. 4095-4104.

47. Chen, X.; Xie, L.; Wu, J.; Tian, Q. Progressive differentiable architecture search: Bridging the depth gap between search and evaluation. Proceedings of the IEEE/CVF International Conference on Computer Vision; Seoul, Korea, 27–28 October 2019; pp. 1294-1303.

48. Guo, Z.; Zhang, X.; Mu, H.; Heng, W.; Liu, Z.; Wei, Y.; Sun, J. Single path one-shot neural architecture search with uniform sampling. Proceedings of the European Conference on Computer Vision; Glasgow, UK, 23–28 August 2020; pp. 544-560.

49. Zhang, X.; Xu, H.; Mo, H.; Tan, J.; Yang, C.; Wang, L.; Ren, W. Dcnas: Densely connected neural architecture search for semantic image segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Nashville, TN, USA, 20–25 June 2021; pp. 13956-13967.

50. Peng, C.; Li, Y.; Shang, R.; Jiao, L. RSBNet: One-Shot Neural Architecture Search for A Backbone Network in Remote Sensing Image Recognition. arXiv; 2021; arXiv: 2112.03456

51. Forney, G.D. The viterbi algorithm. Proc. IEEE; 1973; 61, pp. 268-278. [DOI: https://dx.doi.org/10.1109/PROC.1973.9030]

52. Garcia-Garcia, A.; Orts-Escolano, S.; Oprea, S.; Villena-Martinez, V.; Garcia-Rodriguez, J. A review on deep learning techniques applied to semantic segmentation. arXiv; 2017; arXiv: 1704.06857

Word count: 7694

Show less

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Deep learning methods, especially deep convolutional neural networks (DCNNs), have been widely used in high-resolution remote sensing image (HRSI) semantic segmentation. In literature, most successful DCNNs are artificially designed through a large number of experiments, which often consume lots of time and depend on rich domain knowledge. Recently, neural architecture search (NAS), as a direction for automatically designing network architectures, has achieved great success in different kinds of computer vision tasks. For HRSI semantic segmentation, NAS faces two major challenges: (1) The task’s high complexity degree, which is caused by the pixel-by-pixel prediction demand in semantic segmentation, leads to a rapid expansion of the search space; (2) HRSI semantic segmentation often needs to exploit long-range dependency (i.e., a large spatial context), which means the NAS technique requires a lot of display memory in the optimization process and can be tough to converge. With the aforementioned considerations in mind, we propose a new decoupling NAS (DNAS) framework to automatically design the network architecture for HRSI semantic segmentation. In DNAS, a hierarchical search space with three levels is recommended: path-level, connection-level, and cell-level. To adapt to this hierarchical search space, we devised a new decoupling search optimization strategy to decrease the memory occupation. More specifically, the search optimization strategy consists of three stages: (1) a light super-net (i.e., the specific search space) in the path-level space is trained to get the optimal path coding; (2) we endowed the optimal path with various cross-layer connections and it is trained to obtain the connection coding; (3) the super-net, which is initialized by path coding and connection coding, is populated with kinds of concrete cell operators and the optimal cell operators are finally determined. It is worth noting that the well-designed search space can cover various network candidates and the optimization process can be done efficiently. Extensive experiments on the publicly open GID and FU datasets showed that our DNAS outperformed the state-of-the-art methods, including artificial networks and NAS methods.

Details

Title

DNAS: Decoupling Neural Architecture Search for High-Resolution Remote Sensing Image Semantic Segmentation

Author

Wang, Yu; Li, Yansheng

; Chen, Wei

; Li, Yunzhou; Dang, Bo

First page

3864

Publication year

2022

Publication date

2022

Publisher

MDPI AG

e-ISSN

20724292

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/rs14163864

ProQuest document ID

2706285738

DNAS: Decoupling Neural Architecture Search for High-Resolution Remote Sensing Image Semantic Segmentation

Jump to:

Full Text

Abstract

Details

Suggested sources