Deep Convolutional Neural Network for Large-Scale

Full text

Turn on search term navigation

1. Introduction

1.1. Background

The date palm tree (Phoenix dactylifera L.) is one of the oldest perennial fruit trees [1] and has been one of the most cultivated fruit trees since the Neolithic/Early Bronze Age [2]. The palm tree has unique and easily recognized characteristics, including a single trunk, palm leaves, and fronds. The crown of a date palm tree is densely covered with long pinnate leaves, which vary with the age of the tree and environmental conditions and can be as long as 4 m, on average [3]. The average height of date palm trees typically ranges from 15 m to 25 m [4]. Date palm trees can generally be grown in arid and semi-arid environments and are planted extensively on the Arabian Peninsula, in West Asia, and in North Africa. These trees are resilient and capable of surviving in a very hot and dry climates and tolerating saline and alkaline soils [5]. Date palm trees may live for more than 100 years [6] if they are not attacked by pests (i.e., red palm weevil) or diseases. These trees play a considerable role in harsh arid and semi-arid environments by supporting and stabilizing desert ecosystems [6]. Date palm trees bear fruit at an average age of five years, with an average annual yield of 400–600 kg/tree/year and may continue to produce for up to 60 years [7]. According to the Food and Agriculture Organization [8], the world production of dates has increased from 1,852,592 tons in 1961 to 9,075,446 tons in 2019. The world’s total harvested area increased six times from 1961 (240,972 ha) to 2019 (1,381,434 ha). The estimation of the population of palm trees and the harvest are derived on the basis of the total quantity of the produced dates, and accurate quantification of date palm trees is either limited or obsolete [9]. The precise information about the number, distribution, and health of date palm trees is crucial for sustainable management, disease and pest control, and yield estimation. Considering that palm trees are distributed over large agricultural and urban areas, the mapping and consistent monitoring of these trees using field-based surveys are impractical and can be laborious and time-intensive tasks.

Remote sensing technologies have substantially boosted the efficiency and accuracy of vegetation mapping, as they offer valuable and feasible tools for acquiring and observing large areas with comprehensive options for resolution [10,11,12]. A tremendous amount of satellite-based data is being collected and has been used extensively for the extraction of vegetation cover, forestry, and changes over the Earth’s surface at regional and global scales [13,14,15,16,17,18,19]. However, satellite and piloted aircrafts are constrained by their ability to deliver adequate spatial and temporal resolutions, which are essential to several applications that require short revisit times; such applications include discriminating vegetation or crop types and monitoring their phonological stages and health [20,21]. The capabilities of unmanned aerial vehicles (UAVs) in acquiring images with flexible revisit scheduling at low altitudes with ultra-spatial and temporal resolutions have enabled the observation of small individual plants and the extraction of information at a fine scale that can support farmers in their decision making, improve agricultural production, and optimize the utilization of resources [22,23]. A plethora of studies have successfully employed UAV platforms to acquire red–green–blue (RGB), multispectral, hyperspectral, and thermal images for studying vegetation [24,25,26,27,28,29,30,31], invasive plants [32,33,34], plant diseases, pests and stresses [35,36,37,38,39,40], agriculture [41,42,43,44,45,46], and individual trees [47,48].

Given the formidable and increasing amount of remotely sensed data, a wide spectrum of machine learning (ML) techniques has been used and developed to extract meaningful information and harness the unprecedented sources of data for versatile earth-related applications. Deep learning (DL), as a subfield of machine learning and artificial intelligence, has received considerable attention in the field of remote sensing in the past few years and has increasingly been used in a wide range of applications. In the same manner as the function of the human brain, DL algorithms learn by establishing the natural relationships between input and output data through multilayered, interconnected deep neural network (DNN) architectures [49,50]. Different from the classical machine learning models, DNN is data-driven, which eliminates the need for the construction of manually hand-crafted features of hierarchal data representations; high-level deep features are automatically learned from an input of imagery datasets. DL outperforms classical ML algorithms by effectively tackling the curse of dimensionality and achieving a better and consistent level of classification accuracies from massive image datasets without a significant drop in accuracy [51]. Convolutional neural networks are one of the most widely used deep supervised learning models in a wide spectrum of remote sensing applications and have achieved extraordinary improvement in recent years in the classification of remotely sensed data [52,53]. The use of diverse CNNs in crops and plant phenology recognition [54,55,56,57,58,59], weed detection [60,61,62], agriculture [51,63], vegetation mapping [64,65,66,67,68], tree crown detection and mapping [69,70,71,72], and disease detection [73,74,75,76,77] has elicited considerable interest.

1.2. Related Work

The mapping and detection of individual tree crowns, tree/plant/vegetation species, crops, and wetlands from UAV-based images are achieved by diverse CNN architectures, which are used to perform different tasks, including path-based classification [78,79,80,81,82,83,84,85,86,87], object detection [88,89,90,91,92,93,94,95,96,97], and semantic segmentation [98,99,100,101,102,103,104,105,106,107]. Recently, semantic segmentation, a commonly used term in computer vision where each pixel within the input imagery is assigned to a particular class, has been a widely used technique in diverse earth-related applications [108]. Various architectures, such as fully convolutional networks (FCNs), SegNet [109], U-Net [110], and DeepLab V3+ [111], have been used successfully to delineate tree and vegetation species [70,98,100,101,103,105,106,112,113,114,115,116,117,118,119,120,121,122,123,124], crops [51,57,58,102,125,126], wetlands [107,127], and weeds [61,99] from various remotely sensed data. For instance, Freudenberg et al. [128] utilized U-Net architecture to detect oil and coconut palms from WorldView 2, 3 satellite images. Their approach, which achieved accuracies ranging from 89% to 92%, was proposed as a way to precisely monitor palm trees at large scales. To obtain oil palm plantation maps from high spatial-resolution satellite images, Dong et al. [129] proposed a U-Net structure with a residual channel attention unit and a conditional random field for post-processing. The study achieved an overall accuracy of 96.88% and a mean intersection-over-union of 90.58%. Morales et al. [105] semantically segmented Mauritia flexuosa palm trees from UAV images, which were acquired under different environments and light conditions on the basis of Google’s DeepLab V3+ architecture. The presented DeepLab V3+ model outperformed four U-Net architectures and was able to distinguish young palms or palms partially covered by other types of vegetation. Torres et al. [100] evaluated five semantic segmentation architectures, including SegNet, U-Net, FC-DenseNet, and two DeepLab V3+ variants for segmenting single tree species. An intersection over union ranging from 77.1% to 92.5% was reported by the experimental analysis, which demonstrated the effectiveness of the evaluated architecture.

To the best of the authors’ knowledge, the vast majority of date palm mapping studies focus on the utilization of the traditional machine learning algorithms, such as traditional maximum likelihood supervised classification [130], spectral indices and thresholding analysis [131,132], hybrid per-pixel classification approach [133], fuzzy logic [134], and DT rule-based object-based image analysis [135]. Only limited studies have been dedicated to using deep learning techniques to detect date palm trees [9,136]. The current study aims to (1) develop a deep semantic segmentation method based on U-Shape convolutional network (U-Net) architecture and a pre-trained deep residual network for large-scale mapping of date palm trees; (2) establish a comprehensive and versatile labeled dataset to support the development of the proposed semantic segmentation model for date palm trees from very-high spatial resolution unmanned aerial vehicle (UAV) images; (3) compare the performance of the proposed approach with those of different state-of-the-art semantic segmentation networks.

2. Study Area and Materials

2.1. Experimental Site

The study area is located in the eastern region of Ajman Emirate, United Arab Emirates (UAE). It is geographically located between latitude 25.36°N and 25.43°N and longitude 55.54°E and 55.63°E (World Geodetic System, 1984), as shown in Figure 1, and covers approximately 85 km². The climate of the UAE ranges from arid to hyper-arid, with a daily high temperature ranging between 24 and 42 °C, with mean temperatures of 18 °C–34 °C and extreme hot daytime temperatures occurring frequently, which reach above 40 °C in the summer season [137,138]. The majority of the UAE experiences sporadic and irregular rainfall in time and geographical distribution, whereas the average annual rainfall can be less than 6 mm in the interior of the southern desert and can reach almost 160 mm in the northern and eastern mountainous regions of the country [139].

2.2. UAV Image Acquisition and Preprocessing

A commercial-grade off-the-shelf fixed-wing UAV (eBee-plus, senseFly^®, Cheseaux-sur-Lausanne, Switzerland) was used to acquire the VHSR images used in this research. The UAV system was equipped with a senseFly S.O.D.A (sensor optimized for drone applications) RGB camera (20 MP digital compact camera with a focal length of 28 mm that acquires VHR visible-color images: red (660 nm), green (520 nm), blue (450 nm)) onboard an inertial measurement unit and Global Navigation Satellite System (GNSS) with real-time kinematic/postprocessed kinematic (RTK/PPK) modes to enable high horizontal accuracy. Flight missions were planned and undertaken using senseFly’s eMotion flight controller and data management software. Following the preflight planning and manual launch of the eBee-plus, flight sessions were managed independently by the onboard autopilot. Flight missions were undertaken at an average flying height of 100 m above elevation data (AED), with 80% longitudinal and lateral overlaps. The utilized elevation data, provided in senseFly’s eMotion software, were based on a 90 m resolution digital elevation model derived from the Shuttle Radar Topography Mission (SRTM) combined with other data sources (i.e., ASTER GDEM, SRTM30, cartographic data) [140]. Flight lines were oriented perpendicular to the direction of the prevailing wind on the day of the survey. During the flights, a ground-based Trimble R10 GNSS receiver was used in static mode as a base reference station. The preprocessing of the acquired image data was initiated by rectifying the drone locations where the images were captured during the flight, using the PPK mode. Specifically, the ground GNSS RINEX (receiver independent exchange format) data and drone-based GNSS data (drone flight log file) were processed using eMotion software. Then, pix4Dmapper software (v.4.4.10; Prilly, Switzerland) was used to import the geotagged overlapping images and develop the orthomosaic of the study area. The final product was one orthomosaic RGB image with an average ground sampling distance of 5 cm.

2.3. Labeled Data

Having high-quality and sufficient training data is critical for machine learning algorithms. For the sake of labeling remotely sensed data for semantic segmentation, which requires each pixel in an image to be assigned to a category number (class), a corresponding binary mask was manually prepared for the UAV data. In preparing the binary mask, date palm tree pixels were encircled using ArcGIS Desktop software (v.10.7) to indicate the presence of date palm trees in the study area (Figure 2). The corresponding ground truth data served as a benchmark for the training and evaluation of the implemented models. In this study, the labeling process was comprehensive enough to cover the entire dataset and thereby incorporate as many versatile contexts as possible (i.e., palm trees in farms with vegetation and soil backgrounds and palm trees in urban environments). Given the fine details in the VHSR UAV data, the processing and analysis of large UAV images are demanding and may consume much time and memory. Moreover, resampling these data results in the loss of spatial resolution. As convolutional layers involve extensive computations, the VHSR orthomosaic UAV data and the corresponding mask were divided into equal-sized image tiles (512 × 512) to cope with the GPU memory limitations. An image-label pair was produced for each image tile in the study area and its corresponding mask (Figure 2). The generated image tiles were divided into three distinct sets: 65% of the data was used for training, 15% was used for validation, and 20% was used for testing purposes. Overall, 11,754 image tiles were selected for training and were artificially enlarged thrice through data augmentation by rotating the image-label pair by 90°, 180°, and 270° using the Sk-Image and Scipy libraries in Python. A total of 2300 image tiles were selected from the generated image tiles for validation purposes. Meanwhile, 3900 image tiles were kept for testing the generalizability of the model. The total number of image tiles used in the current study was enough to develop an efficient DL model for date palm tree mapping, as it is greater than the number of image tiles used in several successful studies [58,103,114,115,122,123,127,128,141].

3. Methodology

Different from patch-based classification and object detection techniques that are based on CNNs, fully convolutional neural networks (FCNs) could be used to delineate the boundary and position of individual date palm trees by performing pixel-level semantic segmentation. This section is dedicated to providing a brief description of the proposed U-Net architecture, the compared FCN networks (e.g., DeepLab V3+ and PSPNet), the utilized segmentation evaluation metrics, and the experimental setup.

3.1. U-Net

U-Net, a U-shaped architecture originally proposed for biomedical image semantic segmentation, is one of the commonly used FCN architectures in studies to classify remotely sensed data for multiple applications. It is a symmetric CNN architecture that compromises the encoder (capturing the context in the input image), the bottleneck, and the decoder (mapping and restoring the contextual information back to the original resolution). The encoder part in a U-Net architecture, a contracting path comprising a set of convolutional and max-pooling layers, receives the input image patches and produces an increased number of down-sampled feature maps on the basis of the depth of the network. The decoder part, an expanding path comprising a set of convolutional, concatenation, and upsampling layers, seeks to retrieve the precise locations and fine characteristics of the features that have been learned by the encoder to semantically segment images. Such retrieval is usually achieved by continuously upsampling feature maps and concatenating them with learned high-resolution features obtained from the corresponding blocks from the encoder.

In this study, a deep residual learning network (ResNet) [142] was considered the encoder backbone of the U-Net network for extracting features from input datasets. ResNet, an network architecture that was motivated by the design of Visual Geometry Group network (VGG) [143], was designed to solve the problem of deep gradient explosion and gradient vanishing when the number of layers in the network increases. ResNet architecture encompasses several sets of blocks (i.e., a sequence of convolution, batch normalization, and ReLu) that implement a specific type of connection method, which is referred to as shortcut connection or skip connection. The output feature maps of a particular layer (x) are forwarded and added to a deeper layer (y = F (x) + x). The depth of the ResNet may vary according to the basis of the number of the designed residual layers. ResNet-18, ResNet-50, and ResNet-101 are some common examples of ResNet variations. In this paper, a pertained ResNet-50 based on ImageNet was used as a backbone to increase classification performance and the generalizability of the proposed model. Additional implementation details of the proposed approach are shown in Figure 3 and Table 1.

3.2. DeepLabV3+

The family of DeepLab architectures proposed by the Google research team adopts multiscale atrous (i.e., holes) convolutions to solve the problems of multiscale objects in image segmentation. Different from the traditional convolution operation, the atrous convolution maintains the same resolution of features without increasing the number of parameters [144]. Four DeepLab architectures have been proposed over the past few years. First, the first version of DeepLab architectures, DeepLab V1, incorporates deep convolutional neural networks and probabilistic graph models (i.e., conditional random field). Second, the DeepLab V2 introduces Atrous Spatial Pyramid Pooling (ASPP) mechanism for extracting multiscale contextual information by using multiple parallel dilated convolutions with different dilation rates [145,146]. While the DeepLab V3 utilizes an improved ASPP module [111], the latest DeepLab architecture, that is DeepLab V3+ [147], improves previous DeepLab versions by introducing a decoder to refine segmentation results and produce more distinctive boundaries. Overall, DeepLab V3+ architecture encompasses an encoder, ASPP module, and a decoder (Figure 4). The adopted encoder network serves as a feature extractor, which reduces feature maps and captures the rich semantic information. The design of the encoder varies depending on the adopted backbone network. Multilevel features of the input image are captured through the ASPP mechanism to solve the multiscale problem of image segmentation of objects. Eventually, the decoder gradually retrieves the spatial information to produce more refined and sharp segmentation results [147]. In this study, the performance of the proposed approach was compared with those of two variants of DeepLab V3+, based on ResNet-50 [142] and Xception [148] backbone networks.

3.3. Pyramid Scene Parsing Network

Similar to DeepLab V3+, the pyramid scene parsing network (PSPNet) [149] utilizes a spatial pyramid pooling module between the encoder and decoder structure to capture global contextual information [150] and integrate multiscale features by controlling the size of the receptive field [151]. As shown in Figure 5, feature maps are extracted by the encoder (adopted CNN architecture), and then a series of parallel poolings is used with different grid scales for aggregating contextual information from various regions in extracted features and obtaining a broad spectrum of information. Convoluted low-dimension feature maps are then upsampled through bilinear interpolation, concatenated, and ultimately fed to a convolution layer with a proper activation function to extract a probability map(s). The backbone network of the adopted PSPNet was based on ResNet-50 [142] in this study.

3.4. Evaluation Metrics

To quantitatively evaluate and analyze the performance of various semantic segmentation architectures for detecting date palm trees, various pixel-by-pixel accuracy measures were utilized. Dice similarity coefficient (DSC) (also known as the F-score) and Mean Intersection-Over-Union (Mean-IOU) (also known as the Jaccard Index) metrics were utilized to evaluate the performance of the different trained models on independent testing datasets. These measures are generally used to compute the amount of agreement between the semantically segmented pixels (CNN output) and the hand-annotated masks. These measures can mathematically be expressed in accordance with Equations (1)–(4). Their computed values range from 0 to 1, wherein the value of 1 indicates the utmost similarity between the predicted and labeled mask (high segmentation accuracy), and 0 shows no similarity between them.

(1) $Precision = \frac{TP}{TP + FP}$

(2) $Recall = \frac{TP}{TP + FN}$

(3) $DSC (m, p) = 2 \times \frac{| m \cap p |}{| m | + | p |} = 2 \times \frac{Precision \times Recall}{Precision + Recall}$

(4) $Mean - IOU (m, p) = \frac{| m \cap p |}{| m | + | p | - | m \cap p |}$

where m denotes the binary ground truth mask, and

p

represents the predicted semantic segmentation. TP, TN, FP, and FN symbolize the numbers of true positive, true negative, false positive, and false negative, respectively.

3.5. Loss Function

The Dice loss [152], which is a regionally based loss that optimizes the network by using the dice coefficient, was used in this study on the basis of the empirical evaluation. This loss can mitigate the class imbalance between the foreground class (i.e., palm trees) and the background class in binary segmentation tasks [153,154]. Equation (4) expresses the formulation of Dice loss (L_Dice).

(5) $L_{Dice} = 1 - \frac{2 \sum_{i}^{N} p_{i} m_{i} + ξ}{\sum_{i}^{N} p_{i}^{2} + \sum_{i}^{N} m_{i}^{2} + ξ},$

where

p_{i} \in \{0, 1\}

is the predicted probability (sigmoid output) of the ith pixel in the image,

m_{i}

is the labeled mask of the ith pixel in the image (

0 \leq m_{i} \leq 1

), and

ξ

is a minimal constant value to avoid the division by zero problems in the denominator.

3.6. Experimental Setup

Segmentation models were built using the TensorFlow deep learning framework and executed using multiple graphics processing unit (GPU) hardware. The data parallelism approach using the TensorFlow MirroredStrategy, which enables synchronous distributed training on multiple GPUs in one server, was applied to fit the semantic segmentation models on a Linux cluster, with the following specifications: Intel^®, Xeon^®, 2.3 GHz, 512 GB of RAM, eight NVIDIATM Tesla K80 (GK210GL) GPUs, and 100 TB of storage. Figure 6 depicts the distributed training strategy used in this study. Eight replicas were created, and each variable in the segmentation model was mirrored across all replicas. The global batch size was set to 32, and the global batch was divided into small minibatches over the eight GPUs. Each GPU independently performed forward and backward parallel passes, and the gradients for the different batches of datasets were computed separately. An independent validation set (2300 image-mask pairs) was used after each training epoch to compute the loss and accuracy of the trained model, thereby avoiding overfitting, and then evaluate its generalizability. Dice loss was computed between outputs of the replicated models and corresponding masks of input image sets as an objective function. The dice coefficient was used to evaluate the segmentation outputs. The gradients of the objective function were gathered and averaged, and an identical update was applied to each independent network.

The hyperparameters of all segmentation models were empirically tuned by conducting a set of experiments. The encoder part of all models was initialized using ImageNet pretrained weights, and these weights were fine-tuned with further training. Among the stochastic gradient-based optimization algorithms, the adaptive momentum estimation (ADAM) optimizer [155] was chosen in this study for all FCN networks because of its efficiency in improving convergence and dealing with vanishing learning rates [156]. All segmentation models were trained for 100 epochs by using the ADAM optimizer with an initial learning rate of 0.001 and momentum hyperparameters β₁ and β₂ of 0.9 and 0.999, respectively. The training process continued until the model converged or until a maximum number of epochs was reached. Various techniques, including early stopping, were used to avoid overfitting. Early stopping was applied to control the training when the performance in the validation dataset degraded over a particular number of consecutive epochs. L2 regularization was also added to all convolutional layers to avoid overfitting.

4. Results

4.1. Evaluation of Segmentation Performance

The performance of the proposed segmentation model (U-Net based on a pretrained ResNet-50) for mapping date palm trees was compared with different state-of-the-art fully convolutional networks, including U-Net (VGG-16 backbone network), DeepLab V3+ (ResNet-50 backbone network), DeepLab V3+ (Xception backbone network), and PSPNet (ResNet-50 backbone network). Figure 7 displays the evolution of the loss and dice similarity coefficient curves of the proposed approach over the epochs. Several accuracy measures, including precision, recall, F-score, and Mean-IOU metrics, were applied to the validation dataset for all models to evaluate the performance of the proposed architecture against other segmentation architectures quantitatively, as shown in Figure 8. Among them, the proposed model outperformed other segmentation architectures, followed by PSPNet, DeepLab V3+ (Xception backbone network), U-Net (VGG-16 backbone network), and DeepLab V3+ (ResNet-50 backbone network). The proposed approach achieved an F-score, Mean-IOU, precision, and recall of 92%, 85%, 0.92, and 0.91, respectively. A precision metric of 0.92 indicates positive detections relative to the labeled data (consistency of 92% between 2300 labeled and predicted images).

The output of semantic segmentation models is a probability map ranging from 0 to 1, thereby indicating the probability of the presence or absence of date palm trees in an image. Here, a threshold value greater than 0.5 was applied to the probability map to derive the segmentation results. Figure 9 shows six randomly selected images from the validation dataset and their corresponding masks and experimental results of the five segmentation models. The original image and ground truth are shown on the left side of the image, and the result of the proposed approach is illustrated on the right side of the image. All segmentation models provided satisfactory segmentation results, with an F-score ranging from 81% to 92% and Mean-IOU ranging from 78% to 85%. However, a disparity can be observed between the results of the five segmentation models in terms of the size and boundary of the detected palm trees. Quantitative and visual analyses showed that the proposed approach presents significant potential in mapping date palm trees from UAV images, because it provides a satisfactory delineation of date palm trees.

4.2. Generalizability Evaluation

As described in Section 2.3 a total of 3900 images, extracted from the VHSR orthomosaic UAV product, were selected as the testing dataset to evaluate the generalization capability of the proposed network. Figure 10 illustrates the segmentation quality metric for semantic segmentation of the testing dataset achieved by the trained deep learning models. Figure 11 displays nine randomly selected images from the testing dataset and their corresponding masks and segmentation outputs of the trained models. The proposed segmentation model demonstrates excellent generalization capacity because it achieved an F-score of 91% and Mean-IOU of 85%. The compared segmentation models also maintained a similar range of accuracies. The proposed model demonstrates an efficient model for date palm tree mapping from UAV images via the comparative evaluation of segmentation results.

5. Discussion

Large-scale mapping of date palm trees is essential for their consistent monitoring and sustainable management, given the substantial commercial, environmental, and landscaping values of date palm trees. Mapping and monitoring date palm trees using ground surveys is challenging because these trees are planted in different agricultural and urban environments. The increasing availability and continuous development of commercial UAV systems have amplified the popularity and utilization of UAV-based images in a wide range of earth-related studies. Different from satellite-based images, large-scale UAV images are acquired in different seasons, flight heights, spatial resolutions, weather conditions, sunlight angles, and image illuminations. Developing an accurate transferable approach for large-scale mapping of date palm trees from UAV images can be challenging because feature values may vary significantly based on the source of data, image object segmentation level, and intraclass variability among classes, given the dependence of traditional machine learning techniques on the selection of shallow handcrafted features (i.e., band ratio, color invariants, and geometrical features). Thus, misclassification is expected when traditional machine learning is applied to different imageries [108].

In the current study, a deep semantic segmentation model based on U-Net architecture and a deep residual network was proposed for the large-scale mapping of date palm trees. A pretrained ResNet-50 based on ImageNet was adopted in the encoder module of U-Net. A comprehensive labeled dataset was developed to support the development of the proposed semantic segmentation for date palm trees from very-high spatial resolution (VHSR) UAV images. The labeled dataset was compiled from different agricultural and urban environments with a substantial variance in tree crown sizes, shapes, ages, health status, and backgrounds. The model was trained on eight GPUs (NIVIDIA^TM Tesla K80) through synchronous distributed training. The developed model was evaluated with independent validation and testing datasets. The performance of the proposed model was also compared with different advanced segmentation networks with various encoder backbones, including two variants of the DeepLab V3+ [147] (based on pretrained ResNet-50 and Xception backbones), PSPNet [149] (based on pretrained ResNet-50), and U-Net (based on a pretrained VGG-16) [143]. All segmentation models were tested on an NIVIDIA^TM Titan RTX graphics card with 24 Gb RAM. Table 2 compares the number of parameters, training time per epoch, and the testing time of the evaluated segmentation models.

The proposed approach maintained high accuracy in the validation and testing datasets and indicated that date palm trees can be mapped with an average F-score (>91%) and Mean-IOU (>85%). With a F-score that ranges from 88% to 92% and a Mean-IoU that ranges from 78% to 85%, all of the evaluated segmentation models provided satisfactory segmentation results on the testing dataset. U-net architecture based on ResNet-50 architecture outperformed other segmentation models, with a F-score ranging from 1.2%–10% and Mean-IOU ranging from 2%–16%. This was followed by PSPNet (ResNet-50 Backbone), DeepLab V3+ (Xception Backbone), U-net (VGG-16 Backbone), and DeepLab V3+ (ResNet-50 Backbone). Numerous studies that have used and compared different deep semantic segmentation architectures for tree, crop, and vegetation mapping have reported similar ranges of segmentation metrics [57,70,100,102,117,125]. For instance, five semantic segmentation architectures, including SegNet, U-Net, FC-DenseNet, and DeepLab V3+( based on Xception and MobileNetV2 backbones), were evaluated in Reference [100] for segmenting threatened single tree species from UAV-based images. An intersection-over-union that ranges from 77.1% to 92.5% and F1-score between 87.0% and 96.1% were reported by their experimental analysis. FC-DenseNet and U-Net models were superior to DeepLab V3+ (MobileNetV2), SegNet, and DeepLab V3+ (Xception). Cao and Zhang [112] utilized an improved U-Net model by replacing the convolutional layer in the U-Net network with a residual unit of ResNet for classifying different tree species from high-spatial-resolution airborne images. The developed approach was then followed by post-classification processing using conditional random fields to obtain smoother tree boundaries. An overall classification accuracy of 87% was achieved by the improved U-Net network. Ferreira et al. [70] achieved high accuracy by incorporating ResNet-18 in the DeepLab V3+ architecture to detect and classify Amazonian palm species from UAV images.

The proposed model in this study shows an efficient approach for date palm tree mapping from UAV images. It can segment date palm trees in relatively complex agricultural and urban environments and where palm trees are partially obscured by higher trees and shadow. Figure 9 and Figure 11 depict the segmentation outputs of randomly selected images (512 × 512 pixels) from the validation and testing dataset. Although the differences in evaluation scores between ResUnet-50 and some of the evaluated architectures may not appear significant, it demonstrates better delineation of date palm trees. Considering that it is computationally intensive to train and test large UAV imagery in a deep semantic segmentation model, the UAV data were split into smaller tiles (512 × 512 pixels) and fed to the trained network and predicted the presence of date palm trees. In the splitting process of a large image, an overlap between the tiles may be considered to ensure better delineation of the palm trees around the edges of the generated image tiles. The final prediction is reconstructed by merging the segmentation outputs of the classified tiles. For instance, Figure 12 shows the segmentation output of the proposed model for different image tiles with larger sizes (5120 × 5120 pixels) without performing any post-processing operations. However, a minor misclassification might be encountered in the reconstructed product. For instance, when a palm tree is divided into two separate images, the shape/size of these predicted palm trees might vary slightly. In addition, some minor vertical lines can be observed, as shown in Figure 12d–f. These errors can be refined by post-processing computer-vision operations [70].

6. Conclusions

This study presented an automatic approach for the large-scale mapping of date palm trees from VHSR UAV images based on a deep semantic segmentation model. A pre-trained deep residual learning framework (ResNet-50) was used as the backbone of the encoder module of a U-Net. A large and diverse labeled dataset was created to aid in the development of the proposed semantic segmentation model. A distributed training strategy was used to train the model on multiple GPUs. The proposed segmentation model was evaluated with different state-of-the-art fully convolutional networks, including U-Net (VGG-16), PSPNet (based on ResNet-50), and two variations of DeepLab V3+ (ResNet-50 and Xception backbones). Experimental results showed that the proposed approach was superior to other semantic segmentation models in validation and testing datasets, achieving an F-score of 91% and Mean-IOU of 85%. The proposed deep fully convolutional network is an efficient tool for the accurate mapping and delineation of date palm tree VHSR UAV images, thereby building and updating geospatial databases and enabling consistent monitoring of date palm trees.

Author Contributions

Conceptualization, M.B.A.G. and H.Z.M.S.; methodology, M.B.A.G. and H.Z.M.S.; formal analysis, M.B.A.G.; writing—original draft preparation, M.B.A.G.; writing—review and editing, M.B.A.G., H.Z.M.S., A.S. and R.A.-R.; visualization, M.B.A.G.; Resources, H.Z.M.S., A.S. and R.A.-R.; supervision, H.Z.M.S., A.S., A.W. and S.J.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing is not applicable to this article.

Acknowledgments

The authors would like to acknowledge Universiti Putra Malaysia for the financial support, the municipality of Ajman for providing remotely sensed data of the study area, and the University of Sharjah for providing the high performance computing cluster used in this research.

Conflicts of Interest

The authors declare no conflict of interest.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures and Tables

View Image - Figure 1. Geographical location of the study area: (a) the Middle East; (b) UAE; (c) location of the study area and the digital elevation model of Ajman Emirate; (d) UAV image of the study area.

Figure 1. Geographical location of the study area: (a) the Middle East; (b) UAE; (c) location of the study area and the digital elevation model of Ajman Emirate; (d) UAV image of the study area.

Figure 2. Samples of different image-label pairs: (a,c,e) image tiles and (b,d,f) their corresponding masks.

Figure 3. U-Net architecture based on ResNet-50 adopted in this study.

Figure 4. DeepLab V3+ architecture.

Figure 5. PSPNet architecture.

Figure 6. Mirrored distributed training strategy.

Figure 7. Metric evolution over epochs of the proposed approach.

Figure 8. Evaluation metrics for all models obtained from the validation dataset.

View Image - Figure 9. Comparison of six randomly selected images (a–f) from the validation dataset and their corresponding segmentation results. The left column shows the selected RGB images, followed by the corresponding ground truth label, and segmentation results of the evaluated models including U-Net (based on VGG-16 network), DeepLab V3+ (based on ResNet-50 network), DeepLab V3+ (based on Xception network), PSPNet and the proposed approach).

Figure 9. Comparison of six randomly selected images (a–f) from the validation dataset and their corresponding segmentation results. The left column shows the selected RGB images, followed by the corresponding ground truth label, and segmentation results of the evaluated models including U-Net (based on VGG-16 network), DeepLab V3+ (based on ResNet-50 network), DeepLab V3+ (based on Xception network), PSPNet and the proposed approach).

Figure 10. Summary of evaluation metrics of segmentation models obtained from the testing dataset.

View Image - Figure 11. Comparison of nine randomly selected images (a–i) from the testing dataset and their corresponding segmentation results. The left side of the figure shows the selected RGB images followed by the corresponding ground truth label and classification results of the evaluated models.

Figure 11. Comparison of nine randomly selected images (a–i) from the testing dataset and their corresponding segmentation results. The left side of the figure shows the selected RGB images followed by the corresponding ground truth label and classification results of the evaluated models.

Figure 12. (a–f) Segmentation output of the proposed model for different randomly selected image tiles with larger sizes.

Table 1

The architecture of the proposed U-Net network.

Path	Unit	Kernel Size (k), Feature Map (fm)	Output Size (Width × Height × Channels)
Input			512 × 512 × 3
Encoder	ZeroPadding2D		518 × 518 × 3
	Conv2D	k = (7 × 7), fm = 64
	Batch normalization + Relu	k = (3 × 3), fm = 64	256 × 256× 64
	ZeroPadding2D	Fm = 64	258 × 258× 64
	MaxPooling2D	k = (3 × 3), fm = 64	128 × 128 × 64
	Convltional block 2 $\times$ Identity block	$(\begin{matrix} k = (1 \times 1), fm = 64 \\ k = (3 \times 3), f m = 64 \\ k = (1 \times 1), f m = 256 \end{matrix}) \times 3$	128 × 128 × 256
	Convltional block 3 $\times$ Identity block	$(\begin{matrix} k = (1 \times 1), fm = 128 \\ k = (3 \times 3), f m = 128 \\ k = (1 \times 1), f m = 512 \end{matrix}) \times 4$	64 × 64 × 512
	Convltional block 5 $\times$ Identity block	$(\begin{matrix} k = (1 \times 1), fm = 256 \\ k = (3 \times 3), f m = 256 \\ k = (1 \times 1), f m = 1024 \end{matrix}) \times 6$	32 × 32 × 1024
	Convltional blockIdentity block	$(\begin{matrix} k = (1 \times 1), fm = 512 \\ k = (3 \times 3), f m = 512 \\ k = (1 \times 1), f m = 2048 \end{matrix}) \times 2$	16 × 16 × 2048
Bottleneck	Conv2D	$k = (1 \times 1), fm = 512$	16 × 16 × 512
	Batch normalization + Relu	$k = (1 \times 1), fm = 512$	16 × 16× 512
	Conv2D	$k = (3 \times 3), fm = 512$	16 × 16 × 512
	Batch normalization + Relu	$k = (3 \times 3), fm = 512$	16 × 16 × 512
	Conv2D	$k = (1 \times 1), fm = 2048$	16 × 16 × 2048
	Batch normalization	$k = (1 \times 1), fm = 2048$	16 × 16 × 2048
Decoder	Upsampling2D	$fm = 2048$	32 × 32 × 2048
	Decoder block	$(\begin{matrix} k = (1 \times 1), fm = 2048 \\ k = (1 \times 1), f m = 2048 \\ k = (1 \times 1), f m = 2048 \end{matrix})$	32 × 32 × 2048
	Concatenate_1	$fm = 3072$	32, 32, 3072
	Upsampling2D	$fm = 3072$	64, 64, 3072
	Decoder block	$(\begin{matrix} k = (1 \times 1), fm = 1024 \\ k = (1 \times 1), f m = 1024 \\ k = (1 \times 1), f m = 1024 \end{matrix})$	64, 64, 1024
	Concatenate005F2	$fm = 1536$	64, 64, 1536
	Upsampling2D	$fm = 1536$	128, 128, 1536
	Decoder block	$(\begin{matrix} k = (1 \times 1), f = 512 \\ k = (1 \times 1), f = 512 \\ k = (1 \times 1), f = 512 \end{matrix})$	128, 128, 512
	Concatenate_3	$f = 768$	128, 128, 768
	Upsampling2D	$f = 768$	256, 256, 768
	Decoder block	$(\begin{matrix} k = (1 \times 1), fm = 256 \\ k = (1 \times 1), f m = 256 \\ k = (1 \times 1), f m = 256 \end{matrix})$	256, 256, 256
	Concatenate_4	$fm = 320$	256, 256, 320
	Upsampling2D	$fm = 320$	512, 512, 320
	Decoder block	$(\begin{matrix} k = (1 \times 1), fm = 64 \\ k = (1 \times 1), f m = 64 \\ k = (1 \times 1), f m = 64 \end{matrix})$	512, 512, 64
Output	Conv2D + sigmoid	k = (1 × 1), fm = 1	512, 512, 1

Table 2

Training and testing details for each model.

Model	Backbone	No. of Trainable Parameters (M)	Training Time/Epoch (m)	Test Time (s)/Image
U-Net	ResNet-50	~157.280	~75	~0.17
U-Net	VGG-16	~25.858	~43	~0.1
DeepLab V3+	ResNet-50	~17.795	~25	~0.07
DeepLab V3+	Xception	~21.558	~29	~0.09
PSPNet	ResNet-50	~46.631	~65	~0.14

Word count: 6206

Show less

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Large-scale mapping of date palm trees is vital for their consistent monitoring and sustainable management, considering their substantial commercial, environmental, and cultural value. This study presents an automatic approach for the large-scale mapping of date palm trees from very-high-spatial-resolution (VHSR) unmanned aerial vehicle (UAV) datasets, based on a deep learning approach. A U-Shape convolutional neural network (U-Net), based on a deep residual learning framework, was developed for the semantic segmentation of date palm trees. A comprehensive set of labeled data was established to enable the training and evaluation of the proposed segmentation model and increase its generalization capability. The performance of the proposed approach was compared with those of various state-of-the-art fully convolutional networks (FCNs) with different encoder architectures, including U-Net (based on VGG-16 backbone), pyramid scene parsing network, and two variants of DeepLab V3+. Experimental results showed that the proposed model outperformed other FCNs in the validation and testing datasets. The generalizability evaluation of the proposed approach on a comprehensive and complex testing dataset exhibited higher classification accuracy and showed that date palm trees could be automatically mapped from VHSR UAV images with an F-score, mean intersection over union, precision, and recall of 91%, 85%, 0.91, and 0.92, respectively. The proposed approach provides an efficient deep learning architecture for the automatic mapping of date palm trees from VHSR UAV-based images.

Details

Title

Deep Convolutional Neural Network for Large-Scale Date Palm Tree Mapping from UAV-Based Images

Author

Mohamed Barakat A Gibril¹

; Helmi Zulhaidi Mohd Shafri¹; Shanableh, Abdallah²; Al-Ruzouq, Rami²

; Wayayok, Aimrun³

; Shaiful Jahari Hashim⁴

¹ Department of Civil Engineering and Geospatial Information Science Research Centre (GISRC), Faculty of Engineering, Universiti Putra Malaysia (UPM), Serdang 43400, Selangor, Malaysia; [email protected]
² Department of Civil and Environmental Engineering, Faculty of Engineering, University of Sharjah, Sharjah 27272, United Arab Emirates; [email protected] (A.S.); [email protected] (R.A.-R.); GIS and Remote Sensing Center, Research Institute of Sciences and Engineering, University of Sharjah, Sharjah 27272, United Arab Emirates
³ Department of Biological and Agricultural Engineering, Faculty of Engineering, Universiti Putra Malaysia (UPM), Serdang 43400, Selangor, Malaysia; [email protected]
⁴ Department of Computer and Communication Systems Engineering, Faculty of Engineering, Universiti Putra Malaysia (UPM), Serdang 43400, Selangor, Malaysia; [email protected]

First page

2787

Publication year

2021

Publication date

2021

Publisher

MDPI AG

e-ISSN

20724292

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/rs13142787

ProQuest document ID

2554735804

Deep Convolutional Neural Network for Large-Scale Date Palm Tree Mapping from UAV-Based Images

Jump to:

Full text

Abstract

Details

Suggested sources