Full text

Turn on search term navigation

Introduction

Uncrewed aerial vehicle (UAV) in combination with automatic image classification offers a great tool to monitor vegetation, plant diversity, and related plant traits in high spatial and temporal resolution (Anderson & Gaston, 2013; Bendig et al., 2015; Sotille et al., 2020; Van Iersel et al., 2018). This technical-digital combination benefits small-scale ecosystems, transition zones, and habitat patches with high spatiotemporal complexity. The possibility of monitoring such landscape features created a growing demand of discretization of ecological features and regions and provided new potential for conservation planning and political management of environmental areas of interest (Azpiroz et al., 2012; Dobson et al., 1997; Foster et al., 2003). One of these landscape elements that have high ecosystem importance is kettle holes (Fig. 1). Kettle holes can be described as depression wetlands that are mostly smaller than 1 ha and mainly occur at high densities in the agricultural landscapes of the young moraine areas (Kalettka & Rudat, 2006; Pätzig et al., 2012). They provide a wide range of ecosystem services, such as the improvement of the hydrological cycle, flood control, chemical condition of fresh waters, biotic remediation of wastes acting as a biological filter and shelter for local biodiversity (Pätzig et al., 2012; Vasić et al., 2020). Most of the kettle holes experience a seasonal wet-dry cycle and exhibit high potential for the presence of large biological species diversity. Due to their position within farmed areas and their high perimeter-area ratio, they are highly vulnerable and severely impacted by intensive agricultural practices occurring in their surroundings. They cover up to 5% of the arable land of Germany (Pätzig et al., 2012; Vahrson & Frielinghaus, 1998). There is a demand for a better understanding of the impacts of such activities and for a proposition of solutions for a better conservation of these habitats by a continuation of sustainable agricultural activities. Furthermore, the functioning and ecosystem services and disservices of kettle holes as a result of complex interaction of abiotic and biotic processes within kettle holes and among their surroundings need to be comprehended.

Regular high-resolution monitoring allows a better understanding of the functioning of these small-scale ecosystems and promotes the development of effective conservation and management measures. Photogrammetry workflows that enable an automated high precision detection of dominant plant species are valuable tools to investigate the ecosystem health. Much of the kettle hole area is not flooded most of the time, especially during the peak vegetation period at the end of the summer. Therefore, a large percentage of the area is covered with emerged plant species such as amphibious or terrestrial plants, which can be detected with aerial images.

Efficient workflows for automated vegetation segmentation with UAV imagery already exist and present promising results (Casado et al., 2015; Cruzan et al., 2016; Torres-Sánchez et al., 2014). They are often based on RGB and multi-spectral images in combination with shallow or deep learning (DL) techniques (Cen et al., 2019; Chabot et al., 2018; Turner et al., 2018). DL is a set of techniques that allow computers to learn complex patterns from data. These techniques can adapt to changing environments and enable ongoing improvement to understand the scene. With the advancement of mapping and computer technologies, DL methods are being tested to identify patterns in the most varied fields of science (Dos Santos Ferreira et al., 2017; Martins et al., 2021; Shen et al., 2017; Torres et al., 2020). Thus far, a large amount of training data are needed to train a DL-based network for subsequent classification (Goodfellow et al., 2016). However, the manual labeling of all pixels to extract image features for training is troublesome when dealing with imagery from kettle holes due to the visual complexity of the features. Even when using high spatial resolution datasets, a human operator has difficulties distinguishing between species of plants and assigning each pixel to a class. Therefore, only some parts of each training image can be labeled. In this study, we introduce an approach combining unsupervised superpixel segmentation with subsequent convolutional neural networks (CNNs) to circumvent this challenge as the complex dataset of the natural ecosystem has an abundance of nuances and details. Superpixels are an unsupervised classification approach to group image regions with similar perceptual features into clustered regions (Achanta et al., 2012). Each superpixel is treated as a training dataset and thereby creating a labeled dataset that the CNN can use to find structured patterns without the need of labeling each image in its entirety.

We classify data from kettle holes to support the understanding and the composition of the landscape by performing a detailed detection of the flora present to characterize the species in the ecosystem. Our approach paves the way to the creation of databases that help to identify environmental factors that influence the health of the ecosystem.

Materials and Methods

Our workflow is summarized in the following: data acquisition, manual image labeling, superpixel clustering, training the CNNs based on the superpixel images, and final segmentation of the image with trained CNN models (Fig. 2).

Dataset

The data acquisition was performed in the landscape laboratory AgroScapeLab Quillow (ASLQ, Leibniz Centre for Agricultural Landscape Research [ZALF]) named after the river Quillow that covers a catchment area of about 170 km². The region is located in the Central European lowlands, shaped into a hilly landscape by the last ice age. About three-quarters of the landscape is used for agriculture, the climate is sub-humid with a negative climatic water balance ranging from 1992 to 2019 during which an average of 658 mm of water evaporated and an average of 573 mm precipitation fell per year. The mean annual temperature was 8.8°C (Pätzig & Düker, 2021). UAV images were collected from 19 of more than 1500 kettle holes in the region. The selected kettle holes are very diverse in their morphological and hydrological characteristics and their classification was done according to Kalettka and Rudat (2006).

As a result of the high variability in hydrogeomorphological conditions and water quality, the kettle holes are very singular in their biodiversity composition making each one unique and covering a wide range of dominant vegetation types (Table 1; Fig. 3). For the purpose of this work, we investigated nine local dominant plant species classes commonly present in kettle holes with the inclusion of dead plant biomass belonging to different plant life-forms occurring at each kettle hole. The plant life-forms ranged from amphibian plants to woody plants (Phanerophytes; Table S1). The Helophytes tended to be the most numerous plant life-form, as they often dominated the surface of the kettle holes, especially in the dry season as in our study period.

Table 1 Examined plant species classes (including dead plants) and the associated plant life-forms and the number of samples, that is, superpixels, from each class in the unbalanced and balanced superpixel dataset.

Dominant plant species classes included (Class)	Plant life-form	Unbalanced	Balanced
Carex riparia	Helophytes	13 626 (16.52%)	318 (9.09%)
Cirsium arvense (L.) Scop.	Hemicryptophytes (nitrophilous perennials)	318 (0.39%)	318 (9.09%)
Dead plants	Helophytes etc. (no woody plants)	1450 (1.76%)	318 (9.09%)
Oenanthe aquatica (L.) Poiret	Amphibian plants	2559 (3.10%)	318 (9.09%)
Others	Other plants, soil, water, stones, crops, etc.	22 712 (27.53%)	318 (9.09%)
Phalaris arundinacea	Helophytes	7532 (9.13%)	318 (9.09%)
Phragmites australis (Cav.) Trin. ex Steud.	Helophytes	16 103 (19.52%)	318 (9.09%)
Salix alba	Phanerophytes (woody plants)	3180 (3.85%)	318 (9.09%)
Salix cinerea	Phanerophytes (woody plants)	9749 (11.82%)	318 (9.09%)
Typha latifolia	Helophytes	468 (0.57%)	318 (9.09%)
Urtica dioica L. s. l.	Hemicryptophytes (nitrophilous perennials)	4806 (5.83%)	318 (9.09%)

The size of the selected kettle holes covered a range of 0.038 ha to about 1.2 ha. Concerning their water permanence, some kettle holes had been dry for an extended period of time and were considered episodic, while others were classified as permanent. However, due to prolonged droughts since 2018, all examined permanent kettle holes became semi-permanent, holding water most of the time but potentially drying up, especially at the end of the summer.

UAV imagery was obtained using a DJI Phantom 4 RTK carrying a 20 Megapixel CMOS sensor with a fixed focal lens of 8.8 mm. The sensor captured RGB images. The camera was attached to a gimbal that helped compensating for system vibrations due to the rotor movement and for pitch and roll movements of the aircraft due to wind. The mission planning was done with the integrated software DJI GS RTK App of the P4 RTK. We used the 2D photogrammetry flight plan for 10 kettle holes and the 3D photogrammetry multi-oriented flight plan for nine kettle holes at a flight altitude between about 25 and 35 m above ground leading to a mean ground sampling distance of 9 mm. The flight campaign was either at cloud-free times or with uniform cloud cover. The flights were realized at around 10:00 to 13:00 in broad daylight.

Manual labeling and superpixels

We labeled manually nine plant species classes, one class others and one class dead plants (Table 1). Including dead plants to the dataset was done to improve the quality of the classification and to enable a potential health assessment of the ecosystem as a whole. The labeling of plant species classes was done by manually drawing polygons around homogeneous plant species areas with the VGG Image Annotator (Dutta & Zisserman, 2019). We randomly selected images from different kettle holes to create a detailed dataset for the segmentation. Because the UAV imagery of kettle holes was displaying an environmental scenery, it was a highly complex challenge to manually label every pixel of the image to a particular class. Thus, instead, we took only patches of the image that could be clearly labeled to the objects of interest.

Afterward, we used the superpixel algorithm to divide pixels into perceptually meaningful regions. The aim is to eventually derive labeled superpixels as training images. Superpixels are an unsupervised classification approach that captures image redundancy to provide a simpler primitive for computing image attributes and to considerably simplify subsequent image processing tasks (Achanta et al., 2012). The superpixel dataset was created with the Simple Linear Iterative Clustering (SLIC) algorithm. SLIC generates clusters of pixels with similar attributes such as color, texture, and shape. We used the software Pynovisão (Dos Santos Ferreira et al., 2017) and considered 405 attributes (Table 2). The attribute extraction can be understood as mathematical operations performed in the abstract binary data of the digital image to group regions with similarities. The chosen attributes for extraction were based on previous approaches (Costa et al., 2019) and improved with the implementation of the K-curvature extraction algorithm (Abu Bakar et al., 2015). The algorithm uses K-means to cluster similar pixels and thereby separate the image into small pieces, that is, superpixels. We used a SLIC configuration with a K value of 4000; K corresponds to the approximate number of segments that will separate the given image. The superpixel construction size depends on the K value and the image size. Other configurations were sigma 5 and compactness 10; sigma smooths each image channel and compactness balances the proximity of pixel color and space with higher values favoring space proximity and therefore resulting in squarer superpixel shapes. The final size of each superpixel was in the range of about 50–100 pixels (Fig. 4). The size varied depending on the color attributes as they influence the border delimitation.

Table 2 Attributes and corresponding number of extractors used in the Simple Linear Iterative Clustering (SLIC) approach.

[Table omitted. See PDF]

In the final step of the training data generation, we used the previously annotated plant species classes to extract their annotation coordinates and their corresponding class label. The class assignment of a superpixel was then performed considering that information. Thus, the annotated class that dominated each cluster, that is, superpixel, was chosen as the representing class considering a probabilistic threshold of 50%. For instance, if a superpixel region overlapped with two annotations such as Typha latifolia and dead plants having 51% and 49% of overlap, respectively, the superpixel was classified as T. latifolia. The attribute information of that superpixel was then stored to be later fed into the CNN training, making the machine understand these features as a representation of T. latifolia. The SLIC approach was used across the entire annotated image dataset to separate segments of each class.

Sampling the labeled images with the superpixel approach led to a very unbalanced dataset (Table 1) as most of the superpixels (63.57%) were spread into only three classes: Others (27.53%), Phragmites australis (19.52%), and Carex riparia (16.52%). This can lead to a bias in the DL algorithms. Therefore, we also created a balanced dataset by randomly selecting 318 superpixels from each class using the undersampling technique. Thereby, 318 referred to the smallest class in the unbalanced dataset; Cirsium arvense. Thus, a high number of labels of the plant classes were created but to ensure a balanced labeled dataset, not every identified region was selected. Four samples from each of the 11 superpixel classes are shown in Figure 5, where each superpixel covers an approximated area of 9.2 by 9.2 cm².

Deep learning

Three state-of-the-art (SOTA) CNNs were used for the image segmentation. The chosen CNNs were NasNetMobile (Zoph et al., 2018), EfficientNet (Tan & Le, 2019), and Xception (Chollet, 2017). They were selected due to their SOTA results in previous tasks and different learning strategies.

A CNN architecture is made up of an input layer, which is in this study the superpixel. Following the input, there are hidden layers, which are layers that process the data to logically extract information. There are many ways to configure the hidden layers and we can stack many of them to make the network ‘deeper’. The number of hidden layers depends on the architecture of the network. The hidden layers are learning mathematical functions to obtain predictions from the input data and to create an output that is tangent to human's way of understanding and making sense of data entries. The final output layer concludes the network; in our study, it was a raster in which each pixel was assigned to a class.

NASNet mobile

NASNet is a novel class of algorithms that searches for an adequate network architecture for the problems at hand. It uses a reinforcement learning search method to optimize the architecture configurations (Zoph & Le, 2016). For this research, we used the mobile implementation NASNetMobile, which is less resource intensive than NASNet. It searches for the architecture's building blocks on small datasets and then transfers them to larger datasets. The ‘NASNet search space’ enables the transferability of knowledge from smaller datasets to bigger ones.

The controller recurrent neural network (RNN) samples parallel or ‘child’ networks, which are trained to converge to a target accuracy on a held-out validation set. These accuracies create a gradient and update the controller that will improve the architecture as processing goes forward. The structures of the cells are searched within a search space (Fig. S1). The controller RNN selects an operation from a set of operations, which were selected based on their prevalence in the DL literature (Zoph et al., 2018), to apply to the hidden states. These operations are, for example, max poolings of different sizes and depthwise-separable convolutions of different sizes. The model is supposed to find the best architecture of the CNN related to the dataset and the computation processing capabilities.

EfficientNet

CNNs are usually developed at a determined resource quantity and manually scaled for better results if more resources are available (Tan & Le, 2019). EfficientNet uses a compound coefficient to automatically scale the network depth, width, and resolution dimensions to use all the available resources of the machine. Thus, differently sized CNNs will be generated depending on the data and hardware used. The main difficulty of scaling the model is that the optimal depth, width, or resolution depend on each other, and the values change under different resource constraints. Following observation related to scaling up dimensions was made (Tan & Le, 2019): scaling up any dimension of network width, depth, or resolution improves accuracy. But the accuracy gain diminishes for bigger models. In order to pursue better accuracy and efficiency, it is critical to balance all dimensions of the network during CNN scaling. The network achieved very good results with ImageNet datasets (Deng et al., 2009) while claiming to be smaller and faster than previous CNNs. EfficientNet achieved SOTA accuracies on CIFAR (Çalik & Demirci, 2018) and four other transfer learning datasets (Fig. S2).

Xception

Xception explores inception modules to leverage depthwise separable and regular convolutions because it is assumed that cross-channel and spatial correlations are decoupled. Less computational power is needed because fewer operations are required to perform the convolutions. A depthwise separable convolution, commonly called ‘separable convolution’ in DL frameworks such as TensorFlow and Keras, can be understood as an Inception module with a maximally large number of towers (Chollet, 2017). The towers are defined by a pooling phase, followed by convolutions (Chollet, 2017). More specifically, inception modules have been replaced by depthwise separable convolutions, that is, a spatial convolution performed independently over each channel of an input, followed by a pointwise convolution, that is, a 1 × 1 convolution, projecting the channel's output by the depthwise convolution onto a new channel space (Chollet, 2017). The Xception architecture has 36 convolutional layers forming the feature extraction base of the network (Fig. S3). In short, the Xception architecture is a linear stack of depthwise separable convolution layers with residual connections (Chollet, 2017). The Xception presented gains in performance on the ImageNet dataset (Deng et al., 2009) if compared to other structures such as Inception V3 (Szegedy et al., 2015).

Hyperparameters and optimizers

The networks used in this study exhibit a different number of parameters (Table 3), which control the learning process. Parameters, in general, are weights that are learned during training. They come in the form of matrices that contribute to the model's predictive power and are changed during the back-propagation process. This change is governed by the chosen algorithms, that is, the optimization strategy.

Table 3 Number of trainable parameters of the three CNNs used.

Architectures	Trainable parameters
EfficientNet	43 293 709
NasNetMobile	30 994 717
Xception	72 450 857

After having learned the convolutional cells, several hyperparameters, which are not learned, may be explored to build a final network for a given task (Table 4). In this study, the networks had been trained with 1000 and 100 epochs considering the balanced and unbalanced datasets, respectively. The number of epochs is lower for the unbalanced case due to hardware limitations because the number of training instances is more than one magnitude higher than for the balanced case. The remaining parameter values are the same for both balanced and unbalanced datasets. A 10% patience was used for early stopping (patience, i.e., the training would stop if, after 10% of the total number of epochs, no alteration in the validation loss was observed). Data augmentation has been implemented through random flips, random zooms (max = 10%), random horizontal and vertical shifts (max = 50%), and random rotations (max = 90°).

Table 4 Hyperparameter set used in this study (same for all 3 CNNs).

Hyperparameter	Value
Training epochs	1000 (B), 100 (U)
Early stop patience	10%
Early stop monitor	Loss
Loss function	Softmax
Checkpoint saving	True
Initial learning rate	0.01
Validation split	20%
Neurons FC layer	512
Dropout FC layer	50%
Data augmentation	Yes
Cross-validation data technique	Fivefold
Transfer learning	ImageNet
Fine tuning	True

For the optimization of the CNN, three adaptive strategies had been tested. An optimizer is used to update the weights in the search of the smallest loss value. Thereby, the network is said to be learning information. The general idea of the optimization is to tweak the parameters iteratively, changing the learning rate to minimize the loss function.

The optimizers chosen in this study were Adagrad (Duchi et al., 2011), Adadelta (Zeiler, 2012), and Adam (Kingma & Ba, 2014). They are adaptive gradient descent-based algorithms. Adaptive learning rate optimizers have consistently shown better results than the standard, non-adaptive strategies when it is not possible to fine-tune a specific learning rate schedule (Bera & Shrivastava, 2020). We chose these optimizers because they perform well when the data are sparse, and they are capable of adapting well to the data.

Adagrad is an algorithm for gradient-based optimization that adapts the learning rate to the parameters. For parameters associated with often occurring features, smaller updates (i.e., low learning rates) are used, while for parameters associated with uncommon features, more significant updates (i.e., high learning rates) are used. As a result, it is well suited to handle sparse data and there is no need to tune the learning rate manually. One of the disadvantages of Adagrad is that sometimes the learning rate tends to become infinitesimally small, resulting in the algorithm losing its capacity of learning; this happens because Adagrad accumulates all past gradients in the denominator. Adadelta is a less aggressive extension of Adagrad. Instead of accumulating all past squared gradients, it restricts the window of accumulated past gradients to some fixed size variation. This way, Adadelta continues learning even when many updates have been done. Adam is an alternative way for calculating adaptive learning rates for each parameter. Aside from retaining an exponentially decreasing average of previous events, it also keeps an exponentially decaying average of the past gradients.

Metrics and experimental setup

We calculated the confusion matrix to evaluate the classification performance, and from it, we derived the precision P (Eq. 1), recall R (Eq. 2), and F1-score (Eq. 3). TP, TN, FP, and FN stand for true positives, true negatives, false positives, and false negatives, respectively. In our analysis, positives and negatives refer to the pixels correctly and falsely assigned to the corresponding class by the underlying classifier. Such positives and negatives are true or false, depending on whether or not they agree with the assigned ground truth class.[Image Omitted. See PDF] [Image Omitted. See PDF] [Image Omitted. See PDF]

Three CNNs were trained using three optimizers. The balanced and unbalanced datasets were randomly divided into fivefolds, separated into one test set, and used the remaining fourfolds for training. The resampling strategy, called fivefold stratified cross-validation, is commonly used to evaluate machine learning algorithms (Wilson et al., 2020; Wong & Yeh, 2019). In the case of the unbalanced dataset, the random division considered the different number of samples of each class and hence preserving the class size ratio in each fold. Precision, recall, and F1-score metrics were used to measure the performance of each algorithm over the fivefolds test set.

Results

The result assessment of the processing used the superpixel images as a base. The best performance of classification for the unbalanced dataset is achieved by Xception using the Adagrad optimizer, which is indicated by the highest average precision (0.83), recall (0.75), and F1-score (0.77) (Table 5). However, EfficientNet has a lower interquartile range (IQR) also using the Adagrad optimizer (Fig. 5). In the case of the balanced dataset, the Adagrad optimizer provides the best overall performance (precision, recall, and F1-score) for the EfficientNet and Xception architectures; the latter displays a higher median and smaller IQR. The F1-score for Xception is 0.85 and a good balance between precision (0.86) and recall (0.85) is achieved (Table 5). For the NASNetMobile, the Adadelta optimizer reveals a higher median without outliers. However, the results for NASNetMobile are generally inferior to the other two CNNs (Fig. 6). As expected, the results using a balanced dataset are superior to the unbalanced dataset, but the IQR for the EfficientNet using Adagrad is lower than for Xception in the unbalanced case.

Table 5 Average precision (precis.), recall, and F1-score median values for all the three DL architectures and three optimizers using the unbalanced and balanced dataset.

Architecture	Dataset
	Unbalanced				Balanced
	Optimizer	Precis.	Recall	F1-score	Optimizer	Precis.	Recall	F1-score
EfficientNet	Adadelta	0.14	0.14	0.11	Adadelta	0.28	0.21	0.19
EfficientNet	Adagrad	0.74	0.68	0.69	Adagrad	0.80	0.78	0.78
EfficientNet	Adam	0.03	0.07	0.04	Adam	0.01	0.07	0.02
NASNetMobile	Adadelta	0.19	0.17	0.14	Adadelta	0.28	0.26	0.23
NASNetMobile	Adagrad	0.32	0.19	0.13	Adagrad	0.25	0.22	0.18
NASNetMobile	Adam	0.16	0.15	0.10	Adam	0.01	0.07	0.02
Xception	Adadelta	0.29	0.23	0.21	Adadelta	0.55	0.52	0.51
Xception	Adagrad	0.83	0.75	0.77	Adagrad	0.86	0.85	0.85
Xception	Adam	0.14	0.18	0.14	Adam	0.48	0.40	0.37

In most scenarios for both, balanced and unbalanced, datasets unsatisfactory outcomes were achieved from the nine experiment configurations (with F1-scores mostly below 0.3). However, two good outcomes were obtained with Xception and EfficientNet when employing the Adagrad optimizer. Adagrad is better suitable for sparse data, which is the case for some vegetation classes in this study. It outperforms the other two optimizers (Adadelta and Adam), which have a fixed learning rate momentum. In this study, combining a fixed-sized CNN (i.e., Xception) and an adaptive learning rate optimizer (i.e., Adagrad) produced the best F1-score results.

The learning curve for the best configuration (Xception and Adagrad) is shown in Figure 7 and confirms that the choice of 1000 epochs for training was enough. It could have even been reduced because, after 40 epochs, the validation loss seems to decrease, indicating a possible start of overfitting. EfficientNet and Xception reveal a good performance identifying the plant classes, which is indicated by the high values of the diagonal of the confusion matrix (Fig. 8). The success of the DL method varied depending on the class. The highest errors were given for the C. riparia species that has been mistakenly classified as other in 14% of the test set samples with both architectures; EfficientNet and Xception. EfficientNet presented a lower correct classification rate (64%) than Xception (70%) for this class. The brownish color, which also appears in some of the training samples from the class others, may have confused both DL algorithms. NASNetMobile could not learn to classify most of the classes, shown by the producer's accuracy value of the confusion matrix being only 0.78 for the dead plants class.

Solely a small number of samples (i.e., 318 superpixels per class) were needed in this study to train the networks. Manual labeling is one of the costliest aspects of a supervised image classification workflow. Thus, the need for a small number of samples is an auspicious advancement for improved mapping of complex vegetation systems (Fig. 9).

Discussion

Vegetation mapping is an important technical undertaking for managing natural resources. Traditional approaches (e.g., field surveys or collateral and supplementary data analysis) are time-consuming, data-lagging, and often prohibitively expensive. Remote sensing in combination with machine learning technologies can provide a potentially practical and cost-effective way to study changes in vegetation cover. Results of this study utilizing machine learning methods to UAV images achieved F1-scores that are comparable to previous studies of plant segmentation (Elkind et al., 2019; Martins et al., 2021; Torres et al., 2020). Some studies implemented additional information such as digital surface models (DSMs) or multi-spectral data to classify species to obtain similar or better F1-scores (Benjamin et al., 2021; Chabot et al., 2018; Durgan et al., 2020; Husson et al., 2016; Schulze-Brüninghoff et al., 2021). However, retrieving DSMs might be difficult in kettle holes, for example, under windy conditions due to moving vegetation (Pätzig et al., 2020) Moreover, the application of multi-spectral cameras makes the approach more expensive. Furthermore, these studies relied on extensive, hand-crafted feature selection (e.g., spectral and textural variables and band indices), which were then used with random forest or support vector machine classifiers, whereas CNNs allow for end-to-end learning. Another study by Bhatnagar et al. (2020) also used CNNs and achieved F1-scores up to 90%, however classifying vegetation communities (grouped plant species).

In order to begin vegetation preservation and restoration projects, it is important to first determine the existing status of vegetation cover. Using UAV in combination with DL methods shows potential for highly efficient kettle hole mapping. Nonetheless, remote sensing of plant species detection is still complex. Potentially high changes of plant communities in a small area due to high abiotic gradients and meta-community processes within kettle holes and beyond demand punctual spatial data with high resolution to enable a good discretization of the environment. Furthermore, in some classification cases, an information richer dataset, such as multispectral or hyperspectral datasets, still needs to be applied (Rossi et al., 2021). Also, even trained specialists may produce mistakes caused by poor image quality, influences of shadow, or a pixel being covered by several species (mixed pixels).

When using the results of vegetation mapping from remote sensing imagery, further aspects need to be considered (Rapp et al., 2005): how well does the chosen classification system represent actual vegetation community composition, how effective do remote sensing images capture distinguishing features of each mapping unit, and how well can these mapping units be delineated by photointerpreters. Thus, we must evaluate the applicability of each chosen machine learning method for each specific task. To better depict plant community compositions, a well-fit vegetation classification system should be carefully established according to the study's purpose. There are no superior image classifiers that can be used in all applications equally. Applying or developing new classifiers fit for certain applications is a challenging task needing further research.

Kettle holes in Germany have received legislative protection, but existing conservation methods are insufficient in terms of preserving potential habitat functions, which are dependent on the specific environmental conditions (Berger et al., 2011; Pätzig et al., 2012). A significant achievement of this study is correctly classifying plant species with different plant forms at the same time with more than 0.85 F1-score. This is a pioneering and forward-looking methodological requirement for efficient characterization of many kettle holes and thus a better understanding of the dynamics, the functions, and services of kettle hole ecosystems. In the next step, creating spatiotemporal land cover change maps of these regions, allied with information about the current agricultural practices, can help to better understand how different agricultural strategies influence these ecosystems. This can support the provision of specific protection measures for each kettle hole and can also build the base for transferring experiences to other regions.

Using the superpixels as training images instead of the original image makes the training of the CNNs less complex and resource demanding because instead of working with all the pixels of the original image, a predetermined number of grouped pixels are used. In addition, the needed resources for the object labeling are reduced because not every feature in the image has to be addressed. Instead, the superpixel algorithm creates a dataset of similar feature patches and only the labeled patches, selected based on the manually drawn polygons, are considered during the training process. Thus, an unsupervised, automated approach is used to derive training images, that is, the superpixels, which are manually labeled based on the polygons, leading to potentially many training images from one original image.

Conclusion

DL with CNNs is especially suited for information extraction from UAV imagery, and it can become essential for the assessment of the challenging ecosystem of kettle holes. These ecosystems greatly benefit from high-resolution monitoring of their diversity and change over time. We demonstrated that it is possible to train a suitable network for classifying different plant species and dead plants in kettle hole systems. Three different networks and three different optimizers were tested to automatically classify plant, which are typical in this landscape and which were captured by UAV images. Best results (F1-score 0.86) were achieved with the CNN Xception and the optimizer Adagrad. A balanced dataset was necessary for successful training. However, solely a small number of samples (i.e., 318 superpixels per class) were needed. A workflow was introduced to sample the training data efficiently from patchily annotated images by considering a preceding processing step of unsupervised image segmentation.

This study provides a method to reduce the current gap between the sciences of machine learning and remote sensing and ecology, including conservation. Machine learning is a viable tool to access the health of ecosystems and should be increasingly implemented in long-term monitoring programs to facilitate evidence-based conservation programs. In the next step, the segmented images, whose exterior and interior camera geometry are known, will be projected to an orthomosaic to quantify present plant species cover. Furthermore, the classification needs to be extended to different periods of the year to capture and quantify seasonal changes and better assess the dynamics of the kettle hole ecosystem.

Acknowledgments

This research has been partly funded by the Leibniz Centre for Agricultural Landscape Research (ZALF) through the integrated priority project SWBTrans: ‘Smart Use of Heterogeneities of Agricultural Landscapes’. We would like to thank Dorith Henning for labeling the images. We would like to thank the Graduate Program of Environmental Technologies of the Federal University of Mato Grosso do Sul (UFMS) that supported the doctoral dissertation of the first author. We would like to thank NVIDIA Corporation for the donation of the GPU used in this research. Open Access funding enabled and organized by Projekt DEAL. WOA Institution: Technische Universitat Dresden Consortia Name : Projekt DEAL

Word count: 5825

Show less

© 2023. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

The use of uncrewed aerial vehicle to map the environment increased significantly in the last decade enabling a finer assessment of the land cover. However, creating accurate maps of the environment is still a complex and costly task. Deep learning (DL) is a new generation of artificial neural network research that, combined with remote sensing techniques, allows a refined understanding of our environment and can help to solve challenging land cover mapping issues. This research focuses on the vegetation segmentation of kettle holes. Kettle holes are small, pond-like, depressional wetlands. Quantifying the vegetation present in this environment is essential to assess the biodiversity and the health of the ecosystem. A machine learning workflow has been developed, integrating a superpixel segmentation algorithm to build a robust dataset, which is followed by a set of DL architectures to classify 10 plant classes present in kettle holes. The best architecture for this task was Xception, which achieved an average F1-score of 85% in the segmentation of the species. The application of solely 318 samples per class enabled a successful mapping in the complex wetland environment, indicating an important direction for future health assessments in such landscapes.

Details

Title

Identifying plant species in kettle holes using UAV images and deep learning techniques

Author

Correa Martins, José Augusto¹; José Marcato Junior¹; Pätzig, Marlene²; Sant'Ana, Diego André³; Pistori, Hemerson⁴; Liesenberg, Veraldo⁵; Eltner, Anette⁶

¹ Universidade Federal de Mato Grosso do Campo Sul, Campo Grande, Brazil
² Provisioning of Biodiversity in Agricultural Systems, Leibniz Centre for Agricultural Landscape Research (ZALF) e.V, Müncheberg, Germany
³ Universidade Católica Dom Bosco, Campo Grande, Brazil; Instituto Federal de Mato Grosso do Sul, Aquidauana, Brazil
⁴ Universidade Católica Dom Bosco, Campo Grande, Brazil
⁵ Department of Forest Engineering, Santa Catarina State University (UDESC), Lages, Santa Catarina, Brazil
⁶ Institute of Photogrammetry and Remote Sensing, Technische Universität Dresden, Dresden, Germany

Pages

1-16

Section

Original Research

Publication year

2023

Publication date

Feb 2023

Publisher

John Wiley & Sons, Inc.

e-ISSN

20563485

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1002/rse2.291

ProQuest document ID

2779984611

Identifying plant species in kettle holes using UAV images and deep learning techniques

Jump to:

Full text

Abstract

Details

Suggested sources