Full Text

Turn on search term navigation

1. Introduction

Pine wilt disease (PWD) is a worldwide forest disease caused by the pine wood nematode (PWN: Bursaphelenchus xylophilus (Steiner & Bührer) Nickle) caused by the devastating death of pine species, which can be harmful to 70 species of conifers, including 57 species of pine, 13 species of non-pine plants [1]. The disease is often called “cancer of pine” or “fire without smoke” because of its easy spread, fast onset, high mortality rate, and difficulty being controlled. At present, the disease is only distributed in the United States, Canada, and Mexico in China, North America, East Asia, South Korea, Japan, North Korea, Portugal, and other eight countries in Europe. The origin of PWD is North American, and it has been disseminated by human transport. It is now listed as a quarantine pest in 52 countries worldwide [2].

Pine wilt disease was initially identified in China in 1982 on Pinus sylvestris in Zhongshan Ling, Nanjing, Jiangsu Province, and subsequently spread rapidly to the surrounding area and caused massive pine tree mortality. The financial damage caused by PWD in China has fluctuated and increased since 1998, and then, they entered a full-blown outbreak stage in 2013 and 2017. From 1996 to 2010, China’s mean annual economic loss attributable to diverse forest pests and diseases was more than RMB 100 billion, including direct economic and ecological service value losses [3,4,5]. Since 2015, there has been a notable increase in the economic losses incurred because of PWD, with a rate of growth in excess of 40%. The mean total economic damage was estimated at RMB 7.17 billion every year, comprising a direct financial loss of RMB 1.53 billion and an indirect financial loss of RMB 5.64 billion [6].

PWD is a highly infectious and destructive disease that causes considerable economic losses. The symptoms of PWD typically manifest as a gradual change in coloration of the needles, initially from green to yellow, and subsequently to reddish brown. Furthermore, the bark of beetles may be observed in infected areas. The key to controlling and preventing pine wilt disease is the early detection of the disease [7]. The initial detection of pine wilt disease was primarily conducted through manual monitoring of pine forests. This method revealed that the disease typically manifested between April and November, accompanied by a gradual change in the needles from green to yellow and then to reddish brown. Furthermore, the presence of beetle bark in infected areas indicates the disease’s progression. The removal of bark with a knife without shedding resin results in the release of resin. The above characteristics were initially employed as indicators to ascertain whether a pine tree was afflicted with pine wilt disease [8]. However, this method of identification is highly subjective and dependent on experience, with an inherent lack of accuracy. Based on prior experience of using resin to diagnose infection, monitors also employed a method of punching round holes in the xylem of pine trunks to determine whether a pine tree was infected. This involved observing the flow of resin in the holes, which proved to be a simple but still inaccurate method that could only assist in diagnosis [9]. To enhance the precision of pine wilt detection, biologists have put forth a range of biological detection techniques. Morphological detection is a straightforward process that does not necessitate the use of sophisticated instrumentation. Consequently, it is a method that is frequently preferred by laboratories for the detection of pine nematodes [10]. With the advent of molecular biology, biologists have employed techniques such as DNA assay, PCR assay, and RT-PCR assay, which facilitate more precise detection of pine wood nematode disease [11,12,13]. However, these methods lack the requisite timeliness to control pine wilt disease in a timely and effective manner. Researchers are continually proposing novel methodologies for the expedient and precise diagnosis of pine wilt disease. While biological chemistry methods offer a high degree of accuracy, the development of more efficient detection techniques remains a subject that requires further investigation [13].

The development of geographic information technology and infector technology has led to the widespread research and application of low-altitude earth observation networks consisting of remote sensing satellites and unmanned aerial vehicles (UAVs) in the monitoring of PWD [7]. Lee et al. employed a range of remotely sensed data, including high-spatial-resolution satellite imagery, aerial photographs, and digital UAV data from IKONOS in 2000/2003 and QuickBird in 2005, to assess various methods for detecting infected pine trees [14]. Furthermore, Lee et al. employed a ground-based hyperspectral sensor to examine the wavelength bands of infected trees at varying stages of disease onset, with the analyzed wavelengths predominantly within the optical band [15]. The results demonstrated that the most pronounced alterations in the vegetation index were evident at a wavelength of 688 nm. Kim et al. employed multi-temporal hyperspectral aerial photo data with a spatial resolution of 1 m, acquired via a UAV, along with NDVI and VIgreen to identify falling and upright variegated pine woods [16]. The use of drones for remote sensing has the potential to overcome the limitations of satellite remote sensing. The flexibility and responsiveness of drone-based remote sensing allow for immediate adjustments to monitoring programs, whereas satellite remote sensing is constrained by a fixed return period and weather conditions and is relatively less time-sensitive. Meanwhile, when the number of infected trees in a forest is less, satellite remote sensing is unable to accurately reflect these changes on the remote sensing map, thereby preventing the timely control of the disease’s spread. Nonetheless, the evaluation of UAV remote sensing images remains predominantly reliant on manual visual inspection, which necessitates the involvement of skilled readers and is inherently subjective [12].

With the advancement of artificial intelligence technology, a series of automatic identification of PWD at low altitudes with high resolution and real-time detection is achieved by carrying AI models on a UAV. Ding et al. trained the PWD model by using a regional propositional network (RPN) and residual neural network (ResNet) (R-R-CNN) on the basis of the deep learning framework Faster Regional Convolutional Neural Network (F-RCNN) using remote sensing from UAV and AI technology. R-R-CNN is based on faster R-CNN to train PWD models. Following the optimization of the network, there was a notable enhancement in the accuracy of detection [17]. Yu et al. employed two deep learning detection algorithms (YOLOv4 and Fast-RCNN) and two common machine learning algorithms (Support Vector Machines and Random Forest) for the early detection of infected pines based on UAV remote sensing [18]. Xie et al. presented a feature fusion network for PWD detection based on UAV imagery by adopting an attention mechanism and deformable convolution [19]. In the latest research, the improved Yolo model is still the mainstream detection model. Wang et al. detected PWD based on AAV remote sensing and improved YOLOv5s model [20]. Du et al. improved the YOLOv5 model and used it to detect infected pines by adaptive spatial feature fusion and incorporating a SimAM module [21]. Zhu et al. improved the YOLOv7 model by an improved YOLOv7 model to detect infected pine trees and achieved good results [22].

The use of enhanced YOLO models and machine learning algorithms for pine wilt recognition has become a prevalent approach among researchers. Recently, state-space models (SSMs) with selection mechanisms and hardware-aware architectures, such as Mamba [23], exhibited considerable potential for modeling long sequences. Given the quadratic relationship between the complexity of the transformer self-attention mechanism and the size of the image, in addition to the increasing computational demands, researchers are currently investigating the potential for adapting Mamba to computer vision tasks [24]. While Mamba models have been used for vision tasks [25,26], to date, no scholars have yet used Mamba models for the recognition of pine wilt tasks.

Consequently, a fusion model is proposed in this paper. The objective is to detect pine wilt using this fusion model and to assess the potential of the mamba model for pine wilt disease detection for researchers. The principal contributions of our research are the following:

An unmanned aerial vehicle was employed to gather photographic data of RGB pine trees afflicted by pine wilt in a mixed forest environment. The images were then subjected to pre-processing to constitute the final research dataset.
The Mamba Backbone is employed in the fusion model to capture features from the PWD image input.
The fusion model was augmented with attention networks, and the four attention mechanism modules were evaluated to identify the optimal configuration.
PAFPN is employed in the fusion model to augment the model’s capacity to integrate multi-scale object features.
This fusion model replaces the traditional convolution with a deeply separable convolution.

2. Materials and Methods

2.1. Construction of Dataset

2.1.1. Data Collection

The pine tree image data employed in this study to identify pine wilt disease were procured by our team from natural pine forests, including foreign pine trees and horsetail pine species. The image data of pine wilt were collected from Hengxi Street, Jiangning District, Nanjing City, Jiangsu Province, China (31°66′75″N, 118°69′06″E) (as shown in Figure 1). The area is mainly composed of the area comprises 22 natural villages, with the terrain exhibiting a southwest-to-northeast gradient, with the highest elevation reaching 382 m above sea level. The annual precipitation is 1276.9 mm, with the air temperature above 0 degrees Celsius for most of the year. The environment is conducive to the survival of pine wood nematode, thereby creating conditions that facilitate the emergence of pine wilt infection in the study area [27]. Photographic documentation of pine forests in several natural villages, including Xugao Village, was conducted primarily during the spring of 2023. The documentation spanned the full spectrum of pine wilt infestation, from the initial signs of infestation to the complete defoliation of the trees. In favorable weather conditions and with optimal visibility, images were obtained that depicted different levels of infestation severity. Given the expansive nature of the pine forest environment, the dataset encompasses both areas characterized by dense pine tree cover and those exhibiting relatively sparse pine tree populations. In areas of dense forest, the images demonstrate the presence of shading between trees and overlapping canopies, as well as fluctuations in background illumination attributable to the time of day at which they were captured.

RGB images of PWD in pine trees were captured from a range of heights and distances using a drone (DJI Innovations Mavic Air 2). The dataset comprised different categories of infected pine trees, collectively referred to as pine wilt. Figure 2 provides a visual representation of the early and late stages of wilt symptoms.

2.1.2. Data Pre-Processing

The results of the experiment demonstrate a robust positive correlation between the quality of the dataset and the performance of the machine learning system. The quality of the data can have a detrimental impact on the capability of deep learning models trained on such data [28,29]. The use of deep learning methods for the identification of pine wilt disease represents a favorable approach for the enhancement of efficiency and accuracy of control mechanisms for this disease. Nevertheless, the quality of images captured by UAVs is affected by a few mechanical and environmental factors [30], necessitating pre-processing of the images prior to further analysis [31]. To obtain a higher quality dataset and thereby improve the capability of the trained model, the dataset was enhanced using the following techniques: affine transformation [32], HSV enhancement [33] (as shown in Appendix A), and mosaic enhancement [34] (as shown in Figure 3) and processed the image sizes to make them consistent.

2.1.3. Dataset Labeling and Annotation

Following the completion of the preprocessing stage, 1000 images of pine wilt disease were obtained. It should be noted that each image may contain multiple pine wilt targets. This study was conducted on non-dead pine trees, and the labeled targets were divided into two categories. The first category comprised targets whose overall branches and leaves were still green, but some of them appeared to be yellowish-brown. This was defined as PWD0. The second category included targets whose trees’ branches and crowns had completely turned red and appeared to be wilted or had turned red almost entirely. This was defined as PWD1. The dataset was annotated using LabelImg 1.8.6, with the smallest rectangular box used to frame the infected pine tree target in the figure. The annotation file was saved in text format. The images were then randomly placed into the training set, test set, and validation set, with a ratio of 8:1:1, respectively. Table 1 presents the statistical data pertaining to the data set utilized for experimental analysis.

2.2. The Structure of the Fusion Model

The model comprises four components: the Mamba backbone, the Attention network, the PAFPN, and the head. The Mamba backbone can focus on longer contexts in order to extract deep semantic information effectively. The VSSBlock module employs a spatial state model to capture and process image features, thereby enabling the model to extract complex image features. The integration of lightweight attention modules into the attention network serves to augment the efficacy of pine wilt feature recognition. In this study, four attentional modules were tested in conjunction with previous studies and the research experience of the team. The most effective module was then selected for incorporation into the fusion model. The model’s capacity to recognize multi-scale objects is further enhanced by the PAFPN, which can process images of varying sizes captured from different distances by the UAV, thus facilitating the objective of identifying infected trees. In the final stage of the process, the 20 × 20, 40 × 40 and 80 × 80 feature maps are employed to identify targets of varying sizes, namely large, medium and small. The fusion model structure is shown in Figure 4.

2.3. Mamba

The Mamba Backbone Network is comprised of four modules: the Stem Module, the Visual State-Space (VSS) Module, the Vision Clue Merge Module, and the Spatial Pyramid Pooling–Fast (SPPF) Module [35]. The Stem Module reduces the image space size, thereby reducing the computational burden and initially extracting the features. The VSS Module enhances the feature representations by achieving long-range contextual information meanwhile maintaining linear complexity [36]. This is particularly advantageous for high-resolution images. The Vision Clue Merge module subsequently fuses and optimizes these features, ensuring that those extracted from different levels can be effectively combined to strengthen the detection ability of both large and small targets. Finally, the SPPF module further improves the model’s feature extraction ability. In previous studies, the mamba model was not employed for the detection of pine wilt disease. The objective of the present study is to utilize the mamba model for the construction of a fusion model, with a view to proposing a new method for pine wilt detection. The following section provides a concise brief of the state space model and the VSS block.

2.3.1. Preliminary: State Space Model

Recently, the State Space Model has emerged as a hotspot area of research. Based on the findings of the SSM study [37,38,39], Mamba [23] demonstrated the linear complexity of input sizes and addressed the computational efficiency challenge associated with Transformer on extended sequences of modeled state spaces. In the domain of generalized visual backbones, Vision Mamba [40] proposed a pure visual backbone model based on SSM, marking the inaugural introduction of Mamba into the visual domain.

State-space models employ first-order differential equations or difference equations to elucidate the evolution of the internal status of a dynamic system and its relationship to the output. The matrix-vector approach, expressed through equations, offers a methodology for the analysis of multivariable systems [40]. The sequence x(t)∈ℝ^L is mapped to y(t)∈ℝ^L by the hidden state ℎ(t)∈ℝ^N. The formula is as follows:

(1) $h^{'} (t) = A h (t) + B x (t)$

(2) $y (t) = C h (t) + D x (t)$

A

\in

R^{N \times N}

represents the evolution parameters.

B

\in

R^{N \times N}

and

C

\in

R^{N \times N}

are the projection parameters.

D \in

R^{1}

represents the skip connection.

The original Mamba is designed for use with one-dimensional sequences. In order to process the vision tasks, Vision Mamba [40] first transform the two-dimensional image t $\in$ $R^{H \times W \times C}$ into the flattened two-dimensional patches $x_{P}$ $\in$ $R^{J \times (P^{2} \cdot C)}$ , where (H, W) is the size of input image patches. P is the size of image patches; C is the number of channels. Next, Vision Mamba linearly project the $x_{P}$ to the vector with size D and adds position embeddings $E_{p o s} \in R^{(J + 1) \times D}$ , as follows:

(3) $T_{0} = [t_{c l s}; t_{p}^{1} W; t_{p}^{2} W; \dots; t_{p}^{J} W] + E_{p o s}$

where

t_{p}^{j}

is the j-th patch of t, W

\in R^{D \times (P^{2} \cdot C)}

is the learnable projection matrix. Vision mamba inspired by Vision Transformer [41] and BERT [42].

2.3.2. VSS Block

In VMamba [43] and Vision Mamba, state-space models were extended to the vision domain. Ding et al. [36] proposed RTMamba and introduced a VSS module in the backbone as a deep feature processing module. The incorporation of the VSS block into the backbone of our fusion model is informed by the block’s aptitude for encapsulating long-range contextual data meanwhile sustaining linear complexity. Moreover, the VSS block facilitates the gain of comprehensive semantic data. In the training stage, the VSS block is trained in conjunction with the semantic converter for feature interaction. The semantic alignment functions direct the VSS block to prioritize the capture of features that are more semantically meaningful and align closely with the semantic converter’s comprehension. Figure 5 illustrates the structure of the VSS block.

The input PWD features are divided into two distinct pathways after layer normalization. One pathway progresses sequentially through the linear layer, deep convolution, two-dimensional selective scanning (SS2D) operation, and layer normalization, intending to obtain features. The alternative path merely passes the linear layer, and the outcomes of two different paths are then combined through element-wise multiplication. Ultimately, residual joining is employed to link the outcomes to the initial features via the linear layer, thereby generating the output of the VSS block.

The formula is as follows:

(4) $Q_{x} = f^{L N} (x)$

(5) $S_{x} = f^{L N} (f^{S S 2 D} (D^{c o n v} (f^{L i n e a r} (Q_{x})))) ⨀ Q_{x}$

(6) $F_{x}^{V S S} = f^{L i n e a r} (S_{x}) + x$

where x denotes the input features.

f^{L N} (\cdot)

is the layer normalization operation.

f^{L i n e a r} (\cdot)

is the linear transformation operation.

D^{c o n v} (\cdot)

is the depth-wise convolution operation.

⨀

is the elementwise multiplication operation.

f^{S S 2 D} (\cdot)

is the 2D selective scanning operation.

F_{x}^{V S S}

represents the result obtained by processing the input feature x through the VSS block. During the training stage, the results achieved following processing through the VSSblock module are projected to the corresponding phase of the semantic converter. Subsequently, the semantic alignment loss being used in the training stage, facilitating the efficient acquisition of semantic and spatial detail data by the backbone.

2.4. Attention Mechanisms

It is beyond dispute that attention represents a pivotal concept within the domain of deep learning. The concept of attention has a lot of inspiration from the biological systems of human beings, which have been observed to focus their attention on different areas when processing lots of information [44]. In the context of the visual system, the attention mechanism can be conceptualized as a dynamic selection process that assigns weights to features following the perceived importance of the input, through a process of adaptation [45]. Attentional mechanisms represent a recent research focus that has garnered increasing attention within the field of machine learning. It has been demonstrated that attentional mechanisms can enhance learning outcomes in comparison to the original model. The capacity of attention mechanisms to enhance the generalization capability of machine learning is well-documented [44]. In our fusion model, we have also incorporated a network of attention mechanisms with the objective of further improving the model’s skill to identify infected trees. We observed that many scholars tend to include attention modules such as Convolutional Block Attention Module (CBAM) [46], Squeeze-and-excitation (SE) [47], A Simple, parameter-free Attention Module (SimAM) [48] in the network with the aim of improving the model’s capability. In our study, we tested the effect of four modules (CBAM, SE, SimAM, BoT3 [49]) to determine the optimal configuration for fusion model by comparing the effect of different attention modules on model performance.

2.4.1. SE Block

SE stands for “Squeeze-and-Excitation”, which is an attention mechanism used to improve convolutional neural networks (CNNs). The SE network architecture was put forth by Jie Hu et al. [47] in 2018, and its core idea is to introduce a global attention mechanism into convolutional neural networks to self-adaptive learn the relative importance of every channel. The SE network implements an attention mechanism through two steps: squeeze and excitation. In the compression step, the SE network performs a global pooling of the feature maps of each channel, compressing them into a scalar. In the excitation step, the SE network performs the conversion of the compressed feature vector into a weight vector, which is used to weigh each channel’s feature map through a full connectivity layer. The incorporation of the SE block into our fusion model serves to reinforce the model’s emphasis on pine wilt characteristics, while simultaneously attenuating the influence of extraneous features within the channels [50]. The workflow of the SE module is shown in Figure 6. The following calculations are to be made for the SE module:

(7) $U_{c} = V_{c} * X = \sum_{s = 1}^{C^{'}} V_{C}^{S} * X^{S}$

where

X \in R^{H^{'} \times W^{'} \times C^{'}}

is the input feature map,

X \in R^{H \times W \times C}

is the output feature map,

V

denotes the learned set of filter kernels,

V_{C}

refers to the parameter of the cth filter,

V_{C}^{S}

denotes a 2D spatial kernel,

*

denotes a convolution operation.

(8) $Z_{C} = F_{s q} (u_{c}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} u_{c} (i, j)$

(9) $s = F_{e x} (z, W) = σ (g (z, W)) = σ (W_{2} δ (W_{1} z))$

where

Z

is the statistic generated by reducing u by the spatial dimension

H \times W

z \in R_{c}

c

represents the cth element,

F_{s q}

represents the function that compresses the input features,

F_{e x}

the function that excites the input features,

W_{1} \in R^{\frac{C}{τ} \times C}

W_{2} \in R^{C \times \frac{C}{τ}}

σ

represents the sigmoid function,

δ

represents the ReLU function.

2.4.2. Convolutional Block Attention Module

CBAM aims to solve the limitations of common convolutional neural networks in handling information of varying orientations, shapes, and scales. To achieve this, CBAM introduces two attention mechanisms: spatial attention and channel attention. Channel attention improves the feature representation of various channels, while spatial attention extracts pivotal information from disparate locations in space. CBAM comprises two main components: SAM (S-channel) and CAM (C-channel). These modules can be embedded separately into different layers of the CNN to enhance feature representation [46]. Figure 7 illustrates the CBAM structure.

To improve the representation of channel features and extract information from different locations in space, we added CBAM, a global attention mechanism that focuses on both spatial and channel features, to our fusion model. This helps the model pay more attention to small targets and reduces leakage detection.

The computational flow of the CBAM module is as follows:

Channel attention mechanisms:

(10) $M_{c} (F) = σ (M L P [A v g P o o l (F)] + M L P [M a x P o o l (F)])$

Spatial attention mechanisms:

(11) $M_{s} (F) = σ (f_{7 \times 7} [C o n c a t (A v g P o o l (F), M a x P o o l (F))])$

where

M_{c} (F)

denotes the weighted channel features processed by the channel attention mechanism;

F

denotes the input features; σ denotes the sigmoid function; MLP denotes the global average pooling and global maximum pooling are processed separately; AvgPool denotes the computation of the average value of each small region in the input features; MaxPool denotes the selection of the maximum value of each small region in the features.

M_{s} (F)

denotes weighted spatial features processed by the spatial attention mechanism;

f_{7 \times 7}

denotes a 7 × 7 convolution; and

C o n c a t

denotes a concatenation operation.

2.4.3. A Simple, Parameter-Free Attention Module

SimAM is a lightweight attention module for convolutional neural networks proposed by Yang et al. [48] It produces attention weights by calculating the local self-similarity of the feature maps. SimAM does not introduce any extra parameters and can effectively optimize the performance of CNNs. The core idea of SimAM is predicated on the local self-similarity of images. SimAM module utilizes the fact that adjacent pixels in an image typically exhibit strong similarity, while distant pixels exhibit weak similarity. Attention weights are generated by calculating the similarity between each pixel in the feature map and its neighboring pixels. Unlike CBAM’s 3-D weights, which are synthesized via 1-D and 2-D channels, SimAM directly computes 3-D weights by mimicking the visual information processing of neurons in the human brain, and unique weights are assigned to each neuron [51]. The process is shown in Figure 8. The SimAM formula is as follows:

(12) $e_{t} (w_{t}, b_{t}, y, x_{i}) = \frac{1}{M - 1} \sum_{i = 1}^{M - 1} {(- 1 - (w_{t} x_{i} + b_{t}))}^{2} + {(1 - (w_{t} t + b_{t}))}^{2} + λ w_{t}^{2}$

(13) $w_{t} = - \frac{2 (t - μ_{t})}{{(t - μ_{t})}^{2} + 2 σ_{t}^{2} + 2 λ}$

(14) $b_{t} = - \frac{1}{2} (t + μ_{t}) w_{t}$

(15) $μ_{t} = \frac{1}{M - 1} \sum_{i = 1}^{M - 1} x_{i}, σ_{t}^{2} = \frac{1}{M - 1} \sum_{i = 1}^{M - 1} {(x_{i} - μ_{t})}^{2}$

(16) $e_{t}^{*} = \frac{4 (\hat{σ^{2}} + λ)}{{(t - \hat{μ})}^{2} + 2 {\hat{σ}}^{2} + 2 λ}$

where

e_{t} (w_{t}, b_{t}, y, x_{i})

represents the neuron energy to be computed,

w_{t}

and

b_{t}

are the weight and bias of the transformation, respectively, y represents the value size, t represents the target neuron input feature,

x_{i}

represents the input features of other neurons X

\in R^{C \times R \times H}

i

is indexed in the spatial dimension, M =

H \times W

represents the number of neurons in the channel,

μ

represents the mean,

σ

represents the variance,

λ

represents the coefficient, and

e_{t}^{*}

represents the minimum energy.

Based on the above formula, the feature enhancement processing according to the definition of the attention mechanism yields the calculation formula as:

(17) $\tilde{X} = s i g m o i d (\frac{1}{E}) ⨀ X$

2.4.4. Bottleneck Transformers

Aravind Srinivas et al. proposed BoTNet, a robust backbone architecture that integrates self-attention mechanisms for a spectrum of vision-related tasks in computers, such as object detection and image classification. By replacing the 3 × 3 convolution with Multi-Head Self-Attention (MHSA) only in ResNet, without implementing any other modifications, Aravind Srinivas et al. significantly improved the baseline in object detection and instance segmentation while also diminishing the parameters with minimal delay overhead [52].

The MHSA block is the core module of this system, which distinguishes it from the MHSA in the Transformer. Unlike the one-dimensional objects processed in Transformer, the objects in this system are similar to CNN models, resulting in several related features. Normalization does not use layer norm, but batch norm, as in CNN. BoTNet utilizes three non-linear activations. The content-position module introduces two-dimensional position coding, which is the most significant difference from Transformer.

Multi-Head Self-Attention (MHSA) layer is applied in the Bottleblock [49]. The notations ⊕ and ⊗ represent elementwise addition and matrix multiplication, respectively [53,54]. Additionally, 1 × 1 denotes single-point convolution. Incorporating this layer into the network structure significantly improves its performance. Figure 9 illustrates the structure of the ResNet Bottleneck and Bottleneck Transformer.

2.5. PAFPN

To strengthen the multi-scale fusion ability of our fusion model, we include PAFPN in the fusion model. The PAFPN was developed by incorporating enhancements to the FPN, drawing upon the conceptual framework of PANet [55,56]. FPN incorporates a top-down approach to feature fusion, based on the Backbone methodology. FPN employs high-resolution data of both low-level and high-level features, thereby achieving enhanced prediction outcomes through the integration of these disparate layers of information. Derived from the ideas in PANet adding a top-down path based on the FPN module is applied to enhance the feature information of the PWD image, thereby facilitating more accurate detection by the overall network [57]. After Backbone processing, the output feature layers: F₁, F₂, F₄, F₆ can be achieved. First, the intermediate feature layer: P₁, P₂, P₄, P₆ are generated through conventional FPN processing. Meanwhile, the middle feature layer obtains high-resolution feature maps: $F_{i}^{'}$ , I ∈ {1,2,4,6} through lateral connection, each feature layer $F_{i}^{'}$ reduces the space size through a 3 × 3 convolution with stride = 2; then, it is lateral connected with each element sum of the corresponding upper feature layer P_i to generate a new feature layer F′. The PAFPN structure is performed in Figure 10.

2.6. DSconv

Depthwise separable convolution (DSconv) [58] comprises depth-wise convolution and point-wise convolution, whereby depthwise convolution is employed to extract spatial features and pointwise convolution is utilized to extract channel features. Depthwise separable convolution groups convolutions in feature dimensions, performing independent depth-by-depth convolution (depthwise convolution) for every channel and aggregating all channels using a 1 × 1 convolution (pointwise convolution) prior to the generation of the output. The depth-separable convolution block comprises four principal components: firstly, a 3 × 3 depth convolution; secondly, a Batch Normalization (BN) layer; thirdly, a Rectified Linear Unit (ReLU) layer; and finally, a 1x1 point-by-point convolution. A BN and ReLU layer follow each of these components. The structure of DSconv is shown in Figure 11.

2.7. Evaluation Metrics and Experimental Conditions

2.7.1. Metrics

To evaluate the capability of the aforementioned model in question, some aspects will be considered: the accuracy of detection, the speed of detection, and the size of the model. The evaluation of detection precision is based on the assessment of accuracy, recall, and mean average precision (mAP). In terms of detection speed, the widely used indicator is frames per second (FPS). The model size is evaluated in terms of the number of model parameters and the actual memory footprint.

The Average Precision (AP) value is defined as the area enclosed by the P-R curve. P represents the precision rate, while R denotes the recall rate. The IOU threshold is set at 0.5. In general, a larger value indicates a more effective learning effect on the part of the model, and vice versa. The following is the formula for calculating precision:

(18) $P = \frac{T P}{T P + F P}$

where TP indicates that the prediction is for pine wilt disease and is correct; FP indicates that the prediction is for pine wilt disease and is incorrect.

The recall (R) statistics represent the percentage of targets that have been correctly predicted by the model, expressed as a proportion of the total number of targets. The formula for calculating the recall rate is as follows:

(19) $R = \frac{T P}{T P + F N}$

where FN Indicates that the target is pine wilt disease and that the model detected it incorrectly.

The formulas for calculating AP and mAP are:

(20) $A P = \int_{0}^{1} P (r) d r$

(21) $m A P = \sum_{i = 1}^{C} \frac{A P_{i}}{C}$

In computing, the term FPS (frames per second) is used to describe the number of images that can be processed per second:

(22) $F P S = \frac{1}{t}$

t = Pre-process + Inference + NMS

Parameters denote the number of parameters utilized by a given model, expressed in millions, The formula for calculating parameters is:

$P a r a m e n t s = K_{h} K_{w} C_{i n} C_{o u t}$

2.7.2. Experimental Platform and Parameter Settings

The experimental settings are illustrated in Table 2. The training program was designed to comprise 300 epochs, with an image input size of 640 × 640. The initial learning rate was set at 0.01, and the optimizer employed was SGD [59].

3. Results

3.1. Test Results of Distinct Attention Modules

The fusion model, without the addition of the attention network, is taken as the baseline model and is designated as the Base Table 3 presents the experimental results of five distinct models, including the original Base model, Base-SE, Base-SimAM, Base-CBAM, and Base-BoT3. It can be observed that the detection accuracy of the results demonstrates that the Base-CBAM and Base-BoT3 models exhibit enhanced performance in comparison to the Base model. Conversely, the Base-SE and Base-SimAM models do not demonstrate the same level of improvement in terms of detection accuracy. This suggests that the incorporation of an appropriate attention mechanism module can improve the model’s accuracy in detecting targets. From Figure 12 we can visually and clearly observe and compare the map of different models through the PR curve graphs. A comparison of the Base model with those incorporating an attention mechanism module reveals that the latter is faster in terms of detection speed. Furthermore, the amounts of parameters and actual memory consumption of the models are slightly lower than that of the Base model, except for Base-BoT3. A comparison of the above data reveals that the selection of an attention network comprising appropriate attention mechanism modules for the model can effectively enhance its performance. Ultimately, the fusion model was configured with Base-CBAM, considering the three factors of detection speed, detection accuracy, and model size.

3.2. Ablation Experiments in Test

The experimental procedure outlined in this paper is as follows: Initially, the model is trained using the training set and the validation set. Subsequently, the test set is used to evaluate the trained model and obtain the data required for model evaluation. Detailed data of the ablation experiments are shown in Table 4. The experimental results demonstrate that incorporating PAFPN and Attention Network into the Mamba Backbone framework results in enhanced model performance, as evidenced by augmentation in mAP. Concurrently, the amalgamation of these components yields an optimal fusion model precision that is superior to that of the Mamba Backbone alone. With a 5.7% rise in accuracy, a 6.0% rise in recall, and a 5.8% rise in mAP.

The model consists of three primary components: Mamba Backbone, Attention Network and PAFPN. Finally, the detection header is used to output the detection results. The attention network is constituted by CBAM modules (See Section 3.1 for the selection process). To evaluate the effect of each module on the model performance, we sequentially tested the following models: Mamba Backbone, Mamba Backbone + Attention Network, Mamba Backbone + PAFPN, and Mamba Backbone + Attention Network + PAFPN. This approach was adopted to prove the effectiveness of the fusion model. The experimental results demonstrate that incorporating PAFPN and Attention Network into the Mamba Backbone framework results in enhanced model performance, as evidenced by augmentation in mAP. Concurrently, the amalgamation of these components yields an optimal fusion model precision that is superior to that of the Mamba Backbone alone. With a 5.7% rise in accuracy, a 6.0% rise in recall, and a 5.8% rise in mAP.

3.3. Comparison Experiments in Test

To further validate the performance of our model and demonstrate its ability to identify pine wilt, we conducted a comparative analysis with the YOLO series models. Specifically, we compared the fusion model, which determined the optimal configuration, with YOLOv3, YOLOv5, YOLOv6, YOLOv7, YOLOv8, and YOLOv9. The results of this analysis are presented in Table 5. A detailed examination of the experimental outcomes reveals that, except for YOLOv6, which exhibits a marginal superiority over our fusion model in terms of mAP, our fusion model demonstrates a clear lead in terms of accuracy in the detection process. This outcome substantiates the assertion that our proposed detection model possesses distinctive capabilities in the detection of PWD. Furthermore, the P-R curve in Figure 13 offers a visual representation of the superior performance of our fusion model in terms of detection accuracy. Regarding the recall, it can be observed that the detection model does not exhibit a superior performance compared to the comparison model. The results of our experiments align with those reported by Huang et al. [60]. To enhance the precision of the detection, the model must dedicate a greater amount of time to analyzing the image features. To illustrate this, we can examine the performance of YOLOv5 and YOLOv8. Despite exhibiting the highest detection speed, these models demonstrate a markedly low recall rate. A comparison of the parameter sizes of the models reveals that our model exhibits superior detection accuracy and detection speed with a moderate size.

3.4. Detection Results

Figure 14 demonstrates that the detection results align with the data obtained in Table 4. However, the model effect of the SimAM module with the addition of the SE modules does not perform as well as benchmark models. Conversely, the model with the addition of CBAM and BoT3 exhibits enhanced detection.

Figure 15 demonstrates that all models identified the infected area, with the exception of YOLOv7, which incorrectly identified one area and the remaining models accurately identified it. A comparison of our fusion model with the other models reveals that our fusion model has high accuracy in identifying pine wilt. This indicates that the fusion model incorporating mamba has significant potential for pine wilt recognition.

Figure 16 illustrates the efficacy of various models in detecting large targets. All models accurately identify the infected region. An analysis of the detection results reveals that, with the exception of YOLOv5, which exhibits superior detection metrics compared to our fusion model, our fusion model demonstrates superior performance in detecting the large target depicted in the figure below, when compared to the other models. This suggests that our fusion model also has significant potential in detecting large targets.

Figure 17 illustrates the efficacy of the various models in recognizing the smaller targets and the confusion term. It should be noted that the image was captured in a mixed forest, and therefore, it is not possible to guarantee that all the trees depicted are pine trees. Upon analysis of the detection results, it was observed that all models successfully identified all but the infected area. However, all models except the fusion model exhibited a tendency to misidentify the other tree species with red fruits as the affected trees. A further analysis of the correctly identified areas reveals that the accuracy of our fusion model is significantly higher than that of the other models. The aforementioned detection results collectively indicate that our model exhibits superior capabilities in multi-scale object detection.

4. Discussion

The objective of this study is to examine the potential of a fusion model that integrates the Mamba model and an attention machine in pine wilt recognition. The introduction of the Mamba model to the field of vision by the Vision Mamba proposal serves as the foundation for our study. In our fusion model, we constructed a Mamba backbone network with Mamba as the core, with the objective of acquiring pine wilt features in a more efficient manner. The study results also demonstrated that the network is more effective at extracting features associated with PWD. Before the incorporation of the attention network, the model exhibited a performance level comparable to that of the YOLO family of models for pine wilt detection, with an elevated recall rate. Previous studies have demonstrated that the integration of an attention mechanism can enhance the model’s performance. Consequently, we introduced the attention network into the fusion model with the objective of enhancing its efficacy in identifying PWD. In the present experiment, four distinct attention modules were employed: SE, CBAM, SimAM, and BoT3. Simplicity and lightweight are the advantages of SE; however, it is unable to capture attention in the spatial dimension. CBAM combines both channel and spatial attention; however, it requires more computation than the other attention mechanisms. Furthermore, SimAM has the advantage of being able to compute three-dimensional features directly, which can reduce the number of model parameters. However, in comparison to other attention mechanisms, its accuracy is reduced. The BoT3 module is capable of focusing on global features with the core of a multi-head attention mechanism. However, this also results in an increase in the amount of computation. Based on the experimental results in this paper, we have identified CBAM as the optimal configuration for the attention network of the fusion model. However, our findings indicate that not all attention modules are equally effective in enhancing the performance of our model. Consequently, we intend to pursue further research to identify more suitable attention mechanism modules, with the aim of improving the model’s ability to recognize pine wilt. Furthermore, the authors of Mamba have recently presented a new study [61] that establishes a comprehensive link between SSM and attention variants. It is our belief that this study will provide an explanation for the results of the attention network in our experiments. This will be catered to when optimizing the fusion model in the future and will be compared with the vision transformer for PWD recognition. Concurrently, the ablation experiments elucidate the impact of the three primary components of the model on its performance. The incorporation of an appropriate attention network alone can effectively enhance the accuracy of the model. However, it concomitantly engenders a more cautious model that may result in the omission of positive samples. The addition of the PAFPN alone can significantly augment the accuracy, but it concomitantly increases the training time and the amount of computation. The incorporation of all three components into the fusion model concurrently optimizes the detection performance of the model for the prevailing circumstances. Subsequent research will be conducted to identify a module that exhibits comparable performance to PAFPN, yet is more lightweight, thereby enhancing the deployability of the model on edge computing devices. The present experiment utilized a dataset comprising 1000 images. Due to the hazardous nature of pine nematode disease, the local biochemical tests will be notified of any such detection, thus facilitating immediate validation and remedial action. Consequently, the team only constructed a small dataset consisting of Sargasso pines and foreign pines. In future studies, we intend to expand the dataset to include additional pine species, thereby enhancing the generalizability of the model. Presently, the model that has been trained on the dataset meets the test requirements for pine wilt detection in the region under investigation. In the future, further validation and extension of the model to other regions is planned. In the meantime, due to experimental constraints, a batch size of 16 was employed. According to Figure A1 and Figure A2, the effects of three commonly used optimizers, namely Adam, AdamW and SGD, were compared in the experiments. The final optimizer was selected as SGD, with 300 training rounds conducted. In future research, we will also endeavor to elucidate the impact of these settings on our fusion model in greater detail.

5. Conclusions

The fusion model, which combined the mamba and attention mechanisms, was found to be a viable approach for pine wilt detection, demonstrating superior performance compared to the YOLO family of models against which it was evaluated.

The ablation experimental findings indicate that the amalgamation of the Mamba Backbone, Attention Network and PAFPN components within the fusion model is conducive to achieving optimal performance. This outcome serves as a testament to the rationality and efficacy of the model’s construction. The fusion model, devoid of the attention mechanism, demonstrated an accuracy of 85.9%, a recall of 81.1%, and a mAP of 84.0% in detecting pine wilt. It demonstrated equal performance compared to that of mainstream detection models. Four attention mechanism modules were tested in the experiments, namely CBAM, SE, SimAM, and BoT3. It was observed that the SE and SimAM modules account for a decline in model performance, whereas the BoT3 and CBAM modules led to a notable enhancement in model performance. Ultimately, the CBAM module proved to be the most effective in improving the model’s performance. The CBAM module was identified as the most effective enhancement to the model, demonstrating a 4.1% increase in accuracy, a 0.7% improvement in recall, and a higher detection speed than the model without attention mechanisms. Additionally, there was a notable reduction in parameter overhead. Thus, identifying an appropriate model for the attention mechanism represents a significant enhancement for our fusion model. Furthermore, a comparison was conducted between the fusion model, which determines the optimal configuration, and YOLOv3, YOLOv5, YOLOv6, YOLOv7, YOLOv8, and YOLOv9. The findings of the experiments indicated that the accuracy of detection exhibited by the fusion model surpassed that of all the aforementioned models. However, the frame rate per second (FPS) and the amounts of parameters did not reach the fastest and the least number of parameters, respectively. In addition to the results of the ablation experiments, which demonstrated that PAFPN results in increased computation and parameters, it is postulated that this is because the Mamba backbone employs state-space modeling techniques, which have been shown to effectively capture features at different levels and at deeper levels. However, the processing of richer feature information necessitates a greater expenditure of time. For example, in the experiments, YOLOv5 and YOLOv8 exhibited the highest FPS, yet their recall rates were only 73.4% and 76.4%, respectively, which are considerably inferior to those of our fusion model. In comparison to the work of previous researchers, the amounts of parameters in our model and its detection speed are already compatible with deployment on the majority of edge devices and UAVs for the purpose of detection. It is crucial to note that the extended training period for our fusion model necessitates enhanced device performance, which may consequently impose more rigorous training device specifications when addressing intricate problems.

Consequently, the fusion model incorporating the mamba and attention mechanisms has been demonstrated to exhibit favorable performance and potential for pine wilt, and it is anticipated that it will continue to undergo improvement during subsequent studies. It is also expected to be used for pine wilt detection in Nanjing and other areas.

Author Contributions

M.B. and X.D. designed the program, analyzed, and processed the data, drafted the original manuscript, and participated in the writing touch-ups. J.D. and L.Y. were involved in the revision of the thesis. H.L. designed the project and revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1. The precise location of the research area.

Figure 2. The picture on the left shows an early infected pine tree, and the picture on the right shows a late infected pine tree.

View Image - Figure 3. The images on the left show the affine transformation, the middle image shows the HSV enhancement and the right image shows the mosaic enhancement.

Figure 3. The images on the left show the affine transformation, the middle image shows the HSV enhancement and the right image shows the mosaic enhancement.

Figure 4. The structure of the fusion model.

Figure 5. A comprehensive illustration of the VSS block, SS2D represents the 2D selective scanning.

Figure 6. A squeeze-and-excitation block.

Figure 7. The picture shows the CBAM structure.

View Image - Figure 8. The top picture is the channel-wise attention. The picture in the middle is the spatial-wise attention. The picture at the bottom shows the full 3-D weights for attention.

Figure 8. The top picture is the channel-wise attention. The picture in the middle is the spatial-wise attention. The picture at the bottom shows the full 3-D weights for attention.

Figure 9. The left is a ResNet bottleneck. The right is a bottleneck transformer.

Figure 10. The structure of PAFPN.

Figure 11. The structure of DSconv.

Figure 12. The P-R curve of the results of different attention modules.

Figure 13. The P-R curve of the results of different models.

View Image - Figure 14. A comparative analysis of the efficacy of various models in detecting infected trees. (a) The origin image; (b) Base; (c) Base-SimAM; (d) Base-SE; (e) Base-BoT3; (f) Base-CBAM.

Figure 14. A comparative analysis of the efficacy of various models in detecting infected trees. (a) The origin image; (b) Base; (c) Base-SimAM; (d) Base-SE; (e) Base-BoT3; (f) Base-CBAM.

View Image - Figure 15. Comparing the performance of different models in detecting medium-sized infected tree targets. (a) The origin image; (b) fusion model; (c) YOLOv3; (d) YOLOv5; (e) YOLOv6; (f) YOLOv7; (g) YOLOv8; (h) YOLOv9.

Figure 15. Comparing the performance of different models in detecting medium-sized infected tree targets. (a) The origin image; (b) fusion model; (c) YOLOv3; (d) YOLOv5; (e) YOLOv6; (f) YOLOv7; (g) YOLOv8; (h) YOLOv9.

View Image - Figure 16. Comparing the performance of different models in detecting large-sized infected tree targets. (a) The origin image; (b) fusion model; (c) YOLOv3; (d) YOLOv5; (e) YOLOv6; (f) YOLOv7; (g) YOLOv8; (h) YOLOv9.

Figure 16. Comparing the performance of different models in detecting large-sized infected tree targets. (a) The origin image; (b) fusion model; (c) YOLOv3; (d) YOLOv5; (e) YOLOv6; (f) YOLOv7; (g) YOLOv8; (h) YOLOv9.

View Image - Figure 17. Comparing the performance of different models in detecting small-sized infected tree targets. (a) The origin image; (b) fusion model; (c) YOLOv3; (d) YOLOv5; (e) YOLOv6; (f) YOLOv7; (g) YOLOv8; (h) YOLOv9.

Figure 17. Comparing the performance of different models in detecting small-sized infected tree targets. (a) The origin image; (b) fusion model; (c) YOLOv3; (d) YOLOv5; (e) YOLOv6; (f) YOLOv7; (g) YOLOv8; (h) YOLOv9.

Table 1

Datasets for analysis.

Sets	Number of Images	Number of Instances
Training set	800	2112
Verification set	100	284
Test set	100	303

Table 2

The experimental environment.

Experimental Environment	Version
Programming language	Python 3.10.3
Operating system	Windows 11
Deep learning framework	Pytorch 2.3.1
GPU	NVIDIA GeForce GTX 4070
GPU accelerator	CUDA 11.8

Table 3

The results of using different attention modules.

Model	Precision (%)	Recall (%)	mAP (%)	FPS	Parameters (M)	Size (Mb)
Base	85.9	81.1	84.0	38.76	6.1	11.8
Base-SE	80.1	79.4	83.7	44.34	5.8	11.4
Base-SimAM	83.5	78.3	83.5	44.46	5.7	11.3
Base-CBAM	90.0	81.8	86.5	40.16	5.9	11.6
Base-BoT3	86.3	81.6	85.9	44.64	6.0	11.9

Table 4

The results of ablation experiments.

Mamba Backbone	Attention Network	PAFPN	Precision (%)	Recall (%)	mAP (%)	FPS	Parameters (M)	Size (Mb)
✓	✘	✘	84.3	75.8	80.7	48.54	3.6	7.2
✓	✘	✓	85.9	81.1	84.0	38.76	6.1	11.8
✓	✓	✘	88.7	72.8	81.4	46.29	3.7	7.3
✓	✓	✓	90.0	81.8	86.5	40.16	5.9	11.6

Table 5

The results of different models.

Model	Precision (%)	Recall (%)	mAP (%)	FPS	Parameters (M)	Size (Mb)
YOLOv3	86.7	79.6	83.3	64.11	12.1	23.3
YOLOv5	88.1	73.4	81.9	69.40	2.7	5.1
YOLOv6	84.2	84.9	86.0	59.24	4.3	8.4
YOLOv7	80.7	71.7	79.7	38.71	23.7	55.8
YOLOv8	82.9	76.4	83.4	65.49	3.1	5.9
YOLOv9	86.5	76.8	85.8	49.40	25.3	49.3
Fusion model	90.0	81.8	86.5	40.16	5.9	11.6

Appendix A

RGB color to HSV color:

The principle of conversion is very simple, for any coordinate point in the image, its RGB color space for (R,G,B), HSV color space for (H,S,V), the first need to R, G, B values converted to between 0~1: $R = R / 255$ $G = G / 255$ $B = B / 255$ $V = m a x (R, G, B)$ $S = \{\begin{matrix} \frac{V - m i n (R, G, B)}{V}, i f V \neq 0 \\ 0, e l s e \end{matrix}$ $H = \{\begin{matrix} \frac{60 (G - B)}{V - \min (R, G, B)}, i f V = R \\ 120 + \frac{60 (B - R)}{V - \min (R, G, B)}, i f V = G \\ 240 + \frac{60 (R - G)}{V - \min (R, G, B)}, i f V = B \end{matrix}$ $i f H < 0, H = H + 360$

Appendix B

Figure A1. The results of different epochs and different optimizers without attention modules.

Figure A2. The results of Fusion model using CBAM.

References

1. Zhao, B.G. Pine Wilt Disease in China. Pine Wilt Disease; Zhao, B.G.; Futai, K.; Sutherland, J.R.; Takeuchi, Y. Springer: Tokyo, Japan, 2008; [DOI: https://dx.doi.org/10.1007/978-4-431-75655-2_4]

2. Mota, M.M.; Vieira, P. Pine Wilt Disease: A World Wide Threat to Forest Ecosystems; Springer: New York, NY, USA, 2008.

3. Su, H.-J.; Zhao, J.; You, K.-D.; Chang, G.-B.; Chai, S.-Q.; Qu, T. Evaluation of economic losses caused by forest pests disasters in China. For. Pest Dis.; 2004; 23, 16.

4. Song, S.-Y.; Su, H.-J.; Yu, H.-Y.; Qu, T.; Chang, G.-B.; Zhao, J. Evaluation of economic losses caused by forest pest disasters between 2006 and 2010 in China. For. Pest Dis.; 2011; 6, pp. 1-4.

5. Yan, J. Economic Analysis and Countermeasures of Forestry Biological Disaster Management in China. Ph.D. Thesis; Beijing Forestry University: Beijing, China, 2008.

6. Zhao, J.; Huang, J.; Yan, J.; Fang, G. Economic Loss of Pine Wilt Disease in Mainland China from 1998 to 2017. Forests; 2020; 11, 1042. [DOI: https://dx.doi.org/10.3390/f11101042]

7. Wu, W.; Zhang, Z.; Zheng, L.; Han, C.; Wang, X.; Xu, J.; Wang, X. Research Progress on the Early Monitoring of Pine Wilt Disease Using Hyperspectral Techniques. Sensors; 2020; 20, 3729. [DOI: https://dx.doi.org/10.3390/s20133729]

8. Zhao, B.G.; Futai, K.; Sutherland, J.R.; Takeuchi, Y. Pine Wilt Disease; Springer: Berlin/Heidelberg, Germany, 2008; Volume 17.

9. Ma, Y.; Lu, Q.; Yu, C.; Li, Q.; Liu, H.; Zhang, X. Study on Early Diagnosis Technology of Pine Wilt Disease. J. Shandong Agric. Univ.; 2014; 45, pp. 158-160.

10. Braasch, H. Morphology of Bursaphelenchus xylophilus Compared with Other Bursaphelenchus Species; Brill: Lisbon, Portugal, 2004; pp. 127-143.

11. Bogale, M.; Baniya, A.; DiGennaro, P. Nematode Identification Techniques and Recent Advances. Plants; 2020; 9, 1260. [DOI: https://dx.doi.org/10.3390/plants9101260]

12. Hu, Y.Q.; Kong, X.C.; Wang, X.R.; Zhong, T.K.; Zhu, X.W.; Mota, M.M.; Ren, L.L.; Liu, S.; Ma, C. Direct PCR-based method for detecting Bursaphelenchus xylophilus, the pine wood nematode in wood tissue of Pinus massoniana. For. Pathol.; 2011; 41, pp. 165-168. [DOI: https://dx.doi.org/10.1111/j.1439-0329.2010.00692.x]

13. Li, M.; Li, H.; Ding, X.; Wang, L.; Wang, X.; Chen, F. The Detection of Pine Wilt Disease: A Literature Review. Int. J. Mol. Sci.; 2022; 23, 10797. [DOI: https://dx.doi.org/10.3390/ijms231810797]

14. Ho, L.S.; Lee, W.-K.; Cho, H.-K. Detection of The Pine Trees Damaged by Pine Wilt Disease using High Resolution Satellite and Airborne Optical Imagery. Korean J. Remote Sens.; 2007; 23, pp. 409-420.

15. Lee, J.B.; Kim, E.S.; Lee, S.H. An Analysis of Spectral Pattern for Detecting Pine Wilt Disease Using Ground-Based Hyperspectral Camera. Korean J. Remote Sens.; 2014; 30, pp. 665-675. [DOI: https://dx.doi.org/10.7780/kjrs.2014.30.5.11]

16. Kim, S.-R.; Lee, W.-K.; Lim, C.-H.; Kim, M.; Kafatos, M.C.; Lee, S.-H.; Lee, S.-S. Hyperspectral Analysis of Pine Wilt Disease to Determine an Optimal Detection Index. Forests; 2018; 9, 115. [DOI: https://dx.doi.org/10.3390/f9030115]

17. Deng, X.; Tong, Z.; Lan, Y.; Huang, Z. Detection and Location of Dead Trees with Pine Wilt Disease Based on Deep Learning and UAV Remote Sensing. AgriEngineering; 2020; 2, pp. 294-307. [DOI: https://dx.doi.org/10.3390/agriengineering2020019]

18. Yu, R.; Luo, Y.; Zhou, Q.; Zhang, X.; Wu, D.; Ren, L. Early detection of pine wilt disease using deep learning algorithms and UAV-based multispectral imagery. For. Ecol. Manag.; 2021; 497, 119493. [DOI: https://dx.doi.org/10.1016/j.foreco.2021.119493]

19. Xie, W.; Wang, H.; Liu, W.; Zang, H. Early-Stage Pine Wilt Disease Detection via Multi-Feature Fusion in UAV Imagery. Forests; 2024; 15, 171. [DOI: https://dx.doi.org/10.3390/f15010171]

20. Wang, L.; Cai, J.; Wang, T.; Zhao, J.; Gadekallu, T.R.; Fang, K. Detection of Pine Wilt Disease Using AAV Remote Sensing With an Improved YOLO Model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.; 2024; 17, pp. 19230-19242. [DOI: https://dx.doi.org/10.1109/JSTARS.2024.3478333]

21. Du, Z.; Wu, S.; Wen, Q.; Zheng, X.; Lin, S.; Wu, D. Pine wilt disease detection algorithm based on improved YOLOv5. Front. Plant Sci.; 2024; 15, 1302361. [DOI: https://dx.doi.org/10.3389/fpls.2024.1302361] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38699534]

22. Zhu, X.; Wang, R.; Shi, W.; Liu, X.; Ren, Y.; Xu, S.; Wang, X. Detection of Pine-Wilt-Disease-Affected Trees Based on Improved YOLO v7. Forests; 2024; 15, 691. [DOI: https://dx.doi.org/10.3390/f15040691]

23. Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv; 2023; arXiv: 2312.00752

24. Zhang, H.; Zhu, Y.; Wang, D.; Zhang, L.; Chen, T.; Wang, Z.; Ye, Z. A Survey on Visual Mamba. Appl. Sci.; 2024; 14, 5683. [DOI: https://dx.doi.org/10.3390/app14135683]

25. Zhou, W.; Kamata, S.; Wang, H.; Wong, M.S.; Hou, H. Mamba-in-Mamba: Centralized Mamba-Cross-Scan in Tokenized Mamba Model for Hyperspectral image classification. Neurocomputing; 2025; 613, 128751. [DOI: https://dx.doi.org/10.1016/j.neucom.2024.128751]

26. Wang, Z.; Li, C.; Xu, H.; Zhu, X. Mamba YOLO: SSMs-Based YOLO For Object Detection. arXiv; 2024; arXiv: 2406.05835

27. Li, Y.X.; Zhang, X.Y. High risk of invasion and expansion of pine wood nematode in middle temperate zone of China. J. Temp. For. Res.; 2018; 1, pp. 3-6.

28. Meister, S.; Möller, N.; Stüve, J.; Groves, R.M. Synthetic image data augmentation for fibre layup inspection processes: Techniques to enhance the data set. J. Intell. Manuf.; 2021; 32, pp. 1767-1789. [DOI: https://dx.doi.org/10.1007/s10845-021-01738-7]

29. Chen, H.; Chen, J.; Ding, J. Data Evaluation and Enhancement for Quality Improvement of Machine Learning. IEEE Trans. Reliab.; 2021; 70, pp. 831-847. [DOI: https://dx.doi.org/10.1109/TR.2021.3070863]

30. Eskandari, R.; Mahdianpari, M.; Mohammadimanesh, F.; Salehi, B.; Brisco, B.; Homayouni, S. Meta-analysis of Unmanned Aerial Vehicle (UAV) Imagery for Agro-environmental Monitoring Using Machine Learning and Statistical Models. Remote Sens.; 2020; 12, 3511. [DOI: https://dx.doi.org/10.3390/rs12213511]

31. Aslahishahri, M.; Stanley, K.G.; Duddu, H.; Shirtliffe, S.; Vail, S.; Stavness, I. Spatial Super Resolution of Real-World Aerial Images for Image-Based Plant Phenotyping. Remote Sens.; 2021; 13, 2308. [DOI: https://dx.doi.org/10.3390/rs13122308]

32. Weisstein, E.W. Affine Transformation. 2004; Available online: https://mathworld.wolfram.com/ (accessed on 3 May 2024).

33. Sural, S.; Qian, G.; Pramanik, S. Segmentation and histogram generation using the HSV color space for image retrieval. Proceedings of the International Conference on Image Processing; Rochester, NY, USA, 22–25 September 2002; II. [DOI: https://dx.doi.org/10.1109/ICIP.2002.1040019]

34. Zeng, G.; Yu, W.; Wang, R.; Lin, A. Research on mosaic image data enhancement for overlapping ship targets. arXiv; 2021; arXiv: 2105.05090

35. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell.; 2015; 37, pp. 1904-1916. [DOI: https://dx.doi.org/10.1109/TPAMI.2015.2389824]

36. Ding, H.; Xia, B.; Liu, W.; Zhang, Z.; Zhang, J.; Wang, X.; Xu, S. A Novel Mamba Architecture with a Semantic Transformer for Efficient Real-Time Remote Sensing Semantic Segmentation. Remote Sens.; 2024; 16, 2620. [DOI: https://dx.doi.org/10.3390/rs16142620]

37. Albert, G.; Karan, G.; Christopher, R. Efficiently Modeling Long Sequences with Structured State Spaces. arXiv; 2022; arXiv: 2111.00396

38. Albert, G.; Isys, J.; Karan, G.; Khaled, S.; Tri, D.; Atri, R.; Christopher, R. Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers. arXiv; 2021; arXiv: 2110.13985

39. Smith, J.T.H.; Warrington, A.; Linderman, S.W. Simplified State Space Layers for Sequence Modeling. arXiv; 2023; arXiv: 2208.04933

40. Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv; 2024; arXiv: 2401.09417

41. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; vain Gelly, S. et al. An image is worth 16x16 words: Trans formers for image recognition at scale. arXiv; 2020; arXiv: 2010.11929

42. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional trans formers for language understanding. arXiv; 2019; arXiv: 1810.04805

43. Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. Vmamba: Visual state space model. arXiv; 2024; arXiv: 2401.10166

44. Niu, Z.; Zhong, G.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing; 2021; 452, pp. 48-62. [DOI: https://dx.doi.org/10.1016/j.neucom.2021.03.091]

45. Guo, M.-H.; Xu, T.-X.; Liu, J.-J.; Liu, Z.-N.; Jiang, P.-T.; Mu, T.-J.; Zhang, S.-H.; Martin, R.R.; Cheng, M.-M.; Hu, S.-M. Attention mechanisms in computer vision: A survey. Comput. Vis. Media; 2022; 8, pp. 331-368. [DOI: https://dx.doi.org/10.1007/s41095-022-0271-y]

46. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.-S. CBAM: Convolutional Block Attention Module. Proceedings of the European Conference on Computer Vision; Munich, Germany, 8–14 September 2018.

47. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132-7141.

48. Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. Proceedings of the International Conference on Machine Learning; Virtual, 18–24 July 2021; PMLR: Birmingham, UK, 2021; pp. 11863-11874.

49. Srinivas, A.; Lin, T.Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck transformers for visual recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Virtual, 19–25 June 2021; pp. 16519-16529.

50. Lin, Y.; Cai, R.; Lin, P.; Cheng, S. A detection approach for bundled log ends using K-median clustering and improved YOLOv4 Tiny network. Comput. Electron. Agric.; 2022; 194, 106700. [DOI: https://dx.doi.org/10.1016/j.compag.2022.106700]

51. Carrasco, M. Visual Attention: The Past 25 Years. Vis. Res.; 2011; 51, pp. 1484-1525. [DOI: https://dx.doi.org/10.1016/j.visres.2011.04.012] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/21549742]

52. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA, 27–30 June 2016.

53. Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794-7803.

54. Yu, C.; Feng, Z.; Wu, Z.; Wei, R.; Song, B.; Cao, C. HB-YOLO: An Improved YOLOv7 Algorithm for Dim-Object Tracking in Satellite Remote Sensing Videos. Remote Sens.; 2023; 15, 3551. [DOI: https://dx.doi.org/10.3390/rs15143551]

55. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA, 21–26 July 2017; pp. 2117-2125.

56. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759-8768.

57. Zhang, Y.; Xie, F.; Huang, L.; Shi, J.; Yang, J.; Li, Z. A lightweight one-stage defect detection network for small object based on dual attention mechanism and PAFPN. Front. Phys.; 2021; 9, 708097. [DOI: https://dx.doi.org/10.3389/fphy.2021.708097]

58. Nascimento, M.G.; Fawcett, R.; Prisacariu, V.A. Dsconv: Efficient convolution operator. Proceedings of the IEEE/CVF International Conference on Computer Vision; Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5148-5157.

59. Lydia, A.; Francis, S. Adagrad—An optimizer for stochastic gradient descent. Int. J. Inf. Comput. Sci.; 2019; 5, pp. 566-568.

60. Huang, J.; Rathod, V.; Sun, C.; Zhu, M.; Korattikara, A.; Fathi, A.; Fischer, I.; Wojna, Z.; Song, Y.; Guadarrama, S. et al. Speed/accuracy trade-offs for modern convolutional object detectors. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA, 21–26 July 2017; pp. 7310-7311.

61. Dao, T.; Gu, A. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. arXiv; 2024; arXiv: 2405.21060

Word count: 9941

Show less

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Pine wilt disease (PWD) is a highly destructive worldwide forest quarantine disease that has the potential to destroy entire pine forests in a relatively brief period, resulting in significant economic losses and environmental damage. Manual monitoring, biochemical detection and satellite remote sensing are frequently inadequate for the timely detection and control of pine wilt disease. This paper presents a fusion model, which integrates the Mamba model and the attention mechanism, for deployment on unmanned aerial vehicles (UAVs) to detect infected pine trees. The experimental dataset presented in this paper comprises images of pine trees captured by UAVs in mixed forests. The images were gathered primarily during the spring of 2023, spanning the months of February to May. The images were subjected to a preprocessing phase, during which they were transformed into the research dataset. The fusion model comprised three principal components. The initial component is the Mamba backbone network with State Space Model (SSM) at its core, which is capable of extracting pine wilt features with a high degree of efficacy. The second component is the attention network, which enables our fusion model to center on PWD features with greater efficacy. The optimal configuration was determined through an evaluation of various attention mechanism modules, including four attention modules. The third component, Path Aggregation Feature Pyramid Network (PAFPN), facilitates the fusion and refinement of data at varying scales, thereby enhancing the model’s capacity to detect multi-scale objects. Furthermore, the convolutional layers within the model have been replaced with depth separable convolutional layers (DSconv), which has the additional benefit of reducing the number of model parameters and improving the model’s detection speed. The final fusion model was validated on a test set, achieving an accuracy of 90.0%, a recall of 81.8%, a map of 86.5%, a parameter counts of 5.9 Mega, and a detection speed of 40.16 FPS. In comparison to Yolov8, the accuracy is enhanced by 7.1%, the recall by 5.4%, and the map by 3.1%. These outcomes demonstrate that our fusion model is appropriate for implementation on edge devices, such as UAVs, and is capable of effective detection of PWD.

Details

Title

A Pine Wilt Disease Detection Model Integrated with Mamba Model and Attention Mechanisms Using UAV Imagery

Author

Bai, Minhui¹; Di, Xinyu²; Yu, Lechuan¹; Ding, Jian¹; Lin, Haifeng¹

¹ College of Information Science and Technology, Nanjing Forestry University, Nanjing 210037, China
² College of Mechanical and Electrical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

First page

255

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

20724292

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/rs17020255

ProQuest document ID

3159535672

A Pine Wilt Disease Detection Model Integrated with Mamba Model and Attention Mechanisms Using UAV Imagery

Jump to:

Full Text

1. Introduction

2. Materials and Methods

2.1. Construction of Dataset

2.1.1. Data Collection

2.1.2. Data Pre-Processing

2.1.3. Dataset Labeling and Annotation

2.2. The Structure of the Fusion Model

2.3. Mamba

2.3.1. Preliminary: State Space Model

2.3.2. VSS Block

2.4. Attention Mechanisms

2.4.1. SE Block

2.4.2. Convolutional Block Attention Module

2.4.3. A Simple, Parameter-Free Attention Module

2.4.4. Bottleneck Transformers

2.5. PAFPN

2.6. DSconv

2.7. Evaluation Metrics and Experimental Conditions

2.7.1. Metrics

2.7.2. Experimental Platform and Parameter Settings

3. Results

3.1. Test Results of Distinct Attention Modules

3.2. Ablation Experiments in Test

3.3. Comparison Experiments in Test

3.4. Detection Results

4. Discussion

5. Conclusions

Abstract

Details

Suggested sources