Full Text

Turn on search term navigation

1. Introduction

Strawberry is among the most popular small berries worldwide due to its high economic and nutritional values [1]. However, strawberry fruit matures quickly and is susceptible to damage, which can easily result in decay and economic loss if not harvested promptly. Currently, strawberry harvesting relies on traditional manual approaches, which causes problems of high labor intensity and costs [2]. Automated strawberry harvesting can effectively improve productivity and reduce labor costs, which presents a promising solution to these challenges.

Strawberry detection is a crucial technology for achieving automated harvesting. Traditionally, the primary detection methods involve the integration of machine learning algorithms and computer vision techniques. Image-processing techniques, such as gray morphology, threshold segmentation, logical operations, the OTSU (Nobuyuki Otsu) method, and mean shift segmentation, are widely employed to extract the visual characteristics of fruits from various color spaces [3,4]. Subsequently, the extracted visual characteristics are combined with prediction algorithms, such as Kalman filtering, and machine learning algorithms, such as K-nearest neighbor and SVMs (support vector machines), to facilitate target recognition and localization [5,6,7]. While the mentioned methods can effectively identify fruit targets in certain environments, their performance and robustness may be affected by fluctuating lighting and fruit occlusion [8].

Currently, deep neural network methods have gained popularity for strawberry detection due to their robust feature extraction capabilities [9]. Applying deep neural networks with high accuracy and speed for strawberry harvesting machinery has drawn much research attention. Zhang et al. [10] proposed an improved lightweight network RTSD-Net based on YOLOv4-Tiny for strawberry target detection under field conditions. Similarly, Yu et al. [11] presented a novel strawberry detection method implemented on a harvesting robot to localize picking points based on the lightweight network Mobilenet-V1. Mejia et al. [12] proposed a strawberry localization method and an autonomous rover system, which utilizes image-processing techniques and the Mobilenet-V2 network to estimate strawberry ripeness in field conditions. Originally designed for embedded devices, these models prioritized lightweight design and detection speed. However, this emphasis compromised their accuracy in complex environments, which was only approximately 80%.

Many researchers employed two-stage detection algorithms, which are typically represented by the R-CNN (region-based convolutional neural network) series, to enhance strawberry detection accuracy. Yu et al. [13] proposed a novel algorithm to achieve the visual localization of strawberry picking points based on the Mask R-CNN algorithm, which demonstrated robustness for overlapping and occluded fruits under various lighting conditions. Tang et al. [14] proposed a strawberry detection method that combines Mask R-CNN, region segmentation techniques, and SVM classifiers, which can accurately detect strawberries at six different ripeness levels. Perez-Borrero et al. [15] proposed a specialized Mask R-CNN model for strawberry instance segmentation, which promised both speed and precision outcomes. Chen et al. [16] utilized a UAV to capture near-surface photographs of strawberry fields and employed the Faster R-CNN algorithm to detect and count flowers, mature strawberries, and immature strawberries, which achieved an average counting accuracy of 84.1%. Also, Zheng et al. [17] proposed a detection model based on the Faster R-CNN algorithm for counting strawberry fruit and flowers and effectively addressed the occlusion problem. Zhou et al. [2] proposed a Faster-RCNN-based strawberry-monitoring system that uses image acquisition and transfer learning to detect target fruits and measure strawberry ripeness and achieved an average accuracy exceeding 86%. Li et al. [18] introduced an intelligent system that uses Faster R-CNN to detect strawberries in field scenarios, which achieved high accuracy for the automatic monitoring and harvesting of strawberries. While these R-CNN-based models demonstrated high detection accuracies, their two-stage processing typically led to slower operation speeds, averaging around 15 FPS.

To balance detection accuracy and speed, single-stage detection algorithms have been widely employed in strawberry detection. The YOLO series is a typical one-stage detection algorithm with a speed advantage achieved by directly predicting classifications and bounding boxes. Wang et al. [19] proposed a DSE (detailed semantics enhancement) model based on YOLOv3 for multi-stage strawberry detection, which excelled in identifying different strawberry stages, with an average precision of 86.58%. However, the model’s large size, with 300 million parameters, resulted in a slower detection speed of just 18 FPS and hindered deployment on embedded devices. Du et al. [8] proposed a DSW-YOLO(DCNv3-SA-WIoU-YOLO) network model for detecting strawberries at different occlusion levels, which achieved an average precision of 86.7%. Zhou et al. [20] introduced a YOLOv3-based method for classifying strawberry maturity in UAV (unmanned aerial vehicle) images, which achieved an average accuracy of 88%. The aforementioned YOLO-based models have effectively improved the detection speed, but there is still room for improvement in the detection accuracy. Chai et al. [21] proposed a novel strawberry detection algorithm based on a unique combination of YOLOv7 and augmented reality technology, which achieved a high F1 score of 92%. However, the method’s accuracy in natural environments cannot be guaranteed, as the detection environment was limited to a greenhouse. Li et al. [22] proposed a multi-stage strawberry detection algorithm by integrating the ASFF (adaptive spatial feature fusion) module into YOLOv5, which improved the detection performance in natural environments, with an average precision exceeding 90%. However, this integration also led to a significant decrease in detection speed.

In summary, most research focused on improving specific performance metrics without achieving a well-balanced trade-off between model accuracy, speed, and size. To address this problem, an RTF-YOLO network model based on YOLOv5 is proposed in this paper to detect strawberry fruits. The main contributions of this work are summarized as follows:

(1). A novel neural network model was proposed for strawberry detection under varying illumination and occlusion scenarios, which demonstrated improved speed and accuracy compared with other mainstream networks.
(2). An efficient convolution module based on structural reparameterization was proposed and fused into the backbone and neck networks, improving the model’s detection speed from 122 to 145 FPS.
(3). The triplet attention mechanism and focal loss function were introduced to improve the detection precision, which led to a 3.6% increase in the $m A P_{0.5}$ , reaching 90.24%.

The rest of this article is organized as follows. Section 2 introduces the specific algorithm, while Section 3 focuses on the results of the conducted experiments to evaluate and compare the proposed model with state-of-the-art methods. Section 4 summarizes the main conclusions of this work.

2. Materials and Methods

2.1. Image Acquisition And Augmentation

In this study, the growth stages of strawberries were observed, and images were captured using a camera. We collected a total of 2040 images, which were split into training, validation, and testing datasets with proportions of 70%, 20%, and 10%, respectively. These images were labeled as ripen and unripen, with a total count of 8712 strawberry targets, as shown in Table 1.

In natural environments, illumination variations and occlusion instances are the main factors that affect the detection performance. The dataset included samples under such scenarios to enhance the model’s generalization performance, as shown in Figure 1.

The data were augmented with the following methods: horizontal flipping, rotating by 90 degrees, rotating by 180 degrees, rotating by 270 degrees, and randomly changing the saturation and brightness [23], as shown in Figure 2. The number of images increased to 9180 after the data augmentation.

2.2. RFT-YOLO for Detecting Strawberry

This paper proposes a novel RTF-YOLO model derived from YOLOv5. The framework of RTF-YOLO is shown in Figure 3, and the specific module structures of YOLOv5 are shown in Figure 4.

YOLOv5 is a convolutional neural network mainly comprising three parts: a backbone network, a neck network, and a detection head. The backbone network consists of the Conv, C3 (cross-stage partial), and SPPF (spatial pyramid pooling—fast) modules. The Conv and C3 modules are the primary feature extractors, while the SPPF is employed to extract multi-scale features to improve the detection accuracy for objects of varying sizes.

YOLOv5s’s neck part was constructed by PANet, which employs a hierarchical FPN (feature pyramid network) structure [24] and PAN (path aggregation network) structure [25] to transmit distinctive semantic attributes and positional characteristics from top to bottom and vice versa. PANet enhances the receptive field and provides a richer input representation. YOLOv5s’s detection head comprises three convolutional layers to detect objects at distinct scales. This multi-scale design, coupled with varied aspect ratio anchor boxes, enhances the detection performance across diverse object sizes and shapes. YOLOv5 utilizes a compound loss function, which comprises localization loss, confidence loss, and classification loss. The CE (cross-entropy) loss and GIOU (generalized intersection over union) method were selected for the classification loss and localization loss, respectively.

The RTF-YOLO model essentially follows YOLOv5s’s architecture due to its superiority and enhances the backbone network, neck network, and loss function. First, the C3 module was replaced by the proposed QARepNeXt module to reduce the computational complexity based on structure reparameterization technology. Then, the triplet attention mechanism was embedded in the neck network to enhance feature fusion. Lastly, the focal loss function was introduced to replace the CE loss function, which can balance the positive and negative examples and strengthen the learning ability of challenging examples.

2.2.1. Backbone

The backbone network was reconstructed with structure reparameterization technology [26] and the QARepVgg block [27].

Many mainstream studies focused on well-designed architectures like ResNet, MobileNet, and Inception. These networks use multi-branch parallelism to increase the model’s representational capacity and improve the accuracy. However, complex multi-branch designs lead to increased memory access and usage, which, in turn, reduces the model’s inference speed [26,28].

Ding et al. [26] proposed the RepVgg block based on the structure reparameterization technology. The RepVgg block has different structures during training and inferencing. In training, the RepVgg Block has three parallel branches, namely, the $3 \times 3$ convolution, $1 \times 1$ convolution, and Identify branches, with each followed by a BN (batch normalization) layer. During inference, the three parallel branches are converted into a single-path structure, as shown in Figure 5a, specifically:

(1). In the $1 \times 1$ convolution branch, the $1 \times 1$ convolution is converted into an equivalent $3 \times 3$ convolution by padding.
(2). In the identity branch, the identity is converted into an equivalent $1 \times 1$ convolution by using the identity matrix as the kernel. Then, the $1 \times 1$ convolution is converted into a $3 \times 3$ convolution by padding.
(3). All three branches are converted into the structure of a $3 \times 3$ convolution followed by a BN layer. Then, the $3 \times 3$ convolution and BN layer are fused into a single $3 \times 3$ convolution on each branch.
(4). The $3 \times 3$ convolutions of the three branches are merged into one branch through addition.

The RepVgg block achieved an optimal balance between performance and inference speed. However, in quantization scenarios, models with the RepVgg Block experienced a significant decline in performance, including an over 20% top one accuracy drop on ImageNet with INT8 (8-bit integer) inference.

Chu et al. [27] conducted an in-depth analysis of performance degradation during the standard quantization. The authors introduced a simple yet effective improved structure, QARepVgg block, which benefited from the advantages of reparameterization and possessed quantization-friendly attributes, as illustrated in Figure 5b. Consequently, networks based on the QARepVgg block significantly minimize the performance discrepancy between INT8 and FP32 (32-bit floating point), which enhances the model’s suitability for deployment on edge devices.

We utilized the QARepVgg block to create the QARepNeXt module. In detail, the width (the size of the feature map) of the QARepNeXt module was kept consistent with that of the C3 module, and the stacking layers of the QARepVgg block were aligned with the BottleNeck in the C3 module. Similar to RepVgg, QARepNeXt operates as a multi-branch parallel structure during training and converts to a single-path structure during inference, as depicted in Figure 6. The multi-branch structure promotes diverse feature learning and robustness during training but increases the computational complexity. Converting to a single-path structure can enhance the inference speed while preserving the benefits of the features learned during training.

The QARepNeXt module was implemented to replace the C3 module in the YOLOv5 model, which resulted in an enhanced backbone and neck network. This substitution led to a 26% elevation in inference speed while maintaining the model’s original performance quality.

2.2.2. Attention Mechanism in the Neck Network

Compared with the C3 structure, QARepNeXt has relatively fewer parallel branches, which slightly diminishes the model’s representational capacity. Therefore, we applied the triplet attention mechanism [29] to the fused feature maps in the neck network [30] to enhance the model’s feature extraction capabilities.

The triplet attention modules calculated the attention weights by capturing the interactions between different dimensions of the input tensor. It fuses three parallel branches for the (channel, height), (channel, width), and (height, width) dimensions, as shown in Figure 7. The Z-pool layer combines the average pooled and max pooled features across each dimension to reduce the tensor’s zeroth dimension to two. This can obtain a detailed tensor representation while reducing its depth and making subsequent computations more efficient. The mathematical representation of Z- $P o o l$ can be defined as

(1) $\begin{matrix} Z - P o o l (x) = [M a x P o o l_{0 d} (x), A v g P o o l_{0 d} (x)] . \end{matrix}$

We only applied the triple attention mechanism to the last two layers of the neck network, which contain higher-level semantic features, as depicted in Figure 3. This could improve the detection ability without significantly increasing the detection latency.

2.2.3. Loss Function

In our field strawberry detection scenario, some instances were difficult to detect due to lighting variations, leaf occlusions, and similar colors between the foreground and background. To address this, Lin et al. [31] proposed the focal loss function, which can mitigate the foreground–background class imbalance problem and enhance the learning of challenging cases. This approach significantly improved the detection accuracy.

The focal loss function reduces the weight assigned to well-classified instances of loss by reshaping the standard cross-entropy loss, which is defined as

(2) $\begin{matrix} C E (p, y) = \{\begin{matrix} - l o g (p) i f y = 1 \\ - l o g (1 - p) o t h e r w i s e, \end{matrix} \end{matrix}$

where y is a component of

{\pm 1}

that indicates the ground truth, and

p \in [0, 1]

demonstrates the probability for the class labeled as

y = 1

. To simplify our notation, let us define

p_{t}

as follows:

(3) $\begin{matrix} p_{t} = \{\begin{matrix} p i f y = 1 \\ 1 - p o t h e r w i s e, \end{matrix} \end{matrix}$

and the CE loss can be rewritten as

(4) $\begin{matrix} C E (p, y) = C E (p_{t}) = - l o g (p_{t}) . \end{matrix}$

The class imbalance problem is typically addressed by introducing a weight parameter $α$ within the range $[0, 1]$ for class 1 and $1 - α$ for class $- 1$ . With the definition of $α_{t}$ being the same as $p_{t}$ , the $α$ -balanced CE loss can be written as

(5) $\begin{matrix} C E (p_{t}) = - α_{t} l o g (p_{t}) . \end{matrix}$

Although $α$ can balance the importance of positive and negative examples, it cannot differentiate between easy and hard examples. To address this, the focal loss function introduces a modulating factor ${(1 - p_{t})}^{γ}$ to reduce the importance of easy examples and direct the training toward hard examples. The focal loss function is expressed as follows:

(6) $\begin{matrix} F L (p_{t}) = - α_{t} {(1 - p_{t})}^{γ} l o g (p_{t}), \end{matrix}$

where the tunable focusing parameter

γ

> 0.

The modulating factor effectively reduces the impact of loss caused by easy examples. For instance, if $γ = 2$ , an example with a classification of $p_{t} = 0.95$ will demonstrate 400× lower loss compared with CE. Consequently, there was an increased significance in rectifying misclassified examples.

3. Experiments And Results

3.1. Experimental Setting

3.1.1. Parameters And Equipment

We conducted the experiments with a batch size of 16, an image size of 640 × 640 pixels, and 300 epochs. The remaining parameters were kept as the default values of YOLOv5s. The runtime environment is shown in Table 2.

3.1.2. Index Definition

Various metrics were employed to measure the effectiveness of RTF-YOLO, namely, recall, precision, $A P$ (average precision), $m A P$ (mean average precision), parameter count, FPS, and FLOPs (floating point operations). The formulas for calculating these performance metrics are as follows:

(7) $\begin{matrix} R = \frac{T P}{T P + F N}, \end{matrix}$

(8) $\begin{matrix} P = \frac{T P}{T P + F P}, \end{matrix}$

(9) $\begin{matrix} F 1 = \frac{2 \times P \times R}{P + R}, \end{matrix}$

(10) $\begin{matrix} A P = \int_{0}^{1} P (R) d_{R}, \end{matrix}$

(11) $\begin{matrix} m A P = \frac{\sum_{i = 1}^{n} A P_{i}}{n}, \end{matrix}$

where

T P

represents the count of accurate predictions the model makes for positive examples. In contrast,

F P

denotes the count of negative examples incorrectly identified as positive.

F N

signifies the measure of positive samples erroneously categorized as negative.

F 1

is a composite evaluation index of P and R.

A P

is the area under the P-R (precision–recall) curve, indicating the overall model performance. A larger area implies better performance.

m A P

, the mean value of

A P

across all classes, indicates the model’s detection performance. FLOPs refer to the computational complexity measurement of a model, while the detection speed is evaluated with FPS, which represents the number of images the model can process per second during inference.

3.2. Experimental Results and Analysis

This paper proposes three improved network models derived from YOLOv5: R-YOLO, RT-YOLO, and RTF-YOLO. R-YOLO only utilized the QARepNeXt module for improvement. RT-YOLO incorporated both the QARepNeXt module and the triplet attention mechanism for its enhancement. Meanwhile, RTF-YOLO combined the QARepNeXt module, the triplet attention mechanism, and the focal loss function for improvement. Several experiments were conducted to validate the improved models’ effectiveness.

3.2.1. Experiments of the QARepNeXt Module

To validate the effectiveness of the QARepNeXt module, several improved C3 modules based on mainstream lightweight networks, such as GhostNetV2, MobileNetV3, and FasterNet, were constructed to replace the original C3 module in YOLOv5. Table 3 presents the experimental results of the different models.

We strived to find a fast backbone network to meet the need for real-time strawberry detection. Among various metrics, the actual detection speed, that is, FPS, served as our primary selection criterion for the backbone. According to the results in Figure 3, these lightweight models effectively reduced the model’s parameter count and FLOPs. However, these improved models, except for QARepNeXt, did not demonstrate performance improvements that matched the reduction in FLOPs. The detection speed was determined by both FLOPs and FLOPS (floating point operations per second) [28], which is captured by

(12) $\begin{matrix} FPS = \frac{FLOPs}{FLOPS} \end{matrix}$

It is important to achieve higher FLOPS beyond simply reducing FLOPs for faster neural networks. While many attempts have been made to reduce FLOPs, such as reducing the parameters, they seldom consider optimizing FLOPS simultaneously to achieve truly low latency. The QARepNeXt converts to a single-path structure during inference, while other models all have complex multi-branch designs. The single-path structure is a hardware-efficient architecture that can utilize computing ability and memory bandwidth more effectively [26], which results in higher FLOPS. Despite having higher parameters and FLOPs, the QARepNeXt achieved the best detection speed. In conclusion, we selected the QARepNeXt as the primary module for the backbone and neck, which efficiently improved the detection speed without compromising accuracy.

3.2.2. Experiments of the Triplet Attention Mechanism

The attention mechanisms were embedded in the last two layers of R-YOLO’s neck network to better extract crucial features. We conducted comparative experiments on various types of attention mechanisms, including channel attention mechanisms, such as SE, and channel and spatial attention mechanisms, such as CBAM and the triplet. The experimental results for different attention mechanisms are shown in Table 4.

According to the results in Table 4, all the attention mechanisms improved the detection accuracy of R-YOLO. Given an input image, the channel attention focused on what was meaningful. It considered each channel of a feature map as a feature detector, contributing more to the classification. In this study, the primary challenge was to accurately identify the target class, and thus, the channel attention contributed more to the model’s accuracy. This explained why the SE attention mechanism, which only included the channel attention part, achieved a significant performance improvement [32]. Contrarily, the spatial attention component complemented the channel attention by emphasizing the precise localization of informative parts, which led to further detection accuracy improvements. As a result, the channel and spatial attention mechanisms achieved a better performance than SE. Since the triplet attention can capture the cross-dimensional interaction of the channel and spatial dimension, it achieved the highest detection precision rate, $m A P_{0.5}$ , $m A P_{0.75}$ , and $m A P_{0.5 : 0.95}$ metrics. Therefore, we selected the triplet attention mechanism to enhance the model’s feature extraction.

3.2.3. Experiments of the Focal Loss Function

The focal loss function was integrated with RT-YOLO to improve the recall rate by enhancing the model’s learning ability toward challenging examples. The focal loss function has two key parameters, $α$ and $γ$ . In our experiment, we explored various values of $α$ and $γ$ , where $α \in [0.1, 0.2]$ and $γ \in [1, 5]$ , and found that the optimal result was achieved when $α = 0.13$ and $γ = 1.4$ on the validation dataset, as shown in Table 5. We applied the best parameter values on the test dataset and achieved an improved $m A P_{0.5}$ of 90.24%, which manifested the model’s robustness. A lower value of $α$ (0.13 in this study) is typically chosen in conjunction with a higher value of $γ$ (1.4 in this study) [31]. The results of our research are in agreement with this assertion.

3.3. Ablation Experiments of Different Improved Models

The detection results of different improved models are shown in Table 6 and Figure 8. R-YOLO improved the detection speed by 32 FPS while maintaining the $m A P_{0.5}$ value essentially unchanged. Compared with R-YOLO, RT-YOLO improved the $m A P_{0.5}$ by 2.6 points, but the attention mechanism introduced an increase in FLOPs, which resulted in a slight reduction in the detection speed. Finally, RTF-YOLO integrated the focal loss function with RT-YOLO, which improved the model’s recall rate and improved by more than 1 $m A P_{0.5}$ point. Ultimately, compared with the original YOLOv5 model, our proposed RTF-YOLO model achieved an almost 3.6 point $m A P_{0.5}$ improvement and a 23 FPS increase in detection speed.

The confusion matrix of different improved models is shown in Figure 9. In the confusion matrix, the horizontal axis represents the ground truth labels, while the vertical axis represents the labels predicted by the model. The first two columns represent the counts of the ripen and unripen strawberry labels, respectively. Their sum equalled the total number of labels in the test dataset in Table 1, which was 820. The triplet attention mechanism boosted the feature extraction, which raised the count of accurately detected strawberries by 32. Unripened strawberries were susceptible to false detections and missed detections for resembling leaves in color. The focal loss function could enhance the learning capacity for these challenging objects. As a result, RTF-YOLO reduced the count of unripened strawberries’ false and missed detections by six compared with RF-YOLO.

The visualization results of different improved models are depicted in Figure 10. The $r 1$ label in the diagram signifies unripened strawberries, whereas the $r 2$ label signified ripened strawberries. All the improved models demonstrated some enhancements in the occurrences of false positives and negatives. However, considering the overall performance, RTF-YOLO achieved the best results. For example, as shown in (a), the YOLOv5 model incorrectly detected the leaves in the bottom-right and right sides as unripened strawberry fruits. The RT-YOLO model resolved the false positive problem but introduced a false negative in the top right. RTF-YOLO, on the other hand, successfully addressed both types of problems. As shown in (b), YOLOv5 and R-YOLO suffered from false negatives for unripened strawberries, which had colors similar to the leaves. However, RTF-YOLO effectively eliminated false negatives by enhancing the learning of such challenging examples.

Comparative Experimental Analysis of Different Models

We compared RTF-YOLO with other mainstream target detection networks, including YOLOv5s, YOLOv5n, ASFF-YOLOv5 [22], DSE-YOLOv5 [19], DSW-YOLOv5 [8], and Faster R-CNN, as shown in Table 7. According to the experimental results, Faster R-CNN achieved a recall rate of 81.04%, but its precision rate was only 38.43%, which led to many false detections of strawberries. In contrast, SSD achieved a precision rate of 82.33%, but its recall rate was only 19.24%, which led to many missed detections. Faster RCNN employed a single high-level feature map for target classification and localization, which led to its inadequate detection ability for small- and multi-scale targets. Although SSD used multi-scale feature maps from different layers, it failed to fully exploit the information from low-level high-resolution feature maps, which is vital for small object detection. As a result, neither of the two networks could meet the requirements for multi-stage strawberry detection.

YOLOv5 utilized the PANet network to integrate feature maps from diverse levels, which enhanced the receptive field and achieved a richer input representation. Models based on YOLOv5 achieved better detection accuracy. The $m A P_{0.5}$ metric for RTF-YOLO was 90.24%, which showed improvements of 3.6% and 3% compared with YOLOv5s and ASFF-YOLOv5, respectively. YOLOv5’s single-stage detection scheme offered a higher detection speed. However, introducing enhancement modules, such as the ASFF module, DSE module, and attention mechanism, increased the model complexity and reduced the detection speed. Benefiting from RTF-YOLO’s improved backbone network, its detection speed surpassed YOLOv5s and other YOLO-based improved networks by over 19% and 45%, respectively. Although DSE-YOLO and DSW-YOLO achieved a high detection accuracy, the large model sizes and slower detection speeds limit their deployment on edge computing devices. In contrast, RTF-YOLO offered a threefold advantage in detection speed and model scale. Therefore, RTF-YOLO was more suitable for detecting strawberries in complex environments in real time.

4. Conclusions

This article proposes a novel RTF-YOLO network model for strawberry detection under fluctuating lighting and fruit occlusion scenarios. The RTF-YOLO model was obtained by integrating YOLOv5 with the QARepNeXt module, the triplet attention mechanism, and the focal loss function. The QARepNeXt module improved the detection speed of the model, while the triplet attention mechanism enhanced the extraction capability. The focal loss function was utilized to address the foreground–background class imbalance problem and enhance the learning ability of challenging examples. The model’s performance was validated with a dataset collected from the field. The experimental results showed that the model achieved a precision of 91.92%, a recall rate of 81.43%, an $m A P$ of 90.24%, and a detection speed of 145 FPS. Compared with other mainstream object detection algorithms, RTF-YOLO was more advantageous in terms of the $m A P$ , model size, and detection speed. Consequently, the proposed algorithm can provide guidance for the yield prediction and automated harvesting of strawberries.

Author Contributions

Conceptualization, methodology, and software, S.S.; data curation, F.D. and Z.T.; writing—review and editing, S.S., F.D., Z.T. and C.H.; supervision and project administration, F.D., Z.T. and C.H. All authors read and agreed to the published version of this manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding authors.

Conflicts of Interest

Author Chunxiao Han was employed by the company SDIC XinJiang Luobupo Potash Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

View Image - Figure 1. Images under different lighting and occlusion scenarios: (a) normal scenario, (b,c) occlusion scenarios, (d) low-light scenario, (e) high-light scenario, and (f) mixed low-light and high-light scenario.

Figure 1. Images under different lighting and occlusion scenarios: (a) normal scenario, (b,c) occlusion scenarios, (d) low-light scenario, (e) high-light scenario, and (f) mixed low-light and high-light scenario.

View Image - Figure 2. Image augmentation methods: (a) original image, (b) horizontal mirror, (c) 90° clockwise rotation, (d) 180° clockwise rotation, (e) 270° clockwise rotation, and (f) random transformation of saturation and brightness.

Figure 2. Image augmentation methods: (a) original image, (b) horizontal mirror, (c) 90° clockwise rotation, (d) 180° clockwise rotation, (e) 270° clockwise rotation, and (f) random transformation of saturation and brightness.

View Image - Figure 3. The overall framework of the RTF-YOLO model. The backbone network consists of QARepNeXt, Conv, and SPPF modules. The neck network consists of FPN and PAN structures and the triplet attention mechanism. The head network includes three detection heads with different resolutions and utilizes the focal loss function for target classification.

Figure 3. The overall framework of the RTF-YOLO model. The backbone network consists of QARepNeXt, Conv, and SPPF modules. The neck network consists of FPN and PAN structures and the triplet attention mechanism. The head network includes three detection heads with different resolutions and utilizes the focal loss function for target classification.

View Image - Figure 4. The C3 and SPPF modules of YOLOv5. The C3 combines N bottleneck blocks with three Conv modules, which optimizes the feature fusion for efficient feature extraction. The SPPF module generates multi-scale feature maps by applying max pooling at different scales, which enriches the semantic information of the feature representation.

Figure 4. The C3 and SPPF modules of YOLOv5. The C3 combines N bottleneck blocks with three Conv modules, which optimizes the feature fusion for efficient feature extraction. The SPPF module generates multi-scale feature maps by applying max pooling at different scales, which enriches the semantic information of the feature representation.

View Image - Figure 5. Reparameterization of a QARepVgg block compared with RepVgg. The main improvements were as follows: (1) the BN layers followed by both the 1 × 1 branch and identity branches were removed, and (2) a new BN layer was appended to the convolutional layer obtained from the fusion of multiple branches.

Figure 5. Reparameterization of a QARepVgg block compared with RepVgg. The main improvements were as follows: (1) the BN layers followed by both the 1 × 1 branch and identity branches were removed, and (2) a new BN layer was appended to the convolutional layer obtained from the fusion of multiple branches.

View Image - Figure 6. Illustration of the QARepNeXt module, which adopts a QARepVgg block (multi-branch structure) during training and a convolution module (single-branch structure) during inference.

Figure 6. Illustration of the QARepNeXt module, which adopts a QARepVgg block (multi-branch structure) during training and a convolution module (single-branch structure) during inference.

Figure 7. Illustration of the triplet attention, which has three parallel branches.

View Image - Figure 8. The precision–recall curves. (a) The recall–precision curve of different improved models. (b) The recall–precision curve of various categories obtained by RTF-YOLO.

Figure 8. The precision–recall curves. (a) The recall–precision curve of different improved models. (b) The recall–precision curve of various categories obtained by RTF-YOLO.

Figure 9. Comparative confusion matrices of various improved models.

View Image - Figure 10. Visualization results of YOLOv5, R-YOLO, RT-YOLO, and RTF-YOLO are shown in their respective columns, with subfigures (a–c) serving as three examples. The false positive samples are encircled with a yellow dashed line, while the false negative samples are encircled with a blue dashed line.

Figure 10. Visualization results of YOLOv5, R-YOLO, RT-YOLO, and RTF-YOLO are shown in their respective columns, with subfigures (a–c) serving as three examples. The false positive samples are encircled with a yellow dashed line, while the false negative samples are encircled with a blue dashed line.

Table 1

Details of the strawberry dataset used for training, validating, and testing the proposed model.

Dataset	No. Images	Label
Dataset	No. Images	Ripen	Unripen	Total
Training	1428	3240	2952	6192
Validation	408	962	738	1700
Test	204	370	450	820

Table 2

The experimental environment used in the current study.

Hardware and Software	Configuration
CPU	Intel(R) Core(TM) i9-11900K
RAM	64 GB
GPU	GeForce RTX 2080Ti
Operating system	Windows 10
Cuda	Cuda 11.3.0
Data processing	Python 3.10
Deep learning framework	Pytorch 1.9.0

Table 3

Performance, complexity, and analysis speed of various backbones.

Backbone	mAP_0.5 (%)	mAP_0.75 (%)	mAP_0.5:0.95 (%)	Params (M)	FLOPs (G)	FPS
YOLOv5	86.55	62.08	55.61	7.03	16.0	122
GhostNetV2	87.05	58.53	53.95	4.90	10.6	114
MobileNetV3	86.77	58.74	54.63	6.44	14.1	122
ShuffleNetV2	85.52	61.39	53.72	4.85	10.3	97
InceptionNet	87.10	63.72	55.61	5.37	11.7	94
FasterNet	87.34	64.82	57.44	5.72	12.6	120
QARepNeXt	86.58	65.46	56.87	13.00	33.6	154

Table 4

Performance comparison of R-YOLO with various attention mechanisms.

Model	P (%)	R (%)	mAP_0.5 (%)	mAP_0.75 (%)	mAP_0.5:0.95 (%)	FPS
R-YOLO	89.05	78.50	86.58	62.75	54.87	154
R-YOLO + SE	89.69	79.44	88.02	62.24	55.35	149
R-YOLO + CBAM	89.49	81.26	88.54	62.07	55.80	137
R-YOLO + coordinate	89.96	80.48	88.24	62.14	55.48	141
R-YOLO + NAM	89.84	80.40	88.10	62.36	55.06	152
R-YOLO + SGE	89.37	80.02	87.82	62.78	54.93	141
R-YOLO + SimAM	90.34	80.52	89.02	63.19	56.07	139
R-YOLO + triplet	91.43	81.05	89.21	63.46	56.29	145

Table 5

Performance ( $m A P_{0.5} (%)$ ) of the focal loss function under varying $α$ and $γ$ parameters.

	1.00	1.20	1.40	1.60	2.00	2.40	3.00	5.00
$α$	1.00	1.20	1.40	1.60	2.00	2.40	3.00	5.00
0.10	88.26	87.72	89.63	88.28	88.82	82.96	73.98	51.57
0.11	87.41	88.31	89.97	87.77	88.48	83.18	75.11	52.33
0.12	88.37	88.36	90.16	89.08	87.66	84.17	72.32	55.10
0.13	88.81	89.60	90.32	88.04	88.57	84.88	76.97	54.84
0.15	89.67	88.67	89.84	89.06	88.61	83.59	79.63	53.91
0.20	88.75	88.27	88.31	87.66	86.79	85.95	83.90	53.07

Table 6

The ablation experiment results of different improved models.

Model	P (%)	R (%)	AP_0.5 (%)		mAP_0.5 (%)	mAP_0.75 (%)	mAP_0.5:0.95 (%)	FPS
Model	P (%)	R (%)	Unripen	Ripen	mAP_0.5 (%)	mAP_0.75 (%)	mAP_0.5:0.95 (%)	FPS
YOLOv5	89.67	77.21	82.60	90.50	86.55	62.08	55.61	122
R-YOLO	89.05	78.50	82.20	90.90	86.58	62.75	54.87	154
RT-YOLO	91.43	81.05	84.60	93.50	89.21	63.46	56.29	145
RTF-YOLO	91.92	81.43	86.30	94.30	90.24	64.84	57.65	145

Table 7

Comparative analysis of performance and efficiency metrics for various object detection models.

Model	P (%)	R (%)	mAP_0.5 (%)	Params (M)	FPS
Faster R-CNN	38.43	81.04	64.32	137.10	23
SSD	82.33	19.24	63.24	26.29	50
YOLOv5	89.67	77.21	86.65	7.03	122
ASFF-YOLOv5	86.00	81.75	87.25	12.46	100
DSE-YOLO	85.34	81.09	87.98	224.39	21
DSW-YOLO	82.80	82.10	86.70	32.40	42
RTF-YOLO	91.92	81.43	90.24	13.00	145

References

1. Manganaris, G.A.; Goulas, V.; Vicente, A.R.; Terry, L.A. Berry antioxidants: Small fruits providing large benefits. J. Sci. Food Agric.; 2014; 94, pp. 825-833. [DOI: https://dx.doi.org/10.1002/jsfa.6432] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/24122646]

2. Zhou, C.; Hu, J.; Xu, Z.; Yue, J.; Ye, H.; Yang, G. A novel greenhouse-based system for the detection and plumpness assessment of strawberry using an improved deep learning technique. Front. Plant Sci.; 2020; 11, 559. [DOI: https://dx.doi.org/10.3389/fpls.2020.00559] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32582225]

3. Zhao, Y.; Gong, L.; Huang, Y.; Liu, C. A review of key techniques of vision-based control for harvesting robot. Comput. Electron. Agric.; 2016; 127, pp. 311-323. [DOI: https://dx.doi.org/10.1016/j.compag.2016.06.022]

4. Linker, R.; Cohen, O.; Naor, A. Determination of the number of green apples in RGB images recorded in orchards. Comput. Electron. Agric.; 2012; 81, pp. 45-57. [DOI: https://dx.doi.org/10.1016/j.compag.2011.11.007]

5. Arefi, A.; Motlagh, A.M. Development of an expert system based on wavelet transform and artificial neural networks for the ripe tomato harvesting robot. Aust. J. Crop Sci.; 2013; 7, pp. 699-705.

6. Lu, J.; Sang, N. Detecting citrus fruits and occlusion recovery under natural illumination conditions. Comput. Electron. Agric.; 2015; 110, pp. 121-130. [DOI: https://dx.doi.org/10.1016/j.compag.2014.10.016]

7. Hamuda, E.; Mc Ginley, B.; Glavin, M.; Jones, E. Improved image processing-based crop detection using Kalman filtering and the Hungarian algorithm. Comput. Electron. Agric.; 2018; 148, pp. 37-44. [DOI: https://dx.doi.org/10.1016/j.compag.2018.02.027]

8. Du, X.; Cheng, H.; Ma, Z.; Lu, W.; Wang, M.; Meng, Z.; Jiang, C.; Hong, F. DSW-YOLO: A detection method for ground-planted strawberry fruits under different occlusion levels. Comput. Electron. Agric.; 2023; 214, 108304. [DOI: https://dx.doi.org/10.1016/j.compag.2023.108304]

9. Kamilaris, A.; Prenafeta-Boldú, F.X. Deep learning in agriculture: A survey. Comput. Electron. Agric.; 2018; 147, pp. 70-90. [DOI: https://dx.doi.org/10.1016/j.compag.2018.02.016]

10. Zhang, Y.; Yu, J.; Chen, Y.; Yang, W.; Zhang, W.; He, Y. Real-time strawberry detection using deep neural networks on embedded system (rtsd-net): An edge AI application. Comput. Electron. Agric.; 2022; 192, 106586. [DOI: https://dx.doi.org/10.1016/j.compag.2021.106586]

11. Yu, Y.; Zhang, K.; Liu, H.; Yang, L.; Zhang, D. Real-time visual localization of the picking points for a ridge-planting strawberry harvesting robot. IEEE Access; 2020; 8, pp. 116556-116568. [DOI: https://dx.doi.org/10.1109/ACCESS.2020.3003034]

12. Mejia, G.; de Oca, A.M.; Flores, G. Strawberry localization in a ridge planting with an autonomous rover. Eng. Appl. Artif. Intell.; 2023; 119, 105810. [DOI: https://dx.doi.org/10.1016/j.engappai.2022.105810]

13. Yu, Y.; Zhang, K.; Yang, L.; Zhang, D. Fruit detection for strawberry harvesting robot in non-structural environment based on Mask-RCNN. Comput. Electron. Agric.; 2019; 163, 104846. [DOI: https://dx.doi.org/10.1016/j.compag.2019.06.001]

14. Tang, C.; Chen, D.; Wang, X.; Ni, X.; Liu, Y.; Liu, Y.; Mao, X.; Wang, S. A fine recognition method of strawberry ripeness combining Mask R-CNN and region segmentation. Front. Plant Sci.; 2023; 14, 1211830. [DOI: https://dx.doi.org/10.3389/fpls.2023.1211830] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37670853]

15. Perez-Borrero, I.; Marin-Santos, D.; Gegundez-Arias, M.E.; Cortes-Ancos, E. A fast and accurate deep learning method for strawberry instance segmentation. Comput. Electron. Agric.; 2020; 178, 105736. [DOI: https://dx.doi.org/10.1016/j.compag.2020.105736]

16. Chen, Y.; Lee, W.S.; Gan, H.; Peres, N.; Fraisse, C.; Zhang, Y.; He, Y. Strawberry yield prediction based on a deep neural network using high-resolution aerial orthoimages. Remote Sens.; 2019; 11, 1584. [DOI: https://dx.doi.org/10.3390/rs11131584]

17. Zheng, C.; Liu, T.; Abd-Elrahman, A.; Whitaker, V.M.; Wilkinson, B. Object-Detection from Multi-View remote sensing Images: A case study of fruit and flower detection and counting on a central Florida strawberry farm. Int. J. Appl. Earth Obs. Geoinf.; 2023; 123, 103457. [DOI: https://dx.doi.org/10.1016/j.jag.2023.103457]

18. Li, J.; Zhu, Z.; Liu, H.; Su, Y.; Deng, L. Strawberry R-CNN: Recognition and counting model of strawberry based on improved faster R-CNN. Ecol. Inform.; 2023; 77, 102210. [DOI: https://dx.doi.org/10.1016/j.ecoinf.2023.102210]

19. Wang, Y.; Yan, G.; Meng, Q.; Yao, T.; Han, J.; Zhang, B. DSE-YOLO: Detail semantics enhancement YOLO for multi-stage strawberry detection. Comput. Electron. Agric.; 2022; 198, 107057. [DOI: https://dx.doi.org/10.1016/j.compag.2022.107057]

20. Zhou, X.; Lee, W.S.; Ampatzidis, Y.; Chen, Y.; Peres, N.; Fraisse, C. Strawberry maturity classification from UAV and near-ground imaging using deep learning. Smart Agric. Technol.; 2021; 1, 100001. [DOI: https://dx.doi.org/10.1016/j.atech.2021.100001]

21. Chai, J.J.; Xu, J.L.; O’Sullivan, C. Real-Time Detection of Strawberry Ripeness Using Augmented Reality and Deep Learning. Sensors; 2023; 23, 7639. [DOI: https://dx.doi.org/10.3390/s23177639] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37688097]

22. Li, Y.; Xue, J.; Zhang, M.; Yin, J.; Liu, Y.; Qiao, X.; Zheng, D.; Li, Z. YOLOv5-ASFF: A Multistage Strawberry Detection Algorithm Based on Improved YOLOv5. Agronomy; 2023; 13, 1901. [DOI: https://dx.doi.org/10.3390/agronomy13071901]

23. Wang, J.; Perez, L. The effectiveness of data augmentation in image classification using deep learning. Convolutional Neural Netw. Vis. Recognit.; 2017; 11, pp. 1-8.

24. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA, 21–26 July2017; pp. 2117-2125.

25. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759-8768.

26. Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Nashville, TN, USA, 20–25 June 2021; pp. 13733-13742.

27. Chu, X.; Li, L.; Zhang, B. Make RepVGG Greater Again: A Quantization-aware Approach. arXiv; 2022; arXiv: 2212.01593[DOI: https://dx.doi.org/10.1609/aaai.v38i10.29045]

28. Chen, J.; Kao, S.h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, Don’t walk: Chasing higher FLOPS for faster neural networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Vancouver, BC, Canada, 17–24 June 2023; pp. 12021-12031.

29. Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to attend: Convolutional triplet attention module. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; Virtual, 5–9 January 2021; pp. 3139-3148.

30. Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A better design paradigm of detector architectures for autonomous vehicles. arXiv; 2022; arXiv: 2206.02424

31. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision; Venice, Italy, 22–29 October 2017; pp. 2980-2988.

32. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. Proceedings of the European Conference on Computer Vision (ECCV); Munich, Germany, 8–14 September 2018; pp. 3-19.

Word count: 6168

Show less

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

The recognition and localization of strawberries are crucial for automated harvesting and yield prediction. This article proposes a novel RTF-YOLO (RepVgg-Triplet-FocalLoss-YOLO) network model for real-time strawberry detection. First, an efficient convolution module based on structural reparameterization is proposed. This module was integrated into the backbone and neck networks to improve the detection speed. Then, the triplet attention mechanism was embedded into the last two detection heads to enhance the network’s feature extraction for strawberries and improve the detection accuracy. Lastly, the focal loss function was utilized to enhance the model’s recognition capability for challenging strawberry targets, which thereby improves the model’s recall rate. The experimental results demonstrated that the RTF-YOLO model achieved a detection speed of 145 FPS (frames per second), a precision of 91.92%, a recall rate of 81.43%, and an $m A P$ (mean average precision) of 90.24% on the test dataset. Relative to the baseline of YOLOv5s, it showed improvements of 19%, 2.3%, 4.2%, and 3.6%, respectively. The RTF-YOLO model performed better than other mainstream models and addressed the problems of false positives and false negatives in strawberry detection caused by variations in illumination and occlusion. Furthermore, it significantly enhanced the speed of detection. The proposed model can offer technical assistance for strawberry yield estimation and automated harvesting.

Details

Title

A Novel Deep Learning Method for Detecting Strawberry Fruit

Author

Shen, Shuo¹; Duan, Famin²; Tian, Zhiwei²

; Han, Chunxiao³

¹ Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian 116024, China
² Institute of Urban Agriculture, Chinese Academy of Agricultural Sciences, Chengdu 610213, China
³ SDIC XinJiang Luobupo Potash Co., Ltd., Hami 839000, China

First page

4213

Publication year

2024

Publication date

2024

Publisher

MDPI AG

e-ISSN

20763417

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/app14104213

ProQuest document ID

3059273044

A Novel Deep Learning Method for Detecting Strawberry Fruit

Jump to:

Full Text

Abstract

Details

Suggested sources