Partial feature reparameterization and shallow-level interaction for remote sensing object detection

Abstract

Remote sensing object detection has recently emerged as one of the challenging topics in the field of deep learning applications due to the demand for both high detection performance and computational efficiency. To address these problems, this study introduces an efficient one-stage object detector that is designed mainly for detecting objects on remote sensing images, which consists of several innovations. Firstly, an extraction block is proposed called PRepConvBlock that leverages reparameterization convolution and partial feature utilization to effectively reduce the complexity in convolution operations, allowing for the utilization of larger kernel sizes in order to form the longer interactions between features and significantly expand receptive fields. Secondly, a unique shallow multi-scale fusion framework called SB-FPN based on Bi-FPN that utilizes the cross-interaction between shallow scale and deeper scale while inheriting the bidirectional connection from Bi-FPN to enhance the visual representation of features. Lastly, a Shallow-level Optimized Reparameterization Architecture Detector (SORA-DET) is proposed by applying several introduced innovations. This object detector is designed for UAV remote sensing object detection tasks that employ up to four detection heads. As a result, our proposed detector obtains a competitive performance that outperforms most of the other large-size models and SOTA works. In detail, the SORA-DET achieves 39.3% mAP50 in the VisDrone2019 test set while reaching up to 84.0% mAP50 in the SeaDroneSeeV2 validation set. Furthermore, our proposed detector is smaller than nearly 88.1% in parameters and has an inference speed of only 5.4 ms compared to other large-scale one-stage detectors.

Full text

Translate

Turn on search term navigation

Introduction

Remote sensing object detection is the process of localizing and identifying objects from aerial imagery and has become an essential tool across diverse domains, such as disaster management, urban planning, agriculture, forestry, and military operations. With the assistance of unmanned aerial vehicles (UAVs), the process of aerial imaging is becoming more efficient to capture and analyze data from above, offering a flexible, cost-effective, and high-resolution alternative to traditional satellite-based and aircraft-based imaging systems. However, unlike traditional object detection, remote sensing object detection is currently facing various challenges. For instance, the UAV imaging process tends to output numerous images that have targeting objects appear with different sizes, proximity, and angles. An example of this is when the UAV changes its height; the distance of the camera to the targeting objects would likely be changed, leading to the difference in object scales and positions that eventually affect the detection heads of several deep learning models seriously. Furthermore, as UAVs usually operate regardless of daytime, the object detection methods are expected to be obligated to function normally under various conditions. Consequently, the demand for the robust deep learning architectures that should be able to learn proper features that are beneficial to recognize the targeting objects with various scales, distances, angles, and environments is crucial. The Fig. 1 illustrates various challenges in remote sensing object detection.

Fig. 1 [Images not available. See PDF.]

Several challenges in remote sensing object detection. Obviously, there is a wide range of targeting small-scale, blurry objects under various conditions that would burden the detection head of several one-stage object detectors.

Initially, several works have been carried out to address the problems. For instance, the work of Zhu et al.¹ has explored the ability of the self-attention mechanism in remote sensing object detection by replacing the baseline head detection of YOLOv5² with the transformer head detection. Further work from Zhao et al.³ has also proposed the SwinTransformer layer⁴ to integrate into the YOLOv7⁵ architecture to enhance overall detection ability. Although the self-attention mechanism in Transformers has brought remarkable detection results, the combination normally requires a large number of parameters as the self-attention mechanism has quadratic complexity⁶, leading to a significantly higher computation cost and remarkable resource consumption. Although these problems can be easily solved by the high-performance computing (HPC) servers, it is a serious problem for small device platforms like UAVs to operate functionally. Thereby, numerous lightweight approaches for deep learning models have been introduced, including the work from Hu et al.⁷, which has proposed the unique architecture based on YOLOv5, which is named EL-YOLOv5. The authors have replaced the original spatial pooling with efficient spatial pyramid pooling (ESPP) and introduced the combination of the new loss function alpha-CiOU to obtain an efficient model that is only 7.56 million parameters with a small model version. Liu et al. (2022) introduced a unique module, Spatial-Coordinate Self-Attention (SCSA), which effectively visualizes small features by constructing non-local interactions between pixels in an image. As a result, the model is able to have the number of parameters reduced to only 1.41 million. However, despite the lightweight in parameter count, these approaches are essentially a trade-off between performance and complexity, as the study⁷ only achieved 18.4% mAP50-95 in the VisDrone2019 dataset, while the work⁸ reached 36.6% mAP50 in the same dataset. As a consequence, remote sensing object detection is struggling with both detection ability and computational efficiency, while balancing this trade-off is not simple.

Based on the aforementioned problems, this study is carried out to introduce a novel architecture of a one-stage detector that has low resource consumption and parameter count while achieving a competitive performance in remote sensing object detection tasks compared to other large-scale deep learning models. To reach this ability, the detector has been proposed with several noticeable innovations from feature extraction and aggregation blocks to multi-scale feature fusion networks and unique additional prediction heads. Firstly, it is obvious that convolution plays a vital role in many modern deep learning architectures due to its scalability and robustness. However, the capability of convolution relies on the kernel size to form the interaction between features, greatly restricting its performance when extracting more complex patterns of targeting objects⁹. While enlarging the kernel size can lead to improvement in accuracy, it also adds significant computation cost and extra parameters¹⁰, turning out to be a hindrance for deep learning models to operate properly on small devices. Therefore, inspired by the idea of partial convolution¹¹ and reparameterization convolution¹², this study has designed a unique block that incorporates multiple reparameterization convolutions stacked while keeping partial features to be learnable rather than every feature. In detail, the reparameterization convolution allows the block to further expand its learnable capability during training and be shortened during inference to save space and increase inference speed. Furthermore, by leveraging partial features to be extracted, we can easily expand the kernel size of convolution significantly without adding much computation cost. As a result, this block is capable of learning features effectively and efficiently by expanding its learning ability and using a larger kernel size to form an interaction of features. The block is then named Partial Reparameterization Convolution Block (PRepConvBlock), which is applied for both feature extraction in the backbone network and feature aggregation in the neck of the one-stage detector. Next, acknowledging the crucial role of multi-scale detection in remote sensing object detection, where tiny and small objects usually appear densely on images, the usage of different novel multi-scale networks like PANet¹³ and Bi-FPN¹⁴ are normally applied in several works^3,7. However, these networks, especially the Bi-FPN, only consider the features of the same scale to be aggregated and learned, regardless of other shallow scales, which contain spatial information of small-size features. Therefore, as Bi-FPN provides information from the first layers to its later layers of the same scale; it is a robust network for remote sensing object detection, as it was applied in several works^15,16. We have further expanded its ability to have the previous shallow-scale aggregated, providing more information of tiny and small objects to be learnable and enhancing the feature fusion to create more precise feature visualization. The modified version of Bi-FPN is then called Shallow Bi-directional Feature Pyramid Network (SB-FPN), which is used as the multi-scale detection network for the one-stage detector. To conclude, by combining several innovative approaches for different parts of the one-stage detector architectures, the study managed to introduce a new efficient yet effective one-stage detector, called SORA-DET , designed mainly for remote sensing object detection that meets the requirements of both lightweight and high performance. The proposed detector has outperformed most of the other SOTA models, reaching competitive performance in the VisDrone2019 dataset and SeaDroneSeeV2 dataset. In summary, the study has several contributions, which can be described as follows:

Firstly, the study has proposed a unique PRepConvBlock that comprises stacked reparameterization convolution, designed to have partial features learned in order to expand the kernel size efficiently, allowing more features to be interacted with in a filter to enhance the feature visualization and expand the receptive field of the architecture.
Secondly, the study has introduced an SB-FPN that is developed from Bi-FPN to have shallow scale features cross-interact with current scale to provide more information about tiny or small objects, leading to an increase in the detection ability of the model in remote sensing images by providing more precise feature visualization.
Lastly, based on the innovations of each important component, the study has introduced a novel one-stage detector, SORA-DET, that meets the requirements of both lightweight and high performance for UAV platform integration.

Related works

For decades, the challenge of remote sensing object detection has been a challenging topic that has received most of the concerns due to its scattered, small objects that need to be distinguished while the model must be able to function on the low capability of small devices like UAVs. Therefore, the methods proposed in the topics were various, from the use of conventional convolution networks to the combination of attention modules and transformer-based architectures to achieve higher performance.

For instance, the work of Zhengxin Zhang¹⁷ proposed the use of reparameterization convolution to replace conventional convolution in the C2f block for feature extraction, while having a unique sandwich-fusion network employed that utilizes the feature scale from its top-down concatenated and passes it to large filter depthwise convolutions. Another approach from Jinshan et al.¹⁸ introduced the combination of the ghost convolution¹⁹ and Efficient Layer Aggregation Network (ELAN) in YOLOv5² to achieve the lower resource consumption of the architecture while maintaining reasonable performance. While these studies^17,18 were able to create a new model with only 5.25 million parameters, the use of depthwise convolution could make it much harder for the model to converge and generalize, even when a large kernel size is used. Moreover, one of the biggest disadvantages of the conventional convolution operation was the restriction of forming longer interactions between pixels; the extracted features were limited by the size of the kernel, making it difficult to capture the long-range dependencies.

Therefore, the approach from Yangang et al.²⁰ came up with the attention mechanism module that integrated CBAM²¹ for the downsampling process, learning to retain the important extracted features. In addition, work from Yifan et al.²² introduced the local attention module to filter out irrelevant features in YOLOv8 architecture, providing higher performance that reaches up to 32.1% mAP50 in the VisDrone2019 dataset²³ test set. Although the role of attention modules was crucial in significantly enhancing the architecture’s performance, these mechanisms typically struggled to form long-range interactions between pixels on images, which resulted in poor global feature capturing. Hence, the Transformer-based approaches were carried out in the work of Yaning et al.²⁴, in which the RT-DETR²⁵ was proposed to tackle the problem; the long-range dependencies captured in this unique model allow the performance to reach up to 42.4% mAP50 in the VisDrone2019 test set. Further work from²⁶ proposed a Multi-head Channel and Spatial Trans-Attention (MCSTA) module to perform non-local interaction between pixels with different dimensions; as it was a plug-and-play module, the MCSTA can be integrated into any architectures without further customization. The work from Xingkui et al.¹ also proposed the Transformer-based head prediction for UAV remote sensing object detection and achieved SOTA performance at the time. However, while the long-range dependencies in the nonlocal interaction of transformer-based architecture can significantly increase the model performance, it could add quadratic complexity, greatly restricting the model adaptation to the various low-platform devices such as UAVs, making it suitable for large, high-resource availability like HPC rather than small devices.

Certain works were also conducted to propose several approaches, enhancing the overall performance of the detection framework. Such as Liu et al.²⁷ proposed their balanced feature pyramid network (BFPN), which employs the multi-scale balanced module (MSBM) to incorporate both deep-level features and spatial information from shallow-level scales. In addition, another work²⁸ also introduced the adaptive feature pyramid network (AFPN) to aggregate more discriminative information, enriching the features for detection heads. Also, Ma et al.²⁹ proposed a domain adaptive framework that uses dehazing module and YOLOX³⁰ architecture to detect objects effectively under sophisticated circumstances, especially foggy weathers. Regarding of small object detection problem, Karaca et al.³¹ proposed the EfficientDet¹⁴ to strengthen the ship detection on SAR images. Furthermore, work of Dong et al.³² introduced kernel correlation filters (KCF) tracker to track targeting objects in camera surveillance, which the objects tended to appear blurrily.

Overall, the recent works mainly focus on two goals. The first one was to achieve the lowest resource consumption possible to perform on low-platform devices. However, the researches that follow this goal tend to struggle with low performance. The other one was to reach as much performance as possible, consequently resulting in high complexity, making it difficult to function on small devices. Therefore, the study aimed to exploit both goals by designing the one-stage object detector that adopts the unique PRepConvBlock to further extend the model learning ability during training by leveraging the partial feature learning ability to use large kernel size, forming the long-range interaction between pixels while still being able to maintain the high inference speed during inference by reparameterization. Furthermore, the introduction of SB-FPN preserves the representation of small, scattered features, enhancing the detection head’s capability.

Methodology

Partial reparameterization convolution block (PRepConvBlock)

Fig. 2 [Images not available. See PDF.]

Structure of PRepConvBlock during training and inference. The block during training is able to use a large kernel size and extended convolution effectively due to partial feature utilization, while during inference, the reparameterization technique allows the block to merge the smaller kernel size convolution with the larger one in order to reduce the unnecessary complexity and redundant gradient flows.

The introduction of reparameterization convolution has brought breakthroughs to the modern architectures of several one-stage object detectors. This unique technique allows extended convolution weights to be learned during training and decomposed into a set of parameters during inference. By such a mechanism, this type of convolution can improve the generalization efficiently, allowing more patterns of targeting objects to be learned. Furthermore, the role of kernel size in several CNNs has been proved to be crucial for a wide range of vision tasks, including the detection of small or tiny objects in remote sensing^{33, 34–35}, as it is the main component of forming the interaction between features, gradually expanding the receptive field of the architectures. However, simply applying the large kernel size to reparameterization convolution potentially leads to the unintentional increase of complexity and extra parameters during training. Therefore, the idea of partial convolution¹¹ is adopted that allows only a portion of features to be used as the inputs of reparameterization convolutions, effectively increasing the kernel size of convolutions with a reasonable trade-off in resource consumption. In order to understand the partial feature utilization. First, assuming x is the input features and y is the output features, then the formula below indicates the conventional approach with full feature utilization:

\begin{matrix} y = K * x + b, \forall x \in R^{H \times W \times C} \end{matrix}

where K is the convolution weights and b is the bias. The above Eq. (1) indicates the full usage of input feature x, which is

H \times W \times C

. Therefore, applying the partial feature utilization, the features that are used for convolution operation can then be cut down by half, subsequently reducing the high computation cost under the effect of the large kernel size, which can be rewritten as:

\begin{matrix} y = K_{p} * \bar{x} + b_{p}, \forall \bar{x} \in R^{H \times W \times C / 2} \end{matrix}

In Eq. (2), the input feature x is decreased by half, which is $H \times W \times C / 2$ , leading to the difference of kernel weight $k_{p}$ and bias $b_{p}$ . The output feature y is then concatenated with the unused features to form the final features of the operation. As fewer features are used for convolution, indicating the undesired effect of learning complex features. However, this risk can be suppressed by the large kernel size utilization that allows the detector to have larger receptive fields and form the interaction between features effectively in spite of fewer learnable features. As a result, based on the combination of reparameterization convolution and partial convolution, the partial reparameterization block (PRepConvBlock) is developed. To begin with, the Fig. 2 indicates the overall structure of the PRepConvBlock. During the training phase, by using the several convolution operations in the parallel way, the model can expand its learning capability effectively and provide richer representations. Furthermore, as the reparameterization technique can be applied later on in the inference phase, the convolution operation is able to utilize a large kernel size ( $7 \times 7$ , $5 \times 5$ ) efficiently, allowing the model to enlarge its receptive fields with a smaller extra resource consumption due to the partial feature usage, which is only $H \times W \times C / 2$ rather than $H \times W \times C$ . The outputs of these convolutional layers are activated using the SiLU (Swish Linear Unit) activation function, which enhances gradient flow and further maps the extracted features to the higher spaces that should assist the model in learning more complex patterns and feature interactions. In addition, it is noticeable that the effect of vanishing gradient may worsen as the deep learning architecture becomes deeper in the later stages. Therefore, the skip connection is applied at the end of reparameterization convolutions and at the end of the PRepConvBlock to subdue the effect. During the inference phase, the structure of the PRepConvBlock is simplified through a reparameterization technique, transforming the multiple parallel convolutions into a single, sequential convolutional pathway. This optimized block consists of only $5 \times 5$ and $7 \times 7$ convolutions, reducing computational overhead while maintaining the learned feature representations from the training phase. By removing redundant computations, this approach significantly accelerates inference speed and suppresses unnecessary resource consumption and preserves model’s performance. In order to understand the operation of reparameterization convolution, first of all, the equations below would indicate the convolution operation of the original form during the training phase of the PRepConvBlock, including two phases of reparameterization.

\begin{matrix} m = B N^{(5)} (K^{(5)} * x, μ^{(5)}, σ^{(5)}, γ^{(5)}, β^{(5)}) + B N^{(3)} (K^{(3)} * x, μ^{(3)}, σ^{(3)}, γ^{(3)}, β^{(3)}), \forall x \in R^{H \times W \times C / 2} \end{matrix}

\begin{matrix} y = B N^{(7)} (K^{(7)} * m, μ^{(7)}, σ^{(7)}, γ^{(7)}, β^{(7)}) + B N^{(3)} (K^{(3)} * m, μ^{(3)}, σ^{(3)}, γ^{(3)}, β^{(3)}), \forall m \in R^{H \times W \times C / 2} \end{matrix}

where K is the kernel of the convolution, BN is the batch normalization, x is the first input features, m is the input feature after the first convolution phase, and y is the output of the operation. The batch normalization formula is defined as FuseBN in Eq. (7), which includes

σ^{2}

and

μ

, which are the variance and mean of inputs, and

γ

and

β

, which are learnable parameters of the batch normalization. It is obvious that this equation demonstrates the complex operation in convolution, subsequently reducing the inference speed of the detector due to several gradient flows. Hence, it is essential to remove these redundancies without affecting the model performance. Therefore, to express the reparameterization technique, the equation below indicates the batch normalization fusion process for both convolution weights and biases. which is the first step of reparameterization technique:

\begin{matrix} {\hat{K}}^{(3)}, {\hat{b}}^{(3)} = F u s e B N (K^{(3)}, B N^{(3)}) \end{matrix}

\begin{matrix} {\hat{K}}^{(5)}, {\hat{b}}^{(5)} = F u s e B N (K^{(5)}, B N^{(5)}) \end{matrix}

where FuseBN is defined as:

\begin{matrix} \hat{K} = \frac{γ}{σ + ϵ} \cdot K ; \hat{b} = \frac{γ (b - μ)}{σ + ϵ} + β \end{matrix}

In Eqs. (5) and (6), while ${\hat{K}}^{(3)}$ and ${\hat{b}}^{(3)}$ indicate the BN fusion weights and bias of the convolution operation with kernel size $3 \times 3$ , ${\hat{K}}^{(5)}$ and ${\hat{b}}^{(5)}$ are illustrated with the similar process for the convolution operation with kernel size $5 \times 5$ . These formulas use $ϵ$ as a small constant added to prevent zero division. This whole process is expected to produce the weights and bias of convolution that have fused with its batch normalization process. Next, in order to merge the smaller kernel convolution with the bigger one, the padding process is conducted, as illustrated below:

Let ${\hat{K}}^{(3)}$ be described as weight with shape: ${\hat{K}}^{(3)} = [\begin{matrix} k_{1} & k_{2} & k_{3} \\ k_{4} & k_{5} & k_{6} \\ k_{7} & k_{8} & k_{9} \end{matrix}]$

As the shape of ${\hat{K}}^{(3)}$ is $3 \times 3$ , which is not the same as the shape of ${\hat{K}}^{(5)}$ . Hence, merging process can not be performed directly. Therefore, the zero-padding technique is applied, ensuring the proper shape of weights:

\begin{matrix} {\hat{K}}^{(3 \to 5)} = p a d_{(0)} ({\hat{K}}^{(3)}, 1) = [\begin{matrix} 0 & 0 & 0 & 0 & 0 \\ 0 & k_{1} & k_{2} & k_{3} & 0 \\ 0 & k_{4} & k_{5} & k_{6} & 0 \\ 0 & k_{7} & k_{8} & k_{9} & 0 \\ 0 & 0 & 0 & 0 & 0 \end{matrix}] \end{matrix}

Then, the merging process can be performed, indicating as:

\begin{matrix} {\hat{K}}_{rep}^{(5)} = {\hat{K}}^{(3 \to 5)} + {\hat{K}}^{(5)} ; {\hat{b}}_{rep}^{(5)} = {\hat{b}}^{(3)} + {\hat{b}}^{(5)} \end{matrix}

So, the Eq. (3) can be optimized as:

\begin{matrix} m = {\hat{K}}_{rep}^{(5)} * x + {\hat{b}}_{rep}^{(5)}, \forall x \in R^{H \times W \times C / 2} \end{matrix}

In Eq. (8), $p a d_{(0)}$ is the process of zero padding to kernel weight ${\hat{K}}^{(3)}$ , while the amount of padding is 1. Subsequently obtaining the kernel weight with the similar size of ${\hat{K}}^{(5)}$ to perform the reparameterization on convolution weights in Eq. (9), where ${\hat{K}}_{rep}^{(5)}$ is the unified weight and ${\hat{b}}_{rep}^{(5)}$ is the unified bias. This equation leads to the reparameterization of the first convolution phase with kernels of $3 \times 3$ and $5 \times 5$ , which can be seen above in Fig. 2.

Next, using the same approaches from Eqs. (5) and (6), we get the combined batch normalization weights and biases for the second convolution phase with kernels $3 \times 3$ and $7 \times 7$ , which can be explained in the following way:

\begin{matrix} {\hat{K}}^{(3)}, {\hat{b}}^{(3)} = F u s e B N (K^{(3)}, B N^{(3)}) \end{matrix}

\begin{matrix} {\hat{K}}^{(7)}, {\hat{b}}^{(7)} = F u s e B N (K^{(7)}, B N^{(7)}) \end{matrix}

Then, in Eqs. (11) and (12), again, the FuseBN is described in Eq. (7). Next, the padding process for the smaller convolution weight is indicated as:

\begin{matrix} {\hat{K}}^{(3 \to 7)} = p a d_{(0)} ({\hat{K}}^{(3)}, 2) = [\begin{matrix} 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & k_{1} & k_{2} & k_{3} & 0 & 0 \\ 0 & 0 & k_{4} & k_{5} & k_{6} & 0 & 0 \\ 0 & 0 & k_{7} & k_{8} & k_{9} & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{matrix}] \end{matrix}

From Eq. (11) and (12), the batch normalization fusion weights and biases are obtained, where padding convolution weight ${\hat{K}}^{(3 \to 7)}$ is obtained in Eq. (13). Eventually, the formulation that incorporates the two convolutions weights and biases is indicated as follows:

\begin{matrix} {\hat{K}}_{rep}^{(7)} = {\hat{K}}^{(3 \to 7)} + {\hat{K}}^{(7)} ; {\hat{b}}_{rep}^{(7)} = {\hat{b}}^{(3)} + {\hat{b}}^{(7)} \end{matrix}

Then, the Eq. (4) can be optimized as:

\begin{matrix} y = {\hat{K}}_{rep}^{(5)} * m + {\hat{b}}_{rep}^{(7)}, \forall m \in R^{H \times W \times C / 2} \end{matrix}

Basing on the result of the Eqs. (10) and (15), the final formula that indicates the convolution operation of two stacked reparameterization convolutions is demonstrated as:

\begin{matrix} y = {\hat{K}}_{rep}^{(7)} * ({\hat{K}}_{rep}^{(5)} * x + {\hat{b}}_{rep}^{(5)}) + {\hat{b}}_{rep}^{(7)} \end{matrix}

where y is the final feature output and x is the input features, this whole process of reparameterization is carried out, aiming to cut off unnecessary gradient flow while still preserving essential convolution weights and biases that would speed up the inference phase of the detectors without affecting the detection performance. Compared to the original form in Eqs. (3) and (4) without applying the reparameterization technique, this new form is much simpler. To gain the better understanding of the proposed method’s flow, the pseudo-code 1 shall indicates the whole process of the PrepConvBlock. As a result, the proposed PRepConvBlock offers several advantages over conventional convolutional blocks. First, this feature extraction block with reparameterization enhances feature representation during training and optimizes computational efficiency during inference. Second, by combining partial feature utilization, it provides a balance between accuracy and efficiency, making it particularly effective for remote sensing object detection in real-time applications.

Shallow bi-directional feature pyramid network (SB-FBN)

Fig. 3 [Images not available. See PDF.]

Structure of SB-FPN. This proposed feature fusion network allows the shallow feature scale to interact with the deeper one to provide more useful information about small and tiny objects, while the bidirectional connection is inherited from the Bi-FPN to enhance the model generalization. The overall structure is designed with 4 detection heads, including P2, P3, P4, and P5, which are P/4, P/8, P/16 and P/32 scales.

In the modern deep learning architectures for object detection tasks, most of the current works on feature extraction and aggregation networks like PANet, FPN, and Bi-FPN primarily focus on the simplified version compared to the baseline one. These lightweight versions normally cut the edge nodes to save computation cost, providing higher inference speed. The Fig. 3 illustrates the simplified version of PANet, Bi-FPN, and our proposed SB-FPN.

The simplified PANet is the main feature network inside several SOTA one-stage detectors such as YOLOv8³⁶, YOLOv9³⁷, YOLOv10³⁸, and YOLOv11³⁹. This feature network is widely known for its simplicity, high reliability, and speed by introducing a bottom-up path aggregation, allowing information to be passed from lower-level feature maps (P2, P3) to higher-level ones (P4, P5). However, this structure primarily relies on unidirectional connections, limiting direct feature fusion across multiple scales. Therefore, when handling advanced tasks like detecting small objects in remote sensing, the PANet tends to perform worse due to the insufficient extracted features. Hence, one of the solutions is to increase the model size, allowing more parameters to be learnable. However, this approach may be limited by the resource capacity of small devices. Therefore, another solution is to design a better version of PANet, which is Bi-FPN. This kind of network incorporates bidirectional feature flows with additional shortcut connections. This bidirectional structure allows for more efficient feature reuse by fusing multi-scale features in both top-down and bottom-up pathways, providing better feature fusion information. Furthermore, it is noticeable that the shallow node or P2 node has been proving its ability in providing extracted information for detecting small objects due to its rich information, avoiding feature loss caused by scale reduction^{40, 41–42}. Therefore, the P2 node is added, allowing the number of multi-scale head detections to increase to four heads. However, despite the P2 node assistance, the Bi-FPN essentially only aggregates information of the same scale and is currently lacking the interaction between different scale nodes, especially the shallow scales. As a consequence, our Shallow Bi-directional Feature Pyramid Network (SB-FPN) is proposed, which is based on the Bi-FPN that has the shallow scales cross-concatenated during the aggregation phase as indicated in Fig. 3. In detail, the P2 node is aggregated with the middle P3 node, and the P3 node is concatenated with the middle P4 node, allowing rich information to be cross-learnable and providing better shallow feature visualization. To understand how the feature fusion network is improved, let the formula below indicate the process of feature fusion for middle layers in simplified PANet, which is P4 scale:

\begin{matrix} P_{4}^{o} = R e s i z e (P_{3}^{o}) + P_{4}^{m} \end{matrix}

In Eq. (17), the final $P_{4}^{o}$ is obtained by the aggregation of two different scales in the fusion network, which are $P_{4}^{m}$ and $R e s i z e (P_{3}^{o})$ . This approach of PANet can construct the bottom-up feature to enrich the feature visualization. However, as the information loss tends to be serious in the deeper layer of the network, the extracted features of small-scale objects may be vanished. Hence, the Bi-FPN is proposed to allow the original input of the same scale to be aggregated in order to improve the representations, which can be described as follows:

\begin{matrix} P_{4}^{o} = R e s i z e (P_{3}^{o}) + P_{4}^{m} + P_{4}^{i} \end{matrix}

As indicated in Eq. (18), the final $P_{4}^{o}$ can be enriched with more spatial information of the original phase $P_{4}^{i}$ of the feature fusion network. This study further proposes the cross-interaction of the middle layer in the feature fusion network that allows the shallow scale to be aggregated with the deeper one. Subsequently providing and maintaining the precise spatial information of the targeting objects during the network. This can be described as:

\begin{matrix} P_{4}^{o} = R e s i z e (P_{3}^{o}) + {\hat{P}}_{4}^{m} + P_{4}^{i} \end{matrix}

where

{\hat{P}}_{4}^{m}

can be illustrated as:

\begin{matrix} {\hat{P}}_{4}^{m} = P_{4}^{i} + U p s a m p l e (P_{5}^{o}) + R e s i z e (P_{3}^{i}) \end{matrix}

Here in Eq. (19), the ${\hat{P}}_{4}^{m}$ is provided instead of $P_{4}^{m}$ , which is the combination of the cross interaction between features with shallow scale that is indicated in the Eq. (20). This proposed SB-FPN is proposed, expecting to extract, provide, and maintain the precise spatial information of the targeting objects with various scales, especially the small-scale one, in order to improve the feature visualization of the network for the final detection heads. As a result, the proposed SB-FPN can enhance the gradient flow by increasing the number of paths and facilitating more stable gradient propagation, leading to faster convergence and improved training stability. Furthermore, with the shallow nodes connected, the feature propagation is denser and more interconnected across all levels, which could significantly improve the model’s ability to capture fine-grained details under the scenario of remote sensing object detection.

Shallow-level optimized reparameterization architecture detector (SORA-DET)

Fig. 4 [Images not available. See PDF.]

Architecture of SORA-DET. The PRepConvBlock serves as the main feature extraction and aggregation in both the backbone and neck of the one-stage detectors while having SB-FPN adopted as the feature fusion network, which allows the interaction between different feature scales to deliver rich information from shallow scales to the later ones.

With the involvement of PRepConvBlock and feature fusion network SB-FPN, the proposed one-stage object detector is proposed, naming it SORA-DET. The architecture of the proposed method is illustrated in Fig. 4. This a one-stage object detector is designed specifically for UAV remote sensing object detection, meeting the requirement of both high performance, reliability, and efficient computation cost to function on low-platform devices.

The proposed architecture adopts the SB-FPN as its main feature fusion network, including four detection heads from P2 to P5, allowing it to detect a wide range of object scales from P/4 to P/32. In this unique network, the P2 or P/4 is the main component for feature extraction as it is one of the scales that contain the most information on targeting objects, especially in remote sensing tasks. Therefore, the cross interaction of shallow scales P2 and P3 and P3 to P4 in SB-FPN aims to provide more features, enriching the feature visualization for small and tiny object detection. In order to expand the model learning ability on scattered, sparse information of remote sensing objects, the PRepConvBlocks are used throughout the architectures from the backbone for feature extraction to the neck for feature aggregation. By using such a unique reparameterization technique, the block can extend its convolution to learn the targeting feature representations during training, while in the inference phase, it can merge the unnecessary convolutional weights into another one to save the redundant gradient paths, consequently increasing the inference speed and saving extra parameters. Furthermore, this block employs only half of the features to be learnable partially rather than every single feature, resulting in the efficient utilization of a larger kernel size. Therefore, by using the PRepConvBlock, the proposed architecture can both enhance its gradient flow and improve the generalization while having a significantly large receptive field that can be beneficial to form the interaction of bigger feature regions to increase the model’s ability to detect targeting objects in remote sensing tasks. This proposed one-stage object detector also inherits several modules from the original YOLOv9 model, including ADown and SPPELAN, which is the combination of spatial pyramid pooling and efficient local aggregation network. Specifically, the ADown is the enhanced version of pooling, which employs average pooling. and max pooling as the main factors to scale down the features rather than convolution , resulting in the reduction of complexity and parameters while preserving important spatial characteristics. This downsampling strategy enables the network to capture essential hierarchical features while maintaining a balanced trade-off between accuracy and computational cost. The SPPELAN is the improved version over SPP in the previous one-stage detectors like YOLOv5 or YOLOv7. This module extracts multi-scale features using different receptive fields while enhancing feature propagation and gradient flow efficiency by the ELAN structure. Hence, this structure can improve localization and scale-awareness, providing better performance in object detection tasks, especially in dense scenes and with small objects. To conclude, the proposed architecture offers several key advantages. Firstly, by employing SB-FPN as the main feature fusion network, the architecture can aggregate information from shallow scale and further expand its feature visualization with bidirectional connection, enhancing the ability to detect various scales of objects in complex scenes, especially the small objects in UAV remote sensing images. Secondly, the PRepConvBlock is utilized, enhancing the model’s detection ability and maintaining large receptive fields for capturing small object features due to the reparameterization technique and partial feature learning. Finally, the balanced combination of lightweight convolutional blocks and the proposed feature fusion network allows the model to achieve high detection accuracy while keeping computational requirements manageable, making it suitable for deployment on small devices like UAVs for remote sensing object detection applications.

Experiments

This section is carried out in order to evaluate the performance of the proposed SORA-DET with other large-scale one-stage detectors, including YOLOv5x², YOLOv8x³⁶, YOLOv9e³⁷, YOLOv10x³⁸, YOLOv11x³⁹, and several SOTA works in the field of remote sensing object detection. The evaluation metrics are used to assess both model complexity, computational efficiency, and detection performance.

Experimental metrics

Both performance and complexity are evaluated for every single experimental model. While precision, recall, and mean average precision (mAP) are carried out for performance metrics, the number of parameters and floating-point operations per second (FLOPS) are used for complexity estimation. The following formula indicates how the performance metrics are calculated.

Precision is one of the essential metrics that indicate the correction of the predicted observation in total prediction. The metrics are widely applied in various tasks in order to assess the machine learning or deep learning overall performance. Let TP be true positive, which illustrates the corrected prediction of the model, and FP be the false positive that demonstrates the wrong prediction. Then, the formula of the precision metric is calculated as follows:

\begin{matrix} P r e c i s i o n = \frac{TP}{T P + F P} \end{matrix}

Recall is another essential metric to evaluate the robustness of the model. This metric measures how effective the model is in detecting objects. It does this by computing the ratio of the corrected prediction with all the actual observations. Let FN be the false negative case; then the formula of the recall metric is achieved as follows:

\begin{matrix} R e c a l l = \frac{TP}{T P + F N} \end{matrix}

Eventually, mean average precision (mAP) is a widely adopted performance metric for object detection tasks. This metric assesses the model performance via a range of thresholds in Interaction of Union (IoU) that consider both precision and recall metrics. The formula for mAP is indicated as follows:

\begin{matrix} m A P = \frac{1}{S} Σ_{k = 1}^{S} A P_{k} \end{matrix}

\begin{matrix} A P = Σ_{s} (R_{s} - R_{s - 1}) \times P_{s} \end{matrix}

where S is the total number of classes and

A P_{k}

is the average precision of class k at a determined threshold. Next, while Rs,

R_{1}

, and

R_{s - 1}

are the recall values at different thresholds, the

P_{s}

indicates the precision at specific recall threshold

R_{s}

. In this study, the thresholds [email protected], [email protected], and mAP@[.5:.95] are used for evaluation. As for the convenient purpose, the mAP50, mAP75, and mAP50-95 are used to indicate the metric symbols for the rest of the article.

Experimental datasets

All the models are trained and assessed on two different datasets in remote sensing object detection, including the VisDrone2019 dataset⁴³ and the SeaDroneSeeV2 dataset⁴⁴.

VisDrone2019 is the medium-scale dataset conducted by the AISKYEYE team at Tianjin University. This dataset is designed specifically for remote sensing tasks from object detection to object tracking. It is carried out in 14 cities, and all images are captured by UAVs in both urban and suburban areas under different weather and lighting conditions. There are 10 classes in total with various object scales, from minimal to medium-size ones. Furthermore, this dataset also contains several motion blurs, occlusions, and dense object instances. In summary, it is one of the most challenging datasets for remote sensing tasks, with several benchmarking works that are carried out every year. In order to understand the object size and distribution class of this unique dataset, the Fig. 5 indicates some highlight information of the VisDrone2019. This dataset is already split by the original authors, including 6471 images for the training set, 548 images for the validation set, and 1610 images for the test set. The Table 1 below indicates the ratio of each set in the VisDrone2019.

SeaDroneSeeV2 is another comprehensive dataset mainly conducted for aerial marine object detection. This dataset is designed by Varga et al. in the open-water scenarios and is used for search-and-rescue (SAR) purposes. Although the dataset contains several high-resolution images, it is a challenging one due to the appearance of sun glare, waves, and high imbalance class distribution. There are 6 classes in total with a high dispersion of object size, making it become a burden for various detection heads of the advanced one-stage object detectors. The Fig. 5 below illustrates the key information of the SeaDroneSeeV2. The original authors already split this dataset into three sets, including 8930 images for the training set, 1547 for the validation set, and 3750 images for the test set. However, as the label of the test set is unavailable, the SeaDroneSeeV2 is only used for the validation set for evaluation. The Table 1 indicates the ratio of each set for the SeaDroneSeeV2.

Fig. 5 [Images not available. See PDF.]

Several detection results of the one-stage detectors. It can be seen that the SORA-DET can detect numerous small-scale objects under several complex scenarios. These results indicate the robustness of the proposed one-stage detector in remote sensing object detection.

Table 1. The ratio of each set in two different experimental datasets.

Dataset	Training set	Validation set	Test set
VisDrone2019⁴³	6471 (74.99%)	548 (6.35%)	1610 (18.65%)
SeaDroneSeeV2⁴⁴	8930 (62.76%)	1547 (10.87%)	3750 (26.35%)

Experimental environments and hyperparameter setups

All the experiments are conducted on a single machine that is powered by a CPU Intel Core i7 12700K, 32GB RAM, and a GPU RTX 3090 with 24GB VRAM. The environment setups are Python 3.10.14, PyTorch 2.4.1 with CUDA 12.8, and OS Ubuntu 23.0. Anaconda is used to conduct the isolate environment for both training and evaluation.

To further provide a fair comparison, all the models are trained and evaluated using the following metrics that are indicated in the Table 2. No pretrained weights are used during the training phase for all of the experimental models.

Table 2. Hyperparameter setups for one-stage detectors.

Hyperparameter	Value
Learning rate	0.01
Batch size	4
Momentum	0.937
Weight decay	0.0005
Optimizer	SGD

For each of the datasets, because of the limited resources, the number of epochs and input size for training and testing are different, which are indicated in the following Table 3. Setting up different input sizes would allow the study to evaluate the ability of the proposed model with various scales of targeting objects under remote sensing object detection tasks.

Table 3. Number of training epoch and input size setups of both datasets for one-stage detectors.

Hyperparameter	Value
	VisDrone2019	SeaDroneSeeV2
Input size	640	1024
Epoch	150	70

Experimental ablation study results

This experimental section is carried out, aiming to assess the performance of each introduced component in the SORA-DET architecture. The Table 4 below indicates the overall results.

Table 4. The ablation study result of each component in the SORA-DET architecture on the VisDrone2019 test set.

Model	Detection framework	Precision (%)	Recall (%)	mAP50 (%)	mAP50-95 (%)	Parameters (Million)	Inference speed (ms)
YOLOv9	PANet	43.8%	33.2%	31.6%	18.1%	7.17 M	3.0 ms
YOLOv9 + PRepConvBlock	PANet	46.8%	36.5%	34.6%	20.1%	6.28 M	3.9 ms
YOLOv9 + PRepConvBlock + P2	PANet	51.3%	39.2%	39.1%	22.7%	6.53 M	5.4 ms
YOLOv9 + PRepConvBlock + P2	Bi-FPN	49.5%	39.9%	39.2%	22.7%	6.68 M	5.4 ms
YOLOv9 + PRepConvBlock + P2	SB-FPN	50.4%	40.1%	39.3%	22.9%	6.94 M	5.4 ms

As illustrated, the original version is the YOLOv9 small-scale version, demonstrating reasonable performance with 31.6% mAP50 while maintaining an extremely high inference speed, which is 3.0 ms. Although this version is suitable because of the inference speed, indicating the low computational complexity, it is not optimal for small object detection. Hence, the version YOLOv9+PRepConvBlock is introduced. By having the PrepConvBlock replace all the RepNCSLAN blocks in the backbone and neck networks, the performance of the architecture is increased remarkably with 34.6% mAP50 and 20.1% mAP50-95, while the number of parameters is decreased from 7.17 million to 6.28 million. Therefore, the PRepConvBlock has proved its ability in remote sensing object detection, especially small object detection, by incorporating the reparameterization with partial feature utilization that allows the block to use the larger kernel size to form the relationships between features and extend its learning capability during training to tackle the sophisticated circumstances in remote sensing detection, such as nighttime or small-scale targeting objects. Further acknowledgement of the deep-scale’s limitation in small object detection, the P2 (P/8) scale is introduced to the architecture to provide rich spatial information from the shallow-scale for the aggregation phase of the model. As a result, the performance of the architecture soared significantly up to 39.1% mAP50 and 22.7% mAP50-95 while witnessing a slight increase in parameters from 6.28 million to 6.53 million. However, the inference speed of the model also indicates a drawback, as it requires a longer time process, which is 5.4 ms compared to 3.0 ms in the original YOLOv9. These results indicate the crucial role of the shallow scale in small object detection, especially in remote sensing object detection. Nonetheless, at this stage, the network applied to the architecture is the simplified PANet, which is known for its optimization for computational efficiency but struggles with the small receptive fields, which is a drawback for small object detection. Hence, first of all, the Bi-FPN is introduced to replace the simplified PANet; this version brings a slight improvement in mAP50, which is 39.2%, while remaining the same in inference speed as the previous version, it has an increase in parameters, which is up to 6.68 million. Despite marginal benefits, these results indicate the potential of applying other advanced networks in order to increase the feature aggregation. Therefore, the SB-FPN is introduced, which is the improved version that has the cross interaction between the shallow scale, forming the final version of SORA-DET that illustrates the highest recall among categories, which is 40.1%, highlighting its ability in detecting various objects with different sizes. However, this last version also has a small increase in complexity as its parameters reach 6.94 million while maintaining the same inference speed as the simplified-PANet version of the architecture.

Experimental performance results on VisDrone2019 dataset

This experimental section is carried out to evaluate the performance of the proposed SORA-DET with other SOTA one-stage detectors on one of the most challenging datasets in remote sensing object detection, the VisDrone2019 dataset. The Table 5 indicates the overall performance of the models on the VisDrone2019 test set.

Table 5. The performance of one-stage detectors on the VisDrone2019 test set.

Model	Precision (%)	Recall (%)	mAP50 (%)	mAP75 (%)	mAP50-95 (%)
YOLOv5x²	49.0%	38.3%	36.6%	22.1%	21.5%
YOLOv8x³⁶	51.4%	36.8%	36.7%	22.3%	21.5%
YOLOv9e³⁷	51.6%	39.4%	38.6%	23.7%	22.8%
YOLOv10x³⁸	48.4%	37.3%	36.2%	22.1%	21.3%
YOLOv11x³⁹	49.1%	38.1%	36.9%	22.0%	21.4%
SORA-DET (Our)	50.4%	40.1%	39.3%	23.5%	22.9%

As illustrated in the table, first of all, the YOLOv5x attains average performance with 36.6% mAP50 and 21.5% mAP50-95, while its recall metric reaches up to 38.3%; this figure is even higher than YOLOv8x, which is 36.8%. Furthermore, YOLOv8x only brings a slight improvement in terms of mAP50 from 36.6% to 36.7% while remaining the same score as YOLOv5x in the mAP50-95 metric. However, YOLOv9e is the model that reaches the highest performance among the YOLO family, as its mAP50 is 38.6% while the recall is 39.4%. These results indicate that YOLOv9e is more robust than the earlier version in detecting various complex cases under remote sensing scenarios. In contrast, although being the later version in the YOLO family, the YOLOv10x and YOLOv11x reach lower performance compared to YOLOv9e, even lower than YOLOv5x. Therefore, it is noticeable that the self-attention in these YOLO variants is not as optimal as expected in remote sensing object detection tasks, as they don’t reach the desired outputs. Finally, our proposed SORA-DET delivers the best performance among the category in most of the metrics except for the precision. This introduced one-stage detector can reach up to 39.3% mAP50 and 22.9% mAP50-95, indicating its ability to detect objects with various sizes and conditions in remote sensing images. This experiment can be summarized as the SORA-DET with several innovations from feature extraction to aggregation network has proven its ability in extending the model capacity in small object detection under remote sensing circumstances with various sophisticated cases. In order to prove the robustness of the proposed SORA-DET compared to other one-stage detectors, the Fig. 6 below illustrates the detection results of the proposed one-stage detector with other models, indicating the potential of the introduced methods for remote sensing object detection tasks.

Fig. 6 [Images not available. See PDF.]

Several detection results of the one-stage detectors. It can be seen that the SORA-DET can detects numerous small-scale objects under several complex scenarios. These results indicate the robustness of the proposed one-stage detector in remote sensing object detection.

Experimental performance results on the SeaDroneSeeV2 dataset

To further validate the performance of SORA-DET under the remote sensing object detection task, the model is then trained and evaluated on the SeaDroneSeeV2 dataset with other large-scale one-stage detectors. The Table 6 indicates the overall performance of each model in the SeaDroneSeeV2 validation set.

Table 6. The performance of one-stage detectors on the SeaDroneSeeV2 validation set.

Model	Precision (%)	Recall (%)	mAP50 (%)	mAP75 (%)	mAP50-95 (%)
YOLOv5x²	85.2%	74.2%	79.2%	49.1%	48.0%
YOLOv8x³⁶	86.6%	73.5%	78.5%	49.7%	47.5%
YOLOv9e³⁷	87.5%	76.1%	81.5%	53.0%	50.1%
YOLOv10x³⁸	84.0%	73.6%	78.8%	49.2%	48.6%
YOLOv11x³⁹	84.4%	76.3%	80.2%	51.3%	49.0%
SORA-DET (Our)	90.1%	79.6%	84.0%	53.7%	51.9%

In this table, YOLOv5x demonstrates a solid performance with mAP50 of 79.2% and mAP50-95 of 48.0%. However, in spite of reasonable performance, its capability is still outperformed by the later version YOLO family in terms of mAP50-95. This result indicates challenging detection in certain scenarios of remote sensing object detection tasks. Other models, such as YOLOv8x, improve the previous version by achieving 86.6% precision, but this version witnesses a decrease in mAP50 and mAP50-95 from 79.2% to 78.5% and 48.0% to 47.5%. These results illustrate that this model offers a slightly higher precision, but it is a trade-off in other metrics that sequentially deteriorate the ability to detect various-sized objects in remote sensing tasks. Compared to earlier models, YOLOv9 delivers the best performance, with its mAP50 reaching up to 81.5% while the mAP50-95 is 50.1%. These results indicate the ability of this model to tackle various sophisticated conditions better than other models in the YOLO family of remote sensing object detection. In contrast, YOLOv10x and YOLOv11x, which are the later versions of YOLOv9e, are not as optimal as expected because of the poor performance. While YOLOv10x offers an mAP50 of 78.8%, YOLOv11x only achieves 80.2% mAP50; these figures are even lower than the previous one of YOLOv9e, despite that YOLOv11x brings a slightly higher recall metric. These results illustrate the unsuitability of the later YOLO family to tackle complex detection circumstances under remote sensing image conditions. Finally, our proposed SORA-DET model surpasses all other large-scale one-stage object detectors with an mAP50 of 84.0% and an mAP50-95 of 51.9%. The proposed detector achieves the highest performance among all metrics, providing the sustainable ability to detect small objects or objects in complex conditions of remote sensing detection. In summary, these results once again indicate the significance of several innovations of SORA-DET regarding the ability of efficient large kernel utilization in PRepConvBlock and the robust shallow-scale aggregation network SB-FPN. As to indicate the performance of the proposed SORA-DET, the Fig. 7 illustrates the detection ability of the introduced method compared to other one-stage detectors on the SeeDroneSeev2 dataset.

Fig. 7 [Images not available. See PDF.]

Several detection results of the introduced SORA-DET compared to other one-stage detectors. Numerous small-scale targeting objects can be spotted by the SORA-DET, which indicates its robustness in various remote sensing object detection scenarios.

Experimental feature map visualization results

Fig. 8 [Images not available. See PDF.]

Extracted feature maps of each one-stage detector. Obviously, the feature maps extracted by SORA-DET represent more potential features with detailed patterns of targeting objects, which are useful for the later aggregation phase and the detection head of the detector.

This section is conducted in order to analyze the feature map extracted by the proposed SORA-DET and compare it to other large-scale one-stage detectors. The Fig. 8 indicates the results of the extraction feature maps under various scenarios of remote sensing images.

Obviously, it is evident that the proposed SORA-DET extracted more detailed patterns of several targeting objects with various scales. Hence, producing well-defined feature visualizations for the later aggregation phase and providing more robust features for detection heads. In comparison, other YOLO variants tend to represent the more generalized but less precise feature maps. Although these results support the detection ability of the architecture, they lack the detail level of the complex patterns that are beneficial for recognizing small-scale targeting objects or detecting under complex conditions. Unlike the previous one-stage detector, because of the PRepConvBlock in the proposed SORA-DET architecture, this unique model can easily capture the larger interactions between features to form the more precise spatial information of numerous targeting objects. This information can be later aggregated with the channel information in the deeper stage of the model to represent the better feature visualization for the final detection head, enhancing its performance in detecting objects on remote sensing images. In conclusion, some YOLO variants, like YOLOv9e or YOLOv11x, can highlight the regions of potential objects but lack the ability to construct precise features. In contrast, the proposed SORA-DET demonstrates its superior performance in feature extraction, providing better and more detailed, well-refined information that is beneficial for improving the detection ability of the model, especially in complex detection scenarios of remote sensing object detection applications.

Experimental complexity results

Fig. 9 [Images not available. See PDF.]

The complexity of various one-stage detectors ranges from resource consumption to computation cost. Apparently, the SORA-DET consumes the lowest number of parameters with extremely low inference speed, making it more suitable for remote sensing object detection tasks on low-platform devices.

This experimental section is conducted in order to evaluate the complexity of the proposed SORA-DET in both computation cost and parameter consumption. The Fig. 9 illustrates the inference speed and the number of parameters of various one-stage object detectors.

It is obvious that our proposed SORA-DET not only consumes the least in parameter count but also has a significantly lower inference speed, indicating its ability in computational efficiency and efficient resource consumption compared to other one-stage detectors. By only having 6.94 million parameters and a 5.4 ms inference speed, SORA-DET is an excellent option for remote sensing object detection tasks, particularly on low-platform devices where computational resources are limited. In comparison with other advanced one-stage detectors, most of which have the remarkably large parameter count with high inference speed, demonstrating their heavy resource usage. For instance, YOLOv5x can consume up to 142.46 million parameters, or YOLOv9e can require up to 58.15 million parameters, this figure is about 88.1% greater than SORA-DET. Although YOLOv11x can offer the better inference speed with 7.7 ms, it requires up to 64.76 million in parameter count. It can be concluded that SORA-DET offers both the lowest number of parameters and exceptional low inference speed, making it the most suitable for functioning in remote sensing object detection applications on low platform devices.

Experimental performance with other SOTA works on VisDrone2019 dataset test set

This experimental section is carried out in order to evaluate the trained SORA-DET with other SOTA works on the VisDrone2019 test set. The Table 7 indicates the overall performance of each method that is compared in mAP50, mAP50-95, and the number of parameters. However, several works don’t provide the results related to the required metric. Hence, these are filled with “–”.

Table 7. Comparison of the proposed SORA-DET with other SOTA works on Visdrone dataset.

Models	mAP50 (%)	mAP50-95 (%)	Parameters (million)
Drone-YOLO-medium¹⁷	38.9%	22.5%	33.9M
TA-YOLO-o²⁶	37.7%	21.9%	21.4M
LUD-YOLO-S⁴⁵	33.3%	21.3%	10.3M
GBS-YOLOv5⁴⁶	35.3%	20.0%	–
OD-YOLO⁴⁷	36.1%	–	–
Improved-YOLOv8n⁴⁸	37.6%	–	3.0M
MPE-YOLO⁴⁹	37.0%	–	4.4M
YOLO-HV⁵⁰	38.10%	19.90%	38.5M
LEAF-YOLO⁵¹	39.1%	22.0%	4.28M
HIC-YOLOv5⁵²	35.0%	19.2%	9.3M
Faster RCNN + IT-transformer⁵³	33.6%	19.8%	–
Cascade RCNN + IT-transformer⁵³	35.5%	22.0%	–
Faster GDF⁵⁴	31.8%	17.7%	48.9M
Cascade GDF⁵⁴	31.7%	18.7%	76.7M
Free Anchor GDF⁵⁴	28.5%	15.9%	44.0M
Grid GDF⁵⁴	30.8%	18.2%	72.0M
SORA-DET (our)	39.3%	22.9%	6.94M

As illustrated, although several works were carried out with reasonable performance in mAP50 and mAP50-95, some of which have a significantly large number of parameter consumptions that indicate the high complexity of the method, such as Drone-YOLO¹⁷, TA-YOLO²⁶, or YOLO-HV⁵⁰. However, other numerous works have notable performance in mAP50 despite low resource consumption, also indicating their robustness in the remote sensing object detection task. To achieve this, take GBS-YOLOv5⁴⁶, for instance; this work introduced the shallow-scale feature P/4 aggregation as a separate module rather than constructing it as a part of the feature extraction pyramid network, subsequently reducing the computation cost remarkably. The Improved-YOLOv8n, on the other hand, works the other way around. This one-of-a-kind method was created with a lot of new ideas, mostly focusing on the nano-scale YOLOv8’s detection head. These changes keep it using few resources while gradually improving it at finding things. Another idea was to combine LEAF-YOLO with partial convolution¹¹ in the Efficient Layer Aggregation Network (ELAN) of YOLOv7⁵ to make the convolution operations take less time to compute. Hence, this model is able to reach up to 39.1% mAP50 while consuming 4.28 million parameters. Noticeably, the YOLO-HV⁵⁰ that incorporates the self-attention mechanism to dynamically adjust the important weights for different extracted features makes this model have a significant high in parameter count of up to 38.5 million, while the performance is 38.1% in mAP50. In addition, Faster RCNN and Cascade RCNN with its introduced component, It-transformer⁵³, indicate marginal improvements compared to other SOTA works. Furthermore, other two-stage detectors with GDF enhanced⁵⁴ also witnessed similar outcomes while suffering from high parameter counts. These results imply that even two-stage object detection methods or transformer-based architectures are struggling to recognize small, noisy targeting objects under various complex circumstances of remote sensing images. Eventually, the proposed SORA-DET can easily outperform all other lightweight variants in mAP50 and mAP50-95, which are 39.3% and 22.9%, respectively. This introduced one-stage detector has its ability enhanced by improving the feature extraction block that utilizes the large kernel size efficiently via reparameterization convolution and partial feature utilization. Furthermore, SORA-DET also has its SB-FPN applied rather than conventional PANet, subsequently increasing the robustness of the feature fusion ability for the final detection heads. In summary, these results indicate SORA-DET as a competitive solution in lightweight UAV-based remote sensing object detection.

Limitations and future works

Fig. 10 [Images not available. See PDF.]

Although the SORA-DET can retain more precise spatial information of targeting objects. It lacks the robust features to enhance the model’s confidence in the detection phase.

Although the SORA-DET brings exceptional performance in remote sensing object detection, its resource consumption is still considered moderate compared to other SOTA works. In addition, it is obvious that the SB-FPN only brings a marginal improvement while it increases extra parameters, which is not optimal in the case of functioning on extremely low platform devices. Therefore, according to these drawbacks, it is essential to conduct further work to address the aforementioned problems better. For instance, the SB-FPN can be further developed to have cross interaction with various scales from the shallowest to the deepest, so the final feature fusion can provide the precise visualization, enhancing the overall recall metrics of the detection heads. In addition, this proposed method is only carried out and compared using a sole hyperparameter setup with SGD, so later studies to examine the performance of the SORA-DET with other optimizers are crucial to understand the model behavior better. While the PRepConvBlock demonstrates the robustness in feature extraction, further work can be carried out to suppress the resource consumption without affecting the performance. In Fig. 10, although the proposed block can extract more precise spatial information of the targeting objects, it treats other features equally, subsequently improving the detection but resulting in a trade-off in the confidence score of the model. In summary, the SORA-DET can be further applied to other object detection tasks, and several works can be conducted in order to address the problem of the study.

Conclusion

This study has proposed a novel one-stage object detector, SORA-DET, for detecting objects on remote sensing images. This unique detector includes several innovations, from feature extraction ability to feature fusion network. Firstly, a unique feature extraction block called PRepConvblock that is the combination of partial feature utilization and reparameterization convolution to allow the usage of a significantly large kernel size, subsequently forming interactions between features without overhead in the computation cost. Next, a feature fusion network called SB-FPN, which is based on Bi-FPN, allows the shallow previous scale to be aggregated with the deeper one to provide rich spatial information of the shallow scale for the feature aggregation phase, leading to more precise feature visualization for detection heads. As a result, the proposed SORA-DET outperforms most of the other large-scale one-stage detectors in remote sensing object detection on two different challenging datasets, VisDrone2019 and SeaDroneSeeV2, reaching up to 39.3% and 84.0% mAP50, respectively, while it only consumes 6.94 million in parameter count and 5.4ms for inference time. Also, the proposed detector shows competitive results compared to other SOTA works that both maintain high detection performance and remarkable low complexity. These results illustrate the potential of applying the SORA-DET to solve several challenges in remote sensing object detection on low-platform devices.

Acknowledgements

This work is supported by Osaka Metropolitan University and Ho Chi Minh City Open University.

Author contributions

M.T.P.N. conducted methodology, original draft, formal analysis and validation; Q.D.N.N. conducted methodology, original draft; M.K.P.T. conducted editing, supervision; T.H.T. conducted editing and supervision; T.N. conducted supervision; H.V.A.L. conducted methodology, original draft, validation and editing. All authors read, reviewed and agreed to the published version of the manuscript.

Data availability

All datasets used in the study are public and accessible, including VisDrone2019 (https://github.com/VisDrone/VisDrone-Dataset), SeaDroneSeeV2 (https://seadronessee.cs.uni-tuebingen.de/).

Declarations

Competing interests

The authors declare no conflicts of interest.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1. Zhu, X., Lyu, S., Wang, X. & Zhao, Q. Tph-yolov5: Improved yolov5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2778–2788 (2021).

2. Ultralytics. YOLOv5: A state-of-the-art real-time object detection system. https://docs.ultralytics.com (2021).

3. Zhao, L; Zhu, M. Ms-yolov7: Yolov7 based on multi-scale for object detection on UAV aerial photography. Drones; 2023; 7, 188.

4. Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012–10022 (2021).

5. Wang, C.-Y., Bochkovskiy, A. & Liao, H.-Y. M. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7464–7475 (2023).

6. Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst.30 (2017).

7. Hu, M et al. Efficient-lightweight yolo: Improving small object detection in yolo for aerial images. Sensors; 2023; 23, 6423.2023Senso.23.6423H [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37514717][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10385816]

8. Liu, C; Yang, D; Tang, L; Zhou, X; Deng, Y. A lightweight object detector based on spatial-coordinate self-attention for UAV aerial images. Remote Sens.; 2022; 15, 83.2022RemS..15..83L

9. Alzubaidi, L et al. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J. Big Data; 2021; 8, pp. 1-74.

10. Chen, H., Chu, X., Ren, Y., Zhao, X. & Huang, K. Pelk: Parameter-efficient large kernel convnets with peripheral convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5557–5567 (2024).

11. Chen, J. et al. Run, don’t walk: chasing higher flops for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12021–12031 (2023).

12. Ding, X. et al. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13733–13742 (2021).

13. Liu, S., Qi, L., Qin, H., Shi, J. & Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8759–8768 (2018).

14. Tan, M., Pang, R. & Le, Q. V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10781–10790 (2020).

15. Li, R., Gao, Y. & Zhang, R. An improved yolov5-based small target detection method for uav aerial image. In Chinese Intelligent Automation Conference, 298–312 (Springer, 2023).

16. Li, S; Yang, X; Lin, X; Zhang, Y; Wu, J. Real-time vehicle detection from UAV aerial images based on improved yolov5. Sensors; 2023; 23, 5634.2023Senso.23.5634L [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37420800][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10304702]

17. Zhang, Z. Drone-yolo: an efficient neural network method for target detection in drone images. Drones; 2023; 7, 526.

18. Cao, J; Bao, W; Shang, H; Yuan, M; Cheng, Q. Gcl-yolo: A ghostconv-based lightweight yolo network for UAV small object detection. Remote Sens.; 2023; 15, 4932.2023RemS..15.4932C

19. Han, K. et al. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1580–1589 (2020).

20. Li, Y et al. Sod-yolo: Small-object-detection algorithm based on improved yolov8 for UAV images. Remote Sens.; 2024; 16, 3057.

21. Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), 3–19 (2018).

22. Lyu, Y; Zhang, T; Li, X; Liu, A; Shi, G. Lightuav-yolo: a lightweight object detection model for unmanned aerial vehicle image. J. Supercomput.; 2025; 81, 105.

23. Zhu, P et al. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell.; 2021; 44, pp. 7380-7399.

24. Kong, Y; Shang, X; Jia, S. Drone-detr: Efficient small object detection for remote sensing image using enhanced rt-detr model. Sensors; 2024; 24, 5496. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39275406][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11397902]

25. Zhao, Y. et al. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16965–16974 (2024).

26. Li, M., Chen, Y., Zhang, T. & Huang, W. Ta-yolo: a lightweight small object detection model based on multi-dimensional trans-attention module for remote sensing images. Complex Intell. Syst. 1–15 (2024).

27. Liu, Y., Li, Q., Yuan, Y. & Wang, Q. Single-shot balanced detector for geospatial object detection. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2529–2533 (IEEE, 2022).

28. Liu, Y; Li, Q; Yuan, Y; Du, Q; Wang, Q. Abnet: Adaptive balanced network for multiscale object detection in remote sensing imagery. IEEE Trans. Geosci. Remote Sens.; 2021; 60, pp. 1-14.1:CAS:528:DC%2BB3cXitlCmur%2FO

29. Ma, J., Lin, M., Zhou, G. & Jia, Z. Joint image restoration for domain adaptive object detection in foggy weather condition. In 2024 IEEE International Conference on Image Processing (ICIP), 542–548 (IEEE, 2024).

30. Ge, Z., Liu, S., Wang, F., Li, Z. & Sun, J. Yolox: Exceeding yolo series in 2021. arXiv preprintarXiv:2107.08430 (2021).

31. Karaca, A. C. Robust and fast ship detection in sar images with complex backgrounds based on efficientdet model. In 2021 5th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), 334–339 (IEEE, 2021).

32. Dong, E., Zhang, Y. & Du, S. An automatic object detection and tracking method based on video surveillance. In 2020 IEEE International Conference on Mechatronics and Automation (ICMA), 1140–1144 (IEEE, 2020).

33. Zhang, Y., Wang, W., Ye, M., Yan, J. & Yang, R. Lga-yolo for vehicle detection in remote sensing images. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. (2025).

34. Dong, Y; Xu, F; Guo, J. Lkr-detr: small object detection in remote sensing images based on multi-large kernel convolution. J. Real-Time Image Proc.; 2025; 22, 46.

35. Chen, J., Hu, Z., Wu, W., Zhao, Y. & Huang, B. Lkpf-yolo: A small target ship detection method for marine wide-area remote sensing images. IEEE Trans. Aerospace Electron. Syst. (2024).

36. Jocher, G., Chaurasia, A. & Qiu, J. Ultralytics yolov8. 2023. https://github.com/ultralytics/ultralytics (2023).

37. Wang, C.-Y., Yeh, I.-H. & Mark Liao, H.-Y. Yolov9: Learning what you want to learn using programmable gradient information. In European Conference on Computer Vision, 1–21 (Springer, 2024).

38. Wang, A et al. Yolov10: Real-time end-to-end object detection. Adv. Neural. Inf. Process. Syst.; 2024; 37, pp. 107984-108011.

39. Ultralytics. Ultralytics yolov11. 2024. https://github.com/ultralytics/ultralytics (2024).

40. Sun, Y; Liu, W; Gao, Y; Hou, X; Bi, F. A dense feature pyramid network for remote sensing object detection. Appl. Sci.; 2022; 12, 4997.1:CAS:528:DC%2BB38Xhtl2ns7nO

41. Xiao, Y; Di, N. Sod-yolo: A lightweight small object detection framework. Sci. Rep.; 2024; 14, 25624.1:CAS:528:DC%2BB2cXitlyitLbJ [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39465334][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11514239]

42. Lu, Y; Sun, M. Lightweight multidimensional feature enhancement algorithm lps-yolo for UAV remote sensing target detection. Sci. Rep.; 2025; 15, 1340.1:CAS:528:DC%2BB2MXhsVSjsLw%3D [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39779765][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11711649]

43. Du, D. et al. Visdrone-det2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019).

44. Varga, L. A., Kiefer, B., Messmer, M. & Zell, A. Seadronessee: A maritime benchmark for detecting humans in open water. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2260–2270 (2022).

45. Fan, Q; Li, Y; Deveci, M; Zhong, K; Kadry, S. Lud-yolo: A novel lightweight object detection network for unmanned aerial vehicle. Inf. Sci.; 2025; 686, 121366.

46. Liu, H et al. Improved gbs-yolov5 algorithm based on yolov5 applied to UAV intelligent traffic. Sci. Rep.; 2023; 13, 9577.2023NatSR.13.9577L1:CAS:528:DC%2BB3sXht1CltLzK [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37311854][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10264355]

47. Bu, Y; Ye, H; Tie, Z; Chen, Y; Zhang, D. Od-yolo: Robust small object detection model in remote sensing image with a novel multi-scale feature fusion. Sensors; 2024; 24, 3596. [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38894387][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11175302]

48. Xu, L; Zhao, Y; Zhai, Y; Huang, L; Ruan, C. Small object detection in UAV images based on yolov8n. Int. J. Comput. Intell. Syst.; 2024; 17, 223.1:CAS:528:DC%2BB2cXpvVGqug%3D%3D

49. Su, J; Qin, Y; Jia, Z; Liang, B. Mpe-yolo: enhanced small target detection in aerial imaging. Sci. Rep.; 2024; 14, 17799.1:CAS:528:DC%2BB2cXhslKgtrzN [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39090172][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11294611]

50. Xu, S; Zhang, M; Chen, J; Zhong, Y. Yolo-hypervision: a vision transformer backbone-based enhancement of yolov5 for detection of dynamic traffic information. Egypt. Inform. J.; 2024; 27, 100523.

51. Nguyen, HH; Hoang, MS et al. Leaf-yolo: Lightweight edge-real-time small object detection on aerial imagery. Intell. Syst. Appl.; 2025; 25, 200484.

52. Tang, S., Zhang, S. & Fang, Y. Hic-yolov5: Improved yolov5 for small object detection. In 2024 IEEE International Conference on Robotics and Automation (ICRA), 6614–6619 (IEEE, 2024).

53. Wei, J., Wang, Q. & Zhao, Z. Interactive transformer for small object detection. Comput. Mater. Contin.77 (2023).

54. Zhang, R; Shao, Z; Huang, X; Wang, J; Li, D. Object detection in UAV images via global density fused convolutional network. Remote Sens.; 2020; 12, 3140.2020RemS..12.3140Z

Word count: 10283

Show less

© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Partial feature reparameterization and shallow-level interaction for remote sensing object detection

Content area

Abstract

Full text