Content area
Small object detection is an essential but challenging task in computer vision. Transformer-based algorithms have demonstrated remarkable performance in the domain of computer vision tasks. Nevertheless, they suffer from inadequate feature extraction for small objects. Additionally, they face difficulties in deployment on resource-constrained platforms due to their heavy computational burden. To tackle these problems, an efficient local-global fusion Transformer (ELFT) is proposed for small object detection, which is based on attention and grouping strategy. Specifically, we first design an efficient local-global fusion attention (ELGFA) mechanism to extract sufficient location features and integrate detailed information from feature maps, thereby promoting the accuracy. Besides, we present a grouped feature update module (GFUM) to reduce computational complexity by alternately updating high-level and low-level features within each group. Furthermore, the broadcast context module (CB) is introduced to obtain richer context information. It further enhances the ability to detect small objects. Extensive experiments conducted on three benchmarks, i.e. Remote Sensing Object Detection (RSOD), NWPU VHR-10 and PASCAL VOC2007, achieving 95.8%, 94.3% and 85.2% in mean average precision (mAP), respectively. Compared to DINO, the number of parameters is reduced by 10.4%, and the floating point operations (FLOPs) are reduced by 22.7%. The experimental results demonstrate the efficacy of ELFT in small object detection tasks, while maintaining an attractive level of computational complexity.
1 Introduction
Object detection is an active topic in computer vision, with extensive applications in face detection, autonomous driving, and security monitoring. In particular, small object detection is a vital area of concern. It holds crucial application value across various fields, such as aerial image analysis, satellite remote sensing, medical image interpretation, and industrial automation. However, small object detection faces many challenges, such as a small proportion of pixels, insufficient semantic information, and occlusion in complex scenes. These challenges make it more difficult to detect small objects accurately. Additionally, deep neural networks for small object detection tend to degrade the spatial information as it traverses through the layers of the network. The global dependency feature of the Transformer [1] can mitigate this problem. However, Transformer-based methods encounter problems including substantial computational requirements and slow training processes. These issues hinder the ability to meet the demands of real-time detection and platforms with limited resources. Some approaches design networks via the integration of local and global strategies, aimed at extracting both global and local features. For example, the Fourier Neural Operator with Local Priors and Global Perceptron (FOLGP) proposed in [2] integrates Transformer modules to model global contextual information across frequency bands, enhancing the capture of correlations between different frequency components in the pantograph-catenary system (PCS). This integrated approach stands in contrast to other strategies: local feature extraction methods like the self-supervised pre-training method [3] employ domain-specific masking to retain local structural features of railway components, achieving high efficiency in small object detection. Meanwhile, multi-modal methods such as [4] fuse local cues from heterogeneous sensors (e.g., infrared/visible) to resolve ambiguities in complex scenes. However, these methods lack explicit mechanisms for modeling global system dynamics. Therefore, it is still an open question to achieve efficient and accurate detection of small objects.
The rise of convolutional neural networks has brought small object detection into the era of deep learning, significantly improving the intelligence and automation levels of small object detection. The improved Faster R-CNN [5] enhanced small object detection by adopting Improved IoU loss for bounding box regression, bilinear-interpolated RoI pooling to reduce localization errors, and multi-scale feature fusion to strengthen representation capability. Building on this, the enhanced SSD [6] introduced feature cross-reinforcement, an improved group shuffling-efficient channel attention mechanism and an adaptive training sample selection algorithm to address challenges of small object detection. Specifically for UAV aerial imagery, Chen et al. [7] proposed SOD-YOLOv7, which integrated Swin Transformer with a Bi-Level Routing Attention (BRA) mechanism within its feature extraction network. Along with the advancement of Transformer in natural language processing tasks, the Facebook team introduced the Detection Transformer (DETR) [8] object detection algorithm. DETR transformed the task of object detection into a set prediction problem, opening up a novel approach to the field. Even though DETR eliminated manual processing steps such as anchor box design and Non-Maximum Suppression in traditional object detection methods, it still has evident drawbacks like slow convergence rate, low detection accuracy for small objects, and high computational resource requirements. DINO [9] introduced contrastive denoising training on the basis of Deformable DETR [10], DAB-DETR [11], and DN-DETR [12]. The innovative method accelerated the convergence rate during training and enhanced the stability of the algorithm, laying the foundation for future research. However, it leaves much room for enhancing its detection capabilities, particularly when it comes to identifying small objects. In the task, we need to simultaneously solve the problems of insufficient feature information and large computational cost, which make it more challenging.
To address this challenge, we explore the DINO-based object detection algorithm and propose an efficient local-global fusion Transformer (ELFT) for small object detection. ELFT integrates three novel components, i.e., efficient local-global fusion attention module (ELGFA), grouping feature update module (GFUM) and context broadcast module (CB). Concretely, the ELGFA is designed to ensure the accurate location information for the region of interest, which independently process feature vectors in global, vertical, and horizontal directions. We further propose GFUM to increase encoder efficiency and reduce computational complexity by alternately updating high-level and low-level features. As for the CB module, it is employed to obtain global context and enhance the precision of detecting small objects. In summary, ELFT leverages attention mechanisms and grouping strategy for the trade-off between performance and computational complexity. The main contributions are summarized below:
1. ELFT is proposed to address the limitations of feature extraction networks in capturing long-range dependencies. Furthermore, it aims to mitigate the high computational complexity that is inherent in Transformer-based detection methods.
2. We introduce an efficient local-global fusion attention, a context broadcast module and a grouping feature update module to achieve a fine balance between high performance and reasonable computational cost.
3. Experiments are conducted on three public benchmarks: RSOD, NWPU VHR-10, and PASCAL VOC2007, and the results show that the proposed method improved detection accuracy while reducing parameters and computational complexity.
The remainder sections of this article are structured as outlined below. The “Related Work" section provides an overview of relevant literature. “Efficient Local-Global Fusion Transformer Algorithm" delves into the specifics of our proposed ELFT architecture. “Experiments and Results" section showcases the experimental results and their corresponding analysis. Lastly, the “Conclusion" section summarizes the key points discussed in this article.
2 Related works
In this part, we offer a brief overview of related research on small object detection methods based on DETR, lightweight methods based on DETR, and attention mechanisms.
2.1 Small object detection methods based on DETR
Effectively representing objects at diverse scales has remained a core challenge in the object detection field. Especially for small-scale objects, which occupy limited pixel space in an image, carrying relatively little feature information. Therefore, it is crucial to explore and utilize the detailed information in the image fully. Table 1 demonstrates the studies related to small object detection based on DETR. From the multi-scale features perspective, Zhu et al. [10] improved the accuracy of detecting small objects by adopting a multi-scale feature fusion strategy to merge information from different scales. On this basis, Cunha et al. [13] replaced the simple data augmentation in Deformable DETR with a simple augmentation technology for AUGMIX classification for adapting to small objects. To address the weak correlation between network layers in DETR and the poor performance in small object training and detection. Skip DETR [14] enhanced feature fusion through skip connections and multi-scale feature extraction to provide an efficient solution for forestry pest detection. In contrast, Cao et al. [15] designed a decoder layer structure that progressed from coarse to fine, approaching the problem from the perspective of bounding box localization. The structure incorporated an adaptive scale fusion module to merge feature information from different scales using object query features, thus refining small object localization. Meanwhile, Hoanh et al. [16] introduced an object-focus network with a dual-head, where one head served as a dedicated small object prediction module for obtaining coarse locations of small objects. From the perspective of the limitations of inductive bias, Dubey et al. [17] addressed DETR’s shortcomings in small object detection by feature fusion and normalized inductive bias design, while preserving the end-to-end advantage of ensemble-based prediction.
[Figure omitted. See PDF.]
As analyzed above, these researchers primarily focused on multi-scale features, data augmentation techniques, coarse location acquisition and inductive bias design. However, the detection performance for small objects remains inadequate due to their low pixel ratio and frequent occlusion. Consequently, fully exploring and leveraging the feature information of small objects to enhance detection accuracy remains a critical research direction. In this paper, we strengthen the precision of detecting small objects by combining local and global contextual information, which is a feasible and efficient solution.
2.2 Lightweight methods based on DETR
The growing demand for object detection in practical applications, especially on resource-constrained platforms. It has made lightweight object detection algorithms become an urgent topic and research hotspot in both academia and industry. Table 2 demonstrates the lightweight methods based on DETR. Sparse DETR [18] utilized the decoder’s cross-attention map (DAM) to guide feature selection, and filtered encoder outputs with learnable cross-attention to reduce computational cost. However, not all the feature vectors filtered by Sparse DETR corresponded to foreground regions. Based on this observation, Huawei Noah’s Ark Lab supposed that Sparse DETR used DAM to monitor the foreground feature vectors, but DAM could introduce errors during training. Therefore, Focus-DETR [19] was proposed to better monitor the filtering process of foreground feature vectors by utilizing actual boxes and labels. It achieved better performance by introducing positioning and semantic information for multi-level semantic discrimination. Sun et al. [20] proposed Pruning DETR, which adjusted the importance of module outputs by leveraging parameter scale factor and a sparse regularization term. It pruned the DETR structurally, resulting in a decrease in computational burden and improved inference speed. Furthermore, Zheng et al. [21] introduced an efficient attention mechanism to speed up Transformer model, which was combined linear attention and token pruning.
[Figure omitted. See PDF.]
The above methods reduced the computational load by filtering feature vectors, decreasing attention complexity, and pruning. However, they exhibited notable drawbacks in terms of detection accuracy. Thanks to the grouped feature update module, our algorithm improves the encoder’s efficiency and reduces computational complexity while maintaining the detection accuracy.
2.3 Attention mechanisms
The attention mechanisms simulate the human visual and cognitive processes, enabling the neural network to autonomously learn and dynamically concentrate on the critical parts of the input data, ultimately improving the algorithm’s effectiveness. The main structure of SENet [22] was the Sequence-and-Exception (SE) block. By incorporating the attention mechanism into the image channel, the algorithm paid more attention to the channel features containing a large amount of information. To exploit the spatial correlation details of the image, a model that analyzes pairwise pixel correlations and hierarchical statistics is established [23]. DANet [24] adopted both spatial and channel attention mechanisms to extract the spatial and channel information in the image, which improved the accuracy of segmentation. Even though they boosted the performance, most of them ignored the width and height information of the object. To alleviate this issue, Zhang et al. [25] designed a Quadruple Attention module to extend the attention mechanism from distinct dimensions of channels, positions, heights, and widths, thus better representing the characteristics of small objects. The Magnifying Glass module further emphasized small objects during detection and further improved its detection performance. Multi-head Self-attention (MSA) enabled the model to learn diverse data representations within different subspaces, thereby obtaining various information. Hyeon-Woo et al. [26] conjectured that MSA tended to learn intensive interactions, but the gradient of intensive self-attention was steeper, leading to unstable training. Hence, they proposed the Context Broadcast module (CB) and dimension-scaled CBs to provide uniform attention, which effectively improved the performance of ViT. For the purpose of paying attention to the lesions of COVID-19, Zhou et al. [27] designed SE-Res block by adding the residual connection, and subsequently introduced it to ResNet. Srinivas et al. [28] integrated the self-attention mechanism into ResNet. The image was processed using convolutions to obtain localized information. Subsequently, the global dependency was modeled utilizing the self-attention mechanism. This approach surpassed the baseline ResNet model in object detection performance..
As claimed above, the attention mechanism had been successfully integrated into ResNet. However, two major issues arose: on the one hand, the lack of long-range dependencies restricts the algorithm’s capacity to detect small objects, compromising both accuracy and reliability. On the other hand, its non-lightweight design did not apply to the devices with constrained computing resources more challenging. To overcome such problems, this paper designs a lightweight ELGFA module with a simple structure and few parameters, which can effectively capture the long-range spatial dependencies and accurately locate the location of the target object.
3 Efficient local-global fusion transformer algorithm
In this part, we introduce the network architecture of ELFT. Subsequently, we offer a comprehensive explanation regarding the proposed ELGFA module, along with the GFUM, and CB module.
3.1 Network architecture
The Transformer model captures global dependencies within a sequence through its attention mechanism, which is primarily composed of encoders and decoders. The encoder consists of multiple stacked self-attention layers, followed by a Feed-Forward Network (FFN) layer. The principle of the self-attention mechanism lies in calculating the relationship among every element in the sequence relating to the others. These calculations are described as Eq (1). The FFN layer consists of two linear transformations separated by an activation function.
(1)
For the input sequence X, the query matrix Q, the key matrix K, and the value matrix V are obtained through linear transformations. Then, the attention scores are calculated using the dot product of Q and K. A scaling factor is applied to mitigate the potential vanishing gradient issue in the softmax operation caused by large dot product values. The attention weights are computed by applying the softmax function to the attention scores. Ultimately, the computation of the final output involves calculating the weighted total of the V, where the weights are determined by the attention weights.
The structure of the decoder resembles that of the encoder, but it has a cross-attention layer, which takes into account the output of the encoder. The definition of cross-attention is given by Eq (2):
(2)
where K and V represent the outputs from the encoder, while Q stems from the input of the decoder.
DETR primarily consists of a backbone network, Transformer encoders, decoders, and FFN structure. DINO is an improved method based on DETR. The specific detection process is depicted in Fig 1. When an image is received, multi-scale features are extracted by backbone networks, for instance ResNet or Swin-Transformer. These features are then input into the Transformer encoder alongside their respective position embeddings. After encoding and enhancing the input features, the encoder initializes the decoder’s location queries by associating them with the positional information of the selected top-k features. The Transformer decoder receives both these queries and the content queries to assist in obtaining better location information. The decoder then uses deformable attention to integrate the encoder’s feature outputs and iteratively updates the query layer on a layer-by-layer basis. The predicted bounding box generated by the decoder is matched with the actual object box. The Hungarian matching process involves calculating category loss, positioning loss, and GIoU loss separately after obtaining the predictions, along with the actual labels and boxes. These losses are then combined to form a cost matrix, which is used to compute the matching labels and prediction boxes. Concurrently, there is an additional branch for contrast denoising training. Each actual box generates a positive sample and a negative sample. The label is augmented with noise and appended to the actual box. The boxes with noise are marked as positive, while the others are negative, to avoid the same object being predicted again. Finally, the ultimate prediction box is determined jointly by the initial box and the predicted offset. To mitigate the computational complexity and enhance the detection capability for small objects, we propose an advanced DINO algorithm, named ELFT, as illustrated in Fig 2. To reduce overall computational demands, we first optimize the encoder and backbone network. GFUM is designed to alternately refresh both high-level and low-level features, thereby enhancing the encoder’s efficiency and minimizing the algorithm’s computational cost. During image processing, the backbone network ResNet50 suffers from insufficient context information acquisition. The loss of spatial information through multi-layer network transmission often results in missed and false detections of small objects. In addition, the network structure of ResNet50 is relatively complex. This will lead to the need for more computing resources and time during training. To circumvent this, the ELGFA module is designed to replace the last two bottlenecks of ResNet50, thus diminishing the parameter count and computational cost, while enhancing detection precision. Additionally, the multi-head self-attention mechanism tends to learn intensive interactions. When the attention weights are very close, this intensive self-attention gradient is steeper, and the training process is more complicated. For the purpose, the CB module is introduced to provide uniform attention and help stabilize the training. By broadcasting the global contextual data, it ensures that the entire input data is considered comprehensively during the processing of each position. This promotes to distinguish small objects from background information or other objects. The pseudo-code of the overall method is shown in Algorithm 1. In contrast to DINO, ELFT achieves a reduction in both parameters and computational demands, and substantially boosts the effectiveness in detecting small objects.
[Figure omitted. See PDF.]
[Figure omitted. See PDF.]
Algorithm 1 ELFT.
Require: set of input RGB image I
Ensure: Bounding boxes B and class labels C
1:
2:
3:
4:
5: Based on objectness score
6: Dynamic anchor boxes
7: Static content queries
8: Decoder = TransformerDecoder() with CB
9: for layer in Decoder do
PositiveQueries AddNoise(GT_boxes, scale=)
NegativeQueries AddNoise(GT_boxes, scale=)
CDNQueries Concat(PositiveQueries, NegativeQueries)
10: end for
RefinedQueries DecoderLayer(CurrentQueries, EncodedFeatures)
Predictions PredictionHeads(RefinedQueries)
B Predictions.boxes
C Predictions.classes
11: return B, and C
3.2 Efficient local-global fusion attention
For the small object detection task, using ResNet50 as the backbone network for image feature extraction often leads to insufficient context information. As the network depth increases, the resolution and detailed information of the features tend to diminish, further compromising context acquisition. This results in an increased likelihood of false and missed detections of small objects. Moreover, the network structure of ResNet50 is relatively complex and has a large number of parameters, which will cause an urgent demand for more computing resources and time during training. Motivated by Efficient Local Attention [29], we introduce ELGFA. By pooling image features from the height, width, and global dimensions, ELGFA effectively integrates both local and global contextual information to enhance the input feature image, aiming to accurately identify regions of interest. The structure of ELGFA module is presented in Fig 3, and its processing flow is as follows: Specifically, the input feature map x undergoes a global average pooling process, yielding a tensor of the dimensions (batch_size, channels, 1, 1), which contains the global context information gc for each channel. The estimation of global context is expressed as Eq (3):
(3)
where x denotes the input feature map of size (b, c, h, w), where b, c, h, w refer to the batch_size, channels, height, and width of the feature map, respectively. gc(x)b,c is the result of global average pooling for channel c in sample b.
[Figure omitted. See PDF.]
After that, the input feature map x performs average pooling along the width and height directions, and the tensors and are obtained. Here, represents the local context along the height direction (H,1), with dimensions (b, c, h). Similarly, denotes the local context along the width direction (1, W), with dimensions (b, c, w). They are calculated by (Eqs 4) and (5), respectively.
(4)(5)
The one-dimensional convolution with a convolution kernel of 7 is employed to effectively extract features while enhancing the horizontal and vertical positioning information. Then, the feature vectors from vertical and horizontal directions perform group normalization and nonlinear activation functions to generate bidirectional positional attention predictions. The whole processes are defined as Eqs (6) and (7):
(6)(7)
where ph and pw are the results of Eqs (4) and (5), respectively. Fh and Fw refer to 1D convolutions in vertical and horizontal directions, respectively. Gn represents group normalization and σ indexes the nonlinear activation function Sigmoid.
The weight of each position in the final feature map is determined by a combination of global context information and attention weights in both vertical and horizontal directions. The original input x performs element-by-element multiplication with the enhanced local contexts yh and yw, along with the global context gc. This approach allows the algorithm to consider both local and global context information, thereby enhancing the feature representation. As shown in Eq (8):
(8)
ResNet50 mitigates the vanishing gradient problem in deep networks through residual learning, thereby enabling deeper architectures. Its hierarchical feature extraction structure generates multi-scale features that are crucial for object detection. By leveraging pretrained weights, the network achieves rapid convergence through optimal initialization. This residual mechanism strikes a balance between network depth and computational efficiency, ultimately achieving a trade-off between performance and overhead. Therefore, adopting ResNet50 as the backbone network represents an appropriate choice. However, a key limitation of ResNet50 lies in its local convolutional operations, which fail to capture distant contextual dependencies critical for small object detection. In this paper, we integrate the proposed ELGFA module into ResNet50 to capture global context information. Specifically, the convolution layer within the bottleneck of the last two layers of the S5 block is modified to form ELGFA-ResNet50 (see Table 3). It not only decreases the computational requirements of the feature extraction network, but also enhances the precision of detection.
[Figure omitted. See PDF.]
3.3 Grouped feature update module
The computational cost of the encoder in DINO accounts for 58.3% of the entire algorithm. Among these calculations, high-level features contribute only one-quarter, while low-level features account for three-quarters of all tokens processed. This is one of the reasons for the high computational cost of the encoder [30]. To deal with this issue, we propose GFUM and integrate it into the encoder to alternately update high-level and low-level features. The architecture of GFUM is displayed in Fig 4, and the pseudo code for alternate update process is presented in Algorithm 2. To be more specific, the six encoder layers are organized into three groups, with each group comprising two encoder layers. In the first encoder layer of each group, high-level features are used to query all tokens, updating the feature vector while low-level features remain unchanged. In this way, the number of queried features shrinks to one-fourth of the original, thus decreasing the overall computational burden. In the second encoder layer of each group, low-level features serve as queries to interrogate all tokens, enabling an update of their respective representations while preserving the integrity of multi-scale features. Through the grouping and updating alternately method, efficient computation is achieved.
[Figure omitted. See PDF.]
Algorithm 2 Alternately update algorithm.
Require:set of input feature maps src, set of target feature maps tgt, total number of layers layers, number of groups , position encoding pos, reference points
Ensure:set of processed feature maps src
1: for to layers–1 do
2: if then
3:
4:
5:
6:
7: else
8:
9:
10:
11: end if
12: end for
13: return src
3.4 Context broadcasting module
The multi-scale deformable attention module in the Transformer architecture serves as the core component to capture information from every position within an image. It facilitates intensive interactions by increasing the diversity of attention mechanisms. Each attention head independently learns a unique attention weight distribution to capture information from different areas in the image. However, the diversity is prone to a more complex distribution of attention weights, where each position may receive a relatively high attention weight. This potentially diminishing the learning efficiency of the algorithm.
To further improve the algorithm’s capability of perceiving global context information, we introduce a CB module into the FFN after the multi-scale deformable attention module, as shown in Fig 5. The CB module computes the average of feature vectors across layers and then rebroadcasts this global context information to each feature vector. In this manner, each feature vector can acquire a context representation based on the average information from all feature vectors. The introduction of CB module enables the algorithm to learn more intensive interactions. Each vector considers not only its local context but also the global context. It is worth noting that CB improves the algorithm’s ability to accurately locate and identify small objects. In the mean time, the introduction of global information contributes to the stable training process of the algorithm.
[Figure omitted. See PDF.]
4 Experiments and results
In this part, we conduct experiments on the RSOD, NWPU VHR-10, and PASCAL VOC datasets. Furthermore, we evaluate ELFT by comparing it with state-of-the-art methods. Additionally, we carry out ablation studies to assess the efficacy of the three modules we have introduced in ELFT.
4.1 Experiment details
In this paper, we implement ELFT based on the deep learning framework Pytorch. All experiments are conducted on 12th Gen Intel Core i9-12900H 2.50 GHz and NVIDIA GeForce RTX 3060 GPU. The implementation utilizes Python as the programming language, with Windows 11 as the operating system. During training, AdamW [31] is employed to optimize the network with a batch size of 1, weight decay of 0.0001, and a training period of 12 epochs. The learning rate is initialized to 0.00001, and is decreased by a factor of 10 after 11 epochs.
4.2 Datasets
The RSOD dataset [32,33], which consists of 976 images, is an openly accessible resource specifically designed for object detection within remote sensing imagery. The dataset contains four different object categories: aircraft, oil tank, overpass, and playground, with a total of 6,950 objects. Among them, 4,993 aircraft objects are marked in 466 images, 1,585 oil tank objects in 165 images, 180 overpass objects in 176 images, and 191 playground objects in 189 images. This paper divides the RSOD dataset into a training set and a validation set, with a ratio of 70% allocated to the training set and 30% to the validation set.
NWPU VHR-10 dataset [34–36] is a public remote sensing dataset released by Northwestern Polytechnic University. It contains 800 pictures, belonging to 10 types of objects, 650 pictures with objects, and 150 pictures with backgrounds. The dataset has been partitioned into a training set and a validation set in a 7:3 ratio.
PASCAL VOC2007 dataset [37,38] contains 9,963 images from 20 categories, which are divided into the train-val2007 dataset and the test2007 dataset. In this paper, train-val2007 is used as a training set, including 5,011 images, and the model is evaluated on the test2007 dataset, including 4,952 images.
4.3 Metrics
We employ the average precision (AP), mean average precision (mAP), floating point operations (FLOPs), and F1-score, and Optimal Localization Recall Precision (oLRP) [39] as the evaluation metrics to assess the performance of the algorithm. The AP denotes the area surrounded by the accuracy curve Precision, recall curve Recall, and coordinate axis. The Precision refers to the fraction of accurately predicted positive samples out of the total number of samples predicted to be positive. The Recall represents the fraction of positive samples that are correctly predicted among all actual positive samples. Specifically, they can be defined by Eqs (9) and (10):
(9)(10)
where TP denotes the positive samples that are correctly predicted, FP denotes the negative samples that are incorrectly predicted as positive, and FN represents the positive samples that are incorrectly predicted as negative. AP can be calculated by Eq (11).
(11)
mAP is the average of different kinds of AP. mAP can be calculated by Eq (12) as follows:
(12)
where n is the number of object categories.
The F1-score serves as the harmonic average of Precision and Recall. A higher F1-score signifies an optimal trade-off between Precision and Recall, reflecting the algorithm’s performance. F1-score is estimated as Eq (13).
(13)
FLOPs are employed as a metric to quantify the computational demand and complexity of the algorithm. A larger FLOPs value signifies a higher requirement for computing resources.
oLRP comprehensively evaluates localization accuracy, precision error, and recall error. Under an IoU threshold, it calculates the tightness of bounding boxes enclosing the object to obtain a more reliable assessment of localization performance. Let X represent the set of ground-truth boxes and Y denote the detector’s predicted box set. When given a confidence score and an IoU threshold , the LRP error of Ys relative to X can be calculated by Eq (14):
(14)
where denotes the number of TP, FP and FN samples. , , wFN = |X| represent weights of components. LRPIoU, LRPFP, and LRPFN can be defined by Eqs (15), (16) and (17):
(15)(16)(17)
In summary, the LRP error can be organized as Eq (18).
(18)
The oLRP is defined as the minimum LRP error achievable at t = 0.5 as in Eq (19).
(19)
The moLRP is the mean optimal localization recall precision error, and c denotes the category as in Eq (20).
(20)
4.4 Results analysis
4.4.1 Learning rate analysis.
To validate the selection of the initial learning rate 1E-5, we conduct parameter experiments comparing five candidate values: {1E-6, 5E-6, 1E-5, 5E-5, 1E-4}, the experimental results are shown in Fig 6. As Fig 6 demonstrates, the initial learning rate 1E-5 strikes an optimal trade-off between convergence speed and stability. The model struggles to converge due to oscillations and the accuracy of small object detection is poor when the learning rate is too large. Conversely, the model training process slows down by an excessively small learning rate, although it does not cause significant performance degradation. These results validate 1E-5 as a robust choice for ELFT to ensure the training process stability.
[Figure omitted. See PDF.]
4.4.2 Comparative experiments.
To verify the algorithm’s effectiveness in improving the detection accuracy of small object images while reducing computational complexity, comparative experiments are conducted on the RSOD, NWPU VHR-10, and PASCAL VOC2007 datasets.
In this paper, the current mainstream object detection algorithms: Faster RCNN, SSD, YOLOv7, Deformable DETR, Conditional DETR, DAB-DETR, DN-DETR, Sparse DETR, and DINO are selected for comparative evaluations on the RSOD dataset. The comparison results are presented in Table 4, where “*" indicates reproduced experimental results, bold indicates the optimal results, and underlined are suboptimal results. As evident from Table 4 that ELFT achieves the mAP of 95.8%, surpassing the baseline algorithm while simultaneously reducing parameters by 10.4% and computational complexity by 22.7%. The moLRP is 0.371, which means that the mean optimal localization recall precision error is the smallest, indicating that the bounding box predicted by the ELFT algorithm encircles the target more tightly, and the localization performance is better. These improvements outperform those of the compared representative object detection algorithms. Concretely, the mAP is increased by 2.1% and the F1-score by 1.1% compared to the baseline.
[Figure omitted. See PDF.]
Deformable DETR-D-DETR, Conditional DETR-C-DETR.
Fig 7 shows the comparison of the Precision-Recall curves for different algorithms on RSOD dataset at an IoU threshold of 0.5. It can be observed that the precision of DINO decreases significantly as recall increases, while ELFT still maintains high precision even at higher recall rates. Meanwhile, ELFT exhibits the largest area under the Precision-Recall curve among all compared mainstream algorithms, demonstrating superior performance. The detection results of different detection algorithms on the RSOD dataset for each category are presented in Table 5. The results reveal that ELFT achieves the highest mAP values in the overpass and playground object categories, while also securing high values in the aircraft and oil tank categories. Compared to various DETR series, ELFT demonstrates different degrees of improvement in mAP. In addition, the mAP surpasses that of representative algorithms of Faster RCNN, SSD, and YOLO series. These results show that ELFT exhibits excellent performance while reducing computational complexity.
[Figure omitted. See PDF.]
[Figure omitted. See PDF.]
Fig 8 illustrates the oLRP, , , and of different algorithms on RSOD dataset for each category. It can be seen from the comparison that ELFT achieves the lowest optimal localization recall precision error in the target categories of oil tanks, overpasses, and playgrounds, which proves its effectiveness. The evaluation and comparison results of different algorithms on the NWPU VHR-10 dataset are presented in Table 6. The findings indicate that the proposed algorithm attains an F1-score of 96.1%, which suggests that ELFT strikes a better balance between precision and recall. moLRP, , , and are 0.425, 0.168, 0.051, and 0.104, respectively, and mAP reaches 94.3%. It is worth noting that the models outperform the compared mainstream algorithms. More specifically, it improves by 3.3%, 2.9%, 2.4%, 1.8%, 20% and 1.7% in terms of mAP compared to Deformable DETR, Conditional DETR, DAB-DETR, DN-DETR, Sparse DETR and DINO, respectively. The results demonstrate the efficacy and performance of the proposed algorithm. Furthermore, when compared to Faster RCNN, SSD, and YOLOv7, ELFT exhibits a significant increase in mAP by 6.3%, 32.6%, and 4.7%, respectively, indicating its advantages in detection performance.
[Figure omitted. See PDF.]
[Figure omitted. See PDF.]
Fig 9 presents the Precision-Recall curves of different algorithms on the NWPU VHR-10 dataset. ELFT significantly outperforms other algorithms in the high recall region and reduces the missed detection. These results indicate that ELFT effectively balances the accuracy and recall, and is suitable for small object detection scenarios. The detection results of different detection algorithms on the NWPU VHR-10 dataset for each category are reported in Table 7. Through the analysis of Table 7, it is apparent that the ELFT has achieved the highest value in detecting four types of small objects, such as aircraft, basketball courts, harbors, and vehicles. Additionally, it achieves suboptimal values in detecting three types of small objects: oil tanks, tennis courts, and bridges. The possible reason for the highest or suboptimal performance in ships and ground track fields lies in that the proportion of these objects in the images is tiny, and the resolution may be low. Regarding baseball diamonds, their features are similar to those of the surrounding environment, making it challenging to extract distinct features, which is prone to incur lower detection accuracy.
[Figure omitted. See PDF.]
[Figure omitted. See PDF.]
Fig 10 illustrates the performance of different algorithms for oLRP, , and on each category in the NWPU VHR-10 dataset. It shows that the optimal localization recall precision error is the smallest in all categories except the storage tank target category, suggesting that ELFT predicts a higher degree of tightness of the bounding box enclosing the target. The proposed algorithm and DINO are compared to some current representative object detection algorithms in the PASCAL VOC2007 dataset, and the experimental results are displayed in Table 8. The findings show that the proposed algorithm achieves an F1-score of 91.3% and the mAP of 85.3%, outperforming other algorithms in both metrics. ELFT achieves a lower optimal localization recall precision error (oLRP) of only 0.475 compared to other algorithms, thereby proving its effectiveness. Specifically, the F1-score is increased by 0.5% at least and 5.3% at most compared to other algorithms, while the mAP is increased by 0.7% at least and up to 15.4%. The moLRP is reduced by a minimum of 0.008 and a maximum of 0.059 compared to other algorithms, further demonstrating the high performance of the ELFT algorithm in localization, regression and classification.
[Figure omitted. See PDF.]
[Figure omitted. See PDF.]
Fig 11 shows the Precision-Recall curves of different algorithms on the PASCAL VOC2007 dataset. From the curve patterns, the Precision-Recall curves of all the compared algorithms on the PASCAL VOC2007 dataset exhibit a gently decreasing trend. However, the Precision-Recall curve of ELFT always ranks above the other algorithms, which indicates that ELFT maintains a higher accuracy across the entire recall rate range. The detection results for different classes in the PASCAL VOC2007 dataset on different detection algorithms are presented in Table 9. The results reveal that among the 20 categories in the VOC dataset, except for the categories of table and horse, the remaining 18 categories have achieved optimal or suboptimal detection results compared to the other algorithms. One possible reason is that the number of positive sample images for tables and horses in the dataset is relatively few, which limits the ability of the algorithm to learn and recognize features for these two categories.
[Figure omitted. See PDF.]
[Figure omitted. See PDF.]
Fig 12 shows the performance of different algorithms for oLRP, , , and in each category of the PASCAL VOC2007 dataset. It can be seen that ELFT exhibits less error in the optimal localization recall precision, which proves the superiority of ELFT’s performance.
[Figure omitted. See PDF.]
4.4.3 Ablation experiments.
To validate the efficiency of different components, this paper performs ablation experiments on the RSOD dataset, comparing the performance of algorithms in different cases. Experiment A represents the DINO algorithm, experiments B to F represent the addition of different improved methods. “" indicates the use of a method, while “×" is not used. The experimental results are summarized in Table 10.
[Figure omitted. See PDF.]
The core idea of the GFUM module is that the six-layer encoder is divided into three groups. For each group, the first layer aggregates information globally through high-level features to reduce computational load, while the second layer preserves details using low-level features, achieving a balance between multi-scale feature maintenance and computational efficiency. Moreover, the ELGFA moudle by pooling image features from the height, width, and global dimensions, effectively integrates both local and global contextual information to enhance the input feature image, aiming to accurately identify regions of interest. Furthermore, the CB module computes the average of feature vectors across layers and then rebroadcasts this global context information to each feature vector. Comparing experiment A and experiment B, we observe that adding GFUM to the encoder in DINO reduces FLOPs by 21.0% with a slight increase in mAP, demonstrating that GFUM effectively decreases the computational load of the encoder by alternately updating high and low-level features. When comparing experiment B with experiment C, we observe that a reduction of 10.4% has been achieved in the total number of parameters within the algorithm, with a further reduction
in FLOPs to 223.88G. Notably, the mAP is also slightly improved. The ELA module precisely identifies the region’s position of interest, preserves the input feature channels’ dimensionality, and remains lightweight. Furthermore, the comparison between experiment B and experiment D reveals that the ELGFA module considers the global context information more comprehensively, improves the missed detection issue of small objects, and increases the mAP to 94.5%. The comparison between experiment B and experiment E shows that the CB module further boosts the mAP by 0.9%, thereby proving the effectiveness of the CB module in the algorithm. Experiment F combines three improved methods, and when compared to the original DINO algorithm, a reduction of 10.4% has been achieved in the total number of parameters, the FLOPs are reduced by 22.7%, and the mAP is increased by 2.1%. Note that the experiment F performs the best in regards of mAP, parameters and FLOPs as well. Therefore, experimental results confirm the effectiveness of ELFT.
To evaluate the versatility of the ELGFA module, we replace the backbone of the benchmark model DINO with ResNet101. By incorporating the ELGFA module, we conduct experiments on three datasets, and the experimental results are shown in Table 11. It can be seen that when ResNet50 is used as the backbone, the FLOPs are 289.68G, parameters are 45.15M, and the mAP in the three datasets is 93.7%, 92.6%, and 84.6%, respectively. After adding the ELGFA module, the FLOPs are reduced to 284.80G, the Params are decreased to 40.44M, and the mAP in the three datasets is increased to 95.4%, 93.5%, and 84.8%, respectively. For ResNet101 backbone configurations, the baseline yields 369.23G FLOPs, 64.09M parameters, and mAP is 95.6%, 93.3%, and 84.7%, respectively. After adding the ELGFA module, FLOPs are reduced to 364.28G, parameters are decreased to 59.38M, and mAP on the three datasets is boosted to 96.1%, 93.7%, and 85.0%, respectively. These results demonstrate that the ELGFA module is applicable in different models and can effectively improve the small object detection accuracy. In addition, the experimental results show that ResNet101 outperforms ResNet50 marginally, but it requires higher computational cost. This is the reason why we choose ResNet50 as the backbone network.
[Figure omitted. See PDF.]
Furthermore, to validate the additional detector generality of the ELGFA module, we integrated it into three state-of-the-art DETR variants, i.e. Deformable DETR, DAB-DETR, and DN-DETR. The results of the experiments conducted on the RSOD dataset are summarized in Table 12. It is clear to see that ELGFA integration consistently improves small object detection performance across all evaluated frameworks while reducing computational overhead. Specifically, embedding ELGFA module into Deformable DETR reduces parameter count by 4.71M and FLOPs to 198.50G, F1-score and mAP are improved by 0.2%, 0.6%, respectively. Similarly, DAB-DETR with ELGFA achieves an 11.4% parameter reduction and reduces FLOPs to 96.98G, F1-score and mAP gain of 1.0%, 1.9%, respectively. DN-DETR with ELGFA decreases parameter count to 36.72M, reduces FLOPs by 4.9%, and improves F1-score and mAP by 0.9%, 1.7%, respectively. The mean optimal localization recall precision (moLRP) is decreased to 0.395, 0.392, and 0.383, respectively. These findings demonstrate the generalizability of the ELGFA module, effectively improving small object detection performance while maintaining computational efficiency across diverse detectors.
[Figure omitted. See PDF.]
4.5 Visualizations
To thoroughly investigate ELFT’s performance in small object detection, we perform Grad-CAM visualization for heatmap analysis, as shown in Fig 13. It intuitively shows how the model localizes specific regions. The Grad-CAM technique enhances visualization of the model’s decision-making process, which aids in understanding how targets are identified within images. For small object detection, ELFT demonstrates distinct strengths. Experimental results indicate it successfully detects small objects missed by the original algorithm. This further confirms that the improved method has higher sensitivity to small objects, contributing to enhanced detection accuracy. The detection effects of DAB-DETR, DINO, and the proposed algorithm are shown in Fig 14, where wrongly detected or missed objects are highlighted with a red circle. As Fig 14(a) illustrates, DAB-DETR falsely detects two boarding bridges as aircraft, and the original DINO similarly falsely detects the boarding bridges at the top right of the picture as aircraft. In contrast, ELFT accurately detects all objects within the image without any false detections. As Fig 14(b) shows, the four tennis courts in the upper right corner are not detected by DAB-DETR, and the lower tennis courts are not detected by DINO. On the contrary, ELFT successfully detects all objects. Due to the compact distribution of tennis courts and the similar colors and backgrounds, it is difficult to detect. The CB module embedded in the proposed algorithm considers the global context of the entire image, thereby significantly enhancing the detection performance in dense environments. In Fig 14(c), both DAB-DETR and the original DINO algorithm fail to detect the four oil tank objects located at the top right. Conversely, ELFT accurately detects these objects, indicating that the ELGFA module proposed in this paper can better extract the location features of objects in the image. This fusion of feature map details effectively solves the issue of missed detections of oil tank objects. Figs 15 and 16 present the detection results on the RSOD dataset and PASCAL VOC2007 dataset, respectively.
[Figure omitted. See PDF.]
[Figure omitted. See PDF.]
[Figure omitted. See PDF.]
[Figure omitted. See PDF.]
4.6 Discussion
Although the proposed model demonstrates competitive performance in both quantitative metrics and qualitative visualization, somre failure cases of our method appear in several scenarios, e.g., low-light scenes or severe object occlusion. For low-light, the reduction in image contrast and color fidelity limits the discriminative capacity of visual features, leading to suboptimal predictions. For severe object occlusion, partial visibility of key semantic regions reduces the effectiveness of local feature extraction, making the representation less reliable. We attribute these degradations to two main factors: First, Data distribution bias, as the training dataset contains fewer samples representing extreme lighting or heavy occlusion conditions. Second, Limitations in spatial dependency modeling, where the current mechanism may not fully capture long-range relationships or adapt to highly variable object appearances. At present, transfer leanring and progressive hierarchical attention mechanisms has applied to address these problems. It is an interesting direction to apply these paradigms into small object detection in the future.
5 Conclusion
In this paper, we present an ELFT object detection algorithm, which obtains a balance between the computational cost and detection performance. The ResNet50 structure is augmented with the ELGFA module to efficiently capture long-range spatial relationships and integrate global and local contextual data. The GFUM is devised to optimize the efficiency of the encoder. Furthermore, we adopt the CB module to obtain more extensive and detailed context information, which makes the localization and recognition of small objects more accurate. Extensive experiments validate the efficacy of the proposed approach in enhancing the accuracy of detecting small objects and reducing parameter count and computational cost. Since the algorithm mainly relies on a Transformer structure, there is ample margin to simplify it due to the quadratic computational complexity of self-attention. We will make further efforts to explore the lightweight methods and significantly improve the detection ability for small objects in future work.
References
1. 1. Vaswani A. Attention is all you need. Adv Neural Inform Process Syst. 2017.
* View Article
* Google Scholar
2. 2. Cheng Y, Yan J, Zhang F, Li M, Zhou N, Shi C, et al. Surrogate modeling of pantograph-catenary system interactions. Mech Syst Signal Process. 2025;224:112134.
* View Article
* Google Scholar
3. 3. Yang H, Liu Z, Ma N, Wang X, Liu W, Wang H, et al. CSRM-MIM: A self-supervised pretraining method for detecting catenary support components in electrified railways. IEEE Trans Transp Electrific. 2025;11(4):10025–37.
* View Article
* Google Scholar
4. 4. Yan J, Cheng Y, Zhang F, Zhou N, Wang H, Jin B, et al. Multimodal imitation learning for arc detection in complex railway environments. IEEE Trans Instrum Meas. 2025;74:1–13.
* View Article
* Google Scholar
5. 5. Cao C, Wang B, Zhang W, Zeng X, Yan X, Feng Z, et al. An improved faster R-CNN for small object detection. IEEE Access. 2019;7:106838–46.
* View Article
* Google Scholar
6. 6. Gong L, Huang X, Chao Y, Chen J, Lei B. An enhanced SSD with feature cross-reinforcement for small-object detection. Appl Intell. 2023;53(16):19449–65.
* View Article
* Google Scholar
7. 7. Chen J, Wen R, Ma L. Small object detection model for UAV aerial image based on YOLOv7. SIViP. 2023;18(3):2695–707.
* View Article
* Google Scholar
8. 8. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. Lecture Notes in Computer Science. Springer International Publishing; 2020. p. 213–29. https://doi.org/10.1007/978-3-030-58452-8_13
9. 9. Zhang H, Li F, Liu SL, Zhang L, Su H, Zhu J, et al. DINO: DETR with improved denoising anchor boxes for end-to-end object detection. The eleventh international conference on learning representations; 2023.
10. 10. Zhu XZ, Su WJ, Lu LW, Li B, Wang XG, Dai JF. Deformable {DETR}: Deformable transformers for end-to-end object detection. International conference on learning representations. 2021.
11. 11. Liu SL, Li F, Zhang H, Yang X, Qi XB, Su H, et al. DAB-DETR: Dynamic anchor boxes are better queries for DETR. International conference on learning representations; 2022.
12. 12. Li F, Zhang H, Liu S, Guo J, Ni LM, Zhang L. DN-DETR: Accelerate DETR training by introducing query denoising. IEEE Trans Pattern Anal Mach Intell. 2024;46(4):2239–51. pmid:38019624
* View Article
* PubMed/NCBI
* Google Scholar
13. 13. Cunha E, Macêdo D, Zanchettin C. Improving small object detection with DETRAug. In: 2023 international joint conference on neural networks (IJCNN); 2023. p. 1–8. https://doi.org/10.1109/ijcnn54540.2023.10191541
14. 14. Liu B, Jia Y, Liu L, Dang Y, Song S. Skip DETR: End-to-end skip connection model for small object detection in forestry pest dataset. Front Plant Sci. 2023;14:1219474. pmid:37649993
* View Article
* PubMed/NCBI
* Google Scholar
15. 15. Cao X, Yuan P, Feng B, Niu K. CF-DETR: Coarse-to-fine transformers for end-to-end object detection. AAAI. 2022;36(1):185–93.
* View Article
* Google Scholar
16. 16. Hoanh N, Pham TV. Focus-attention approach in optimizing DETR for object detection from high-resolution images. Knowl-Based Syst. 2024;296:111939.
* View Article
* Google Scholar
17. 17. Dubey S, Olimov F, Rafique MA, Jeon M. Improving small objects detection using transformer. J Visual Commun Image Represent. 2022;89:103620.
* View Article
* Google Scholar
18. 18. Roh B, Shin J, Shin W, Kim S. Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity. International Conference on Learning Representations. 2022.
19. 19. Zheng D, Dong W, Hu H, Chen X, Wang Y. Less is More: Focus Attention for Efficient DETR. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 6651–60. https://doi.org/10.1109/iccv51070.2023.00614
20. 20. Sun H, Zhang S, Tian X, Zou Y. Pruning DETR: Efficient end-to-end object detection with sparse structured pruning. SIViP. 2023;18(1):129–35.
* View Article
* Google Scholar
21. 21. Zheng W, Lu S, Yang Y, Yin Z, Yin L. Lightweight transformer image feature extraction network. PeerJ Comput Sci. 2024;10:e1755. pmid:39669455
* View Article
* PubMed/NCBI
* Google Scholar
22. 22. Hu J, Shen L, Albanie S, Sun G, Wu E. Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell. 2020;42(8):2011–23. pmid:31034408
* View Article
* PubMed/NCBI
* Google Scholar
23. 23. Yu L, Liu N, Zhou W, Dong S, Fan Y, Abbas K. Weber’s law based multi-level convolution correlation features for image retrieval. Multimed Tools Appl. 2021;80(13):19157–77.
* View Article
* Google Scholar
24. 24. Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, et al. Dual attention network for scene segmentation. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR); 2019. p. 3141–9. https://doi.org/10.1109/cvpr.2019.00326
25. 25. Zhang J, Xia K, Huang Z, Wang S, Akindele RG. ETAM: Ensemble transformer with attention modules for detection of small objects. Expert Syst Applic. 2023;224:119997.
* View Article
* Google Scholar
26. 26. Hyeon-Woo N, Yu-Ji K, Heo B, Han D, Oh SJ, Oh T-H. Scratching visual transformer’s back with uniform attention. In: 2023 IEEE/CVF international conference on computer vision (ICCV); 2023. p. 5784–95. https://doi.org/10.1109/iccv51070.2023.00534
27. 27. Zhou T, Chang X, Liu Y, Ye X, Lu H, Hu F. COVID-ResNet: COVID-19 recognition based on improved attention ResNet. Electronics. 2023;12(6):1413.
* View Article
* Google Scholar
28. 28. Srinivas A, Lin T-Y, Parmar N, Shlens J, Abbeel P, Vaswani A. Bottleneck transformers for visual recognition. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR); 2021. p. 16514–24. https://doi.org/10.1109/cvpr46437.2021.01625
29. 29. Xu W, Wan Y. ELA: Efficient local attention for deep convolutional neural networks. arXiv preprint; 2024.
* View Article
* Google Scholar
30. 30. Li F, Zeng A, Liu S, Zhang H, Li H, Zhang L, et al. Lite DETR : An interleaved multi-scale encoder for efficient DETR. In: 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR); 2023. p. 18558–67. https://doi.org/10.1109/cvpr52729.2023.01780
31. 31. Loshchilov I, Hutter F. Decoupled weight decay regularization. International conference on learning representations; 2019.
32. 32. Long Y, Gong Y, Xiao Z, Liu Q. Accurate object localization in remote sensing images based on convolutional neural networks. IEEE Trans Geosci Remote Sensing. 2017;55(5):2486–98.
* View Article
* Google Scholar
33. 33. Xiao Z, Liu Q, Tang G, Zhai X. Elliptic Fourier transformation-based histograms of oriented gradients for rotationally invariant object detection in remote-sensing images. Int J Remote Sensing. 2015;36(2):618–44.
* View Article
* Google Scholar
34. 34. Cheng G, Han J, Zhou P, Guo L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J Photogr Remote Sensing. 2014;98:119–32.
* View Article
* Google Scholar
35. 35. Cheng G, Han J. A survey on object detection in optical remote sensing images. ISPRS J Photogr Remote Sensing. 2016;117:11–28.
* View Article
* Google Scholar
36. 36. Cheng G, Zhou P, Han J. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Trans Geosci Remote Sensing. 2016;54(12):7405–15.
* View Article
* Google Scholar
37. 37. Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A. The pascal visual object classes (VOC) challenge. Int J Comput Vis. 2009;88(2):303–38.
* View Article
* Google Scholar
38. 38. Everingham M. The pascal visual object classes challenge,(voc2007) results; 2007. http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2007/index.html
39. 39. Oksuz K, Cam BC, Akbas E, Kalkan S. Localization recall precision (LRP): A new performance metric for object detection. In: Proceedings of the European conference on computer vision (ECCV); 2018. p. 504–19.
40. 40. Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2017;39(6):1137–49. pmid:27295650
* View Article
* PubMed/NCBI
* Google Scholar
41. 41. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, et al. SSD: Single shot multibox detector. Lecture Notes in Computer Science. Springer International Publishing; 2016. p. 21–37. https://doi.org/10.1007/978-3-319-46448-0_2
42. 42. Wang C-Y, Bochkovskiy A, Liao H-YM. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR); 2023. p. 7464–75. https://doi.org/10.1109/cvpr52729.2023.00721
43. 43. Meng D, Chen X, Fan Z, Zeng G, Li H, Yuan Y, et al. Conditional DETR for fast training convergence. In: 2021 IEEE/CVF international conference on computer vision (ICCV); 2021. p. 3631–40. https://doi.org/10.1109/iccv48922.2021.00363
44. 44. Ge Z. Yolox: Exceeding yolo series in 2021 . arXiv preprint; 2021.
* View Article
* Google Scholar
45. 45. Zhang Y, Xu A, Lan D, Zhang X, Yin J, Goh HH. ConvNeXt-based anchor-free object detection model for infrared image of power equipment. Energy Rep. 2023;9:1121–32.
* View Article
* Google Scholar
Citation: Hua G, Wu F, Hao G, Xia C, Li L (2025) ELFT: Efficient local-global fusion transformer for small object detection. PLoS One 20(9): e0332714. https://doi.org/10.1371/journal.pone.0332714
About the Authors:
Guoguang Hua
Roles: Conceptualization, Formal analysis, Methodology, Project administration, Software, Validation, Writing – review & editing
Affiliation: School of Artificial Intelligence, Guangzhou Maritime University, Guangzhou, Guangdong, China
Fangfang Wu
Roles: Data curation, Formal analysis, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing
Affiliation: School of Information and Electrical Engineering, Hebei University of Engineering, Handan, Hebei, China
Guangzhao Hao
Roles: Software, Validation
Affiliation: Section of Network and Information, Handan Water Supply Co. Ltd, Handan, Hebei, China
Chenbo Xia
Roles: Software, Validation, Visualization
Affiliation: School of Information and Electrical Engineering, Hebei University of Engineering, Handan, Hebei, China
Li Li
Roles: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Writing – review & editing
E-mail: [email protected]
Affiliation: School of Information and Electrical Engineering, Hebei University of Engineering, Handan, Hebei, China
ORICD: https://orcid.org/0000-0001-7045-4727
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
[/RAW_REF_TEXT]
1. Vaswani A. Attention is all you need. Adv Neural Inform Process Syst. 2017.
2. Cheng Y, Yan J, Zhang F, Li M, Zhou N, Shi C, et al. Surrogate modeling of pantograph-catenary system interactions. Mech Syst Signal Process. 2025;224:112134.
3. Yang H, Liu Z, Ma N, Wang X, Liu W, Wang H, et al. CSRM-MIM: A self-supervised pretraining method for detecting catenary support components in electrified railways. IEEE Trans Transp Electrific. 2025;11(4):10025–37.
4. Yan J, Cheng Y, Zhang F, Zhou N, Wang H, Jin B, et al. Multimodal imitation learning for arc detection in complex railway environments. IEEE Trans Instrum Meas. 2025;74:1–13.
5. Cao C, Wang B, Zhang W, Zeng X, Yan X, Feng Z, et al. An improved faster R-CNN for small object detection. IEEE Access. 2019;7:106838–46.
6. Gong L, Huang X, Chao Y, Chen J, Lei B. An enhanced SSD with feature cross-reinforcement for small-object detection. Appl Intell. 2023;53(16):19449–65.
7. Chen J, Wen R, Ma L. Small object detection model for UAV aerial image based on YOLOv7. SIViP. 2023;18(3):2695–707.
8. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. Lecture Notes in Computer Science. Springer International Publishing; 2020. p. 213–29. https://doi.org/10.1007/978-3-030-58452-8_13
9. Zhang H, Li F, Liu SL, Zhang L, Su H, Zhu J, et al. DINO: DETR with improved denoising anchor boxes for end-to-end object detection. The eleventh international conference on learning representations; 2023.
10. Zhu XZ, Su WJ, Lu LW, Li B, Wang XG, Dai JF. Deformable {DETR}: Deformable transformers for end-to-end object detection. International conference on learning representations. 2021.
11. Liu SL, Li F, Zhang H, Yang X, Qi XB, Su H, et al. DAB-DETR: Dynamic anchor boxes are better queries for DETR. International conference on learning representations; 2022.
12. Li F, Zhang H, Liu S, Guo J, Ni LM, Zhang L. DN-DETR: Accelerate DETR training by introducing query denoising. IEEE Trans Pattern Anal Mach Intell. 2024;46(4):2239–51. pmid:38019624
13. Cunha E, Macêdo D, Zanchettin C. Improving small object detection with DETRAug. In: 2023 international joint conference on neural networks (IJCNN); 2023. p. 1–8. https://doi.org/10.1109/ijcnn54540.2023.10191541
14. Liu B, Jia Y, Liu L, Dang Y, Song S. Skip DETR: End-to-end skip connection model for small object detection in forestry pest dataset. Front Plant Sci. 2023;14:1219474. pmid:37649993
15. Cao X, Yuan P, Feng B, Niu K. CF-DETR: Coarse-to-fine transformers for end-to-end object detection. AAAI. 2022;36(1):185–93.
16. Hoanh N, Pham TV. Focus-attention approach in optimizing DETR for object detection from high-resolution images. Knowl-Based Syst. 2024;296:111939.
17. Dubey S, Olimov F, Rafique MA, Jeon M. Improving small objects detection using transformer. J Visual Commun Image Represent. 2022;89:103620.
18. Roh B, Shin J, Shin W, Kim S. Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity. International Conference on Learning Representations. 2022.
19. Zheng D, Dong W, Hu H, Chen X, Wang Y. Less is More: Focus Attention for Efficient DETR. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 6651–60. https://doi.org/10.1109/iccv51070.2023.00614
20. Sun H, Zhang S, Tian X, Zou Y. Pruning DETR: Efficient end-to-end object detection with sparse structured pruning. SIViP. 2023;18(1):129–35.
21. Zheng W, Lu S, Yang Y, Yin Z, Yin L. Lightweight transformer image feature extraction network. PeerJ Comput Sci. 2024;10:e1755. pmid:39669455
22. Hu J, Shen L, Albanie S, Sun G, Wu E. Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell. 2020;42(8):2011–23. pmid:31034408
23. Yu L, Liu N, Zhou W, Dong S, Fan Y, Abbas K. Weber’s law based multi-level convolution correlation features for image retrieval. Multimed Tools Appl. 2021;80(13):19157–77.
24. Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, et al. Dual attention network for scene segmentation. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR); 2019. p. 3141–9. https://doi.org/10.1109/cvpr.2019.00326
25. Zhang J, Xia K, Huang Z, Wang S, Akindele RG. ETAM: Ensemble transformer with attention modules for detection of small objects. Expert Syst Applic. 2023;224:119997.
26. Hyeon-Woo N, Yu-Ji K, Heo B, Han D, Oh SJ, Oh T-H. Scratching visual transformer’s back with uniform attention. In: 2023 IEEE/CVF international conference on computer vision (ICCV); 2023. p. 5784–95. https://doi.org/10.1109/iccv51070.2023.00534
27. Zhou T, Chang X, Liu Y, Ye X, Lu H, Hu F. COVID-ResNet: COVID-19 recognition based on improved attention ResNet. Electronics. 2023;12(6):1413.
28. Srinivas A, Lin T-Y, Parmar N, Shlens J, Abbeel P, Vaswani A. Bottleneck transformers for visual recognition. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR); 2021. p. 16514–24. https://doi.org/10.1109/cvpr46437.2021.01625
29. Xu W, Wan Y. ELA: Efficient local attention for deep convolutional neural networks. arXiv preprint; 2024.
30. Li F, Zeng A, Liu S, Zhang H, Li H, Zhang L, et al. Lite DETR : An interleaved multi-scale encoder for efficient DETR. In: 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR); 2023. p. 18558–67. https://doi.org/10.1109/cvpr52729.2023.01780
31. Loshchilov I, Hutter F. Decoupled weight decay regularization. International conference on learning representations; 2019.
32. Long Y, Gong Y, Xiao Z, Liu Q. Accurate object localization in remote sensing images based on convolutional neural networks. IEEE Trans Geosci Remote Sensing. 2017;55(5):2486–98.
33. Xiao Z, Liu Q, Tang G, Zhai X. Elliptic Fourier transformation-based histograms of oriented gradients for rotationally invariant object detection in remote-sensing images. Int J Remote Sensing. 2015;36(2):618–44.
34. Cheng G, Han J, Zhou P, Guo L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J Photogr Remote Sensing. 2014;98:119–32.
35. Cheng G, Han J. A survey on object detection in optical remote sensing images. ISPRS J Photogr Remote Sensing. 2016;117:11–28.
36. Cheng G, Zhou P, Han J. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Trans Geosci Remote Sensing. 2016;54(12):7405–15.
37. Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A. The pascal visual object classes (VOC) challenge. Int J Comput Vis. 2009;88(2):303–38.
38. Everingham M. The pascal visual object classes challenge,(voc2007) results; 2007. http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2007/index.html
39. Oksuz K, Cam BC, Akbas E, Kalkan S. Localization recall precision (LRP): A new performance metric for object detection. In: Proceedings of the European conference on computer vision (ECCV); 2018. p. 504–19.
40. Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2017;39(6):1137–49. pmid:27295650
41. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, et al. SSD: Single shot multibox detector. Lecture Notes in Computer Science. Springer International Publishing; 2016. p. 21–37. https://doi.org/10.1007/978-3-319-46448-0_2
42. Wang C-Y, Bochkovskiy A, Liao H-YM. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR); 2023. p. 7464–75. https://doi.org/10.1109/cvpr52729.2023.00721
43. Meng D, Chen X, Fan Z, Zeng G, Li H, Yuan Y, et al. Conditional DETR for fast training convergence. In: 2021 IEEE/CVF international conference on computer vision (ICCV); 2021. p. 3631–40. https://doi.org/10.1109/iccv48922.2021.00363
44. Ge Z. Yolox: Exceeding yolo series in 2021 . arXiv preprint; 2021.
45. Zhang Y, Xu A, Lan D, Zhang X, Yin J, Goh HH. ConvNeXt-based anchor-free object detection model for infrared image of power equipment. Energy Rep. 2023;9:1121–32.
© 2025 Hua et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.