This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction
In recent years, with the rapid development of the automobile industry, the number of motor vehicles in the city has also developed rapidly. By the end of 2020, the number of motor vehicles in the country has reached more than 300 million, followed by the colossal traffic pressure and many traffic regulation problems. When the traffic pressure and traffic control problems become more serious, it will bring many inconveniences to the production and life of urban residents and will also restrict the rapid development of cities and towns. On the other hand, with the gradual maturity of artificial intelligence, the application of artificial intelligence in vehicles has also received extensive attention, and the application of vehicle detection is the focus of this article. Moreover, vehicle detection covers more and more areas, such as road traffic monitoring, the automatic lifting of the entrance guard of the community, some charging pile parking places, and automated driving.
Pedestrians or private cars are often parked in unguarded charging areas, causing inconvenience to new energy owners. When the vehicle is fully charged, you need to remind the owner to move the car out of the charging pile. There are still many problems with vehicle detection in autonomous driving. The vehicle will be seriously obscured or cause misidentification in different scenarios, which is often one of the leading causes of accidents in self-driving cars. And, before the license plate recognition, it is often necessary to recognize the body to narrow the identification range, reduce the interference of the surrounding environment, and improve accuracy. Based on these conditions, the vehicle detection is still a challenging task.
The primary purpose of vehicle detection can be divided into two. One is to determine whether a vehicle is detected (such as a bus, truck, and car) in the video or image. During the detection process, it needs to determine its location and mark it. Second, the specific category needs to be determined. The particular sort of vehicle needs to be determined by analyzing the semantic information (e.g., [1]) of the car in the frame to complete the vehicle detection task.
Neural network design often follows several elements. The main one is to reduce the information path and enhance the information dissemination; for example, the residual connection [2] and dense connection [3] have played a perfect role. It is also effective in improving the flexibility and diversity of information channels, typically using a split-transform-merge strategy [4]. There are also many methods [5–7] to combine high-resolution image information with high-level semantic information.
Driven by these cutting-edge algorithms, we propose an improved-FCOS algorithm to detect vehicles, consisting of mainly three points. Firstly, the car belongs to a rigid structure and changes laterally under different shooting angles and scene occlusions. The receptive field of the standard convolution kernel is often rectangular, extending to the surroundings, which may not completely cover the entire car. We introduce Deformable ConvNet [8] which enables the convolution kernel to adaptively learn the response’s position deviation according to the target’s deformation. Adding a bottom-up module to the original FPN further enhances the flow of information between different feature layers, reducing the distance between the bottom layer and the top layer. Pang et al. [9] put forward a new idea called the Libra R-CNN before this, thinking that today’s detectors all follow the region selection, feature extraction, and then the gradual convergence under the guidance of multitask loss, which directly affects the effect of model training. Based on the former balance concept, we add a balanced module after the improved FPN, which integrates feature maps of different resolutions, uses this feature to strengthen the previous pyramid, and then adds nonlocal attention to enhance the contextual network connection. This operation can reduce the inconsistency of bbox head recognition due to different variances, and the whole process is cost-free through interpolation and pooling.
The work of this paper is as follows:
(1) The introduction of DCN [8] can make the receptive field of the convolution kernel to adaptively change
(2) We added a bottom-up module behind the traditional FPN [6] to reduce the distance from bottom to top
(3) We added a balanced module after the improved FPN to reduce the inconsistency of the bbox head prediction
2. Related Works
This section mainly introduces the method of research used for vehicle detection and FCOS algorithm.
2.1. Detection Algorithm
Many methods based on combining the feature extraction and classifiers have been proposed to achieve vehicle detection formerly. For example, HOG [10] feature detection and then HOG [10] and LBP [11] features are combined to improve the accuracy of the vehicle detection further; Li Xiangfeng et al. put forward the Haar [12] feature algorithm. Although these algorithms achieve better detection results in simple scenarios, they are challenging to deal with complex systems.
After 2012, the convolutional neural network-based algorithm in deep learning became popular, extracting semantic information about vehicles from different feature layers and solving the problem of insufficient robustness of traditional algorithms when the dataset is sufficient.
Detection can be done in three ways. One is a two-stage regression algorithm (e.g., [9, 13–21]). These algorithms all propose an anchor as a prior to further improve the accuracy and speed up the convergence of the network model, which is often slower than single-stage detection algorithms in speed. The second is the single-stage detection algorithm (e.g., [22–27]). These algorithms do not use a prior method such as anchor, so there are many differences in regression.
The last one is the direction proposed by Facebook, represented by a transformer [28], which introduces the transformer in NLP into CV, greatly simplifying the network model. However, it has some shortcomings in accuracy, and it is still far away in the engineering deployment, such as [29–32].
2.2. FCOS Algorithm
We choose the anchor-free network because the anchor-based algorithm has many limitations:
(1) The dataset is susceptible to the size, number, and aspect ratio of anchors, and different tasks need to be readjusted, which is not conducive to generalization
(2) To better match the GT box, many anchors need to be generated, most of which are marked as negative samples, which will cause an imbalance between positive and negative examples
(3) It is necessary to calculate the value of IoU, which consumes a lot of computing power, which slows down the detection speed and increases the cost
FCOS [22] is an anchor-free detection algorithm. The accuracy of the previously proposed anchor-free algorithm is quite different from that of the anchor-based algorithm. And, FCOS [22] successfully surpassed the anchor-based detection algorithm and became the SOTA of the year through alternative solutions.
The definition of the positive and negative samples in the FCOS [22] algorithm is quite different from before. If a location (x, y) falls into any GT box, it is a positive sample and regresses the distance between this point and the bounding box
The previous anchor-free algorithm has no optimal solution for overlapping the GT box regions, so the point regression has ambiguity. In FCOS [22], the ambiguity is significantly reduced through the feature pyramid. As we all know, the shallow layer of the neural network is rich in more detailed features, which is beneficial to small target detection [33]. Higher level has more semantic features, which are used to detect large targets. To reduce the overlapping of objects with significant differences, the parameter
To further constrain those prediction boxes far away from the center of the GT box, FCOS [22] samples the centerness method to solve this problem (Equation (2)) and uses BCE loss [34] to optimize the centerness branch.
3. Methods
This chapter will give a detailed overview of the improved part of the algorithm (the complete structure of the algorithm in Figure 1). The deformable convolution [8] is added to the suppression factor that can reduce the influence of noise and background. The added bottom-up module could significantly reduce the loss of data reaching the top. The integrated feature map is put into the nonlocal attention structure by the balanced module and then redivided to obtain a new feature map. In this paper, the improved FCOS dramatically increases the accuracy but does not increase too much calculation.
[figure omitted; refer to PDF]
For example,
In the experiment, the DCN [8] was added to the C3–C5 layer of the backbone, which brought a considerable increase in accuracy, and we added the C2 feature map to improve the learning ability of the model further.
3.2. Improved FPN
We modified the FCOS [22] neck module. The previous FPN [6] mainly improves the target detection effect by fusing high- and low-level features, especially for small-size targets. As we all know, high-level features contain semantic information, while low-level features contain more specific descriptions of detailed information. Driven by PAN [7] (the champion of instance segmentation competition that year), this paper adds a bottom-up path augmentation module as in Figure 4 after the traditional FPN [6]. However, in the FPN [6] algorithm, a top-down process is required. The transfer of shallow features to the top layer requires dozens or more than one hundred network layers. Obviously, after such a multilayer transfer, the superficial feature information will be seriously lost. The bottom-up path augmentation added in this article can connect the shallow features to the P2 through the lateral connection of the original FPN underneath and then pass from P2 to the top layer along with the bottom-up path augmentation. The number of layers passed is less than 10, which can better retain the shallow feature information.
[figure omitted; refer to PDF]
Figure 6 is the specific form of nonlocal attention. First,
[figure omitted; refer to PDF]
Equations (7) and (8) calculate the Gaussian distance in the embedding space by the corresponding parts 1 and 2 in the structure diagram.
Then, the dimensions of the above three features are reshaped except for the number of channels, and the correlation is calculated by matrix point multiplication of θ and φ. Finally, the weights are 0∼1 by the softmax operation, as follows:
Equations (9) and (10) are sorted out to obtain the following equation:
Finally, the attention coefficient is correspondingly multiplied back to the feature matrix
3.4. Loss Function
The final loss function is
In order to highlight the improvements, we have not changed the loss function of the original algorithm. Here,
4. Experiment
Our experiment performed detection on three diverse datasets, including UA-DETRAC, MSCOCO2017, and Pascal VOC, for joint training (where each dataset only uses pictures in the category of car, bus, and truck).
4.1. Experimental Details
UA-DETRAC (a multitarget tracking dataset, taken on different roads in Beijing and Tianjin, China) contains various weather conditions, such as cloudy, night, sunny, and rainy. Occlusion is divided into unoccluded and heavy occlusions; the video per second recording had 25 frames and about 130,000 pictures. The pictures are highly similar, and this article takes a sample every 40 frames by increasing the contrast to prevent overfitting. The second dataset selected part of the vehicle pictures in Pascal VOC2012. The third dataset uses MSCOCO2017. The full name of COCO is “Common Objects in Context,” a dataset provided by the Microsoft team for image recognition. The images in the dataset are divided into training, verification, and test sets. This article samples all vehicle pictures in the training and validation sets, and the total dataset is about 20,000 (the specific information is in Table 1). All the annotation information is represented by the VOC format, 90% is used for training and verification, and the remaining 10% is used for the testing.
Table 1
The distribution of the dataset.
DETRAC | Pascal-VOC2012 | COCO |
3457 | 467 | 16977 |
The experiments used in this article are all based on the mmdetection [36] framework developed by Shangtang, which is developed based on the PyTorch framework. It divides the target detection algorithm into several significant modules: backbone, neck, head, bbox, encode, decode, and loss, decoupling the connection between the modules. This article uses 6 Nvidia TITAN Xp to train the network, and all the parameter settings are consistent with mmdetection official, where the learning rate and the number of GPUs have a linear scaling relationship.
4.2. Accuracy Experiment
We report our main results on the test (approximate 2K images) by uploading detection results to the server. We firstly forward the input image through the network and obtain the predicted bounding boxes with a predicted class. Unless specified, the postprocessing and data enhancement of the algorithm will use the official default of mmdetection. We hypothesize that the performance of our detector may be improved further if we carefully tune the hyperparameters. We compare the mainstream algorithms in recent years, and the results are in Table 2. Compared with other algorithms in current years, our method achieves the best performance on this dataset.
Table 2
Our proposed algorithm vs. other algorithms with ResNet-50 and FPN as default.
Model | mAP |
Faster-RCNN [13] | 67.4 |
ATSS [23] | 67.2 |
GFL [37] | 66.8 |
FCOS [22] | 66.8 |
YOLOF [38] | 67.1 |
YOLOV3 [16] | 58.9 |
SSD [39] | 57.6 |
DETR [29] | 61.0 |
Centernet [40] | 51.3 |
Retinanet [9] | 66.0 |
Ours | 70.1 |
4.3. Model Complexity
We also tested the complexity of each model on the dataset as shown in Table 3. It can be seen from the table that the one-stage network often has fewer parameters than the two-stage network and our model has dramatically improved the accuracy and reduced them. For the GFLOPs indicator, the parameter has only risen a little.
Table 3
Model complexity comparison.
Model | Input shape | GFLOPs | Parameters (M) |
Faster-RCNN [13] | 1280 × 800 | 206.67 | 41.13 |
ATSS [23] | 1280 × 800 | 201.51 | 31.89 |
GFL [37] | 1280 × 800 | 204.61 | 32.04 |
FCOS [22] | 1280 × 800 | 196.76 | 31.84 |
YOLOF [38] | 1280 × 800 | 98.21 | 42.11 |
YOLOV3 [16] | 1280 × 800 | 193.89 | 61.53 |
SSD [39] | 1280 × 800 | 343.77 | 24.68 |
DETR [29] | 1280 × 800 | 101.34 | 20.09 |
Centernet [40] | 1280 × 800 | 51.02 | 14.21 |
Retinanet [9] | 1280 × 800 | 205.24 | 36.15 |
Ours | 1280 × 800 | 174.39 | 35.1 |
4.4. Ablation Experiment
This section analyzes the performance of ablation experiments on the improved network (Table 4). The DCN [8] can expand the receptive field of the convolution kernel, which can change the sampling points with the deformation of the object. Its visualization result is shown in Figure 7, and we can see that the feature focus area is very different. The experimental results are shown in Table 4. It is evident that the combined performance of the three improvement methods is the best. Figure 8 shows the linechart of cls and bbox loss. Compared with FCOS, our method converges faster and more stably on cls.
Table 4
Ablation study for the proposed methods.
Method | mAP |
FCOS [22] | 66.8 |
+Improved FPN | 67.2 |
+DCN [8] | 68.4 |
+Balanced module | 67.9 |
+All | 70.1 |
[figures omitted; refer to PDF]
[figures omitted; refer to PDF]
4.5. Visualization of Results
Figure 9 is a visualization of the effect of the algorithm. It can be seen that vehicles can be detected no matter whether they are at a distance or in some unique scenes, even when there is little picture information, which proves the superiority of our algorithm.
[figures omitted; refer to PDF]
5. Conclusions
We improved the detection algorithm based on the anchor-free FCOS [22] and introduced the DCN [8] based on the original backbone to broaden the receptive field of the convolution kernel and also introduced a bottom-up module to improve the FPN and reduce the loss between the information transmission. A balance module is added because the variance of the feature pyramid affects the accuracy, which has a good effect and proves the superiority of the improved algorithm.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (No. 61605083) and Jiangsu Provincial Key Research and Development Program (China).
[1] J. Long, E. Shelhamer, T. Darrell, "Fully convolutional networks for semantic segmentation," Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431-3440, .
[2] K. He, X. Zhang, S. Ren, J. Sun, "Deep residual learning for image recognition," Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778, .
[3] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, "Densely connected convolutional networks," Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700-4708, .
[4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, "Going deeper with convolutions," Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, .
[5] H. Zhu, X. Li, P. Zhang, G. Li, J. He, H. Li, K. Gai, "Learning tree-based deep model for recommender systems," Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1079-1088, .
[6] T. Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, "Feature pyramid networks for object detection," Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117-2125, .
[7] W. Wang, E. Xie, X. Song, Y. Zang, W. Wang, T. Lu, G. Yu, C. Shen, "Efficient and accurate arbitrary-shaped text detection with pixel aggregation network," Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, pp. 8440-8449, .
[8] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, Y. Wei, "Deformable convolutional networks," Proceedings of the 2017 IEEE International Conference on Computer Vision, pp. 764-773, .
[9] J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, D. Lin, "Libra r-CNN: towards balanced learning for object detection," Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 821-830, .
[10] N. Dalal, B. Triggs, "Histograms of oriented gradients for human detection," pp. 886-893, DOI: 10.1109/CVPR.2005.177, .
[11] Z. Guo, L. Zhang, D. Zhang, "A completed modeling of local binary pattern operator for texture classification," IEEE Transactions on Image Processing, vol. 19 no. 6, pp. 1657-1663, DOI: 10.1109/tip.2010.2044957, 2010.
[12] P. Viola, M. Jones, "Rapid object detection using a boosted cascade of simple features," Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition,DOI: 10.1109/CVPR.2001.990517, .
[13] S. Ren, K. He, R. Girshick, J. Sun, "Faster r-cnn: towards real-time object detection with region proposal networks," Advances in Neural Information Processing Systems, vol. 28, pp. 91-99, 2015.
[14] Z. Cai, N. Vasconcelos, "Cascade R-Cnn: delving into high quality object detection," Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154-6162, .
[15] J. Redmon, A. Farhadi, "YOLO9000: better, faster, stronger," Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263-7271, .
[16] J. Redmon, A. Farhadi, "Yolov3: an incremental improvement," 2018. http://arxiv.org/abs/1804.02767
[17] J. Wang, K. Chen, S. Yang, C. Change Loy, D. Lin, "Region proposal by guided anchoring," Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2965-2974, .
[18] Y. Cao, J. Xu, S. Lin, F. Wei, H. Hu, "Gcnet: non-local networks meet squeeze-excitation networks and beyond," Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshops, .
[19] C. Zhu, Y. He, M. Savvides, "Feature selective anchor-free module for single-shot object detection," Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 840-849, .
[20] X. Lu, B. Li, Y. Yue, Q. Li, J. Yan, "Grid R-CNN," Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7363-7372, .
[21] K. Hu, Y. Zhang, C. Weng, P. Wang, Z. Deng, Y. Liu, "An underwater image enhancement algorithm based on generative adversarial network and natural image quality evaluation index," Journal of Marine Science and Engineering, vol. 9 no. 7,DOI: 10.3390/jmse9070691, 2021.
[22] Z. Tian, C. Shen, H. Chen, H. Tong, "Fcos: a simple and strong anchor-free object detector," IEEE Transactions on Pattern Analysis and Machine Intelligence,DOI: 10.1109/TPAMI.2020.3032166, 2020.
[23] S. Zhang, C. Chi, Y. Yao, Z. Lei, S. Z. Li, "Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection," Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9759-9768, .
[24] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, "You only look once: unified, real-time object detection," Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 779-788, .
[25] S. Zhang, L. Wen, X. Bian, Z. Lei, S. Z. Li, "Single-shot refinement neural network for object detection," Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, pp. 4203-4212, .
[26] T. Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, "Focal loss for dense object detection," Proceedings of the 2017 IEEE International Conference on Computer Vision, pp. 2980-2988, .
[27] B. Zhu, J. Wang, Z. Jiang, F. Zong, S. Liu, Z. Li, J. Sun, "Autoassign: differentiable label assignment for dense object detection," 2020. http://arxiv.org/abs/2007.03496
[28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, "Attention is all you need," Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), pp. 5998-6008, .
[29] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, "End-to-End object detection with transformers," pp. 213-229, DOI: 10.1007/978-3-030-58452-8_13, .
[30] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dang, "Deformable detr: deformable transformers for end-to-end object detection," 2020. http://arxiv.org/abs/2010.04159
[31] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, "An image is worth 16 x 16 words: transformers for image recognition at scale," 2020. http://arxiv.org/abs/2010.11929
[32] Y. Qu, M. Xia, Y. Zhang, "Strip pooling channel spatial attention network for the segmentation of cloud and cloud shadow," Computers & Geosciences, vol. 157,DOI: 10.1016/j.cageo.2021.104940, 2021.
[33] M. Xia, X. Zhang, W. A. Liu, L. Weng, Y. Xu, "Multi-stage feature constraints learning for age estimation," IEEE Transactions on Information Forensics and Security, vol. 15 no. 1, pp. 2417-2428, DOI: 10.1109/tifs.2020.2969552, 2020.
[34] Z. Zhang, M. R. Sabuncu, "Generalized cross entropy loss for training deep neural networks with noisy labels," Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS), .
[35] G. Brazil, X. Liu, "M3d-rpn: monocular 3d region proposal network for object detection," Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, pp. 9287-9296, .
[36] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. Change Loy, D. Lin, "MMDetection: open mmlab detection toolbox and benchmark," 2019. http://arxiv.org/abs/1906.07155
[37] X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, J. Yang, "Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection," 2020. http://arxiv.org/abs/2006.04388
[38] Q. Chen, Y. Wang, T. Yang, X. Zhang, J. Chen, J. Sun, "You only look one-level feature," Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13039-13048, .
[39] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A. C. Berg, "SSD: single shot multibox detector," Proceedings of the Computer Vision-ECCV 2016, pp. 21-37, DOI: 10.1007/978-3-319-46448-0_2, .
[40] X. Zhou, D. Wang, P. Krähenbühl, "Objects as points," , 2019. http://arxiv.org/abs/1904.07850
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2021 Fei Yan et al. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/
Abstract
Whether in intelligent transportation or autonomous driving, vehicle detection is an important part. Vehicle detection still faces many problems, such as inaccurate vehicle detection positioning and low detection accuracy in complex scenes. FCOS as a representative of anchor-free detection algorithms was once a sensation, but now it seems to be slightly insufficient. Based on this situation, we propose an improved FCOS algorithm. The improvements are as follows: (1) we introduce a deformable convolution into the backbone to solve the problem that the receptive field cannot cover the overall goal; (2) we add a bottom-up information path after the FPN of the neck module to reduce the loss of information in the propagation process; (3) we introduce the balance module according to the balance principle, which reduces inconsistent detection of the bbox head caused by the mismatch of variance of different feature maps. To enhance the comparative experiment, we have extracted some of the most recent datasets from UA-DETRAC, COCO, and Pascal VOC. The experimental results show that our method has achieved good results on its dataset.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details





1 College of Automation, Nanjing University of Information Science & Technology, Nanjing 210044, China; Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology (CICAEET), Nanjing 210044, China
2 Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology (CICAEET), Nanjing 210044, China
3 College of Energy and Electrical Engineering, Hohai University, Nanjing 21100, China