1. Introduction
Object detection is an important and challenging field in computer vision, one which has been the subject of extensive research [1]. The goal of object detection is to detect all objects and class the objects. It has been widely used in autonomous driving [2], pedestrian detection [3], medical imaging [4], industrial detection [5], robot vision [6], intelligent video surveillance [7], remote sensing images [8], etc.
In recent years, deep learning techniques have been applied in object detection [9]. Deep learning uses low-level features to form more abstractive high-level features, and to hierarchically represent the data in order to improve object detection [10]. Compared with traditional detection algorithms, the deep learning-based object detection method based has better performance in terms of robustness, accuracy and speed for multi-classification tasks.
Object detection methods based on deep learning mainly include region proposal-based methods, and those based on a unified pipeline framework. The former type of method firstly generates a series of region proposals from an input image, and then uses a convolution neural network to extract features from the generated regions and construct a classifier for object classes. The region-based convolution neural network (R-CNN) method [11] is the earlier method used to introduce convolution neural networks into the field of object detection. It uses the selection search method to generate region proposals from the input images, and uses a convolution neural network to extract features from the generated region proposals. The extracted features are used to train the support vector machine. Based on the R-CNN method, Fast R-CNN [12] and Faster R-CNN [13] were also proposed to reduce training time and improve the mean average precision. Although region proposal-based methods have higher detection accuracy, the structure of the method is complex, and object detection is time-consuming. The latter type of method (based on a unified pipeline framework) directly predicts location information and class probabilities of objects with a single-feed forward convolution neural network from the whole image, and does not require the generation of region proposals and post-classification. Therefore, the structure of the unified pipeline framework approach is simple and can detect objects quickly; however, it is less accurate than the region proposal-based approach. The two kinds of methods have different advantages and are suitable for different applications. In this paper, we mainly discuss the unified pipeline framework-based approach.
Researchers have proposed a wide range of unified pipeline framework-based methods in recent years, one of which is the You Only Look Once v2 (YOLOv2) method [14]. YOLOv2 uses batch normalization to improve convergence and prevent overfitting, and anchor boxes to predict bounding boxes, in order to increase the recall. Other innovations include a high-resolution classifier, direct location prediction, dimension cluster, and multi-scale training, all of which lend greater detection accuracy. Pedoeem and Huang recently proposed a shallow real-time detection method for Non-GPU computers based on the YOLOv2 method [15]; their method reduces the size of input image by half in order to speed up the detection speed, and removes the batch normalization of shallow layers in order to reduce the amount of model parameters. Shafiee et al. have proposed the Fast YOLO method, whereby YOLOv2 can be applied to embedded devices [16]: this employs an evolutionary deep intelligence framework to generate an optimized network architecture. The optimized network architecture can be used in the motion-adaption inference framework to speed up the detection process and thus reduce the energy consumption of the embedded device. Simon et al. have developed a complex-YOLO method [17], which uses a specific complex regression strategy to estimate multi-class 3D boxes in Cartesian space for detecting RGB images; the authors report a significant improvement in the speed of 3D object detection. Liu et al. have developed the Single Shot MultiBox Detector (SSD) method [18], which generates multi-scale feature maps in order to detect objects of different sizes. This method strikes a careful balance between speed and accuracy of detection, but the expression ability of the feature map is insufficient in the shallow layer. In order to enhance the expression ability of shallow feature maps, Fu et al. have proposed the Deconvolutional Single Shot Detector (DSSD) method [19], which uses the ResNet extraction network (generating better features) [20], a deconvolution layer and skip connection in order to improve the expression ability of shallow feature maps. In order to improve the detection accuracy of the SSD method for small objects, Qin et al. have proposed a new SSD method based on the feature pyramid [21]. Their method applies a deconvolution network in the high-level of the feature pyramid in order to extract semantic information, and expands the convolution network so as to learn low-level position information. Their method constructs a multi-scale detection structure so as to improve the detection accuracy of small objects. Redmon and Farhadi have proposed applying the YOLOv3 method for using binary cross-entropy loss for class predictions [22], which employs scale prediction to predict boxes at different scales, and thus improves the detection accuracy with regard to small objects.
In this Section 1, we have reviewed recent developments related to object detection. In Section 2, we outline the concepts and processes of the YOLOv3 object detection method, and in Section 3 we describe our proposed method. In Section 4, we illustrate and discuss our simulation results.
2. YOLOv3 The YOLOv3 method considers object detection as a regression problem. It directly predicts class probabilities and bounding box offsets from full images with a single feed forward convolution neural network. It completely eliminates region proposal generation and feature resampling, and encapsulates all stages in a single network in order to form a true end-to-end detection system.
The YOLOv3 method divides the input image intoS×Ssmall grid cells. If the center of an object falls into a grid cell, the grid cell is responsible for detecting the object. Each grid cell predicts the position information ofBbounding boxes and computes the objectness scores corresponding to these bounding boxes. Each objectness score can be obtained as follows:
Cij=Pi,j(Object)∗IOUpredtruth
wherebyCijis the objectness score of thejth bounding box in theith grid cell.Pi,j(Object)is merely a function of the object. TheIOUpredtruthrepresents the intersection over union (IOU) between the predicted box and ground truth box. The YOLOv3 method uses binary cross-entropy of predicted objectness scores and truth objectness scores as one part of loss function. It can be expressed as follows:
E1=∑i=0S2∑j=0BWijobj[C^ijlog(Cij)−(1−C^ij)log(1−Cij)]
wherebyS2is the number of grid cells of the image, andBis the number of bounding boxes. TheCijandC^ijare the predicted abjectness score and truth abjectness score, respectively.
The position of each bounding box is based on four predictions:tx,ty,tw,th, on the assumption that(cx,cy)is the offset of the grid cell from the top left corner of the image. The center position of final predicted bounding boxes is offset from the top left corner of the image by(bx,by). Those are computed as follows:
bx=σ(tx)+cxby=σ(ty)+cy
wherebyσ()is a sigmoid function. The width and height of the predicted bounding box are calculated thus:
bw=pw etwbh=ph eth
wherebypw,phare the width and height of the bounding box prior. They are obtained by dimensional clustering.
The ground truth box consists of four parameters (gx,gy,gwandgh), which correspond to the predicted parametersbx,by,twandth, respectively. Based on (3) and (4), the truth values oft^x,t^y,t^wandt^hcan be obtained as follows:
σ(t^x)=gx−cxσ(t^y)=gy−cyt^w=log(gw/pw)t^h=log(gh/ph)
The YOLOv3 method uses the square error of coordinate prediction as one part of loss function. It can be expressed as follows:
E2=∑i=0S2∑j=0BWijobj[(σ(tx)ij−σ(t^x)ij)2+(σ(ty)ij−σ(t^y)ij)2]+∑i=0S2∑j=0BWijobj[((tw)ij−(t^w)ij)2+((th)ij−(t^h)ij)2]
3. Proposed Method
Before developing the YOLOv3 model, it is necessary to determine the width and height of bounding box priors (pwandphin (4) and (5), respectively), as they directly affect the performance of the YOLOv3 method. In the YOLOv3 method, it uses k-means clustering algorithm to select the representative width and height of bounding box priors to avoid consuming much time in adjusting the width and height. The complexity of k-means clustering method is expressed asO(nkd)for the data based onddimension andkcluster centers, wherebyn is the number of data. The larger the dataset, the more time-consuming the modelling process. In addition, the YOLOv3 method is sensitive to the initial cluster center. To overcome this problem, we apply the AFK-MC2 method [23] in order to estimate the width and height of bounding box priors.
For the purpose of convenience, we suppose that the width and height of ground truth boxes areφ={(w1,h1),(w2,h2),⋯,(wn,hn)}. Firstly, we randomly select a couple of width and height values(wi,hi)as one initial cluster centerc1from the setφ. To obtain the otherk−1initial cluster centers, we repeat the following procedure fork−1times in order to buildk−1Markov chains with lengthm. The procedure begins by computing all proposal distributions, orq(φj). Eachq(φj)is calculated as follows:
q(φj)=d(φj,c1)∑i=1nd(φi,c1)+1n
wherebyφj∈φ,j=1,2,⋯,n,c1is the first initial cluster center. The AFK-MC2 method directly uses the Euclidean distance to compute the distance between two parameters. In this paper, we use the intersection over union method to compute distance. This is expressed as:
d(φj,c1)=min(1−IOU(φj,c1))
wherebyIOU(φj,c1)is the intersection over union betwee j-th bounding boxesφj=(wj,hj)and the first initial cluster centerc1=(wi,hi). It is used to measure the overlap betweenφjandc1. If theIOU(φj,c1)is larger, it means that there are more overlaps betweenφjandc1.d(φj,c1)is distance fromφjto the initial cluster centerc1.
Secondly, we randomly select a couple of width and height valuesφias an initial point of the Markov chain. For the other points in the same Markov chain, we select a candidateφtfrom setφbased on proposal distributionq(φ)from setφ, and compute the sampling probabilityp(φt)as follows:
p(φt)=d(φt,C)∑i=1nd(φi,C)
wherebyCis the set of selected cluster centers,d(φt,C)is minimum value ofd(φt,ci)(i=1,2,⋯,k). We compute the distance from candidateφtcluster center in setCusing equation (8), and select the minimum value asd(φt,C). Based on the sampling probability and proposal distribution ofφt, we can compute the acceptance probability thatφtcan be accepted as the next point in the Markov chain. This can be expressed as follows:
α(φt,φt−1)=min(1,p(φt)p(φt−1)×q(φt−1)q(φt))=min(1,d(φt,C)d(φt−1,C)×q(φt−1)q(φt))
wherebyφt−1is the current point in the Markov chain. If the acceptance probabilityα(φt,φt−1)is greater than the thresholdN∈Unif(0,1), thenφtcan be accepted as the next point in the Markov chain. Otherwise,φt−1is also used as the next point in the Markov chain. Therefore, we can construct a Markov chain with lengthm. Based on the above procedure, we can constructk−1different Markov chains with lengthmand use the last point of every Markov chains as the initial cluster center. The obtained initial cluster centers of the Markov chains and the randomly set initial cluster center formkinitial cluster centersC=[c1,c2,⋯ck].
In constructing a Markov chain, each candidate point requires to calculate the distance between the candidate point and the selected cluster centers, if the selected candidate point is a point that has been selected as the cluster center, the distance between the candidate point and the selected cluster centers is 0, and the acceptance probability of the candidate point is 0. The candidate point will not be used as a point in the Markov chain. This avoids using the selected cluster center as one point of Markov chain in constructing different Markov chains. Therefore, the selected initial cluster centers are different.
Thirdly, we randomly selectSpoints from setφto form setΦ, and compute the distances between the each point in theΦandkcluster centers. If one point is closest to one cluster center, we assign the point to the cluster at which the cluster center is located. Therefore, we can constructkclusters usingSpoints andkcluster centers. We use the all-points mean in every cluster as the new cluster center, i.e.,
pw i=1|H|∑j=1Hwi,j
ph i=1|H|∑j=1Hhi,j
whereby(pw i,ph i)is the new cluster center in the new clusterϑi,i=1,2,⋯,k, and is the number of points in the clusterϑi,ϑi=[(wi,1,hi,1),(wi,2,hi,2),⋯,(wi,H,hi,H)]. Next, we reselectSpoints from setφand compute the distances between the points andk new cluster centers. We also construct the new cluster according the computed distance and use equations (11) and (12) to obtain a new cluster center. If the new cluster center is invariant, we can obtain the final cluster center. The flowchart of our proposed method is shown in Figure 1.
The k-means method used in YOLOv3 randomly selectskcouples of width and height values as initial cluster centers, so the k-means method is sensitive to the initial cluster center. Secondly, it requires computing the distances between all points andkcluster centers, and this will consume a large amount of time for large-scale detection dataset in adjusting cluster centers. Our proposed method only randomly selects a couple of width and height values as one initial cluster, and then selectsk−1cluster centers by constructingk−1different Markov chains of lengthm. Therefore, our proposed method reduces the sensitivity to the initial cluster centers. Besides, we only randomly selectSpoints instead of all points from setφ, and compute distance between theSpoints andkcluster centers. It requires a shorter running time compared to the k-means method, especially for large-scale detection datasets. Therefore, using the YOLOv3 method, we can use the cluster centers as the width and height of bounding box priors so as to realize object prediction.
4. Simulation and Discussion
In this paper, we used two datasets, PASCAL VOC (Pattern Analysis, Statical Modeling and Computational Learning Visual Object Classes) and MS COCO (Microsoft Common Objects in Context) [24]. The PASCAL VOC is a standardized dataset for image classification and object detection. The images contained in the PASCAL VOC dataset are from real scenes. These objects can be divided into twenty classes. There are 9963 images in the datasets which contain 24,640 annotated objects. The MS COCO is an authoritative and important benchmark tool used in the field of object recognition and detection. It is also used in the YOLOv3 method. It contains 117,264 training images and more than 5000 testing images with 80 classes. The Ubuntu 18.04 system is used for the simulations, and the method employs an Intel Xeon E5-2678 v3 CPU. The GPU is NVIDIA GeForce GTX 1080Ti, and the deep learning framework is PyTorch. The size of each image is416×416. The learning rate, momentum and decay are 0.001, 0.9 and 0.0005, respectively. The number of training images is 64 per batch. Our YOLOv3 model uses three output feature maps with different scales to detect differently sized objects, and we have tested it on 3, 6, 9, 12, 15 and 18 candidate cluster centers.
We use Avg IOU (Average Intersection over Union) between the boxes that are generated by using cluster centers and all ground truth boxes in order to measure the performance of each cluster method. This can be expressed as follows:
Avg IOU=1N∑j=1N(maxi∈[1,⋯,k]Ci∩ΨjCi∪Ψj)
wherebyNis the number of ground truth boxes (that is,Ψ=[Ψ1,Ψ2,⋯,ΨN]),kis the number of cluster centers, andCiis the box generated using the width and height in the cluster centers. The larger the Avg IOU value, the better the clustering effect. We also use recall, mean value of average precision (mAP), and F1-score to measure the performance of different methods. The recall is the ratio of the number of objects that are successfully detected and the number of samples that contain the detected objects. The mAP is the mean value of average precision for the detection of all classes. The F1-score is the harmonic mean of precision and recall, the maximum value is 1 and the minimum value is 0.
Below, we compare the performance of the proposed cluster method and AFK-MC2 method in terms of estimating the initial width and height of predicting boxes. The length of the Markov chain used in simulations for two methods is 200. We have also tried to increase and decrease the length of the Markov chain. When the length is increased, the Avg IOU and running time are also increased. When the length is decreased, the Avg IOU and running time are also decreased. For simplicity, we used 200 as the length of the Markov chain that is also used in [23]. On the MS COCO datasets, the Avg IOUs obtained by our proposed method and the AFK-MC2 method are shown in Figure 2: it can be seen that our proposed method has a larger Avg IOU than that of the AFK-MC2 method for a different number of cluster centers. This means that the proposed method has better performance than the AFK-MC2 method in terms of estimating the initial width and height of predicting boxes.
In order to compare the detection performance of the original YOLOv3 method and that based on our proposed cluster method, we use the same cluster center number that is used in the former method. The cluster center number is 9, and the results on the MS COCO and PASCAL VOC datasets are shown in Table 1 and Table 2, respectively. In Table 1, the Avg IOU values for proposed cluster method and k-means method used in YOLOv3 are 60.44 and 59.88, respectively. The running time for proposed cluster method and k-means method used in YOLOv3 are 1183.083s and 3.972 s, respectively. The running time of k-means used in YOLOv3 is about 297 times that of our proposed cluster method. This shows that the proposed cluster has a larger Avg IOU and smaller running time than the k-means method used in YOLOv3. In Table 2, the Avg IOU values for proposed cluster method and k-means method used in YOLOv3 are 67.34 and 67.45, respectively. The running time for the proposed cluster method and the k-means method used in YOLOv3 are 19.337 s and 0.239 s, respectively. The running time of k-means used in YOLOv3 is about 81 times that of our proposed cluster method. This also shows that the proposed cluster has a larger Avg IOU and smaller running time than the k-means method used in YOLOv3. The k-means method used in YOLOv3 requires computing the distance between all points andkcluster centers. This will consume a large amount of time for large-scale dataset detection. While we only randomly selectSpoints instead of all points from setφ, and compute distance between theSpoints andkcluster centers. Therefore, it requires a smaller running time compared with the k-means method, especially for large-scale detection dataset. The size of the MS COCO dataset is larger than PASCAL VOC dataset, so the difference of running time between our proposed method and the k-means method used in YOLOv3 is larger for the MS COCO dataset than for the PASCAL VOC dataset.
Table 3 shows the comparisons between the original YOLOv3 and improved YOLOV3 method (based on our proposed cluster method) on the MS COCO dataset: it can be seen that our YOLOv3 method produces larger recall, mAP and F1-score values, and therefore has better detection accuracy than the original YOLOv3 method.
We also randomly selected five images from the test sets of the MS COCO dataset in order to test the performance of small object detection; the object detection results are shown in Figure 3. Subfigures (a), (c), (e), (g) and (i) show object detection results generated using the original YOLOv3 method, and subfigures (b), (d), (f), (h) and (j) show the object detection results generated using our proposed method. For the first image, the YOLOv3 method detected three objects, while our proposed method detected four objects (subfigures (a) and (b)). With the second image, the YOLOv3 method detected three objects, while our proposed method detected four objects (subfigures (c) and (d)). For the first and second image, our proposed method detected more objects, and it has higher scores in terms of detecting small objects. With the third image, the YOLOv3 method and our proposed method detected three objects (subfigures (e) and (f)), and our proposed method has higher scores in terms of detecting objects, especially small objects such as people in the distance and skateboards. With the fourth image, the YOLOv3 method and our proposed method detected two objects (subfigures (g) and (h)), and our proposed method has higher scores in terms of detecting objects, especially cups. With the fifth image, the YOLOv3 method and our proposed method detected three objects (subfigures (i) and (j)), and our proposed method has higher scores in terms of detecting objects, especially the giraffe in the distance. These ten sub-figures indicate that our proposed method has better performance in terms of detecting objects, especially for some small objects such as sports balls, tennis rackets, bottles, people in the distance, skateboards, cups, a giraffe in the distance, etc.
5. Conclusions This paper proposes a new method for initializing the width and height of predicted bounding boxes. Our proposed method has a larger Avg IOU and smaller running time on the MS COCO dataset. The Avg IOU is 60.44%, which is 0.56% higher than original YOLOv3 method, and the running time is 1/297 that of the original YOLOv3 method. For the PASCAL VOC dataset, the average IOU is 67.45%, which is 0.13% higher than original YOLOv3 method, and the running time is 1/81 that of the original YOLOv3 method. It exhibits better performance in terms of initializing the width and height of predicted bounding boxes, as well as in terms of choosing the representative initial width and height. Besides, we randomly selected some images from the test set of the MS COCO dataset for detection. The object detection results indicate that our proposed method detected more objects in some test images. It also has better performance in terms of detecting small objects. Our proposed method also outperforms the original YOLOv3 method in terms of recall, mean average precision, and F1-score.
Figure 2. Avg IOU of the proposed cluster method and the AFK-MC2 method on the MS COCO dataset.
Figure 3. Object detection results using the YOLOv3 method (subfigures (a), (c), (e), (g) and (i), and object detection results using our proposed method (subfigures (b), (d), (f), (h) and (j)).
Figure 3. Object detection results using the YOLOv3 method (subfigures (a), (c), (e), (g) and (i), and object detection results using our proposed method (subfigures (b), (d), (f), (h) and (j)).
| Method | Avg IOU | Running Time |
|---|---|---|
| k-means used inYOLOv3 | 59.88 | 1183.038 |
| Proposed cluster method | 60.44 | 3.972 |
| Method | Avg IOU | Running Time |
|---|---|---|
| k-means used inYOLOv3 | 67.34 | 19.377 |
| Proposed cluster method | 67.45 | 0.239 |
| Method | Recall | mAP | F1-Score |
|---|---|---|---|
| YOLOv3 | 70.5 | 53.2 | 60.6 |
| Proposed cluster method | 71.3 | 53.3 | 61.0 |
Author Contributions
Conceptualization, formal analysis, investigation, and writing the original draft were performed by L.Z. and S.L. Experimental tests were performed by S.L. All authors have read and approved the final manuscript.
Funding
This research was funded by National Natural Science Foundation of China (61271115) and Science and Technology Innovation and Entrepreneurship Talent Cultivation Program of Jilin (20190104124).
Conflicts of Interest
The authors declare no conflict of interest.
1. Hanchinamani, S.R.; Sarkar, S.; Bhairannawar, S.S. Design and Implementation of High Speed Background Subtraction Algorithm for Moving Object Detection. In Proceedings of the IEEE International Conference on Advances in Computing, Communications and Informatics, Jaipur, India, 21-24 September 2016; pp. 367-374.
2. Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D Object Detection Network for Autonomous Driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, HI, USA, 21-26 July 2017; pp. 6526-6534.
3. Mao, J.; Xiao, T.; Jiang, Y.; Cao, Z. What Can Help Pedestrian Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, HI, USA, 21-26 July 2017; pp. 3127-3136.
4. Christ, P.F.; Kaissis, G.; Ettlinger, F.; Kaissis, G.; Schlecht, S.; Ahmaddy, F.; Grün, F.; Menze, B.; Valentinitsch, A.; Ahmadi, S.-A.; et al. SurvivalNet: Predicting patient survival from diffusion weighted magnetic resonance images using cascaded fully convolutional and 3D Convolutional Neural Networks. In Proceedings of the IEEE International Conference on International Symposium on Biomedical Imaging, Melbourne, Australia, 18-21 April 2017; pp. 839-843.
5. Weimer, D.; Scholz-Reiter, B.; Shpitalni, M. Design of deep convolutional neural network architectures for automated feature extraction in industrial inspection. CIRP Ann. 2016, 65, 417-420.
6. Senicic, M.; Matijevic, M.; Nikitovic, M. Teaching the methods of object detection by robot vision. In Proceedings of the IEEE International Convention on Information and Communication Technology, Electronics and Microelectronics, Opatija, Croatia, 21-25 May 2018; pp. 558-563.
7. Sreenu, G.; Durai, M. Intelligent video surveillance: A review through deep learning techniques for crowd analysis. J. Big Data 2019, 6, 48-75.
8. Li, K.; Cheng, G.; Bu, S.; You, X. Rotation-Insensitive and Context-Augmented Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2337-2348.
9. Zhou, X.; Gong, W.; Fu, W.; Du, F. Application of deep learning in object detection. In Proceedings of the IEEE/ACIS 16th International Conference on Computer and Information Science, Wuhan, China, 24-26 May 2017; pp. 631-634.
10. Zhao, Z.; Zheng, P.; Xu, S.; Wu, X. Object Detection with Deep Learning: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212-3232.
11. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24-27 June 2014; pp. 125-138.
12. Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7-13 December 2015; pp. 127-135.
13. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans Pattern Anal Mach Intell. 2017, 39, 1137-1149.
14. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. IEEE Trans. Pattern Anal. 2017, 29, 6517-6525.
15. Pedoeem, J.; Huang, R. YOLO-LITE: A Real-Time Object Detection Algorithm Optimized for Non-GPU Computers. In Proceedings of the IEEE International Conference on Big Data, Seattle, WA, USA, 10-13 December 2018; pp. 2503-2510.
16. Shafiee, M.J.; Chywl, B.; Li, F.; Wong, A. Fast YOLO: A Fast You Only Look Once System for Real-time Embedded Object Detection in Video. J. Comput. Vis. Image Syst. 2017, 3, 171-173.
17. Simon, M.; Milz, S.; Amende, K.; Gross, H.M. Complex-YOLO: Real-time 3D Object Detection on Point Clouds. In Proceedings of the IEEE International Conference on European Conference on Computer Vision, Munich, Germany, 8-14 September 2018; pp. 197-209.
18. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the IEEE International Conference on European Conference on Computer Vision, Amsterdam, The Netherlands, 8-16 October 2016; pp. 21-37.
19. Fu, C.Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. DSSD: Deconvolutional single shot detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, HI, USA, 21-26 July 2017; pp. 1-8.
20. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June-1 July 2016; pp. 770-778.
21. Pinle, Q.; Chuanpeng, L.; Jun, C.; Chai, R. Research on improved algorithm of object detection based on feature pyramid. Multimed. Tools Appl. 2019, 78, 913-927.
22. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. IEEE Trans. Pattern Anal. 2018, 15, 1125-1131.
23. Bachem, O.; Lucic, M.; Hassani, H.; Krause, A. Fast and Provably Good Seedings for k-Means. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5-8 December 2016; pp. 55-63.
24. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the IEEE International Conference on European Conference on Computer Vision, Zurich, Switzerland, 6-12 September 2014; pp. 740-755.
Liquan Zhao* and Shuaiyang Li
Key Laboratory of Modern Power System Simulation and Control & Renewable Energy Technology, Ministry of Education (Northeast Electric Power University), Jilin 132012, China
*Author to whom correspondence should be addressed.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2020. This work is licensed under http://creativecommons.org/licenses/by/3.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
The ‘You Only Look Once’ v3 (YOLOv3) method is among the most widely used deep learning-based object detection methods. It uses the k-means cluster method to estimate the initial width and height of the predicted bounding boxes. With this method, the estimated width and height are sensitive to the initial cluster centers, and the processing of large-scale datasets is time-consuming. In order to address these problems, a new cluster method for estimating the initial width and height of the predicted bounding boxes has been developed. Firstly, it randomly selects a couple of width and height values as one initial cluster center separate from the width and height of the ground truth boxes. Secondly, it constructs Markov chains based on the selected initial cluster and uses the final points of every Markov chain as the other initial centers. In the construction of Markov chains, the intersection-over-union method is used to compute the distance between the selected initial clusters and each candidate point, instead of the square root method. Finally, this method can be used to continually update the cluster center with each new set of width and height values, which are only a part of the data selected from the datasets. Our simulation results show that the new method has faster convergence speed for initializing the width and height of the predicted bounding boxes and that it can select more representative initial widths and heights of the predicted bounding boxes. Our proposed method achieves better performance than the YOLOv3 method in terms of recall, mean average precision, and F1-score.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer





