This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
1. Introduction
Conventional fire detection approaches are based on sensing detection of smoke and temperature or data monitoring with electric valve and vent valve. With the development deep learning, Yu and Liu [1] employed the improved Mask R-CNN (Mask Regions with CNN Features) for identification and segmentation of flame images, adopting the bottom-up feature fusion and improved loss function for high precision detection. Wu [2] replaced the backbone network of YOLOv3 (You Only Look Once Version 3) [3] with Densenet121 (Densely Connected Convolutional Networks), to improve the capture ability of the backbone network for flame and smoke features, and introduced focal loss for regression. Zhao et al. [4] proposed a CenterNet [5] algorithm-based target detection approach in complicated environment. Li et al. [6] proposed to use a network structure with depth separable convulsion to improve the flame detection model and adopted several data augmentation techniques to increase the detection precision. Luo [7] adopted YOLOv4 (You Only Look Once Version 4) framework-based UAV (Unmanned Aerial Vehicle) for real-time flame detection.
However, the models of the above methods are complicated and difficult in deployment, with high calculation load and few detection targets. To solve such problems, an improved T-YOLOX detection model was proposed in our study, for multitarget detection of flame, smoke, and persons in complicated fire scenarios.
The method is based on YOLOX [8] architecture, with a light attention module for adjusting and improving the weight of each channel, to improve the overall feature extraction ability of the network; incorporated with the channel shuffle technique, to improve the communication ability between channels, increase the complexity level of the model, and avoid overfitting; and replaced the last layer of the backbone network with MobileViT (Mobile-friendly Vision Transformer) [9] module, for adding the learning ability of the backbone network for global features with this light transformer [10] module and improving the generality of the model. The experiment was conducted to demonstrate the effectiveness and advantages of the method.
2. Related Work
2.1. YOLOX
YOLOX is the result of the recent improvements to YOLO series, which incorporates the advantages of YOLO series network, the feature extraction network CSPDarknet (Cross Stage Partial Darknet) architecture of YOLOv4 [11–13], focus channel augmentation technique of YOLOv5, mosaic data augmentation, innovatively added decoupling prediction head, Anchor Free concept, and SimOTA (A label allocation strategy) dynamic positive sample matching.
2.2. Transformer
Transformer, a groundbreaking 4th-generation neural network in recent years, has been dominantly applied in the NLP (Natural Language Processing) field, and also in the CV (Computer Vision) field with dramatic effect. Initially, ViT [14] (Vision Transformer) was employed to train pictures with the text processing method and also in image classification with excellent effect, and BoTNet (Bottleneck Transformers for Visual Recognition) [15–18] was used for replacing the last layer of convolutional neural networks (CNN) with transformer module, to enhance the capability of backbone network for capturing global information.
2.3. Improved YOLOX Algorithm: T-YOLOX
Though YOLOX has shown satisfactory detection performance and inference speed, it still requires improvements in the following aspect for solving the problems of the study:
(1) CSPLayer in YOLOX contains high residual ratio, and residual error operation can effectively avoid vanishing gradient in deep network, but such residual error ratio will transmit feature information together with included noises to deep network, causing impact on training of backbone network
(2) All the residual error operations will connect input features to output features with residual error ratio, but the effect of merely connecting operation for the feature layer is not satisfactory, likely to cause channel information unable to be properly fused
(3) CNN-based CSPDarknet backbone network is used for YOLOX, to capture local feature information via convulsion core, but the relationship between global feature information may be neglected
Therefore, considering the drawbacks of YOLOX in detecting complicated fire scenarios, we proposed T-YOLOX model, consisting of 3 parts, i.e., backbone, neck and head, with the architecture shown in Figure 1.
[figure omitted; refer to PDF]
The collected data was screened and organized, and the pictures were labeled with LabelImg tool, to create the fire dataset, including 3 categories, i.e., fire, smoke, and person. The labeled pictures were finally stored in an xml file, including the categories and coordinate information of the detection targets. Figure 6 shows the operation of LabelImg scripts.
[figure omitted; refer to PDF]
The experiment is intended for realizing detection of flame, smoke, and persons with the datasets. Figure 7 visualizes the information of the dataset. As shown in visualized data plot 7 (a), nearly 7,000 flames were labeled, and about 2,000 smokes and persons were labeled; and plot 7 (b) shows the uniform distribution of labeled frames, Figure 7(c) shows the location distribution of target frame, and Figure 7(d) shows the percentage target frame relative to the picture. It is apparent that the distribution and percentage of labeled data are uniform and varied.
[figures omitted; refer to PDF]
3.2. Data Augmentation
Data augmentation can effectively expand diversity of samples, to ensure the higher robustness of the model in different environments. In our experiment, in addition to conventional data augmentation methods, e.g., random zooming, cutting, rotation, and other operation, mosaic data augmentation was also employed, for splicing 4 pretreated images into one large image, to enrich the background of the object for detection. Figure 8 shows the operation of mosaic data augmentation.
[figure omitted; refer to PDF]3.3. Experiment Environment and Program Design
The experiment was conducted in the environment of Python3.8, CUDA11.1, PyTorch 1.9.1, and all the models were trained and tested on NVIDIA RTX3060 GPU.
Before network training, the model was initiated with He et al. [20]. Upon training, Python script was employed to randomly divide the dataset into the training set and test set as per 8 : 2, and Adam [21] optimizer and cosine annealing learning rate were adopted for training, with 4 training lots, initial learning rate of 0.0001, and totally 300 training iterations. Totally, 640, 640, 3 pictures were input in the experiment. After the training, the performance of the proposed model was compared with that of CenterNet, YOLOv3, and YOLOX models.
3.4. Model Evaluation
For the target detection task, the precision and recall can be calculated for each category, and a PR curve can be plotted for each category based on reasonable calculation. Each labeled picture contains a detection targets and
In our study, the commonly used evaluation indexes for target detection models mAP (mean average precision) and FPS (frames per second) were used for model evaluation. AP is the area under PR (precision-recall) curve, and mAP is the mean average value of AP of each category (the higher value of AP and mAP, the better), calculated with the following:
TP (true positives) refers to positive samples, correctly segmented.
FP (false positives) refers to positive samples, incorrectly segmented.
FN (false negatives) refers to negative samples, incorrectly segmented.
3.5. Experiment Design and Result Analysis
For our study, we totally designed 1 group of ablation experiments and 1 group of comparative experiments and to meet the requirements for model deployment, size—
All the comparisons were made with the self-developed fire dataset, the dataset was divided into the training set and the test set as per 8 : 2, and 10% was sampled from the training set as the verification set. The experiment results were verified with the above two methods. The loss of the proposed T-YOLOX model after training for 300 iterations is shown in Figure 9. As seen in the figure, the loss decreases quickly at the early stage of training, and with the increases of training rounds, the loss curve gradually falls down and becomes flat in the end. When Epoch reaches around 200, the model gradually converges, and no overfitting was discovered in the training process.
[figure omitted; refer to PDF]3.5.1. Ablation Experiment
To analyze the effect of the proposed improvements on the model performance, 3 groups of experiments were designed for analysis and comparison of the improvements, and each group of experiments was conducted with the same training parameters and different model contents. Table 1 shows the detection results of the model performance, where “√” represents the strategy used in the improved model and “×” represents the strategy not used in the improved model. According to the analysis on the results in Table 1, for Improvement 1, after connecting operation (CONCAT), channel shuffle module is used, to increase the communication ability between channels and avoid overfitting, increasing mAP to some extent; for Improvement 2, in addition to the above, light attention module is added, and an attention enhancing edge is added at CSPLayer, to improve the attention of the channel to the space information and reduce the impact of noise on deep network, increasing mAP by 1.05%; and for Improvement 3, MobileViT module is added, and CNN and transformer are fused, to realize the learning ability of backbone network for local and global information, increasing mAP by 1.02%.
Table 1
Experiment results with different improvement methods.
Method | SC | Attention | MobileViT | mAP | FPS |
YOLOX | × | × | × | 67.28% | 63.69 |
Improvement 1 | √ | × | × | 67.47% | 63.09 |
Improvement 2 | √ | √ | × | 68.52% | 58.35 |
Improvement 3 | √ | √ | √ | 69.54% | 54.17 |
3.5.2. Model Comparison
To verify the detection performance of improved T-YOLOX model, comparative experiment was conducted to the model and mainstream target detection models, e.g., YOLOv3, CenterNet, and YOLOX, adopting mAP and FPS for evaluating each algorithm, with the results shown in Table 2. As seen in the table, mAP of T-YOLOX algorithm was 69.54%, increased by 2.24% as compared with the original YOLOX algorithm, and according to the analysis on the average AP values for fire, smoke, and person in the table, the proposed method showed the improved AP values in detecting flame, smoke, and persons as compared with that of original YOLOX algorithms to different extent, with detection performance better than other mainstream target detection models (CenterNet, YOLOv3). For detection of victims, T-YOLOX showed the significant advantages, with FPS not falling significantly, and the detection speed still higher than that of the mainstream models, while maintaining the high precision detection.
Table 2
Comparison of performance between main stream target detection models.
Model | AP ( | mAP | FPS | ||
Fire | Smoke | Person | |||
CenterNet | 0.59 | 0.43 | 0.17 | 39.80 | 60.29 |
YOLOv3 | 0.74 | 0.58 | 0.54 | 62.05 | 48.60 |
YOLOX | 0.72 | 0.61 | 0.69 | 67.28 | 63.69 |
T-YOLOX | 0.75 | 0.62 | 0.72 | 69.54 | 54.17 |
4. Conclusions
Considering the problems of existing target detection models difficult to give timely and effective feedback on fire detection in complicated fire scenarios, a fire scenario detection model T-YOLOX for improving YOLOX was proposed in the study. On the basis of the YOLOX model, the method is added with the channel shuffle module with enhanced channel communication ability and CSPLayer_attention module for channel attention weight, and MobileViT module integrating CNN and transformer, to complete target detection for persons, flame, and smoke in complicated fire scenarios, with the detection results shown in Figure 10. As seen in the experiment, the proposed detection method demonstrated the good performance in detection in complicated fire scenarios. In the future work, consideration shall be made on how to further improve the detection accuracy of smoke in complicated fire scenarios.
[figure omitted; refer to PDF]Acknowledgments
This research was funded by the National Natural Science Foundation of China (grant number 61803148).
[1] L. C. Yu, J. Q. Liu, "Flame image recognition algorithm based on improved mask R-CNN," Computer Engineering and Applications, vol. 56 no. 21, pp. 194-198, 2020.
[2] F. Wu, Research and Implementation of Fire Detection Algorithm Based on Deep Learning, 2020.
[3] J. Redmon, A. Farhadi, "YOLOv3: an incremental improvement," 2018. ar Xiv: 1804.02767
[4] M. Zhao, Y. L. Ge, N. Ding, "Object detection in complex environment based on CenterNet algorithm," Journal of China Academy of Electronics Science, vol. 16 no. 7, pp. 654-660, 2021.
[5] Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, Q. Tian, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6569-6578, 2019.
[6] X. J. Li, D. S. Zhang, L. L. Sun, "CNN-based lightweight flame detection method in complex scenes," Pattern Recognition and Artificial Intelligence, vol. 34 no. 5, pp. 415-422, 2021.
[7] Z. H. Luo, Research on Forest Fire Monitoring and Path Planning Based on UAV, 2021.
[8] Z. Ge, S. Liu, F. Wang, "YOLOX: exceeding YOLO series in 2021," 2021. arXiv: 2107.08430
[9] S. Mehta, M. Rastegari, "MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer," 2021. arXiv: 2110.02178
[10] A. Vaswani, N. Shazeer, N. Parmar, "Attention is all you need," 2017. arXiv: 1706.03762
[11] A. Bochkovskiy, C.-Y. Wang, H.-Y. M. Liao, "YOLOv4: optimal speed and accuracy of object detection," 2020. arXiv:2004.10934
[12] Z. Lv, L. Qiao, M. S. Hossain, B. J. Choi, "Analysis of using blockchain to protect the privacy of drone big data," IEEE Network, vol. 35 no. 1, pp. 44-49, DOI: 10.1109/MNET.011.2000154, 2021.
[13] T. Wang, W. Liu, J. Zhao, X. Guo, V. Terzija, "A rough set-based bio-inspired fault diagnosis method for electrical substations," International Journal of Electrical Power & Energy Systems, vol. 119, article 105961,DOI: 10.1016/j.ijepes.2020.105961, 2020.
[14] A. Dosovitskiy, L. Beyer, A. Kolesnikov, "An image is worth 16x16 words: transformers for image recognition at scale," . https://arxiv.org/abs/2010.11929
[15] A. Srinivas, T. Y. Lin, N. Parmar, "Bottleneck transformers for visual recognition," 2021. arXiv: 2101.11605
[16] Z. Lv, L. Qiao, I. You, "6G-enabled network in box for internet of connected vehicles," IEEE transactions on intelligent transportation systems, vol. 22 no. 8, pp. 5275-5282, DOI: 10.1109/TITS.2020.3034817, 2021.
[17] B. Li, G. Xiao, R. Lu, R. Deng, H. Bao, "On feasibility and limitations of detecting false data injection attacks on power grid state estimation using D-FACTS devices," IEEE Transactions on Industrial Informatics, vol. 16 no. 2, pp. 854-864, DOI: 10.1109/TII.2019.2922215, 2020.
[18] Z. Lv, W. Xiu, "Interaction of edge-cloud computing based on SDN and NFV for next generation IoT," IEEE Internet of Things Journal, vol. 7 no. 7, pp. 5706-5712, DOI: 10.1109/JIOT.2019.2942719, 2020.
[19] N. Ma, X. Zhang, H.-T. Zheng, J. Sun, "ShuffleNet V2: practical guidelines for efficient CNN architecture design," Computer Vision – ECCV 2018, pp. 116-131, DOI: 10.1007/978-3-030-01264-9_8, 2018.
[20] K. He, X. Zhang, S. Ren, "Delving deep into rectifiers: surpassing human-level performance on ImageNet classification," 2015. arXiv: 1502.01852
[21] D. P. Kingma, J. Ba, "Adam: a method for stochastic optimization," , 2017. arXiv:1412.6980
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2022 Jianfei Zhang and Sai Ke. This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Considering the problems of existing target detection model difficulty for use in complicated fire scenarios and few detection targets, an improved YOLOX fire scenario detection model was introduced, to realize multitarget detection of flame, smoke, and persons: firstly, a light attention module, for improving the overall detection performance of the model; secondly, the channel shuffle technique was employed, for increasing the communication ability between channels; and finally, the backbone channel was replaced with a light transformer module, for enhancing the capture ability of the backbone channel for global information. As shown in the experiment with self-developed fire dataset, mAP of T-YOLOX increased by 2.24% as compared with the benchmark model (YOLOX), and the detection accuracy was significantly improved as compared with that of CenterNet and YOLOv3, showing the effectiveness and advantages of the algorithm.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer