This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction
Convolutional neural network (CNN) was firstly introduced in the 1980s. At that time, Lecun et al. [1] proposed a simply constructed CNN architecture which contains three convolutional layers, two subsampling layers, and a fully connected layer. LeNet was mainly used for handwriting recognition in the MNIST dataset and obtained the lowest error rate. However, the hardware equipment was not advanced, and graphics processing units had not been invented which led to the development of CNN being greatly restricted. In 2012, Krizhevsky et al. [2] developed AlexNet and won the first place in the ImageNet large-scale visual recognition competition by achieving a top-5 error of 15.3%. Compared with LeNet, AlexNet uses rectified linear unit (ReLU) to replace the conventional sigmoid activation function in order to resolve the vanishing gradient problem. Moreover, the dropout [3] regularization technique was also introduced to reduce overfitting in neural networks. In general, AlexNet extends its network architecture resulting in the requirement of nearly 60 million parameters, and the floating-point operations (FLOPs) have reached 0.7 giga FLOPs. Subsequently, researchers have continued to deepen networks to improve the accuracy such as VGGNet [4].
Instead of deepening the CNN architecture, some researchers expand the width of the network architectures. For instance, Szegedy et al. [5] firstly came up with a concept of inception block in the CNN which encapsulates different sizes of kernels for extracting global and local features. It adjusts the computations by adding a bottleneck layer of a 1 × 1 convolutional filter before applying large-size kernels. Furthermore, Srivastava et al. [6] designed a new architecture to moderate gradient-based training of very deep networks which is called highway network. This network imitates the horizontal expansion concept using the gating function to adaptively bypass the input so that the network can go deeper. In addition, He et al. [7] proposed ResNet by taking inspiration from the bypass and bottleneck layer approaches for reducing the amount of operations. Many improved designs of network architectures are proposed and applied in many applications, such as object detection [8] and semantic analysis [9]. However, regardless of deepening or widening the network architectures, high computational cost and memory requirement are the two main concerns observed with these architectures.
To further alleviate these two primary concerns of the network, designing a lightweight architecture without compromising the performance is necessary, especially when the CNN model is implemented in resource-constrained hardware. Howard et al. [10] adopted depthwise separable convolution in the MobileNet to reduce the model parameters so that the model can be embedded in portable devices for mobile and embedded vision applications. Juefei-Xu et al. [11] proposed the local binary convolutional neural network which adopts local binary convolution (LBC) as a substitute for the conventional CNN. The experimental results showed that the LBC module performs a good approximation of a conventional convolutional layer and results in a major reduction in the number of learnable parameters while training the network. Iandola et al. introduced SqueezeNet [12] which replaces 3 × 3 filters with 1 × 1 filters and decreases the number of input channels to 3 × 3 filters. These strategies are desirable to decrease the quantity of parameters in a CNN while attempting to maintain accuracy. According to the experimental results reported by Iandola et al., the parameters used in SqueezeNet are 50x fewer than those in AlexNet; besides, it preserves AlexNet-level accuracy on ImageNet. Others such as parameter pruning and quantization can reduce redundant parameters which reduces the network complexity and addresses the overfitting problem. Furthermore, without decreasing accuracy, more improvements of YOLO were also proposed [13, 14] to prove that light CNN can reduce training time and make applications more diverse without being limited by hardware.
The three modules provide capabilities and advantages: saving computations when kernel size and the number of kernels are large using depthwise separable convolution, expanding the field of view (FOV) of filters without increasing parameters by atrous convolution, and extracting local and global features simultaneously adopted by the inception module to reduce the parameters and operations of the CNN. In this study, the proposed model, efficient light convolutional neural network (ELNet) with the three modules, is no longer limited by memory and computational constraints.
The rest of the paper is organized as follows. In Section 2, the conventional CNN architecture is briefly reviewed. The ELNet is introduced in Section 3. The experimental results using CIFAR-10 and CIFAR-100 datasets are revealed in Section 4 and compared with other state-of-the-art CNN architectures such as GoogLeNet, ResNet-50, and MobileNet. Lastly, Section 5 draws conclusions.
2. Convolutional Neural Network (CNN)
The concept of neural networks mainly comes from biological neural network systems; however, neural networks are connected in a fully connected manner which causes a great amount of calculations when the input size is large. Therefore, in the 1980s, convolution kernel was first introduced and then was widely applied in image processing. There are four main parts of the CNN: convolutional layer, pooling layer, activation function, and fully connected layer. The function of feature extraction depends on the first three parts, and the fully connected layer is used to classify the obtained features. More descriptions of these parts are explained as follows.
2.1. Convolutional Layer
A convolutional layer consists of a set of learnable filters (or kernels) which have a small receptive field; however, feature extraction can be acquired by extending filters through the full depth of the input volume. The formula is as follows:
2.2. Pooling Layer
In order to effectively extract features, most of the moving strides are set as 1; yet, this setting causes relatively more operations. Therefore, pooling layer is usually added in the CNN for effectively reducing the amount of operations. Equation (2) shows the calculation of max pooling and average pooling:
2.3. Activation Function
The conventional operation of the convolution kernel is a linear operation; LeNet adopts sigmoid function as an activation function to solve nonlinear problems. Along with the development of deeper network, researchers found out that gradient disappearance occurs when the sigmoid function approaches to 0 in the saturation region. Then, ReLU is introduced in AlexNet to address this problem. Moreover, the operations using ReLU are simpler than those of the sigmoid function. Later, many scholars made various improvements based on ReLU and sigmoid functions. For instance, Leaky ReLU [15] solves the problem that ReLU is not activated when x is less than 0, PReLU [16] adds a parameter to make ReLU more accurate when x is less than 0, and RReLU [17] learns parameters automatically via the neural network. Here, PReLU is selected as the activation function which is shown in Figure 1, and its equation is given as follows:
[figure omitted; refer to PDF]
Table 1
ELNet architecture.
Type/stride | Filter shape ( | Input size | |
I | Conv/2 | ||
1 | Conv dw/1 | ||
Conv/1 | |||
2 | Conv dw/2 | ||
Conv/1 | |||
3 | Conv dw/1 | ||
Conv/1 | |||
4 | Conv dw/2 | ||
Conv/1 | |||
5 | Conv dw/1 | ||
Conv/1 | |||
6 | Conv dw/2 | ||
Conv/1 | |||
(a) Atrous dw/1 | |||
(b) Atrous dw/1 | |||
(c) Atrous dw/1 | |||
Add | |||
7 | Conv/1 | ||
8 | Conv dw/2 | ||
Conv/1 | |||
9 | Conv dw/1 | ||
Conv/1 | |||
10 | Average pooling | Global pooling | |
F | Fully connected | ||
O | Softmax | Classification answer |
In Table 1, Conv dw represents a depthwise separable convolution, and
3.1. Depthwise Separable Convolution
Depthwise separable convolution separates the original convolution into two parts for the purpose of reducing operations as shown in Figure 3.
[figure omitted; refer to PDF]
Compared with the conventional convolution method, one convolutional kernel will generate only one feature map according to its input dimensions. However, depthwise separable convolution performs multiple feature maps corresponding to each dimension, and then a 1
3.2. Atrous Convolution
Atrous convolution [9], as shown in Figure 4, enlarges the FOV of filters by incorporating the larger context without growing parameters. The advantages of using atrous convolution are allowing the user to filter a larger context instead of using a bigger size of kernel and reducing the usage of pooling layers which brings less operation consumption and accuracy improvement; besides, using less parameters can also avoid an overfitting problem.
[figure omitted; refer to PDF]3.3. Inception Module
Inception module uses various convolution kernels to extract features so that the feature maps are able to contain local features and global features. The schematic view of conventional convolutional layers and inception module are displayed in Figure 5 as comparison. Although both of the methods can map to the same size of FOV, local features in Figure 5(a) might be washed out at the end. On the contrary, the wash-out problem will not be considered when using the inception module (Figure 5(b)); however, fusing multiple feature maps is another question. In general, concatenation (Concat) and addition (Add) are two common methods; the former can retain characteristics of each convolution output but produce high-dimensional problems; in contrast, the latter does not have dimensional problems, yet relatively might lose the independence of each output.
[figure omitted; refer to PDF]4. Results and Discussion
To deploy the systems on resource-constrained hardware for real-time data processing, large-scale datasets such as PASCAL VOC, ImageNet, and COCO are not considered. Thus, CIFAR-10 and CIFAR-100, two well-understood and widely used datasets, were provided to verify the performance of ELNet. The experimental results including parameters, FLOPs, and accuracy were compared, respectively, with the other state-of-the-art CNN architectures such as GoogLeNet [5], ResNet-50 [7], MobileNet [10], and All Convolutional Net (All-CNN-C) [19]. The hardware specifications and predefined parameters used in this study are listed in Tables 2 and 3.
Table 2
Hardware specifications.
Hardware | Specification |
GPU | NVidia GTX1080-Ti 11G |
CPU | Intel Xeon E3-1225 v3 @ 3.2 GHz |
Table 3
Predefined parameters.
Parameter | Value |
Epoch | 120 |
Optimizer | Nesterov’s accelerated gradient |
Learning rate | 0.01 |
Learning rate decay | 0.9 |
Learning rate decay frequency | 40 (epochs/time) |
Momentum | 0.9 |
Batch size | 100 |
4.1. CIFAR-10 Dataset
The CIFAR-10 dataset includes 60,000 colour images with the size of 32 × 32 in a total of 10 classes. To fit into the proposed network, bilinear interpolation is used to resize the images into 224 × 224 which provides more features than using the padding method. Table 4 shows the results in which the parameters and MFLOPs required in larger CNN models such as GoogLeNet and ResNet-50 models are very large. In other words, these models need longer training time and higher operations. To make the model suitable for general hardware equipment, models with less operations and lower complexity are more favourable. Therefore, the proposed model is also compared with MobileNet and All-CNN-C which are also called light models. According to the results, MobileNet uses less parameters and MFLOPs than others; yet, the accuracy is lower than that of ELNet. Even though All-CNN-C has the least parameter requirements, its MFLOPs are the highest which means the training time could be decreased by using better graphics processing units, but this increases the cost of hardware equipment. ELNet reaches a tradeoff between accuracy and parameters/MFLOPs which is closer to the purpose of this study than that of other methods.
Table 4
Experimental results using the CIFAR-10 dataset.
Model | Parameter ( | MFLOPs | Accuracy (%) |
GoogLeNet [5] | 6.9 | 1,582 | 83.1 |
ResNet-50 [7] | 25.6 | 3,857 | 88.1 |
MobileNet [10] | 4.2 | 569 | 85.6 |
All-CNN-C [19] | 1.37 | 13,965 | 90.9 |
ELNet (proposed) | 2.1 | 257 | 92.3 |
4.2. CIFAR-100 Dataset
The CIFAR-100 dataset contains 100 classes which are more than in the CIFAR-10 dataset. Therefore, the accuracy shown in Table 5 is obviously relatively lower than the accuracy of classifying the CIFAR-10 dataset; yet, the accuracy of ELNet is still the highest.
Table 5
Experimental results using the CIFAR-100 dataset.
Model | Accuracy (%) |
GoogLeNet [5] | 56 |
ResNet-50 [7] | 57.3 |
MobileNet [10] | 65 |
All-CNN-C [19] | 66.3 |
ELNet (proposed) | 69 |
To evaluate the effectiveness of three convolutional modules used in ELNet, Tables 6 and 7 show the results of classifying the CIFAR-100 dataset. Table 6 shows that using atrous convolution can not only widen the FOV which increases the accuracy from 67% to 69% but also reach the same accuracy (69%) as using a bigger kernel size. Additionally, the inception module has the ability to extract features using different convolution kernel sizes. In order to keep the features, different fusion methods may display distinct results. From the experimental results (Table 7), concatenation shows better accuracy than the other two methods; however, it requires more parameters and MFLOPs; thus, the addition method might be the better choice for implementing the network in a resource-constrained environment.
Table 6
The comparisons of using the atrous convolution.
Model | Parameter ( | MFLOPs | Accuracy (%) |
ELNet (no atrous convolution) | 2.1 | 257 | 67 |
ELNet (7 × 7 kernel size) | 2.2 | 262 | 69 |
ELNet | 2.1 | 257 | 69 |
Table 7
The comparisons of using the inception module.
Model | Parameter ( | MFLOPs | Accuracy (%) |
ELNet (Add) | 2.1 | 257 | 69 |
ELNet (Concat) | 2.6 | 359 | 69.4 |
ELNet (no inception module) | 2.1 | 257 | 68 |
Overall, the proposed ELNet showed better performance in comparison with either relatively larger CNN architectures (GoogLeNet and ResNet-50) or light CNN architectures (MobileNet and All-CNN-C). The accuracy of ELNet is acceptable if the environment of the deployed system is considered. Although the proposed ELNet reaches 92.3% and 69% in the CIFAR-10 and CIFAR-100 datasets, respectively, the accuracy can be improved by using more complex networks. The three convolution modules with depthwise separable convolution, atrous convolution, and inception modules can also be extended to these complex networks to lower the number of parameters and operations and preserve the accuracy of classification as well.
5. Conclusions
The contributions of this study listed in the following confirm that the ELNet can effectively reduce model complexity but maintain fine accuracy:
(1) ELNet successfully combines three convolutional modules, depthwise separable convolution, atrous convolution, and inception module, for reducing the number of parameters and operations in the model
(2) ELNet requires only 2.1 million training parameters and 2.57 mega FLOPs based on the input image size that is equal to 224 × 224
(3) The accuracy of ELNet reached 92.3% and 69% in CIFAR-10 and CIFAR-100 datasets, respectively
Therefore, the proposed ELNet can be applied on embedded systems for image classification applications. In addition, the architecture can integrate other methods such as parameter pruning, recursion, or other learning methodologies to optimize the network for further research.
Acknowledgments
The authors would like to thank the support of the Intelligent Manufacturing Research Center (iMRC) from the Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan. This research was funded by the Ministry of Science and Technology of the Republic of China (Grant no. MOST 109-2221-E-167-027).
[1] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86 no. 11, pp. 2278-2324, DOI: 10.1109/5.726791, 1998.
[2] A. Krizhevsky, I. Sutskever, G. E. Hinton, "ImageNet classification with deep convolutional neural networks," Proceedings of the International Conference on Neural Information Processing Systems, .
[3] N. Srivastava, G. Hinton, A. Krizhevsky, "Dropout: a simple way to prevent neural networks from overfitting," JMLR, vol. 15, pp. 1929-1958, 2014.
[4] K. Simonyan, A. Zisserman, "Very deep convolutional networks for large-scale image recognition," Proceedings of the International Conference on Learning Representations, .
[5] C. Szegedy, W. Liu, Y. Jia, "Going deeper with convolutions," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, .
[6] R. K. Srivastava, K. Greff, J. Schmidhuber, "Highway networks," 2015. http://arxiv.org/abs/1505.00387
[7] K. He, X. Zhang, S. Ren, "Deep residual learning for image recognition," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, .
[8] W. Liu, D. Anguelov, D. Erhan, "SSD: single shot MultiBox detector," Proceedings of the European Conference on Computer Vision, .
[9] L. C. Chen, G. Papandreou, I. Kokkinos, "DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, pp. 834-848, 2017.
[10] A. G. Howard, M. Zhu, B. Chen, "MobileNets: efficient convolutional neural networks for mobile vision applications," 2017. http://arxiv.org/abs/1704.04861
[11] F. Juefei-Xu, V. N. Boddeti, M. Savvides, "Local binary convolutional neural networks," 2017. http://arxiv.org/abs/1608.06049
[12] F. N. Iandola, S. Han, M. W. Moskewicz, "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size," 2016. http://arxiv.org/abs/1602.07360
[13] J. Redmon, A. Farhadi, "YOLO9000: better, faster, stronger," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, .
[14] J. Redmon, A. Farhadi, "YOLOv3: an incremental improvement," 2018. http://arxiv.org/abs/1804.02767
[15] A. L. Maas, A. Y. Hannun, A. Y. Ng, "Rectifier nonlinearities improve neural network acoustic models," Proceedings of the International Conference on Machine Learning, .
[16] K. He, X. Zhang, S. Ren, "Delving deep into rectifiers: surpassing human-level performance on ImageNet classification," Proceedings of the IEEE International Conference on Computer Vision, .
[17] B. Xu, N. Wang, T. Chen, "Empirical evaluation of rectified activations in convolutional network," 2015. http://arxiv.org/abs/1505.00853
[18] M. Lin, Q. Chen, S. Yan, "Network in network," Proceedings of the International Conference on Learning Representations, .
[19] J. Springenberg, A. Dosovitskiy, T. Brox, "Striving for simplicity: the all convolutional net," Proceedings of the International Conference on Learning Representations, .
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2021 Cheng-Jian Lin et al. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/
Abstract
Deep learning has accomplished huge success in computer vision applications such as self-driving vehicles, facial recognition, and controlling robots. A growing need for deploying systems on resource-limited or resource-constrained environments such as smart cameras, autonomous vehicles, robots, smartphones, and smart wearable devices drives one of the current mainstream developments of convolutional neural networks: reducing model complexity but maintaining fine accuracy. In this study, the proposed efficient light convolutional neural network (ELNet) comprises three convolutional modules which perform ELNet using fewer computations, which is able to be implemented in resource-constrained hardware equipment. The classification task using CIFAR-10 and CIFAR-100 datasets was used to verify the model performance. According to the experimental results, ELNet reached 92.3% and 69%, respectively, in CIFAR-10 and CIFAR-100 datasets; moreover, ELNet effectively lowered the computational complexity and parameters required in comparison with other CNN architectures.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details

1 Department of Computer Science and Information Engineering, National Chin-Yi University of Technology, Taichung 411, Taiwan; College of Intelligence, National Taichung University of Science and Technology, Taichung 404, Taiwan
2 Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan
3 Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan; Intelligent Manufacturing Research Center, National Cheng Kung University, Tainan 701, Taiwan