Content area
Channel pruning is a method to compress convolutional neural networks, which can significantly reduce the number of model parameters and the computational amount. Current methods that focus on the internal parameters of a model and feature mapping information rely on artificially set a priori criteria or reflect filter attributes by partial feature mapping, which lack the ability to analyze and discriminate the channel feature extraction and ignore the basic reasons for the similarity of the channels. This study developed a pruning method based on similar structural features of channels, called SSF. This method focuses on analysing the ability to extract similar features between channels and exploring the characteristics of channels producing similar feature mapping. First, adaptive threshold coding was introduced to numerically transform the channel characteristics into structural features, and channels with similar coding results could generate highly similar feature mapping. Secondly, the spatial distance was calculated for the structural features matrix to obtain the similarity between channels. Moreover, in order to keep rich channel classes in the pruned network, different class cuts were made on the basis of similarity to randomly remove some of the channels. Thirdly, considering the differences in the overall similarity of different layers, this study determined the appropriate pruning ratio for different layers on the basis of the channel dispersion degree reflected by the similarity. Finally, extensive experiments were conducted on image classification tasks, and the experimental results demonstrated the superiority of the SSF method over many existing techniques. On ILSVRC-2012, the SSF method reduced the floating-point operations (FLOPs) of the ResNet-50 model by 57.70% while reducing the Top-1 accuracy only by 1.01%.
Introduction
Convolutional neural networks (CNNs) have shown excellent performance in image recognition, natural language processing, and other tasks, and their capabilities are proportional to the size of the dataset and the number of model parameters (Raffel et al. 2020). As the number of layers of the deep learning model deepens, its floating-point operations (FLOPs) increase. For example, the VGG-16 model (Simonyan and Zisserman 2015) requires more than 15.5 billion FLOPs when processing a single image, and ResNet-50 (He et al. 2016) requires 4.1 billion FLOPs. Models with a high computational amount find it difficult to meet the real-time computing requirements of tasks in the modern society, such as self-driving car systems and embedded artificial intelligence systems. Hence, it is challenging to deploy complex models on resource-constrained devices or systems with low-latency requirements. However, parameter redundancy exists in a model with massive parameters, and pruning is a method that can effectively reduce model complexity. The decrease in the number of parameters and FLOPs can be achieved by removing redundant parameters, channels, filters, or convolutional layers. Meanwhile, feature mappings are expressed by similar convolutional kernels and can be substituted for each other, so sparing the redundant convolutional kernels to a certain degree not only reduces the number of model parameters but also preserves the rich feature extraction capability. The sparse redundant model is the manifestation of model miniaturisation and simplification.
The redundancy of the filter has been determined by many studies. The mainstream methods include those based on feature mapping information and those based on model parameter attributes. The methods based on feature mapping information are mainly used (i) to reuse the input data to calculate the rank (Lin et al. 2020), entropy (Wang et al. 2021b), and singular value (Chen et al. 2022) of the feature mapping obtained by each filter, thereby judging whether the filter produces useless information; (ii) to analyse the colour and texture information of the feature mapping, thereby measuring the similarity between different filters and pruning redundant filters (Yao et al. 2021); and (iii) to calculate the classification contribution of filters, thus determining their importance and removing filters with low importance (Zhang et al. 2022; Pachón et al. 2022). Furthermore, (iv) ARPruning (Yuan et al. 2024) proposes pruning criteria based on the feature mapping attention graph to measure the importance of channels and uses a cross-layer search method to find the optimal pruning strategy; and (v) the global balanced iterative pruning method is introduced based on the magnitude distribution of the intermediate features (Chang et al. 2022). These methods have certain limitations in reusing training data to make pruning decisions because they rely on the ideal situation of replacing the whole with parts. The model should have a strong generalisation ability when dealing with realistic tasks. However, using only a part of the dataset to formulate a pruning decision actually selects a filter that is beneficial to the current data. This will cause the loss of filters with different capabilities and reduce the generalisation ability of the model.
The methods for determining redundancy based on model parameter attributes rely on prior knowledge: e.g. (i) channels with a smaller -norm are attenuated gradually (Mousa-Pasandi et al. 2020; ii) filters closer to the geometric median are removed (He et al. 2019); and (iii) filters with low sensitivity are removed (Yang and Liu 2022). Parameters with smaller magnitudes are often considered to contribute less to the network (Liu et al. 2017). In fact, appropriate weight decay can solve the overfitting problem to a certain extent. Currently, most methods adopt different criteria to independently determine the properties of each filter. However, exploring the similarity between parameters can ignore the selection of criteria. In Wang et al. (2021d), the similarity between filters was calculated, and the most similar filters were removed. However, this study only considered the expectation and variance of the filter, and the analysis of the structural characteristics of the filter was not comprehensive, so the processing ability of the detailed information cannot be reflected. Meanwhile, using model parameters to calculate similarity is a process of strict similarity judgment, which may cause missed or wrong determinations on similar channels. For example, filters with the same effect would be considered dissimilar because of the overall deviation between parameters. As a result, this method lacks tolerance for similarity determination.
The motivation for most pruning methods derives from correct prior knowledge, which may fail leading to poor pruning when dealing with different models and data tasks. It is also difficult to clearly elucidate the role of feature representation within the model using only partial feature mapping to reflect the information extraction capabilities of the filter. Dynamic pruning (Sanh et al. 2020; Chen et al. 2020), architectural search-driven (Dong and Yang 2019; Liu et al. 2019), and other methods are free from manual decision-making processes but suffer from the problem of a long computational time to formulate pruning strategies. The current mainstream approach focusing on the internal parameters and feature mapping information of the model lacks the analysis and discrimination of the feature extraction ability of the channels, while different channels in the pre-training model have the ability to extract a variety of information, including the attention to the local texture, edges, colour, brightness and other information. It is very difficult to express the specific feature extraction capability of a channel explicitly in a numerical manner, but the problem can be simplified if the similarity of the feature extraction capability between different channels can be achieved numerically. At the same time, finding channels with similar feature extraction capability is the core idea of the pruning method. Therefore, this study focused on exploring the role of expression within the model and analysing the similar feature extraction ability between convolutional channels, and consequently, developed a pruning strategy based on similar structural features (SSF) of the channels. The main differences between this method and the related methods are shown in Fig. 1.
Fig. 1 [Images not available. See PDF.]
Differences in the realisation process between the proposed method and the related methods
In this study, SSF firstl constructeds structural features from various aspects to reflect the feature extraction capability that the channel had, and used adaptive threshold coding to convert the channel into symbol features, magnitude features, expectation features, and magnitude-mean features. The traditional method calculates the similarity directly on the basis of the model parameters, neglecting to explore the capability of the convolution kernel. Instead, SSF introduced Euclidean distance and cosine similarity to compute the spatial distance values between structural features, and defined the average distance value of the channel in the filter as the channel similarity. The calculation of similarity based on structural features was a reflection of the ability to focus on the convolution kernel and enhanced the analysis of the role of feature representation within the model. Finally, on the basis of the channel similarity, SSF used the K-means clustering algorithm to divide the channels into subclasses, where some redundant channels were randomly removed to make the final sub-network retain more classes of channels.
Translated with DeepL.com (free version)
The main contributions of this study could be summarised as follows.
A pruning strategy driven by the similarity of channel feature extraction capabilities was developed, which analysed the feature extraction capabilities between different channels in the same convolutional layer, and removed the channels with redundant roles on the basis of the similarity result of the feature extraction capabilities to achieve an effective compression of CNNs.
A convolutional channel similarity determination mechanism was developed, which constructed the corresponding structural features of the channels from various aspects and reflected the similarity of the feature extraction capability between the channels on the basis of the spatial distance value of the structural feature matrix, thus realising the similarity determination of the feature extraction capability of the channels instead of the direct similarity determination of the channels.
Related work
The core idea of pruning is to eliminate redundant parameters in a model to reduce memory usage and computational complexity. According to structural properties, the existing pruning methods can be divided into unstructured pruning (weight level Han et al. 2015; Guo et al. 2016) and structured pruning (channel, filter, and layer level). As the weight positions selected by unstructured pruning are irregular, special hardware platforms are needed to accelerate the processing. Structured pruning removes channels or filters directly by disconnecting some connections in the model, so it is easy to implement in edge devices. In this case, filter pruning directly removes the entire filter, even though it contains important channels, while channel-level pruning has a smaller granularity than filter pruning, thus enabling the selective removal of redundant channels and alleviating the over-impairment of performance in the pruned model. Compared with the structured pruning methods based on feature mapping information, those methods based on model parameter attributes are not limited by an insufficient data representation. Compared with the methods of minimising reconstruction error, architecture search, and automatic machine learning, the proposed method reduced the computational complexity more significantly in making pruning decisions. The classification of related work is shown in Table 1.
Table 1. Related work classification tables
Types | Methods | Specificities | Works |
|---|---|---|---|
Based on model parameter attributes | Direct reflection of channel or filter contribution based on parameters in pre-trained models | Is computationally simple, relies on a priori knowledge, and requires pre-trained models | Mousa-Pasandi et al. (2020), He et al. (2019), Yang and Liu (2022), Liu et al. (2017), Wang et al. (2021d), ours |
Based on feature mapping information | Reflection of the contribution of the filter based on the amount of information contained in the feature mapping | Needs to use the dataset to generate partial feature mappings and easily leads to generalization | Lin et al. (2020), Wang et al. (2021b), Yao et al. (2021), Zhang et al. (2022), Pachón et al. (2022), Chen et al. (2022), Chang et al. (2022), Yuan et al. (2024) |
Based on the minimization of reconstruction error | Minimization of reconstruction error in feature output determines significant filters | Requires layer-by-layer computation of reconfiguration error, which is computationally intensive | Luo et al. (2017), Zhuang et al. (2018), He et al. (2017), Yu et al. (2018) |
Architecture-search-driven | Selection of compact sub-networks from larger networks | Emphasizes model structure and parameter connectivity, often using evolutionary algorithms to search for model structure and hyperparameters | Dong and Yang (2019), Liu et al. (2019), Lin et al. (2021b), Wang et al. (2021c) |
Automatic-machine-learning-driven | Combination with other machine learning methods to form pruning decisions | Is mostly a combination of different methods with complex calculations | Jiang et al. (2023), Liu et al. (2019, 2020, 2021), Yu et al. (2018), Yuan et al. (2024), Zhuang et al. (2018, 2020) |
Based on parametric gradient | Determination of the contribution of the channel on the basis of the gradient information generated by model fine-tuning or training | Requires correct gradient relationships to be established along with additional fine-tuning of the model | Lee et al. (2019), Molchanov et al. (2017) |
Based on regularization | Addition of regular terms to the model to form pruning decisions during the training process | Needs the loss function or model structure to be changed and trained repeatedly | Wang et al. (2021a), Wang and Fu (2022), Wang et al. (2024), Liu et al. (2023a) |
Others | Frankle and Carbin (2019), Bai et al. (2022) |
Based on the minimisation of reconstruction error. The filters to be removed are determined by minimising the reconstruction error of the feature output. ThiNet (Luo et al. 2017) defines filter pruning as an optimisation problem. With the statistical information from the next layer, the filters retained in the model can minimise the reconstruction error of the next layer. DCP (Zhuang et al. 2018) defines channel pruning as a sparsity-induced optimisation problem considering both the reconstruction error and the channel discrimination, and it uses a greedy algorithm to seek a solution. In He et al. (2017), the authors proposed LASSO regression-based channel selection and least squares reconstruction and pruned channels layer by layer. Neuron importance score propagation (NISP) (Yu et al. 2018) minimises the reconstruction error of important elements in the ’final response layer’ and then propagates the importance to each neuron by using feature ranking algorithms to remove the least important neurons.
Architecture-search-driven criterion. Searching for the optimal model architecture during training is a similar channel-level pruning method. In Dong and Yang (2019), the authors present a differentiable search algorithm that aims to search for the optimal size of the model rather than the topology. PruningNet (Liu et al. 2019) is used to predict the parameters of the sub-network, and the evolutionary algorithm was adopted to search for the best sub-network. Similarly, the artificial bee colony algorithm (Lin et al. 2021b) is applied to search for architectures, and its accuracy is considered the degree of fitness for each architecture. In Wang et al. (2021c), the authors propose to use polynomial regression to fit the relationship between accuracy and model depth, width, and input resolution; the polynomial is maximised to obtain the optimal parameters, and the model is re-trained. However, architecture search methods often have a computationally intensive search process, which consumes excessive computing time.
Automatic-machine-learning-driven criterion. Other deep learning models have been used in the pruning process to determine the channels to be pruned and the corresponding pruning parameters, such as the pruning ratio. AMC (He et al. 2018) automatically generates the pruning ratios of convolutional layers through reinforcement learning, while AutoCompress (Liu et al. 2020) proposes a heuristic search technique to determine the pruning ratio of each layer. PruningNet (Liu et al. 2019) combines meta-learning with architecture search. GAL (Lin et al. 2019) uses the GAN network to guide the pruning process of the model. Genetic algorithms are used to find the optimal combination of pruning rates and remove the filter with the highest similarity on the basis of the results of the matrix similarity measurement algorithm (Dong et al. 2024). Bi-CNN-Pruning uses a co-evolutionary migration-based algorithm as a search method to solve the dual-layer optimisation problem of combining filters and channel pruning (Louati et al. 2024). In Jiang et al. (2023), the researchers propose to convert pruning into a multi-objective optimization problem by implementing the selection of feature mappings on the basis of an evolutionary algorithm. In Liu et al. (2023b), the authors first cluster the filters hierarchically by using the K-means++ method and subsequently search for the best compression structure by using the social group optimisation (SGO) algorithm.
Based on parametric gradient criterion. The importance of filters is determined by parameter changes or gradient information during model fine-tuning. SNIP (Lee et al. 2019) prnsents a pruning method based on the importance of weight connections, which determines the importance of different weight connections in an interleaved manner through gradient information in fine-tuning. In Molchanov et al. (2017), the researchers conducted greedy criterion-based pruning and fine-tuning via back-propagation to achieve efficient pruning. Movement pruning (Sanh et al. 2020) exploits the first-order information of the weights during fine-tuning to remove connections gradually moving away from zero.
Based on regularisation. By adding regularisation to the parameters thus gradually sparsifying them during training, GReg-2 (Wang et al. 2021a) found that parameters in the pre-trained model have different local curvature structures and that weights with a larger curvature will move less when the L2 norm penalty term is set. Therefore, this work proposes a variant of L2 regularisation with an increasing penalty factor that gradually amplifies the magnitude difference between the weights until they are naturally separated. TPP (Wang and Fu 2022) proposes to add a penalty term to the gram matrix of the convolution kernel, thus removing the correlation between the filters and separating those to be retained from those to be removed. Although the separation of unimportant weights is forced by sparsity regularisation, the magnitude of the values can hardly be exactly zero, so the bn layer of the pruned feature mapping is again regularised to zero. GlobalPru (Wang et al. 2024) makes all images agree on the global channel ranking with respect to channel saliency by learning ordering regularisation and uses an efficient channel ranking algorithm to obtain a sparse network. SOKS (Liu et al. 2023a) introduces a regularisation coefficient matrix to locate significant kernel positions and performs strip pruning by using a binary search algorithm to find the optimal kernel shape.
Lottery ticket hypothesis (LTH): Dense, randomly initialised, feed-forward networks contain subnetworks (winning tickets) that ’when trained in isolation’ reach test accuracy comparable to the original network in a similar number of iterations. The results of LTH (Frankle and Carbin 2019) suggest that (1) winning lotteries are sub-networks with specific sparse architectures found in the original network, and (2) winning lotteries tend to have stronger generalisation capabilities than the original model. The dual lottery ticket hypothesis (DLTH) (Bai et al. 2022) suggests that more than one winning lottery exists in the original network. This work randomly selects sub-networks from a randomly initialised original network that can be optimized to be trained individually and achieve better performance. Thus, LTH can be described as finding structure according to weights, because it prunes the pre-trained network to find a mask by using weight magnitude ranking. DLTH can be described as finding weights based on a given structure, because it transforms weights for a randomly selected sparse network. The results of DLTH show that (1) converting randomly selected sub-networks from randomly initialised dense networks into a format with good trainability is performable and (2) selecting the flexibility of the sparse structure ensures that we have control over the structure of the subnetwork, rather than deciding it by pruning.
Discussion. The core idea of these methods (such as the methods based on parameter gradient changes, searching optimal sub-networks, and reusing data for feature selection) is to remove redundant parameters of the model. However, the pruning strategy ignores the essential role of the convolutional kernel, which is the ability to extract different features of the input data. Each channel extracts a specific local feature, and analysing channel differences non-rigorously makes it easy to find channels with similar effects. Therefore, on the basis of the parameters in the pre-trained model, this study adopted adaptive threshold coding to analyse the basic structural features of channels from multiple perspectives, investigated the effect of similar coding results on feature mappings, and developed new methods to distinguish redundant channels.
For the lottery hypothesis, the SSF method determined the similarity after converting the convolutional channels into structural features in the pre-trained model and clustered the results to remove some of the channels with higher similarity, a process similar to the process of finding the winning lottery ticket. The difference, however, was that SSF preferred the winning lottery to not be unique. The sub-network was obtained by a randoml sparse for different similarity categories, and several experiments showed that the results were stable.
Structural features
The underlying convolution kernel in the CNN model is good at extracting simple information such as the contour of the target. With a multi-layer nonlinear activation function, the convolution kernel can extract high-order semantic features. The information involved in the corresponding feature mapping is incomprehensible to humans. Therefore, a method based on the similarity of interpretable feature mappings has some drawbacks in judging the similarity of feature mappings of any convolutional layer. In Yao et al. (2021), the authors focused on the differences in feature mappings in colour and texture and analysed the ability of filters to extract features by computing histograms. When using feature mappings to measure the importance of channels, a part of the dataset needs to be used as the model input. Consequently, the selected sub-network may be biased towards the current dataset. However, this problem can be avoided by directly using model parameters to determine channel similarity. In Wang et al. (2021d), Pearson’s correlation was exploited to calculate the similarity of the weight parameters. Only the expectation and variance information of the weight participated in the calculation process, and the determination of the similarity of the parameters lacked comprehensive consideration. The image block matching technique referred to in the work (Zou et al. 2018) can be used to find the local blocks in the image that are most similar to a fixed reference block, and the computation of this process relies on the spatial distance values. If the method is used to discriminate the similarity between fixed-size convolutional kernels, the block matching technique uses the convolutional channel parameters directly to determine the similarity, and the process has similar computational criteria as the work (Wang et al. 2021d). Similar convolutional kernels with subtle numerical deviations may produce similar or very different feature mappings, so simply analysing the parameters of convolutional kernels is insufficient. The purpose of the present study was to analyse the feature extraction ability of convolutional kernels, explore the similarity role of convolutional kernels, and construct a variety of feature description operators to convert convolutional kernels into the corresponding structural features. Therefore, in all, four structural features were developed, namely symbol feature, magnitude feature, expectation feature, and magnitude mean feature, to establish the theoretical basis for judging the channel similarity.
Completely local binary pattern (CLBP) (Guo et al. 2010) is a texture descriptor that can handle textured images and achieve high classification accuracy. CLBP decomposes the local structure of an image into two complementary components: sign of difference and magnitude of difference, which are used to discriminate the local texture information of an image. Therefore, this study drew on CLBP to establish a feature description operator that conformed to the convolutional kernel, and adopted adaptive threshold coding for the convolutional kernel, so as to qualitatively express the feature extraction ability.
Symbol features. The weight parameters of different convolutional layers are at different scales. In an analysis of the structure of the channel, its mean value should be subtracted to achieve parameter centralisation and make the channel at the same scale. For convolution kernels of 3?, 5?, and 7?, the symbol features can be represented as the symbol coding results of the centralised convolution kernels, as shown in Eq. (1).
1
where denotes the element in the channel, indicates the expectation of the channel, h represents the convolution kernel size, and s(x) refers to the symbol function. Figure 2a shows a certain convolution kernel, and Fig. 2b shows the centered matrix after subtracting the expectation. The symbol feature matrix obtained by adaptive threshold encoding of this matrix is illustrated in Fig. 2d. In , the elements greater than 0 were converted to 1, and those less than or equal to 0 were converted to 0. Hence, the vector [0, 1, 1, 1, 1, 0, 1, 0, 0] was obtained after was flattened. This vector presented the symbol results of the channel elements as compared to the mean and reflected the positive and negative gain relationships of different positions to the input elements. Because of nonlinear activation functions such as ReLU in the model, recognising the positive and negative gains of channels to elements at different positions was equivalent to discriminating the detailed information in the feature mapping. Thus, the symbol features could reflect the channel’s attention to local features.Magnitude features. Magnitude features can be expressed as the adaptive threshold coding results of comparing the magnitude of the centralisation matrix with the mean value of the magnitude, as shown in Eq. (2).
2
where denotes the mean of matrix . According to the centralised magnitude matrix shown in Fig. 2c, the encoded magnitude feature matrix is shown in Fig. 2e. In , the elements greater than were converted to 1, while those less than or equal to were converted to 0. Hence, the vector [1, 1, 1, 0, 0, 0, 0, 0, 1] was obtained after flattening . It showed the symbol results of comparing the centralised magnitude matrix to the magnitude mean and reflected the gain strength relationships at different positions to the input elements. The symbol features described the specific information in the feature mapping, while the magnitude features described the strength of the specific information. Thus, the combination of the above two features formed an effective basis for judging channel similarity.The symbol features and the magnitude features reflected the positive or negative gain and the gain intensity of different positions in the weights to the input elements, respectively. By taking the convolutional kernel to adaptive threshold coding, the obtained structural features not only fully reflected the feature extraction capability of the convolutional kernel but also avoided similarity misjudgment caused by subtle numerical deviations.
The above features showed the analysis of local parameters in the weights and lacked the overall analysis of the convolution kernel. While the L1 norm (the sum of the absolute values of the weights) is often used to determine the importance of the filter, this study drew on the L1 norm to construct the global feature description operator — expectation features and magnitude mean features, both of which could be regarded as decompositions of the L1 norm.
Fig. 2 [Images not available. See PDF.]
Example to illustrate the calculation of structural features
Expectation features. As the weight-sharing mechanism was applied to the convolution calculation, there was a linear relationship between the feature mapping obtained through convolution and the input data, and it was related to the convolution kernel. The image information was reflected by the overall expectation value. After multiple random convolution experiments, it was found that the expectation ratio between the feature mapping and the input data had a linear relationship with the expectation of the channel, as shown in Eq. (3). It was observed that the overall expectation value of the channel determines the overall level of the feature mapping elements, i.e. the brightness information. Thus, the expectation features are defined in Eq. (4).
3
4
where is used to calculate the overall expectation of the image, and the linear coefficient k is constant in the same input image. For example, in Fig. 2f, the overall expectation value of the convolution kernel is established as the expectation feature , i.e., [0.05]. Note that the expectation features also belong to the positive and negative gain relationships, which can express global information, while the symbol features express the local information of the window.Magnitude mean features. In pruning methods based on model parameters, channels with larger norms are considered important (Liu et al. 2017). Meanwhile, the loss function based on the parameter norm penalty term can be applied to improve the regularisation ability of the model. Therefore, similar to the -norm, the magnitude mean feature was used to measure the complexity of the channel, and the calculation formula is shown in Eq. (5). For example, in Fig. 2g, the mean of the centralized magnitude matrix is established as the magnitude mean feature , i.e. [0.18].
5
Intuitively, the convolution kernel mean controls the brightness of the feature mapping, and the magnitude mean of the centring matrix reflects the complexity of the channel, so both tend to reflect the overall gain of the parameters on the input elements. Structural features are similar to the L1 norm in their focus on parameter mean, but differ in that structural features pay more attention to the feature extraction capability, and SSF methods have a more comprehensive analysis of the convolution kernel than the L1 norm.Fig. 3 [Images not available. See PDF.]
Example to illustrate that channels with similar structural features generate similar feature mappings. The input image passes through four random filters to obtain different feature mappings. In each filter, the channels at the corresponding positions have the same symbol features and magnitude features, as well as approximate expectation features and magnitude mean features
The four structural features were exploited to analyse the feature extraction ability of the channel in terms of the gain relationship to the input and the complexity of the channel. In Fig. 2h, each channel has structural features with a dimension of after digitisation. Structural features reflect the channel’s attention to specific information, which can express the role of the channel, irrespective of the input data. Moreover, adopting adaptive threshold coding can help to ignore small-scale differences in weight parameters, and the result is a non-strict similarity judgement. Therefore, the channels producing the same effects can be classified according to the structural features.
Structural features can not only numerically achieve the qualification but also comprehensively reflect the corresponding capabilities of channels. Channels with similar structural features produce similar feature mappings. In Fig. 3, an example is presented to illustrate that the feature mappings produced by random channels with similar structural features have high similarity. In this study, the structural similarity function (SSIM) (Wang et al. 2004) was used to objectively evaluate the similarity between feature mappings. This function performed a similar comparison from the three aspects of average brightness, contrast, and structural information. The normalised image of the RGB channel was taken as the input, and the feature mapping was obtained using filtering and a ReLU activation function. The channels at the corresponding positions in the filter had the same symbol features and magnitude features, and similar expectation features and magnitude mean features. As shown in Fig. 3b, the SSIM values between feature mappings 1, 2, and 3 were all 0.99, showing high similarity. The average SSIM value between feature mapping 4 and the rest feature mappings was 0.85 because of the large difference between the expectation features and the magnitude mean features (the numerical symbol was changed). Although the SSIM value decreased, the same symbol features and magnitude features corresponded to similar information such as details and contours, while the expectation features and magnitude mean features controlled the brightness and complexity information. To sum up, random channels with similar structural features could extract similar feature mappings, which provided a theoretical basis for explaining the pruning process in Sect. 4.
Network compression
This section focuses on using structural features to determine the similarity between channels and making a pruning decision. In Sect. 4.1, the pruning decision is introduced. In Sect. 4.2, an adaptive pruning ratio is introduced to address the disadvantages of artificially determining the pruning ratio. Finally, network pruning with special structures is described in Sect. 4.3.
Firstly, a shared notation is established to discuss the decision process of the pruning strategy. Assuming that there are N convolutional layers in a trained CNN, the filter set of the i-th convolutional layer can be represented as , where represents the number of filters in this layer, and represents the number of channels in the filter. The channel set in the j-th filter can be denoted as . denotes an artificially set global pruning ratio, and represents the compression ratio of the i-th convolutional layer, .
Pruning strategies
In Sect. 3, this study numericalised the channel’s focus on features and analysed the channel’s characteristics in terms of its local gain, global gain effect on the input, and complexity, and used adaptive thresholding coding to obtain the sign, magnitude, expectation, and magnitude-mean features. In making pruning decisions, this study focused on investigating the channel’s attention to image features and revealing the essential meaning of similar channels. The determination of the similarity of channels based on structural features was a non-strict similarity judgement process. That is, the structural features focussed on the influence of channels on the gain of data rather than the parameters, and were independent of the dataset. Thus, this study calculated the structural feature similarity with the spatial distance value to reflect the similarity of channels in terms of feature attention. Channels with the same effects caused redundancy in the model. In this study, a clustering algorithm was used to classify the channels, and random sparsity was performed on some channels according to different categories so that the sub-network retained abundant feature extraction capabilities. Taking similar structural features as basic theory, the detailed pruning process could be described as follows.
Constructing structural features. In the convolutional layer, the structural feature analysis was conducted channel by channel. Firstly, the expectation value of each channel was calculated in the channel set to obtain the expectation features , and the expectation value was subtracted from each element to obtain the centralisation matrix. Then, the centralisation matrix was calculated using Eqs. (1) and (2) to obtain the symbol features and magnitude features , and the magnitude mean features were calculated using Eq. (5). Each channel had the corresponding structural features to reflect the ability of the channel to extract target features.
Note that there were 1*1 convolution kernels in the ResNet-50 and GoogleNet models. To correspond to the number of filters in the upper and lower layers, pruning was also performed. The centralisation matrix of the 1*1 convolution kernel was 0, and the expectation value reflected the overall gain relationship between the output and the input. Thus, only the expectation features were considered the structural features of the 1*1 convolution kernel.
Determining the similarity. The selection of redundant channels was the focus of this study. Following Sect. 3, channels were transformed into the corresponding structural features according to the characteristics analysis, while the distance in the feature space reflected the similarity between two samples. Hence, the Euclidean distance was applied in this study to measure the distance between the structural features as the basis for similarity. As the dimension of structural features was high, and the effect of the Euclidean distance on the high-dimensional data was small, a cosine similarity was supplemented to measure the cosine of the angle between the structural features. Compared with the Euclidean distance, the cosine similarity paid more attention to the difference in the direction between samples, which made up for the insufficiency of the Euclidean distance in the high-dimensional data. Meanwhile, the Euclidean distance strengthened the focus on the characteristic amplitude that was not sensitive to the cosine similarity. Therefore, this study calculated the structural feature similarity between channels, as shown in Eqs. (6) and (7).
6
7
where represents the calculation of the Euclidean distance and denotes the calculation of the cosine similarity. The similarity calculation between channels could be extended to the filters. The overall similarity of any channel in the filter could be determined using Eq. (8).8
where denotes the channel similarity, which represents the average similarity of this channel to all other channels in the current convolutional layer. A small average Euclidean distance indicated that the structural feature was at the centre of the cluster in the Euclidean space and has a higher similarity to other channels. The results of the cosine similarity fell within [, 1], and larger average values indicate that the structural feature had more similar directions to other vectors. If a channel was highly correlated with other channels, its role could be taken over by the other channels, and this channel could be deleted.Fig. 4 [Images not available. See PDF.]
Visualisation results of channel outputs in randomly selected filters. All of the feature mappings were re-ordered using the channel similarity. The channels’ structural features corresponding to the feature mappings in the green boxes had the lowest similarity; in contrast, the channels’ structural features corresponding to the feature mappings in the orange boxes had the highest similarity. The channels’ structural features corresponding to the feature mappings in the red boxes had a relatively high similarity
Clustering. Fig. 4 shows the visualised results of different channels in randomly selected filters sorted by channel similarity. Previous work (Liu et al. 2017) always removed or retained the channel with the most similar results, such as the feature mapping in the red box. This resulted in the discontinuity of the channel and the lack of categories, thereby causing performance degradation of the sub-network, which was verified by the ablation experiments discussed in Sect. 5.3.
As shown in Fig. 4, the channels corresponding to the feature mappings in the orange boxes had the structural features with the highest similarity. These feature mappings were in the center of the cluster and had the basic characteristics of most feature mappings. In contrast, the feature mappings in the green boxes were at the edge of the cluster and had completely different characteristics. Sorted from low to high channel similarity, the feature mappings indicated that they gradually tended to the basic characteristics from diverse characteristics. The position of any channel in space could be estimated by analyzing the structural features corresponding to the channels and determining the similarity. Hence, the categories could be determined according to the channel similarity. To make the sub-network adaptable to the diversity of the datasets, as many categories of channels as possible should be retained, thereby generating abundant feature mappings.
In this study, the K-means clustering algorithm was adopted to categorize the channels according to the two-dimensional similarity . As the numbers of channels in different convolutional layers differed, the corresponding K value of each layer was determined using Eq. (9) as the number of initial cluster centers. Then, the distance between the sample and the centre of the cluster was calculated through cyclical iteration, and the centre point was adjusted. Finally, the minimum distance between the sample and the centre of the cluster to which it belonged was obtained, as shown in Eq. (10).
9
10
where set contains all samples in this category, and denotes the centre point of set .Pruning. The channels were divided into different categories according to the similarity of structural features. To make the sub-network retain channels with multiple effects, this study randomly removed some channels in the categories according to the number of channels in different categories from high to low at a certain pruning ratio. Moreover, it was guaranteed that the same number of channels were removed for each filter. The overall pruning process is shown in Algorithm 1.
Adaptive pruning ratio
This study determined the structural features of the channels layer by layer and pruned them. Meanwhile, the convolutional layer was independent of other layers. As a result, setting a constant pruning ratio decreased the precision seriously. Artificially determining the pruning ratio of each layer relied on prior knowledge or required constant adjustment and re-training, resulting in a wastage of the computing resources. An effective solution was to adopt an adaptive pruning ratio according to the actual situation. This study used the overall similarity difference of different convolution layers to determine the corresponding pruning ratio. As the calculation of structural features used the channels’ centralisation parameters instead of normalisation parameters, the overall similarity could reflect the channel dispersion degree and category diversity. A higher pruning ratio was used for convolutional layers with a higher overall similarity, while a lower pruning ratio was used for convolutional layers with scattered channels. When pruning layer by layer, the use of different pruning ratios was conducive to retaining more channel categories and achieving a greater compression ratio.
The calculation of the overall similarity of the i-th convolutional layer is shown in Eq. (11), which reflects the dispersion of channels in this layer. A convolutional layer with a higher similarity indicated a more concentrated channel effect with fewer categories.
11
Upon the determination of the overall similarity layer by layer, the set was obtained. Then, according to the overall pruning ratio , the pruning ratio corresponding to each convolutional layer was obtained according to Eq. (12).12
Cross-layer connection adjustment
The pruning method proposed in this study could be directly applied to various architectures, but certain adjustments were required when it was applied to residual networks with cross-layer connections. The residual structure fused the outputs of multiple layers and the size of these outputs was uniform. To address this issue, this study pruned the residual structure with the assistance of special operations from the work in Wang et al. (2021f). In this work, the channel dimension increase (CDI) padded zero feature images in the output, and the channel dimension decrease (CDD) removed zero feature images in the output to ensure that networks with cross-layer connections could accept free pruning.
Experiments
To verify the effectiveness of the proposed method in reducing model complexity, three classic deep networks (VGG, ResNet, and GoogLeNet) were used for testing. Meanwhile, three datasets were used for experiments, namely CIFAR-10/100 (Krizhevsky 2009) and ImageNet ILSVRC-2012 (Russakovsky et al. 2015). The image size in the CIFAR-10 dataset was set to 32?2, and the dataset included a total of 50k training images and 10k test images that were divided into 10 categories; in CIFAR-100, the images were divided into 100 categories. The ImageNet ILSVRC-2012 classification dataset was more challenging. The image size was set to 224224, and the dataset included 1280K training images and 50k test images, a total of 1000 categories. Both of the above datasets are datasets for the recognition of pervasive objects, which contain common object categories in life. However, for domain-specific datasets, such as medical, biological, etc., the features of these datasets are significantly different from the above datasets, and thus the proposed method and model results may not be universal. The widely used FLOPs and Top-1 accuracy were adopted as the test standards to evaluate the degree of model complexity reduction and performance results, respectively. All of the experiments were conducted on an NVIDIA Telsa P40 graphics card, and all of the models were implemented and trained using PyTorch (Paszke et al. 2017).
Detail settings
The experiments in this study were all completed in the pre-training model. The stochastic gradient descent algorithm (SGD) was applied to solve the optimization problem. In the experiments on CIFAR, the batch size was set to 128, the weight decay was set to 5e−4, the momentum factor was set to 0.9, and the initial learning rate was set to 0.01. To reduce the number of re-trainings, layer-by-layer pruning and cross-layer fine-tuning were adopted. For example, for the ResNet network, fine-tuning was performed after pruning a residual block. The number of fine-tuning epochs was 25, which decreased by 10 times at the 10th and 20th epoch, respectively. In the experiments on ILSVRC-2012, layer-by-layer pruning and cross-layer fine-tuning strategies were also adopted. The initial learning rate was set to 0.001, the batch size was set to 512, the weight decay is set to 1e−4, and the number of fine-tuning epochs is set to 25, which decreased by 10 times at the 10th and 20th epoch, respectively. The data augmentation strategy followed PyTorch’s official random cropping and horizontal flipping operations. Sampling from the dataset yields that the CIFAR dataset is normalised with the mean set to (0.4914, 0.4822, 0.4465) and the standard deviation to (0.2023, 0.1994, 0.2010). The mean for the ILSVRC-2012 dataset were set to (0.485, 0.456, 0.406) with standard deviation of (0.229, 0.224, 0.225).
SSF adopts stochastic sparsity in generating pruning decisions, so the uncontrollable stochastic process may affect the stability of the results. To confirm the reliability of the method, the experimental results of CIFAR-10/100 will be repeated three times under the same pruning decision and pruning rate conditions, and the average result will be used as the pruning result. Secondly, uncertainty quantification (UQ) analyses of the pruning model are also carried out in Sect. 5.5 by two methods.
CIFAR-10/100 experimental results
Results on CIFAR-10 dataset
The VGG-16 model and the ResNet-56 model were first compressed by setting different overall pruning rates, and the results are shown in Tables 2 and 3. For the VGG-16 model, when the FLOPs of the model were reduced by 48.12%, the accuracy of the pruned model only decreased by 0.16%. Moreover, the accuracy of VGG-16 decreased by 0.80% when nearly 70% of the FLOPs were removed.
For the ResNet-56 model, when FLOPs of the model were reduced by 52.25%, the accuracy of the pruned model increased by 0.48%. Furthermore, the accuracy of ResNet-56 decreased by 0.70% when 69.93% of the FLOPs were removed.
Table 2. Performance of SSF method for pruning on VGG-16 and CIFAR-10 datasets
Pruning rate | Top-1 Acc | Parameters | FLOPs |
|---|---|---|---|
(BL) | 93.96% | 14.98M | 313.73M |
93.80% | 9.09M | 162.75M | |
93.45% | 7.35M | 130.51M | |
93.41% | 6.94M | 117.60M | |
93.16% | 6.44M | 96.24M |
Table 3. Performance of SSF method for pruning on ResNet-56 and CIFAR-10 datasets
Pruning rate | Top-1 Acc | Parameters | FLOPs |
|---|---|---|---|
(BL) | 93.26% | 0.85M | 125.49M |
93.74% | 0.42M | 59.92M | |
93.20% | 0.36M | 53.82M | |
92.90% | 0.27M | 41.35M | |
92.56% | 0.23M | 37.73M |
Comparison
The method proposed in this study was compared with some of the mainstream pruning methods, including CAC (Liu et al. 2019), EPruner (Lin et al. 2021a), SCOP (Tang et al. 2020), HRank (Lin et al. 2020), PR (Zhuang et al. 2020), SAI (Aketi et al. 2020), SENS (Yang and Liu 2022), SST (Lian et al. 2021), FPC (Chen et al. 2022), GlobalPru(Wang et al. 2024), MOPFMS(Jiang et al. 2023), and EACP(Liu et al. 2023b). The comparison results are presented in Tables 4 and 5.
Table 4. Pruning performance on CIFAR-10
Model | Approach | Top-1 Acc | Parameters | FLOPs | |||
|---|---|---|---|---|---|---|---|
SSF (ours) | 93.81% | 0.15% | 9.13M | 39.05% | 163.86M | 47.77% | |
HRank (2020) | 93.43% | 0.53% | 2.51M | 82.90% | 145.61M | 53.50% | |
VGG-16 | SENS (2022) | 93.17% | 0.53% | N/A | N/A | N/A | 54.10% |
GlobalPru (2023) | 93.29% | 0.21% | N/A | 71.34% | N/A | 55.00% | |
SSF (ours) | 93.45% | 0.51% | 7.37M | 50.80% | 131.63M | 58.14% | |
CAC (2021) | 93.10% | + 0.22% | 0.43M | 49.04% | 62.70M | 50.03% | |
SSF (ours) | 93.72% | + 0.46% | 0.41M | 51.76% | 59.92M | 52.25% | |
SCOP (2020) | 93.64% | 0.06% | N/A | 56.30% | N/A | 56.00% | |
ResNet-56 | SENS (2022) | 93.45% | 0.33% | N/A | N/A | N/A | 56.80% |
EPruner (2021) | 93.18% | 0.08% | 0.39M | 54.20% | 49.35M | 61.33% | |
FPC (2022) | 93.39% | 0.39% | 0.18M | 78.82% | 63.74M | 49.74% | |
SSF (ours) | 93.19% | 0.07% | 0.36M | 57.64% | 53.82M | 57.11% | |
EACP (2023) | 93.11% | 0.09% | 0.31M | 75.40% | 40.44M | 68.00% | |
MOPFMS (2023) | 93.47% | 0.31% | N/A | 72.28% | N/A | 76.26% | |
SST (2021) | 93.38% | 0.32% | N/A | N/A | 88.88M | 63.15% | |
FPC (2022) | 93.83% | 0.60% | 0.29M | 83.23% | 90.53M | 64.53% | |
FSIM-E (2023) | 93.48% | + 0.18% | 0.39M | 54.10% | 59.24M | 53.20% | |
ResNet-110 | SSF (ours) | 94.06% | + 0.56% | 0.61M | 64.53% | 89.29M | 64.69% |
EACP (2023) | 93.35% | + 0.05% | 0.59M | 65.90% | 80.90M | 68.30% | |
MOPFMS (2023) | 94.47% | + 0.04% | N/A | 69.38% | N/A | 67.32% | |
HRank (2020) | 94.53% | 0.52% | 2.74M | 55.40% | 0.69G | 54.90% | |
GoogLeNet | EPruner (2021) | 94.99% | 0.06% | 2.22M | 64.20% | 0.50G | 67.36% |
EACP (2023) | 94.80% | + 0.08% | 2.49M | 59.60% | 0.56G | 62.30% | |
SSF (ours) | 95.17% | + 0.12% | 1.97M | 67.96% | 0.46G | 69.73% |
Bold value indicates the best result of the model
Table 5. Pruning performance on CIFAR-100
Model | Approach | Top-1 Acc | Parameters | FLOPs | |||
|---|---|---|---|---|---|---|---|
PR (2020) | 72.46% | 0.06% | N/A | N/A | N/A | 25.00% | |
ResNet-56 | FPC (2021) | 71.30% | 0.43% | 0.26M | 69.76% | 74.95M | 40.90% |
CAC (2021) | 70.10% | 0.80% | 0.46M | 45.69% | 62.70M | 50.02% | |
SSF (ours) | 70.72% | 1.28% | 0.37M | 56.47% | 54.86M | 56.28% | |
SAI (2020) | 72.35% | 0.93% | 1.33M | 23.00% | 154.27M | 39.00% | |
ResNet-110 | CAC (2021) | 72.16% | 0.37% | 0.93M | 45.65% | 126.40M | 50.02% |
FPC (2022) | 72.39% | + 0.08% | 0.45M | 73.98% | 125.93M | 50.66% | |
SSF (ours) | 71.12% | 1.00% | 0.57M | 66.86% | 82.79M | 67.26% | |
GoogLeNet | Random | 78.06% | + 0.04% | 2.30M | 62.60% | 0.52G | 65.79% |
SSF (ours) | 78.65% | + 0.63% | 2.30M | 62.60% | 0.52G | 65.79% |
Bold value indicates the best result of the model
CIFAR-10: Table 4 shows the results of various models on CIFAR-10. Overall, the SSF method achieved higher model acceleration than the other state-of-the-art methods. For VGG-16, SSF caused a 58.14% reduction in the FLOPs, while the accuracy reduced by 0.51% compared with the baseline model. In contrast, for ResNet-56, after the SSF method achieved a 52.25% reduction in FLOPs and a 51.76% reduction in parameters, and the accuracy increased by 0.46% compared with the baseline model. When achieving greater compression, the FLOPs decreased by 57.11%, while the accuracy hardly changed. Compared with the FPC method (Chen et al. 2022) based on the contribution of feature mappings, in this case, the loss of model accuracy was more serious when the FPC method achieved a large amount of compression, as in the case of the SENS method (Yang and Liu 2022) based on model parameters to calculate filter sensitivity. However, while achieving the approximate FLOPs compression, the SSF method could reduce the accuracy drop. With respect to the MOPFNS and EACP methods, both achieved greater pruning rates and less loss on the ResNet-56 model. In the case of the other models, SSF achieved similar results to EACP.
CIFAR-100: Table 5 shows the results of various models on CIFAR-100. For relatively complex datasets, the SSF method showed the advantage of adapting to specific datasets. The three models were subjected to different compression degrees, and the accuracy showed an upward trend. For ResNet-56, SSF introduced a 56.28% drop in FLOPs, and the accuracy was reduced by 1.28%. Similarly, for GoogLeNet, SSF achieved a 65.79% reduction in FLOPs while improving the accuracy by 0.61% compared with the baseline model.
Table 6. Pruning performance on ImageNet ILSVRC2012
Approach | Top-1/TOP-5 Acc | Parameters | FLOPs | |||
|---|---|---|---|---|---|---|
Model: ResNet-18 | ||||||
CAC (2021) | 68.17%/88.35% | 0.29%/0.29% | 7.70M | 29.97% | 1.30G | 25.01% |
EPruner (2021) | 67.31%/87.70% | 2.35%/ 1.38% | 6.05M | 48.25% | 1.02G | 43.88% |
SSF (ours) | 67.02%/87.71% | 2.64%/1.37% | 6.36M | 49.37% | 0.99G | 45.60% |
Model: ResNet-50 | ||||||
HRank (2020) | 74.98%/92.33% | 1.17%/ 0.54% | 16.15M | 36.67% | 2.30G | 43.76% |
WB (2022) | 75.32%/92.43% | 0.83%/ 0.53% | N/A | N/A | 2.22G | 45.60% |
EPruner (2021) | 74.26%/91.88% | 1.75%/ 1.08% | 12.70M | 50.31% | 1.93G | 53.35% |
DCP (2021) | 75.15%/92.30% | 0.86% 0.63% | N/A | 54.99% | N/A | 52.41% |
SCOP-B (2020) | 75.26%/92.53% | 0.89%/0.34% | N/A | 51.80% | N/A | 54.60% |
SRR-GR (2021) | 75.11%/92.86% | 1.02%/ 0.95% | N/A | N/A | N/A | 55.10% |
SENS (2022) | 75.23%/N/A | 0.92%/N/A | N/A | N/A | N/A | 56.30% |
GReg-2 (2021) | 75.36%/N/A | 0.77%/N/A | N/A | N/A | N/A | 56.72% |
TPP (2023) | 75.60%/N/A | 0.53%/N/A | N/A | N/A | N/A | 56.72% |
EACP (2023) | 73.95%/91.79% | 1.87%/ 1.10% | 11.59M | 54.70% | 1.86G | 53.80% |
ARPruning (2023) | 72.31%/N/A | 3.84%/N/A | 11.02N | 56.80% | 1.80G | 56.00% |
L1-norm | 73.40%/91.41% | 2.75%/ 1.46% | 12.65M | 50.43% | 1.74G | 57.45% |
SSF (ours) | 75.14%/92.43% | 1.01%/ 0.44% | 12.64M | 50.43% | 1.73G | 57.70% |
Bold value indicates the best result of the model
ILSVRC-2012 experimental results
The method proposed in this study was compared with some popular methods, including WB (Zhang et al. 2022), CAC (Liu et al. 2019), SRR-GR (Wang et al. 2021e), SCOP (Tang et al. 2020), EPruner (Lin et al. 2021a), DCP (Liu et al. 2021), SENS (Yang and Liu 2022), Greg (Wang et al. 2021a), TPP (Wang and Fu 2022), EACP (Liu et al. 2023b), and ARPruning (Yuan et al. 2024). The results are shown in Table 6. It was observed that ResNet-18 had fewer model parameters and was unstable at a larger pruning ratio. ResNet-50 has more parameters than ResNet-18 and could adapt to complex classification tasks. When the FLOPs were reduced by 57.70%, the SSF method only reduced the TOP-1 accuracy by 1.01%. However, EPruner (Lin et al. 2021a), which also used adaptive exemplar filters to simplify the pruning strategy, achieved a 53.35% reduction in FLOPs, with a reduction in accuracy of 1.75%. Compared with the SRR-GR (Wang et al. 2021e) method, which explores the most redundant convolutional layers in the model by using redundant discriminations and prunes the corresponding filters, the SSF adapted to a greater compression at a consistent degree of accuracy degradation.
Previous methods have assumed that channels with lower L1 paradigms contribute less to the model, so the method was compared with the same experimental setup, which achieved only 73.40% TOP-1 accuracy while reducing FLOPs by 57.45%, showing that SSF was more effective in analysing the similarity of channel roles than focusing on the size of the weights. In contrast, GReg-2 and TPP achieved 75.36% and 75.60% TOP-1 accuracies when FLOPs were reduced by 56.72%, respectively, achieving results superior to the SSF method. Both of the abovementioned methods used regularisation to gradually separate unimportant parameters from the network during the training process. In contrast, the SSF method converted the convolutional kernel to structural features and performed similarity judgement and clustering at once in the pre-training model, followed by gradual pruning in the fine-tuning. This implied that the two types of methods differed in the time points at which pruning decisions were decided, which might be one of the reasons for the relatively poor performance of SSF. However, the SSF is, on the whole, a new pruning method for discriminating the similarity of model parameters, which is computationally simple and outperforms many other methods.
In terms of the overall performance of the three datasets, the SSF method demonstrated favourable pruning effectiveness, which was related to its calculation of the similarity between the structural features of the channels, which was detached from the dataset influence, and therefore remains effective after replacing the dataset.
Fig. 5 [Images not available. See PDF.]
Visualisation results of channel distance distribution. The first sub-figure shows the original distribution and the categories divided by the clustering algorithm; the second one shows the distribution results after conducting random sparsity according to the categories; the third represents the categories divided by Eqs. (14) and (15)
Ablation study
Pruning decision As shown in Fig. 4, according to the similarity of structural features, the position of the channels in the cluster could be analysed to indirectly reflect the type of characteristics the corresponding feature mapping had. In this study, to verify which pruning decision could make the sub-network have a strong generalisation ability, removing the channels with higher or lower similarity was taken as a pruning strategy for comparison. As a lower Euclidean distance and a higher cosine similarity indicated a higher similarity between channels, two similarity criteria were adopted to jointly determine the similarity of channels, as shown in Eq. (13). Combined with the actual pruning ratio of the convolutional layer, this equation reflected the creation of an ellipse range with the minimum Euclidean distance and the maximum cosine similarity as the centre. In addition, the pruning ratio determined the radius of the ellipse.
13
where s represents the one-dimensional similarity of the channel. The removal of the channels with higher similarity according to the percentage can be expressed using Eq. (14), where the function is defined as the highest value of v% in the selection results. The removal of the channels with a lower similarity according to the percentage can be expressed using Eq. (15), where the function obtains the minimum value of v% in the selection results.14
15
To intuitively express the differences in channel selection between different methods, Fig. 5 visualises the distance distribution of all of the channels in the filter. In this figure, the second column of the sub-figures shows the sparse results obtained after classifying the channels. Finally, the channels of various categories were retained, and the abundant feature extraction capabilities of the original network were maintained. However, as in the work described in Liu et al. (2017), the pruning decision formulated by Eqs. (14) and (15) had certain flaws. According to the one-dimensional similarity of channels, the categories were divided by percentage, resulting in the absence and discontinuity of categories, as well as the loss of some feature extraction capabilities, as shown in the third column of the sub-figures in Fig. 5.Finally, with the application of the CIFAR-10 dataset and the same hyper-parameters and pruning ratios, comparative experiments were conducted on the ablation pruning strategy. The experimental results are shown in Fig. 6. The method of retaining the most similar channels only retained a few channel categories, resulting in poor feature extraction ability and the lowest accuracy. In contrast, the method of removing the most similar channels preserved more channel categories but discarded the channels in the center of the cluster, which resulted in poor accuracy. The comparison experiments also demonstrated that a network with good performance had to have abundant feature extraction capabilities. The cluster-based random sparsity method proposed in this study achieved the highest performance. For the VGG-16 model, the proposed method obtained 0.64 and 0.41% higher accuracy than the previous two methods, respectively. For the ResNet-56 model, the proposed method obtained 0.48 and 0.44% higher accuracy than the previous two methods, respectively. In summary, retaining the ability of the original model to extract abundant features was the key to making pruning decisions.
Fig. 6 [Images not available. See PDF.]
Comparison experiment results among the method of removing the most similar channels, the method of retaining the most similar channels, and the clustering method. Three different models were adopted to conduct experiments on the CIFAR-10 dataset. Each model uses the same hyperparameters and pruning ratios. The Top-1 accuracy was exploited to reflect the performance of the pruning results
Adaptive pruning ratio The pruning rate was a key hyperparameter in the pruning work; however, the variability between convolutional layers caused setting a fixed pruning rate to degrade the model performance, or setting proprietary pruning rates for different layers heavily relied on a priori knowledge. In this subsection, the effectiveness of adaptive pruning rates is verified by comparing setting fixed pruning rates in different models, as shown in Table 7. A fixed pruning rate [0.34, 0.57*55] means that the first convolutional layer was set with a pruning rate of 0.34, and all of the subsequent 55 convolutional layers were set with a pruning rate of 0.57. While the adaptive pruning rate only needed to set the overall pruning rate , the pruning rate of each layer was calculated using Eq. (12). It was observed that the model performance obtained using the adaptive pruning ratio proposed in Sect. 4.2 had different degrees of improvement as compared to that obtained by setting a fixed pruning ratio, and could assign appropriate pruning ratios to different convolutional layers. For example, in the VGG-16 model, the pruning rate was calculated as [0.34, 0.58, 0.55, 0.55, 0.55, 0.55, 0.59, 0.61, 0.6, 0.61, 0.58, 0.58, 0.58, 0.58] for each layer after setting =0.58, which not only had less loss of accuracy but also had more compression of the parameters, as compared to the fixed pruning rate.
Table 7. Comparison experiment of adaptive pruning rate and constant pruning rate
Model | Pruning rate | Top-1 Acc | Parameters | FLOPs |
|---|---|---|---|---|
[0.34, 0.57*55] | 93.08% | 0.35M | 54.87M | |
ResNet-56 | Adaptive pruning ratio () | 93.08% | 0.35M | 54.87M |
[0.34, 0.62*55] | 92.75% | 0.28M | 42.23M | |
Adaptive pruning ratio () | 92.84% | 0.29M | 43.52M | |
[0.34, 0.65*109] | 93.57% | 0.58M | 87.34M | |
ResNet-110 | Adaptive pruning ratio () | 94.06% | 0.61M | 89.29M |
[0.34, 0.74*109] | 93.08% | 0.42M | 63.78M | |
Adaptive pruning ratio () | 93.21% | 0.42M | 64.17M | |
[0.34, 0.58*12] | 93.20% | 7.44M | 130.69M | |
VGG-16 | Adaptive pruning ratio () | 93.45% | 7.37M | 131.63M |
[0.34, 0.62*12] | 93.07% | 6.96M | 119.44M | |
Adaptive pruning ratio () | 93.44% | 7.38M | 120.57M |
Uncertainty and sensitivity analysis
Uncertainty analysis
Table 8. Uncertainty quantitative analysis of pruned models
Model | TOP-1 ACC. | MC-Dropout | VI | ||||
|---|---|---|---|---|---|---|---|
ECE | MCE | NLL | ECE | MCE | NLL | ||
93.96% (BL) | 0.0092 | 0.3301 | 0.3013 | 0.0092 | 0.3300 | 0.3012 | |
VGG-16 | 93.80% | 0.0078 | 0.2675 | 0.2620 | 0.0077 | 0.2660 | 0.2620 |
93.45% | 0.0085 | 0.3265 | 0.2696 | 0.0085 | 0.3430 | 0.2696 | |
93.26% (BL) | 0.0086 | 0.2549 | 0.2893 | 0.0086 | 0.2549 | 0.2893 | |
ResNet-56 | 93.72% | 0.0071 | 0.1587 | 0.2573 | 0.0075 | 0.2136 | 0.2615 |
93.19% | 0.0075 | 0.2119 | 0.2615 | 0.0071 | 0.1623 | 0.2573 | |
92.90% | 0.0067 | 0.1782 | 0.2587 | 0.0067 | 0.1807 | 0.2587 | |
ResNet-110 | 93.50% (BL) | 0.0088 | 0.2827 | 0.3057 | 0.0088 | 0.2776 | 0.3057 |
94.04% | 0.0072 | 0.2149 | 0.2449 | 0.0073 | 0.2154 | 0.2449 | |
Googlenet | 95.05% (BL) | 0.0048 | 0.1927 | 0.1852 | 0.0049 | 0.1927 | 0.1852 |
95.17% | 0.0051 | 0.1989 | 0.1791 | 0.0051 | 0.1988 | 0.1791 | |
Uncertainty quantification is an important criterion for evaluating the performance of integrated deep learning-based models. Due to the uncertainty in the model’s predictions and its susceptibility to noise and erroneous inference, UQ analysis is often required to assess the stability and accuracy of the predictive model. Mainstream approaches achieve model uncertainty quantification mainly through approximate Bayesian inference (Gal and Ghahramani 2016), and lesser extent, Abbaszadeh Shahri et al. (2022) build on this by proposing a more efficient method of automatically and randomly deactivating connection weights. The main objective of this section is to further validate the effectiveness of SSF methods involving stochastic sparse processes by analysing the change in uncertainty metrics before and after pruning the model.
Specifically, the MC-Dropout and Variational Inference (VI) methods in the Bayesian Deep Learning Library (BayesDLL) tool (Kim and Hospedales 2023) are used to evaluate the uncertainty of the pruning results of the four models on the CIFAR-10 dataset, respectively, and the expected calibration error (ECE), maximum calibration error (MCE) and negative log-likelihood (NLL) as the evaluation metrics of model uncertainty, which are used to measure the degree of matching between prediction accuracy and prediction confidence and the degree of proximity between the predicted and true labelling distributions of the models, respectively. The quantitative results of the uncertainty of the four pruned representative models are shown in Table 8.
In Table 8, the Baseline models of VGG-16 and ResNet-56 have higher accuracies, but both have high ECE, MCE, and NLL at the same time. As consistent with the work (Guo et al. 2017), the pre-trained models may suffer from the phenomenon of overconfidence, which in turn leads to a miscalibration between accuracy and confidence. The work (Guo et al. 2017) further elucidates that when the network complexity is higher, the model overconfidence problem becomes more serious; secondly, the pre-trained model has a potential overfitting problem, the higher the fit, the higher the confidence, the greater the ECE. From the data in Table 8, it can be seen that although the accuracy of the models obtained after SSF pruning has decreased, all of them are able to reduce the model uncertainty, and the process eases the mismatch between model accuracy and confidence. It can be confirmed that the stochastic sparse process involved in SSF can reduce the redundant channels in the model, reduce the width of the model, reduce the excessive attention to local features, and further alleviate the overconfidence problem of the model.
Sensitivity analysis
Fig. 7 [Images not available. See PDF.]
Sensitivity analyses of SSF. vGGNet-16 with of 0.57, ResNet-56 with of 0.55, and the rest of the settings are consistent with Sect. 5.1
In order to verify the sensitivity of SSF to different input data, this section will explore how changing the distribution of the input data affects the pruned model. Before training the model, the input data will be normalised based on mean and standard deviation to speed up model convergence. Therefore, this chapter will fix the random number seed during the training process, set to 0.57 for VGGNet-16 and 0.55 for ResNet-56, and use the mean value of the input data as the experimental independent variable to analyse the sensitivity of the SSF to data with different distributions. The results of the sensitivity analysis are shown in Fig. 7, where m is the hyperparameter set in Sect. 5.1, while the rest of the horizontal axis points have the same mean value for all three channels and the standard deviation set to 0.5 for all three channels. When the mean value is 0.5 and the standard deviation is 0.5, all three channels of RGB data are distributed in a normal form as far as possible within (, 1), at which point the model output can be activated to a greater extent. The m-point set as a parameter with the data sampling results achieved the maximum activation of the output, so it can be seen that determining the optimal normal distribution of the input data could maximise the output of the activation model.
Conclusion
In this paper, a novel convolutional channel pruning method named SSF was proposed, which aimed to discriminate the similarity of feature extraction roles between channels, classify them into multiple categories based on their overall similarity, and ultimately sparse the channels in each category in order to retain richly feature-extracted pruned models. The effectiveness of SSF was verified using CIFAR and ILSVRC 2012 on several classical CNN models. On the ILSVRC 2012 dataset, SSF reduced the FLOPs of ResNet-50 by 57.70%, while the Top-1 accuracy was reduced by 1.01%.
SSF, as an effective pruning method, can be applied to any model with a convolutional structure, significantly compressing model parameters and computation. However, the SSF method has not yet been applied to the acceleration and compression of other visual tasks in the real world. Therefore, the next step is to explore the practical application of the SSF method to real-world applications.
Author Contributions
Sun Chuanmeng: Conceptualization, Funding Acquisition, Methodology, Visualization, Writing-Review. Chen Jiaxin: Conceptualization, Methodology, Software, Writing-Original Draft. Li Yong: Conceptualization, Methodology, Funding Acquisition. Wang Yu: Funding Acquisition, Data Curation. Ma Tiehua: Supervision, Investigation.
Funding
This work was supported by the National Key Research and Development Program of China (2022YFC2905700), the National Key Research and Development Program of China (2022YFB3205800), and the Fundamental Research Programs of Shanxi Province (202203021212129, 202203021221106). The funders had no role in study design, datacollection and analysis, decision to publish, or preparation of the manuscript.
Data availability
The datasets that support the findings of this study are available in Krizhevsky (2009), Russakovsky et al. (2015). Our code is available at https://github.com/sunchuanmeng/SSF_Pruning. The results of our experiments were also uploaded to Google Cloud Drive, and the link to get them is in the Readme file in Gitgub.
Declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethical approval
This article does not contain any research on animals by any of the authors.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
Abbaszadeh Shahri, A; Shan, C; Larsson, S. A novel approach to uncertainty quantification in groundwater table modeling by automated predictive deep learning. Nat Resour Res; 2022; 31,
Aketi, SA; Roy, S; Raghunathan, A et al. Gradual channel pruning while training using feature relevance scores for convolutional neural networks. IEEE Access; 2020; 8, pp. 171924-171932. [DOI: https://dx.doi.org/10.1109/ACCESS.2020.3024992]
Bai Y, Wang H, Tao Z et al (2022) Dual lottery ticket hypothesis. In: 10th international conference on learning representations, ICLR 2022
Chang, J; Lu, Y; Xue, P et al. Global balanced iterative pruning for efficient convolutional neural networks. Neural Comput Appl; 2022; 34,
Chen, Z; Xu, TB; Du, C et al. Dynamical channel pruning by conditional accuracy change for deep neural networks. IEEE Trans Neural Netw Learn Syst; 2020; 32,
Chen, Y; Wen, X; Zhang, Y et al. FPC: filter pruning via the contribution of output feature map for deep convolutional neural networks acceleration. Knowl-Based Syst; 2022; 238, [DOI: https://dx.doi.org/10.1016/j.knosys.2021.107876]
Dong X, Yang Y (2019) Network pruning via transformable architecture search. Adv Neural Inf Process Syst 32
Dong, X; Yan, P; Wang, M et al. An optimization method for pruning rates of each layer in CNN based on the GA-SMSM. Memetic Comput; 2024; 16,
Frankle J, Carbin M (2019) The lottery ticket hypothesis: finding sparse, trainable neural networks. In: 7th international conference on learning representations, ICLR 2019
Gal Y, Ghahramani Z (2016) Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: International conference on machine learning. PMLR, pp 1050–1059
Guo, Z; Zhang, L; Zhang, D. A completed modeling of local binary pattern operator for texture classification. IEEE Trans Image Process; 2010; 19,
Guo Y, Yao A, Chen Y (2016) Dynamic network surgery for efficient dnns. Advances in neural information processing systems 29
Guo C, Pleiss G, Sun Y et al (2017) On calibration of modern neural networks. In: International conference on machine learning. PMLR, pp 1321–1330
Han S, Pool J, Tran J et al (2015) Learning both weights and connections for efficient neural network. Adv Neural Inf Process Syst 28
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
He Y, Zhang X, Sun J (2017) Channel pruning for accelerating very deep neural networks. In: Proceedings of the IEEE international conference on computer vision, pp 1389–1397
He Y, Lin J, Liu Z et al (2018) AMC: automl for model compression and acceleration on mobile devices. In: Proceedings of the European conference on computer vision (ECCV), pp 784–800
He Y, Liu P, Wang Z et al (2019) Filter pruning via geometric median for deep convolutional neural networks acceleration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4340–4349
Jiang, P; Xue, Y; Neri, F. Convolutional neural network pruning based on multi-objective feature map selection for image classification. Appl Soft Comput; 2023; 139, [DOI: https://dx.doi.org/10.1016/j.asoc.2023.110229] 110229.
Kim M, Hospedales T (2023) BayesDLL: Bayesian deep learning library. In: arXiv preprint arXiv:2309.12928
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Technical report
Lee N, Ajanthan T, Torr PH (2019) SNIP: single-shot network pruning based on connection sensitivity. In: 7th International conference on learning representations, ICLR 2019
Lian, Y; Peng, P; Xu, W. Filter pruning via separation of sparsity search and model training. Neurocomputing; 2021; 462, pp. 185-194. [DOI: https://dx.doi.org/10.1016/j.neucom.2021.07.083]
Lin S, Ji R, Yan C et al (2019) Towards optimal structured CNN pruning via generative adversarial learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2790–2799
Lin M, Ji R, Wang Y et al (2020) HRank: filter pruning using high-rank feature map. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1529–1538
Lin, M; Ji, R; Li, S et al. Network pruning using adaptive exemplar filters. IEEE Trans Neural Netw Learn Syst; 2021; 33,
Lin M, Ji R, Zhang Y et al (2021b) Channel pruning via automatic structure search. In: Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence, pp 673–679
Liu Z, Li J, Shen Z et al (2017) Learning efficient convolutional networks through network slimming. In: Proceedings of the IEEE international conference on computer vision, pp 2736–2744
Liu Z, Mu H, Zhang X et al (2019) MetaPruning: meta learning for automatic neural network channel pruning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3296–3305
Liu N, Ma X, Xu Z et al (2020) AutoCompress: an automatic DNN structured pruning framework for ultra-high compression rates. In: Proceedings of the AAAI conference on artificial intelligence, pp 4876–4883
Liu, J; Zhuang, B; Zhuang, Z et al. Discrimination-aware network pruning for deep model compression. IEEE Trans Pattern Anal Mach Intell; 2021; 44,
Liu, G; Zhang, K; Lv, M. SOKS: automatic searching of the optimal kernel shapes for stripe-wise network pruning. IEEE Trans Neural Netw Learn Syst; 2023; 34,
Liu, Y; Wu, D; Zhou, W et al. EACP: an effective automatic channel pruning for neural networks. Neurocomputing; 2023; 526, pp. 131-142. [DOI: https://dx.doi.org/10.1016/j.neucom.2023.01.014]
Louati H, Louati A, Bechikh S et al (2024) Joint filter and channel pruning of convolutional neural networks as a bi-level optimization problem. Memetic Comput 1–20
Luo JH, Wu J, Lin W (2017) ThiNet: a filter level pruning method for deep neural network compression. In: Proceedings of the IEEE international conference on computer vision, pp 5058–5066
Molchanov P, Tyree S, Karras T et al (2017) Pruning convolutional neural networks for resource efficient inference. In: 5th international conference on learning representations, ICLR 2017—conference track proceedings
Mousa-Pasandi M, Hajabdollahi M, Karimi N et al (2020) Convolutional neural network pruning using filter attenuation. In: 2020 IEEE international conference on image processing (ICIP), pp 2905–2909
Pachón, CG; Ballesteros, DM; Renza, D. SeNPIS: sequential network pruning by class-wise importance score. Appl Soft Comput; 2022; 129, [DOI: https://dx.doi.org/10.1016/j.asoc.2022.109558] 109558.
Paszke A, Gross S, Chintala S et al (2017) Automatic differentiation in pytorch. NIPS Autodiff Workshop
Raffel, C; Shazeer, N; Roberts, A et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res; 2020; 21,
Russakovsky, O; Deng, J; Su, H et al. ImageNet large scale visual recognition challenge. Int J Comput Vis; 2015; 115, pp. 211-252.
Sanh, V; Wolf, T; Rush, A. Movement pruning: adaptive sparsity by fine-tuning. Adv Neural Inf Process Syst; 2020; 33, pp. 20378-20389.
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations
Tang, Y; Wang, Y; Xu, Y et al. SCOP: scientific control for reliable neural network pruning. Adv Neural Inf Process Syst; 2020; 33, pp. 10936-10947.
Wang H, Fu Y (2022) Trainability preserving neural pruning. In: International conference on learning representations
Wang, Z; Bovik, AC; Sheikh, HR et al. Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process; 2004; 13,
Wang H, Qin C, Zhang Y et al (2021a) Neural pruning via growing regularization. In: ICLR 2021—9th international conference on learning representations
Wang, J; Jiang, T; Cui, Z et al. Filter pruning with a feature map entropy importance criterion for convolution neural networks compressing. Neurocomputing; 2021; 461, pp. 41-54. [DOI: https://dx.doi.org/10.1016/j.neucom.2021.07.034]
Wang W, Chen M, Zhao S et al (2021c) Accelerate CNNs from three dimensions: a comprehensive pruning framework. In: Proceedings of machine learning research, pp 10717–10726
Wang, W; Yu, Z; Fu, C et al. COP: customized correlation-based filter level pruning method for deep CNN compression. Neurocomputing; 2021; 464, pp. 533-545. [DOI: https://dx.doi.org/10.1016/j.neucom.2021.08.098]
Wang Z, Li C, Wang X (2021e) Convolutional neural network pruning with structural redundancy reduction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14913–14922
Wang, Z; Xie, X; Shi, G. RFPruning: a retraining-free pruning method for accelerating convolutional neural networks. Appl Soft Comput; 2021; 113, [DOI: https://dx.doi.org/10.1016/j.asoc.2021.107860]
Wang, Y; Guo, S; Guo, J et al. Towards performance-maximizing neural network pruning via global channel attention. Neural Netw; 2024; 171, pp. 104-113. [DOI: https://dx.doi.org/10.1016/j.neunet.2023.11.065]
Yang, C; Liu, H. Channel pruning based on convolutional neural network sensitivity. Neurocomputing; 2022; 507, pp. 97-106. [DOI: https://dx.doi.org/10.1016/j.neucom.2022.07.051]
Yao, K; Cao, F; Leung, Y et al. Deep neural network compression through interpretability-based filter pruning. Pattern Recogn; 2021; 119, [DOI: https://dx.doi.org/10.1016/j.patcog.2021.108056] 108056.
Yu R, Li A, Chen CF et al (2018) NISP: pruning networks using neuron importance score propagation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9194–9203
Yuan T, Li Z, Liu B et al (2024) ARPruning: an automatic channel pruning based on attention map ranking. Neural Netw 106220
Zhang Y, Lin M, Lin CW et al (2022) Carrying out CNN channel pruning in a white box. IEEE Trans Neural Netw Learn Syst
Zhuang Z, Tan M, Zhuang B et al (2018) Discrimination-aware channel pruning for deep neural networks. Adv Neural Inf Process Syst 31
Zhuang, T; Zhang, Z; Huang, Y et al. Neuron-level structured pruning using polarization regularizer. Adv Neural Inf Process Syst; 2020; 33, pp. 9865-9877.
Zou, BJ; Guo, YD; He, Q et al. 3D filtering by block matching and convolutional neural network for image denoising. J Comput Sci Technol; 2018; 33, pp. 838-848. [DOI: https://dx.doi.org/10.1007/s11390-018-1859-7]
Copyright Springer Nature B.V. Jan 2025