This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
1. Introduction
In recent years, with the rapid development of 5G, “Interconnection of All Things” has become a trend of information technology. The speed of information circulation on the Internet has reached an unprecedented level. The rapid information circulation has brought a dramatic increase in data and promoted the development of big data and artificial intelligence technology. However, the interconnection of more devices brings higher security risks. How to enjoy the benefits brought by artificial intelligence on the premise of ensuring data privacy has become a challenge.
In 2016, McMahan et al. [1] first proposed the concept of federated learning to respond to the challenge. Multiple clients can jointly train the model under the coordination of a central server or service provider in federated learning. Each client uses its data to train the local model and uploads the parameters of local model to the server for model aggregation to achieve the global model. Therefore, the original data of each client is stored locally without exchange or transmission. At present, federated learning has been widely applied to optimize the user experience on the premise of protecting privacy in Google, Apple, and other enterprises. For example, Google has widely used federated learning in Gboard [2], Pixel mobile phone [3], and Android Message [4], so as IOS13 [5] of Apple.
In the FedAvg algorithm proposed by McMahan et al., the central server generates a global model after aggregation in each synchronization and then distributes the global model to some clients which are selected randomly. When the clients received the global model, they use their own data to train their local models with the parameters of global model in the specified epochs. After the local models are trained, the clients will send them to the server, and the server will execute model aggregation with the weighted average strategy. The global model will gradually converge after several synchronizations. According to the FedAvg, the update of global model in each synchronization can be expressed by formula (1), where
In each synchronization of global model updating, FedAvg sets the same number of local epochs
As the idling of faster clients in federated learning will prolong the training of global model, we propose to predict the training time of deep learning models on heterogeneous clients to guide dynamically set the number of local epochs. In the deep learning task, the training time may be affected by the amount of training data and the setting of hyperparameters. We call the factors in training data and hyperparameters that may affect the training time as model features. Justus et al. [8] have trained a multilayer perceptron (MLP) to predict the training time of layers in the neural network using the model features and training time collected on different GPUs. They divide model features into layer features and predict the training time of the complete model by accumulating the prediction results of multiple layers. However, when the structure of the model is very complex or the number of layers is very large, collecting many features in each layer will increase the burden of the system and hinder the convergence progress of global model. What is more, when a new device is added to the federated learning system, it is necessary to collect a large amount of high dimension training data on this device to tune the current prediction model, which is very time-consuming and will lead to long-term failure of training time prediction for the new device.
To solve these problems, we propose a method based on deep learning to reduce the number of model features and the amount of training data required by training time prediction. We first design a neural network to extract the influence of model features on training time, which can accurately interpret the relationship between model features and training time. Then, we propose a dimensionality reduction rule to extract the key features based on the influence of model features on training time. A large number of experiments show that compared with the current prediction method, the features of the convolutional layer and dense layer can be reduced by 30% and 20%, respectively, and the error of training time prediction still maintains the original level. At the same time, 25% convolutional layer training data and 20% dense layer training data are reduced, which speeds up the adaptation of the time prediction model to the new device.
The rest of this paper is arranged as follows: in Section 2, we discuss the related work in time prediction; in Section 3, we introduce the method proposed in this work, including our neural network and dimensionality reduction rule. We also provide an algorithm to dynamically set the number of local epochs in this section. To verify our work, we set up a large number of experiments and interpret the experimental results in Section 4. Finally, we provide the summary of all work in Section 5.
2. Related Work
At first, machine learning regression algorithms are often used to predict time series, such as linear regression, random forest, and GBDT. Edelman et al. [9] use a linear regression model to predict the execution time of surgery; Wang et al. [10] train regression decision trees to predict the arrival time of buses by using the nearest neighbor-based random forest algorithm; Cheng et al. [11] use GBDT to predict traffic time in different time ranges and find the variables which have a great impact on prediction error. These regression methods have good universality, and they can be used in many fields. But, the error range of their prediction is very large, so they can only be applied to scenarios with low sensitivity to time fluctuations.
To narrow the error range of time series prediction, some scholars have proposed the method with specific domain knowledge to predict time series. It constructs mathematical models to achieve time series prediction, by studying the calculation characteristics of the specific domain such as PALEO [12] and Optimus [13] method. PALEO is a method to predict the computing time by counting the number of floating-point operations. It counts the number of floating-point operations required in a model training epoch and multiplies the number by a scale factor to predict the training time. However, PALEO assumes that the whole training process of the model is linearly related to the number of floating-point operations, ignoring some operations that are not, such as parameter transmission. Unlike PALEO, Optimus mathematically summarizes the factors that affect model training, establishes a performance model to evaluate the training speed, and can predict the model convergence according to online resources. Compared with the regression methods, these works reduce the prediction error range of model training time to a certain extent, but the mathematical model established for the training is fuzzy and it ignores some factors which contribute greatly to the training time, resulting in instability of the prediction.
Because of the excellent performance of deep learning models, researchers began to use deep learning methods for predicting time series, and trying to further reduce the error of time series prediction. Xu et al. [14] creatively combine linear regression and deep belief network (DBN) to predict time series; PreVIous [15] trains the MLP model to predict the inference time of convolutional neural networks according to the throughput and energy consumption of the Internet of Things vision device; Petersen et al. [16] design a neural network mixed with convolutional layers and LSTM layers to accurately predict the bus arrival time. These works have achieved high prediction accuracy, but their application in the training time prediction of deep learning models is limited by the specific model structure. Their time prediction models can only predict those network structures contained in their training data (such as VGG [17], ResNet [18], or user-defined network). When a network with a new structure is encountered, their models need to be retrained, that is, they cannot apply to other new deep learning models. Although Fathom [19] has proved that the inference time of a model can be estimated by another model with a similar structure and known performance, its prediction is very rough, and it is still to be proved that whether this method can be used in the prediction of training time. In order to accurately predict the training time of networks with different structures, Justus et al. divide the neural network into layers and classify these layers (such as convolutional layer and dense layer) according to the structural characteristics, then collect the layer features and train an MLP model to predict the training time of a single layer in the neural network, which can achieve high prediction accuracy. This method has good generality. When a network with a new structure is encountered, it only needs to predict the training time of layers according to the layer model features, and then the training time of the whole model can be predicted by accumulating the training time of layers. However, there are some problems when Daniel Justus’ method is applied to federated learning: (1) the relationship between model features and training time is not accurate. They assumed that almost every model feature is necessary for training time prediction, including features that have no or little impact on the final result. (2) Too many unimportant features need to be collected. When the neural network is very deep, collecting features in every layer will increase the burden of the federated learning system, and it is usually difficult to obtain all details of clients’ models. (3) High training cost caused by too much redundant training data. Unimportant features produce a lot of redundant training data, which increases the transfer training cost of the time prediction model on the newly added device and reduces the training efficiency of the global model.
To solve the above problems, we propose a training time prediction method based on deep learning, which can reduce the required model features and training data on the premise of ensuring low prediction error and improve the feasibility of practical application in federated learning. The contributions of this paper are as follows:
(1) We design a neural network to extract the influence of model features on training time according to the characteristics of the deep learning model, which provides an effective analysis of the relationship between model features and training time
(2) We propose a dimensionality reduction rule to extract the key features that have a great impact on training time according to the influence of features, which can reduce the number of features required for predicting model training time without loss of prediction accuracy. By using the dimensionality reduction rule, 7 dimensions are extracted from convolutional layer features (10 dimensions in total), and 4 dimensions are extracted from dense layer features (5 dimensions in total)
(3) We train the time prediction model using dimension-reduced datasets. Compared with the method of Justus et al., the training data of the convolutional layer is reduced by 25%, and the training data of the dense layer is reduced by 20%, with the error of prediction remaining at the same level
3. Methodology
In this section, we will introduce the overall process and technical details of extracting the influence of model features on training time, dimensionality reduction, and the algorithm of dynamically setting the number of local epochs. First, we prove the feasibility of accumulating the layers’ training time for the prediction of the whole network by interpreting the calculation process of training. And we describe the layer features based on the work of Justus et al. in detail. Second, we introduce the structure of the neural network (Here we name our neural network weights model), which is designed for extracting the influence of model features on the training time. Third, we propose the dimensionality reduction rule to extract the key features which have a great impact on training time, based on the influence of model features. Finally, we provide a representative algorithm for dynamically setting the number of local epochs.
3.1. Feature Analysis
One training of neural network consists of forward propagation and backward propagation. With the widespread use of Batch Normalization [20] which can speed up the convergence of neural networks, it usually has to perform a batch of forward propagation before one backward propagation. A complete round of training (including multiple batches) of a neural network in the training set is called an epoch. Generally, a model with high accuracy needs to be trained many epochs until the model converges. At present, the method of setting the number of epochs is based on the experience of deep learning engineers. It needs to set the different number of epochs for different models to achieve the specified accuracy. Therefore, it becomes a challenge to accurately predict the training time of models with the different number of epochs. In addition, since the structures of models are heterogeneous, the training time will be significantly different. For example, the training of the convolutional layer is usually more time-consuming than the dense layer. Therefore, it is also a challenge to accurately predict the training time for models with different structures. To solve these problems, Justus et al. proposed a method to predict the training time of different layers in a batch. The training time of the whole model in one batch can be obtained by accumulating the prediction results of layers. And the total training time can be predicted by accumulating the training time in batches. Ulteriorly, we prove the feasibility of predicting the whole model’s training time through layers combined with the training characteristics and structural characteristics of the deep learning model.
Usually, a neural network needs to be trained repeatedly on the training set many times, which has obvious iteration characteristics. According to the iteration, the time required for model training can be expressed as formula (2), where
One training of the neural network consists of a batch of forward propagation and one backward propagation. Therefore, the time cost of the propagation can be expressed as formula (3),
Combining formula (2) and formula (3), the training time of a deep learning model can be described as the following formula:
A neural network is composed of many layers, and the output of the current layer is used as the input of the next layer. Besides the iteration characteristic of the training process, the neural network also has obvious hierarchical structure characteristics. According to this hierarchy characteristic, a complete neural network can be divided into layers, and its forward and backward propagation can be expressed as formulas (5) and (6) on layer level.
Combining formula (4) with formulas (5) and (6), the training time formula of the complete model with the layer’s training time as the unit can be derived, as shown in the following formula.
In summary, the training time of a single layer can be used as the basic unit of the whole model’s training time. Therefore, for models with number of different epochs, the total training time of the model can be obtained by accumulating training time in a batch; for models with different structures, one batch training of the model can be obtained by accumulating the forward and backward propagation time of layers, well solved the two challenges in training time prediction of models.
In order to predict the training time of a single layer, it is necessary to analyze the layer features of neural networks. First of all, we classify the layer features into common features, dense features, convolutional features, and hardware features according to the device characteristics and computing characteristics. And then we extract the features according to the categories. The layer features are shown in Table 1. Since the deep learning model contains a large number of convolutional layers and dense layers, we mainly focus on the convolutional layer and dense layer in this paper. And we trained the time prediction model for the convolutional layer and the dense layer, respectively, based on the data of layer features and training time, which are collected on six different types of GPUs (P100, V100, K40, K80, M60, and 1080ti).
Table 1
The description of layer features.
Kind of features | Features | Description |
Common features | Activation function | The activation function of neuron output, the common ones are sigmoid, tanh, and relu, etc. |
Optimizer | The optimization method of the model, the common ones are SGD, Adadelta, Adagrad, momentum, Adam, and RMS prop, etc. | |
Dense features | Number of inputs | Since the MLP layers are fully connected, the input of each layer comes from the output of the previous layer. |
Number of neurons | The number of neurons. | |
Convolutional features | Matrix size | The size of the input data. |
Kernel size | The size of convolutional kernel. | |
Input depth | The number of input channels. | |
Output depth | The number of output channels. | |
Stride size | The convolution step size of convolution kernel. | |
Input padding | The number of edge padding after convolution. | |
Hardware features | GPU clock speed | GPU clock cycle speed. |
GPU memory bandwidth | GPU bandwidth. | |
GPU core count | The number of GPU processing units, which represents the number of CUDA cores in NVIDIA GPU. |
3.2. Model Design
In the previous section, we analyzed the layer features which may affect the training time of neural networks by parsing the structure of different layers. However, we find that collecting layer features will increase the system overhead for deep-seated neural networks. For example, in terms of the ReNet101 with 100 convolutional layers and 1 dense layer, if we collect 10 features of the convolutional layer and 5 features of the dense layer, we finally need to collect 1005 features to predict the training time of ResNet101. For clients with low computing power, the process of predicting will take a lot of time.
What is more, since there are a large-scale of heterogeneous devices in federated learning, we cannot use the time prediction model to predict the training time of models for newly added devices whose types are not in the set of preset device types. To predict the training time for the new device, we need to tune the parameters of the time prediction model based on the training data collected on this device. But it needs a long time to collect a large number of high dimension training data for tuning the time prediction model. The prediction model cannot quickly adapt to the new device, and the number of local epochs is set to be a fixed value for a long time, which may lead to the clients remaining idle with high training speed.
For cutting down the time cost of predicting and accelerating the adaptation of the time prediction model to new devices, it is necessary to reduce the dimension of layer features and redundant training data. In the case of ensuring high prediction accuracy, we choose to exclude the features that have no or little impact on training time. So, the influence of model features on training time needs to be analyzed.
In order to extract the influence of features, we abstract the relationship between model features and training time as
In related work, we introduced some machine learning regression models, including linear models and nonlinear models. The linear regression model can directly extract the features’ weights
[figure omitted; refer to PDF]
Before using the dimensionality reduction rule, we use formula (8) to calculate the ranking of feature weights in each weights dataset. According to the first step of the dimensionality reduction rule, we need to measure the fluctuation of each feature weight, that is, the standard deviation. Therefore, the standard deviation of weights ranking and the average standard deviation of weights ranking are calculated, see Tables 8 and 9. For the second step of the dimensionality reduction rule, we calculate the average ranking of each feature’s weight on all weights datasets and obtain the overall average ranking on all datasets according to formula (9). Tables 10 and 11, respectively, show the average ranking of feature weights on each dataset, as well as the overall average ranking
Table 8
The standard deviation of convolutional feature weights ranking.
Features | P100_Conv | V100_Conv | K40_Conv | MeanRankStd |
Batchsize | 1.7653 | 1.3045 | 1.4026 | 1.4908 |
Elements_matrix | 2.3233 | 2.0775 | 1.1578 | 1.8529 |
Elements_kernel | 1.7904 | 2.2949 | 1.4356 | 1.8403 |
Channels_in | 0.4440 | 0.4218 | 0.7599 | 0.5419 |
Channels_out | 3.0549 | 2.1915 | 2.1091 | 2.4518 |
Padding | 1.6192 | 1.8036 | 1.2830 | 1.5686 |
Strides | 2.4347 | 2.0012 | 3.2603 | 2.5654 |
Use_bias | 1.2580 | 1.5514 | 1.3097 | 1.3730 |
Opt_SGD | 2.3229 | 3.0940 | 1.6340 | 2.3503 |
Opt_Adadelta | 2.2164 | 2.8144 | 1.9606 | 2.3305 |
Opt_Adagrad | 2.7459 | 2.4789 | 1.1959 | 2.1402 |
Opt_Momentum | 2.6090 | 2.1952 | 1.1699 | 1.9914 |
Opt_Adam | 1.4019 | 2.1550 | 2.4311 | 1.9960 |
Opt_RMSProp | 2.0723 | 2.3218 | 0.9707 | 1.7883 |
Act_relu | 1.5234 | 1.0455 | 0.9251 | 1.1647 |
Act_tanh | 1.7225 | 1.1762 | 1.4122 | 1.4370 |
Act_sigmoid | 1.3042 | 1.1182 | 1.1107 | 1.1777 |
Table 9
The standard deviation of dense feature weight ranking.
Features | P100_Dense | V100_Dense | K40_Dense | MeanRankStd |
Batchsize | 1.8927 | 1.4465 | 1.5556 | 1.6316 |
Dim_input | 3.3842 | 2.2099 | 2.3296 | 2.6412 |
Dim_output | 2.4491 | 2.7088 | 1.7920 | 2.3166 |
Opt_SGD | 2.8395 | 2.9443 | 2.1840 | 2.6559 |
Opt_Adadelta | 3.3851 | 2.9903 | 1.3891 | 2.5882 |
Opt_Adagrad | 2.0481 | 3.0305 | 2.0840 | 2.3875 |
Opt_Momentum | 2.8527 | 2.9168 | 1.9885 | 2.5860 |
Opt_Adam | 2.7488 | 2.9804 | 2.4706 | 2.7333 |
Opt_RMSProp | 3.4788 | 1.7232 | 2.0469 | 2.4163 |
Act_relu | 1.8273 | 1.4856 | 1.0920 | 1.4683 |
Act_tanh | 2.1289 | 2.2291 | 0.7319 | 1.6966 |
Act_sigmoid | 2.2241 | 2.1797 | 1.1457 | 1.8498 |
Table 10
The mean ranking of convolutional feature weights.
Features | P100_Conv | V100_Conv | K40_Conv | MeanRank |
Batchsize | 8.692 | 9.574 | 5.702 | 7.989333333 |
Elements_matrix | 5.8396 | 8.6824 | 7.5608 | 7.360933333 |
Elements_kernel | 11.2952 | 11.7014 | 10.1 | 11.0322 |
Channels_in | 1.09 | 1.0334 | 1.0112 | 1.044866667 |
Channels_out | 7.0608 | 5.6072 | 12.0992 | 8.255733333 |
Padding | 14.1344 | 13.7096 | 13.9144 | 13.91946667 |
Strides | 8.9108 | 9.8826 | 8.4852 | 9.092866667 |
Use_bias | 14.5956 | 14.595 | 15.7176 | 14.9694 |
Opt_SGD | 6.3264 | 6.853 | 4.2328 | 5.804066667 |
Opt_Adadelta | 3.764 | 7.3048 | 7.0828 | 6.050533333 |
Opt_Adagrad | 7.732 | 5.3538 | 6.3752 | 6.487 |
Opt_Momentum | 5.2476 | 5.3584 | 11.0724 | 7.226133333 |
Opt_Adam | 10.1944 | 3.6814 | 5.1448 | 6.3402 |
Opt_RMSProp | 2.9736 | 4.107 | 5.9512 | 4.343933333 |
Act_relu | 15.8232 | 15.6576 | 7.3332 | 12.938 |
Act_tanh | 14.0576 | 15.8164 | 15.514 | 15.12933333 |
Act_sigmoid | 15.2628 | 14.082 | 15.7032 | 15.016 |
Table 11
The mean ranking of dense feature weights.
Features | P100_Dense | V100_Dense | K40_Dense | MeanRank |
Batchsize | 10.3196 | 10.6268 | 9.5132 | 10.1532 |
Dim_input | 6.0088 | 4.9544 | 3.854 | 4.939066667 |
Dim_output | 7.1052 | 4.882 | 5.9032 | 5.963466667 |
Opt_SGD | 4.7872 | 7.2696 | 3.9624 | 5.339733333 |
Opt_Adadelta | 4.396 | 4.2684 | 2.2148 | 3.6264 |
Opt_Adagrad | 5.1912 | 6.3724 | 7.1868 | 6.250133333 |
Opt_Momentum | 8.072 | 7.9004 | 3.562 | 6.511466667 |
Opt_Adam | 5.4128 | 3.7576 | 5.1504 | 4.7736 |
Opt_RMSProp | 7.35 | 2.5372 | 5.1288 | 5.005333333 |
Act_relu | 6.0432 | 10.2896 | 9.9068 | 8.746533333 |
Act_tanh | 5.6348 | 7.3096 | 11.452 | 8.132133333 |
Act_sigmoid | 7.6792 | 7.832 | 10.1656 | 8.558933333 |
Taking the convolutional features as an example, since the optimizers and activation functions are represented by one-hot encoding, we regard all optimizer fields (feature name starting with opt_) as one feature opt and all activation function fields (feature name starting with act_) as one feature act.
According to step 1 of the dimensionality reduction rule, we select features with
According to step 2 of the dimensionality reduction rule, we select features with
Finally, by using the dimensionality reduction rule, we get batchsize, channels_in, elements_matrix, elements_kernel, channels_in, channels_out, strides, and opt. The convolutional layer features are reduced by 3 dimensions (padding, use_bias, and act), and the training data is reduced by 5 dimensions (padding, use_bias, act_relu, act_tanh, and act_sigmoid).
For the dense layer features, we use the same method. According to the dimensionality reduction rule, we exclude features that have less influence on training time, including act_relu, act_tanh, act_sigmoid, and batchsize. It should be noted that the dense layer has the parameter-intensive characteristic, which means that the transmission of parameters takes a long time. However, the time overhead of parameter transmission was ignored in our datasets. According to the forward propagation process of the neural network, the dense layer must perform one forward propagation calculation for one input data, and it must perform a batch of forward propagation for a batch of input data. Therefore, the batchsize determines the number of times the parameters are transmitted, which should not be excluded.
After extracting key features by dimensionality reduction rule, we filter the training sets and get the dimension-reduced datasets on P100_Conv, V100_Conv, K40_Conv, All_Conv, P100_Dense, V100_Dense, K40_Dense, and All_Dense (We add _ small represents the dimension-reduced datasets. For example, the dimension-reduced dataset of P100_Conv is P100_ Conv_ small). For verifying the validity of dimension-reduced datasets, we use the dimension-reduced datasets to train the baseline, and the trained model is called baseline_small. And our experiments have proved that baseline_small also has good convergence. Figure 5 shows the comparison of predicted time and observed time of baseline_small on the test sets.
[figure omitted; refer to PDF]
For determining whether the dimension-reduced datasets caused the loss of prediction accuracy, we calculated the RMSE and MAPE of baseline and baseline_small on all datasets. The results are shown in Table 12. For the convolutional layer datasets, the baseline_small test RMSE is 0.1385 ms lower than the baseline on average, the verification RMSE is 0.3664 ms higher than the baseline on average; for the dense layer datasets, the baseline_small test RMSE is 0.0078 ms lower than the baseline on average, and the verification RMSE is 0.015 ms lower than the baseline. It is worth mentioning that for the dense layer datasets, both RMSE and MAPE of baseline_small are lower than baseline. We consider it is because there are some features in the dense layer that have no contribution to the training time. These features will disturb the data distribution and increase the prediction error of the model.
Table 12
Baseline vs. baseline_small.
Datasets | Model | Test RMSE | Test MAPE | Validation RMSE | Validation MAPE |
P100_Conv | Baseline | 2.962 ms | 14.36% | 3.455 ms | 15.17% |
Baseline_small | 2.838 ms | 15.27% | 3.0948 ms | 15.47% | |
V100_Conv | Baseline | 1.722 ms | 11.44% | 1.547 ms | 11.13% |
Baseline_small | 1.762 ms | 11.80% | 1.791 ms | 11.80% | |
K40_Conv | Baseline | 10.136 ms | 15.59% | 10.361 ms | 16.39% |
Baseline_small | 9.882 ms | 16.01% | 11.508 ms | 16.68% | |
P100_Dense | Baseline | 0.033 ms | 3.15% | 0.032 ms | 3.04% |
Baseline_small | 0.027 ms | 2.86% | 0.031 ms | 2.92% | |
V100_Dense | Baseline | 0.051 ms | 6.26% | 0.046 ms | 5.84% |
Baseline_small | 0.044 ms | 5.49% | 0.045 ms | 5.45% | |
K40_Dense | Baseline | 0.157 ms | 7.57% | 0.179 ms | 7.92% |
Baseline_small | 0.154 ms | 6.60% | 0.140 ms | 6.60% | |
All_Conv | Baseline | 4.079 ms | 11.44% | 4.021 ms | 11.16% |
Baseline_small | 4.001 ms | 11.89% | 4.090 ms | 12.10% | |
All_Dense | Baseline | 0.077 ms | 4.70% | 0.074 ms | 4.68% |
Baseline_small | 0.069 ms | 4.70% | 0.070 ms | 4.75% |
The experiments show that after reducing the convolutional layer features by 30% and the training data by 25%, the error level of prediction is still consistent with the original baseline; after reducing the dense layer features by 20% and the training data by 20%, the error level is generally lower than the baseline, which proves the effectiveness of our weights model and dimensionality reduction rule.
5. Conclusion
For the problem of setting the number of local epochs for heterogeneous clients in federated learning, we propose a solution of predicting the training time of deep learning tasks on clients to guide the dynamic setting number of local epochs. We design the weights model to extract the weights of features and accurately interpret the relationship between model features and training time. This paper focuses on the combination of weights model and dimensionality reduction rule to extract the key features for reducing the dimension of features and redundant training data required by the time prediction model. The purpose of our work is to improve the feasibility of predicting the training time of deep learning models on heterogeneous clients in federated learning, so as to dynamically set the number of local epochs for clients. Compared with the existing methods, the results of our experiments show that (1) the weights model has good convergence on heterogeneous devices; (2) the predicted training time of weights model reaches the same error level as baseline; (3) the dimensionality reduction rule in this paper can reduce 30% features and 25% redundant data for the convolutional layer and reduce 20% features and 20% redundant data for the dense layer, while maintaining high prediction accuracy.
Acknowledgments
This work is supported by the National Natural Science Foundation of China under Grant no. 62072146 and no. 61972358 and the Key Research and Development Program of Zhejiang Province (2019C01059, 2019C03135, and 2019C03134).
[1] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, B. A. Y. Arcas, "Communication-efficient learning of deep networks from decentralized data," Artificial intelligence and statistics, pp. 1273-1282, 2017.
[2] A. Hard, K. Rao, R. Mathews, S. Ramaswamy, F. Beaufays, S. Augenstein, H. Eichner, C. Kiddon, D. Ramage, "Federated learning for mobile keyboard prediction," 2018. https://arxiv.org/abs/1811.03604
[3] "ai.google. Under the hood of the Pixel 2: How AI is supercharging hardware, 2018," 2018. https://ai.google/stories/ai-in-hardware/
[4] "support.google. Your chats stay private while Messages improves suggestions, 2019," 2019. http://support.google.com/messages/answer/9327902
[5] Apple, "Private federated learning (NeurIPS 2019 Expo Talk Abstract)," 2019. https://nips.cc/ExpoConferences/2019/schedule?talk_id=40
[6] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, V. Smith, "Federated optimization in heterogeneous networks," 2018. https://arxiv.org/abs/1812.06127
[7] J. Wang, Q. Liu, H. Liang, G. Joshi, H. V. Poor, Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization, 2020.
[8] D. Justus, J. Brennan, S. Bonner, A. S. McGough, "Predicting the computational cost of deep learning models," 2018 IEEE International Conference on Big Data (Big data),DOI: 10.1109/bigdata.2018.8622396, .
[9] E. R. Edelman, S. M. J. van Kuijk, A. Hamaekers, M. J. M. de Korte, G. G. van Merode, W. F. F. A. Buhre, "Improving the prediction of total surgical procedure time using linear regression modeling," Frontiers in Medicine, vol. 4,DOI: 10.3389/fmed.2017.00085, 2017.
[10] B. Yu, H. Wang, W. Shan, B. Yao, "Prediction of bus travel time using random forests based on near neighbors," Computer-Aided Civil and Infrastructure Engineering, vol. 33 no. 4, pp. 333-350, DOI: 10.1111/mice.12315, 2018.
[11] J. Cheng, G. Li, X. Chen, "Research on travel time prediction model of freeway based on gradient boosting decision tree," IEEE Access, vol. 7, pp. 7466-7480, DOI: 10.1109/ACCESS.2018.2886549, 2019.
[12] Q. Hang, E. R. Sparks, A. Talwalkar, Paleo: A Performance Model for Deep Neural Networks, 2016.
[13] Y. Peng, Y. Bao, Y. Chen, C. Wu, C. Guo, Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters, 2018.
[14] W. Xu, H. Peng, X. Zeng, F. Zhou, X. Tian, X. Peng, "A hybrid modelling method for time series forecasting based on a linear regression model and deep learning," Applied Intelligence, vol. 49 no. 8, pp. 3002-3015, DOI: 10.1007/s10489-019-01426-3, 2019.
[15] D. Velasco-Montero, J. Fernandez-Berni, R. Carmona-Galan, A. Rodriguez-Vazquez, "PreVIous: a methodology for prediction of visual inference performance on IoT devices," IEEE Internet of Things Journal, vol. 7 no. 10, pp. 9227-9240, DOI: 10.1109/JIOT.2020.2981684, 2019.
[16] N. C. Petersen, F. Rodrigues, F. C. Pereira, "Multi-output bus travel time prediction with convolutional LSTM neural network," Expert Systems with Applications, vol. 120, pp. 426-435, DOI: 10.1016/j.eswa.2018.11.028, 2019.
[17] K. Simonyan, A. Zisserman, "Very deep convolutional networks for large-scale image recognition," Computer Science, 2015. https://arxiv.org/abs/1409.1556
[18] K. He, X. Zhang, S. Ren, J. Sun, "Deep residual learning for image recognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, DOI: 10.1109/CVPR.2016.90, .
[19] R. Adolf, S. Rama, B. Reagen, G.-y. Wei, D. Brooks, "Fathom: reference workloads for modern deep learning methods," 2016 IEEE International Symposium on Workload Characterization (IISWC),DOI: 10.1109/iiswc.2016.7581275, .
[20] S. Ioffe, C. Szegedy, "Batch Normalization: Accelerating deep network training by reducing internal covariate shift," JMLR, 2015. https://arxiv.org/abs/1502.03167
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2022 Yan Zeng et al. This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Federated learning is a new framework of machine learning, it trains models locally on multiple clients and then uploads local models to the server for model aggregation iteratively until the model converges. In most cases, the local epochs of all clients are set to the same value in federated learning. In practice, the clients are usually heterogeneous, which leads to the inconsistent training speed of clients. The faster clients will remain idle for a long time to wait for the slower clients, which prolongs the model training time. As the time cost of clients’ local training can reflect the clients’ training speed, and it can be used to guide the dynamic setting of local epochs, we propose a method based on deep learning to predict the training time of models on heterogeneous clients. First, a neural network is designed to extract the influence of different model features on training time. Second, we propose a dimensionality reduction rule to extract the key features which have a great impact on training time based on the influence of model features. Finally, we use the key features extracted by the dimensionality reduction rule to train the time prediction model. Our experiments show that, compared with the current prediction method, our method reduces 30% of model features and 25% of training data for the convolutional layer, 20% of model features and 20% of training data for the dense layer, while maintaining the same level of prediction error.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details





1 School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China; Key Laboratory of Complex Systems Modeling and Simulation Ministry of Education, Hangzhou 310018, China; Zhejiang Engineering Research Center of Data Security Governance, Hangzhou 310018, China
2 HDU-ITMO Joint Institute, Hangzhou Dianzi University, Hangzhou 310018, China
3 School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China